Informatics Musings: Back to the Basics

Sometimes it's good to go back to the basics for a refresher lesson. Last year I did some training with analysts and realized not everyone who works with data knows what data normalization actually means. Most have heard of it but not all understand it.
So I'm going to take a few posts and go back to the basics.

Data Normalization

The concept of data normalization was first described by Edgar F. Codd in 1971 while he was working at IBM. He is credited with creating the theoretical basis for relational databases.
The purpose of normalized data structures included:

minimizing data storage
reducing data write times
ensuring data consistency
ensuring data queries would return meaningful results

The first 3 bullets can be accomplished by ensuring each data element is only stored once. The fourth bullet requires some additional structure. All four purposes were accomplished by the normalization rules that E.F. Codd developed.

First Normal Form

There are different definitions of the first normal form (1NF) on the internet, but I find it easiest to think of it in these terms:

Each field has only one value
Each row has a unique key

The dataset below shows the winners of the company's Employee of the Year award for the past few years. It violates 1NF because it contains two year values in the first row. It appears that George won the award in 2010 and again in 2014, so that information has been stored in a single record.

By separating out the 2010 and 2014 information into distinct rows, this dataset becomes compliant with 1NF. The Year column provides a unique key for each row.

However, let's take the dataset back one year when there was an unusual situation. Two people, Sam and Maria, shared the award in 2009. Once again our dataset violates 1NF because two fields have multiple values, Winner and Department.

We solve the problem just like we did before by separating out the 2009 record into two rows, one for each award recipient. However, now the Year is no longer a unique key. The year 2009 shows up twice, so we need to add another field to uniquely identify each row. Winner works nicely, so now Year and Winner together become the unique key. Whenever there is more than one field in a key, it is called a Compound Key, and that leads nicely into the next article on Second Normal Form (2NF).

Informatics Musings

Friday, 20 March 2015

Back to the Basics - First Normal Form

Data Normalization

First Normal Form

No comments:

Post a Comment