Sometimes it's good to go back to the basics for a refresher lesson. Last year I did some training with analysts and realized not everyone who works with data knows what data normalization actually means. Most have heard of it but not all understand it.
So I'm going to take a few posts and go back to the basics.
Data Normalization
The concept of data normalization was first described by
Edgar F. Codd in 1971 while he was working at IBM. He is credited with creating the theoretical basis for relational databases.
The purpose of normalized data structures included:
- minimizing data storage
- reducing data write times
- ensuring data consistency
- ensuring data queries would return meaningful results
The first 3 bullets can be accomplished by ensuring each data element is only stored once. The fourth bullet requires some additional structure. All four purposes were accomplished by the normalization rules that E.F. Codd developed.
First Normal Form
There are different definitions of the first normal form (1NF) on the internet, but I find it easiest to think of it in these terms:
- Each field has only one value
- Each row has a unique key
The dataset below shows the winners of the company's Employee of the Year award for the past few years. It violates 1NF because it contains two year values in the first row. It appears that George won the award in 2010 and again in 2014, so that information has been stored in a single record.
By separating out the 2010 and 2014 information into distinct rows, this dataset becomes compliant with 1NF. The Year column provides a unique key for each row.
However, let's take the dataset back one year when there was an unusual situation. Two people, Sam and Maria, shared the award in 2009. Once again our dataset violates 1NF because two fields have multiple values, Winner and Department.
We solve the problem just like we did before by separating out the 2009 record into two rows, one for each award recipient. However, now the Year is no longer a unique key. The year 2009 shows up twice, so we need to add another field to uniquely identify each row. Winner works nicely, so now Year and Winner together become the unique key. Whenever there is more than one field in a key, it is called a Compound Key, and that leads nicely into the
next article on Second Normal Form (2NF).