Informatics Musings: March 2015

Friday, 27 March 2015

Back to the Basics - Second Normal Form

This is the second article on normalized data structures. Click here to read about First Normal Form.

Second Normal Form

Second Normal Form (2NF) can only exist if a dataset complies with 1NF and each record has a unique compound key. A compound key uses more than one field to uniquely identify a record.

The dataset above requires two fields to unique identify each row: Year and Winner. Therefore this table is compliant with 2NF.

If a table does not have a compound key, then 2NF is automatically equivalent to 1NF.

Because 1NF and 2NF do not provide much value in a querying environment, they are rarely discussed in the real world. Either a table is 3NF compliant or it is not normalized at all.

However, the prevalence of spreadsheets means many datasets are created with a completely free data structure. Data can be organized in a spreadsheet without regard to any database rules, and as long as the data stays in a spreadsheet this isn't a problem. The challenges comes when spreadsheet data needs to be loaded into a database system or data warehouse -- that's when the need for structured data becomes apparent.

Friday, 20 March 2015

Back to the Basics - First Normal Form

Sometimes it's good to go back to the basics for a refresher lesson. Last year I did some training with analysts and realized not everyone who works with data knows what data normalization actually means. Most have heard of it but not all understand it.
So I'm going to take a few posts and go back to the basics.

Data Normalization

The concept of data normalization was first described by Edgar F. Codd in 1971 while he was working at IBM. He is credited with creating the theoretical basis for relational databases.
The purpose of normalized data structures included:

minimizing data storage
reducing data write times
ensuring data consistency
ensuring data queries would return meaningful results

The first 3 bullets can be accomplished by ensuring each data element is only stored once. The fourth bullet requires some additional structure. All four purposes were accomplished by the normalization rules that E.F. Codd developed.

First Normal Form

There are different definitions of the first normal form (1NF) on the internet, but I find it easiest to think of it in these terms:

Each field has only one value
Each row has a unique key

The dataset below shows the winners of the company's Employee of the Year award for the past few years. It violates 1NF because it contains two year values in the first row. It appears that George won the award in 2010 and again in 2014, so that information has been stored in a single record.

By separating out the 2010 and 2014 information into distinct rows, this dataset becomes compliant with 1NF. The Year column provides a unique key for each row.

However, let's take the dataset back one year when there was an unusual situation. Two people, Sam and Maria, shared the award in 2009. Once again our dataset violates 1NF because two fields have multiple values, Winner and Department.

We solve the problem just like we did before by separating out the 2009 record into two rows, one for each award recipient. However, now the Year is no longer a unique key. The year 2009 shows up twice, so we need to add another field to uniquely identify each row. Winner works nicely, so now Year and Winner together become the unique key. Whenever there is more than one field in a key, it is called a Compound Key, and that leads nicely into the next article on Second Normal Form (2NF).