Informatics Musings: April 2015

Friday, 24 April 2015

Back to the Basics - Snowflake Schema

In the last post we introduced the Star Schema. Next we will look at a modification of the Star Schema called the Snowflake Schema and explain why you should try to avoid it.

Snowflake Schema

The Star Schema is a de-normalized data structure designed to optimize aggregation query response times. It contains a fact table surrounded by a number of entity tables.

A common expansion of the Star Schema is the Snowflake. A Snowflake Schema is simply a Star Schema where one or more of the entities have been partially normalized. This means some of the repeated details for an entity table have been hived off into a separate details table to reduce data repetition in the main entity table. An example of a Snowflake Schema is below.

Note that the original Star Schema tables are still there, but new Entity Details tables have been added around the edges, making it roughly resemble a snowflake.

Snowflake Schemas are rarely symmetric. Some entities may have multiple detail tables (such as Entity 1 above with 3) while other entities may have none (Entity 2).

What type of data is stored in the Entity Detail tables? Anything that is repeated frequently in the main Entity tables. Hierarchy levels or long textual description fields are common candidates for snowflaking.

For example, an Entity called Cost Center may have a hierarchy starting at the top with Company, then going down to Business Unit, Department and Cost Center. The facts are stored at the cost center level, so the Company, Business Unit and Department values will be repeated frequently in the entity table. Therefore, those details could be put in a separate entity details table called Cost Center Hierarchy with the cost center key remaining in the main entity table.

Avoid Snowflaking

Now that we've reviewed what the Snowflake Schema is, this is why you should avoid using it: Snowflaking works against the sole purpose of the Star Schema.

By normalizing an intentionally de-normalized data structure, the additional joins slow query response time in order to save a bit of disk space. This is ridiculous when the only reason the Star Schema exists is to improve query response time. Most instances where I have seen snowflaking used in real-world data models has been the result of a DBA (database administrator) returning to old habits and normalizing tables because that's what they've always done in the world of OLTP. It's a tough habit to break.

One Exception: Bridge Tables

A situation that diagrammatically resembles a Snowflake Schema is a Bridge Table, although I do not consider it snowflaking. A bridge table contains the conversion mapping logic between two related entity tables.

Consider two data marts, each with their own star schema. One mart contains sales data and the other manufacturing data. We want to connect the two fact tables through the Region entities to permit analysis of sales and production data together. One way to accomplish this is to conform the Region dimensions so sales and manufacturing use the same regional boundaries. That would permit a single entity table to be used by both fact tables. However, that is not always possible from a business perspective. Sales divides geography based on the number of customers and prospects in an area while manufacturing divides geography based on factory and warehouse locations. Forcing sales to use manufacturing's boundaries or vice-versa would create a lot of unnecessary hassles for the business with essentially no real benefits.

An alternative is to use a Bridge Table that equates sales regions and production regions. For example, Toronto may be in sales region Central East but in production region Ontario. The diagram below shows the Regions Bridge Table linking these two data marts.

Again, this appears similar to a snowflake of the Regions entities, but it actually involves no normalization so I don't consider it snowflaking.

A different definition of a Bridge Table exists that states they resolve many-to-many relationships between entities, but those situations should typically be resolved through the fact table. Bridge Tables should resolve different aggregation standards for similar entities. In addition to different geographical entities (described above), time is another frequent candidate for bridging when different data marts require different fiscal reporting periods.

Outrigger Tables

One other term that deserves comment comes from Ralph Kimball, one of the grandfathers of data warehousing and star schemas. While he strongly advises against snowflaking, he does permit some normalization of the star schema using something he calls Outrigger Tables. I will let him define it from his website:

"A dimension can contain a reference to another dimension table. For instance, a bank account dimension can reference a separate dimension representing the date the account was opened. These secondary dimension references are called outrigger dimensions. Outrigger dimensions are permissible, but should be used sparingly. In most cases, the correlations between dimensions should be demoted to a fact table, where both dimensions are represented as separate foreign keys."

One of Ralph's colleagues, Margy Ross, elaborates further:

"Outriggers are similar to snowflakes in that they’re used for many-to-one relationships, however they’re more limited. Outriggers are dimension tables joined to other dimension tables, but they’re just one more layer removed from the fact table, rather than being fully normalized snowflakes. Outriggers are most frequently used when one standard dimension table is referenced in another dimension, such as a hire date attribute in the employee dimension table."

Margy continues:

"Like many things in life, outriggers are acceptable in moderation, but they should be viewed as the exception rather than the rule. If outriggers are rampant in your dimensional model, it’s time to return to the drawing board given the potentially negative impact on ease-of-use and query performance."

While I agree with Margy that outriggers should be used sparingly, I fail to see a distinction between snowflaked tables and outriggers. The only apparent difference is that if snowflaked tables are small, then we can call them "outriggers" and they become permissible. I don't see the need for such a distinction. Snowflakes are snowflakes, even though they come in different sizes.

Avoid snowflaking at all costs. If there are too many outriggers or bridges in your model, it usually means the fact table is the problem and it needs to be split into two (or more) fact tables. Further discussion about cube design will follow in a later post.

Friday, 17 April 2015

Back to the Basics - Star Schema

In the last article we covered OLAP and OLTP. This article looks at the data models that support those two data structures.

Entity-Relationship Diagrams

In the mid-1970's as normalized relational databases became popular, the Entity-Relationship Diagram (ERD) emerged as a method to model and understand them. Any relational database can be modeled using an ERD, even if it is not in Third Normal Form.

The most common form of ERD's contain boxes which represent tables and lines connecting them which represent joins between foreign keys. Within the boxes all of the fields are listed, with the key fields being listed first.

Here is an example of an ERD for one portion of a financial system database involving journal entries. It is only one of hundreds of such diagrams that together describe all of the tables in this financial database, but it serves as a good example for our purposes. It is far too small to be legible, but it is easy to see that there are a lot of tables (boxes) and a lot of joins (lines) between them.

Analytical Weaknesses of 3NF

The diagram above makes it apparent how Third Normal Form complicates database querying. Whenever a query involves a join, the query will take longer to process. The more joins in a query the slower it runs. If you have a business question that involves linking two fields from two different tables, the query performance will depend on where those two fields are located.

For example, if the two fields are in two adjacent tables, there will only be one join necessary and the query should run quickly. The diagram below highlights a query that links the Project table to the Journal Line table.

If however the two fields happen to be in the Project table and the Ledger Group table, the query will require 5 joins. The latter query will take noticeably longer to run than the first query, even though both may be valid business questions to ask of the data.

Of course, queries can have far more than 5 joins. If we consider the other 100 such ERD's in this financial system, some business questions may occasionally require fields from different modules that involve a dozen or more joins.

This inconsistent analysis query performance in 3NF caused enough of a problem that an alternative data structure was created. Enter the Star Schema.

Star Schema

Whereas 3NF treats all tables as entities, Star Schema involves two different types of tables: fact tables and entity tables.

Fact tables contain records with numeric values, such as financial transactions, and foreign keys to the entities. Fact records typically are summed but occasionally they can be counted instead.

Entity tables contain descriptive data about the facts, such as department name or product details. They permit the "slicing and dicing" of the fact records.

Combined, the fact table and entity tables make up the Star Schema.

The advantage of the Star Schema is that no matter which entities you wish to query, it involves a maximum of 2 joins. In addition, any join between entities must pass through the fact table. This guarantees there will be data to aggregate for any inter-entity query.

What is not apparent from this diagram is that it is not normalized. For example, if Entity 1 contains department data, the fields may Company, Business Unit, Department and Cost Centre. While any cost centre may only appear once in this entity table, the Company, Business Unit and Department names may repeat thousands of times. This is acceptable in Star Schema as it permits faster queries and users are not updating the records. Star Schema databases are typically updated nightly using automated processes, so repeating a business unit name thousands of times is not a problem.

Isn't denormalized data inherently inefficient though? Why not try to normalize those entity tables? That's the next topic.

Friday, 10 April 2015

Back to the Basics - OLAP vs OLTP

In the last article we looked at Third Normal Form. In this article we look at a de-normalized data structure for Business Intelligence data analysis called OLAP.

OLAP vs. OLTP

In the 1980's a new method of structuring data for analysis emerged called On-Line Analytical Processing (OLAP). It was in contrast to the prevailing method for structuring data, which was On-Line Transactional Processing (OLTP).

OLTP was designed to accomplish certain purposes with regard to databases, namely:

efficient data storage
fast performance when writing data (either creating or updating records)
rapid query response time for returning individual records

OLTP accomplishes these purposes by strictly adhering to Third Normal Form (3NF). No data element is stored twice anywhere in the database, which ensures efficient storage and writing performance.

The majority of database systems in the world are OLTP. Think of your bank's system that tracks your chequing account transactions or your grocery store's Point-of-Sale system at the check-out counter that tallies your scanned products. Transactions still rule the world and OLTP systems will always have a prominent place in it.

However, for all its strengths, OLTP has a major weakness -- it can be slow when trying to aggregate large amounts of transaction data. For instance, my bank system is very fast at processing a cash withdrawal at an ATM or showing my last 10 transactions for the month. However, if an analyst at the bank wants to know the average deposit amount across all personal chequing accounts in eastern Canada in 2014, that may take a while. In fact, it may slow down the entire system while the query crunches through the millions of records necessary to find the answer.

This is why OLAP was created. It had one purpose only:

fast performance for aggregation queries

Note there is nothing mentioned about efficient data storage or data writing performance. OLAP only cares about data aggregation performance. That single purpose is tremendously freeing because all other concerns become secondary. If storing the same data element in 3 places in the database will improve aggregation query performance, then we can do it. If it slows down data update performance, that's OK. If finding an individual record takes longer than in an OLTP system, who cares?

The main consequence of OLAP's singular purpose is that it does NOT follow 3NF. OLAP is a non-normalized data structure. Some Database Administrators (DBA's) are so used to using 3NF that they try to force normalization into OLAP, but that defeats the purpose of OLAP. Normalization slows queries because it introduces additional table joins, and joins are the enemy of fast query performance.

In the next article we will compare the data models that support OLTP and OLAP.

Friday, 3 April 2015

Back to the Basics - Third Normal Form

This is the last in a series of articles on data normalization. Click here for the previous article on Second Normal Form.

Third Normal Form

The best-known form of data normalization is Third Normal Form (3NF). When a database is said to be normalized, it usually means it's in 3NF.

Using the example from the previous article, we had our list of Employee of the Year award winners in 2NF. However, it is not in 3NF yet. In order to achieve that level of normalization, every attribute must describe the key.

In this case, the compound key is Year and Winner. The attributes are Reason and Department.

The Reason attribute is describing the keys. Why did the award go to Annette in 2011? Because she implemented the ERP system on time and on budget.

However, does Department actually describe the key? Is IT describing the award that went to Annette in 2011? It's really describing Annette more than the award. This means the table is not in 3NF.

Does it matter that this data is not in 3NF? It does. When we try to ask a question of the data like, "Tell me which department each winner is from," there is a challenge when we reach George Smith. He appears twice and so does him department. When they are the same, it doesn't matter which department value we return, but what if they differ? What if Department equals Sales for the 2010 record but Merchandising for the 2014 record? Which Department is correct for George? Did he change departments between those dates? Or is one of the entries simply an error? How do we know which record is wrong and which is correct? If we need to update George's department, how many records do we need to update? This kind of query ambiguity is why 3NF is important.

To resolve this problem, we need to normalize the data structure. Since Department describes the employees and not the awards, we need to create a separate table for employees as below.

Now the Employee table has a key called Employee and the Department attribute describes the key. Both tables are now compliant with 3NF.

In honour of the creator of Third Normal Form, it can be summarized in one sentence, often referred to as the Database Administrator's Oath:

Each attribute will provide a fact about the key, the whole key, and nothing but the key, so help me Codd!

Benefits of Using 3NF

Note that George is listed twice in the Award table because he won two awards, but he is listed once in the Employee table. There is no longer any ambiguity about his department. It is stored once, so queries will always return the correct department for him and any changes to his department will only have to be written in one place. To repeat from our first article, the main benefits of using normalized data structures are as follows:

minimizing data storage
reducing data write times
ensuring data consistency
ensuring data queries would return meaningful results

Sometimes the purpose of our database varies from these goals though. What is data storage costs are cheap and we no longer care about reducing write times because we rarely update a record once it is written in the database? This is where de-normalized data structures come into play. We will look at one such example in the next article.