Friday 22 May 2015

Designing Data Cubes - Part 2

In Part 1, we looked at identifying the facts and dimensions in your data. The next step is to organize the data elements into distinct cubes.

Step 2 - Pick the Grain of the Fact Table

In data cubes, grain is not referring to wheat. Grain refers to the level of detail that the facts describe. Every fact in a fact table should be at the same grain. For instance, with time one fact could measure a daily amount and another could measure an annual amount. If you need both types of measures in your data warehouse, it is best to use two fact tables at a minimum.

This principle is key for designing data cubes. Each star schema only contains one fact table, so if your data demands multiple fact tables then your data mart will require multiple cubes. Do not try to jam different grains of data into a single cube as it will cause a lot of hassles when you try to query it.

For example, financial data may be budgeted at the monthly level but actual expenses may be tracked at a weekly level. The temptation is to pack them into a single cube because the budget reports contain both actual and budgeted expenses. Do not fall into that trap. Keep the fact table grains consistent and separate, and then combine the data from both cubes when it's time to build the report.

Even though grain is always discussed in the context of the fact table, it really is the dimensions that determine grain. Like the example above, the Time dimension determines the lowest possible grain in its associated fact table. If Time only contains months and years, then the fact table cannot store data at a daily grain. Similarly, if the Geography dimension only goes to the Province level of detail, the fact table cannot store data at the City level.

Aren't Multi-Grain Fact Tables Healthy?

What's the big deal? Why can't I mix the grains in one fact table if they are regularly needed together for reporting? Multi-grain bagels are supposed to be good for me, so why not multi-grain cubes?

Of course anyone can mix grains in a single fact table. No one has to follow best practices. Just like you can build a transactional database that doesn't follow 3rd Normal Form, you can build a multi-grain fact table. It will just make your life difficult down the road.

For example, you may choose to keep budgets at a monthly level and actual expenses at a daily level in a single cube. What are the consequences?

Performance.

Cubes require that all facts be stored at the lowest grain. Even though you want to store the budget measure at a monthly level, you must store it at the daily level. You will have to choose how to do that. You could divide your monthly $100 Office Supplies budget by 30 and store $3.33 in each day of April. This means you need to include logic in your ETL to know how many days are in each month. Some reports may now have rounding errors when you report your monthly budgets. That may or may not be a problem for your users, or it may be a bit of hassle for your BI report writers to hide the rounding issues. You could also decide to store the entire monthly budget in the first day of each month to avoid rounding problems, but this means the budget amounts could disappear if a user is allowed to drill down to the weekly or daily levels.

The performance is affected during write processes (more cells to write to, and probably some minor arithmetic) and during read processes (more cells to aggregate). Write processes are typically done at night using ETL programs, so most users will not notice it. If however you are doing multiple data refreshes during the day or near-real-time mini-batches, performance hits can have major implications. If ETL processes are being run hourly during the day and they now take 2 hours to complete, your near-real-time data warehouse is going to turn back into a nightly data warehouse.

The example of monthly versus daily grain will not result in a significant increase in the number of potential data cells. If however you have a Product dimension and you are considering putting a budget grain of 500 Product Categories and an actual expense grain of 50,000 Products in the same cube, the performance impact could be huge.

In the end, the decision to mix grains is one of estimating the performance degradation against the convenience of having both of the different grains of facts together in one cube. If your cube is small, you can get away with it.

No comments:

Post a Comment