Friday, 15 May 2015

Designing Data Cubes - Part 1

OLAP data cubes have become very common and the current generation of BI tools are making it easier to create a cube quickly and without a lot of training. This is a good thing, but it also means that some people who build their own cubes have not had the opportunity to learn the design principles that underlie strong cube performance. This article (and a few subsequent ones) will cover some of the key design steps necessary to have a well-functioning data cube.

There are lots of books and websites out there that teach you how to build data warehouses and cubes, and some of them are excellent (such as the Kimball Group). I am not going to duplicate those but instead focus on a few issues I've seen in my experience, especially with smaller implementations of data cubes. It seems the smaller the project is, the less likely that good cube design principles will be used and the more likely that performance and maintenance problems will develop.

Step 1 - Distinguish the Facts from the Dimensions

When it's time to build a new data cube (or your first one), most data analysts generally know which datasets they want included. (It's not always the case, but it's generally true.) However you cannot just shovel piles of data into a cube and expect it to answer all your business questions quickly and correctly. You need to structure your data cube properly based on the nature and characteristics of your specific data.

The first step is to look at the data that is available and distinguish the facts from the dimensions.

Facts

Facts are the data elements that sum. Terms like amount, dollars, units or hours usually signify that we're dealing with a fact. Some facts are counted or averaged rather than summed but it is an aggregation nonetheless.

Any transactional dataset will typically have a numeric element that can be aggregated. Transactions include things such as payments, purchases, deposits, orders, and sales.

Facts also tend to change frequently and apply either to a point in time or a relatively short time period.

Dimensions

Dimensions are descriptive data elements that provide more details about the facts. They permit the aggregated transactional data to be "sliced 'n' diced." Examples of dimensions are time, product, department, account, region and customer.

Dimensions change infrequently and the tend to apply to relatively longer periods of time.

Dimensions typically do not have elements that can be summed, but that is not a hard-and-fast rule. For example, a Product source dataset could have a descriptive field called Product Weight which stores the number of grams per package. Weight is numeric but it is not necessarily transactional. If users are going to slice the transactional data by package size (e.g. sales of 100g vs 500g packages) then this numeric field is descriptive and belongs in a dimension. If however users want to report on the average package size sold, then the Product Weight field needs to be aggregated and therefore belongs with the facts.

Don't Mix Them Up!

Much grief can be avoided by correctly identifying your facts and dimensions before starting any construction of your cube. When I find slow-performance cubes with incredibly complex queries or lots of dummy dimension values, it usually means the cube was poorly designed at the start. When you put a dimension in a fact table or vice-versa, it will make your life difficult for as long as that cube exists.

Examples

  1. Department
    A department (such as Human Resources, Finance, Research & Development) is a data element that describes an organization. It is something that we would want to use to slice 'n' dice our financial data. Therefore it must be classified as a dimension.
  2. Purchase Orders
    A company uses purchase orders when they want to buy something from one of their suppliers. There tend to be lots of purchase orders and they usually are active until the item is received. They are typically linked to invoices and shipping receipts. Therefore purchase orders are
    transactional and must be classified as a fact.
  3. Employee Headcount
    A useful indicator on financial statements is the number of employees in the company. Since "count" is in the data element name, it's obvious this element needs to be aggregated. Therefore Headcount must be a fact.
  4. Tweets
    This is not a typical data element that would be found in a cube, but let's assume a company wants to track every tweet that is sent with their company's official hashtag. Would tweets be a dimension or a fact?
    Tweets are extremely short-lived. You would never want to slice 'n' dice other data by a single tweet. Therefore tweets would be considered as facts even though they may not contain numeric data. In practice, tweet counts would be the most likely use of a tweet data cube.
  5. Assets
    Assets are things that by definition are long-lasting, which means they must be dimensions. Assets such as a building or a computer server will last multiple years and therefore will have multiple transactions applied to them over time.
    Be aware though that assets also generate transactions, so a data element such as Asset Purchase would be a fact because purchases are transactions. Similarly Asset Depreciation Expense is a numeric amount that is calculated each month or year over the course of its life and therefore is a fact. Do not let a descriptive term like Asset blind you to the real nature of the data element.

Facts or Dimensions? Take the Quiz

Demonstrate your ability for identifying business data elements as either facts or dimensions! It's simple to use and it's quick.



Want to understand why the answers are what they are? See the answer discussion page here.

No comments:

Post a Comment