Informatics Musings: May 2015

Friday, 29 May 2015

Designing Data Cubes - Part 3

This is part of a series on designing OLAP cubes. Click here for the last article.

Step 3 - Designing the Dimensions

So far we've looked at distinguishing the facts from the dimensions in the source data and ensuring each fact table is at one common grain. Now we can focus on designing the dimensions.

With dimensions, you really only have two main design decisions:

Which dimensions are required in this cube?
What will the dimension hierarchies be, if any?

These appear to be simple questions, but some datasets can cause confusion when setting up the dimensions. Getting the dimension design wrong can result in poor cube performance and sometimes erroneous report results.

Some dimensions are easy to design, such as Time. Everyone knows that Days roll up to Weeks, roll up to Months, roll up to Quarters, roll up to Years. However, it is not always clear which data elements belong together in one hierarchy and which ones deserve their own dimensions.

For example, the data element Sales Region rolls up to Sales District and drills down to Customer. It's obvious those elements belong together in one hierarchy. However, Sales Manager rolls up to Sales Director and also drills down to Customer. Customer is common to both potential hierarchies. Should the Sales Managers and Directors be part of the Sales Region hierarchy or should they be in their own dimension? Or should they be an alternate hierarchy above Customers in one dimension? The answer depends on the specific nature of these data elements for this particular company.

I have found there are two guidelines that have helped me answer these dimension design questions:

Eliminate ALL unnecessary dimensions and hierarchy levels from the cube
Use Entity-Relationship diagrams to guide the dimension design

Eliminate Unnecessary Dimensions and Hierarchy Levels

The common temptation I have seen in many workplaces is to build a cube that can answer every ad hoc question that could ever be asked of the data. That is the worst thing anyone could do when designing a cube. Always design cubes to accomplish a particular purpose. By implication, this means the cube will not accomplish every purpose. That's OK because cubes designed to accomplish all purposes usually end up accomplishing none of them.

If none of your organization's reports contain weekly summary levels, then don't include weeks in your time dimension hierarchy. If your scheduling system books staff by the shift, then don't build a reporting cube that shows hourly schedules. Keep your cubes consistent with your business processes.

When I review cubes and ask why certain dimensions have been designed the way they are, there are some terms that can indicate unnecessary dimensions in a cube:

Just in case: "We added Weeks to the Time hierarchy just in case someone would want to build a weekly report." If no one has had a business need to build a weekly report yet, don't presume that business need is suddenly going to appear after you build your cube.
Future requirement: "We built hourly schedule reporting functionality because the HR manager said that they might consider moving to hourly staff scheduling in a couple years." Don't build to imagined future requirements. Build to today's requirements to ensure you meet today's requirements. Failing to meet today's requirements but meeting non-existent future requirements does not equal success.
One person asked for it: "We didn't see any existing reports with this dimension, but one user requested it." If that one user is the project owner who is funding this cube, then it's probably a good idea to include it. Otherwise, leave it out.
It was in the source data, so we put it in the cube: "We put every data element we found into the cube." This may sound comprehensive but it is too large a purpose for one cube. Most source systems have data elements that are no longer used, so including everything guarantees a cluttered cube.

Keeping the cube dimensions and hierarchies clear of clutter improves performance significantly. Be ruthless when considering what elements should be included and excluded from your cube.

The second guideline will be tackled in the next post.

Friday, 22 May 2015

Designing Data Cubes - Part 2

In Part 1, we looked at identifying the facts and dimensions in your data. The next step is to organize the data elements into distinct cubes.

Step 2 - Pick the Grain of the Fact Table

In data cubes, grain is not referring to wheat. Grain refers to the level of detail that the facts describe. Every fact in a fact table should be at the same grain. For instance, with time one fact could measure a daily amount and another could measure an annual amount. If you need both types of measures in your data warehouse, it is best to use two fact tables at a minimum.

This principle is key for designing data cubes. Each star schema only contains one fact table, so if your data demands multiple fact tables then your data mart will require multiple cubes. Do not try to jam different grains of data into a single cube as it will cause a lot of hassles when you try to query it.

For example, financial data may be budgeted at the monthly level but actual expenses may be tracked at a weekly level. The temptation is to pack them into a single cube because the budget reports contain both actual and budgeted expenses. Do not fall into that trap. Keep the fact table grains consistent and separate, and then combine the data from both cubes when it's time to build the report.

Even though grain is always discussed in the context of the fact table, it really is the dimensions that determine grain. Like the example above, the Time dimension determines the lowest possible grain in its associated fact table. If Time only contains months and years, then the fact table cannot store data at a daily grain. Similarly, if the Geography dimension only goes to the Province level of detail, the fact table cannot store data at the City level.

Aren't Multi-Grain Fact Tables Healthy?

What's the big deal? Why can't I mix the grains in one fact table if they are regularly needed together for reporting? Multi-grain bagels are supposed to be good for me, so why not multi-grain cubes?

Of course anyone can mix grains in a single fact table. No one has to follow best practices. Just like you can build a transactional database that doesn't follow 3rd Normal Form, you can build a multi-grain fact table. It will just make your life difficult down the road.

For example, you may choose to keep budgets at a monthly level and actual expenses at a daily level in a single cube. What are the consequences?

Performance.

Cubes require that all facts be stored at the lowest grain. Even though you want to store the budget measure at a monthly level, you must store it at the daily level. You will have to choose how to do that. You could divide your monthly $100 Office Supplies budget by 30 and store $3.33 in each day of April. This means you need to include logic in your ETL to know how many days are in each month. Some reports may now have rounding errors when you report your monthly budgets. That may or may not be a problem for your users, or it may be a bit of hassle for your BI report writers to hide the rounding issues. You could also decide to store the entire monthly budget in the first day of each month to avoid rounding problems, but this means the budget amounts could disappear if a user is allowed to drill down to the weekly or daily levels.

The performance is affected during write processes (more cells to write to, and probably some minor arithmetic) and during read processes (more cells to aggregate). Write processes are typically done at night using ETL programs, so most users will not notice it. If however you are doing multiple data refreshes during the day or near-real-time mini-batches, performance hits can have major implications. If ETL processes are being run hourly during the day and they now take 2 hours to complete, your near-real-time data warehouse is going to turn back into a nightly data warehouse.

The example of monthly versus daily grain will not result in a significant increase in the number of potential data cells. If however you have a Product dimension and you are considering putting a budget grain of 500 Product Categories and an actual expense grain of 50,000 Products in the same cube, the performance impact could be huge.

In the end, the decision to mix grains is one of estimating the performance degradation against the convenience of having both of the different grains of facts together in one cube. If your cube is small, you can get away with it.

Friday, 15 May 2015

Designing Data Cubes - Fact or Dimension Quiz Answers

Have you taken the Facts or Dimensions quiz? Here are some explanations for the answers.

Note that the quiz website randomizes the question order, so your quiz will be in a different order than the list below.

Quiz Answers

Customer - Dimension
This is a data element that we would want to use to slice 'n' dice our data, such as grouping sales dollars by customer. Therefore it must be a dimension.
Invoice - Fact
An invoice represents a sales transaction, and transactional data elements are facts.
Region
This is a common dimension that permits geographical slicing of the data.
Fiscal Year
Time is the most common dimension in cubes. Whether it's calendar or fiscal time periods, time will always be a dimension.
Company
If a corporation owns multiple companies, then there would be a need to slice the data by company. This must be a dimension.
Cost Center
Cost centers are similar to departments but typically at a lower level of detail, such as a function or project. Since departments are modeled as a dimension, then cost centers must be a dimension as well.
Asset Purchases
As discussed in the blog article, purchases are transactions and are therefore facts.
Basket Ball Sales
Sales are always transactions, regardless of the item being sold.
(And yes, data cubes are flexible enough to store data on round objects!)
Currency Conversion Rate
Currency rates change rapidly, so that means it must be a fact.
Product Inventory Counts
Inventory counts are numeric and would need to be aggregated, so they need to be modeled as a fact.
Financial Account
Financial accounts are categories such as Office Supplies Expense or Interest Earned. The finances always need to be aggregated by accounts for financial reporting (i.e. slice the data by account), so accounts will always be a dimension.
Insurance Policy Type
Type implies a category of some kind. Even though we don't know what policy types might be included here (e.g. auto or home, personal or business, etc.) we know that there will always be fewer policy types than policies themselves. Since type would be used to slice the policy transactions in some way, it must be a dimension.
Patents
This is an unusual element to include in a data cube, but it's possible. Patents are very long-lasting, being in effect for multiple decades. If royalties are being paid to us by other companies who are using our patent, then those royalty transactions could be grouped by patent so we could see which ones are contributing the most royalty revenues. Patents have to be a dimension.
Contracts
This is a difficult one to model because businesses can apply its meaning broadly enough to make it behave as a dimension or a fact.
The most common application of a contract would be as a dimension, where contract is a legal container for many invoices to be charged against it. Think of an office building construction project. The entire project may last for years, and a contract will exist for all the truckloads of cement required and another contract for all the electrical installation work, etc. Many invoices will be generated over the years under each contract. Clearly contract is a dimension.
However, now think of a self-employed handyman who does bathroom renovations for home owners. Each contract lasts 1-4 weeks and only one invoice is ever billed per contract. In that case, contract is essentially equivalent to an invoice and it behaves as a transaction. You would never want to group data by contract because there is only ever one transaction per contract. There is no data to aggregate. Even though these contracts are still functioning as legal agreements, in this situation contract must be treated as a fact.
This is a great example of how a single business term can mean different things in different organizations, and why it is so important to understand your organization's specific requirements before assuming how the data model and cubes should look.
Employee Satisfaction Survey Responses
Survey responses are usually numerous (at least you hope for that -- not many people fill them out!) and they typically contain responses that can be tallied numerically. Numeric data like this must be modeled as a fact. The tricky part with this data element is that it can be long-lasting, in that the survey may only be taken once per year. Even though it is not a frequent transaction, it is still a transaction. A similar example would be asset depreciation which also may only be calculated once per year. Infrequent transactions are still transactions and are therefore must be modeled as facts.

Click here to return to the original article.

Designing Data Cubes - Part 1

OLAP data cubes have become very common and the current generation of BI tools are making it easier to create a cube quickly and without a lot of training. This is a good thing, but it also means that some people who build their own cubes have not had the opportunity to learn the design principles that underlie strong cube performance. This article (and a few subsequent ones) will cover some of the key design steps necessary to have a well-functioning data cube.

There are lots of books and websites out there that teach you how to build data warehouses and cubes, and some of them are excellent (such as the Kimball Group). I am not going to duplicate those but instead focus on a few issues I've seen in my experience, especially with smaller implementations of data cubes. It seems the smaller the project is, the less likely that good cube design principles will be used and the more likely that performance and maintenance problems will develop.

Step 1 - Distinguish the Facts from the Dimensions

When it's time to build a new data cube (or your first one), most data analysts generally know which datasets they want included. (It's not always the case, but it's generally true.) However you cannot just shovel piles of data into a cube and expect it to answer all your business questions quickly and correctly. You need to structure your data cube properly based on the nature and characteristics of your specific data.

The first step is to look at the data that is available and distinguish the facts from the dimensions.

Facts

Facts are the data elements that sum. Terms like amount, dollars, units or hours usually signify that we're dealing with a fact. Some facts are counted or averaged rather than summed but it is an aggregation nonetheless.

Any transactional dataset will typically have a numeric element that can be aggregated. Transactions include things such as payments, purchases, deposits, orders, and sales.

Facts also tend to change frequently and apply either to a point in time or a relatively short time period.

Dimensions

Dimensions are descriptive data elements that provide more details about the facts. They permit the aggregated transactional data to be "sliced 'n' diced." Examples of dimensions are time, product, department, account, region and customer.

Dimensions change infrequently and the tend to apply to relatively longer periods of time.

Dimensions typically do not have elements that can be summed, but that is not a hard-and-fast rule. For example, a Product source dataset could have a descriptive field called Product Weight which stores the number of grams per package. Weight is numeric but it is not necessarily transactional. If users are going to slice the transactional data by package size (e.g. sales of 100g vs 500g packages) then this numeric field is descriptive and belongs in a dimension. If however users want to report on the average package size sold, then the Product Weight field needs to be aggregated and therefore belongs with the facts.

Don't Mix Them Up!

Much grief can be avoided by correctly identifying your facts and dimensions before starting any construction of your cube. When I find slow-performance cubes with incredibly complex queries or lots of dummy dimension values, it usually means the cube was poorly designed at the start. When you put a dimension in a fact table or vice-versa, it will make your life difficult for as long as that cube exists.

Examples

Department
A department (such as Human Resources, Finance, Research & Development) is a data element that describes an organization. It is something that we would want to use to slice 'n' dice our financial data. Therefore it must be classified as a dimension.
Purchase Orders
A company uses purchase orders when they want to buy something from one of their suppliers. There tend to be lots of purchase orders and they usually are active until the item is received. They are typically linked to invoices and shipping receipts. Therefore purchase orders are
transactional and must be classified as a fact.
Employee Headcount
A useful indicator on financial statements is the number of employees in the company. Since "count" is in the data element name, it's obvious this element needs to be aggregated. Therefore Headcount must be a fact.
Tweets
This is not a typical data element that would be found in a cube, but let's assume a company wants to track every tweet that is sent with their company's official hashtag. Would tweets be a dimension or a fact?
Tweets are extremely short-lived. You would never want to slice 'n' dice other data by a single tweet. Therefore tweets would be considered as facts even though they may not contain numeric data. In practice, tweet counts would be the most likely use of a tweet data cube.
Assets
Assets are things that by definition are long-lasting, which means they must be dimensions. Assets such as a building or a computer server will last multiple years and therefore will have multiple transactions applied to them over time.
Be aware though that assets also generate transactions, so a data element such as Asset Purchase would be a fact because purchases are transactions. Similarly Asset Depreciation Expense is a numeric amount that is calculated each month or year over the course of its life and therefore is a fact. Do not let a descriptive term like Asset blind you to the real nature of the data element.

Facts or Dimensions? Take the Quiz

Demonstrate your ability for identifying business data elements as either facts or dimensions! It's simple to use and it's quick.

Take the free quiz here!

Want to understand why the answers are what they are? See the answer discussion page here.

Friday, 8 May 2015

Back to the Basics - Crow's Foot Diagrams

In an earlier post we touched on the topic of Entity-Relationship Diagrams (ERDs). In this post we will explain one of the more common diagramming standards for ERDs. To begin though we need to understand the types of relationships between entities.

Entity Relationships

In ERDs, entities are shown as boxes and relationships between them are shown as lines. For example, the entity Author has a relationship to the entity Book.

While knowing that a relationship exists between these entities is helpful, it would be more helpful to convey information about the nature of that relationship in the diagram. For instance, does every author have a relationship to a book or only some authors? Can a book exist without an author? Can a single author write multiple books? Can a book be written by multiple authors?

These relationship questions boil down to two that apply to every entity relationship:

Is the relationship optional or mandatory?
Is a single relationship or multiple relationships possible per instance of each entity?

The first question deals with the minimum number of connections (0 if optional, 1 if mandatory) and the second question deals with the maximum number of connections (1 or many). We don't try to specify a firm maximum number because whether it's 2 or 2 million, the impact on the database design will be the same.

Many relationship diagramming methods exist for ERDs but I have found the Crow's Foot design the easiest for me to understand and the most commonly used in the companies I've worked with.

Crow's Foot Diagrams

The Crow's Foot connector uses 3 symbols to describe the minimum and maximum number of relationships: 0, 1 and < (for "many" -- I think of it as a greater-than sign)

Two symbols are used at the end of each line. The first is the minimum and the second is the maximum. There are four acceptable combinations of values as shown below.

Because the "many" symbol resembles a crow's foot, hence the name of this diagramming method.

Some people view the number zero as a letter "O" which stands for "Optional." Both interpretations are equivalent.

Now we can modify the relationship diagram for Authors and Books.

We can read the relationships in both directions. The symbol at the destination end of the line is the one that applies to that direction. These connectors show that each author writes one to many books and each book is written by one to many authors.

Note that these relationships are mandatory (i.e. minimum of one). If an author has not written any books, they cannot be considered an author. Similarly a book cannot exist without an author.

Note that the greater-than sign always means "many" even if it appears backwards (like a less-than sign).

Here is a finance example from a General Ledger data model.

In this system, financial accounts can have sub-accounts but they are not mandatory. For example, the account Office Supplies Expense may have sub-accounts called Copier Paper, Copier Toner and General Office Supplies. However the account Electricity Utility Expense has no sub-accounts because there isn't any further cost breakdown possible. This diagram reads that each account can have zero to many sub-accounts, but that each sub-account must belong to one and only one account. While sub-accounts are optional, once one is created then it has a mandatory relationship back to a single account. Sub-accounts cannot be orphans.

Next example: users and software licences.

Each user must have one and only one licence. (We will stick to the legal options only -- no pirated software!) If a person does not have a licence then they cannot be considered a user of that software. However a licence may be purchased and not assigned to any user. It may be waiting for a new hire who starts next quarter. Therefore a licence can be assigned to zero or one user.

You can begin to see that the definition of the entity can affect the definition of the relationship. The entity User contains an inherent assumption that must be understood in order for the relationship to be modeled correctly, specifically that there is no pirated software allowed.

The importance of entity definition shows up in this example as well. Here a customer buys one to many products, while a product may be bought by no one or many customers. If a person doesn't buy any of our products, they cannot be considered our customer. If however we want to track everyone who walks into our store, then we need a different model. Some people could walk in, look around and leave without buying and therefore without becoming customers. If we call everyone who enters our store "Shoppers" then that model would look like the following:

Now the connector from Shopper to Product becomes optional.

But What About Other Combinations?

You may have noticed that selecting two values (min, max) from three different symbols (0, 1, <) should give us more than just four combinations. Obviously the maximum cannot be less than the minimum so those situations are nonsensical. However, there still are two other combinations:

Zero and only Zero -- this effectively means there is no relationship between these two entities and therefore no line, so it is a non-existent situation in an ERD.
Many to Many -- this means the minimum is greater than one. This situation sounds plausible.

In fact, it's not hard to imagine an entity relationship where the minimum is 2. Consider the diagram below for a genealogy database:

Each parent can have one to many children, but each child must have 2 and only 2 parents. (We'll stick to genetic parents for simplicity.) What Crow's Foot symbol should we use for representing a "2-and-only 2" relationship?

Whenever you encounter a situation like this, it means you need to redefine your entity definitions. You need to take the model to a lower level of detail.

In this case, we need to ask why we created an entity called Parent. Why not two entities called Father and Mother? Or Man and Woman? In genealogy software, those distinctions are bound to be important for reporting.

Once we create two entities out of the Parent entity (Father and Mother), the relationship to Child simplifies to a minimum value of one. Now our "2-and-only-2" relationship has been resolved.

It's important to know that relational databases cannot resolve a connection where the minimum value is greater than 1. Whenever that shows up in your model, you have to create additional entities to simplify the relationships so the minimum connector is either 0 or 1.

Short-Form Description of Entity Relationships

One more important point. When describing an entity relationship in words, the practice is to summarize it by focusing on the maximum connector values only. Minimum values are omitted for simplicity as they have less effect on the data architecture design.

For example, we would say the entity Author has a many-to-many relationship with Book.

Account has a one-to-many relationship with Sub-account, and Sub-account has a many-to-one relationship with Account.

User has a one-to-one relationship with Licence.

While these little Crow's Foot symbols may seem insignificant, ERDs are incredibly helpful in understanding an organization's data and the business definition assumptions inherent in it. We will explore some applications of ERDs using Crow's Foot notation in future posts.

Friday, 1 May 2015

Back to the Basics - ROLAP vs MOLAP

This is part of a series of posts on some of the basic concepts of data management and business intelligence concepts. We've previously discussed the differences between OLAP and OLTP. In this post we look at the two main implementations of OLAP: Relational OLAP and Multi-dimensional OLAP.

Review of OLAP

On-Line Analytical Processing (OLAP) is a method of structuring data to permit quick aggregation queries. It makes use of the Star Schema to limit the number of query joins when connecting different data entities. Business Intelligence tools that make use of OLAP typically permit data to be arranged in rows, columns and pages, with the intersecting cells storing the aggregation results (like Excel Pivot Tables). These tools also allow interactive "drag and drop" functionality to move or add entities on rows to columns and have the aggregation results recalculate automatically.

The user experience for OLAP is similar regardless of the back-end OLAP implementation, but it is important to understand that there are two ways to store and access data in OLAP. The choice affects performance and scalability as the data mart grows and expands and it can affect which tools can be used on the front end.

Relational OLAP

Relational OLAP (ROLAP) uses standard relational databases and tables to store and deliver the data to the BI tool. However, instead of organizing the data in Third Normal Form (3NF) it is organized using a Star Schema. Any BI tool accessing a ROLAP database must be able to aggregate the cell intersection values on the fly.

Because ROLAP uses a standard relational database, the OLAP queries are written using standard SQL scripts. No special query language or functions are required.

Multi-dimensional OLAP

Multi-dimensional OLAP (MOLAP) uses a unique type of database to store and deliver the data to the BI tool. It does not consist of data tables per se but rather a multi-dimensional array that is designed to quickly aggregate the cell intersection values automatically.

Because MOLAP does not use a standard relational database, the OLAP queries are typically not written in SQL. Microsoft created a MOLAP query language for their own cube database (SSAS) called MDX and most other data cube vendors have since adopted it. (Oracle is the exception as they still use SQL with their cubes.) However MDX is not an industry standard. Microsoft owns it and they can modify it however and whenever they like.

MDX has some similarities to SQL but it is different enough that it requires dedicated training and practice. It is a very syntactic, meaning it's if you're missing a comma in a particular place it will cause the query to fail!

Comparison of ROLAP and MOLAP

Both ROLAP and MOLAP deliver all of the OLAP functionality to the end users, so from the users' perspective there is no apparent difference. There are differences in terms of performance though as well as in back-end maintenance considerations.

I have worked with multiple ROLAP and MOLAP clients across multiple database vendors and multiple BI tools. From this experience, these are the main differences I have witnessed.

Back-End Maintenance: Advantage MOLAP

MOLAP cubes can be implemented with relatively less back-end IT support than a ROLAP solution. While both require IT support, a ROLAP environment typically requires a small ETL department to maintain and load the data into the growing number of tables that support it. MOLAP also requires data loading and maintenance at the back-end, but it's usually possible for one DBA to maintain a series of MOLAP cubes.

For example, Prophix budgeting software, which sits on top of SSAS MOLAP cubes, actually permits the admin end users to do most of the data loading and maintenance processes themselves, thereby (almost) eliminating the need for IT support. I never seen this kind of functionality built on top of a ROLAP database.

Scalability: Advantage ROLAP

As a data mart grows in terms of number of fact records and number of entity members, both MOLAP and ROLAP performance can deteriorate. However, MOLAP seems to do worse when dimensions get large. While you can throw more hardware (memory and processors) at a MOLAP problem to improve performance on a large cube, it has a limited affect. Eventually the only option left is to split a MOLAP cube into smaller cubes, and that may not suit the business' analytical needs.

When ROLAP databases slow down, there is the added flexibility of being able to add more aggregate tables to the database. Aggregate tables calculate and store the sums of certain cell intersections so the BI tool doesn't have to make those calculations on the fly. This reduces the load on the BI tool and common reports that reference this aggregate table will generally load faster. These aggregate tables must be updated during the nightly ETL (Extract, Transform and Load) processes so the trade-off is a larger, slower and more complex ETL code. It also creates a more complex data model, as each BI report that is running slow may require its own new aggregate table in the background. If end users have direct access to query the data tables, these aggregate tables can make their queries more complex to build and if the aggregate tables include filter conditions, it can allow users to generate erroneous results if they are not aware of those hidden filters. While there are obviously trade-offs when considering aggregate tables, at least ROLAP provides the flexibility to choose between those trade-offs. MOLAP provides not such option.

When it comes to large-scale enterprise-wide data warehouses, MOLAP cannot keep up with the increasing data demands. ROLAP is the most common choice for these situations.

Query Language: Advantage ROLAP

ROLAP uses SQL to query the data while MOLAP uses MDX. This is a huge advantage for ROLAP for two reasons:

There are far more SQL developers in the world than MDX developers, so staffing your OLAP team is much easier.
SQL is a simpler language to learn than MDX as MDX is more syntactic than SQL. (Translation: MDX is a fussier language that is more difficult to debug!)

Summary

Each method has its own strengths and weaknesses. MOLAP is more common among small and mid-size companies while ROLAP is more common in large companies. The key is to understand your priorities and choose accordingly.