Informatics Musings: 2015

Monday, 16 November 2015

The Risk of Relying on a Single Data Source: A Lesson from NASA's MODIS Satellites

I read a fascinating article recently on the darkening of Greenland's ice sheet. NASA has been proclaiming for years that the ice is getting darker due to increased global pollution and is therefore reflecting less sunlight back into space, something that has ramifications for global warming. However, when researcher Chris Polashenski and his team analyzed Greenland ice samples to explain this darkening, they found ... nothing. Turns out there has been no darkening of the central ice sheet after all. Polashenski's explanation is that the satellite sensors on NASA's aging MODIS satellites are degrading and NASA has failed to correct for this error. I have not found any response from NASA yet, but I may have missed it.

Source

It appears that no matter how expensive and complex the measuring tools get, nothing beats boots on the ground to verify the data.

Calibrating Sensors

When I was in university, I had the privilege of working for the Canadian government doing research in preparation for the first RADARSAT satellite. When I was there in 1991, RADARSAT was still four years in the future, but the Europeans launched ERS-1 that summer and it carried the first spaceborne Synthetic Aperture Radar (SAR) system. Because Canada had a reliable airborne SAR system, we helped the Europeans calibrate their new satellite. We did that by having the satellite and the plane pass over the same stretch of eastern Lake Superior shoreline at the same time while people on the ground took measurements of the conditions that would affect radar return, such as air and water temperature, wave heights, wind speed, and so forth. The ground measurements and airborne radar images could be compared and then the ERS-1 images could be calibrated against both of them. Once calibrated to real-world data, the satellite SAR could image any other areas of the Earth's surface and the radar image could be reliably interpreted to describe the surface conditions.

The Convair 580 C-X SAR Aircraft that helped calibrate the ERS-1 satellite in 1991. Source

The lesson I learned from those months doing radar research is that even the most complex sensors are only useful if they are calibrated against a real-world source.

Implications for Business Intelligence

As we move into the age of the Internet of Things (IoT), this problem will only become more prevalent. If we base any important decisions on a single measurement source, we stand the risk of getting fooled by a faulty sensor. Of course, it's not just technical gizmos that can fail. Any single source of data can be faulty, including someone's spreadsheet.

When I worked in the hospital sector in the late 1990's, the government funding agency sent out a spreadsheet showing how the complexity of cardiac surgery cases had mysteriously dropped during the first 6 months of that year across all cardiac hospitals in the province. As a result, they were clawing back over a million dollars in funding based on the argument that the hospitals were taking simpler cases and were redirecting the funds elsewhere. Our hospital CEO was concerned as it is no simple exercise to find money half way through a budget year. A colleague of mine was assigned to discover if this complexity drop was real, so the first thing he did was talk to the Chief of Cardiac Surgery. Were cardiac surgery cases getting simpler? On the contrary, he answered, they were getting noticeably more complex. It was the same story when he spoke to surgeons at other hospitals. My colleague then set out to uncover the discrepancy, which he eventually found as a formula error buried deep in the spreadsheet. He corrected the spreadsheet, sent it back to the government, and a few days later they sheepishly sent out the corrected spreadsheet showing a slight increase in cardiac surgery complexity across the province. The funding clawback was quietly cancelled. It was sloppy spreadsheet work on the government's part for sure, but the approach to solving the problem began by getting a second measurement source, in this case by having a conversation with a surgeon.

It reminds me of Segal's Law, which states:

A man with a watch always knows what time it is.
A man with two watches is never sure.

Every watch has a measurement error of some degree, but you are only confronted with that error when comparing to a second watch. When looking at one watch, we can easily forget about its inherent inaccuracies. NASA was looking at one watch for a decade and it was sure it had the correct time. Even the experts can make simple mistakes.

When it comes to any important business decisions that you are basing off of your BI metrics, always make sure you have a second watch.

Source: David Boettcher. Notice how these watches differ?

Friday, 5 June 2015

Designing Data Cubes - Part 4

This is part of an on-going series on data cube design. Click here for the previous article.

Use Entity-Relationship Diagrams to Guide the Dimension Design

In the previous post I discussed the need for eliminating unnecessary dimensions during the cube design phase. Once you have all the required dimensions, you need to arrange the components into hierarchies. In some cases, it's not always obvious if dimension elements belong together in the same dimension or in separate dimensions.

In an earlier post, I discussed the basics of ER Diagrams so I won't repeat that here. I have found taking the time to model the dimension elements properly makes the dimension design easier and helps avoid common design errors.

Situation 1: Separate Dimensions

When two elements have a many-to-many relationship (such as Customer and Product), they absolutely should not be part of the same dimension. They are separate dimensions that connect to each other through the fact table. The temptation may be to put products as a child of customer so you can easily see which products each customer purchased, but when it's time to show all sales for Product A, it becomes a nightmare. I do not know of a valid exception to this principle, but I certainly have seen it ignored by new cube designers.

Situation 2: Level-Based Hierarchy

When a series of entities are related as one-to-many going forward and one-and-only-one going backwards, they are part of the same dimension hierarchy. Further, it is a level-based hierarchy because each forward relationship is mandatory.

The example below shows part of the Canadian census geographic levels and how they are subdivided into smaller and smaller areas. While it's obvious they are part of the same dimension, working out the relationships helps determine the type of hierarchy to use.

Situation 3: Parent-Child Hierarchy

Similar to Situation 2, entities that are related as zero-to-many going forward and one-and-only-one going backwards are also part of the same dimension hierarchy, but because it is optional at one or more levels, it is a parent-child hierarchy. Just like with families, some people have children and grand-children while other people do not. This asymmetry is not allowed in a level-based hierarchy.

The example below shows an account structure, where Account Group may be Asset, Account may be Bank Account, and Sub-account may be TD Bank Account #12345. Because one level of the hierarchy is optional (Sub-account), the entire hierarchy becomes parent-child.

Situation 4: Member Properties (Synonyms)

This situation is not as intuitive as the previous examples, but I have seen it built improperly in cubes on multiple occasions. Entities relate to one base element in what appears to be two distinct hierarchies. The example below shows Customers rolling up to geographic regions and also through the Sales department.

Because both sets are based on Customer, it doesn't make sense to create two separate dimensions. Building one dimension with a default hierarchy and an alternate hierarchy looks like a reasonable option. However, before making that decision it is important to understand the relationships between the hierarchies. How does Region relate to Regional Sales Managers? In this case, it turns out that each RSM oversees one and only one Region, and simliarly for Sales VP's and Districts.

Since the geography and sales entities are therefore equivalent, we don't need two hierarchies. We just need alternate labels on one hierarchy. For example, in Microsoft cubes (SSAS), these alternate labels are called Member Properties. Some reports will show Districts and Regions while other reports show Sales VP's and RSM's, but the numbers in each bucket will always be identical.

I have found people who are new to cube design tend to shun data modeling exercises because it's time consuming and in some cases it appears unnecessary. However, understanding your data relationships clearly by building some simple ER diagrams can really help to design a cube that works efficiently.

Friday, 29 May 2015

Designing Data Cubes - Part 3

This is part of a series on designing OLAP cubes. Click here for the last article.

Step 3 - Designing the Dimensions

So far we've looked at distinguishing the facts from the dimensions in the source data and ensuring each fact table is at one common grain. Now we can focus on designing the dimensions.

With dimensions, you really only have two main design decisions:

Which dimensions are required in this cube?
What will the dimension hierarchies be, if any?

These appear to be simple questions, but some datasets can cause confusion when setting up the dimensions. Getting the dimension design wrong can result in poor cube performance and sometimes erroneous report results.

Some dimensions are easy to design, such as Time. Everyone knows that Days roll up to Weeks, roll up to Months, roll up to Quarters, roll up to Years. However, it is not always clear which data elements belong together in one hierarchy and which ones deserve their own dimensions.

For example, the data element Sales Region rolls up to Sales District and drills down to Customer. It's obvious those elements belong together in one hierarchy. However, Sales Manager rolls up to Sales Director and also drills down to Customer. Customer is common to both potential hierarchies. Should the Sales Managers and Directors be part of the Sales Region hierarchy or should they be in their own dimension? Or should they be an alternate hierarchy above Customers in one dimension? The answer depends on the specific nature of these data elements for this particular company.

I have found there are two guidelines that have helped me answer these dimension design questions:

Eliminate ALL unnecessary dimensions and hierarchy levels from the cube
Use Entity-Relationship diagrams to guide the dimension design

Eliminate Unnecessary Dimensions and Hierarchy Levels

The common temptation I have seen in many workplaces is to build a cube that can answer every ad hoc question that could ever be asked of the data. That is the worst thing anyone could do when designing a cube. Always design cubes to accomplish a particular purpose. By implication, this means the cube will not accomplish every purpose. That's OK because cubes designed to accomplish all purposes usually end up accomplishing none of them.

If none of your organization's reports contain weekly summary levels, then don't include weeks in your time dimension hierarchy. If your scheduling system books staff by the shift, then don't build a reporting cube that shows hourly schedules. Keep your cubes consistent with your business processes.

When I review cubes and ask why certain dimensions have been designed the way they are, there are some terms that can indicate unnecessary dimensions in a cube:

Just in case: "We added Weeks to the Time hierarchy just in case someone would want to build a weekly report." If no one has had a business need to build a weekly report yet, don't presume that business need is suddenly going to appear after you build your cube.
Future requirement: "We built hourly schedule reporting functionality because the HR manager said that they might consider moving to hourly staff scheduling in a couple years." Don't build to imagined future requirements. Build to today's requirements to ensure you meet today's requirements. Failing to meet today's requirements but meeting non-existent future requirements does not equal success.
One person asked for it: "We didn't see any existing reports with this dimension, but one user requested it." If that one user is the project owner who is funding this cube, then it's probably a good idea to include it. Otherwise, leave it out.
It was in the source data, so we put it in the cube: "We put every data element we found into the cube." This may sound comprehensive but it is too large a purpose for one cube. Most source systems have data elements that are no longer used, so including everything guarantees a cluttered cube.

Keeping the cube dimensions and hierarchies clear of clutter improves performance significantly. Be ruthless when considering what elements should be included and excluded from your cube.

The second guideline will be tackled in the next post.

Friday, 22 May 2015

Designing Data Cubes - Part 2

In Part 1, we looked at identifying the facts and dimensions in your data. The next step is to organize the data elements into distinct cubes.

Step 2 - Pick the Grain of the Fact Table

In data cubes, grain is not referring to wheat. Grain refers to the level of detail that the facts describe. Every fact in a fact table should be at the same grain. For instance, with time one fact could measure a daily amount and another could measure an annual amount. If you need both types of measures in your data warehouse, it is best to use two fact tables at a minimum.

This principle is key for designing data cubes. Each star schema only contains one fact table, so if your data demands multiple fact tables then your data mart will require multiple cubes. Do not try to jam different grains of data into a single cube as it will cause a lot of hassles when you try to query it.

For example, financial data may be budgeted at the monthly level but actual expenses may be tracked at a weekly level. The temptation is to pack them into a single cube because the budget reports contain both actual and budgeted expenses. Do not fall into that trap. Keep the fact table grains consistent and separate, and then combine the data from both cubes when it's time to build the report.

Even though grain is always discussed in the context of the fact table, it really is the dimensions that determine grain. Like the example above, the Time dimension determines the lowest possible grain in its associated fact table. If Time only contains months and years, then the fact table cannot store data at a daily grain. Similarly, if the Geography dimension only goes to the Province level of detail, the fact table cannot store data at the City level.

Aren't Multi-Grain Fact Tables Healthy?

What's the big deal? Why can't I mix the grains in one fact table if they are regularly needed together for reporting? Multi-grain bagels are supposed to be good for me, so why not multi-grain cubes?

Of course anyone can mix grains in a single fact table. No one has to follow best practices. Just like you can build a transactional database that doesn't follow 3rd Normal Form, you can build a multi-grain fact table. It will just make your life difficult down the road.

For example, you may choose to keep budgets at a monthly level and actual expenses at a daily level in a single cube. What are the consequences?

Performance.

Cubes require that all facts be stored at the lowest grain. Even though you want to store the budget measure at a monthly level, you must store it at the daily level. You will have to choose how to do that. You could divide your monthly $100 Office Supplies budget by 30 and store $3.33 in each day of April. This means you need to include logic in your ETL to know how many days are in each month. Some reports may now have rounding errors when you report your monthly budgets. That may or may not be a problem for your users, or it may be a bit of hassle for your BI report writers to hide the rounding issues. You could also decide to store the entire monthly budget in the first day of each month to avoid rounding problems, but this means the budget amounts could disappear if a user is allowed to drill down to the weekly or daily levels.

The performance is affected during write processes (more cells to write to, and probably some minor arithmetic) and during read processes (more cells to aggregate). Write processes are typically done at night using ETL programs, so most users will not notice it. If however you are doing multiple data refreshes during the day or near-real-time mini-batches, performance hits can have major implications. If ETL processes are being run hourly during the day and they now take 2 hours to complete, your near-real-time data warehouse is going to turn back into a nightly data warehouse.

The example of monthly versus daily grain will not result in a significant increase in the number of potential data cells. If however you have a Product dimension and you are considering putting a budget grain of 500 Product Categories and an actual expense grain of 50,000 Products in the same cube, the performance impact could be huge.

In the end, the decision to mix grains is one of estimating the performance degradation against the convenience of having both of the different grains of facts together in one cube. If your cube is small, you can get away with it.

Friday, 15 May 2015

Designing Data Cubes - Fact or Dimension Quiz Answers

Have you taken the Facts or Dimensions quiz? Here are some explanations for the answers.

Note that the quiz website randomizes the question order, so your quiz will be in a different order than the list below.

Quiz Answers

Customer - Dimension
This is a data element that we would want to use to slice 'n' dice our data, such as grouping sales dollars by customer. Therefore it must be a dimension.
Invoice - Fact
An invoice represents a sales transaction, and transactional data elements are facts.
Region
This is a common dimension that permits geographical slicing of the data.
Fiscal Year
Time is the most common dimension in cubes. Whether it's calendar or fiscal time periods, time will always be a dimension.
Company
If a corporation owns multiple companies, then there would be a need to slice the data by company. This must be a dimension.
Cost Center
Cost centers are similar to departments but typically at a lower level of detail, such as a function or project. Since departments are modeled as a dimension, then cost centers must be a dimension as well.
Asset Purchases
As discussed in the blog article, purchases are transactions and are therefore facts.
Basket Ball Sales
Sales are always transactions, regardless of the item being sold.
(And yes, data cubes are flexible enough to store data on round objects!)
Currency Conversion Rate
Currency rates change rapidly, so that means it must be a fact.
Product Inventory Counts
Inventory counts are numeric and would need to be aggregated, so they need to be modeled as a fact.
Financial Account
Financial accounts are categories such as Office Supplies Expense or Interest Earned. The finances always need to be aggregated by accounts for financial reporting (i.e. slice the data by account), so accounts will always be a dimension.
Insurance Policy Type
Type implies a category of some kind. Even though we don't know what policy types might be included here (e.g. auto or home, personal or business, etc.) we know that there will always be fewer policy types than policies themselves. Since type would be used to slice the policy transactions in some way, it must be a dimension.
Patents
This is an unusual element to include in a data cube, but it's possible. Patents are very long-lasting, being in effect for multiple decades. If royalties are being paid to us by other companies who are using our patent, then those royalty transactions could be grouped by patent so we could see which ones are contributing the most royalty revenues. Patents have to be a dimension.
Contracts
This is a difficult one to model because businesses can apply its meaning broadly enough to make it behave as a dimension or a fact.
The most common application of a contract would be as a dimension, where contract is a legal container for many invoices to be charged against it. Think of an office building construction project. The entire project may last for years, and a contract will exist for all the truckloads of cement required and another contract for all the electrical installation work, etc. Many invoices will be generated over the years under each contract. Clearly contract is a dimension.
However, now think of a self-employed handyman who does bathroom renovations for home owners. Each contract lasts 1-4 weeks and only one invoice is ever billed per contract. In that case, contract is essentially equivalent to an invoice and it behaves as a transaction. You would never want to group data by contract because there is only ever one transaction per contract. There is no data to aggregate. Even though these contracts are still functioning as legal agreements, in this situation contract must be treated as a fact.
This is a great example of how a single business term can mean different things in different organizations, and why it is so important to understand your organization's specific requirements before assuming how the data model and cubes should look.
Employee Satisfaction Survey Responses
Survey responses are usually numerous (at least you hope for that -- not many people fill them out!) and they typically contain responses that can be tallied numerically. Numeric data like this must be modeled as a fact. The tricky part with this data element is that it can be long-lasting, in that the survey may only be taken once per year. Even though it is not a frequent transaction, it is still a transaction. A similar example would be asset depreciation which also may only be calculated once per year. Infrequent transactions are still transactions and are therefore must be modeled as facts.

Click here to return to the original article.

Designing Data Cubes - Part 1

OLAP data cubes have become very common and the current generation of BI tools are making it easier to create a cube quickly and without a lot of training. This is a good thing, but it also means that some people who build their own cubes have not had the opportunity to learn the design principles that underlie strong cube performance. This article (and a few subsequent ones) will cover some of the key design steps necessary to have a well-functioning data cube.

There are lots of books and websites out there that teach you how to build data warehouses and cubes, and some of them are excellent (such as the Kimball Group). I am not going to duplicate those but instead focus on a few issues I've seen in my experience, especially with smaller implementations of data cubes. It seems the smaller the project is, the less likely that good cube design principles will be used and the more likely that performance and maintenance problems will develop.

Step 1 - Distinguish the Facts from the Dimensions

When it's time to build a new data cube (or your first one), most data analysts generally know which datasets they want included. (It's not always the case, but it's generally true.) However you cannot just shovel piles of data into a cube and expect it to answer all your business questions quickly and correctly. You need to structure your data cube properly based on the nature and characteristics of your specific data.

The first step is to look at the data that is available and distinguish the facts from the dimensions.

Facts

Facts are the data elements that sum. Terms like amount, dollars, units or hours usually signify that we're dealing with a fact. Some facts are counted or averaged rather than summed but it is an aggregation nonetheless.

Any transactional dataset will typically have a numeric element that can be aggregated. Transactions include things such as payments, purchases, deposits, orders, and sales.

Facts also tend to change frequently and apply either to a point in time or a relatively short time period.

Dimensions

Dimensions are descriptive data elements that provide more details about the facts. They permit the aggregated transactional data to be "sliced 'n' diced." Examples of dimensions are time, product, department, account, region and customer.

Dimensions change infrequently and the tend to apply to relatively longer periods of time.

Dimensions typically do not have elements that can be summed, but that is not a hard-and-fast rule. For example, a Product source dataset could have a descriptive field called Product Weight which stores the number of grams per package. Weight is numeric but it is not necessarily transactional. If users are going to slice the transactional data by package size (e.g. sales of 100g vs 500g packages) then this numeric field is descriptive and belongs in a dimension. If however users want to report on the average package size sold, then the Product Weight field needs to be aggregated and therefore belongs with the facts.

Don't Mix Them Up!

Much grief can be avoided by correctly identifying your facts and dimensions before starting any construction of your cube. When I find slow-performance cubes with incredibly complex queries or lots of dummy dimension values, it usually means the cube was poorly designed at the start. When you put a dimension in a fact table or vice-versa, it will make your life difficult for as long as that cube exists.

Examples

Department
A department (such as Human Resources, Finance, Research & Development) is a data element that describes an organization. It is something that we would want to use to slice 'n' dice our financial data. Therefore it must be classified as a dimension.
Purchase Orders
A company uses purchase orders when they want to buy something from one of their suppliers. There tend to be lots of purchase orders and they usually are active until the item is received. They are typically linked to invoices and shipping receipts. Therefore purchase orders are
transactional and must be classified as a fact.
Employee Headcount
A useful indicator on financial statements is the number of employees in the company. Since "count" is in the data element name, it's obvious this element needs to be aggregated. Therefore Headcount must be a fact.
Tweets
This is not a typical data element that would be found in a cube, but let's assume a company wants to track every tweet that is sent with their company's official hashtag. Would tweets be a dimension or a fact?
Tweets are extremely short-lived. You would never want to slice 'n' dice other data by a single tweet. Therefore tweets would be considered as facts even though they may not contain numeric data. In practice, tweet counts would be the most likely use of a tweet data cube.
Assets
Assets are things that by definition are long-lasting, which means they must be dimensions. Assets such as a building or a computer server will last multiple years and therefore will have multiple transactions applied to them over time.
Be aware though that assets also generate transactions, so a data element such as Asset Purchase would be a fact because purchases are transactions. Similarly Asset Depreciation Expense is a numeric amount that is calculated each month or year over the course of its life and therefore is a fact. Do not let a descriptive term like Asset blind you to the real nature of the data element.

Facts or Dimensions? Take the Quiz

Demonstrate your ability for identifying business data elements as either facts or dimensions! It's simple to use and it's quick.

Take the free quiz here!

Want to understand why the answers are what they are? See the answer discussion page here.

Friday, 8 May 2015

Back to the Basics - Crow's Foot Diagrams

In an earlier post we touched on the topic of Entity-Relationship Diagrams (ERDs). In this post we will explain one of the more common diagramming standards for ERDs. To begin though we need to understand the types of relationships between entities.

Entity Relationships

In ERDs, entities are shown as boxes and relationships between them are shown as lines. For example, the entity Author has a relationship to the entity Book.

While knowing that a relationship exists between these entities is helpful, it would be more helpful to convey information about the nature of that relationship in the diagram. For instance, does every author have a relationship to a book or only some authors? Can a book exist without an author? Can a single author write multiple books? Can a book be written by multiple authors?

These relationship questions boil down to two that apply to every entity relationship:

Is the relationship optional or mandatory?
Is a single relationship or multiple relationships possible per instance of each entity?

The first question deals with the minimum number of connections (0 if optional, 1 if mandatory) and the second question deals with the maximum number of connections (1 or many). We don't try to specify a firm maximum number because whether it's 2 or 2 million, the impact on the database design will be the same.

Many relationship diagramming methods exist for ERDs but I have found the Crow's Foot design the easiest for me to understand and the most commonly used in the companies I've worked with.

Crow's Foot Diagrams

The Crow's Foot connector uses 3 symbols to describe the minimum and maximum number of relationships: 0, 1 and < (for "many" -- I think of it as a greater-than sign)

Two symbols are used at the end of each line. The first is the minimum and the second is the maximum. There are four acceptable combinations of values as shown below.

Because the "many" symbol resembles a crow's foot, hence the name of this diagramming method.

Some people view the number zero as a letter "O" which stands for "Optional." Both interpretations are equivalent.

Now we can modify the relationship diagram for Authors and Books.

We can read the relationships in both directions. The symbol at the destination end of the line is the one that applies to that direction. These connectors show that each author writes one to many books and each book is written by one to many authors.

Note that these relationships are mandatory (i.e. minimum of one). If an author has not written any books, they cannot be considered an author. Similarly a book cannot exist without an author.

Note that the greater-than sign always means "many" even if it appears backwards (like a less-than sign).

Here is a finance example from a General Ledger data model.

In this system, financial accounts can have sub-accounts but they are not mandatory. For example, the account Office Supplies Expense may have sub-accounts called Copier Paper, Copier Toner and General Office Supplies. However the account Electricity Utility Expense has no sub-accounts because there isn't any further cost breakdown possible. This diagram reads that each account can have zero to many sub-accounts, but that each sub-account must belong to one and only one account. While sub-accounts are optional, once one is created then it has a mandatory relationship back to a single account. Sub-accounts cannot be orphans.

Next example: users and software licences.

Each user must have one and only one licence. (We will stick to the legal options only -- no pirated software!) If a person does not have a licence then they cannot be considered a user of that software. However a licence may be purchased and not assigned to any user. It may be waiting for a new hire who starts next quarter. Therefore a licence can be assigned to zero or one user.

You can begin to see that the definition of the entity can affect the definition of the relationship. The entity User contains an inherent assumption that must be understood in order for the relationship to be modeled correctly, specifically that there is no pirated software allowed.

The importance of entity definition shows up in this example as well. Here a customer buys one to many products, while a product may be bought by no one or many customers. If a person doesn't buy any of our products, they cannot be considered our customer. If however we want to track everyone who walks into our store, then we need a different model. Some people could walk in, look around and leave without buying and therefore without becoming customers. If we call everyone who enters our store "Shoppers" then that model would look like the following:

Now the connector from Shopper to Product becomes optional.

But What About Other Combinations?

You may have noticed that selecting two values (min, max) from three different symbols (0, 1, <) should give us more than just four combinations. Obviously the maximum cannot be less than the minimum so those situations are nonsensical. However, there still are two other combinations:

Zero and only Zero -- this effectively means there is no relationship between these two entities and therefore no line, so it is a non-existent situation in an ERD.
Many to Many -- this means the minimum is greater than one. This situation sounds plausible.

In fact, it's not hard to imagine an entity relationship where the minimum is 2. Consider the diagram below for a genealogy database:

Each parent can have one to many children, but each child must have 2 and only 2 parents. (We'll stick to genetic parents for simplicity.) What Crow's Foot symbol should we use for representing a "2-and-only 2" relationship?

Whenever you encounter a situation like this, it means you need to redefine your entity definitions. You need to take the model to a lower level of detail.

In this case, we need to ask why we created an entity called Parent. Why not two entities called Father and Mother? Or Man and Woman? In genealogy software, those distinctions are bound to be important for reporting.

Once we create two entities out of the Parent entity (Father and Mother), the relationship to Child simplifies to a minimum value of one. Now our "2-and-only-2" relationship has been resolved.

It's important to know that relational databases cannot resolve a connection where the minimum value is greater than 1. Whenever that shows up in your model, you have to create additional entities to simplify the relationships so the minimum connector is either 0 or 1.

Short-Form Description of Entity Relationships

One more important point. When describing an entity relationship in words, the practice is to summarize it by focusing on the maximum connector values only. Minimum values are omitted for simplicity as they have less effect on the data architecture design.

For example, we would say the entity Author has a many-to-many relationship with Book.

Account has a one-to-many relationship with Sub-account, and Sub-account has a many-to-one relationship with Account.

User has a one-to-one relationship with Licence.

While these little Crow's Foot symbols may seem insignificant, ERDs are incredibly helpful in understanding an organization's data and the business definition assumptions inherent in it. We will explore some applications of ERDs using Crow's Foot notation in future posts.

Friday, 1 May 2015

Back to the Basics - ROLAP vs MOLAP

This is part of a series of posts on some of the basic concepts of data management and business intelligence concepts. We've previously discussed the differences between OLAP and OLTP. In this post we look at the two main implementations of OLAP: Relational OLAP and Multi-dimensional OLAP.

Review of OLAP

On-Line Analytical Processing (OLAP) is a method of structuring data to permit quick aggregation queries. It makes use of the Star Schema to limit the number of query joins when connecting different data entities. Business Intelligence tools that make use of OLAP typically permit data to be arranged in rows, columns and pages, with the intersecting cells storing the aggregation results (like Excel Pivot Tables). These tools also allow interactive "drag and drop" functionality to move or add entities on rows to columns and have the aggregation results recalculate automatically.

The user experience for OLAP is similar regardless of the back-end OLAP implementation, but it is important to understand that there are two ways to store and access data in OLAP. The choice affects performance and scalability as the data mart grows and expands and it can affect which tools can be used on the front end.

Relational OLAP

Relational OLAP (ROLAP) uses standard relational databases and tables to store and deliver the data to the BI tool. However, instead of organizing the data in Third Normal Form (3NF) it is organized using a Star Schema. Any BI tool accessing a ROLAP database must be able to aggregate the cell intersection values on the fly.

Because ROLAP uses a standard relational database, the OLAP queries are written using standard SQL scripts. No special query language or functions are required.

Multi-dimensional OLAP

Multi-dimensional OLAP (MOLAP) uses a unique type of database to store and deliver the data to the BI tool. It does not consist of data tables per se but rather a multi-dimensional array that is designed to quickly aggregate the cell intersection values automatically.

Because MOLAP does not use a standard relational database, the OLAP queries are typically not written in SQL. Microsoft created a MOLAP query language for their own cube database (SSAS) called MDX and most other data cube vendors have since adopted it. (Oracle is the exception as they still use SQL with their cubes.) However MDX is not an industry standard. Microsoft owns it and they can modify it however and whenever they like.

MDX has some similarities to SQL but it is different enough that it requires dedicated training and practice. It is a very syntactic, meaning it's if you're missing a comma in a particular place it will cause the query to fail!

Comparison of ROLAP and MOLAP

Both ROLAP and MOLAP deliver all of the OLAP functionality to the end users, so from the users' perspective there is no apparent difference. There are differences in terms of performance though as well as in back-end maintenance considerations.

I have worked with multiple ROLAP and MOLAP clients across multiple database vendors and multiple BI tools. From this experience, these are the main differences I have witnessed.

Back-End Maintenance: Advantage MOLAP

MOLAP cubes can be implemented with relatively less back-end IT support than a ROLAP solution. While both require IT support, a ROLAP environment typically requires a small ETL department to maintain and load the data into the growing number of tables that support it. MOLAP also requires data loading and maintenance at the back-end, but it's usually possible for one DBA to maintain a series of MOLAP cubes.

For example, Prophix budgeting software, which sits on top of SSAS MOLAP cubes, actually permits the admin end users to do most of the data loading and maintenance processes themselves, thereby (almost) eliminating the need for IT support. I never seen this kind of functionality built on top of a ROLAP database.

Scalability: Advantage ROLAP

As a data mart grows in terms of number of fact records and number of entity members, both MOLAP and ROLAP performance can deteriorate. However, MOLAP seems to do worse when dimensions get large. While you can throw more hardware (memory and processors) at a MOLAP problem to improve performance on a large cube, it has a limited affect. Eventually the only option left is to split a MOLAP cube into smaller cubes, and that may not suit the business' analytical needs.

When ROLAP databases slow down, there is the added flexibility of being able to add more aggregate tables to the database. Aggregate tables calculate and store the sums of certain cell intersections so the BI tool doesn't have to make those calculations on the fly. This reduces the load on the BI tool and common reports that reference this aggregate table will generally load faster. These aggregate tables must be updated during the nightly ETL (Extract, Transform and Load) processes so the trade-off is a larger, slower and more complex ETL code. It also creates a more complex data model, as each BI report that is running slow may require its own new aggregate table in the background. If end users have direct access to query the data tables, these aggregate tables can make their queries more complex to build and if the aggregate tables include filter conditions, it can allow users to generate erroneous results if they are not aware of those hidden filters. While there are obviously trade-offs when considering aggregate tables, at least ROLAP provides the flexibility to choose between those trade-offs. MOLAP provides not such option.

When it comes to large-scale enterprise-wide data warehouses, MOLAP cannot keep up with the increasing data demands. ROLAP is the most common choice for these situations.

Query Language: Advantage ROLAP

ROLAP uses SQL to query the data while MOLAP uses MDX. This is a huge advantage for ROLAP for two reasons:

There are far more SQL developers in the world than MDX developers, so staffing your OLAP team is much easier.
SQL is a simpler language to learn than MDX as MDX is more syntactic than SQL. (Translation: MDX is a fussier language that is more difficult to debug!)

Summary

Each method has its own strengths and weaknesses. MOLAP is more common among small and mid-size companies while ROLAP is more common in large companies. The key is to understand your priorities and choose accordingly.

Friday, 24 April 2015

Back to the Basics - Snowflake Schema

In the last post we introduced the Star Schema. Next we will look at a modification of the Star Schema called the Snowflake Schema and explain why you should try to avoid it.

Snowflake Schema

The Star Schema is a de-normalized data structure designed to optimize aggregation query response times. It contains a fact table surrounded by a number of entity tables.

A common expansion of the Star Schema is the Snowflake. A Snowflake Schema is simply a Star Schema where one or more of the entities have been partially normalized. This means some of the repeated details for an entity table have been hived off into a separate details table to reduce data repetition in the main entity table. An example of a Snowflake Schema is below.

Note that the original Star Schema tables are still there, but new Entity Details tables have been added around the edges, making it roughly resemble a snowflake.

Snowflake Schemas are rarely symmetric. Some entities may have multiple detail tables (such as Entity 1 above with 3) while other entities may have none (Entity 2).

What type of data is stored in the Entity Detail tables? Anything that is repeated frequently in the main Entity tables. Hierarchy levels or long textual description fields are common candidates for snowflaking.

For example, an Entity called Cost Center may have a hierarchy starting at the top with Company, then going down to Business Unit, Department and Cost Center. The facts are stored at the cost center level, so the Company, Business Unit and Department values will be repeated frequently in the entity table. Therefore, those details could be put in a separate entity details table called Cost Center Hierarchy with the cost center key remaining in the main entity table.

Avoid Snowflaking

Now that we've reviewed what the Snowflake Schema is, this is why you should avoid using it: Snowflaking works against the sole purpose of the Star Schema.

By normalizing an intentionally de-normalized data structure, the additional joins slow query response time in order to save a bit of disk space. This is ridiculous when the only reason the Star Schema exists is to improve query response time. Most instances where I have seen snowflaking used in real-world data models has been the result of a DBA (database administrator) returning to old habits and normalizing tables because that's what they've always done in the world of OLTP. It's a tough habit to break.

One Exception: Bridge Tables

A situation that diagrammatically resembles a Snowflake Schema is a Bridge Table, although I do not consider it snowflaking. A bridge table contains the conversion mapping logic between two related entity tables.

Consider two data marts, each with their own star schema. One mart contains sales data and the other manufacturing data. We want to connect the two fact tables through the Region entities to permit analysis of sales and production data together. One way to accomplish this is to conform the Region dimensions so sales and manufacturing use the same regional boundaries. That would permit a single entity table to be used by both fact tables. However, that is not always possible from a business perspective. Sales divides geography based on the number of customers and prospects in an area while manufacturing divides geography based on factory and warehouse locations. Forcing sales to use manufacturing's boundaries or vice-versa would create a lot of unnecessary hassles for the business with essentially no real benefits.

An alternative is to use a Bridge Table that equates sales regions and production regions. For example, Toronto may be in sales region Central East but in production region Ontario. The diagram below shows the Regions Bridge Table linking these two data marts.

Again, this appears similar to a snowflake of the Regions entities, but it actually involves no normalization so I don't consider it snowflaking.

A different definition of a Bridge Table exists that states they resolve many-to-many relationships between entities, but those situations should typically be resolved through the fact table. Bridge Tables should resolve different aggregation standards for similar entities. In addition to different geographical entities (described above), time is another frequent candidate for bridging when different data marts require different fiscal reporting periods.

Outrigger Tables

One other term that deserves comment comes from Ralph Kimball, one of the grandfathers of data warehousing and star schemas. While he strongly advises against snowflaking, he does permit some normalization of the star schema using something he calls Outrigger Tables. I will let him define it from his website:

"A dimension can contain a reference to another dimension table. For instance, a bank account dimension can reference a separate dimension representing the date the account was opened. These secondary dimension references are called outrigger dimensions. Outrigger dimensions are permissible, but should be used sparingly. In most cases, the correlations between dimensions should be demoted to a fact table, where both dimensions are represented as separate foreign keys."

One of Ralph's colleagues, Margy Ross, elaborates further:

"Outriggers are similar to snowflakes in that they’re used for many-to-one relationships, however they’re more limited. Outriggers are dimension tables joined to other dimension tables, but they’re just one more layer removed from the fact table, rather than being fully normalized snowflakes. Outriggers are most frequently used when one standard dimension table is referenced in another dimension, such as a hire date attribute in the employee dimension table."

Margy continues:

"Like many things in life, outriggers are acceptable in moderation, but they should be viewed as the exception rather than the rule. If outriggers are rampant in your dimensional model, it’s time to return to the drawing board given the potentially negative impact on ease-of-use and query performance."

While I agree with Margy that outriggers should be used sparingly, I fail to see a distinction between snowflaked tables and outriggers. The only apparent difference is that if snowflaked tables are small, then we can call them "outriggers" and they become permissible. I don't see the need for such a distinction. Snowflakes are snowflakes, even though they come in different sizes.

Avoid snowflaking at all costs. If there are too many outriggers or bridges in your model, it usually means the fact table is the problem and it needs to be split into two (or more) fact tables. Further discussion about cube design will follow in a later post.

Friday, 17 April 2015

Back to the Basics - Star Schema

In the last article we covered OLAP and OLTP. This article looks at the data models that support those two data structures.

Entity-Relationship Diagrams

In the mid-1970's as normalized relational databases became popular, the Entity-Relationship Diagram (ERD) emerged as a method to model and understand them. Any relational database can be modeled using an ERD, even if it is not in Third Normal Form.

The most common form of ERD's contain boxes which represent tables and lines connecting them which represent joins between foreign keys. Within the boxes all of the fields are listed, with the key fields being listed first.

Here is an example of an ERD for one portion of a financial system database involving journal entries. It is only one of hundreds of such diagrams that together describe all of the tables in this financial database, but it serves as a good example for our purposes. It is far too small to be legible, but it is easy to see that there are a lot of tables (boxes) and a lot of joins (lines) between them.

Analytical Weaknesses of 3NF

The diagram above makes it apparent how Third Normal Form complicates database querying. Whenever a query involves a join, the query will take longer to process. The more joins in a query the slower it runs. If you have a business question that involves linking two fields from two different tables, the query performance will depend on where those two fields are located.

For example, if the two fields are in two adjacent tables, there will only be one join necessary and the query should run quickly. The diagram below highlights a query that links the Project table to the Journal Line table.

If however the two fields happen to be in the Project table and the Ledger Group table, the query will require 5 joins. The latter query will take noticeably longer to run than the first query, even though both may be valid business questions to ask of the data.

Of course, queries can have far more than 5 joins. If we consider the other 100 such ERD's in this financial system, some business questions may occasionally require fields from different modules that involve a dozen or more joins.

This inconsistent analysis query performance in 3NF caused enough of a problem that an alternative data structure was created. Enter the Star Schema.

Star Schema

Whereas 3NF treats all tables as entities, Star Schema involves two different types of tables: fact tables and entity tables.

Fact tables contain records with numeric values, such as financial transactions, and foreign keys to the entities. Fact records typically are summed but occasionally they can be counted instead.

Entity tables contain descriptive data about the facts, such as department name or product details. They permit the "slicing and dicing" of the fact records.

Combined, the fact table and entity tables make up the Star Schema.

The advantage of the Star Schema is that no matter which entities you wish to query, it involves a maximum of 2 joins. In addition, any join between entities must pass through the fact table. This guarantees there will be data to aggregate for any inter-entity query.

What is not apparent from this diagram is that it is not normalized. For example, if Entity 1 contains department data, the fields may Company, Business Unit, Department and Cost Centre. While any cost centre may only appear once in this entity table, the Company, Business Unit and Department names may repeat thousands of times. This is acceptable in Star Schema as it permits faster queries and users are not updating the records. Star Schema databases are typically updated nightly using automated processes, so repeating a business unit name thousands of times is not a problem.

Isn't denormalized data inherently inefficient though? Why not try to normalize those entity tables? That's the next topic.

Friday, 10 April 2015

Back to the Basics - OLAP vs OLTP

In the last article we looked at Third Normal Form. In this article we look at a de-normalized data structure for Business Intelligence data analysis called OLAP.

OLAP vs. OLTP

In the 1980's a new method of structuring data for analysis emerged called On-Line Analytical Processing (OLAP). It was in contrast to the prevailing method for structuring data, which was On-Line Transactional Processing (OLTP).

OLTP was designed to accomplish certain purposes with regard to databases, namely:

efficient data storage
fast performance when writing data (either creating or updating records)
rapid query response time for returning individual records

OLTP accomplishes these purposes by strictly adhering to Third Normal Form (3NF). No data element is stored twice anywhere in the database, which ensures efficient storage and writing performance.

The majority of database systems in the world are OLTP. Think of your bank's system that tracks your chequing account transactions or your grocery store's Point-of-Sale system at the check-out counter that tallies your scanned products. Transactions still rule the world and OLTP systems will always have a prominent place in it.

However, for all its strengths, OLTP has a major weakness -- it can be slow when trying to aggregate large amounts of transaction data. For instance, my bank system is very fast at processing a cash withdrawal at an ATM or showing my last 10 transactions for the month. However, if an analyst at the bank wants to know the average deposit amount across all personal chequing accounts in eastern Canada in 2014, that may take a while. In fact, it may slow down the entire system while the query crunches through the millions of records necessary to find the answer.

This is why OLAP was created. It had one purpose only:

fast performance for aggregation queries

Note there is nothing mentioned about efficient data storage or data writing performance. OLAP only cares about data aggregation performance. That single purpose is tremendously freeing because all other concerns become secondary. If storing the same data element in 3 places in the database will improve aggregation query performance, then we can do it. If it slows down data update performance, that's OK. If finding an individual record takes longer than in an OLTP system, who cares?

The main consequence of OLAP's singular purpose is that it does NOT follow 3NF. OLAP is a non-normalized data structure. Some Database Administrators (DBA's) are so used to using 3NF that they try to force normalization into OLAP, but that defeats the purpose of OLAP. Normalization slows queries because it introduces additional table joins, and joins are the enemy of fast query performance.

In the next article we will compare the data models that support OLTP and OLAP.

Friday, 3 April 2015

Back to the Basics - Third Normal Form

This is the last in a series of articles on data normalization. Click here for the previous article on Second Normal Form.

Third Normal Form

The best-known form of data normalization is Third Normal Form (3NF). When a database is said to be normalized, it usually means it's in 3NF.

Using the example from the previous article, we had our list of Employee of the Year award winners in 2NF. However, it is not in 3NF yet. In order to achieve that level of normalization, every attribute must describe the key.

In this case, the compound key is Year and Winner. The attributes are Reason and Department.

The Reason attribute is describing the keys. Why did the award go to Annette in 2011? Because she implemented the ERP system on time and on budget.

However, does Department actually describe the key? Is IT describing the award that went to Annette in 2011? It's really describing Annette more than the award. This means the table is not in 3NF.

Does it matter that this data is not in 3NF? It does. When we try to ask a question of the data like, "Tell me which department each winner is from," there is a challenge when we reach George Smith. He appears twice and so does him department. When they are the same, it doesn't matter which department value we return, but what if they differ? What if Department equals Sales for the 2010 record but Merchandising for the 2014 record? Which Department is correct for George? Did he change departments between those dates? Or is one of the entries simply an error? How do we know which record is wrong and which is correct? If we need to update George's department, how many records do we need to update? This kind of query ambiguity is why 3NF is important.

To resolve this problem, we need to normalize the data structure. Since Department describes the employees and not the awards, we need to create a separate table for employees as below.

Now the Employee table has a key called Employee and the Department attribute describes the key. Both tables are now compliant with 3NF.

In honour of the creator of Third Normal Form, it can be summarized in one sentence, often referred to as the Database Administrator's Oath:

Each attribute will provide a fact about the key, the whole key, and nothing but the key, so help me Codd!

Benefits of Using 3NF

Note that George is listed twice in the Award table because he won two awards, but he is listed once in the Employee table. There is no longer any ambiguity about his department. It is stored once, so queries will always return the correct department for him and any changes to his department will only have to be written in one place. To repeat from our first article, the main benefits of using normalized data structures are as follows:

minimizing data storage
reducing data write times
ensuring data consistency
ensuring data queries would return meaningful results

Sometimes the purpose of our database varies from these goals though. What is data storage costs are cheap and we no longer care about reducing write times because we rarely update a record once it is written in the database? This is where de-normalized data structures come into play. We will look at one such example in the next article.

Friday, 27 March 2015

Back to the Basics - Second Normal Form

This is the second article on normalized data structures. Click here to read about First Normal Form.

Second Normal Form

Second Normal Form (2NF) can only exist if a dataset complies with 1NF and each record has a unique compound key. A compound key uses more than one field to uniquely identify a record.

The dataset above requires two fields to unique identify each row: Year and Winner. Therefore this table is compliant with 2NF.

If a table does not have a compound key, then 2NF is automatically equivalent to 1NF.

Because 1NF and 2NF do not provide much value in a querying environment, they are rarely discussed in the real world. Either a table is 3NF compliant or it is not normalized at all.

However, the prevalence of spreadsheets means many datasets are created with a completely free data structure. Data can be organized in a spreadsheet without regard to any database rules, and as long as the data stays in a spreadsheet this isn't a problem. The challenges comes when spreadsheet data needs to be loaded into a database system or data warehouse -- that's when the need for structured data becomes apparent.

Friday, 20 March 2015

Back to the Basics - First Normal Form

Sometimes it's good to go back to the basics for a refresher lesson. Last year I did some training with analysts and realized not everyone who works with data knows what data normalization actually means. Most have heard of it but not all understand it.
So I'm going to take a few posts and go back to the basics.

Data Normalization

The concept of data normalization was first described by Edgar F. Codd in 1971 while he was working at IBM. He is credited with creating the theoretical basis for relational databases.
The purpose of normalized data structures included:

minimizing data storage
reducing data write times
ensuring data consistency
ensuring data queries would return meaningful results

The first 3 bullets can be accomplished by ensuring each data element is only stored once. The fourth bullet requires some additional structure. All four purposes were accomplished by the normalization rules that E.F. Codd developed.

First Normal Form

There are different definitions of the first normal form (1NF) on the internet, but I find it easiest to think of it in these terms:

Each field has only one value
Each row has a unique key

The dataset below shows the winners of the company's Employee of the Year award for the past few years. It violates 1NF because it contains two year values in the first row. It appears that George won the award in 2010 and again in 2014, so that information has been stored in a single record.

By separating out the 2010 and 2014 information into distinct rows, this dataset becomes compliant with 1NF. The Year column provides a unique key for each row.

However, let's take the dataset back one year when there was an unusual situation. Two people, Sam and Maria, shared the award in 2009. Once again our dataset violates 1NF because two fields have multiple values, Winner and Department.

We solve the problem just like we did before by separating out the 2009 record into two rows, one for each award recipient. However, now the Year is no longer a unique key. The year 2009 shows up twice, so we need to add another field to uniquely identify each row. Winner works nicely, so now Year and Winner together become the unique key. Whenever there is more than one field in a key, it is called a Compound Key, and that leads nicely into the next article on Second Normal Form (2NF).