Management for All: DATA WAREHOUSE SCHEMA

Sunday, June 22, 2014

DATA WAREHOUSE SCHEMA

One of the key questions to be answered by the database designer is: How can we design a database that allows unknown queries to be performant? This question encapsulates the differences between designing for a data warehouse and designing for an operational system. In a data warehouse one designs to support the business process rather than specific query requirements. In order to achieve this, the designer must understand the way in which the information within the data warehouse will be used.

In general, the queries directed at a data warehouse tend to ask questions about

some essential fact, analyzed in different ways. For example reporting on:

-The average number of light bulbs sold per store over the past month

-The top ten most viewed cable-TV programs during the past week

-Top spending customers over the past quarter

-Customers with average credit card balances more than Rs.10,000 during the past year.

Each of these queries has one thing in common: they are all based on factual data. The content and presentation of the results may differ between examples, but the factual and transactional nature of the underlying data is the same.

sbobet asia indonesia Read More About News, Trends and Ideas that matter most to Entrepreneurs Visit SnapMunk

Table 1: Fact Tables and Attributes

REQUIREMENT	FACTS	ATTRIBUTES
Sales of Light Bulbs	EPOS Transaction.	Quantity Sold Product Identifier (SKU) Store Identifier Data and Time Revenue Achieved.
Cable Programs	Cable Pay-per-view Transaction	Customer Identifier Cable Channel Watched Program Watched Data and Time Duration Household Identifier.
Customer Spend	Loyalty Card Transaction	Customer Identifier Store Identifier Transaction Value Date and Time
Customer Account	Account Transactions	Customer Identifier Account Number Type of Transaction Destination Account Number

Fact data possesses some characteristics that allow the underlying information in the database to be structured. Facts are transactions that have occurred at some point in the past, and are unlikely to change in the future. Facts can be analyzed in different ways by cross-referencing the facts with different reference information.

For example, we can look at sales by store, sales by region, or sales by product. In a data warehouse facts also tend to have few attributes, because there are no operational data overheads. For each of the examples described, the attributes of the fact could be as listed in Table 1 above.

One of the major technical challenges within the design of a data warehouse is to structure a solution that will be effective for a reasonable period of time (at least three to five years). This implies that the data should not have to be restructured when the business changes or the query profiles change. This is an important point, because in more traditional applications it is not uncommon to restructure the underlying data in order to address query performance issues.

Fig : 1 Star Schema

The inherent advantage of factual transactions is that their content is unlikely to change, regardless of how it is analyzed. Also, the majority of the data volume of a data warehouse comes from the factual information. It is therefore possible to treat fact data as read-only data, and the reference data (used to interpret the fact data) as data that is liable to change over time. Thus, if reference data needs to change, the voluminous fact data would not have to be changed or restructured.

Star schemas (so called due to their ‘star like’ appearance) are physical database structures that store the factual data in the ‘center’, surrounded by the reference (or dimension) data (see Figure 1 above).

The dimension tables should be relatively small (typically less than 10 GB in total) in comparison to the size of the data warehouse, so that restructuring costs are small as long as the keys to the fact tables are not changed. In star schema arrangements, the reference information is often denormalized to a single table to speed up query performance. The redundancy overheads are acceptable, as the sizes involved are small and even the reference information changes infrequently.

The dimensional information, represented by the reference data is often organized in form of concept hierarchies that are used for analysis, and designing data warehouse aggregations.

A typical concept hierarchy for a retail chain is depicted in Figure 2 below :

Fig. 2 : Concept Hierarchies

The number of concept hierarchies that can be defined on any given dimension is by no means restricted to one. There may be several arbitrary concept hierarchies in use in any organization to enable data analysis from various angles. Figure 3 below shows how the ‘time’ dimension in a retail data warehouse (represented by the day / date reference data) may be organized into several concept hierarchies or groupings (which are single level concept hierarchies – ‘Easter’ and ‘Summer’ in Figure 3)

Fig. 3 : Multiple Hierarchies

When the concept hierarchies and groupings are incorporated into a star schema diagram (like Figure 4 below), the appearance resembles a ‘snowflake’. Hence, schemas of this type are called Snowflake schemas.

Fig 4 : Snowflake Schema

Linkbar

Management for All

Subscribe through E-mail