Custom Search

Popular Posts

Friday, June 20, 2014

DATA WAREHOUSE


To facilitate strategic decision-making, we need a new breed of information delivery environment, called a data warehouse. The concept of a data warehouse given by Bill Inmon, the father of data warehousing, is depicted in Figure-1. 

Figure -1 : What is Data Warehousing ?

 
The defining characteristics of a data warehouse are:


Subject-orientation: Data warehouse data are arranged and optimized to provide answers to questions coming from diverse functional area within a company. Therefore, the data warehouse contains data organized and summarized by topic, such as sales, marketing, finance, distribution, and transportation. For each one of these topics the data warehouse contains specific subjects of interest - products, customers, departments, regions, promotions, and so on. Note that this form of data organization is quite different from the more functional or process-oriented organization of typical transaction systems. 

Time-variancy: We have already noted that the DSS data include a time element (see Table-1). In contrast to the operational data, which focus on current transactions, the warehouse data represent the flow of data through time. The data warehouse can even contain projected data generated through statistical and other models. 

Non-volatility: Once data enter the data warehouse they are never removed. Because the data in the data warehouse represent the company’s entire history, the operational data representing the near-tern history, are always added to it. Because data are never deleted and new data are always added, the data warehouse is always growing. That is why the DSS DBMS must be able to support multi-gigabyte and even multi-terabyte database and multiprocessor hardware.

Integration: The data warehouse is a centralized, consolidated database that integrates data derived from the entire organization. Thus the data warehouse consolidates data from multiple and diverse sources with diverse formats. Data integration implies a well-organized effort to define and standardize all data elements. This integration effort can be time-consuming but, once accomplished, it provides a unified view of the overall organizational situation. Data integration enhances decision-making and helps managers to better understand the company’s operations. This understanding can be translated into recognition of strategic business opportunities.   

Table-1 summarizes the differences between the data in a data warehouse and that in an operational database.

Table-1 : Operational Data and Data Warehouse Data

CHARCTERISTIC
OPERATIONAL DATABASE DATA
DATA WAREHOUSE DATA
Integrated
Similar data can have different representations or meanings. For example, telephone numbers may be stored as 033-29-70701 or as 0332970701, and a given condition may be labelled as T/F or 0/1 or Y/N. A sales value may be shown in thousands or in millions.
Provide a unified view of all data elements with a common
definition and representation for all business units.
Subject-oriented
Data are stored with a functional, or process, orientation. For example, data may be stored for invoices, payments, credit amounts, and so on.
Data are stored with a subject orientation that facilities, multiple views of the data and facilitates decision making .For example, sales may be recorded by product, by division, by manager, or by region.
Time-variant
Data are recorded as current transactions. For example, the sales data may be the sale of a product on a given date, such as Rs. 342.78 on 12-AUG-2012.
Data are recorded with a historical perspective in mind. Therefore, a time dimension is added to facilitate data analysis and various time comparisons
Non-volatile
Data updates are frequent and common. For example, an inventory amount changes with each sale. Therefore, the data environment is fluid.
Data cannot be changed. Data are only added periodically from historical systems. Once the data are properly stored, no changes are allowed. Therefore the data environment is relatively static.

 
Figure-2 : Warehouse architecture
 

THE LOAD MANAGER

Data flows into the data warehouse through the load manager. The data is mostly extracted from the operational database(s) and other internal sources (like archived historical data), and supplemented by data imported from external sources. Externally sourced data can greatly enhance the value of information generated from a data warehouse. For example Transco, the gas pipeline operator in UK, uses weather forecast data from the British Met Office on a regular basis to determine demand for gas (the main source of energy used for heating homes and offices) in various areas of the country. The weather data is fed into a model that incorporates several other factors (e.g. day of the week, internal data about customers’ usage patterns, demographic and economic profile data, alternate sources of energy, types of buildings in the area) to arrive at a demand forecast. Types of data from external sources that may be included in data warehouse are: financial indicators and statistics of the industry, market share data of competitors, demographic data, weather data, credit worthiness data, readership / viewer survey data for advertising media, specially commissioned surveys and so on. External data is usually obtained from commercial database services or government agencies (e.g. Equifax, Reuters, Met Office, census agency, industry associations, stock exchanges, local government statistics service). The data from such diverse sources will obviously be in different incompatible formats and will be distributed through various media. Some of them may be available on a downloadable format on the Internet; others may be distributed on CD-ROMs, while some may only be available on printed media. Some data may be available for free but most data (particularly when used for commercial purposes) have to be purchased. 

The load manager primarily performs what is termed an Extract-Transform-Load (ETL) operation.

-Data Extraction
-Data Transformation
-Data Loading

Data Extraction: This function has to deal with numerous data sources. Appropriate techniques have to be employed for each data source. Source data may be from different source machines in diverse data formats. Part of the source data may be in relational database systems. Some data may be on other legacy network and hierarchical data models. Many data sources may still be in flat files. There may also be the need to include data from spread sheets and local departmental data sets. Data extraction can become quite a complex operation at times.

Various tools are available on the market for data extraction. Use of outside tools may be considered suitable for certain data sources. For the other data sources, inhouse programs may need to be developed to do the data extraction. Purchasing outside tools may entail high initial costs. In-house programs, on the other hand, may mean ongoing costs for development and maintenance.

After extraction, the data needs to be kept somewhere for further preparation. Sometimes the extraction function is performed in the legacy platform itself if that approach suits the designed framework. More frequently, data warehouse implementation teams extract the source data into a separate physical environment from which moving the data into the data warehouse would be easier. In the separate environment, the source data may be extracted into a group of flat files, or an intermediate relational database, or a combination of both. This physical environment is called the data-staging area. 

Data Transformation: In every system implementation, data conversion is an important function. For example, when implementing an operational system such as a magazine subscription application, the database has to be initially populated with data from the existing system records. The conversion may either be from a manual system or from a file-oriented system to a modern system supported with relational database tables. In either case, the data will need to be converted from the existing systems. So, what is so different for a data warehouse? Why is data transformation for a data warehouse more involved than that for an operational system?

As already discussed, data for a data warehouse comes from many disparate sources. If data extraction for a data warehouse poses great challenges, data transformation presents even greater challenges. Another factor in the data warehouse is that the data feed is not just an initial one-time load. The ongoing changes will have to continue to be picked up from the source systems. Any transformation tasks are set up for the initial load will have to be adapted for the ongoing revisions as well.

A number of individual tasks are performed as part of data transformation. First, the data extracted from each source is cleaned. Cleaning may be correction of misspellings, or may include resolutions of conflicts between state codes and pin codes in the source data, or may deal with providing default values for missing data elements, or elimination of duplicates when the same data is brought in from multiple source systems.

Standardization of data elements forms a large part of data transformation. The data types and field lengths for same data elements retrieved from the various sources need to be standardized. Semantic standardization is another major task. Synonyms and homonyms have to be resolved. Resolution of synonyms is required when two or more terms from different source systems mean the same thing. On the other hand, when a single term means many different things in different source systems, resolution of homonyms have to be performed. 

Data transformation involves many forms of combining pieces of data from the different sources. In some cases, data from a single source record or related data elements from many source records are combined. In other situations, data transformation may also involve purging source data that is not useful and/or separating out source records into new combinations. During data transformation sorting and merging of data takes place on a large scale in the data staging area. 

In many cases, the keys chosen for the operational systems are field values with built-in meanings. For example, the product key value may be a combination of characters indicating the product category, the code of the warehouse where the product is stored, and some code to show the production batch. Primary keys in the data warehouse cannot have built-in meanings. Therefore, data transformation also includes the assignment of surrogate keys derived from the source system primary keys.

A grocery chain point-of-sale operational system keeps the unit sales and revenue amounts by individual transactions at the checkout counter at each store. But in the data warehouse, it may not be necessary to keep the data at this detailed level. It may be more appropriate to summarize the totals by product at each store for a given day and keep the summary totals of the sale units and revenue in the data warehouse’s storage. In such cases, the data transformation function would include such summarization processing.

The end result of the data transformation function is a collection of integrated data that is cleaned, standardized, and summarized. Now the data is ready to be loaded into each data set in the data warehouse. 

Data Loading: Two distinct groups of tasks form the data loading function. After completion of the design and construction of the data warehouse, when it goes live for the first time, the initial loading of data is done. The initial load moves large volumes of data and takes substantial amount of time, but it is a one-time effort. As the data warehouse starts functioning, extraction of additions (and changes) to the source data continues on an ongoing basis, together with the transformation and loading operations. 

THE QUERY MANAGER

The query manager provides an interface between the data warehouse and its users. It performs tasks like directing the queries to the appropriate tables, generating views on an ad-hoc basis if required, monitoring the effectiveness of indexes and summary data, and query scheduling. 

Data Warehouse Design Considerations

The key considerations involved in the design of a data warehouse are:

-Time Span
-Granularity
-Dimensionality
-Aggregations
-Partitioning 

Time span: Operational data represent current (atomic) transactions. Such transactions might define a purchase order, a sales invoice, an inventory movement, and so on. In short, operational data cover a short time frame. In contrast, data warehouse data tend to cover a longer time frame. Managers are seldom interested in a specific sales invoice to customer X; rather they tend to focus on sales generated during the last month, the last year, or the last five years. Rather than concern themselves with a single customer purchase, they might be interested in the buying pattern of such a customer or groups of customers. In short, data warehouse data tend to be historic in nature. That is, the data warehouse data represent company transactions up to a given point in time, yesterday, last week, last month, and the like. The time period for which data is held in the data warehouse is determined by the data analysis requirements of the users of the data warehouse. These needs, in turn, arise from the changes in the business environment that a particular organization needs to monitor, in its effort to stay ahead of its competitors. Since, a data warehouse’s size depends on the span of time for which data is stored, the time span covered by the data warehouse is an important design consideration. If, for example, the environment changes rapidly, the data required for analysis would relate more often to the recent past, rather than that over several years or decades. In that case the designers of the data warehouse need to consider whether or not the cost incurred in holding data for indefinitely long time spans would be worthwhile. 

Granularity: According to Inmon, the single most important design aspect of a data warehouse is the decision on granularity. It refers to the level of detail or summarization available in units of data in the data warehouse. The more detail there is, the lower the level of granularity. The less detail there is, the higher the level of granularity.  

Operational data represent specific transactions that occur at a given time, such as customer purchase of product X in store A. Thus, granularity is taken for granted to be of the lowest level, in operational systems. Data warehouse data must be presented at different levels of aggregation, from highly summarized to near atomic. This requirement is based on the fact that managers at different levels in the organization require data with different levels of aggregation. It is also possible that a single problem requires data with different summarization levels. For example, if a manager must analyze sales by region, (s)he must be able to access data showing the sales by region, by city within the region, by store within the city within the region, and so on. In this case, the manager requires summarized data to compare the regions, but (s)he also needs data in a structure that enables him or her to decompose (drill down) the data into more atomic components (that is, data at lower levels of aggregation). For example, it is necessary to be able to drill down to the stores within the region in order to compare store performance by region. Granularity level in a data warehouse cannot, therefore, be assumed. 

The decision on granularity level profoundly affects both the volume of data that resides in the data warehouse, and the type of query that can be answered. A trade off exists between the volume of data in the data warehouse and the level of detail of queries. Some data warehouses are designed to support dual granularity. In such environments some data (usually the most recent) is held at a relatively low level of granularity, while the rest is held in more summarized form (i.e. at a higher granularity level). This enables detailed analysis at the same time allows reduction of data volume. 

Dimensionality: This is probably the most distinguishing characteristic of a data warehouse. From the data analyst’s point of view, the data is always related in many different ways. For example, when we analyze product sales by a customer during a given time span, we are likely to ask how many widgets of type X were sold to customer Y during the last six months. In fact, the question tends to expand quickly to include many different data dimensions. For instance, we might want know how the product X fared relative to product Z during the past six months, by region, state, city, store, and customer (or sales of various products by quarters by country). In this case, both place and time are part of the picture. In general, data analysis tends to include many data dimensions, producing a multidimensional view of the data.  

The data model used for modelling data warehouses is known as the dimensional model. The numerical measurements related to the business (like sales volumes) are stored in fact tables. The descriptions of the dimensions are stored in dimension tables. The number and types of dimensions and facts that are to be stored in a data warehouse is a very important design decision, and (much like the decision on granularity) affects both the data volume and the types of analysis that can be supported.

Aggregations: We have seen how data analysis queries directed at data warehouses involve dimensions. Another very common type of query directed at data warehouses involves sums of values along the different dimensions. For example: what is the total sales volume of LAPTOP during the past 4 quarters?  Answering this query using the fact and dimension tables would involve summing up the individual sales volume figures over the 4 quarters and the 3 counties. In real situations, similar queries might involve retrieving and summing hundreds or thousands of individual values. To avoid excessive processing load on the data warehouse arising from frequently asked queries of this type, it is often decided, at design time, to store some pre-calculated aggregations, along with the base facts, in the data warehouse. This decision affects the data volume and performance of certain types of queries. 

Partitioning: One of the essences of the data warehouse is flexible data storage, management, and access. When data resides in large physical units, among other things it cannot be:

-indexed easily
-sequentially scanned, if needed
-restructured easily
-backed up conveniently
-recovered easily
-monitored easily 

In short, having a big mass of data defeats much of the purpose of the data warehouse. The purpose of partitioning is to break the data into smaller (more manageable) physical units of storage. The criteria used for dividing the data can be: date, line of business / product category, geography / location, organizational / administrative unit, or any combination of these.

0 comments:

Blog Widget by LinkWithin