Saturday, January 17, 2015

DataWareHouse Concepts in Breif

What is DataWareHouse?


A warehouse is a place where something is stored a goods is stored. e.g. An eCommerce Website have a warehouse where they will project a demand for a product they will procureth (get/obtain) from client and store in the warehouse. As soon as the supplier places an order the good is immediately dispatched.

Taking Warehouse out of this equation, what do you get? You have the customer directly going to the eCommerce website, the eCommerce site is not storing any goods, when an order is placed the eCommerce website people will directly go to the supplier and ask for the product. 

Now if 100 supplier go to the manufacturer asking for the product becasue there is demand from the consumer, imagine the strained (adj - showing signs of nervous tension or tiredness) that is placed on the manufacturer to supply those products. That is looking from manufacturer percepective.
Now looking from customer perspective there will be delay in delivering the product which customer has ordered. Nobody would like a delay. If there is delay of 2-3 days it will affect customer satisfaction level and you will loose customer.

Similar to this is the "Data WareHouse", where you will be storing the data that you will precure from the transaction system.

There are two terms:
1. OLTP - Online Transaction Processing.
2. OLAP - Online Analytical Processing.

Analytical processing is where you use the Data Warehouse. Transaction processing is something that you will use to record each and every transaction.

Example 1 - ATM
 We all use ATMs. Every transaction we do in ATM is record in OLTP system. Besides ATM when you go to bank account and perform transaction, even that is recorded. So there are multiple sources feeding into a particular system. Now if you want to perfrom query on the system you will have to join multiple sources. Which have different formating type of their own. Second disadvantage is number of transaction - 100 of customers use an ATM in a given day, and millions of queries are being hit on OLTP system. Imagine the load going into OLTP system. So this system is definetely not used for Query purpose or Analytical purpose. This system is only used to record a transaction. 

Example 2 - Railway Reservation
There are many ways in which you can reserve a railway ticket. You can go through mobile, Directly go to railway station, railway website, go to n number of agents that are spread across the city. These are multiple desperate sources. Format of ticket booking on website is entrily different from booking it from mobile. Data Type and Kind of data is changing. Multiple deseparate sources make it difficult for quering. 

Based on this the enduser, at the end of the day would want reporting. He want data for his reporting purpose. So we create an alternate system. This is called as OLAP system.

As you can see from the diagram there are multiple sources Source 1, Source 2, Source 3 feeding into the Data WareHouse. This is not simple loading. Lots of calculating, varification is done before loading the data into the WareHouse and at the end you have multiple users trying to access this warehouse and trying to get data and generate reports.

Why do we need DataWareHouse

We all know this is an era of competition, it is not just taking smart decision but taking the decision on time.  Assume a super market chain is not implementing any datawarehouse. It is very difficult for the supermarket to analyse:
What product are sold
What product are not selling
When is the time selling goes up
What is the age of person buying a particular product.
None of these analysis is possible. What actually happen is some goods getting out some good getting in without knowing anything.

That is not an ideal scenario. I want to make a decision when I say this particular product hit among the age group 18-25. I will intensify my effort to market that particular product to that particular age group. Also from the data I see this particular product is not selling as it is expected. So I will analyse why and the reason for this. This particular product is not selling during a particular time (like Dec-Feb) or it is not selling throughout the year. Then I will be in a better situation to take the decision of either not to take the product or suggest some improvement so that the product start selling.

These are smarter decision which if you do not take on time customer will move out. 

When I talk about stretegic value  that is given to a company. Taking example of procurement department. Every company procures certain things from suppliers. There will be hundred of suppliers supplying a product to company (for example Desktop, Laptops, stationary etc). Now before making purchase the ompany definetely ink a contract with the supplier, saying if you are charging 100 for a particular product please give us at a discount of 80 Rupees and what are the terms that should be followed when procurement is made etc.
Now what is the gaurantee that the supplier is following all the terms set in the contract. When a particular purchase is made, the suppliers provide an invoice to the company. That invoice is record of that transaction. Now I want to match this invoice data with actual contract data to see whether all the terms are being matched or not. If the invoice says product has been charged 100 instead of 80,how you will analyse if the data is not there. 100 of suppliers are providing such invoices to the company. You cannnot go and sit with each and every suppliers. 
What we will do now is: We will load the data into dimensions and extract reports and see whether there is a match or mismmatch. If there is a mismatch we will schedule a meeting with the supplier and ask him why he has done so. This is stretegic advantage and also it gives the company more negotiating power. You can question for mismatched. But in the absence of data you end up paying more.

  •  Primary reason for a Datawarehouse is, for a company to get the extra edge over its competitors.
  •  This extra edge can be gained by taking smarter decisions in a timely manner.
  •  Smarter decision can be taken only if the executives responsible for taking such decision have data at their disposal.


There was a time when fact based decision and experienced based decision making is much more prevelant. We are moving from that area and going into an area fact based decision have gain importance in our life. So data is very essential.

Some stretegic Questions that a manager has to answer:
Q. How do we increase the market share of this company by 5%?
Q. Which product is not doing well in the market?
Q. Which agent need help with selling policies? 
Example from analysing data of Warehouse you can find an agent not selling policies as per expectations. You will identify whether there is deficiency in company or Employee. Does he need more training. 
Q. What is the quality of the customer service provided and what improvement are needed?

Why is DataWareHouse so important?


Picture: Why is DataWareHouse so important

What approach manager follow to arrive at final decision?

What is the quality of the customer service provided and what improvement are needed?

This is a larger question. Now he will break it down into smaller questions.

Subset Question 1: How many customer feedback we have in last 6 months? 
Now he will fire query on database which has all the customer details.

Subset Question 2: How many customer have given a feedback of Excellent, How many Average and How many Bad?
You will go to the database and by grouping you will find how many excellent, Average or Bad feedback you got.

Subset Question 3: Now I need to know why are people giving bad, Average and Excellent? What are the comments or improvement areas highlighted by customers who have rated us bad or average?
May be one person said your distribution channel is not good, may be some say support is not good, and that is the reason why they have given.  So we can identify why they have given that feedback.

All these questions combined you will have an overall picture of the quality of customer service and what improvement needed.

We will hit the data warehouse to get results of these questions, then consolidate these results and arrive at a final conclusion. In the absense of data there is no way of knowing.

One more important factor is trends. Because data ware house hold the entrire history. Jan 2015 to Aug 2014. It holds data till august 2014. With this data we can see what trend is happening:
  •  A Particular product is being sold at a particular month at higher level.
  •  A Particular product sale goes down at a particular month. 
  •  And Does that happens at a regular interval?

We can spot that trend this history. 

When we spot that trend, because of this history and when I spot that trend, I can make out common factor and I can take a decision to why that is happening. 
An operation system do not provide trends. They are transaction system and you may purge the data in an operation system over time. You cannot maintain from starting till end. But in DataWareHouse the main purpose is to maintain history. You will not be deleting history in immediate future.
Result are provided in ready to access format where you can take the analysis.

What is ETL?


The full form of ETL is Extract Transform and Load.

Now question are: 
  • What to extract? 
  • From where to Extract?
  • What to Transform
  • How to Transform?
  • Where to Load it?

As you know there are multiple desperate sources that are loading the data into the datawarehouse so you need to extract from the multiple different sources. There is no consistency in the data in the OLTP system, so you need to standardize data that is coming in and then you have to load it into the datawarehouse. 

Usually Most of the companies (Banking and Insurance sector) use Mainframe system, which are very legacy system (Very old) and very difficult for reporting. And now they are trying to move to DataWareHouse system.  

Usually in a production environment the files( or data) is extracted from the mainframe and send to a unix or windows server in a file format. Each file will have a specific file format. They can send multiple file as well depending on the requirement. 

Let's say they send the file at 3 AM in the early morning, so we process those files using an ETL tool. Some of the ETL tool are: Informatica and Data Stage. These are not open source. Talent is one open source.

We use any one of these ETL tool to cleanse the data. Cleansing is important because special character might come, emailID not in proper format etc. So the Unwanted spaces, unwanted character can to be removed, Data need to be cleansed using the ETL tools.

Then they(File, data from different sources) are loaded into an area called the Staging area. In the Staging area all the business rules are applied. For example there is a Bussiness rule saying that a particular record that is coming in always be presend in the master table record. If it is not present we will not be moving it further. So we will have to look up on the master table, if the record is present we will move it further next level else left it in Staging.
Then we load in into the Dimention table. Schedular are also available to run the job exactly at 3 AM ( or any particular time) or you can run the job when the file arrives. There can be time dependency as well as file dependency. 

At the end we need to validate whether the job has been executed successfully or the data has been loaded successfully or not.

DataWareHouse Architecture


There are multiple transactional systems Source 1, Source 2. Now these source are not only Mainframes, it can be SAP, Flat files. So there can be combination of sources. ETL tool is used to load the DataWareHouse and the DataMarts.

Difference between DataMart and DataWareHouse is: 
DataWareHouse is used across the organization and DataMart are used for individual customized reporting. For example there are multiple department in the company. Finance department is different from the marketing department. They all draw data from different sources and they need customized reporting. Finance department is mainly concern with the statistics, Marketing department is mainly concern with the promotions. The marketing department may not get the finance information. So for customized report they create subset of the datawarehouse which are called Datamarts.

There are two approaches:
1. First load the DataWareHouse and then load the DataMart.
2. Load the DataMart and then Load the DataWareHouse.

Advantage of these approaches I will be post in upcoming posts.

Repoting (Data Access Layer):
User actually accesses the DataWareHouse and generates the report. All these reporting tool are meant to make the front interface very easy for the consumer because people at decision making level usually not concerned with technical details. They are not concerned with writing queries and extracting reports. They are mainly concern with neat usable report only.

So all these reporting tool does that at the front end but at the backend they generate the queries, hit the datawarehouse and user get the report just in time. These reporting tool can also schedule the job to run and generates the report depending on a particular schedule. 

Advantages of DataWareHouse

  • Standardizes data across an organization.
  • Smarter decision for companies - Move toward fact based decisions 
  • Reduce Cost 
    • Drop products that are not doing well.
    • Negotiate for improvement with suppliers
  • Increase Revenue
    • Work on high selling products
    • Customer Satisfaction - Know what is working and what is not.

1 comment:

Mongodb explain() Query Analyzer and it's Verbosity

First creating 1 million documents: > for(i=0; i<100; i++) { for(j=0; j<100; j++) {x = []; for(k=0; k<100; k++) { x.push({a:...