Step by step guide on data analysis

For  most businesses and government agencies, lack of data isn’t a problem.  In fact, it’s the opposite: there’s often too much information available  to make a clear decision.

With so much data to sort through, you need something more from your data:

  • You need to know it is the right data for answering your question;
  • You need to draw accurate conclusions from that data; and
  • You need data that informs your decision making process

In short, you need better data analysis. With the right data analysis  process and tools, what was once an overwhelming volume of disparate  information becomes a simple, clear decision point.

To improve your data analysis skills and simplify your decisions, execute these five steps in your data analysis process:

Step 1: Define Your Questions

In your organizational or business data analysis, you must begin with  the right question(s). Questions should be measurable, clear and  concise. Design your questions to either qualify or disqualify potential  solutions to your specific problem or opportunity.

For example, start with a clearly defined problem: A government  contractor is experiencing rising costs and is no longer able to submit  competitive contract proposals. One of many questions to solve this  business problem might include: Can the company reduce its staff without  compromising quality?

Step 2: Set Clear Measurement Priorities

This step breaks down into two sub-steps: A) Decide what to measure, and B) Decide how to measure it.

A) Decide What To Measure

Using the government contractor example, consider what kind of data  you’d need to answer your key question. In this case, you’d need to know  the number and cost of current staff and the percentage of time they  spend on necessary business functions. In answering this question, you  likely need to answer many sub-questions (e.g., Are staff currently  under-utilized? If so, what process improvements would help?). Finally,  in your decision on what to measure, be sure to include any reasonable  objections any stakeholders might have (e.g., If staff are reduced, how  would the company respond to surges in demand?).

B) Decide How To Measure It

Thinking about how you measure your data is just as important,  especially before the data collection phase, because your measuring  process either backs up or discredits your analysis later on. Key  questions to ask for this step include:

  • What is your time frame? (e.g., annual versus quarterly costs)
  • What is your unit of measure? (e.g., USD versus Euro)
  • What factors should be included? (e.g., just annual salary versus annual salary plus cost of staff benefits)

Step 3: Collect Data

With your question clearly defined and your measurement priorities  set, now it’s time to collect your data. As you collect and organize  your data, remember to keep these important points in mind:

  • Before you collect new data, determine what information could be  collected from existing databases or sources on hand. Collect this data  first.
  • Determine a file storing and naming system ahead of time to help all  tasked team members collaborate. This process saves time and prevents  team members from collecting the same information twice.
  • If you need to gather data via observation or interviews, then  develop an interview template ahead of time to ensure consistency and  save time.
  • Keep your collected data organized in a log with collection dates  and add any source notes as you go (including any data normalization  performed). This practice validates your conclusions down the road.

Step 4: Analyze Data

After you’ve collected the right data to answer your question from  Step 1, it’s time for deeper data analysis. Begin by manipulating your  data in a number of different ways, such as plotting it out and finding  correlations or by creating a pivot table in Excel. A pivot table lets  you sort and filter data by different variables and lets you calculate  the mean, maximum, minimum and standard deviation of your data.

As you manipulate data, you may find you have the exact data you  need, but more likely, you might need to revise your original question  or collect more data. Either way, this initial analysis of trends,  correlations, variations and outliers helps you focus your data analysis on better answering your question and any objections others might have.

During this step, data analysis tools and software are extremely  helpful. Visio, Minitab and Stata are all good software packages for  advanced statistical data analysis. However, in most cases, nothing  quite compares to Microsoft Excel in terms of decision-making tools. If  you need a review or a primer on all the functions Excel accomplishes  for your data analysis.

Step 5: Interpret Results

After analyzing your data and possibly conducting further research,  it’s finally time to interpret your results. As you interpret your  analysis, keep in mind that you cannot ever prove a hypothesis true:  rather, you can only fail to reject the hypothesis. Meaning that no  matter how much data you collect, chance could always interfere with  your results.

As you interpret the results of your data, ask yourself these key questions:

  • Does the data answer your original question? How?
  • Does the data help you defend against any objections? How?
  • Are there any limitation on your conclusions, any angles you haven’t considered?

If your interpretation of the data holds up under all of these  questions and considerations, then you likely have come to a productive  conclusion. The only remaining step is to use the results of your data  analysis process to decide your best course of action.


Data Analysis


  • Decide on the objectives: The first step of the  data value chain must happen before there is data: the business unit has  to decide on objectives for the data science teams. These objectives  usually require significant data collection and analysis. Since we are  looking at data to drive decision-making, we need a measurable way to  know if the business is advancing toward its goals. Key metrics or  performance indicators must be identified early in the process.


  • Identify business levers: The business should make changes to improve the key metrics and reach  its goals. If there is nothing that can be changed, there can be no  improvement regardless of how much data is collected and analyzed.  Identifying the goals, metrics and levers early in the project provides  the project with direction and avoids meaningless data analysis. For  example, the goal can be improving customer retention, one of the  metrics can be percent of customers renewing their subscriptions, and  the business levers can be design of the renewal page, timing and  content of reminder emails and special promotions.


  • Data collection: Cast a wide net for data. More data—especially data from more diverse  sources—enables finding better correlations, building better models and  finding more actionable insights. Big data economics mean that while  individual records are often useless, having every record available for  analysis can provide real value. Companies are instrumenting their  websites to closely track user clicks and mouse movements, attaching  RFIDs to products to track their movements through stores as coaches  attach sensors to athletes’ bodies to track the way they move.


  • Data cleaning: The first step in data analysis is to improve data quality. Data  scientists correct spelling mistakes, handle missing data and weed out  nonsense information. This is the most critical step in the data value  chain—even with the best analysis, junk data will generate wrong results  and mislead the business. More than one company has been surprised to  discover that a large percentage of customers live in Schenectady, NY, a  rather small town with population of less than 70,000 people. However,  Schenectady has zip code 12345, so it is disproportionately represented  in almost every customer profile database since consumers are often  reluctant to enter their real details into online forms. Analyzing this  data will result in erroneous conclusions unless the data analysts take  steps to validate and clean the data. It is especially important that  this step will scale, since having continuous data value chain requires  that incoming data will get cleaned immediately and at very high rates.  This usually means automating the process, but it doesn’t mean humans  can’t be involved.


  • Data modeling:  Data scientists build models that correlate the data with the business  outcomes and make recommendations regarding changes to the levers  identified in the first step. This is where the unique expertise of data  scientists becomes critical to business success—correlating the data  and building models that predict business outcomes. Data scientists must  have a strong background in statistics and machine learning to build  scientifically accurate models and avoid the traps of meaningless  correlations and models that are so reliant on existing data that their  future predictions are useless. But statistical background is not  enough; data scientists need to understand the business well enough that  they will be able to recognize whether the results of the mathematical  models are meaningful and relevant.


  • Grow a data science team: Since data scientists are notoriously difficult to hire, it’s a good  idea to build a data science team that allows those with an advanced  degree in statistics to focus on data modeling and predictions, while  others in the team—qualified infrastructure engineers, software  developers and ETL experts—build the necessary data collection  infrastructure, data pipeline and data products that enable streaming  the data through the models and displaying the results to the business  in the form of reports and dashboards. These teams typically use  large-scale data analysis platforms like Hadoop to automate the data  collection and analysis and run the entire process as a product.


  • Optimize and repeat: The data value chain is a repeatable process and leads to continuous  improvements, both to the business and to the data value chain itself.  Based on the results of the model, the business will make changes to the  driving levers and the data science team will measure the results.  Based on the results, the business can decide on further action while  the data science team improves its data collection, data cleanup and  data models. The faster the business can repeat the process, the sooner  it can make course corrections and get value out of the data. Ideally,  after multiple iterations, the model will generate accurate predictions,  the business will reach the predefined goals, and the resulting data  value chain will be used for monitoring and reporting as everyone moves  on to solve the next business challenge.