Why Data Preparation?
The Challenging Reality
You are a Human Resource Information System (HRIS) Analyst, and one of your responsibilities is to gather data to help your organization submit the Equal Employment Opportunity Form (EEO-1). The concept is straight forward, you just need to retrieve employee data, specifically race, gender, and job category, however the task is easier said than done due to several challenges. The first challenge is that one of the fields you need, race, is not captured in the HR database, the gender field has more blank values than you believe it should, and quite a few new employees have had the wrong job categories assigned to them. You realize that it might be a good idea to leverage your Microsoft Excel skills to compare and validate the data in the HR database against the reports in the HRIS because you trust the data in those reports.
Unfortunately this process of validating the quality of the data in the HR database is taking longer than anticipated so you decide to enlist the assistance of the IT/Data team that built the HR database. You show them the discrepancies between the database and the HRIS reports and ask them how long it will take them to remedy the situation. They inform you that unless this is a priority it will take approximately 3-6 months as they have a backlog of requests. Furthermore they inform you that they will only be able to address the data gaps for new employees, not the existing employees. Disappointed you return to your Excel spreadsheet to finish the data validation. Eventually overtime you develop a complex process in Excel that works despite the long wait time for assistance from IT.
Just as you were getting comfortable with your Excel spreadsheet you hear the big announcement regarding a change to the EEO-1 Form. The Equal Employment Opportunity Commission (EEOC) is now requesting that employee salary and hours worked be added as a second component to help better identify potential pay discrimination. Fortunately you have about 3 months before the submission of this new data, so you quickly inform the IT/Data Team of the change in your requirements and the expanded scope so they can add this new data to the HR database. The response you get is not inspiring, in fact, it deflates you. They inform you yet again that it will take more time, 6-9 months, which means you will miss the deadline if you rely solely on them. What do you do? Will you outsource the work to consultants? Will you hire temps to lighten the load? Will you resurrect your Excel skills once more to accomplish the task? How long will it take you to do it in Excel?
Data Prep To The Rescue
This story helps to highlight what I believe to be the most powerful and compelling reason for Data Prep; Time to Insight. Most people have heard that information workers spend 80% of their time cleaning and preparing data, however I believe that many don’t understand the significance of this well researched statistic. Quite simply, if you have an employee who works with data they could be spending up to 80% of their workday cleaning and preparing data. It’s not just an issue for the HRIS Analyst, other professionals face the same challenge; Data Engineers, Data Scientist, Accountants, Marketing Analyst, Healthcare Analysts etc. The exponential and compounding impact of the 80% is what we need to be aware of.
One of the contributing factors to the length of time to insight is the notion of waiting on an IT/Data team to perform a centralized ETL process to make data available to end users, however that IT/Data team is usually inundated with requests, and lack the agility to respond to those requests and any changes. With Data Prep information workers no longer need to rely solely on an IT/Data to provide data. Data Prep empowers everyday technical and non-technical professionals to perform the ETL process themselves for faster time to insight.
While there are many tools today on the market that do Data Prep I think it’s important that we first understand it is a discipline, a useful skill set for anyone working with data because it will save you time.