All Things Data Prep

Unveiling Data Prep

What is Data Preparation?

Imagine that you're having friends over for dinner and you've decided to make a meal you're a novice at, let's say Pad Thai.

Acquiring Ingredients: You get all the ingredients you need from a variety of places; grocery stores, farmers markets, and even your local Thai shop.

Prepare the Meal: Now the fun starts as you follow a detailed step by step process for getting dinner ready. Chopping up the veggies, chopping the meat, cooking the noodles, stir frying etc.

Dinner is Served: Finally, you present the finished product; an edible Pad Thai meal. You go the extra mile and dish the meal aesthetically on dinner plates for everyone.

This is the essence of Data Preparation (Data Prep). I’ve heard Data Prep be referred to as Self-Service ETL (Extract, Transform, Load), but it so much more than that. Gartner provided their formal definition of Data Prep in the Market Guide for Data Preparation Tools, but I’d like to present you with my own definition as a practitioner:

Data Preparation is the process of making data user friendly and consumable for a Business or IT objective

This definition at its core consists of the following three tasks:

Data Acquisition

Before we can do anything with data we have to acquire it, gather it, from all the relevant sources. You want to know who the top 10 customers are by number of products purchased last month? Well that means our first step is to gather the list of all products that were purchased last month by customers. Without retrieving this information the only other option of answering that question is to guess. Data acquisition is challenging because data is disparate: spread across multiple systems and in a wide variety of formats. For example many organizations still have many of their core business processes running on the mainframe which generates text files that are largely unstructured. Further each line of business has systems and databases that store information in various formats (PDFs, Excel, XML, HTML, etc). What about external sources? Social media and data from the Web are proving to be more relevant as the global economy becomes more connected. What about Mergers & Acquisitions? Can two organizations merge and reconcile as one entity without acquiring all the right data for the project? Data acquisition is indeed the first hurdle to clear for business insight.

Data Transformation

Once data has been acquired it has to be made user friendly in the context of a Business or IT objective. The implication here is that data is not user friendly because data in its natural state is raw and unrefined. As such it needs to be made user friendly. Familiar synonyms for this are transformation and preparation. So what does it really mean to make data user friendly? Generally speaking we must first extract and organize the data into a table format with rows and columns. This is applicable to primarily unstructured data sources such as text, PDFs, XML, and HTML. Once the data is in a table format we can perform a variety of transformation functions that help to further clean and refine the data, and make sense of it in the context of the intended use. Common examples of these functions are creating calculated fields, summarizing, joining two or more tables together, removing or identifying duplicate records, and much more. 

Data Consumption

This is a term that is broad but apt for this final phase of the data preparation process. Now that data has been acquired, and transformed it is ready to be consumed by business professionals, IT professionals, processes, and systems. We consume data in a variety of manners, most common are through visualizations, dashboards, reports. Often a business user may want to 'consume' the transformed data by using it as a data source for an ancillary process such as Data Science, and Data Engineering. Finally systems and databases often have a set criteria and format that is required to ingest information. If we are migrating data from a legacy application into a modern application we have to ensure the data is prepared appropriately for the new system to receive it. Consuming data I believe also entails curating and sharing cleaned and transformed data to enhance knowledge management and collaboration.

There are two words I would like to conclude with that describe the nature of the Data Prep process. Those words are Iterative & Collaborative. The process of cleaning data in the context of an objective can often lead to several cycles of extracting and transforming data especially as we get a better understanding of the data and the objective at hand. In addition Data Prep need not be a solo (and siloed) undertaking. Instead a Data Prep task can be an opportunity for collaboration and leveraging the collective wisdom of the crowds.

Now that we have a working definition of Data Prep the next blog post shall highlight the key reasons Why Data Prep is so important.