Cleansing and Inspecting Data
Unfortunately for organisations (and data scientists), the perfect database does not exist. Data-sets are often incomplete, incorrect or poorly maintained and this presents a major limitation for organisations seeking to gain actionable insight from their data. Despite this, cleansing and inspecting a data-set with the appropriate methodology ensures organisations are able to derive maximum value and insight from their data.
As we have outlined in a previous blog, generating meaningful insights requires data of sufficient quality and of a sufficient quantity. The better the quality of your data, the greater the depth of insight that can be gained. The more data you have, the more representative those insights become.
A recent article from Gartner states that poor data quality “weakens an organisation’s competitive standing and undermines critical business objectives” with a potential average financial cost of $15 million per year.
So, with that in mind, how can organisations effectively address the quality of their data and convert data into information capable of enabling meaningful action? In this blog, we’ll look at the specific processes by which we turn data into information, looking specifically at the data cleansing and inspection processes.
What is data?
We often speak in broad terms and words like “data” can often mean a variety of different things. In the context of data cleansing and inspection, “Data” or “Data-set” refers to a group of values which have not been processed, while “information” refers to data that has been processed in such a way that it enables action to occur.
The way by which we process a data-set depends a great deal on the type of value that our clients hope to extract from it. Location, for example, can be expressed in a variety of ways such as Address, Latitude & Longitude, UTM, Geohash each with their own advantages and limitations. In this instance, the type of location data that an organisation chooses to use will depend a great deal on the value they hope to extract. A retail store hoping to find out more about where its customers are located would likely choose to use address, while a group of oceanographers studying specific points of the ocean might be more inclined to use Latitude & Longitude.
It is at this initial planning stage, that the first instance of data inspection occurs, in order to determine if the data-set is of sufficient quality and quantity to proceed. If so, then we proceed to the data cleansing process.
When considering the data cleansing process, it best to imagine a funnel, with increasing complexity and detail at each subsequent level. In an ideal situation, should data be well maintained, and should the data format be consistent, then very little action is required on the part of the data scientist. Unfortunately, such is very rarely the case, so it the responsibility of the data scientist to process and standardise the data as required, with increasing scrutiny until the data is in a usable state. When cleansing a data-set, data-scientists seek to:
- Remove null or invalid results
- Standardise data within a single relevant, usable format
- Unify disparate data sources in a consistent format
- Maintain the integrity of the source data-set
As simple as this process might seem, actual data sets are often much larger and typically contain a large variety of disparate values requiring greater scrutiny. For example, our solution, developed for Coriolis Technologies takes incredibly complex international trade data, cleanses it in a similar fashion to the process outlined above, albeit in greater detail to produce information that once processed further, enables Coriolis Technologies to draw meaningful insights from the data.
Similarly, our work with The MIDAS Project takes a variety of disconnected healthcare data-sets from throughout Europe and processes them in a way that provides policymakers with insights capable of informing long-term policy-making decisions within the healthcare industry.
With the data-set cleansed, the next step is to determine the quality of the data-set. Typically, this involves carrying out tests against the data, such as comparisons and ranges. At this stage the data scientist would seek to determine the following:
- How much usable data do I have?
- How complete is the data set?
- What form does it take?
- What type of data do I have?
Much like the initial data inspection, once these are determined, one of two things must happen. The data-set is determined as useful and project proceeds as intended or there is a request for additional data to shore up any issues that might exist within the data set.
At Analytics Engines, we apply the appropriate business logic and processes to enable our clients and partners to utilise their data to reach their business goals. Cleansing and inspecting data is the first step that organisations must take in order to derive maximum value and insight from their data. To find out more, speak with us using the form below.
Fancy a chat?Get in touch