In a recent blog, we looked at the way by which organisations can cleanse and inspect a dataset in order to derive the maximum value and insight from it. In this blog, we’ll revisit that topic in more detail and consider some of the more technical aspects involved in extracting and cleansing data.
Quite often, our clients will ask us to create a data solution that makes use of open data from online sources. When this happens, it’s critically important for us to collect, cleanse and maintain the data in the most effective way possible, while abiding by the data security policies of those sources.
When delivering these types of solutions, we tend to use a combination of four separate python libraries. To collect and extract data from online sources, we use Requests, Beautiful Soup and Selenium and once collected, we can use a library like Pandas to do the cleansing and processing job. On occasion, the task of cleansing and processing data is part of the end solution itself.
An Application Programming Interface (API) is a set of functions, protocols and procedures that enable two separate systems to communicate. In the context of data collection, they allow us to securely access and work with an organisation’s dataset.
Using Requests, datasets are returned to us as easy-to-use JSON objects that we can then import directly into a Pandas DataFrame and manipulate as needed.
On occasion, we may only need a small part of the data that’s deeply nested within a larger dataset. To get around this, we use an additional library called JSONPath to choose the data we want.
Unfortunately, all the data that we need will rarely be available through an API. It is more likely that the data will be available to us through a web page. In such cases, we’ll use Requests to scrape the HTML code and then use Beautiful Soup to parse it and extract the data we need.
Some web pages are rendered server-side and a request won’t collect the data we need. In such instances, we use Selenium to mimic a chrome window and take the data directly from the XML code. Sometimes the data we need may also be split over several web pages, so Selenium allows us to navigate a website by interacting with elements like buttons.
Once we’ve collected the data, it is collated within a Pandas DataFrame. Depending on the quality and completeness of the dataset, we may need to cleanse the data to make it usable. Data cleansing operations typically include grouping, joining, filling missing values and a range of other tasks.
Pandas has several grouping methods. Two methods we like are pivot_table and groupby.
pivot_table allows us to create a spreadsheet-style pivot table as a DataFrame. A pivot table is a table of statistics that summarises the data of a more extensive table.
groupby is similar to the SQL statement groupby in that it involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.
For example, if we were to apply a df.group_by([‘country’]).population.sum() operation on country to the table below…
… it would become:
Pandas has a join method that is quite similar to SQL join. It allows for inner, outer, left and right joins using a specified column or group of columns as keys.
For example, if we had two datasets like the tables shown below:
By calling join, we can transform these two tables into one, merging on country.
Missing Value Handling
Dataset completeness – or incompleteness – is quite a common issue. It’s not unusual for a dataset to have large gaps in it. It’s something that the cleansing process seeks to address. There are several ways to deal with missing data, such as dropna, fillna or interpolate.
Consider the following table:
dropna simply drops any entries that have a NaN value in the dataset while fillna replaces all NaN values with a different value, typically 0. In the interest of maintaining the integrity of the original dataset, we think interpolate works pretty well.
dropna results in the following dataset:
fillna(0) results in the following dataset:
interpolate is where the missing values are estimated using an algorithm. Some of these algorithms are linear, spline and polynomial. Linear interpolation on the above dataset will result in the following:
There are several other functions that we frequently use when processing a dataset, that while they are not strictly related to cleansing make the data much easier to work with.
One such function is melt. Consider the table below. It doesn’t make sense to increase the number of new columns for every year. Instead, you should increase the number of rows.
If we apply melt to the dataset above, we receive the following. Although some of the data is repeated, the dataset is far easier to work with than before.
Using data to empower decision making and improve your business is very much a process. While similarities do exist, no two problems are the same. The examples above are just some of the techniques that we like to use; however, it is important to remember that every project comes with its challenges and we adapt our approach accordingly.
Doing more with data
Organisations are not limited to using only data that they own. Public and open datasets provide organisations with a range of opportunities, capable of empowering meaningful change. Extracting and cleansing data – using techniques such as those above – is a critical process and must be performed correctly to derive maximum value and insight from the data available.