Skip to content

Extracting and cleaning data using Python

Extracting and cleaning data using Python

In a recent blog, we looked at the way by which organisations can cleanse and inspect a dataset in order to derive the maximum value and insight from it. In this blog, we’ll revisit that topic in more detail and consider some of the more technical aspects involved in extracting and cleansing data.

Open Data

Quite often, our clients will ask us to create a data solution that makes use of open data from online sources. When this happens, it’s critically important for us to collect, cleanse and maintain the data in the most effective way possible, while abiding by the data security policies of those sources.

When delivering these types of solutions, we tend to use a combination of four separate python libraries. To collect and extract data from online sources, we use Requests, Beautiful Soup and Selenium and once collected, we can use a library like Pandas to do the cleansing and processing job. On occasion, the task of cleansing and processing data is part of the end solution itself.

Collecting Data


An Application Programming Interface (API) is a set of functions, protocols and procedures that enable two separate systems to communicate. In the context of data collection, they allow us to securely access and work with an organisation’s dataset.

Using Requests, datasets are returned to us as easy-to-use JSON objects that we can then import directly into a Pandas DataFrame and manipulate as needed.

On occasion, we may only need a small part of the data that’s deeply nested within a larger dataset. To get around this, we use an additional library called JSONPath to choose the data we want.

Beautiful Soup

Unfortunately, all the data that we need will rarely be available through an API. It is more likely that the data will be available to us through a web page. In such cases, we’ll use Requests to scrape the HTML code and then use Beautiful Soup to parse it and extract the data we need.


Some web pages are rendered server-side and a request won’t collect the data we need. In such instances, we use Selenium to mimic a chrome window and take the data directly from the XML code. Sometimes the data we need may also be split over several web pages, so Selenium allows us to navigate a website by interacting with elements like buttons.

Cleansing Data

Pandas DataFrame

Once we’ve collected the data, it is collated within a Pandas DataFrame. Depending on the quality and completeness of the dataset, we may need to cleanse the data to make it usable. Data cleansing operations typically include grouping, joining, filling missing values and a range of other tasks.


Pandas has several grouping methods. Two methods we like are pivot_table and groupby.

pivot_table allows us to create a spreadsheet-style pivot table as a DataFrame. A pivot table is a table of statistics that summarises the data of a more extensive table.

groupby is similar to the SQL statement groupby in that it involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.

For example, if we were to apply a df.group_by([‘country’]).population.sum() operation on country to the table below…

Country City Population
UK London 14 million
UK Glasgow 1.3 million
France Paris 12 million
France Toulouse 1.3 million

… it would become:

Country City Population
UK London 15.3 million
France Glasgow 13.3 million


Pandas has a join method that is quite similar to SQL join. It allows for inner, outer, left and right joins using a specified column or group of columns as keys.

For example, if we had two datasets like the tables shown below:

Country Year Val
UK 2018 1

By calling join, we can transform these two tables into one, merging on country.

Country Year Val Population Capital
UK 2018 1 14 million London

Missing Value Handling

Dataset completeness – or incompleteness – is quite a common issue. It’s not unusual for a dataset to have large gaps in it. It’s something that the cleansing process seeks to address. There are several ways to deal with missing data, such as dropnafillna or interpolate.

Consider the following table:

Country Year Val
UK 2016 1
UK 2017 NaN
UK 2018 3

dropna simply drops any entries that have a NaN value in the dataset while fillna replaces all NaN values with a different value, typically 0. In the interest of maintaining the integrity of the original dataset, we think interpolate works pretty well.

dropna results in the following dataset:

Country Year Val
UK 2016 1
UK 2018 3

fillna(0) results in the following dataset:

Country Year Val
UK 2016 1
UK 2017 0
UK 2018 3

interpolate is where the missing values are estimated using an algorithm. Some of these algorithms are linear, spline and polynomial. Linear interpolation on the above dataset will result in the following:

Country Year Val
UK 2016 1
UK 2017 2
UK 2018 3

Other functions

There are several other functions that we frequently use when processing a dataset, that while they are not strictly related to cleansing make the data much easier to work with.

One such function is melt. Consider the table below. It doesn’t make sense to increase the number of new columns for every year. Instead, you should increase the number of rows.

Country 2015 2016 2017 2018
UK 1 2 2 3

If we apply melt to the dataset above, we receive the following. Although some of the data is repeated, the dataset is far easier to work with than before.

Country Year Val
UK 2015 1
UK 2016 2
UK 2017 2
UK 2018 3

Using data to empower decision making and improve your business is very much a process. While similarities do exist, no two problems are the same. The examples above are just some of the techniques that we like to use; however, it is important to remember that every project comes with its challenges and we adapt our approach accordingly.

Doing more with data

Organisations are not limited to using only data that they own. Public and open datasets provide organisations with a range of opportunities, capable of empowering meaningful change.  Extracting and cleansing data – using techniques such as those above – is a critical process and must be performed correctly to derive maximum value and insight from the data available.

Share this article
by PJ Kirk

Fancy a chat?

Get in touch