Considerations when approaching a data science project

Data science plays a central role in the development of a practical, valuable data solution. The data scientist’s primary role is to prepare and interrogate data to draw meaningful, actionable insights. Nick from our data science team sets out some of the key considerations when approaching a data science challenge.
Like a lot of former physicists, I’m fond of the wisdom of American Theoretical Physicist Richard Feynman. Best known for his work in the field of Quantum Mechanics, he was also very well known as a teacher and held to strong scientific principles for understanding complex things. Feynman was quoted as saying ‘The first principle is that you must not fool yourself and you are the easiest person to fool’. While Feynman meant this in general terms, this approach can be seen as particularly important to data scientists.
Fooling yourself is easy when you are handling complex problems, especially if the solution to those problems looks simple at first glance. With that hard-earned lesson in mind, here are a few of the issues I consider at the early stage of any project.
- Are you and I missing something?
Most data problems are more complex than they first appear. An organisation may underestimate the depth of their problem, or a data scientist may miss complexity in the dataset due to an inaccurate idea of the customer’s workflow. In both cases, communication is important above all else. Talk to the customer. Find out about their business. Don’t overpromise. - Friendly algorithms and unfriendly data
Conversations in data science and machine learning should focus on data. In order of importance: how to get it; where it came from; how to handle data protection; what can be done with it. Conversations steered by algorithms are a solution looking for a problem. Keep the conversation on data until such time that the data pipeline doesn’t require a data scientists’ oversight anymore. Algorithms come later. - Defence in simplicity
When in doubt, go simple. Simpler algorithms and processes are much more amenable to ad hoc analysis, and their debugging is much easier than piecing apart some transfer learning-driven pre-trained HAL-alike. Complex machine learning models have their place, and many problem domains greatly benefit from them. However, don’t underestimate the power of simply taking stock of what kind of data you actually have, and running it through the nearest decision tree. - Software engineering is underrated
Or, Jupyter Lab is Not For Coding. Data science often comes across as woolly in the eyes of your Full Stack Software Head of Engineering, but it doesn’t have to be. An elegant, well-tested pipeline can make ad hoc analysis much easier. Integrating useful software engineering processes into data science design, ranging from test-driven development through to something akin to Clean Code can make the code readable, make results reproducible, and make the scientific process followed during data exploration much more scalable.
All of these things go into making a data science project a worthwhile endeavour for any client. Clean code, delivered by an organisation keen on communication, focused on data sources and data cleaning long before algorithms are involved, and whose work is something you can rely upon.
Find out more
To find out more about Analytics Engines or to arrange a free consultation with one of our data experts, contact us using the form below: