Data Scientists, do good with Big Data 1

Earlier this month, I attended the O’Reilly Strata conference in London about all things data science. Data science is the current IT industry hype, concerned with big data, another hype. Big data can be summarised as growing data that is unwieldily large, potentially unstructured, and thus hard to manage and extract value from with current technology, tools, and algorithms. Data science is the attempt to address this.

The two topics emerged recently after we started to collect and store a mind boggling amount of (digital) data. The problem is that the speed of data collection is accelerating as a result of an explosion of electronic services, ubiquitous connectivity, and smart devices. And there is no end in sight, the next data wave will hit us with the ‘Internet of Things’. It predicts that fantastically inexpensive chips and sensors will be added to even the cheapest, and most perishable and expendable products. Everything, from a bag of potatoes to a chocolate bar will be tracked from production to consumption in an attempt to optimise supply chains and provide new services to consumers.

Consequently, the industry’s interest in big data is rooted in the necessity to digest this data deluge and the idea that where there is data there ought to be information buried in it. Information about what people like, buy, do, need, want. That information is invaluable to companies trying to get you to buy (more of) their products and services.

On this very worldly backdrop one keynote on the conference stood out to me. Jake Porway, founder of DataKind, held an inspiring talk about the ability of people working with big data, commonly referred to as data scientists, to do good with data. He points out that while data scientist can do good, they don’t. Not because they are bad people but because they, like most people, are making a living working in an industry which, of course, is about making money and not doing good.

Data Scientists

We, data scientists, wrestle information and, hopefully, economical value from datasets often with the help of computer science in general, and machine learning and information retrieval algorithms in particular. We compute, predict, and infer, who you might want to be friends with, what movies interest you, what email is spam, what documents or websites you are looking for, how credit worthy you are, which price will maximise revenue, and so on.

You may have never heard of data scientist as a degree or job title. The term data science is merely 11 years old and the job title data scientist has only become popular in the last few years. As a result of this, none of the data scientists today have been educated specifically for their job. We come from various hard science backgrounds like mathematics, computer science, physics and such, often with academic research experience and PhDs. Additionally, data scientists, opposite of what the name and previous description implies, are very hands on people that usually can program and at the very least prototype solutions to data problems. That combination of skills and the lack of a degree in data science makes us pretty rare (or ubiquitous depend on how loose your interpretation of the mentioned attributes are). Josh Wills, Director of Data Science at Cloudera defined it concisely in a talk as “Data Scientist (n.): Person who is better at statistics than any software engineer and better at software engineering than any statistician.”.

Big Data, Good Data

The same technology, tools and skills which aid to predict behaviour, identify similarities, extract hard to reach data, and visualise it, for example, can be utilised to help people. Particularly NGOs and governments can benefit from the insight data science can provide – sometimes even with little effort and comparatively small data sets. The bad news is that as with any hype there is a shortage. A shortage of data scientists and similarly skilled data experts who are currently highly valued by the industry.

That is where Jack Porway and DataKind try to help out. They understand the opportunity data science poses for the public good. DataKind brings together data scientists, hackers, visualisation experts, and the likes to aid organisations that help people but cannot afford data specialists. One way DataKind is doing this are weekend events called DataDives. There organisations bring their data and questions and meet with volunteers who typically are highly paid professionals, e.g. data scientists in large corporations, academics, or visualisation wizards at national newspaper, as well as bright individuals like students or hackers. They donate their time and experience to help answer the questions at hand.

An example is MIX, a non-profit organisation that collects, analysis, and publishes micro financing data. They improve market transparency to make micro financing more accessible to the poor around the world. DataKind organised the micro-financing DataDive for MIX to get a picture of where extending financing services for the poor is most needed in South Africa, Kenya, and Rwanda. The data scientists set up web scrapper, little programs that collect and extract hard to reach data, and transformed it. In this instance, they collected all the addresses of service locations from the major financial institutes’ web sites, accumulating information about 60,000 points of services in the three countries. The outcome is clean, structured data available to anyone to do good with it. DataKind went beyond only collecting the data and visualised it into an interactive map for analysis (see below).

One of the major insights of the project was gained when they accessed the National Credit Regulator dataset in South Africa. It contains data about organisational financial services as well as cash loan shops alike to pay day lenders. As it turns out they are most prevalent in the poorest neighbourhoods. An important insight for MIX’s and other NGOs’ efforts, they can now focus their resources in these areas to help the poor with financial service alternatives.


Jake and I spoke briefly at the conference. I was curious about DataKind’s potential engagement in Bangladesh. As it turns out the demand for DataKind’s services greatly outstrips the young organisation’s capabilities. So instead of waiting for DataKind to get ready for Bangladesh, I would like to appeal to you. Let’s organise our own DataDive.

Are you an organisation with (data and) questions which you consider important to making a difference in Bangladesh that you cannot answer by yourself? Get in touch briefly describing your challenge.

Are you a data scientist, hacker, mathematician, statistician, software engineer, designer, visualisation expert, or alike living in Bangladesh or abroad who wants to help Bangladesh? Get in touch and briefly describing what you can do to help.

Are you able to provide sponsoring, i.e. a venue, publicity, or money, to an event to help us? Get in touch and briefly describe what you can provide.

Once we have enough people interested I will get in touch with Jake and organise the DhakaDataDive.

This article was written as part of a regular column by Christian Prokopp for the daily online newspaper.

One comment on “Data Scientists, do good with Big Data

  1. Reply Jon Nov 6,2012 11:33

    Nice article, I’ve recently started working with a range of organisations that are addressing the issue of the copious amounts of unstructured data in the world today.

    Your article was very insightful and allowed a non-technical layman like myself to understand and digest.

    Keep up the good work!

Leave a Reply




This site uses Akismet to reduce spam. Learn how your comment data is processed.