Fri - 24.11.2017

How clean is your data?

How clean is your data?

Working with numbers is becoming more important than ever for journalists.

Vast amounts of data are being collected online, investigative journalism outfits like ProPublica are doing more and more work with large sets of publicly available data, and data visualisations are increasingly becoming a standard part of reporting. At the end of last year, Amy Webb, CEO of Webbmedia, named 'Big Data' as her first prediction of a major tech trend for 2011.

Tools already exist for journalists to exploit this growth in data. Nieman Lab reported earlier this week on Weave, an open-source internet platform for creating visualizations of "any available data by anyone for any purpose". Another example is Tableau Public, a data visualization tool that was billed by as requiring "no technical ability" and being "easier to use than the wizard options that allow you to create graphs in Excel".

For more on this story please see our sister publication

The results are often impressive: take for example this post using Tableau from The Guardian's datablog last November, which allows users to explore data from the Office of National Statistic's Annual Survey of Hours and Earnings, in order to compare the pay received by men and women.

But although there are many excellent and informative visualizations like this one by The Guardian out there, there's also an important question to be asked. How many times have you shared a data visualization because it looks pretty, rather than because you know the findings to be accurate? In the rush to provide readers with data in a pleasing format, journalists may be tempted/pressured to create or share visualtizations, without know that the data that they are based on is really clean.

Unreliable data abounds, and not just in the context of visulaizations. To take one example, in this blog post, Tony Hirst describes Facebook 'laundering' data to make claims about the economic benefits that it has brought to Europe. Hirst describes the process: "We have some dodgy evidence, about which we're biased, so we give it to an "independent" consultant who re-reports it, albeit with caveats, that we can then report, minus the caveats. Lovely, clean evidence. Our lobbyists can then go to a lazy policy researcher and take this scrubbed evidence, referencing it as finding in the Deloitte report, so that it can make it's way into a policy briefing."

This is not a new phenomenon, but it may be one reason why a campaign from the UK's Royal Statistical Society called Getstats, is particularly welcome. Headed by former journalist David Walker, Getstats is campaigning to make numeracy an important part of a journalists' training. Walker has proposed twelve points about understanding numbers which every journalist should know.

The first point advises reporters:

"You come across a number in a story or press release. Buyer beware. Before making it your own, ask who cooked it up; what are their credentials; are they selling something. What other evidence do we have (what numbers are they not showing us?); why this number, now? If the number comes from a study or research, has anyone reputable said it is any good?"

Other points include asking what kind of average is being quoted, understanding whether a sample of data is representative of a wider group, and checking whether results fall within a margin of error. reports that Walker has said that his twelve points are "a starting point" and that he would "welcome feedback from journalism lecturers to help shape the minimum standards of an understanding of numbers". As big data becomes more important, these kind of conversations should become more and more productive.

Sources: ProPublica, Nieman Lab (1) (2), (1) (2) (3), The Guardian, Tony Hirst,


Hannah Vinter


2012-02-02 16:18

Shaping the Future of the News Publishing

© 2015 WAN-IFRA - World Association of Newspapers and News Publishers

Footer Navigation