Open source text analysis tool exposes repurposed news | Opensource.com
Open source text analysis tool exposes repurposed news
Churnalism US is a new web tool and browser extension that allows anyone to compare the news you read against existing content to uncover possible instances of plagiarism. It is a joint project with the Media Standards Trust.
Simply feed in a link or block of text to the Churnalism site or let the browser extension run in the background to notify you of any matches of text from Churnalism's cache of documents. They include most articles in Wikipedia, press releases from PR Newswire, PR News Web, EurekaAlert!, congressional leadership offices, the White House, a sampling of Fortune 500 companies, prominent philanthropic foundations and much more. The browser extension available for Chrome, Internet Explorer, and Firefox (full approval pending) allows Churnalism to extract article text from a whitelist of common news sites and lets you know when something you're reading may be copied from another source. It's a rare occurrence, but it's not unprecedented. Just last week Tom Lee, a noted Churnalism beta tester and the Sunlight Labs Director, found through Churnalism that Reuters' prematurely published an obituary of the still-alive-human George Soros borrowed heavily from the collection of quotes on his Wikipedia page.
Sunlight’s Churnalism is based on a UK site of the same name and is driven by open-source text analysis technology dubbed SuperFastMatch, both developed by the awesome Media Standards Trust. For a deeper dive into the underlying technology and process behind the project, check out this detailed post from Drew Vogel, another developer on Churnalism.
With the extension installed, you can learn about the sourced and unsourced flow of text copied from somewhere else. For some anecdotal evidence from my experience using Churnalism, I've found a number of instances of articles about science topics relying heavily on press releases and study summaries. For example, take this piece on the BBC website about epilepsy and migraines. Churnalism found a significant portion of the text came from this press release in EurekaAlert! and let me know with a ribbon notification on the top of the page. By tapping the Show Me button on the notification, Churnalism overlays a side-by-side display of the article and the possible match with copied text highlighted for easy comparison:
The best way to detect influence and language sharing from other sources is to install the browser extension and continue consuming news. You'll slowly start uncovering overlaps of language seen in this CBS News report, this NY Daily News article, this piece on NBC News or maybe uncover a reverse application of Churnalism, like this New York Times article that is cited heavily in a Wikipedia article.
We understand the privacy sensitivities with an extension extracting text from what you read, so we've designed Churnalism to be quite customizable and never retain identifiable information such as your IP address. You can easily change which sites Churnalism runs on by going into the settings for the browser extension. We've provided a basic whitelist of major news sites, a listing of local news affiliates and the ability to let Churnalism run on any site with news or article in url, but all these can be removed or paired down (or expanded!) to whatever sites you're interested in.
We're very excited to get this project out into the public and hope to continue to improve the underlying software as there are some excited potential applications for large corpus text matching. We used the SuperFastMatch technology to look at model legislation and it drove stories like our look at how ALEC distributed the 'Stand Your Ground' legislation for adoption in a number of different states.
Let us know any interesting Churnalism matches you uncover!
Originally posted on the Sunlight Foundation blog. Reposted under Creative Commons.