A software engineer and data scientist explains why news makes a “good, reliable source” of data and made his workflow a lot easier while working at a hedge fund.
Editor’s note: This article is the second of a two-part series examining how data from news archives can be mined to power research. It has been adapted from a recent conversation with Dwayne Desaulniers, an AP director of regional media.
I first came across data from The Associated Press while working at a hedge fund a few years ago. We were looking for information that might have a predictable impact on financial markets, and news seemed to be an obvious contributor.
We recognized pretty quickly that AP was a good, reliable source, and had some nice properties in its data that made a lot of the workflow easier.
- Specifically, AP’s comprehensive coverage means more news stories on a wide variety of topics, increasing its flexibility to be used in a variety of research applications. Because AP journalists are the ones actually in the field collecting information, data from their stories are likely to be cleaner.
- A problem with scraping news stories off the web is that you’re often just left with a date of when they were published, and even that may not be reliable. AP has to-the-second time stamps associated with each of its articles, allowing us to really look at high-resolution interactions between news events.
- And finally, metadata tags from each story let us filter news not just by period of time, but also location, topic, industry and name classifications.
One application of how news can integrate into a workflow is seen in the slideshow below. You’ll see price and equity data for Tesla Motors, with a blue line representing traffic to Tesla’s Wikipedia page and dots for earnings events related to the company.
You can tell there is a lot of thought put into how AP organizes and structures its data.
Using data from AP, we could select an event and see information related to it, including news coverage on that particular day. We could filter specifically for stories related to Tesla because of the metadata tags and time stamps, and actually read the content to assess whether any news events were driving market behavior.
From there, taking a more quantitative and analytical approach, we took an underlying time series such as the price of a stock or futures instrument – this could also work for sales of a product – and then looked for events that occurred at the same discreet time that may have had some impact or reverberation.
We then lined those events up to our time series to see which posed some shock to the system. That gave us an indication that there may have been something interesting beyond random activity, and alerted us to pursue further analytical and statistical methods.
In addition to time series analysis, another application of using AP data is looking at how the sentiment of a particular story related to certain topics might have an impact on the underlying metrics you’re looking at.
Filter through the news archive data for certain search terms from specific periods of time, and then feed the resulting stories into a natural language processor such as IBM Watson. You can then see how the public currently views people and issues in the news.
When I worked in finance, I often heard buzz around using social media to produce trading signals. But, I was surprised at how few trading firms looked at historical news data, which we found often led to a stronger signal than the noisy data from a handful of people typing something into Twitter.
You can tell there is a lot of thought put into how AP organizes and structures its data. And it’s consistent. You don’t see a lot of change in how the data is managed over time, making it easier to believe the models you’ve created will actually hold water when they start making predictions.
Adam is a software engineer and data scientist who has previously worked at Kyper Data, Flyberry Capital and MIT.