Scraping and Analyzing News Headlines

As a political news junkie, I was inspired to analyze news headlines after coming across three slightly different stories about the same topic. In particular, it was three different stories and associated headlines from Fox News, the New York Times, and the Washington Post describing the announcement from North Korea on April 20, 2018 that his country would no longer pursue missile testing. Obviously, these stories are all referring to the same event, yet they use different language and convey different shades of information even in the headline. See here for a slightly more in-depth description of the differences between the three headlines.

Several questions came to mind after reading each of the three stories above. Are there systematic differences between publications in the way that they cover events or the information they include? In today's political environment, clouded by attacks on the media from the highest levels of government, this question is of vital interest to democracy and to news organizations themselves as they seek to navigate an increasingly complex media environment. A question on many minds concerns the political neutrality of the media — are legacy media primarily a neutral gatekeeper of information and accountability, or do they tend to favor certain interests or even parties?

To answer these questions and to practice some new skills, I set out to collect data from the three news organizations mentioned above. I chose to scrape news headlines directly from the home page of each publication's website, building an ETL (Extract-Transform-Load) pipeline using Selenium and Pandas. I made several choices here that deserve mentioning:

  • Web scraping instead of using an API or some other means of collecting news data. Why not use the News API or something similar?
  • Deploying the ETL pipeline and interactive infographic on Heroku.
  • Scraping only headline text, source publication, and story URL. So this method excludes, for example, the category, author, and even publication date. However, these can eventually be recovered by visiting the story's URL, though this is presumably not guaranteed to be static.
  • Doing exploratory analysis in Python and writing the infographic in the new framework Dash, marketed as the Python alternative to Shiny.

To implement the ETL pipeline, I wrote a CLI (command-line interface) script that allows the user to either write a CSV to a local file or load it to a Postgres database hosted on Heroku. The script was then configured to run every 10 minutes using Heroku Scheduler, a convenient UI that abstracts away the configuration of cron jobs. Only new headlines are loaded to the database on each run.

I used the following technologies and Python libraries:

  • Python 3
  • Pip inside an Anaconda environment for managing dependencies
  • Heroku for hosting of the database, cron job, and infographic
  • PostgreSQL for data storage
  • Libraries:
    • Selenium and Pandas for scraping, deduplicating, and database loading
    • Matplotlib & Seaborn for visualization-based EDA
    • NLTK, TextBlob, and scikit-learn for textual analysis. This analysis is still in progress.
    • Dash for public interactive infographic

So far, textual analysis has included primarily stemming and tokenization with custom word analyzers, followed by principal-component analysis and k-means clustering to identify narratives and similar headlines. Eventually, I'd like to come back to the analysis of the original questions about media neutrality, but there are some intermediate steps before that's possible. In particular, I'd like to:

  • Identify document clusters, a first step toward identifying narratives.
  • Identify individuals and organizations that tend to be mentioned together. This will require entity detection and analysis/visualization of the co-occurrence matrix (and associated weighted graph).
  • Identify narratives and topics, probably using Latent Dirichlet Allocation, as well as the entities associated with each topic.
  • Collect more information about each story using its URL, including author, category (e.g., opinion vs. lifestyle vs. politics), publication datetime, and more.
  • Conduct sentiment analysis to identify the appraisal of the headline toward the detected entities, as well as its overall polarity. By comparing these results with the political affiliation of each entity, we may be able to observe some pattern between publications. In particular, I hypothesize that WaPo and NYTimes are left-leaning while Fox News is right-leaning.
  • Build a classifier to predict publication from headline and story features.

See the infographic (warning: work in progress).
See the code here.