Newspy - A Python News Wesbite Parser

Since I started to learn how to code I’ve had the idea to create a news parsing script. Early iterations of the idea relied heavily on a third-party project newspaper3k. It’s a great robust project that made my introduction into python incredibly fun but, it has fails to extract certain data points and has some bloat.

Thus I have decided to create a new, dumb python based news scraper: newspy. I’m not sure whether or not I’ll keep the repository set up as a fork of newspaper3k …The majority of the code will be new but I will borrow some code - for example newspaper3k’s author extraction.

Newspy?

The original iterations I called nuntium, latin for newspaper. In hindsight using the latin verbiage for something tech related is nauseating. So I’ve landed on newspy.

  • news because it’s a news specific website scraper
  • spy because it’s spying on publications publishing practices
  • py … Python.

Dumb Webpage Parsing

Newspy isn’t meant to be the most intelligent parser. What I’m aiming for is specific data points that can be useful for machine learning at a later point. Beyond that I want newspy to be akin to the web archive.

Archiving Articles

I want to routinely scan news websites and ingest their article content so I can at a later time go back and compare any changes to the articles. This is the primary reason I’m enamoured with this project / idea. An often overlooked issue within digital media - of any persuasion - is ghost editing.

An example of Ghost editing is when an editor publishes a correction to previous article. Then at some point in the future the editor’s note regarding the need/reason for the edit is removed whilst the edit itself remains.

The concern I have with ghost edits is not immediate; rather, the potential for history to be misconstrued because context is lost. A publication / author could make an outlandish remark which could be denied in the future.

Before the digital era a state library would often have copies mainstream publications. Like the web archive project but with more importance / urgency I feel we as a society writ large require a digital news archive.

Posts in this series