Disclosure: I may earn affiliate revenue or commissions if you purchase products from links on my website. The prospect of compensation does not influence what I write about or how my posts are structured. The vast majority of articles on my website do not contain any affiliate links.
Following July’s Chipotle Norovirus outbreak, I wrote an article in which I analyzed ways that we might be able to profit from detecting food health incidents before they are disclosed by the mainstream media. For publicly traded companies, a confirmed case of food contamination is probably going to drive the stock price down. Following the Chipotle case, iwaspoisoned.com has gained a great deal of traffic and has become the go-to website for crowdsourced food contamination reporting.
Scrolling through the reports, you’ll see a fairly simple format- people claim to get sick at eateries and include the location, their symptoms, and any other information they deem relevant. Of course, when viewed in isolation, reports like this are highly unreliable. Trying to decide which of your three meals in a given day made you sick and confirming that you didn’t get sick due to your own lack of hygiene is nonscientific to say the least. The crowdsourcing aspect is what makes this site work, and is what has enabled it to detect outbreaks before health inspectors could visit the restaurant and confirm that something was wrong.
So we just write a program that takes a look at the reports and then we get rich, right? Not quite.
There are some prominent ‘big picture’ issues that we have to deal with first. The overarching one is that scraping iwaspoisoned is against their Terms of Service. This is a restriction most sites have. Why would they want their information taken and stored elsewhere? In truth, I have no intention to profit from this information being gathered, and suppose this means I have to classify this post as ‘for educational value only.’ I do not support or encourage breaking websites’ terms of service.
The next part is that the posting of reports to iwaspoisoned is moderated. Isn’t that a good thing? Not for us. This means that any reports, though they seem to be posted in nearly real time, aren’t. In the case of a flurry of reports, iwaspoisoned moderators can start filtering them as they see fit allowing them to 1. Profit, if desired and 2. Report it to the media in order to increase credibility of the service. In this sense, we’re pretty much screwed and this takes a lot of the potential away from the project. Aside from getting a cool data set to play with, there is a real chance that by performing your own careful analysis, you can still detect when there is an event unfolding between IWP noticing and the MSM reporting it; certainly before it is digested by the markets.
I’m going to show you how I developed a web scraper that builds a food poisoning dataset.
IWP does not provide all their historical data. Currently, it can only be viewed as far back as August 6th. Perhaps it is volume-based or time based. In any case, this will be the most challenging part of the project. We’ll have to design a mechanism for gathering the data. As soon as this is done, we’ve gained an edge over anybody else scraping IWP but not storing a chronological data set.
The fields provided by IWP are Title, Time, Location, Symptoms, Details, Doctor Visit (+ Diagnosis), and Comments. By clicking the title of a report, you can then view the post on its own, where it will be tagged appropriately. If you want to analyze one company specifically, such as Chipotle Mexican Grill, there is a page for that where all of the information is aggregated. However, you’ll see that that page suffers from the same problem as the main page- any information going back farther than roughly a month is inaccessible.
When I store the data, I want to be able to tag it too. Here’s what I want:
Location: (Address, City, State, Country)
Eatery Name: (Proper Name)
Date:
Details:
I’ve decided that the doctor’s visit, diagnosis, comments, and symptoms fields are pretty much useless. We know that if someone posted on the site, they got sick. And–probably–they got very sick. Details are somewhat useless, too, but we may eventually be able to use that field in order to better hone in on when the food was actually consumed.
A scraper that pulls down a page:
from bs4 import BeautifulSoup import requests url_base = "https://iwaspoisoned.com" page_query = "/?page=" page_number = 1 response = requests.get(url_base + page_query + str(page_number)) soup = BeautifulSoup(response.content, "html.parser") reports = soup.findAll("div", {"class" : "post-box"}) for report in reports: name = "" location = "" date = "" details = "" header = report.find("div", {"class": "heading"}) body = report.find("div", {"class": "post-info"}) title = header.find("a") time = header.find("span").text paragraphs = body.findAll("p") for paragraph in paragraphs: span = paragraph.find("span") if "Location:" in str(span): span.extract() location_components = paragraph.text.split(",") name = location_components[0] location = paragraph.text if "Details:" in str(span): span.extract() details = paragraph.text
This is pretty typical web scraper setup. The good news is that it’s almost complete. There are two things missing. First, we need some way of actually storing the data, and for the sake of example here we’ll use SQLite, which is a local database. Then comes the problem of how to cycle through the pages and make sure we don’t insert duplicates. I’ll also do the minimum required formatting
from bs4 import BeautifulSoup import requests import sqlite3 from datetime import datetime url_base = "https://iwaspoisoned.com" page_query = "/?page=" page_number = 1 conn = sqlite3.connect('poisonings.db') c = conn.cursor() #c.execute('''DROP TABLE poisonings''') c.execute('''CREATE TABLE IF NOT EXISTS poisonings (id INTEGER PRIMARY KEY AUTOINCREMENT, name text, location text, date text, details text)''') #Find the newest time already in the database and terminate the loop if you hit it newest_time_in_db = datetime.utcfromtimestamp(0) for row in c.execute('select * from poisonings'): row_datetime = datetime.strptime(row[3], "%Y-%m-%d %H:%M:%S") if row_datetime > newest_time_in_db: newest_time_in_db = row_datetime while True: response = requests.get(url_base + page_query + str(page_number)) print("Now scraping page " + str(page_number)) #This is how you know there are no pages left if 'No incidents' in response.content: print("Reached last page - nothing left to do") break soup = BeautifulSoup(response.content, "html.parser") reports = soup.findAll("div", {"class" : "post-box"}) for report in reports: name, location, date, details = ("",)*4 header = report.find("div", {"class": "heading"}) body = report.find("div", {"class": "post-info"}) title = header.find("a") date = header.find("span").text date_parts = date.split(" ") dt = datetime.now() year = dt.year full_date = date_parts[0] + " " + date_parts[1] + " " + str(datetime.now().year) + " " + date_parts[2] datetime = datetime.strptime(full_date, '%B %d %Y %I:%M%p') #Once you've if datetime < newest_time_in_db: print("Overlapping times - aborting") exit(1) paragraphs = body.findAll("p") for paragraph in paragraphs: span = paragraph.find("span") if "Location:" in str(span): span.extract() location_components = paragraph.text.split(",") name = location_components[0] location = paragraph.text if "Details:" in str(span): span.extract() details = paragraph.text if name and location and date and details: #Fix single quotes for sql name = name.replace("'", "''") location = location.replace("'", "''") date = date.replace("'", "''") details = details.replace("'", "''") insertion_statement = "INSERT INTO poisonings (name, location, date, details) VALUES ('" + name + "', '" + location + "', '" + str(datetime) + "', '" + details + "')" c.execute(insertion_statement) conn.commit() page_number += 1 conn.close()
So now, if we just run a cron job that matches our desired scraping frequency (let's say once a day), we're good to go.
The only remaining problem is what to do with the data. If I knew for sure, I would be doing it! If you'd be interested in a more detailed tutorial on web scrapers or my thought process behind solving this problem, please let me know in the comments.