Most people who know me well know I’m a massive fan of the show Jeopardy!. I’m fairly good at answering the questions myself, but my batting average is below what an actual contestant on the show would be capable of. If I’m going to get better, I need to be able to identify what subjects and concepts appear most often on the show, and what my own areas of expertise are. This will require some data analysis.
Pulling Jeopardy! data from the online archive#
It doesn’t take long to notice that there’s a pattern in the subjects that appear on Jeopardy!. A typical show will contain:
- One category about history
- One category about science, especially medicine
- One category about literature, including poetry and plays
- One category about film or TV
- One category about either music or painting, or sometimes one of each
- One or two categories about geography
- One or two categories about wordplay and language
- A handle of miscelaneous topics, although the writers seem to have a fondness for current events, brands, pop culture, academia, and mythology.
To give an example of how set in stone this formula is, consider the categories from October 25th of this year:
Of course, it’s possible to get an idea of what’s on the show by watching it. However, this is rather slow-paced to what’s possible, and I don’t have cable. To access the questions more quickly and easily, one can look at the online Jeopardy! archive at j-archive.com. In fact, I’ve been doing this for a while now. I have a massive spreadsheet where I log my scores for different “genre"s of questions on the archive, in order to identify my strengths and weeknesses.
This approach is adequate, but it does have its limitations. The largest one is that it’s not possible to see the text of questions that have been answered wrong. It’s one thing to recognize that 41% of questions about Film and TV were answered correctly, but which movies are being asked about? Additionally, there’s no way to tell the values of questions, or if they appeared in Single or Double Jeopardy! In order to fix this, it’s important to build a comprehensive database of Jeopardy! questions, their answers, their subjects, and whether they were answered right or wrong.
The easiest approach is to scrape this data from the website directly. To do this I used a Python library called Beautiful Soup. Once the HTML code for a page of j-archive.com is saved, it can be split up into separate chunks for Single, Double, and Final Jeopardy!, allowing for the clues and categories to be easily sorted.
There are a few issues with this data as it exists, which require some effort to clean. For one, not every question in Jeopardy! is available. If the contestants run out the clock before a question is asked, it won’t appear on the Jeopardy! archive. Another problem is that many categories require clarification from the host to provide necessary context.
Neither of these problems are especially troubling. Missing questions can simply be unincluded, and Ken Jenning’s comments can be included in the text of the category. A more difficult problem is with those clues that contain some sort of media. Many clues on Jeopardy can include an image, a sound, or a video (think for the video daily double, for instance). Although the archive we’re pulling from does provide a link to allow for media, these links seem to always lead to a 404 error.
Although these are sometimes solvable without the media they refer to, they sometimes aren’t. My approach to this is simply to include a tag on each question marking if it was supposed to include media, and otherwise to totally remove the broken links.
Once this is completed, the dataframe is organized and exported to a csv file. That dataframe looks a bit like this:
The collumns of this sheet are:
category | str | The official name of the category given by the show. This is the phrase a contestant says before " for $800, Ken" |
---|---|---|
hint | str | The actual text given by the host to the contestants. This is called the "hint" rather than "question", because the question is technically said by the contestant |
answer | str | The correct response to the hint. To avoid ambiguity, this is also often called the "response" in this article |
value | int | An integer indicating the point value of a question. The dollar sign is not included |
jtype | int | An integer indicating which round of Jeopardy! a row is included in. 1 indicates Single Jeopardy, 2 indicates Double Jeopardy, and 0 indicates Final Jeopardy |
media | bool | Indicates whether or not a hint includes media, such as sound, images, or video |
DD | bool | Indicates if a hint is one of the daily doubles |
date | date | This is the date when a question is aired, rather than the date it was downloaded, or the date it is viewed by the software user. |
game_name | str | A string giving a name to a game. Although almost all games have names of the form "Game #____", some are unique |
correct | int | An indicator for whether the user could answer the question correctly or not. 1 indicates a correct answer, -1 indicates an incorrect answer, and 0 indicates that the user would have skipped |
subjects | str | The broad subject or subjects a hint belongs to. Because there is some variance within categories, this applies to a specific question, not a category. Each subject is represented by a single letter in the string |
date_answered | date | The date at which the user entered whether they would have answered a question correctly or not |
Compiling the mountain of data that exists currently#
This section will focus mostly on how this data was compiled and cleaned to be useful for what comes next. It won’t be long, but if you’re not interested in those kinds of mundane details, feel free to skip to the next section.
At time of writing, there are just over 9100 games of Jeopardy! available on the online archive, with another game added each weekday. At 61 (ish) questions per game, it’s possible to build a collection of around 546,000 questions. This raises the question of how many of these questions to include. There’s a handful of types of games we may choose not to include for being non-representative of standard modern Jeopardy!:
- Old episodes of Jeopardy! will lack questions relating to events that occured after the date of airing. An episode from the 80's couldn't possibly ask any question about, for example, the Clinton presidency, or the Star Wars Prequels.
- Additionally, old episodes are more likely to ask questions that are no longer relevant to the modern show. The Jeopardy writers like to ask questions pertaining to contemporary culture, and things like Jennifer Slept Here or J. J. Fad have been largely forgoten.
- Celebrity Jeopardy! has a well known reputation for featuring questions easier than those from a standard game.
- Conversely, an episode from one of the many Tournament of Champions runs would likely feature questions more difficult than usual. These episodes have their difficulty adjusted to provide an exciting chalenge for contestants that have already won a large number of games.
- Many episodes from early seasons are missing entirely from the archive. Many others are present, but are missing specific questions. In fact, if you should happen to be in posession of a VHS taping of Jeopardy! from the 80's or early 90's, I'm sure the people at the archive would be happy to hear about it! There's a decent chance you own a piece of lost media.
Despite all these problems, I’m chosing to include as much data is available. Although older episodes are less relevant, I’ll simply chose to deprioritize those by placing them at the bottom of the dataframe. Regarding episodes from Celebrity Jeopardy! or other promotional events, the dataframe will include the exact title of each episode. The standard episodes are easily recognizeable as their titles are all simply numbers.
After Monday, November 26, 2001, the value of each clue was doubled. The $300 clues from before that date were exchanged for $600 clues, and likewise for every other value available. For the sake of this project, all questions from before this change will have their values doubled, in order to normalize the data.
For posterity, I’ll include the code used to compile the full dataframe here. It’ll be changed after I write this.
# Datascraper is a uncreative name for a module built
# specifically for this project.
import Datascraper as ds
import pandas as pd
df = pd.DataFrame([], columns=['category', 'hint', 'answer', 'value', 'jtype', 'media', 'DD',
'date', 'game_name', 'correct', 'subjects', 'date_answered'])
# Here, the program collects eact season available
for season in (ds.find_seasons()):
# And here, the program looks through the directory of that season, and finds each game available
for game in ds.find_games_in_season(season):
# The data from that game is loaded onto a massive dataframe
df = pd.concat([df, ds.webpage_to_dataframe(game)])
#And that dataframe is exported
df.to_csv('jeopardydata.csv')
After this is complete, the end result is a csv file containing roughly 90 megabytes of questions, answers, categories, and other pieces of information. Admittedly, this is not the most compact this data could be. It would be totally possible to have a second dataframe containing the data from each game, rather than each question, with the questions only having the index of the game, rather than its entire title and date. Even the categories could be stored here, so the dataframe of questions would only have to contain the number of each category. That all said, I estimate this would have an overall minimal impact on the size of the file.
The date and name of each game represents a very small proportion of this data, and neither of these categories are substancially larger than other categories that consist of only integers, such as ‘value’. Therefore, we’ll keep this data in the form that it is. Although 90 megabytes is a solid chunk of data, it’s not a rediculous amount.
Unfortunately, this chunk of data needs to grow with time. I compiled this dataframe over the weekend, but today is monday, and a new game of Jeopardy! has been added to the archive that needs to be added to our dataframe. The code to handle this is shown here:
import Datascraper as ds
import pandas as pd
#First, the old dataframe is importedprint("Retrieving data...")
df = pd.read_csv("jeopardydata.csv")
latest_game = df.iloc[0]['game_name']
newdf = pd.DataFrame([], columns=['category', 'hint', 'answer', 'value', 'jtype', 'media',
'DD', 'date', 'game_name', 'correct', 'subjects', 'date_answered'])
print("Searching for new data...")
# Once we're certain we've found all the new questions, done_searching is set to True and the loop is exited.
done_searching = False
new_data = False
for season in (ds.find_seasons()):
for game in ds.find_games_in_season(season):
todaysdf = ds.webpage_to_dataframe(game)
# this checks to see if the game currently being examined is already present in the dataframe
if todaysdf.iloc[0]['game_name'] == latest_game:
done_searching = True
break
newdf = pd.concat([newdf, todaysdf])
new_data = True
if done_searching:
break
# Finally, the new data is added to the existing dataframe and saved.
if new_data:
df = pd.concat([df, newdf])
df.to_csv('jeopardydata.csv', index=False)
Entering answers to the questions posed so far#
Welcome back to anyone who chose to skip over the previous section! This project now includes a comprehensive dataframe containing all the questions one could ever want, well ordered and cleaned. Although there’s a huge amount of interesting patterns to find just in the data that exists here, the focus of this project is to be able to find data about an individual’s areas of expertise. In order to do that, we’ll need a good way to enter that data quickly.
The program introduces questions like this: first, the category and value are listed. After that, the date is listed, along with indications on if the question included a Daily Double or media. Finally, the hint is given. This is technically an input. It doesn’t matter what the user types here, but the answer to the question isn’t listed until the user hits enter. This gives enough time to decide if one wants to skip the question, and if not, what ones answer is.
print(row['category'] + " for $" + str(row['value']))
mediatext = ''
if row['media']:
mediatext = ' with media'
DDtext = ''
if row['DD']:
DDtext = ' Daily Double'
print(row['date'] + DDtext + mediatext)
empty = input(row['hint'])
After enter is hit, the answer is shown.
Seeing as we have a goal to answer these questions as quickly as possible, answering is done with a single keystroke:
, | . | / |
---|---|---|
skipped | correct | incorrect |
it’s also important to be able to list the subject (or subjects) of each question. Now, I’m choosing to have the user input the subject of eacch question after answering. Of course, this comes with a certain level of subjectivity, but I don’t think this can be meaningfully avoided. My original plan was to use AI or some other method to list the subject of each question, but I doubt this could meaningfully reduce subjectivity. It would make more sense, I think, to rely on the individual filling the questions out to decide this on a case-by-case basis for each question. These are the letters I’m using to represent each category:
a | b | c | d | e | f | g | h |
---|---|---|---|---|---|---|---|
anatomy & medicine | brands | current events & law | drama & theater | econ & business | film | geography | history |
i | j | l | m | n | o | p | r |
---|---|---|---|---|---|---|---|
internet & pop culture | fashion | languages | music | novels & non-fiction books | other | poetry | religion & mythology |
s | t | u | v | w | y | z |
---|---|---|---|---|---|---|
science | television | universities | visual arts & painting | wordplay | food & cooking (yum) | sportz |
Of course, anyone else who’s using this program is welcome to come up with their own categories. From my experience, this roughly covers the range of questions asked on the show.
Returning to our example question from before, I do happen to know the Latin word for law, even if I had no idea that was the motto of the University of North Dakota. I’ll type a “.” to indicate that I answered correctly, followed by an “l” to indicate that this is a question about languages, and a “u” to indicate that this is a question about university culture. The “.” is converted to an integer signifying that the question was answered correctly, and the date of answering was added to the dataframe automatically.
And it’s as quick as that! Three characters, and I’ve categorized my relationship with this question in as much detail as is required. To give a basic idea of how effective this system is, I can go through an entire episode’s worth of questions in roughly 11 minutes. Much quicker than the shows half hour run time!
It was at this point that I ironed out the full functionality for demo purposes. So far, a decent handful of functions have been implemented (downloading data from the archive, giving subjects to questions, sorting and managing data, etc.), so it makes sense to add a menu for selecting among these.
Admittedly, this is hardly visually flashy, but it’s more than enough to get the job done. If you’d like to experience the program as it exists at this point, you can download this early version here.
Analyzing response frequency#
Now that the entire dataframe is created, the sky’s the limit in terms of what data we can analyze. I want this article to focus mostly on the technological side of this project, and how it can be useful to figure out my own strengths and weaknesses — but it’s impossible to resist the temptation to look for patterns in word frequency before filling in my own answers. What are the most common correct responses, and when do they appear? Because I have so much to say about it, I’m choosing to split this question off into its own sub article:
This way, the article you’re reading currently can function more as an overview of the programs functionality as a tool for Jeopardy! training.