A system for Jeopardy training and data analysis

The code for this project can be found at its github repository.

Most people who know me well know I’m a massive fan of the show Jeopardy!. I’m fairly good at answering the questions myself, but my batting average is below what an actual contestant on the show would be capable of. If I’m going to get better, I need to be able to identify what subjects and concepts appear most often on the show, and what my own areas of expertise are. This will require some data analysis.

Pulling Jeopardy! data from the online archive
#

It doesn’t take long to notice that there’s a pattern in the subjects that appear on Jeopardy!. A typical show will contain:

One category about history
One category about science, especially medicine
One category about literature, including poetry and plays
One category about film or TV
One category about either music or painting, or sometimes one of each
One or two categories about geography
One or two categories about wordplay and language
A handle of miscelaneous topics, although the writers seem to have a fondness for current events, brands, pop culture, academia, and mythology.

To give an example of how set in stone this formula is, consider the categories from October 25th of this year:

Of course, it’s possible to get an idea of what’s on the show by watching it. However, this is rather slow-paced to what’s possible, and I don’t have cable. To access the questions more quickly and easily, one can look at the online Jeopardy! archive at j-archive.com. In fact, I’ve been doing this for a while now. I have a massive spreadsheet where I log my scores for different “genre"s of questions on the archive, in order to identify my strengths and weeknesses.

An adequate approach to analyzing Jeopardy!

This approach is adequate, but it does have its limitations. The largest one is that it’s not possible to see the text of questions that have been answered wrong. It’s one thing to recognize that 41% of questions about Film and TV were answered correctly, but which movies are being asked about? Additionally, there’s no way to tell the values of questions, or if they appeared in Single or Double Jeopardy! In order to fix this, it’s important to build a comprehensive database of Jeopardy! questions, their answers, their subjects, and whether they were answered right or wrong.

The easiest approach is to scrape this data from the website directly. To do this I used a Python library called Beautiful Soup. Once the HTML code for a page of j-archive.com is saved, it can be split up into separate chunks for Single, Double, and Final Jeopardy!, allowing for the clues and categories to be easily sorted.

There are a few issues with this data as it exists, which require some effort to clean. For one, not every question in Jeopardy! is available. If the contestants run out the clock before a question is asked, it won’t appear on the Jeopardy! archive. Another problem is that many categories require clarification from the host to provide necessary context.

An example of some of the potential problems with a Jeopardy! game

Neither of these problems are especially troubling. Missing questions can simply be unincluded, and Ken Jenning’s comments can be included in the text of the category. A more difficult problem is with those clues that contain some sort of media. Many clues on Jeopardy can include an image, a sound, or a video (think for the video daily double, for instance). Although the archive we’re pulling from does provide a link to allow for media, these links seem to always lead to a 404 error.

the link on this question only leads to a 404 error

Although these are sometimes solvable without the media they refer to, they sometimes aren’t. My approach to this is simply to include a tag on each question marking if it was supposed to include media, and otherwise to totally remove the broken links.

Once this is completed, the dataframe is organized and exported to a csv file. That dataframe looks a bit like this:

The collumns of this sheet are:

category	str	The official name of the category given by the show. This is the phrase a contestant says before " for $800, Ken"
hint	str	The actual text given by the host to the contestants. This is called the "hint" rather than "question", because the question is technically said by the contestant
answer	str	The correct response to the hint. To avoid ambiguity, this is also often called the "response" in this article
value	int	An integer indicating the point value of a question. The dollar sign is not included
jtype	int	An integer indicating which round of Jeopardy! a row is included in. 1 indicates Single Jeopardy, 2 indicates Double Jeopardy, and 0 indicates Final Jeopardy
media	bool	Indicates whether or not a hint includes media, such as sound, images, or video
DD	bool	Indicates if a hint is one of the daily doubles
date	date	This is the date when a question is aired, rather than the date it was downloaded, or the date it is viewed by the software user.
game_name	str	A string giving a name to a game. Although almost all games have names of the form "Game #____", some are unique
correct	int	An indicator for whether the user could answer the question correctly or not. 1 indicates a correct answer, -1 indicates an incorrect answer, and 0 indicates that the user would have skipped
subjects	str	The broad subject or subjects a hint belongs to. Because there is some variance within categories, this applies to a specific question, not a category. Each subject is represented by a single letter in the string
date_answered	date	The date at which the user entered whether they would have answered a question correctly or not

Compiling the mountain of data that exists currently
#

This section will focus mostly on how this data was compiled and cleaned to be useful for what comes next. It won’t be long, but if you’re not interested in those kinds of mundane details, feel free to skip to the next section.

At time of writing, there are just over 9100 games of Jeopardy! available on the online archive, with another game added each weekday. At 61 (ish) questions per game, it’s possible to build a collection of around 546,000 questions. This raises the question of how many of these questions to include. There’s a handful of types of games we may choose not to include for being non-representative of standard modern Jeopardy!:

Old episodes of Jeopardy! will lack questions relating to events that occured after the date of airing. An episode from the 80's couldn't possibly ask any question about, for example, the Clinton presidency, or the Star Wars Prequels.
Additionally, old episodes are more likely to ask questions that are no longer relevant to the modern show. The Jeopardy writers like to ask questions pertaining to contemporary culture, and things like Jennifer Slept Here or J. J. Fad have been largely forgoten.
Celebrity Jeopardy! has a well known reputation for featuring questions easier than those from a standard game.
Conversely, an episode from one of the many Tournament of Champions runs would likely feature questions more difficult than usual. These episodes have their difficulty adjusted to provide an exciting chalenge for contestants that have already won a large number of games.
Many episodes from early seasons are missing entirely from the archive. Many others are present, but are missing specific questions. In fact, if you should happen to be in posession of a VHS taping of Jeopardy! from the 80's or early 90's, I'm sure the people at the archive would be happy to hear about it! There's a decent chance you own a piece of lost media.

Despite all these problems, I’m chosing to include as much data is available. Although older episodes are less relevant, I’ll simply chose to deprioritize those by placing them at the bottom of the dataframe. Regarding episodes from Celebrity Jeopardy! or other promotional events, the dataframe will include the exact title of each episode. The standard episodes are easily recognizeable as their titles are all simply numbers.

After Monday, November 26, 2001, the value of each clue was doubled. The $300 clues from before that date were exchanged for $600 clues, and likewise for every other value available. For the sake of this project, all questions from before this change will have their values doubled, in order to normalize the data.

For posterity, I’ll include the code used to compile the full dataframe here. It’ll be changed after I write this.

# Datascraper is a uncreative name for a module built
# specifically for this project.
import Datascraper as ds
import pandas as pd

df = pd.DataFrame([], columns=['category', 'hint', 'answer', 'value', 'jtype', 'media', 'DD',
                             'date', 'game_name', 'correct', 'subjects', 'date_answered'])

# Here, the program collects eact season available
for season in (ds.find_seasons()):
    # And here, the program looks through the directory of that season, and finds each game available
    for game in ds.find_games_in_season(season):
        # The data from that game is loaded onto a massive dataframe
        df = pd.concat([df, ds.webpage_to_dataframe(game)])
    #And that dataframe is exported
    df.to_csv('jeopardydata.csv')

After this is complete, the end result is a csv file containing roughly 90 megabytes of questions, answers, categories, and other pieces of information. Admittedly, this is not the most compact this data could be. It would be totally possible to have a second dataframe containing the data from each game, rather than each question, with the questions only having the index of the game, rather than its entire title and date. Even the categories could be stored here, so the dataframe of questions would only have to contain the number of each category. That all said, I estimate this would have an overall minimal impact on the size of the file.

The date and name of each game represents a very small proportion of this data, and neither of these categories are substancially larger than other categories that consist of only integers, such as ‘value’. Therefore, we’ll keep this data in the form that it is. Although 90 megabytes is a solid chunk of data, it’s not a rediculous amount.

Unfortunately, this chunk of data needs to grow with time. I compiled this dataframe over the weekend, but today is monday, and a new game of Jeopardy! has been added to the archive that needs to be added to our dataframe. The code to handle this is shown here:

import Datascraper as ds
import pandas as pd

#First, the old dataframe is importedprint("Retrieving data...")
df = pd.read_csv("jeopardydata.csv")
latest_game = df.iloc[0]['game_name']

newdf = pd.DataFrame([], columns=['category', 'hint', 'answer', 'value', 'jtype', 'media',
                                  'DD', 'date', 'game_name', 'correct', 'subjects', 'date_answered'])

print("Searching for new data...")
# Once we're certain we've found all the new questions, done_searching is set to True and the loop is exited.
done_searching = False
new_data = False
for season in (ds.find_seasons()):

    for game in ds.find_games_in_season(season):
        todaysdf = ds.webpage_to_dataframe(game)
        # this checks to see if the game currently being examined is already present in the dataframe
        if todaysdf.iloc[0]['game_name'] == latest_game:
            done_searching = True
            break
        newdf = pd.concat([newdf, todaysdf])
        new_data = True

    if done_searching:
        break

# Finally, the new data is added to the existing dataframe and saved.
if new_data:
    df = pd.concat([df, newdf])
    df.to_csv('jeopardydata.csv', index=False)

Entering answers to the questions posed so far
#

Welcome back to anyone who chose to skip over the previous section! This project now includes a comprehensive dataframe containing all the questions one could ever want, well ordered and cleaned. Although there’s a huge amount of interesting patterns to find just in the data that exists here, the focus of this project is to be able to find data about an individual’s areas of expertise. In order to do that, we’ll need a good way to enter that data quickly.

The program introduces questions like this: first, the category and value are listed. After that, the date is listed, along with indications on if the question included a Daily Double or media. Finally, the hint is given. This is technically an input. It doesn’t matter what the user types here, but the answer to the question isn’t listed until the user hits enter. This gives enough time to decide if one wants to skip the question, and if not, what ones answer is.

  print(row['category'] + " for $" + str(row['value']))
  mediatext = ''
  if row['media']:
      mediatext = ' with media'
  DDtext = ''
  if row['DD']:
      DDtext = ' Daily Double'
  print(row['date'] + DDtext + mediatext)
  empty = input(row['hint'])

After enter is hit, the answer is shown.

Seeing as we have a goal to answer these questions as quickly as possible, answering is done with a single keystroke:

,	.	/
skipped	correct	incorrect

it’s also important to be able to list the subject (or subjects) of each question. Now, I’m choosing to have the user input the subject of eacch question after answering. Of course, this comes with a certain level of subjectivity, but I don’t think this can be meaningfully avoided. My original plan was to use AI or some other method to list the subject of each question, but I doubt this could meaningfully reduce subjectivity. It would make more sense, I think, to rely on the individual filling the questions out to decide this on a case-by-case basis for each question. These are the letters I’m using to represent each category:

a	b	c	d	e	f	g	h
anatomy & medicine	brands	current events & law	drama & theater	econ & business	film	geography	history

i	j	l	m	n	o	p	r
internet & pop culture	fashion	languages	music	novels & non-fiction books	other	poetry	religion & mythology

s	t	u	v	w	y	z
science	television	universities	visual arts & painting	wordplay	food & cooking (yum)	sportz

Of course, anyone else who’s using this program is welcome to come up with their own categories. From my experience, this roughly covers the range of questions asked on the show.

Returning to our example question from before, I do happen to know the Latin word for law, even if I had no idea that was the motto of the University of North Dakota. I’ll type a “.” to indicate that I answered correctly, followed by an “l” to indicate that this is a question about languages, and a “u” to indicate that this is a question about university culture. The “.” is converted to an integer signifying that the question was answered correctly, and the date of answering was added to the dataframe automatically.

And it’s as quick as that! Three characters, and I’ve categorized my relationship with this question in as much detail as is required. To give a basic idea of how effective this system is, I can go through an entire episode’s worth of questions in roughly 11 minutes. Much quicker than the shows half hour run time!

It was at this point that I ironed out the full functionality for demo purposes. So far, a decent handful of functions have been implemented (downloading data from the archive, giving subjects to questions, sorting and managing data, etc.), so it makes sense to add a menu for selecting among these.

here's an example of the program in use now

Admittedly, this is hardly visually flashy, but it’s more than enough to get the job done. If you’d like to experience the program as it exists at this point, you can download this early version here.

Analyzing response frequency
#

Now that the entire dataframe is created, the sky’s the limit in terms of what data we can analyze. I want this article to focus mostly on the technological side of this project, and how it can be useful to figure out my own strengths and weaknesses — but it’s impossible to resist the temptation to look for patterns in word frequency before filling in my own answers. What are the most common correct responses, and when do they appear? Because I have so much to say about it, I’m choosing to split this question off into its own sub article:

Reading into the common responses on Jeopardy!

30 October 2024·2730 words·13 mins

This way, the article you’re reading currently can function more as an overview of the programs functionality as a tool for Jeopardy! training.

Pulling Jeopardy! data from the online archive#

Compiling the mountain of data that exists currently#

Entering answers to the questions posed so far#

Analyzing response frequency#

After this point, this article is a work in progres#

Pulling Jeopardy! data from the online archive
#

Compiling the mountain of data that exists currently
#

Entering answers to the questions posed so far
#

Analyzing response frequency
#

After this point, this article is a work in progres
#