A system for Jeopardy training and data analysis

The code for this project can be found at its github repository.

This article functions as a bit of a journal doccumenting the development process for this project. If you would like to see the abbreviated documentation intended for users of this program, read the README file included in the project.

Most people who know me well know I’m a massive fan of the show Jeopardy!.[1] I’m fairly good at answering the questions myself, but my batting average is below what an actual contestant on the show would be capable of. If someone like me wanted to get better at Jeopardy!, they would need some sort of program to analyze their performance. The goal of this project is to build just such a program.

Pulling Jeopardy! data from the online archive
#

It doesn’t take long to notice that there’s a general pattern in the subjects that appear on Jeopardy!. A typical show will contain:

One category about history
One category about science, especially medicine
One category about literature, including poetry and plays
One category about film or TV
One category about either music or painting, or sometimes one of each
One or two categories about geography
One or two categories about wordplay and language
A handle of miscelaneous topics, although the writers seem to have a fondness for current events, brands, pop culture, academia, and mythology.

To give an example of how common this formula is, consider the categories from October 25th of 2024:

Of course, it’s possible to get an idea of what topics are on the show just by watching it, but the show is full of interuptions, and I don’t have cable. To access the questions more quickly and easily, one can look at the online Jeopardy! archive at j-archive.com. I’ve been using the archive to play Jeopardy for ages now, and I even have a massive spreadsheet where I log my scores for different “genre"s of questions on the archive.

An adequate approach to analyzing Jeopardy!

This approach is adequate, but clearly limited. We can’t see the text of the questions missed, their values, whether they appeared in Single or Double Jeopardy!,[2] or the date they were answered. In order to fix this, it’s important to build a comprehensive database of Jeopardy! questions, their answers, their categories, and all the other data we’re interested in tracking. We’ll use pandas, a popular Python library made for data analysis.

It would take ages to enter every question by hand, so we’ll pull this data directly from the internet. I used a Python library called Beautiful Soup to scrape all the data available at j-archive.com. Once the HTML code for a page is saved, it can be split up into separate chunks for Single, Double, and Final Jeopardy!, and the data for the clues and categories can be easily sorted.

There are a few issues with this data as it exists, which require some effort to clean. For one, not every question written for Jeopardy! is available to us. If the contestants run out the clock before a question is asked, it obviously can’t appear on the archive. For another, many categories require clarification from the host to provide necessary context.

An example of some of the potential problems with a Jeopardy! game

Neither of these problems are especially troubling: missing questions can simply be unincluded, and Ken Jenning’s comments can be included in the text of the category. It’s much more difficult to handle those clues that contain some sort of media. Many clues on Jeopardy can include an image, a sound, or a video.[3] Although the archive we’re pulling from does provide a link to allow for media, these links seem to always lead to a 404 error.

the link on this question only leads to a 404 error

These are sometimes solvable without the media they refer to, and they sometimes aren’t. My approach to this is simply to include a tag on each question indicating if it was supposed to include media, and to totally remove the broken links.

Once this is completed, the dataframe will be organized and exported to a CSV file. Once that’s finished, we’ll have a database that looks like this:

The columns of this sheet are:

column	datatype	description
category	str	The official name of the category given by the show. This is the phrase contestanta say before "for $800, Ken".
hint	str	The actual text given by the host to the contestants. This is called the "hint" rather than "question", due to Jeopardy!'s quirk of phrasing questions as answers and vice versa.
response	str	The correct response to the hint. To avoid ambiguity, this is called "response", because the "answer" is given by the host.
value	int	An integer indicating the point value of a question; the dollar sign is not included. Final Jeopardy! is given a value of 0.
jtype	int	An integer indicating which round of Jeopardy! a row is included in. 1 indicates Single Jeopardy, 2 indicates Double Jeopardy, and 0 indicates Final Jeopardy.
media	bool	Indicates whether or not a hint includes media, such as sound, images, or video.
DD	bool	Indicates if a hint is a daily double
date	str	This is the date when a question is aired, rather than the date it was downloaded, or the date it is viewed by the user. The string class is used rather than the date class because it interacts more smoothly with conversion into a CSV file.
game_name	str	A string giving a name to a game. Although almost all games have names of the form "Game #____", some are unique.
accuracy	int	An indicator for whether the user answered the question correctly or not. 1 indicates a correct answer, -1 indicates an incorrect answer, and 0 indicates that the user would have skipped.
subjects	str	The broad subject or subjects a hint belongs to. Because there is some variance within categories, this applies to a specific question, not a category. Each subject is represented by a single letter within the string.
date_answered	str	The date at which the user entered whether they would have answered a question correctly or not.

Compiling the mountain of data that exists currently
#

This section will focus mostly on how this data was compiled and cleaned to be useful for what comes next. It won’t be long, but if you’re not interested in those kinds of details, feel free to skip to the next section.

At time of writing, there are just over 9100 games of Jeopardy! available on the online archive, with another game added each weekday. At 61 (ish) questions per game, it’s possible to build a collection of around 546,000 questions. how many of these questions should be included? There’s a handful of types of games we may choose not to include for not representing “standard” modern Jeopardy!:

Old episodes of Jeopardy! will lack questions relating to events that occured after the date of airing. An episode from the 80's couldn't possibly ask any question about, for example, the Clinton presidency, or the Star Wars Prequels.
Old episodes are more likely to ask questions that are no longer relevant to the modern world. The Jeopardy! writers like to ask questions pertaining to contemporary culture, and things like Jennifer Slept Here or J. J. Fad have been largely forgoten.
Celebrity Jeopardy! has a well known reputation for featuring questions easier than those from a standard game.
Conversely, an episode from one of the many Tournament of Champions runs would likely feature questions more difficult than usual. These episodes have their difficulty adjusted to provide a competetive chalenge for contestants that have already made numerous appearances.
Many episodes from early seasons are missing entirely from the archive. Many others are present, but are missing questions.[4]

Despite all these problems, I’m chosing to include as much data is available. Although older episodes are less relevant, I’ll simply chose to deprioritize those by placing them at the bottom of the dataframe. Celebrity Jeopardy! and other promotional events won’t cause a problem, since they can be easilty differentiated from standard episodes by title.

On Monday, November 26, 2001, the value of each clue was doubled. The $300 clues from before that date were exchanged for $600 clues, and likewise for every other value available. For the sake of this project, all questions from before this change will have their values doubled, in order to normalize the data.

For posterity, I’ll include the code used to compile the full dataframe here. It’ll be changed after I write this in order to optimize for adding to an already existing dataframe.

# Datascraper is a uncreative name for a module built specifically for this project. It can take in a webpage and return a useable dataframe
import Datascraper as ds
import pandas as pd

df = pd.DataFrame([], columns=['category', 'hint', 'response', 'value', 'jtype', 'media', 'DD',
                             'date', 'game_name', 'accuracy', 'subjects', 'date_answered'])

# Here, the program collects eact season available
for season in (ds.find_seasons()):
    # And here, the program looks through the directory of that season, and finds each game available
    for game in ds.find_games_in_season(season):
        # The data from that game is loaded onto a massive dataframe
        df = pd.concat([df, ds.webpage_to_dataframe(game)])
    #And that dataframe is exported
    df.to_csv('jeopardydata.csv')

The end result is a CSV file containing roughly 90 megabytes of hints, correct responses, categories, and other minor pieces of information. Admittedly, this is not the most compact this data could be. It would be totally possible to have a second dataframe containing the data from each game, rather than each question, with the questions only having the index of the game, rather than its entire title and date. Even the categories could be stored here, so the dataframe of questions would only have to contain the number of each category. That all said, I estimate this would have an overall minimal impact on the size of the file.

The date and name of each game represents a very small proportion of this data, and neither of these categories are substancially larger than other categories that consist of only integers, such as ‘value’. Therefore, we’ll keep this data in the form that it is. Although 90 megabytes is a solid chunk of data, it’s not a rediculous amount.

Up next, this chunk of data needs to grow with time. The next weekday after that dataframe was compiled, a new game of Jeopardy! was added to the archive, and needs to be appended to our dataframe. The code to handle this is shown here:

import Datascraper as ds
import pandas as pd

#First, the old dataframe is importedprint("Retrieving data...")
df = pd.read_csv("jeopardydata.csv")
latest_game = df.iloc[0]['game_name']

newdf = pd.DataFrame([], columns=['category', 'hint', 'response', 'value', 'jtype', 'media',
                                  'DD', 'date', 'game_name', 'accuracy', 'subjects', 'date_answered'])

print("Searching for new data...")
# Once we're certain we've found all the new questions, done_searching is set to True and the loop is exited.
done_searching = False
new_data = False
for season in (ds.find_seasons()):

    for game in ds.find_games_in_season(season):
        todaysdf = ds.webpage_to_dataframe(game)
        # this checks to see if the game currently being examined is already present in the dataframe
        if todaysdf.iloc[0]['game_name'] == latest_game:
            done_searching = True
            break
        newdf = pd.concat([newdf, todaysdf])
        new_data = True

    if done_searching:
        break

# Finally, the new data is added to the existing dataframe and saved.
if new_data:
    df = pd.concat([df, newdf])
    df.to_csv('jeopardydata.csv', index=False)

Analyzing response frequency
#

Now that the entire dataframe is created, the sky’s the limit in terms of what data we can analyze. I want this article to focus mostly on the technological side of this project, and how it can be useful to figure out my own strengths and weaknesses — but it’s impossible to resist the temptation to look for patterns in word frequency before filling in my own answers. What are the most common correct responses, and when do they appear? Because I have so much to say about it, I’m choosing to split this question off into its own sub article:

This way, the article you’re reading currently can function more as an overview of the programs functionality as a tool for Jeopardy! training.

Entering answers to the questions posed so far
#

Welcome back to anyone who chose to skip over the previous section! This project now includes a comprehensive dataframe containing all the questions one could ever want, well-ordered and cleaned. Although there’s a huge amount of interesting patterns to find already, the focus of this project is to be able to find data about an individual’s areas of expertise. In order to do that, we’ll need a good way to allow the user to enter in their own data.

The program introduces questions like this: first, the category and value are listed. After that, the date is listed, along with indications on if the question included a Daily Double or media. Finally, the hint is given. This is technically an input. It doesn’t matter what the user types here, but the answer to the question isn’t listed until the user hits enter. This gives enough time to decide if one wants to skip the question, and if not, what the answer is.

  print(row['category'] + " for $" + str(row['value']))

  mediatext = ''
  if row['media']:
      mediatext = ' with media'

  DDtext = ''
  if row['DD']:
      DDtext = ' Daily Double'

  print(row['date'] + DDtext + mediatext)
  
  # This does not store the user's answer anywhere, and is only intended to wait for some user input.
  input(row['hint'])

After enter is hit, the answer is shown.

Seeing as we have a goal to answer these questions as quickly as possible, answering is done with a single keystroke:

,	.	/
skipped	correct	incorrect

it’s also important to be able to list the subject (or subjects) of each question. Now, I’m choosing to have the user input the subject of eacch question after answering. Of course, this comes with a certain level of subjectivity, but I don’t think this can be meaningfully avoided. My original plan was to use AI or some other method to list the subject of each question, but I doubt this could meaningfully reduce subjectivity. It would make more sense, I think, to rely on the individual filling the questions out to decide this on a case-by-case basis for each question. These are the letters I’m using to represent each category:

a	b	c	d	e	f	g	h
anatomy & medicine	brands	current events & law	drama & theater	econ & business	film	geography	history

i	j	l	m	n	o	p	r
internet & pop culture	fashion	languages	music	novels & non-fiction books	other	poetry	religion & mythology

s	t	u	v	w	y	z
science	television	universities	visual arts & painting	wordplay	food & cooking (yum)	sportz

Of course, anyone else who uses this program is welcome to come up with their own categories. From my experience, this roughly covers the range of questions asked on the show.

Returning to our example question from before, I do happen to know the Latin word for law, even if I had no idea that was the motto of the University of North Dakota. I’ll type a “.” to indicate that I answered correctly, followed by an “l” to indicate that this is a question about languages, and a “u” to indicate that this is a question about university culture. The “.” is converted to an integer signifying that the question was answered correctly, and the date of answering was added to the dataframe automatically.

And it’s as quick as that! Three characters, and I’ve categorized my relationship with this question in as much detail as is required. To give a basic idea of how effective this system is, I can go through an entire episode’s worth of questions in roughly 11 minutes, which is much quicker than the shows half-hour run time!

It was at this point in the development process that I ironed out the full functionality for demo purposes. So far, a decent handful of functions have been implemented (downloading data from the archive, giving subjects to questions, sorting and managing data, etc.), so it makes sense to add a menu for selecting among these.

here's an example of the program in use now

Admittedly, this is hardly visually flashy, but it’s more than enough to get the job done for now!

Evaluating accuracy by game
#

I’d like to be able to know about my accuracy per game. From what I understand, contestants that go on to appear on the show know about three quarters of the material, so that’s a good benchmark to aim for.[4] It’s fairly easy to take the dataframe containing all answered questions, separate them by game and correctness, and then corelate those into one single dataframe. That looks a little like this:

Showing the correct and incorect values from each game

Of course, these raw numbers aren’t especially useful on their own. We need to be able to see what the percentages are for each of these.

Now, this is all very good. My accuracy tends to fall a bit below what the contestants on the show make, so I’ll have to isolate my weak areas and study those to improve. Exploring that aspect of my accuracy will come for later sections; for now there’s something else that feels missing from this table. How much does my accuracy drop moving from cheaper questions to more expensive ones? Having an accuracy of 75% isn’t especially useful if that accuracy is concentrated entirely in cheap questions that score a minimal amount of points. After all, the final questions are worth ten times the first ones. This concept can be expressed through Coryat score, the total number of points earned by a player if wagering is ignored. This is found by summing up the values of all correct guesses within a game, and subtracting the values of all incorrect guesses. Incidentally, accounting for Coryat score calculations was part of why I chose to code the correctness of an answer with 1, 0, and -1, and why I gave Final Jeopardy a value of 0. Not only is it very memory efficient, but it means the Coryat score can be found by simply multiplying the “accuracy” cell in a row by that row’s “value” cell.

I’ll create a new column called “coryat” that consists of only zeroes. Then, the program itterates through the entire abreviated dataframe, and adds the Coryat score from each question to the appropriate row in the games dataframe. The end result looks like this:

The full dataframe for each game, including Coryat scores

I’m pleasantly surprised with my scores! After looking around, I found this article from a former contestant that suggests having an at-home Coryat score above $20,000, and this one which suggests scoring “consistently” around $28,000, or even $32,000 if you perform worse under pressure. In fact, the average Coryat score earned on the show in the current season is around $10,500. Of course, Coryat scores are bound to be lower when competing with others as opposed to when by onesself, but even trippling that score only gives $31,500.

I wonder why my Coryat scores seem so high compared to my accuracy percentages. This will be explored in greater depth later, but I can state affirmatively that I’m pretty awful with movies, TV, and sports. The general wisdom is that subjects pertaining to pop culture are more common in single Jeopardy as opposed to double Jeopardy, so that might be an explanation. I also tend to be fairly conservative when guessing, so I skip a lot more than I get incorrect.

Let’s say I want to know my average Coryat score over all the games I’ve played. The problem here is that I haven’t actually finished all the games that appear in the games dataframe. For example, I’ve only answered 5 questions from show #9198, meaning that my Coryat score for that game is much smaller that it should be! The solution to this is to only include games that have been completed, but how do we tell which ones those are? Unfortunately, some games have fewer than the maximum 61 questions loaded into the archive. Show #9200 only has 60 questions, all of which I’ve answered, so labeling a game as complete when 61 questions have been answered is not sufficient.

For the sake of efficiency, I’ll start by labeling all games with 61 answered questions as complete.

Past that, it becomes necesary to compare the answered questions to the overall questions, and see if any remain unanswered. Thankfully, the dataframe of all answers is ordered by air date, so searching it will be fairly quick! I tried a handful of different ways of optimizing this, but in the end, the simplest solution was the best one. I simply marked any game with an unanswered question in the main dataframe as incomplete.

for idx in games.loc[games["complete"] != True].index:
    itr = 0
    while True:
        if df.loc[itr,"game_name"] == games.loc[idx,"game_name"] and not df.loc[itr,"accuracy"] in [-1, 0, 1]:
            games.loc[idx, "complete"] = False
            break
        elif df.loc[itr,"game_name"] < games.loc[idx,"game_name"]:
            games.loc[idx, "complete"] = True
            break
        itr += 1

It’s straightforward to display the user’s Coryat scores within a histogram:

This seems like it might be normally distributed for me, but that might just be my habit of seeing normal distributions everywhere. In either case, I get the impression that a cumulative distribution would be more informative:

A cumulative distribution of Coryat scores

As I did here, it’s even possible to add a few benchmarks! I chose to add my median score, as well as some of the numbers suggested from those resources earlier.[5]

Evaluating accuracy by subject
#

One of the biggest goals of this project is to analyze which topics and concepts the user needs to study and improve on in order to make the greatest improvement. As mentioned in the section on answering questions, the user is able to indicate the general topic of a question with a brief code consisting of one or more letters. The simplest and first question we might ask is “what is the user’s relative accuracy within each subject?”. Because the topics explored by a question are stored as single-letters inside a string, it will be required to separate these into distinct units. My original plan looked something like this:

Create an abbreviated dataframe called adf that only contains questions that have already been answered (so unanswered questions don't have to be checked repeatedly, like earlier).
Add a column to adf for each letter that can act as a code.
Fill each new column in adf with a boolean value indicating if that letter appears in the "subjects" column
Drop the "subjects" column from adf, as it is now redundant.
Create a new dataframe called atdf with columns ["correct", "skip", "incorrect"], and a rows indexed by letters 'a' through 'z'.
Set each cell of stdf to the count of times that cell's letter appears in adf with an accuracy of that cells column value.
Drop rows in stdf that are indexed with an unused letter

After experimenting with this strategy for a bit, I’m not particularly satisfied with it. The abreviated dataframe nearly doubles in size by adding 26 new columns, and the time to run the computations is non-trivial. On top of that, the computation size would only grow as more questions are answered, making this increasingly inneficient as more questions are answered. Steps 2 and 3 were eventually dropped, even if they make the implementation easier; the inneficiency bothered me too much. The resulting workaround looks like this:

# adf is the abbreviated dataframe. In fact, that's what adf stands for.
# stdf is the simplified topic dataframe. We'll hope to do some more complicated analysis on topics later
# I normally would be reluctant to choose such abbreviated names, as I find it can cause confusion. However,
# due to the somewhat awkward way Pandas requires dataframe names to be repeated while altering dataframes,
# it's just too temping to abbreviate in order to save space.

adf = df.dropna().copy()
stdf = pd.DataFrame(columns=["correct", "skip", "incorrect"])

# This adds a row to stdf for each letter
for char in ascii_lowercase:
    stdf.loc[char] = [0,0,0]

# Earlier drafts of this function used some fancy math to increment each column by index. While that made for
# some nice code golf, it wasn't very easily readable. This provides a more human-friendly approach.
accuracy_translator = {1: "correct", 0: "skip", -1:"incorrect"}

# this loop runs through adf and appends each question to stdf appropriately
for index in range(len(adf)):
    for char in adf.loc[index, "subjects"]:
        stdf.loc[char, accuracy_translator[int(adf.loc[index, 'accuracy'])]] += 1

# It feels important to drop the unused letter categories after calculating their total. The alternative is to
# check all 3 other columns for 0s.

stdf["total"] = stdf.sum(1)
stdf.drop(stdf.loc[stdf["total"] == 0].index, inplace=True)

stdf["correct_rate"] = stdf["correct"] / stdf["total"]
stdf["skip_rate"] = stdf["skip"] / stdf["total"]
stdf["incorrect_rate"] = stdf["incorrect"] / stdf["total"]

The user should be able to see visually what their percentages are. While pie or donut charts are an ideal way to represent ratio values, it would probably be overwhelming to present 26 pie charts simultaneously. A stacked bar graph would work better.

Suddenly, the single-letter topic names aren’t so useful! The likely best solution is to build a dictionary containing all the letters I use, along with their definitions:

topics = {'a':'medicine', 'b':'brands', 'c':'current events', 'd':'theatre', 'e':'econ', 'f':'movies',
          'g':'geography', 'h':'history', 'i':'pop culture', 'j':'fashion', 'l':'languages', 'm':'music',
          'n':'literature', 'o':'other', 'p':'poetry', 'r':'mythology', 's':'science', 't':'television',
          'u':'universities', 'v':'art', 'w':'wordplay', 'y':'food', 'z':'sports'}

But I want this program to be useful for other people, and they might want to use different topics from what I do. For now, I’ll have a little text file called params.txt, and I’ll include the topics dictionary in there. A user may choose to alter that text file directly.

After just a little bit more tweaking, the chart from before is looking very nice and readable:

I’m a bit surprised by my own numbers! I wasn’t expecting sports to be that high, and I wasn’t expecting geography to be that low.[6] Regardless, there’s still an additional problem with this graph: It doesn’t show the prevalance of these topics! The easiest solution is to graph the absolute appearances of these topics, rather than their rates:

Looking at the first graph, you can see that I don’t know a lot about universities (the topic largely has to do with college sports teams, which I’m even more clueless about than professional sports). However, based on the second graph, it probably doesn’t warrant being a top concern. Film and TV both seem somewhat minor, but taken as a whole they make up a huge number of questions! If I want to get better, that’s probably a good place to start.

Evaluating score rather than accuracy
#

There is, however, an even more precice approach to performance by topic that could provide even more information. What if I considered the total score attributeable to each of these topics? As I mentioned, the common knowledge is that some topics are cheaper than others: pop cultural references tend to stay within single Jeopardy, for example. Conversely, questions about America’s prestigious universities tend to make their way into double Jeopardy. Maybe learning more about Ivy League schols will look more important once I take the relative scores of these questions into account.

The previous code is tweaked to count total score. I’ve left the comments off so I don’t repeat myself:


adf = df.dropna().copy()
stdf = pd.DataFrame(columns=["correct", "skip", "incorrect", "correct_score", "skip_score", "incorrect_score"])

for char in ascii_lowercase:
    stdf.loc[char] = [0,0,0,0,0,0]

accuracy_translator = {1: "correct", 0: "skip", -1:"incorrect"}
score_translator = {1: "correct_score", 0: "skip_score", -1:"incorrect_score"}

for index in range(len(adf)):
    for char in adf.loc[index, "subjects"]:
        stdf.loc[char, accuracy_translator[int(adf.loc[index, 'accuracy'])]] += 1
        stdf.loc[char, score_translator[int(adf.loc[index, 'accuracy'])]] += int(adf.loc[index, 'value'])

stdf["total"] = stdf[["correct", "skip", "incorrect"]].sum(1)
stdf["total_score"] = stdf[["correct_score", "skip_score", "incorrect_score"]].sum(1)
stdf.drop(stdf.loc[stdf["total"] == 0].index, inplace=True)

stdf["correct_rate"] = stdf["correct"] / stdf["total"]
stdf["skip_rate"] = stdf["skip"] / stdf["total"]
stdf["incorrect_rate"] = stdf["incorrect"] / stdf["total"]
stdf["average_value"] = stdf["total_score"] / stdf["total"]

I’m currious how much truth there is to the idea that different topics appear with different average values. Graphing them side-by-side is straightforward:

a simple look at different values of topics

This graph is actually a little misleading. Topics don’t appear at random accross the Jeopardy board: instead, they typically appear in groups of five, taking up each value in an entire Jeopardy category. Therefore, you wouldn’t expect any category to have an average value under the average value for the entire single Jeopardy board! The average value of all clues in single Jeopardy is

$\frac{200+400+600+800+1000}{5}=600$

So we shouldn’t expect anything to have a value below that! A topic that only appears in double Jeopardy should similarly have an average value of $1200. Because of this, $600 and $1200 make appropriate bounds on what we could reasonably expect to find while looking at average values. Adjusting this shows just how skewed the average values of different topics are:

Despite how striking this chart is, I wonder how robust these results are. Over 3 months of questions, I only found 24 questions about business and economics. What are the confidence intervals around that average value of $650? Unfortunately, trying to do the standard statistical analyses here fails to jump over the first hurdle. The values of answers are not at all independent! As mentioned, topics tend to appear in chunks that occupy an entire column of the board. We’ll explore this more a bit later on, but for now, just know that these values should be taken with a bit of a grain of salt.

Returning to the subject at hand, what is the total score across each topic? Just like before, we can build a stacked bar chart of total scores:

This is, in some ways, an improvement. For example, although I’ve missed more questions about TV than I have about Music, It looks like the gaps in my music knowledge are responsible for more missed points! If I was going to start studying for trivia night, popular music might be a better starting point than television!

In other ways, this new chart shows some of the limitations of the present approach. Listing the total score like this takes up a large amount of space, preventing some of the labels from appearing. Additionally, comparison between these values is difficult for a few reasons: the total score for each topic changes as more questions are answered, it’s hard to compare large numbers without commas, and it’s hard to see how these values compare to the larger picture. The obvious solution is to display the percentages of scores as a proportion of the total, rather than their absolute value. We’ll divide the scores from each topic by the total score possible over all questions answered. The result looks like this:

A slightly better way of graphing scores

Note that the values on this graph add up to above 100%. That’s because questions may have multiple topics, meaning that the scores from topics may overlap.

Although I like this graph quite a bit, it still doesn’t quite satisfy me. It’s a bit inconvenient to compare the percentages of questions I missed. One possible solution is to display this bar chart with the split between correct and not-correct matched to the center, a bit like a population pyramid. This is my favourite option so far.

Of course, there’s always more tweaking to be done. For now, let’s move on to other parts of this project.

Correcting misentered data
#

There’s another problem I’ve noticed while working on this project — sometimes I’ll enter in the subjects for a question incorrectly, and there’s no way to adjust after the fact! I’d like to include a method to re-enter previous inputs conveniently. When I misenter data presently, I have to resort to one of the possible courses of action to fix it:

Close the program before it can save entered data, and reenter everything
Type out a specialized line of code to edit things
Open the .csv file in a spreadsheet editor and adjust it manually

All of these seem like a lot of time and effort to edit a single character. The easy solution is to allow for the user to enter a special symbol that goes back and erases the previous entry. We’ve already used commas, periods, and slashes to note accuracy, but the semicolon is unused! Here’s what the final result looks like:

An illustration of the new functionality

Reviewing missed questions
#

One of my main goals when starting this project was to implement the ability to review previously missed questions. When you want to improve, the most important thing is focusing on your weak points. It’s fairly easy to restrict the set of questions to those that have been answered incorrectly and belonging to a specific topic.

In the implementation built here, it is also possible to see those questions that belong to one of two or more topics. For example, let’s say I want to see missed questions from either “TV” or “movies”, since they’re so similar. That’s totally possible.

It’s even possible to consider only questions that have both of two topic markers. Let’s say I want to review questions pertaining specifically to musical theater and opera. That would correspond to the questions that are categorized under both music and drama.

you can chose to display either the most recent questions fitting the requirements, or a random sample. In either case, I recognize that it’s a bit limited. To give the user to read whatever they like, the questions fitting the description are also outputted to a CSV file.

Frequency of topics within each game
#

I mentioned near the beginning of this article that there are certain topics that consistently get one category per game. I’m currious what the distribution of topics looks like within each game. I mentioned that there seems to be about one category of science questions per game, for example; does the standard game of Jeopardy! contain 5 science questions? I’ve gone over my coding process in pretty good detail so far, so I’ll cut to the chase. After removing all the incomplete games, here’s the top of dataframe of games:

It’s then possible to check the frequency distribution charts of quantities within each game. Here’s a handful:

some of the frequency distributions of topics

Unsurprisingly, many of these distributions have a spike after 5, corresponding to a full category. However, they’re a good bit flatter than I expected! I mentioned that there’s almost always a category dedicated to movies, so why do so many games see fewer than 5 questions about movies? Maybe those games are the same ones that ask about TV, so the total number of questions around movies and TV stays above 5. For any pair of topics, can we create a table displaying how often each combination of their counts appears. Here’s a few examples of that:

Comparing the frequencies of topics within games

I’ll admit, I’m a little disappointed with these results. I tried comparing similar topics, with the hope that each game would have a category fitting into either of the two. My dream is that there would be a clear line along the diagonal totalling to 5 questions, but the data seems much messier than that. For example, some games with very few questions about literature seem to make up for it with extra questions about poetry, but many games lack either![7]

Because of this messiness, it may be more enlightening to make a heatmap of the correlations between each category, rather than heatmaps of the count combinations between two categories. That looks like this:

a correlation heatmap between all the topics

In order to really exagerate the severity of these correlations, we can also look at this chart with all values square-rooted:

a correlation heatmap between all the topics, showing squareroots of correlations

These both show Pearson coefficients, which I think is the right choice. If you’re curious what the Spearman versions look like, I’ll show those here:

From this, we see that the subject pairs with the most extreme corelations are:

correlation	first subject	second subject
0.33	Pop culture	Poetry
0.33	Pop culture	Mythology
0.26	History	Television
0.24	Current Events	Theatre
0.22	Medicine	Food
…	…	…
-0.30	Sports	Economics
-0.35	Television	Music
-0.36	Poetry	Literature
-0.40	Langauges	Wordplay
-0.53	Medicine	Science

It’s easy to make a few surface level observations.

Most correlations are slightly negative. Because each game contains at most 61 questions, a question belonging to any subject means there is one fewer question that can fall under other subjects.
There is a strong negative correlation between very similar subjects. It seems that from the perspective of the Jeopardy! writers, medicine and science are the same subject, so the number of questions they ask about them collectively is kept within certain bounds.
In general, very few subjects show any sort of significant correlation

Pulling Jeopardy! data from the online archive#

Compiling the mountain of data that exists currently#

Analyzing response frequency#

Entering answers to the questions posed so far#

Evaluating accuracy by game#

Evaluating accuracy by subject#

Evaluating score rather than accuracy#

Correcting misentered data#

Reviewing missed questions#

Frequency of topics within each game#

Reviewing over rolling windows#