This article was spun off from a larger, more technical article on the subject of how this data was collected. I wanted to focus on some fun statistical analysis for the answers on Jeopardy!, so feel free to skip the main article if you’re only interested in the stats.
What are the most common correct responses? #
In the previous article, I created a massive dataframe containing the categories, hints, answers, dates, and point values of every question aired on Jeopardy!. The first question that I’m curious about is simple: what are the most common correct responses on Jeopardy!? What people, places, and things are most valued by the Jeopardy! writers?
Simply performing a count on the number of times a word or phrase appears as a correct response gives some surprising results:
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
response: | Australia | China | Japan | Chicago | France | India | Spain | California | Mexico | Alaska | Canada | India | Hawaii | Florida | Texas |
count | 511 | 484 | 472 | 471 | 462 | 434 | 427 | 420 | 390 | 380 | 376 | 354 | 354 | 351 | 340 |
Geographical locations are repeated far more frequently than famous people or works of art! The most frequent correct response is “What is Australia?”, appearing 511 times over the shows run. After that comes China, then Japan, Chicago, France, and so on. As a matter of fact, all of the top 37 responses are geographic locations! (the trend is broken by Napoleon, appearing 243 times.)
Geography certianly isn’t a vastly more common subject than history, so what could cause this pattern? Part of the answer is most likely that answers to geography questions tend to focus on a more narrow range of answers than other categories. There are fewer cities than people, after all. It’s also possible that locations are more common as responses, while historical figures appear more in the questions themselves. I’ve asked a few friends what they think the most popular responces would be, and they usually intuit that the top responses would be geographical.
However, there is a second answer at play here, which is that people’s names can appear a number of different ways. Let’s return to Napoleon as an example. In addition to 243 appearances of “Napoleon”, the table also contains:
Napoleon Bonaparte | Napoléon | Napoleon (Bonaparte) | Napoléon Bonaparte | Napoleon (I) | Napoléon (Bonaparte) |
---|---|---|---|---|---|
47 times | 14 times | 4 times | 3 times | 3 times | 1 times |
One might be curious why so many variations on a single name are possible. While contestants may say the full name while answering, the judges will often (but not always) allow a contestant to say only the surname, especially while refering to polititians. To eliminate ambiguity, the people maintaning the archive will often (but not always) add the given name to a response in parentheses. Ironically, Napoleon is a counter example, as his first name is much more recognizeable than his last name.
Regardless of these details, this does pose a serious problem to our understanding of this data. A first instinct at a solution might be to simply sum up the instances of each cell containing the word “Napoleon”, but this approach is fraught. Such an approach would include over twenty instances of variations on “Napoleon III”, as well as 6 instances of “Napoleon Dynamite”. This problem is even worse for less distinctive names. Here’s a breakdown of answers containing “Ford” as the last part of a person’s name, appearing three or more times:

On a more personal note, the corresponding chart for people named Johnson is even more disasterous:

This is a massive problem! How could we possibly handle this? As I see it, there’s a handful of strategies that could be used to colect different versions of one person together, some of which might be used simultaneously:
- Find all variations of responses that end with the same surname, and sum up their counts into a single number.
- Take the counts of answers that consist of a single surname, and distribute that number proportionally across the different names with that surname.
- Ask an LLM to consider the question as a whole, and determine the exact identity of each human included.
- Group together variations that differ only by a pair of parentheses or diacritical marks. for example, replace "(Gerald Ford)" with "Gerald Ford".
Option 1 can be disregarded immediately. This would require that Gerald Ford and Harrison Ford be counted as one person, which is unacceptable.
Option 2 seems like a better idea at first, but falls apart the more one considers it. It would be nice to split the 119 instances of “Ford” and add those to “Gerald Ford”, “Henry Ford”, “Betty Ford” and “Harrison Ford”. However, it’s not a totally valid assumption to make that refering to someone by one name is equally common for presidents as it is for actors. In fact, it’s not fair to assume that all 119 instances of “Ford” refer to a person at all. Surely, many refer to the Ford Motor Company.
Option 3 is tempting, but LLMs are always prone to error. This concern is easy to overstate; LLMs are getting more accurate all the time, especially as pertains to simple informational questions. However, determining the exact level of innaccuracy would take some testing and comparison that is outside the scope of this project currently.
That leaves option 4, which really does seem reasonable. This is a small change overall: Adding the 13 instances of “(Gerald) Ford” to the 128 instances of “Gerald Ford” is not likely to be hugely impactful. However, it’s also very unlikely to have negative side effects. This will be implemented moving forward.
Finally, it’s probably a good idea to restrict the timespan the program searches. What’s considered important knowledge has changed over the years, as do the people on Jeopardy!’s writing staff. Viewing just the last 5 years seems like a decent compromise between quantity and relevance of data. We can finally note the most common responses in the last few years. Here’s the top 75:
|
|
|
it’s also fairly easy to sift through the first rows of this table by hand and remove all answers that are geographic locations:
|
|
|
Separating data by value #
I’m also currious if these results change for different point values. Chicago is a very well known location for Americans, so it’s possible that Chicago only appears so often because it’s an easy “gimme” question. In general, clues with higher point values are much harder; maybe common responses are only common for easy questions. Would Chicago still be the most common response if we limit our search to expensive questions? Let’s break down answers by difficulty, and see if the results change. In order to account for splitting the data into ten parts, I’ll extend the search up to 25 years. Here’s the 20 most common questions for each round of Single Jeopardy:
$200 answer | count | $400 answer | count | $600 answer | count | $800 answer | count | $1000 answer | count |
---|---|---|---|---|---|---|---|---|---|
China | 108 | Australia | 64 | California | 57 | Chicago | 41 | Australia | 39 |
Hawaii | 106 | Alaska | 62 | Chicago | 54 | France | 36 | Maine | 35 |
Japan | 104 | Chicago | 58 | Australia | 45 | New York | 36 | Brazil | 35 |
California | 74 | California | 54 | China | 44 | California | 34 | Chicago | 34 |
Alaska | 73 | France | 54 | Texas | 44 | Australia | 34 | Greece | 32 |
Chicago | 72 | China | 53 | Spain | 42 | Spain | 34 | South Africa | 31 |
Australia | 67 | Japan | 53 | India | 39 | China | 33 | France | 29 |
Mexico | 66 | Spain | 53 | France | 38 | India | 33 | Sweden | 29 |
Florida | 65 | Canada | 52 | Florida | 38 | Alaska | 32 | Spain | 26 |
France | 64 | Mexico | 51 | Japan | 37 | Greece | 32 | Japan | 26 |
India | 59 | India | 50 | London | 35 | Minnesota | 32 | Oklahoma | 25 |
George Washington | 58 | Florida | 49 | Hawaii | 35 | Pennsylvania | 31 | Belgium | 25 |
Ireland | 58 | Boston | 44 | Germany | 35 | Mexico | 30 | Utah | 25 |
Boston | 56 | New York | 44 | Sweden | 35 | Maine | 30 | Texas | 24 |
Canada | 55 | Texas | 44 | New Orleans | 34 | Canada | 29 | Ireland | 24 |
Russia | 52 | Egypt | 40 | Italy | 34 | New Mexico | 29 | Wyoming | 24 |
Egypt | 50 | San Francisco | 38 | Alaska | 33 | Texas | 28 | Maryland | 24 |
New Orleans | 50 | London | 37 | Mars | 32 | Italy | 28 | Norway | 24 |
Paris | 50 | Switzerland | 37 | Greece | 31 | Israel | 28 | Portugal | 24 |
New York | 49 | Hawaii | 36 | South Africa | 31 | Montana | 28 | Thailand | 24 |
And for double Jeopardy:
$400 answer | count | $800 answer | count | $1200 answer | count | $1600 answer | count | $2000 answer | count |
---|---|---|---|---|---|---|---|---|---|
China | 111 | Australia | 57 | Japan | 46 | Australia | 48 | Brazil | 33 |
Japan | 87 | Chicago | 57 | Australia | 45 | Sweden | 40 | Denmark | 33 |
France | 85 | India | 55 | Sweden | 45 | Georgia | 37 | Portugal | 32 |
Australia | 81 | Spain | 54 | France | 42 | Italy | 36 | India | 30 |
Paris | 77 | France | 47 | Spain | 40 | Brazil | 34 | Andrew Jackson | 30 |
California | 74 | China | 46 | Canada | 40 | Florida | 34 | Sweden | 28 |
Mexico | 73 | Mexico | 46 | India | 39 | France | 32 | Indonesia | 28 |
Cleopatra | 70 | Japan | 43 | Italy | 37 | Spain | 32 | Georgia | 27 |
Spain | 68 | Paris | 43 | Portugal | 36 | South Africa | 32 | Norway | 26 |
London | 67 | Egypt | 43 | Chicago | 35 | Maine | 32 | Poland | 26 |
Alaska | 67 | California | 42 | Denmark | 35 | India | 31 | the Netherlands | 26 |
Ireland | 67 | Ireland | 41 | Greece | 34 | Mexico | 31 | Spain | 25 |
Italy | 66 | Italy | 41 | China | 33 | Switzerland | 31 | North Carolina | 25 |
India | 64 | South Africa | 41 | Brazil | 33 | Norway | 30 | Finland | 25 |
Chicago | 62 | Canada | 39 | Paris | 32 | Portugal | 29 | South Africa | 24 |
Canada | 57 | Venus | 39 | South Africa | 32 | Chicago | 29 | Chicago | 24 |
George Washington | 56 | Napoleon | 37 | Texas | 32 | Denmark | 29 | New Hampshire | 24 |
Hawaii | 55 | Rome | 37 | Germany | 31 | Andrew Jackson | 28 | France | 23 |
Florida | 52 | Hamlet | 37 | New York | 31 | China | 27 | Egypt | 22 |
Egypt | 52 | Texas | 36 | Napoleon | 30 | Greece | 26 | the Philippines | 22 |
Here’s the same data, with all the geographic locations removed:
$200 answer | count | $400 answer | count | $600 answer | count | $800 answer | count | $1000 answer | count |
---|---|---|---|---|---|---|---|---|---|
George Washington | 58 | Ronald Reagan | 36 | Mars | 32 | 4 | 21 | Andrew Jackson | 17 |
red | 49 | 2 | 30 | 3 | 25 | Eisenhower | 21 | Grover Cleveland | 17 |
Abraham Lincoln | 47 | red | 29 | white | 24 | Mars | 18 | golf | 16 |
McDonald’s | 47 | gold | 29 | basketball | 23 | Venus | 18 | Theodore Roosevelt | 16 |
Napoleon | 46 | Wisconsin | 28 | Ronald Reagan | 22 | golf | 18 | 4 | 15 |
gold | 43 | George Washington | 27 | Venus | 22 | Eleanor Roosevelt | 18 | Calvin Coolidge | 15 |
Julius Caesar | 42 | tea | 27 | Richard Nixon | 22 | Theodore Roosevelt | 18 | white | 15 |
Lincoln | 41 | 3 | 26 | Napoleon | 21 | Andrew Jackson | 17 | 12 | 15 |
Madonna | 41 | Maine | 26 | Thomas Jefferson | 20 | blue | 17 | Henry VIII | 14 |
Elvis Presley | 39 | Sweden | 25 | baseball | 20 | Jacob | 17 | Julius Caesar | 14 |
water | 38 | Mars | 24 | Andrew Jackson | 20 | 3 | 16 | iron | 14 |
milk | 36 | Thomas Jefferson | 24 | George Washington | 19 | Richard Nixon | 16 | Uranus | 14 |
Cleopatra | 36 | coffee | 24 | Abraham Lincoln | 19 | Mark Twain | 16 | Solomon | 13 |
Babe Ruth | 35 | rice | 24 | Julius Caesar | 19 | Henry VIII | 16 | Jupiter | 13 |
white | 34 | World War I | 24 | 4 | 18 | Solomon | 15 | Saturn | 13 |
Moses | 34 | Elvis Presley | 23 | green | 17 | Pocahontas | 15 | Neptune | 13 |
2 | 34 | Venus | 23 | Jupiter | 17 | nitrogen | 15 | Woodrow Wilson | 13 |
Coca-Cola | 33 | Pennsylvania | 23 | blue | 17 | basketball | 14 | Job | 12 |
golf | 32 | New Jersey | 23 | Hamlet | 17 | Jupiter | 14 | Othello | 12 |
Richard Nixon | 32 | oil | 23 | Buddhism | 17 | 7 | 14 | John Adams | 12 |
$400 answer | count | $800 answer | count | $1200 answer | count | $1600 answer | count | $2000 answer | count |
---|---|---|---|---|---|---|---|---|---|
Cleopatra | 70 | Venus | 39 | Napoleon | 30 | Andrew Jackson | 28 | Andrew Jackson | 30 |
George Washington | 56 | Napoleon | 37 | Mozart | 30 | Woodrow Wilson | 22 | Woodrow Wilson | 21 |
Napoleon | 51 | Hamlet | 37 | Thomas Jefferson | 28 | Henry VIII | 21 | Eugene O’Neill | 20 |
Julius Caesar | 49 | Macbeth | 34 | David | 26 | Jupiter | 20 | William Faulkner | 19 |
Michelangelo | 47 | Julius Caesar | 32 | Galileo | 25 | Thomas Jefferson | 19 | Virginia Woolf | 18 |
Mars | 43 | Abraham Lincoln | 31 | Michelangelo | 24 | Theodore Roosevelt | 19 | Henry Moore | 18 |
Joan of Arc | 43 | Picasso | 31 | Hamlet | 23 | Richard III | 18 | Richard III | 17 |
Mark Twain | 38 | Thomas Jefferson | 30 | Mars | 22 | Eleanor Roosevelt | 18 | John Quincy Adams | 17 |
Abraham Lincoln | 38 | Cleopatra | 28 | Beethoven | 22 | King Lear | 17 | Maria Theresa | 17 |
Hamlet | 36 | Ronald Reagan | 28 | Theodore Roosevelt | 22 | Charlemagne | 17 | Claudius | 16 |
Alexander the Great | 36 | Mars | 27 | King Lear | 21 | A Midsummer Night’s Dream | 17 | Aeschylus | 16 |
Ronald Reagan | 35 | Mozart | 27 | Richard III | 20 | Rembrandt | 17 | Twelfth Night | 15 |
Romeo and Juliet | 35 | Lincoln | 26 | Henry VIII | 20 | George Eliot | 17 | Orpheus | 15 |
Columbus | 35 | Queen Victoria | 26 | Picasso | 19 | Herbert Hoover | 17 | John Adams | 15 |
Venus | 34 | George Washington | 25 | 3 | 19 | Archimedes | 17 | Raphael | 15 |
Agatha Christie | 34 | World War I | 25 | Sylvia Plath | 19 | Galileo | 16 | Nathaniel Hawthorne | 15 |
gold | 33 | David | 25 | Gerald Ford | 18 | Georgia O’Keeffe | 16 | Aristophanes | 15 |
Shakespeare | 33 | Michelangelo | 24 | John Adams | 18 | English | 16 | 7 | 15 |
water | 33 | Benjamin Franklin | 24 | Thomas Hardy | 18 | Venus | 15 | Sir Walter Scott | 14 |
Beethoven | 32 | Galileo | 24 | Venus | 17 | Dylan Thomas | 15 | Zachary Taylor | 14 |
Reading through this data is endlessly fascinating to me. Of course, I care more than the average bear about this show, so it’s hard for me to tell what pecularities of this show are interesting to the average person. Here’s just a few of the odd patterns present when breaking this data down by value:
- Although the Beatles are notorious as a common subject on Jeopardy!, their prevalence as a response drops off rapidly after the $200 clue
- The same is true for Shakespeare, although the titles of his plays see reasonable representation across different clue values
- In general, some answers seem to favor certain values. "Barcelona" is three times as likely to be the answer to the $800 clue than any other value in Single Jeopardy
- Aeschylus is an even more expreme example. He has appeared as an answer only three times in Single Jeopardy, twice in a $1600 clue, and sixteen times under the $2000 clue.
It’s worth noting that none of these numbers are incredibly large. I don’t want to overstate the signifigance of Aeschylus’ 16 appearances, especially since there’s been about 36,000 questions worth $2000 in the past 25 years. Nonetheless, we’ve at least gained some insight into our question! It does seem that espensive questions tend to be more spread out than cheap questions, but this doesn’t seem to be the whole story. Let’s get a more in depth picture of the frequencies of different answers. Here’s a graph showing the frequencies of answers by their index in the grid:

Converting both axes into logarithmic axes brings the graph into clearer focus:

To get a better idea of how this data is shaped, we might separate this into the specific point values:

These graphs are not linear, so the equations that fit them closely are not simple. I doubt it would be informative to list every equations for curves of best fit, but here’s a graph showing quadratic regression curves for the previous graph.

This more clearly highlights the confirmation for our suspicions. Easier clues, like the ones worth $200, focus more predominantly on a small handful of answers, while more expensive clues are spread across a wider variety of topics. However, the fact that answers are more widely spread for difficult clues than for easy ones does not mean that they are all the same answers. As we’ve seen, some answers, like “Virginia Woolf” or “Aeschylus” appear almost exclusively as responces to difficult questions, even if they appear relatively often.
To bring this into sharper focus, compare the distribution of “Virginia Wolf” in Double Jeopardy to that of “Elizabeth I”
$400 | $800 | $1200 | $1600 | $2000 | |
---|---|---|---|---|---|
Virginia Woolf | 2 | 8 | 8 | 11 | 11 |
Elizabeth I | 21 | 8 | 7 | 4 | 0 |
Both women have appeared exactly 40 times in Double Jeopardy over the last 40 years, but there’s a clear difference in distribution! “Who is Virginia Woolf” tends to be a much higher scoring answer than “Who is Elizabeth I”. Typically, when I ask people what they assume is the difference between expensive and inexpensive clues in Jeopardy!, they reason that expensive clues ought to ask about more “obscure” people and things, and this does hold true for our examples of Woolf and Elizabeth. Virginia Woolf is mostly talked about only in specialized English Lit courses at universities, while Elizabeth I is well known even to elementary students, so it’s probably fair to characterize Woolf as “more obscure”. That said, it’s signifigant that they appear with equal frequency. It’s as though Virginia Woolf is obscure enough to appear mostly in very expensive questions, but important enough to appear just as often as one of England’s most famous monarchs.
Measuring how interesting a response is #
These figures that appear frequently, but only as high-value answers are very interesting to me. It’s as though the Jeopardy! writers have set out a specific list of cultural touchstones that separate highly knowledgeable people from everyone else. In order to find these figures more easily, I’ll have to define some metric that indicates how “expensive” a question is, on average. I waffled on this for a while, but in the end I chose to treat an answers value as interval data rather than ordinal, and measured the mean value of each answer within each round. With this metric included, the distribution for Woolf and Elizabeth I looks like this:
$400 | $800 | $1200 | $1600 | $2000 | Total J2 appearances | Average J2 value | |
---|---|---|---|---|---|---|---|
Virginia Woolf | 2 | 8 | 8 | 11 | 11 | 40 | 740 |
Elizabeth I | 21 | 8 | 7 | 4 | 0 | 40 | 1410 |
Out of pure interest, we might choose to display each answer on a massive scatterplot, comparing their average value in the first round with their total number of appearances in the first round.

And here’s a similar chart for the Double Jeopardy round:

Thinking of “interesting” answers as those with very high average values isn’t adequate. To explain what I mean by this, consider the fact that there are 13385 answers that have appeared only as $2000 dolar clues in double Jeopardy, and 12098 of these have appeared only once. Many of these are highly specific combinations of things that would only make sense in context. Some examples are:
- "B-29 (the one that carried the atomic bomb)"
- "BMO"
- "people who have seconds"
- "Runaway Bride of Chucky".
- "Sunday in the Park with George Orwell".
- "The Lion, the Witch, and the Wardrobe Malfunction".
If anything, the biggest surprises here are the answers I would have expected to be common, but aren’t. Both “The Man with the Golden Arm” and “The Three-body Problem” both apprear only once, although I thought they were fairly well known and important books. Regardless, this demonstrates that we need to think of “interesting” answers as those that balange average value and frequency of appearance, and that we should find a method that isolates these “interesting” answers. One option is to consider the Pareto frontier of the set of all answers. In this case, that means the set of points that are of a higher average value than any answer less common than it. In other words, these data points are “Pareto optimal”: they cannot be further optimized for frequency of appearance without sacrificing average value, and vice-versa.
I find this a bit surprising, but Pandas doesn’t have a built-in function to find the Pareto frontier. I always found that a bit of a shame, so for the sake of anyone reading, I’m including my function for finding the Pareto frontier along two optimized variables. This function is limited to 2-D Pareto frontiers, and is highly optimized for this specific circumstance.
import pandas as pd
# Given a dataframe df, returns the Pareto front of that data as another dataframe, optimized along the columns with names
# x and y. All columns in the optional "other" list are included as well in the returned dataframe.
def pareto_front_2D(df, x, y, other = []):
front = df[other+[x]+[y]].copy().dropna()
# The point with the highest y value has the lowest possible x value. if something had a lower x value, it must also
# have a lower x value, and so is not worth including in the front. We also do it the other direction.
front = front.sort_values(by= [y,x], ascending=False, ignore_index=True)
x_record = front.iloc[0][x]
front = front[front[x] >= x_record]
front = front.sort_values(by= [x,y], ascending=False, ignore_index=True)
y_record = front.iloc[0][y]
front = front[front[y] >= y_record]
front.reset_index(drop=True, inplace=True)
# Sorting values in this dataframe by the metric x allows us to ignore all data points that do not set a new record along
# the metric y
for i in front.index:
if front.loc[i][y] < y_record:
front.drop(index=i, inplace=True)
elif front.loc[i][y] > y_record:
y_record = front.loc[i][y]
elif (i != 0) and (front.loc[i][x] < front.loc[i-1][x]):
front.drop(index=i, inplace=True)
return front
The Pareto front for the single Jeopardy round looks like this:

And the chart for the Double Jeopardy round looks quite similar:

The precice values on the pareto front on the first round of Jeopardy are:
answer | J1 total | J1 average |
---|---|---|
Chicago | 173 | 514.45 |
Australia | 143 | 534.27 |
Greece | 101 | 566.34 |
Brazil | 97 | 595.88 |
South Africa | 77 | 644.16 |
Maryland | 62 | 674.19 |
Oklahoma | 60 | 683.33 |
The Philippines | 53 | 690.57 |
Indonesia | 51 | 701.96 |
Ethiopia | 37 | 794.30 |
Grover Cleveland | 25 | 800 |
Ferrari | 17 | 811.76 |
Andromeda | 16 | 825 |
Martin Van Buren | 16 | 825 |
Nicarague | 15 | 853.33 |
Bhutan | 15 | 853.33 |
Malaysia | 14 | 871.43 |
As you like it | 13 | 876.92 |
Dorothy Parker | 12 | 883.33 |
Sisyphus | 11 | 927.27 |
Padua | 9 | 1000 |
Meanwhile, the pareto front of the second round is:
answer | J1 total | J1 average |
---|---|---|
Chicago | 135 | 989.63 |
Australia | 131 | 1016.79 |
Brazil | 96 | 1133.33 |
South Africa | 91 | 1134.07 |
Georgia | 84 | 1285.71 |
The Philippines | 67 | 1367.16 |
Algeria | 46 | 1373.91 |
Finland | 44 | 1472.73 |
Andrew Jackson | 35 | 1485.71 |
Malta | 34 | 1576.47 |
George Sand | 25 | 1584 |
The Orinoco | 21 | 1600 |
Tonga | 20 | 1700 |
Ghana | 17 | 1741.18 |
Avignon | 16 | 1775 |
Mozambique | 15 | 1813.33 |
Sikhism | 15 | 1813.33 |
Aeschylus | 13 | 1938.46 |
Ganymede | 9 | 2000 |
Certainly, there are some interesting points in this data here. I had no idea George Sand appeared so often and as such a high value response, for instance. Nonetheless, this method of analysis does still leave something to be desired: it maintains the previous bias towards geographic locations for one. More egregiously, it provides a high number of answers that are especially common without being high value — the first item on the pareto front of the second roound I would really consider “interesting” is Malta.
Perhaps there’s some sort of value function that could be used instead to rank answers. The most natural solution is to total up the point values of each answers appearance, giving the total dollar sum for which an answer has been worth over the last 25 years. Unfortunately, I don’t find this solution very satisfying either. Looking at the results of these totals, we see more of the usual suspects: Chicago, Australia, Andrew Jackson, etc. It seems that simply multiplying the total number of appearances by the average point value isn’t going to work either.
Let’s take a moment to consider the distribution of average point values. The distributions for average values look a bit like this:


The large spikes in this graph are due to the fact that most answers occur only once or twice, and therefore average values are very likely to be one of the possible values that a single answer can have. If we restrict these graphs to answers that have appeared more than 3 times over the time span investigated, we get a much clearer picture:


This data is distributed roughly normally! That gives the impression that it would be more reasonable to multiply the frequency of an answers appearance by the difference of the average value to the mean of average values. This would have the added benefit of eliminating anything with an average value not above the average. Additionally, every clue in jeopardy has a value divisible by 200. Removing this factor would make our metric that littlest bit cleaner.
To summarise, if an answer appears times with an average value of , then the “interest” of that answer can be calculated as
Of course, that only applies to Single Jeopardy, where the average value is 600. For Double Jeopardy, the calculation would be . After ranking items by this metric, I’m happy to say that I’m broadly satisfied with it! It does seem to highlight topics that are important yet obscure; a lot of these names and places are standard parts of college curricula, but not well known by the general population, which is exactly what I wanted. I tinkered with a few alternate formulas, but they always seemed to emphasize what I wanted to see, rather than neutrally reflecting the data. Besides, the mathematical simplicity of this formula makes it feel like it has a very strong connection to the game. It’s easy to see how studying the topics that score high here would translate to an improved average score while playing Jeopardy! at home. It even seems to de-emphasize geography!
This time, while displaying the top scorers, I think I’ll separate the geographic answers out entirely. Without further ado, here’s the top scorers! In this chart, J1 represents answers from the Single Jeopardy round, while J2 represents answers from Double Jeopardy.
|
|
|
|
Overall, I’m quite pleased with this! These results mostly line up with what I expected, with names like Keats, Solzhenitsyn, and Woolf winding up as big winners. There’s a few unexpected results as well — I’m shocked curling wound up being so high up, for example. If you want proof that this metric highlights the top left edge of the data, look at the scatter plots from before, now with the 100 most “interesting” datapoints in orange:


All the code used is available in this project’s Github repository. Feel free to play around with it yourself! If you find anything interesting or surprising, feel free to contact me and let me know!