Reading into the common responses on Jeopardy!

The code for this project can be found at its Github repository.

This article was spun off from a larger, more technical article on the subject of how this data was collected. I wanted to focus on some fun statistical analysis for the answers on Jeopardy!, so feel free to skip the main article if you’re only interested in the stats.

What are the most common correct responses?
#

In the previous article, I created a massive dataframe containing the categories, hints, answers, dates, and point values of every question aired on Jeopardy!. The first question that I’m curious about is simple: what are the most common correct responses on Jeopardy!? What people, places, and things are most valued by the Jeopardy! writers?

Simply performing a count on the number of times a word or phrase appears as a correct response gives some surprising results:

	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15
response:	Australia	China	Japan	Chicago	France	India	Spain	California	Mexico	Alaska	Canada	India	Hawaii	Florida	Texas
count	511	484	472	471	462	434	427	420	390	380	376	354	354	351	340

Geographical locations are repeated far more frequently than famous people or works of art! The most frequent correct response is “What is Australia?”, appearing 511 times over the shows run. After that comes China, then Japan, Chicago, France, and so on. As a matter of fact, all of the top 37 responses are geographic locations! (the trend is broken by Napoleon, appearing 243 times.)

Geography certianly isn’t a vastly more common subject than history, so what could cause this pattern? Part of the answer is most likely that answers to geography questions tend to focus on a more narrow range of answers than other categories. There are fewer cities than people, after all. It’s also possible that locations are more common as responses, while historical figures appear more in the questions themselves. I’ve asked a few friends what they think the most popular responces would be, and they usually intuit that the top responses would be geographical.

However, there is a second answer at play here, which is that people’s names can appear a number of different ways. Let’s return to Napoleon as an example. In addition to 243 appearances of “Napoleon”, the table also contains:

Napoleon Bonaparte	Napoléon	Napoleon (Bonaparte)	Napoléon Bonaparte	Napoleon (I)	Napoléon (Bonaparte)
47 times	14 times	4 times	3 times	3 times	1 times

One might be curious why so many variations on a single name are possible. While contestants may say the full name while answering, the judges will often (but not always) allow a contestant to say only the surname, especially while refering to polititians. To eliminate ambiguity, the people maintaning the archive will often (but not always) add the given name to a response in parentheses. Ironically, Napoleon is a counter example, as his first name is much more recognizeable than his last name.

Regardless of these details, this does pose a serious problem to our understanding of this data. A first instinct at a solution might be to simply sum up the instances of each cell containing the word “Napoleon”, but this approach is fraught. Such an approach would include over twenty instances of variations on “Napoleon III”, as well as 6 instances of “Napoleon Dynamite”. This problem is even worse for less distinctive names. Here’s a breakdown of answers containing “Ford” as the last part of a person’s name, appearing three or more times:

On a more personal note, the corresponding chart for people named Johnson is even more disasterous:

This is a massive problem! How could we possibly handle this? As I see it, there’s a handful of strategies that could be used to colect different versions of one person together, some of which might be used simultaneously:

Find all variations of responses that end with the same surname, and sum up their counts into a single number.
Take the counts of answers that consist of a single surname, and distribute that number proportionally across the different names with that surname.
Ask an LLM to consider the question as a whole, and determine the exact identity of each human included.
Group together variations that differ only by a pair of parentheses or diacritical marks. for example, replace "(Gerald Ford)" with "Gerald Ford".

Option 1 can be disregarded immediately. This would require that Gerald Ford and Harrison Ford be counted as one person, which is unacceptable.

Option 2 seems like a better idea at first, but falls apart the more one considers it. It would be nice to split the 119 instances of “Ford” and add those to “Gerald Ford”, “Henry Ford”, “Betty Ford” and “Harrison Ford”. However, it’s not a totally valid assumption to make that refering to someone by one name is equally common for presidents as it is for actors. In fact, it’s not fair to assume that all 119 instances of “Ford” refer to a person at all. Surely, many refer to the Ford Motor Company.

Option 3 is tempting, but LLMs are always prone to error. This concern is easy to overstate; LLMs are getting more accurate all the time, especially as pertains to simple informational questions. However, determining the exact level of innaccuracy would take some testing and comparison that is outside the scope of this project currently.

That leaves option 4, which really does seem reasonable. This is a small change overall: Adding the 13 instances of “(Gerald) Ford” to the 128 instances of “Gerald Ford” is not likely to be hugely impactful. However, it’s also very unlikely to have negative side effects. This will be implemented moving forward.

Finally, it’s probably a good idea to restrict the timespan the program searches. What’s considered important knowledge has changed over the years, as do the people on Jeopardy!’s writing staff. Viewing just the last 5 years seems like a decent compromise between quantity and relevance of data. We can finally note the most common responses in the last few years. Here’s the top 75:

response	count
Chicago	58
Australia	48
Florida	43
Philadelphia	40
California	39
Brazil	37
India	37
Georgia	34
Alaska	34
Jupiter	34
Ireland	33
China	33
Greece	33
Mars	33
Texas	32
Poland	31
Japan	31
Spain	30
Boston	30
San Francisco	30
Mexico	29
New Orleans	29
Cuba	29
the Philippines	29
France	28

response	count
Switzerland	27
Norway	27
Hawaii	27
Virginia	27
Venice	26
Egypt	26
Paris	26
Iceland	25
Michigan	25
Portugal	25
Iran	25
Atlanta	25
Canada	25
Argentina	24
the Netherlands	24
Beethoven	24
Germany	23
Italy	23
Florence	23
Dublin	23
Sweden	23
New Zealand	23
South Africa	23
Amsterdam	22
Denmark	22

response	count
Scotland	22
Massachusetts	22
mercury	21
London	21
Ethiopia	21
Maine	21
New Mexico	21
Puerto Rico	21
Venus	21
Yellowstone	20
the Mississippi	20
Peru	20
Morocco	20
Colombia	20
Antarctica	20
New Jersey	20
Chile	20
Vienna	20
Napoleon	20
St. Louis	19
Pennsylvania	19
the Thames	19
Madagascar	19
Joan of Arc	19
Seattle	19

it’s also fairly easy to sift through the first rows of this table by hand and remove all answers that are geographic locations:

answer	count
Jupiter	34
Mars	33
Beethoven	24
mercury	21
Venus	21
Joan of Arc	19
Napoleon	19
Hamlet	19
Saturn	19
Picasso	18
iron	18
Macbeth	18
Mercury	18
David	18
Tesla	18
Galileo	17
Jordan	17
Richard III	17
Neptune	17
Wilson	17
Julius Caesar	17
Cleopatra	16
the Moon	16
Cinderella	16
Hamilton	16

response	count
lead	16
carbon dioxide	15
Mozart	15
Moses	15
hydrogen	15
Churchill	15
Lady Gaga	15
tea	15
soccer	15
Solomon	15
baseball	15
Lincoln	15
Harvard	15
the liver	15
smallpox	14
John Quincy Adams	14
Henry VIII	14
a horse	14
Copernicus	14
Catherine the Great	14
Buddhism	14
the Amazon	15
Alexander the Great	14
World War I	14
Exodus	14

response	count
Teddy Roosevelt	14
Eisenhower	14
the heart	14
gold	14
Wagner	14
12	13
Robinson Crusoe	13
Marie Antoinette	13
the Titanic	13
cricket	13
Grey’s Anatomy	13
The Phantom of the Opera	13
Jaws	13
John Quincy Adams	13
Twelfth Night	13
Nero	13
Carmen	13
bamboo	13
King Lear	13
the Statue of Liberty	13
Marshall	13
John	13
Nixon	13
Madonna	13
teeth	13
the Louvre	13

Separating data by value
#

I’m also currious if these results change for different point values. Chicago is a very well known location for Americans, so it’s possible that Chicago only appears so often because it’s an easy “gimme” question. In general, clues with higher point values are much harder; maybe common responses are only common for easy questions. Would Chicago still be the most common response if we limit our search to expensive questions? Let’s break down answers by difficulty, and see if the results change. In order to account for splitting the data into ten parts, I’ll extend the search up to 25 years. Here’s the 20 most common questions for each round of Single Jeopardy:

$200 answer	count	$400 answer	count	$600 answer	count	$800 answer	count	$1000 answer	count
China	108	Australia	64	California	57	Chicago	41	Australia	39
Hawaii	106	Alaska	62	Chicago	54	France	36	Maine	35
Japan	104	Chicago	58	Australia	45	New York	36	Brazil	35
California	74	California	54	China	44	California	34	Chicago	34
Alaska	73	France	54	Texas	44	Australia	34	Greece	32
Chicago	72	China	53	Spain	42	Spain	34	South Africa	31
Australia	67	Japan	53	India	39	China	33	France	29
Mexico	66	Spain	53	France	38	India	33	Sweden	29
Florida	65	Canada	52	Florida	38	Alaska	32	Spain	26
France	64	Mexico	51	Japan	37	Greece	32	Japan	26
India	59	India	50	London	35	Minnesota	32	Oklahoma	25
George Washington	58	Florida	49	Hawaii	35	Pennsylvania	31	Belgium	25
Ireland	58	Boston	44	Germany	35	Mexico	30	Utah	25
Boston	56	New York	44	Sweden	35	Maine	30	Texas	24
Canada	55	Texas	44	New Orleans	34	Canada	29	Ireland	24
Russia	52	Egypt	40	Italy	34	New Mexico	29	Wyoming	24
Egypt	50	San Francisco	38	Alaska	33	Texas	28	Maryland	24
New Orleans	50	London	37	Mars	32	Italy	28	Norway	24
Paris	50	Switzerland	37	Greece	31	Israel	28	Portugal	24
New York	49	Hawaii	36	South Africa	31	Montana	28	Thailand	24

And for double Jeopardy:

$400 answer	count	$800 answer	count	$1200 answer	count	$1600 answer	count	$2000 answer	count
China	111	Australia	57	Japan	46	Australia	48	Brazil	33
Japan	87	Chicago	57	Australia	45	Sweden	40	Denmark	33
France	85	India	55	Sweden	45	Georgia	37	Portugal	32
Australia	81	Spain	54	France	42	Italy	36	India	30
Paris	77	France	47	Spain	40	Brazil	34	Andrew Jackson	30
California	74	China	46	Canada	40	Florida	34	Sweden	28
Mexico	73	Mexico	46	India	39	France	32	Indonesia	28
Cleopatra	70	Japan	43	Italy	37	Spain	32	Georgia	27
Spain	68	Paris	43	Portugal	36	South Africa	32	Norway	26
London	67	Egypt	43	Chicago	35	Maine	32	Poland	26
Alaska	67	California	42	Denmark	35	India	31	the Netherlands	26
Ireland	67	Ireland	41	Greece	34	Mexico	31	Spain	25
Italy	66	Italy	41	China	33	Switzerland	31	North Carolina	25
India	64	South Africa	41	Brazil	33	Norway	30	Finland	25
Chicago	62	Canada	39	Paris	32	Portugal	29	South Africa	24
Canada	57	Venus	39	South Africa	32	Chicago	29	Chicago	24
George Washington	56	Napoleon	37	Texas	32	Denmark	29	New Hampshire	24
Hawaii	55	Rome	37	Germany	31	Andrew Jackson	28	France	23
Florida	52	Hamlet	37	New York	31	China	27	Egypt	22
Egypt	52	Texas	36	Napoleon	30	Greece	26	the Philippines	22

Here’s the same data, with all the geographic locations removed:

$200 answer	count	$400 answer	count	$600 answer	count	$800 answer	count	$1000 answer	count
George Washington	58	Ronald Reagan	36	Mars	32	4	21	Andrew Jackson	17
red	49	2	30	3	25	Eisenhower	21	Grover Cleveland	17
Abraham Lincoln	47	red	29	white	24	Mars	18	golf	16
McDonald’s	47	gold	29	basketball	23	Venus	18	Theodore Roosevelt	16
Napoleon	46	Wisconsin	28	Ronald Reagan	22	golf	18	4	15
gold	43	George Washington	27	Venus	22	Eleanor Roosevelt	18	Calvin Coolidge	15
Julius Caesar	42	tea	27	Richard Nixon	22	Theodore Roosevelt	18	white	15
Lincoln	41	3	26	Napoleon	21	Andrew Jackson	17	12	15
Madonna	41	Maine	26	Thomas Jefferson	20	blue	17	Henry VIII	14
Elvis Presley	39	Sweden	25	baseball	20	Jacob	17	Julius Caesar	14
water	38	Mars	24	Andrew Jackson	20	3	16	iron	14
milk	36	Thomas Jefferson	24	George Washington	19	Richard Nixon	16	Uranus	14
Cleopatra	36	coffee	24	Abraham Lincoln	19	Mark Twain	16	Solomon	13
Babe Ruth	35	rice	24	Julius Caesar	19	Henry VIII	16	Jupiter	13
white	34	World War I	24	4	18	Solomon	15	Saturn	13
Moses	34	Elvis Presley	23	green	17	Pocahontas	15	Neptune	13
2	34	Venus	23	Jupiter	17	nitrogen	15	Woodrow Wilson	13
Coca-Cola	33	Pennsylvania	23	blue	17	basketball	14	Job	12
golf	32	New Jersey	23	Hamlet	17	Jupiter	14	Othello	12
Richard Nixon	32	oil	23	Buddhism	17	7	14	John Adams	12

$400 answer	count	$800 answer	count	$1200 answer	count	$1600 answer	count	$2000 answer	count
Cleopatra	70	Venus	39	Napoleon	30	Andrew Jackson	28	Andrew Jackson	30
George Washington	56	Napoleon	37	Mozart	30	Woodrow Wilson	22	Woodrow Wilson	21
Napoleon	51	Hamlet	37	Thomas Jefferson	28	Henry VIII	21	Eugene O’Neill	20
Julius Caesar	49	Macbeth	34	David	26	Jupiter	20	William Faulkner	19
Michelangelo	47	Julius Caesar	32	Galileo	25	Thomas Jefferson	19	Virginia Woolf	18
Mars	43	Abraham Lincoln	31	Michelangelo	24	Theodore Roosevelt	19	Henry Moore	18
Joan of Arc	43	Picasso	31	Hamlet	23	Richard III	18	Richard III	17
Mark Twain	38	Thomas Jefferson	30	Mars	22	Eleanor Roosevelt	18	John Quincy Adams	17
Abraham Lincoln	38	Cleopatra	28	Beethoven	22	King Lear	17	Maria Theresa	17
Hamlet	36	Ronald Reagan	28	Theodore Roosevelt	22	Charlemagne	17	Claudius	16
Alexander the Great	36	Mars	27	King Lear	21	A Midsummer Night’s Dream	17	Aeschylus	16
Ronald Reagan	35	Mozart	27	Richard III	20	Rembrandt	17	Twelfth Night	15
Romeo and Juliet	35	Lincoln	26	Henry VIII	20	George Eliot	17	Orpheus	15
Columbus	35	Queen Victoria	26	Picasso	19	Herbert Hoover	17	John Adams	15
Venus	34	George Washington	25	3	19	Archimedes	17	Raphael	15
Agatha Christie	34	World War I	25	Sylvia Plath	19	Galileo	16	Nathaniel Hawthorne	15
gold	33	David	25	Gerald Ford	18	Georgia O’Keeffe	16	Aristophanes	15
Shakespeare	33	Michelangelo	24	John Adams	18	English	16	7	15
water	33	Benjamin Franklin	24	Thomas Hardy	18	Venus	15	Sir Walter Scott	14
Beethoven	32	Galileo	24	Venus	17	Dylan Thomas	15	Zachary Taylor	14

Reading through this data is endlessly fascinating to me. Of course, I care more than the average bear about this show, so it’s hard for me to tell what pecularities of this show are interesting to the average person. Here’s just a few of the odd patterns present when breaking this data down by value:

Although the Beatles are notorious as a common subject on Jeopardy!, their prevalence as a response drops off rapidly after the $200 clue
The same is true for Shakespeare, although the titles of his plays see reasonable representation across different clue values
In general, some answers seem to favor certain values. "Barcelona" is three times as likely to be the answer to the $800 clue than any other value in Single Jeopardy
Aeschylus is an even more expreme example. He has appeared as an answer only three times in Single Jeopardy, twice in a $1600 clue, and sixteen times under the $2000 clue.

It’s worth noting that none of these numbers are incredibly large. I don’t want to overstate the signifigance of Aeschylus’ 16 appearances, especially since there’s been about 36,000 questions worth $2000 in the past 25 years. Nonetheless, we’ve at least gained some insight into our question! It does seem that espensive questions tend to be more spread out than cheap questions, but this doesn’t seem to be the whole story. Let’s get a more in depth picture of the frequencies of different answers. Here’s a graph showing the frequencies of answers by their index in the grid:

Converting both axes into logarithmic axes brings the graph into clearer focus:

To get a better idea of how this data is shaped, we might separate this into the specific point values:

These graphs are not linear, so the equations that fit them closely are not simple. I doubt it would be informative to list every equations for curves of best fit, but here’s a graph showing quadratic regression curves for the previous graph.

This more clearly highlights the confirmation for our suspicions. Easier clues, like the ones worth $200, focus more predominantly on a small handful of answers, while more expensive clues are spread across a wider variety of topics. However, the fact that answers are more widely spread for difficult clues than for easy ones does not mean that they are all the same answers. As we’ve seen, some answers, like “Virginia Woolf” or “Aeschylus” appear almost exclusively as responces to difficult questions, even if they appear relatively often.

To bring this into sharper focus, compare the distribution of “Virginia Wolf” in Double Jeopardy to that of “Elizabeth I”

	$400	$800	$1200	$1600	$2000
Virginia Woolf	2	8	8	11	11
Elizabeth I	21	8	7	4	0

Both women have appeared exactly 40 times in Double Jeopardy over the last 40 years, but there’s a clear difference in distribution! “Who is Virginia Woolf” tends to be a much higher scoring answer than “Who is Elizabeth I”. Typically, when I ask people what they assume is the difference between expensive and inexpensive clues in Jeopardy!, they reason that expensive clues ought to ask about more “obscure” people and things, and this does hold true for our examples of Woolf and Elizabeth. Virginia Woolf is mostly talked about only in specialized English Lit courses at universities, while Elizabeth I is well known even to elementary students, so it’s probably fair to characterize Woolf as “more obscure”. That said, it’s signifigant that they appear with equal frequency. It’s as though Virginia Woolf is obscure enough to appear mostly in very expensive questions, but important enough to appear just as often as one of England’s most famous monarchs.

Measuring how interesting a response is
#

These figures that appear frequently, but only as high-value answers are very interesting to me. It’s as though the Jeopardy! writers have set out a specific list of cultural touchstones that separate highly knowledgeable people from everyone else. In order to find these figures more easily, I’ll have to define some metric that indicates how “expensive” a question is, on average. I waffled on this for a while, but in the end I chose to treat an answers value as interval data rather than ordinal, and measured the mean value of each answer within each round. With this metric included, the distribution for Woolf and Elizabeth I looks like this:

	$400	$800	$1200	$1600	$2000	Total J2 appearances	Average J2 value
Virginia Woolf	2	8	8	11	11	40	740
Elizabeth I	21	8	7	4	0	40	1410

Out of pure interest, we might choose to display each answer on a massive scatterplot, comparing their average value in the first round with their total number of appearances in the first round.

And here’s a similar chart for the Double Jeopardy round:

Thinking of “interesting” answers as those with very high average values isn’t adequate. To explain what I mean by this, consider the fact that there are 13385 answers that have appeared only as $2000 dolar clues in double Jeopardy, and 12098 of these have appeared only once. Many of these are highly specific combinations of things that would only make sense in context. Some examples are:

"B-29 (the one that carried the atomic bomb)"
"BMO"
"people who have seconds"
"Runaway Bride of Chucky".
"Sunday in the Park with George Orwell".
"The Lion, the Witch, and the Wardrobe Malfunction".

If anything, the biggest surprises here are the answers I would have expected to be common, but aren’t. Both “The Man with the Golden Arm” and “The Three-body Problem” both apprear only once, although I thought they were fairly well known and important books. Regardless, this demonstrates that we need to think of “interesting” answers as those that balange average value and frequency of appearance, and that we should find a method that isolates these “interesting” answers. One option is to consider the Pareto frontier of the set of all answers. In this case, that means the set of points that are of a higher average value than any answer less common than it. In other words, these data points are “Pareto optimal”: they cannot be further optimized for frequency of appearance without sacrificing average value, and vice-versa.

I find this a bit surprising, but Pandas doesn’t have a built-in function to find the Pareto frontier. I always found that a bit of a shame, so for the sake of anyone reading, I’m including my function for finding the Pareto frontier along two optimized variables. This function is limited to 2-D Pareto frontiers, and is highly optimized for this specific circumstance.

import pandas as pd

# Given a dataframe df, returns the Pareto front of that data as another dataframe, optimized  along the columns with names
# x and y. All columns in the optional "other" list are included as well in the returned dataframe.
def pareto_front_2D(df, x, y, other = []):

    front = df[other+[x]+[y]].copy().dropna()

    # The point with the highest y value has the lowest possible x value. if something had a lower x value, it must also
    # have a lower x value, and so is not worth including in the front. We also do it the other direction.

    front = front.sort_values(by= [y,x], ascending=False, ignore_index=True)
    x_record = front.iloc[0][x]
    front = front[front[x] >= x_record]

    front = front.sort_values(by= [x,y], ascending=False, ignore_index=True)
    y_record = front.iloc[0][y]
    front = front[front[y] >= y_record]

    front.reset_index(drop=True, inplace=True)

    # Sorting values in this dataframe by the metric x allows us to ignore all data points that do not set a new record along
    # the metric y

    for i in front.index:

        if front.loc[i][y] < y_record:
            front.drop(index=i, inplace=True)
            
        elif front.loc[i][y] > y_record:
            y_record = front.loc[i][y]
            
        elif (i != 0) and (front.loc[i][x] < front.loc[i-1][x]):
            front.drop(index=i, inplace=True)

    return front

The Pareto front for the single Jeopardy round looks like this:

And the chart for the Double Jeopardy round looks quite similar:

The precice values on the pareto front on the first round of Jeopardy are:

answer	J1 total	J1 average
Chicago	173	514.45
Australia	143	534.27
Greece	101	566.34
Brazil	97	595.88
South Africa	77	644.16
Maryland	62	674.19
Oklahoma	60	683.33
The Philippines	53	690.57
Indonesia	51	701.96
Ethiopia	37	794.30
Grover Cleveland	25	800
Ferrari	17	811.76
Andromeda	16	825
Martin Van Buren	16	825
Nicarague	15	853.33
Bhutan	15	853.33
Malaysia	14	871.43
As you like it	13	876.92
Dorothy Parker	12	883.33
Sisyphus	11	927.27
Padua	9	1000

Meanwhile, the pareto front of the second round is:

answer	J1 total	J1 average
Chicago	135	989.63
Australia	131	1016.79
Brazil	96	1133.33
South Africa	91	1134.07
Georgia	84	1285.71
The Philippines	67	1367.16
Algeria	46	1373.91
Finland	44	1472.73
Andrew Jackson	35	1485.71
Malta	34	1576.47
George Sand	25	1584
The Orinoco	21	1600
Tonga	20	1700
Ghana	17	1741.18
Avignon	16	1775
Mozambique	15	1813.33
Sikhism	15	1813.33
Aeschylus	13	1938.46
Ganymede	9	2000

Certainly, there are some interesting points in this data here. I had no idea George Sand appeared so often and as such a high value response, for instance. Nonetheless, this method of analysis does still leave something to be desired: it maintains the previous bias towards geographic locations for one. More egregiously, it provides a high number of answers that are especially common without being high value — the first item on the pareto front of the second roound I would really consider “interesting” is Malta.

Perhaps there’s some sort of value function that could be used instead to rank answers. The most natural solution is to total up the point values of each answers appearance, giving the total dollar sum for which an answer has been worth over the last 25 years. Unfortunately, I don’t find this solution very satisfying either. Looking at the results of these totals, we see more of the usual suspects: Chicago, Australia, Andrew Jackson, etc. It seems that simply multiplying the total number of appearances by the average point value isn’t going to work either.

Let’s take a moment to consider the distribution of average point values. The distributions for average values look a bit like this:

The large spikes in this graph are due to the fact that most answers occur only once or twice, and therefore average values are very likely to be one of the possible values that a single answer can have. If we restrict these graphs to answers that have appeared more than 3 times over the time span investigated, we get a much clearer picture:

This data is distributed roughly normally! That gives the impression that it would be more reasonable to multiply the frequency of an answers appearance by the difference of the average value to the mean of average values. This would have the added benefit of eliminating anything with an average value not above the average. Additionally, every clue in jeopardy has a value divisible by 200. Removing this factor would make our metric that littlest bit cleaner.

To summarise, if an answer appears $x$ times with an average value of $y$ , then the “interest” of that answer can be calculated as

$x\cdot(y-600)/200$

Of course, that only applies to Single Jeopardy, where the average value is 600. For Double Jeopardy, the calculation would be $x\cdot(y-1200)/200$ . After ranking items by this metric, I’m happy to say that I’m broadly satisfied with it! It does seem to highlight topics that are important yet obscure; a lot of these names and places are standard parts of college curricula, but not well known by the general population, which is exactly what I wanted. I tinkered with a few alternate formulas, but they always seemed to emphasize what I wanted to see, rather than neutrally reflecting the data. Besides, the mathematical simplicity of this formula makes it feel like it has a very strong connection to the game. It’s easy to see how studying the topics that score high here would translate to an improved average score while playing Jeopardy! at home. It even seems to de-emphasize geography!

This time, while displaying the top scorers, I think I’ll separate the geographic answers out entirely. Without further ado, here’s the top scorers! In this chart, J1 represents answers from the Single Jeopardy round, while J2 represents answers from Double Jeopardy.

geographic J1 answers	interest
Ethiopia	36
Thailand	27
Indonesia	26
Oklahoma	25
Jordan	24
Hungary	24
the Philippines	24
Maryland	23
Wyoming	21
Bhutan	19
Lebanon	19
Sri Lanka	19
Malaysia	19
Andromeda	18
Oregon	18
Padua	18
Mississippi	18
Guam	18
Libya	17
Memphis	17
South Africa	17
Singapore	17
Maui	17
Turkey	17
Carthage	16
St. Augustine	16
Uranus	16
Maine	16
The Atlas Mountains	16
Chad	15
Dubai	15
Crete	15
Montana	15
Iowa	15
Bavaria	15
Andora	15
Windsor	15
Panama	15
Namibia	14
Milan	14

Other J1 answers	interest
Grover Cleveland	25
Andrew Jackson	19
As You Like It	18
Sisyphus	18
Martin Van Buren	18
Ferrari	18
Jacob	17
The Crimean war	17
Dorothy Parker	17
William Jennings Bryan	17
Richard III	17
Earl Warren	17
Twelfth Night	16
Titus Andronicus	16
The Rosetta Stone	16
Caligula	16
Samuel Pepys	16
Howard Jughes	16
Lolita	15
Wagner	15
William McKinley	15
National Geographic	15
Job	15
Sinclair Lewis	15
Joshua	15
cholera	14
Robert the Bruce	14
Jack London	14
Patrick Henrey	14
Van Buren	14
curling	14
Sikhism	13
Henry James	13
Guy Fawkes	13
Phosphorus	13
John Tyler	13
Strom Thurmond	13
deciduous	13
Ulysses	13
Trotsky	13

geographic J2 answers	interest
Malta	64
Finland	60
the Philippines	56
Tonga	50
Kazakhstan	46
Avignon	46
Ghana	46
Mozambique	46
Andromeda	42
Qatar	42
Botswana	42
the Orinoco	42
Bhutan	42
Nigeria	40
Yemen	40
Algeria	40
San Marino	40
Liechtenstein	38
Bahrain	38
Djibouti	38
Georgia	36
Io	36
East Timor	36
Ganymede	36
Ethiopia	34
the Bay of Biscay	34
Ur	34
Montenegro	34
the Tagus	34
Lebanon	34
Fiji	34
the Caspian Sea	32
Angola	32
Borneo	32
Timbuktu	32
Thermopylae	32
Nunavut	32
Cornwall	32
Suriname	30
El Savador	30

other J2 answers	interest
Andrew Jackson	50
George Sand	48
Aeschylus	48
Sikhism	46
Twelfth Night	44
Petrarch	44
Henry Moore	44
Andromeda	42
Virginia Woolf	42
Fidelio	40
Voltaire	38
Caligula	38
Maria Theresa	38
Aaron Copland	38
Zachary Taylor	38
Charles II	38
John Locke	38
Heroditus	38
Raphael	36
Much Ado About Nothing	36
August Wilson	36
Zoroastrianism	36
Richard Wright	36
Pericles	36
Ambrose Bierce	36
The Sun Also Rises	34
Marc Chagall	34
Billy Budd	34
William Butler Yeats	34
Ovid	34
Solzhenitsyn	34
Gilgamesh	34
John Donne	34
Shelley	32
Langston Hughes	32
Skylab	32
Cicero	32
John Keats	32
the Knights Templar	32
the Etruscans	32

Overall, I’m quite pleased with this! These results mostly line up with what I expected, with names like Keats, Solzhenitsyn, and Woolf winding up as big winners. There’s a few unexpected results as well — I’m shocked curling wound up being so high up, for example. If you want proof that this metric highlights the top left edge of the data, look at the scatter plots from before, now with the 100 most “interesting” datapoints in orange:

All the code used is available in this project’s Github repository. Feel free to play around with it yourself! If you find anything interesting or surprising, feel free to contact me and let me know!

What are the most common correct responses? #

Separating data by value #

Measuring how interesting a response is #

What are the most common correct responses?
#

Separating data by value
#

Measuring how interesting a response is
#