Skip to main content

Reading into the common responses on Jeopardy!

·5072 words·24 mins
Jules Johnson
Author
Jules Johnson
The code for this project can be found at its Github repository.

This article was spun off from a larger, more technical article on the subject of how this data was collected. I wanted to focus on some fun statistical analysis for the answers on Jeopardy!, so feel free to skip the main article if you’re only interested in the stats.

What are the most common correct responses?
#

In the previous article, I created a massive dataframe containing the categories, hints, answers, dates, and point values of every question aired on Jeopardy!. The first question that I’m curious about is simple: what are the most common correct responses on Jeopardy!? What people, places, and things are most valued by the Jeopardy! writers?

Simply performing a count on the number of times a word or phrase appears as a correct response gives some surprising results:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
response: Australia China Japan Chicago France India Spain California Mexico Alaska Canada India Hawaii Florida Texas
count 511 484 472 471 462 434 427 420 390 380 376 354 354 351 340

Geographical locations are repeated far more frequently than famous people or works of art! The most frequent correct response is “What is Australia?”, appearing 511 times over the shows run. After that comes China, then Japan, Chicago, France, and so on. As a matter of fact, all of the top 37 responses are geographic locations! (the trend is broken by Napoleon, appearing 243 times.)

Geography certianly isn’t a vastly more common subject than history, so what could cause this pattern? Part of the answer is most likely that answers to geography questions tend to focus on a more narrow range of answers than other categories. There are fewer cities than people, after all. It’s also possible that locations are more common as responses, while historical figures appear more in the questions themselves. I’ve asked a few friends what they think the most popular responces would be, and they usually intuit that the top responses would be geographical.

However, there is a second answer at play here, which is that people’s names can appear a number of different ways. Let’s return to Napoleon as an example. In addition to 243 appearances of “Napoleon”, the table also contains:

Napoleon Bonaparte Napoléon Napoleon (Bonaparte) Napoléon Bonaparte Napoleon (I) Napoléon (Bonaparte)
47 times 14 times 4 times 3 times 3 times 1 times

One might be curious why so many variations on a single name are possible. While contestants may say the full name while answering, the judges will often (but not always) allow a contestant to say only the surname, especially while refering to polititians. To eliminate ambiguity, the people maintaning the archive will often (but not always) add the given name to a response in parentheses. Ironically, Napoleon is a counter example, as his first name is much more recognizeable than his last name.

Regardless of these details, this does pose a serious problem to our understanding of this data. A first instinct at a solution might be to simply sum up the instances of each cell containing the word “Napoleon”, but this approach is fraught. Such an approach would include over twenty instances of variations on “Napoleon III”, as well as 6 instances of “Napoleon Dynamite”. This problem is even worse for less distinctive names. Here’s a breakdown of answers containing “Ford” as the last part of a person’s name, appearing three or more times:

On a more personal note, the corresponding chart for people named Johnson is even more disasterous:

This is a massive problem! How could we possibly handle this? As I see it, there’s a handful of strategies that could be used to colect different versions of one person together, some of which might be used simultaneously:

  1. Find all variations of responses that end with the same surname, and sum up their counts into a single number.
  2. Take the counts of answers that consist of a single surname, and distribute that number proportionally across the different names with that surname.
  3. Ask an LLM to consider the question as a whole, and determine the exact identity of each human included.
  4. Group together variations that differ only by a pair of parentheses or diacritical marks. for example, replace "(Gerald Ford)" with "Gerald Ford".

Option 1 can be disregarded immediately. This would require that Gerald Ford and Harrison Ford be counted as one person, which is unacceptable.

Option 2 seems like a better idea at first, but falls apart the more one considers it. It would be nice to split the 119 instances of “Ford” and add those to “Gerald Ford”, “Henry Ford”, “Betty Ford” and “Harrison Ford”. However, it’s not a totally valid assumption to make that refering to someone by one name is equally common for presidents as it is for actors. In fact, it’s not fair to assume that all 119 instances of “Ford” refer to a person at all. Surely, many refer to the Ford Motor Company.

Option 3 is tempting, but LLMs are always prone to error. This concern is easy to overstate; LLMs are getting more accurate all the time, especially as pertains to simple informational questions. However, determining the exact level of innaccuracy would take some testing and comparison that is outside the scope of this project currently.

That leaves option 4, which really does seem reasonable. This is a small change overall: Adding the 13 instances of “(Gerald) Ford” to the 128 instances of “Gerald Ford” is not likely to be hugely impactful. However, it’s also very unlikely to have negative side effects. This will be implemented moving forward.

Finally, it’s probably a good idea to restrict the timespan the program searches. What’s considered important knowledge has changed over the years, as do the people on Jeopardy!’s writing staff. Viewing just the last 5 years seems like a decent compromise between quantity and relevance of data. We can finally note the most common responses in the last few years. Here’s the top 75:

response count
Chicago 58
Australia 48
Florida 43
Philadelphia 40
California 39
Brazil 37
India 37
Georgia 34
Alaska 34
Jupiter 34
Ireland 33
China 33
Greece 33
Mars 33
Texas 32
Poland 31
Japan 31
Spain 30
Boston 30
San Francisco 30
Mexico 29
New Orleans 29
Cuba 29
the Philippines 29
France 28
response count
Switzerland 27
Norway 27
Hawaii 27
Virginia 27
Venice 26
Egypt 26
Paris 26
Iceland 25
Michigan 25
Portugal 25
Iran 25
Atlanta 25
Canada 25
Argentina 24
the Netherlands 24
Beethoven 24
Germany 23
Italy 23
Florence 23
Dublin 23
Sweden 23
New Zealand 23
South Africa 23
Amsterdam 22
Denmark 22
response count
Scotland 22
Massachusetts 22
mercury 21
London 21
Ethiopia 21
Maine 21
New Mexico 21
Puerto Rico 21
Venus 21
Yellowstone 20
the Mississippi 20
Peru 20
Morocco 20
Colombia 20
Antarctica 20
New Jersey 20
Chile 20
Vienna 20
Napoleon 20
St. Louis 19
Pennsylvania 19
the Thames 19
Madagascar 19
Joan of Arc 19
Seattle 19

it’s also fairly easy to sift through the first rows of this table by hand and remove all answers that are geographic locations:

answer count
Jupiter 34
Mars 33
Beethoven 24
mercury 21
Venus 21
Joan of Arc 19
Napoleon 19
Hamlet 19
Saturn 19
Picasso 18
iron 18
Macbeth 18
Mercury 18
David 18
Tesla 18
Galileo 17
Jordan 17
Richard III 17
Neptune 17
Wilson 17
Julius Caesar 17
Cleopatra 16
the Moon 16
Cinderella 16
Hamilton 16
response count
lead 16
carbon dioxide 15
Mozart 15
Moses 15
hydrogen 15
Churchill 15
Lady Gaga 15
tea 15
soccer 15
Solomon 15
baseball 15
Lincoln 15
Harvard 15
the liver 15
smallpox 14
John Quincy Adams 14
Henry VIII 14
a horse 14
Copernicus 14
Catherine the Great 14
Buddhism 14
the Amazon 15
Alexander the Great 14
World War I 14
Exodus 14
response count
Teddy Roosevelt 14
Eisenhower 14
the heart 14
gold 14
Wagner 14
12 13
Robinson Crusoe 13
Marie Antoinette 13
the Titanic 13
cricket 13
Grey’s Anatomy 13
The Phantom of the Opera 13
Jaws 13
John Quincy Adams 13
Twelfth Night 13
Nero 13
Carmen 13
bamboo 13
King Lear 13
the Statue of Liberty 13
Marshall 13
John 13
Nixon 13
Madonna 13
teeth 13
the Louvre 13

Separating data by value
#

I’m also currious if these results change for different point values. Chicago is a very well known location for Americans, so it’s possible that Chicago only appears so often because it’s an easy “gimme” question. In general, clues with higher point values are much harder; maybe common responses are only common for easy questions. Would Chicago still be the most common response if we limit our search to expensive questions? Let’s break down answers by difficulty, and see if the results change. In order to account for splitting the data into ten parts, I’ll extend the search up to 25 years. Here’s the 20 most common questions for each round of Single Jeopardy:

$200 answer count $400 answer count $600 answer count $800 answer count $1000 answer count
China 108 Australia 64 California 57 Chicago 41 Australia 39
Hawaii 106 Alaska 62 Chicago 54 France 36 Maine 35
Japan 104 Chicago 58 Australia 45 New York 36 Brazil 35
California 74 California 54 China 44 California 34 Chicago 34
Alaska 73 France 54 Texas 44 Australia 34 Greece 32
Chicago 72 China 53 Spain 42 Spain 34 South Africa 31
Australia 67 Japan 53 India 39 China 33 France 29
Mexico 66 Spain 53 France 38 India 33 Sweden 29
Florida 65 Canada 52 Florida 38 Alaska 32 Spain 26
France 64 Mexico 51 Japan 37 Greece 32 Japan 26
India 59 India 50 London 35 Minnesota 32 Oklahoma 25
George Washington 58 Florida 49 Hawaii 35 Pennsylvania 31 Belgium 25
Ireland 58 Boston 44 Germany 35 Mexico 30 Utah 25
Boston 56 New York 44 Sweden 35 Maine 30 Texas 24
Canada 55 Texas 44 New Orleans 34 Canada 29 Ireland 24
Russia 52 Egypt 40 Italy 34 New Mexico 29 Wyoming 24
Egypt 50 San Francisco 38 Alaska 33 Texas 28 Maryland 24
New Orleans 50 London 37 Mars 32 Italy 28 Norway 24
Paris 50 Switzerland 37 Greece 31 Israel 28 Portugal 24
New York 49 Hawaii 36 South Africa 31 Montana 28 Thailand 24

And for double Jeopardy:

$400 answer count $800 answer count $1200 answer count $1600 answer count $2000 answer count
China 111 Australia 57 Japan 46 Australia 48 Brazil 33
Japan 87 Chicago 57 Australia 45 Sweden 40 Denmark 33
France 85 India 55 Sweden 45 Georgia 37 Portugal 32
Australia 81 Spain 54 France 42 Italy 36 India 30
Paris 77 France 47 Spain 40 Brazil 34 Andrew Jackson 30
California 74 China 46 Canada 40 Florida 34 Sweden 28
Mexico 73 Mexico 46 India 39 France 32 Indonesia 28
Cleopatra 70 Japan 43 Italy 37 Spain 32 Georgia 27
Spain 68 Paris 43 Portugal 36 South Africa 32 Norway 26
London 67 Egypt 43 Chicago 35 Maine 32 Poland 26
Alaska 67 California 42 Denmark 35 India 31 the Netherlands 26
Ireland 67 Ireland 41 Greece 34 Mexico 31 Spain 25
Italy 66 Italy 41 China 33 Switzerland 31 North Carolina 25
India 64 South Africa 41 Brazil 33 Norway 30 Finland 25
Chicago 62 Canada 39 Paris 32 Portugal 29 South Africa 24
Canada 57 Venus 39 South Africa 32 Chicago 29 Chicago 24
George Washington 56 Napoleon 37 Texas 32 Denmark 29 New Hampshire 24
Hawaii 55 Rome 37 Germany 31 Andrew Jackson 28 France 23
Florida 52 Hamlet 37 New York 31 China 27 Egypt 22
Egypt 52 Texas 36 Napoleon 30 Greece 26 the Philippines 22

Here’s the same data, with all the geographic locations removed:

$200 answer count $400 answer count $600 answer count $800 answer count $1000 answer count
George Washington 58 Ronald Reagan 36 Mars 32 4 21 Andrew Jackson 17
red 49 2 30 3 25 Eisenhower 21 Grover Cleveland 17
Abraham Lincoln 47 red 29 white 24 Mars 18 golf 16
McDonald’s 47 gold 29 basketball 23 Venus 18 Theodore Roosevelt 16
Napoleon 46 Wisconsin 28 Ronald Reagan 22 golf 18 4 15
gold 43 George Washington 27 Venus 22 Eleanor Roosevelt 18 Calvin Coolidge 15
Julius Caesar 42 tea 27 Richard Nixon 22 Theodore Roosevelt 18 white 15
Lincoln 41 3 26 Napoleon 21 Andrew Jackson 17 12 15
Madonna 41 Maine 26 Thomas Jefferson 20 blue 17 Henry VIII 14
Elvis Presley 39 Sweden 25 baseball 20 Jacob 17 Julius Caesar 14
water 38 Mars 24 Andrew Jackson 20 3 16 iron 14
milk 36 Thomas Jefferson 24 George Washington 19 Richard Nixon 16 Uranus 14
Cleopatra 36 coffee 24 Abraham Lincoln 19 Mark Twain 16 Solomon 13
Babe Ruth 35 rice 24 Julius Caesar 19 Henry VIII 16 Jupiter 13
white 34 World War I 24 4 18 Solomon 15 Saturn 13
Moses 34 Elvis Presley 23 green 17 Pocahontas 15 Neptune 13
2 34 Venus 23 Jupiter 17 nitrogen 15 Woodrow Wilson 13
Coca-Cola 33 Pennsylvania 23 blue 17 basketball 14 Job 12
golf 32 New Jersey 23 Hamlet 17 Jupiter 14 Othello 12
Richard Nixon 32 oil 23 Buddhism 17 7 14 John Adams 12
$400 answer count $800 answer count $1200 answer count $1600 answer count $2000 answer count
Cleopatra 70 Venus 39 Napoleon 30 Andrew Jackson 28 Andrew Jackson 30
George Washington 56 Napoleon 37 Mozart 30 Woodrow Wilson 22 Woodrow Wilson 21
Napoleon 51 Hamlet 37 Thomas Jefferson 28 Henry VIII 21 Eugene O’Neill 20
Julius Caesar 49 Macbeth 34 David 26 Jupiter 20 William Faulkner 19
Michelangelo 47 Julius Caesar 32 Galileo 25 Thomas Jefferson 19 Virginia Woolf 18
Mars 43 Abraham Lincoln 31 Michelangelo 24 Theodore Roosevelt 19 Henry Moore 18
Joan of Arc 43 Picasso 31 Hamlet 23 Richard III 18 Richard III 17
Mark Twain 38 Thomas Jefferson 30 Mars 22 Eleanor Roosevelt 18 John Quincy Adams 17
Abraham Lincoln 38 Cleopatra 28 Beethoven 22 King Lear 17 Maria Theresa 17
Hamlet 36 Ronald Reagan 28 Theodore Roosevelt 22 Charlemagne 17 Claudius 16
Alexander the Great 36 Mars 27 King Lear 21 A Midsummer Night’s Dream 17 Aeschylus 16
Ronald Reagan 35 Mozart 27 Richard III 20 Rembrandt 17 Twelfth Night 15
Romeo and Juliet 35 Lincoln 26 Henry VIII 20 George Eliot 17 Orpheus 15
Columbus 35 Queen Victoria 26 Picasso 19 Herbert Hoover 17 John Adams 15
Venus 34 George Washington 25 3 19 Archimedes 17 Raphael 15
Agatha Christie 34 World War I 25 Sylvia Plath 19 Galileo 16 Nathaniel Hawthorne 15
gold 33 David 25 Gerald Ford 18 Georgia O’Keeffe 16 Aristophanes 15
Shakespeare 33 Michelangelo 24 John Adams 18 English 16 7 15
water 33 Benjamin Franklin 24 Thomas Hardy 18 Venus 15 Sir Walter Scott 14
Beethoven 32 Galileo 24 Venus 17 Dylan Thomas 15 Zachary Taylor 14

Reading through this data is endlessly fascinating to me. Of course, I care more than the average bear about this show, so it’s hard for me to tell what pecularities of this show are interesting to the average person. Here’s just a few of the odd patterns present when breaking this data down by value:

  • Although the Beatles are notorious as a common subject on Jeopardy!, their prevalence as a response drops off rapidly after the $200 clue
  • The same is true for Shakespeare, although the titles of his plays see reasonable representation across different clue values
  • In general, some answers seem to favor certain values. "Barcelona" is three times as likely to be the answer to the $800 clue than any other value in Single Jeopardy
  • Aeschylus is an even more expreme example. He has appeared as an answer only three times in Single Jeopardy, twice in a $1600 clue, and sixteen times under the $2000 clue.

It’s worth noting that none of these numbers are incredibly large. I don’t want to overstate the signifigance of Aeschylus’ 16 appearances, especially since there’s been about 36,000 questions worth $2000 in the past 25 years. Nonetheless, we’ve at least gained some insight into our question! It does seem that espensive questions tend to be more spread out than cheap questions, but this doesn’t seem to be the whole story. Let’s get a more in depth picture of the frequencies of different answers. Here’s a graph showing the frequencies of answers by their index in the grid:

Converting both axes into logarithmic axes brings the graph into clearer focus:

To get a better idea of how this data is shaped, we might separate this into the specific point values:

These graphs are not linear, so the equations that fit them closely are not simple. I doubt it would be informative to list every equations for curves of best fit, but here’s a graph showing quadratic regression curves for the previous graph.

This more clearly highlights the confirmation for our suspicions. Easier clues, like the ones worth $200, focus more predominantly on a small handful of answers, while more expensive clues are spread across a wider variety of topics. However, the fact that answers are more widely spread for difficult clues than for easy ones does not mean that they are all the same answers. As we’ve seen, some answers, like “Virginia Woolf” or “Aeschylus” appear almost exclusively as responces to difficult questions, even if they appear relatively often.

To bring this into sharper focus, compare the distribution of “Virginia Wolf” in Double Jeopardy to that of “Elizabeth I”

$400 $800 $1200 $1600 $2000
Virginia Woolf 2 8 8 11 11
Elizabeth I 21 8 7 4 0

Both women have appeared exactly 40 times in Double Jeopardy over the last 40 years, but there’s a clear difference in distribution! “Who is Virginia Woolf” tends to be a much higher scoring answer than “Who is Elizabeth I”. Typically, when I ask people what they assume is the difference between expensive and inexpensive clues in Jeopardy!, they reason that expensive clues ought to ask about more “obscure” people and things, and this does hold true for our examples of Woolf and Elizabeth. Virginia Woolf is mostly talked about only in specialized English Lit courses at universities, while Elizabeth I is well known even to elementary students, so it’s probably fair to characterize Woolf as “more obscure”. That said, it’s signifigant that they appear with equal frequency. It’s as though Virginia Woolf is obscure enough to appear mostly in very expensive questions, but important enough to appear just as often as one of England’s most famous monarchs.

Measuring how interesting a response is
#

These figures that appear frequently, but only as high-value answers are very interesting to me. It’s as though the Jeopardy! writers have set out a specific list of cultural touchstones that separate highly knowledgeable people from everyone else. In order to find these figures more easily, I’ll have to define some metric that indicates how “expensive” a question is, on average. I waffled on this for a while, but in the end I chose to treat an answers value as interval data rather than ordinal, and measured the mean value of each answer within each round. With this metric included, the distribution for Woolf and Elizabeth I looks like this:

$400 $800 $1200 $1600 $2000 Total J2 appearances Average J2 value
Virginia Woolf 2 8 8 11 11 40 740
Elizabeth I 21 8 7 4 0 40 1410

Out of pure interest, we might choose to display each answer on a massive scatterplot, comparing their average value in the first round with their total number of appearances in the first round.

And here’s a similar chart for the Double Jeopardy round:

Thinking of “interesting” answers as those with very high average values isn’t adequate. To explain what I mean by this, consider the fact that there are 13385 answers that have appeared only as $2000 dolar clues in double Jeopardy, and 12098 of these have appeared only once. Many of these are highly specific combinations of things that would only make sense in context. Some examples are:

  • "B-29 (the one that carried the atomic bomb)"
  • "BMO"
  • "people who have seconds"
  • "Runaway Bride of Chucky".
  • "Sunday in the Park with George Orwell".
  • "The Lion, the Witch, and the Wardrobe Malfunction".

If anything, the biggest surprises here are the answers I would have expected to be common, but aren’t. Both “The Man with the Golden Arm” and “The Three-body Problem” both apprear only once, although I thought they were fairly well known and important books. Regardless, this demonstrates that we need to think of “interesting” answers as those that balange average value and frequency of appearance, and that we should find a method that isolates these “interesting” answers. One option is to consider the Pareto frontier of the set of all answers. In this case, that means the set of points that are of a higher average value than any answer less common than it. In other words, these data points are “Pareto optimal”: they cannot be further optimized for frequency of appearance without sacrificing average value, and vice-versa.

I find this a bit surprising, but Pandas doesn’t have a built-in function to find the Pareto frontier. I always found that a bit of a shame, so for the sake of anyone reading, I’m including my function for finding the Pareto frontier along two optimized variables. This function is limited to 2-D Pareto frontiers, and is highly optimized for this specific circumstance.

import pandas as pd

# Given a dataframe df, returns the Pareto front of that data as another dataframe, optimized  along the columns with names
# x and y. All columns in the optional "other" list are included as well in the returned dataframe.
def pareto_front_2D(df, x, y, other = []):

    front = df[other+[x]+[y]].copy().dropna()

    # The point with the highest y value has the lowest possible x value. if something had a lower x value, it must also
    # have a lower x value, and so is not worth including in the front. We also do it the other direction.

    front = front.sort_values(by= [y,x], ascending=False, ignore_index=True)
    x_record = front.iloc[0][x]
    front = front[front[x] >= x_record]

    front = front.sort_values(by= [x,y], ascending=False, ignore_index=True)
    y_record = front.iloc[0][y]
    front = front[front[y] >= y_record]

    front.reset_index(drop=True, inplace=True)

    # Sorting values in this dataframe by the metric x allows us to ignore all data points that do not set a new record along
    # the metric y

    for i in front.index:

        if front.loc[i][y] < y_record:
            front.drop(index=i, inplace=True)
            
        elif front.loc[i][y] > y_record:
            y_record = front.loc[i][y]
            
        elif (i != 0) and (front.loc[i][x] < front.loc[i-1][x]):
            front.drop(index=i, inplace=True)

    return front

The Pareto front for the single Jeopardy round looks like this:

And the chart for the Double Jeopardy round looks quite similar:

The precice values on the pareto front on the first round of Jeopardy are:

answer J1 total J1 average
Chicago 173 514.45
Australia 143 534.27
Greece 101 566.34
Brazil 97 595.88
South Africa 77 644.16
Maryland 62 674.19
Oklahoma 60 683.33
The Philippines 53 690.57
Indonesia 51 701.96
Ethiopia 37 794.30
Grover Cleveland 25 800
Ferrari 17 811.76
Andromeda 16 825
Martin Van Buren 16 825
Nicarague 15 853.33
Bhutan 15 853.33
Malaysia 14 871.43
As you like it 13 876.92
Dorothy Parker 12 883.33
Sisyphus 11 927.27
Padua 9 1000

Meanwhile, the pareto front of the second round is:

answer J1 total J1 average
Chicago 135 989.63
Australia 131 1016.79
Brazil 96 1133.33
South Africa 91 1134.07
Georgia 84 1285.71
The Philippines 67 1367.16
Algeria 46 1373.91
Finland 44 1472.73
Andrew Jackson 35 1485.71
Malta 34 1576.47
George Sand 25 1584
The Orinoco 21 1600
Tonga 20 1700
Ghana 17 1741.18
Avignon 16 1775
Mozambique 15 1813.33
Sikhism 15 1813.33
Aeschylus 13 1938.46
Ganymede 9 2000

Certainly, there are some interesting points in this data here. I had no idea George Sand appeared so often and as such a high value response, for instance. Nonetheless, this method of analysis does still leave something to be desired: it maintains the previous bias towards geographic locations for one. More egregiously, it provides a high number of answers that are especially common without being high value — the first item on the pareto front of the second roound I would really consider “interesting” is Malta.

Perhaps there’s some sort of value function that could be used instead to rank answers. The most natural solution is to total up the point values of each answers appearance, giving the total dollar sum for which an answer has been worth over the last 25 years. Unfortunately, I don’t find this solution very satisfying either. Looking at the results of these totals, we see more of the usual suspects: Chicago, Australia, Andrew Jackson, etc. It seems that simply multiplying the total number of appearances by the average point value isn’t going to work either.

Let’s take a moment to consider the distribution of average point values. The distributions for average values look a bit like this:

The large spikes in this graph are due to the fact that most answers occur only once or twice, and therefore average values are very likely to be one of the possible values that a single answer can have. If we restrict these graphs to answers that have appeared more than 3 times over the time span investigated, we get a much clearer picture:

This data is distributed roughly normally! That gives the impression that it would be more reasonable to multiply the frequency of an answers appearance by the difference of the average value to the mean of average values. This would have the added benefit of eliminating anything with an average value not above the average. Additionally, every clue in jeopardy has a value divisible by 200. Removing this factor would make our metric that littlest bit cleaner.

To summarise, if an answer appears xx times with an average value of yy, then the “interest” of that answer can be calculated as

x(y600)/200x\cdot(y-600)/200

Of course, that only applies to Single Jeopardy, where the average value is 600. For Double Jeopardy, the calculation would be x(y1200)/200x\cdot(y-1200)/200. After ranking items by this metric, I’m happy to say that I’m broadly satisfied with it! It does seem to highlight topics that are important yet obscure; a lot of these names and places are standard parts of college curricula, but not well known by the general population, which is exactly what I wanted. I tinkered with a few alternate formulas, but they always seemed to emphasize what I wanted to see, rather than neutrally reflecting the data. Besides, the mathematical simplicity of this formula makes it feel like it has a very strong connection to the game. It’s easy to see how studying the topics that score high here would translate to an improved average score while playing Jeopardy! at home. It even seems to de-emphasize geography!

This time, while displaying the top scorers, I think I’ll separate the geographic answers out entirely. Without further ado, here’s the top scorers! In this chart, J1 represents answers from the Single Jeopardy round, while J2 represents answers from Double Jeopardy.

geographic J1 answers interest
Ethiopia 36
Thailand 27
Indonesia 26
Oklahoma 25
Jordan 24
Hungary 24
the Philippines 24
Maryland 23
Wyoming 21
Bhutan 19
Lebanon 19
Sri Lanka 19
Malaysia 19
Andromeda 18
Oregon 18
Padua 18
Mississippi 18
Guam 18
Libya 17
Memphis 17
South Africa 17
Singapore 17
Maui 17
Turkey 17
Carthage 16
St. Augustine 16
Uranus 16
Maine 16
The Atlas Mountains 16
Chad 15
Dubai 15
Crete 15
Montana 15
Iowa 15
Bavaria 15
Andora 15
Windsor 15
Panama 15
Namibia 14
Milan 14
Other J1 answers interest
Grover Cleveland 25
Andrew Jackson 19
As You Like It 18
Sisyphus 18
Martin Van Buren 18
Ferrari 18
Jacob 17
The Crimean war 17
Dorothy Parker 17
William Jennings Bryan 17
Richard III 17
Earl Warren 17
Twelfth Night 16
Titus Andronicus 16
The Rosetta Stone 16
Caligula 16
Samuel Pepys 16
Howard Jughes 16
Lolita 15
Wagner 15
William McKinley 15
National Geographic 15
Job 15
Sinclair Lewis 15
Joshua 15
cholera 14
Robert the Bruce 14
Jack London 14
Patrick Henrey 14
Van Buren 14
curling 14
Sikhism 13
Henry James 13
Guy Fawkes 13
Phosphorus 13
John Tyler 13
Strom Thurmond 13
deciduous 13
Ulysses 13
Trotsky 13
geographic J2 answers interest
Malta 64
Finland 60
the Philippines 56
Tonga 50
Kazakhstan 46
Avignon 46
Ghana 46
Mozambique 46
Andromeda 42
Qatar 42
Botswana 42
the Orinoco 42
Bhutan 42
Nigeria 40
Yemen 40
Algeria 40
San Marino 40
Liechtenstein 38
Bahrain 38
Djibouti 38
Georgia 36
Io 36
East Timor 36
Ganymede 36
Ethiopia 34
the Bay of Biscay 34
Ur 34
Montenegro 34
the Tagus 34
Lebanon 34
Fiji 34
the Caspian Sea 32
Angola 32
Borneo 32
Timbuktu 32
Thermopylae 32
Nunavut 32
Cornwall 32
Suriname 30
El Savador 30
other J2 answers interest
Andrew Jackson 50
George Sand 48
Aeschylus 48
Sikhism 46
Twelfth Night 44
Petrarch 44
Henry Moore 44
Andromeda 42
Virginia Woolf 42
Fidelio 40
Voltaire 38
Caligula 38
Maria Theresa 38
Aaron Copland 38
Zachary Taylor 38
Charles II 38
John Locke 38
Heroditus 38
Raphael 36
Much Ado About Nothing 36
August Wilson 36
Zoroastrianism 36
Richard Wright 36
Pericles 36
Ambrose Bierce 36
The Sun Also Rises 34
Marc Chagall 34
Billy Budd 34
William Butler Yeats 34
Ovid 34
Solzhenitsyn 34
Gilgamesh 34
John Donne 34
Shelley 32
Langston Hughes 32
Skylab 32
Cicero 32
John Keats 32
the Knights Templar 32
the Etruscans 32

Overall, I’m quite pleased with this! These results mostly line up with what I expected, with names like Keats, Solzhenitsyn, and Woolf winding up as big winners. There’s a few unexpected results as well — I’m shocked curling wound up being so high up, for example. If you want proof that this metric highlights the top left edge of the data, look at the scatter plots from before, now with the 100 most “interesting” datapoints in orange:

All the code used is available in this project’s Github repository. Feel free to play around with it yourself! If you find anything interesting or surprising, feel free to contact me and let me know!