Skip to main content

Reading into the common responses on Jeopardy!

·5072 words·24 mins
Jules Johnson
Author
Jules Johnson
The code for this project can be found at its Github repository.

This article was spun off from a larger, more technical article on the subject of how this data was collected. I wanted to focus on some fun statistical analysis for the answers on Jeopardy!, so feel free to skip the main article if you’re only interested in the stats.

What are the most common correct responses?
#

In the previous article, I created a massive dataframe containing the categories, hints, answers, dates, and point values of every question aired on Jeopardy!. The first question that I’m curious about is simple: what are the most common correct responses on Jeopardy!? What people, places, and things are most valued by the Jeopardy! writers?

Simply performing a count on the number of times a word or phrase appears as a correct response gives some surprising results:

123456789101112131415
response:AustraliaChinaJapanChicagoFranceIndiaSpainCaliforniaMexicoAlaskaCanadaIndiaHawaiiFloridaTexas
count511484472471462434427420390380376354354351340

Geographical locations are repeated far more frequently than famous people or works of art! The most frequent correct response is “What is Australia?”, appearing 511 times over the shows run. After that comes China, then Japan, Chicago, France, and so on. As a matter of fact, all of the top 37 responses are geographic locations! (the trend is broken by Napoleon, appearing 243 times.)

Geography certianly isn’t a vastly more common subject than history, so what could cause this pattern? Part of the answer is most likely that answers to geography questions tend to focus on a more narrow range of answers than other categories. There are fewer cities than people, after all. It’s also possible that locations are more common as responses, while historical figures appear more in the questions themselves. I’ve asked a few friends what they think the most popular responces would be, and they usually intuit that the top responses would be geographical.

However, there is a second answer at play here, which is that people’s names can appear a number of different ways. Let’s return to Napoleon as an example. In addition to 243 appearances of “Napoleon”, the table also contains:

Napoleon BonaparteNapoléonNapoleon (Bonaparte)Napoléon BonaparteNapoleon (I)Napoléon (Bonaparte)
47 times14 times4 times3 times3 times1 times

One might be curious why so many variations on a single name are possible. While contestants may say the full name while answering, the judges will often (but not always) allow a contestant to say only the surname, especially while refering to polititians. To eliminate ambiguity, the people maintaning the archive will often (but not always) add the given name to a response in parentheses. Ironically, Napoleon is a counter example, as his first name is much more recognizeable than his last name.

Regardless of these details, this does pose a serious problem to our understanding of this data. A first instinct at a solution might be to simply sum up the instances of each cell containing the word “Napoleon”, but this approach is fraught. Such an approach would include over twenty instances of variations on “Napoleon III”, as well as 6 instances of “Napoleon Dynamite”. This problem is even worse for less distinctive names. Here’s a breakdown of answers containing “Ford” as the last part of a person’s name, appearing three or more times:

On a more personal note, the corresponding chart for people named Johnson is even more disasterous:

This is a massive problem! How could we possibly handle this? As I see it, there’s a handful of strategies that could be used to colect different versions of one person together, some of which might be used simultaneously:

  1. Find all variations of responses that end with the same surname, and sum up their counts into a single number.
  2. Take the counts of answers that consist of a single surname, and distribute that number proportionally across the different names with that surname.
  3. Ask an LLM to consider the question as a whole, and determine the exact identity of each human included.
  4. Group together variations that differ only by a pair of parentheses or diacritical marks. for example, replace "(Gerald Ford)" with "Gerald Ford".

Option 1 can be disregarded immediately. This would require that Gerald Ford and Harrison Ford be counted as one person, which is unacceptable.

Option 2 seems like a better idea at first, but falls apart the more one considers it. It would be nice to split the 119 instances of “Ford” and add those to “Gerald Ford”, “Henry Ford”, “Betty Ford” and “Harrison Ford”. However, it’s not a totally valid assumption to make that refering to someone by one name is equally common for presidents as it is for actors. In fact, it’s not fair to assume that all 119 instances of “Ford” refer to a person at all. Surely, many refer to the Ford Motor Company.

Option 3 is tempting, but LLMs are always prone to error. This concern is easy to overstate; LLMs are getting more accurate all the time, especially as pertains to simple informational questions. However, determining the exact level of innaccuracy would take some testing and comparison that is outside the scope of this project currently.

That leaves option 4, which really does seem reasonable. This is a small change overall: Adding the 13 instances of “(Gerald) Ford” to the 128 instances of “Gerald Ford” is not likely to be hugely impactful. However, it’s also very unlikely to have negative side effects. This will be implemented moving forward.

Finally, it’s probably a good idea to restrict the timespan the program searches. What’s considered important knowledge has changed over the years, as do the people on Jeopardy!’s writing staff. Viewing just the last 5 years seems like a decent compromise between quantity and relevance of data. We can finally note the most common responses in the last few years. Here’s the top 75:

responsecount
Chicago58
Australia48
Florida43
Philadelphia40
California39
Brazil37
India37
Georgia34
Alaska34
Jupiter34
Ireland33
China33
Greece33
Mars33
Texas32
Poland31
Japan31
Spain30
Boston30
San Francisco30
Mexico29
New Orleans29
Cuba29
the Philippines29
France28
responsecount
Switzerland27
Norway27
Hawaii27
Virginia27
Venice26
Egypt26
Paris26
Iceland25
Michigan25
Portugal25
Iran25
Atlanta25
Canada25
Argentina24
the Netherlands24
Beethoven24
Germany23
Italy23
Florence23
Dublin23
Sweden23
New Zealand23
South Africa23
Amsterdam22
Denmark22
responsecount
Scotland22
Massachusetts22
mercury21
London21
Ethiopia21
Maine21
New Mexico21
Puerto Rico21
Venus21
Yellowstone20
the Mississippi20
Peru20
Morocco20
Colombia20
Antarctica20
New Jersey20
Chile20
Vienna20
Napoleon20
St. Louis19
Pennsylvania19
the Thames19
Madagascar19
Joan of Arc19
Seattle19

it’s also fairly easy to sift through the first rows of this table by hand and remove all answers that are geographic locations:

answercount
Jupiter34
Mars33
Beethoven24
mercury21
Venus21
Joan of Arc19
Napoleon19
Hamlet19
Saturn19
Picasso18
iron18
Macbeth18
Mercury18
David18
Tesla18
Galileo17
Jordan17
Richard III17
Neptune17
Wilson17
Julius Caesar17
Cleopatra16
the Moon16
Cinderella16
Hamilton16
responsecount
lead16
carbon dioxide15
Mozart15
Moses15
hydrogen15
Churchill15
Lady Gaga15
tea15
soccer15
Solomon15
baseball15
Lincoln15
Harvard15
the liver15
smallpox14
John Quincy Adams14
Henry VIII14
a horse14
Copernicus14
Catherine the Great14
Buddhism14
the Amazon15
Alexander the Great14
World War I14
Exodus14
responsecount
Teddy Roosevelt14
Eisenhower14
the heart14
gold14
Wagner14
1213
Robinson Crusoe13
Marie Antoinette13
the Titanic13
cricket13
Grey’s Anatomy13
The Phantom of the Opera13
Jaws13
John Quincy Adams13
Twelfth Night13
Nero13
Carmen13
bamboo13
King Lear13
the Statue of Liberty13
Marshall13
John13
Nixon13
Madonna13
teeth13
the Louvre13

Separating data by value
#

I’m also currious if these results change for different point values. Chicago is a very well known location for Americans, so it’s possible that Chicago only appears so often because it’s an easy “gimme” question. In general, clues with higher point values are much harder; maybe common responses are only common for easy questions. Would Chicago still be the most common response if we limit our search to expensive questions? Let’s break down answers by difficulty, and see if the results change. In order to account for splitting the data into ten parts, I’ll extend the search up to 25 years. Here’s the 20 most common questions for each round of Single Jeopardy:

$200 answercount$400 answercount$600 answercount$800 answercount$1000 answercount
China108Australia64California57Chicago41Australia39
Hawaii106Alaska62Chicago54France36Maine35
Japan104Chicago58Australia45New York36Brazil35
California74California54China44California34Chicago34
Alaska73France54Texas44Australia34Greece32
Chicago72China53Spain42Spain34South Africa31
Australia67Japan53India39China33France29
Mexico66Spain53France38India33Sweden29
Florida65Canada52Florida38Alaska32Spain26
France64Mexico51Japan37Greece32Japan26
India59India50London35Minnesota32Oklahoma25
George Washington58Florida49Hawaii35Pennsylvania31Belgium25
Ireland58Boston44Germany35Mexico30Utah25
Boston56New York44Sweden35Maine30Texas24
Canada55Texas44New Orleans34Canada29Ireland24
Russia52Egypt40Italy34New Mexico29Wyoming24
Egypt50San Francisco38Alaska33Texas28Maryland24
New Orleans50London37Mars32Italy28Norway24
Paris50Switzerland37Greece31Israel28Portugal24
New York49Hawaii36South Africa31Montana28Thailand24

And for double Jeopardy:

$400 answercount$800 answercount$1200 answercount$1600 answercount$2000 answercount
China111Australia57Japan46Australia48Brazil33
Japan87Chicago57Australia45Sweden40Denmark33
France85India55Sweden45Georgia37Portugal32
Australia81Spain54France42Italy36India30
Paris77France47Spain40Brazil34Andrew Jackson30
California74China46Canada40Florida34Sweden28
Mexico73Mexico46India39France32Indonesia28
Cleopatra70Japan43Italy37Spain32Georgia27
Spain68Paris43Portugal36South Africa32Norway26
London67Egypt43Chicago35Maine32Poland26
Alaska67California42Denmark35India31the Netherlands26
Ireland67Ireland41Greece34Mexico31Spain25
Italy66Italy41China33Switzerland31North Carolina25
India64South Africa41Brazil33Norway30Finland25
Chicago62Canada39Paris32Portugal29South Africa24
Canada57Venus39South Africa32Chicago29Chicago24
George Washington56Napoleon37Texas32Denmark29New Hampshire24
Hawaii55Rome37Germany31Andrew Jackson28France23
Florida52Hamlet37New York31China27Egypt22
Egypt52Texas36Napoleon30Greece26the Philippines22

Here’s the same data, with all the geographic locations removed:

$200 answercount$400 answercount$600 answercount$800 answercount$1000 answercount
George Washington58Ronald Reagan36Mars32421Andrew Jackson17
red49230325Eisenhower21Grover Cleveland17
Abraham Lincoln47red29white24Mars18golf16
McDonald’s47gold29basketball23Venus18Theodore Roosevelt16
Napoleon46Wisconsin28Ronald Reagan22golf18415
gold43George Washington27Venus22Eleanor Roosevelt18Calvin Coolidge15
Julius Caesar42tea27Richard Nixon22Theodore Roosevelt18white15
Lincoln41326Napoleon21Andrew Jackson171215
Madonna41Maine26Thomas Jefferson20blue17Henry VIII14
Elvis Presley39Sweden25baseball20Jacob17Julius Caesar14
water38Mars24Andrew Jackson20316iron14
milk36Thomas Jefferson24George Washington19Richard Nixon16Uranus14
Cleopatra36coffee24Abraham Lincoln19Mark Twain16Solomon13
Babe Ruth35rice24Julius Caesar19Henry VIII16Jupiter13
white34World War I24418Solomon15Saturn13
Moses34Elvis Presley23green17Pocahontas15Neptune13
234Venus23Jupiter17nitrogen15Woodrow Wilson13
Coca-Cola33Pennsylvania23blue17basketball14Job12
golf32New Jersey23Hamlet17Jupiter14Othello12
Richard Nixon32oil23Buddhism17714John Adams12
$400 answercount$800 answercount$1200 answercount$1600 answercount$2000 answercount
Cleopatra70Venus39Napoleon30Andrew Jackson28Andrew Jackson30
George Washington56Napoleon37Mozart30Woodrow Wilson22Woodrow Wilson21
Napoleon51Hamlet37Thomas Jefferson28Henry VIII21Eugene O’Neill20
Julius Caesar49Macbeth34David26Jupiter20William Faulkner19
Michelangelo47Julius Caesar32Galileo25Thomas Jefferson19Virginia Woolf18
Mars43Abraham Lincoln31Michelangelo24Theodore Roosevelt19Henry Moore18
Joan of Arc43Picasso31Hamlet23Richard III18Richard III17
Mark Twain38Thomas Jefferson30Mars22Eleanor Roosevelt18John Quincy Adams17
Abraham Lincoln38Cleopatra28Beethoven22King Lear17Maria Theresa17
Hamlet36Ronald Reagan28Theodore Roosevelt22Charlemagne17Claudius16
Alexander the Great36Mars27King Lear21A Midsummer Night’s Dream17Aeschylus16
Ronald Reagan35Mozart27Richard III20Rembrandt17Twelfth Night15
Romeo and Juliet35Lincoln26Henry VIII20George Eliot17Orpheus15
Columbus35Queen Victoria26Picasso19Herbert Hoover17John Adams15
Venus34George Washington25319Archimedes17Raphael15
Agatha Christie34World War I25Sylvia Plath19Galileo16Nathaniel Hawthorne15
gold33David25Gerald Ford18Georgia O’Keeffe16Aristophanes15
Shakespeare33Michelangelo24John Adams18English16715
water33Benjamin Franklin24Thomas Hardy18Venus15Sir Walter Scott14
Beethoven32Galileo24Venus17Dylan Thomas15Zachary Taylor14

Reading through this data is endlessly fascinating to me. Of course, I care more than the average bear about this show, so it’s hard for me to tell what pecularities of this show are interesting to the average person. Here’s just a few of the odd patterns present when breaking this data down by value:

  • Although the Beatles are notorious as a common subject on Jeopardy!, their prevalence as a response drops off rapidly after the $200 clue
  • The same is true for Shakespeare, although the titles of his plays see reasonable representation across different clue values
  • In general, some answers seem to favor certain values. "Barcelona" is three times as likely to be the answer to the $800 clue than any other value in Single Jeopardy
  • Aeschylus is an even more expreme example. He has appeared as an answer only three times in Single Jeopardy, twice in a $1600 clue, and sixteen times under the $2000 clue.

It’s worth noting that none of these numbers are incredibly large. I don’t want to overstate the signifigance of Aeschylus’ 16 appearances, especially since there’s been about 36,000 questions worth $2000 in the past 25 years. Nonetheless, we’ve at least gained some insight into our question! It does seem that espensive questions tend to be more spread out than cheap questions, but this doesn’t seem to be the whole story. Let’s get a more in depth picture of the frequencies of different answers. Here’s a graph showing the frequencies of answers by their index in the grid:

Converting both axes into logarithmic axes brings the graph into clearer focus:

To get a better idea of how this data is shaped, we might separate this into the specific point values:

These graphs are not linear, so the equations that fit them closely are not simple. I doubt it would be informative to list every equations for curves of best fit, but here’s a graph showing quadratic regression curves for the previous graph.

This more clearly highlights the confirmation for our suspicions. Easier clues, like the ones worth $200, focus more predominantly on a small handful of answers, while more expensive clues are spread across a wider variety of topics. However, the fact that answers are more widely spread for difficult clues than for easy ones does not mean that they are all the same answers. As we’ve seen, some answers, like “Virginia Woolf” or “Aeschylus” appear almost exclusively as responces to difficult questions, even if they appear relatively often.

To bring this into sharper focus, compare the distribution of “Virginia Wolf” in Double Jeopardy to that of “Elizabeth I”

$400$800$1200$1600$2000
Virginia Woolf2881111
Elizabeth I218740

Both women have appeared exactly 40 times in Double Jeopardy over the last 40 years, but there’s a clear difference in distribution! “Who is Virginia Woolf” tends to be a much higher scoring answer than “Who is Elizabeth I”. Typically, when I ask people what they assume is the difference between expensive and inexpensive clues in Jeopardy!, they reason that expensive clues ought to ask about more “obscure” people and things, and this does hold true for our examples of Woolf and Elizabeth. Virginia Woolf is mostly talked about only in specialized English Lit courses at universities, while Elizabeth I is well known even to elementary students, so it’s probably fair to characterize Woolf as “more obscure”. That said, it’s signifigant that they appear with equal frequency. It’s as though Virginia Woolf is obscure enough to appear mostly in very expensive questions, but important enough to appear just as often as one of England’s most famous monarchs.

Measuring how interesting a response is
#

These figures that appear frequently, but only as high-value answers are very interesting to me. It’s as though the Jeopardy! writers have set out a specific list of cultural touchstones that separate highly knowledgeable people from everyone else. In order to find these figures more easily, I’ll have to define some metric that indicates how “expensive” a question is, on average. I waffled on this for a while, but in the end I chose to treat an answers value as interval data rather than ordinal, and measured the mean value of each answer within each round. With this metric included, the distribution for Woolf and Elizabeth I looks like this:

$400$800$1200$1600$2000Total J2 appearancesAverage J2 value
Virginia Woolf288111140740
Elizabeth I218740401410

Out of pure interest, we might choose to display each answer on a massive scatterplot, comparing their average value in the first round with their total number of appearances in the first round.

And here’s a similar chart for the Double Jeopardy round:

Thinking of “interesting” answers as those with very high average values isn’t adequate. To explain what I mean by this, consider the fact that there are 13385 answers that have appeared only as $2000 dolar clues in double Jeopardy, and 12098 of these have appeared only once. Many of these are highly specific combinations of things that would only make sense in context. Some examples are:

  • "B-29 (the one that carried the atomic bomb)"
  • "BMO"
  • "people who have seconds"
  • "Runaway Bride of Chucky".
  • "Sunday in the Park with George Orwell".
  • "The Lion, the Witch, and the Wardrobe Malfunction".

If anything, the biggest surprises here are the answers I would have expected to be common, but aren’t. Both “The Man with the Golden Arm” and “The Three-body Problem” both apprear only once, although I thought they were fairly well known and important books. Regardless, this demonstrates that we need to think of “interesting” answers as those that balange average value and frequency of appearance, and that we should find a method that isolates these “interesting” answers. One option is to consider the Pareto frontier of the set of all answers. In this case, that means the set of points that are of a higher average value than any answer less common than it. In other words, these data points are “Pareto optimal”: they cannot be further optimized for frequency of appearance without sacrificing average value, and vice-versa.

I find this a bit surprising, but Pandas doesn’t have a built-in function to find the Pareto frontier. I always found that a bit of a shame, so for the sake of anyone reading, I’m including my function for finding the Pareto frontier along two optimized variables. This function is limited to 2-D Pareto frontiers, and is highly optimized for this specific circumstance.

import pandas as pd

# Given a dataframe df, returns the Pareto front of that data as another dataframe, optimized  along the columns with names
# x and y. All columns in the optional "other" list are included as well in the returned dataframe.
def pareto_front_2D(df, x, y, other = []):

    front = df[other+[x]+[y]].copy().dropna()

    # The point with the highest y value has the lowest possible x value. if something had a lower x value, it must also
    # have a lower x value, and so is not worth including in the front. We also do it the other direction.

    front = front.sort_values(by= [y,x], ascending=False, ignore_index=True)
    x_record = front.iloc[0][x]
    front = front[front[x] >= x_record]

    front = front.sort_values(by= [x,y], ascending=False, ignore_index=True)
    y_record = front.iloc[0][y]
    front = front[front[y] >= y_record]

    front.reset_index(drop=True, inplace=True)

    # Sorting values in this dataframe by the metric x allows us to ignore all data points that do not set a new record along
    # the metric y

    for i in front.index:

        if front.loc[i][y] < y_record:
            front.drop(index=i, inplace=True)
            
        elif front.loc[i][y] > y_record:
            y_record = front.loc[i][y]
            
        elif (i != 0) and (front.loc[i][x] < front.loc[i-1][x]):
            front.drop(index=i, inplace=True)

    return front

The Pareto front for the single Jeopardy round looks like this:

And the chart for the Double Jeopardy round looks quite similar:

The precice values on the pareto front on the first round of Jeopardy are:

answerJ1 totalJ1 average
Chicago173514.45
Australia143534.27
Greece101566.34
Brazil97595.88
South Africa77644.16
Maryland62674.19
Oklahoma60683.33
The Philippines53690.57
Indonesia51701.96
Ethiopia37794.30
Grover Cleveland25800
Ferrari17811.76
Andromeda16825
Martin Van Buren16825
Nicarague15853.33
Bhutan15853.33
Malaysia14871.43
As you like it13876.92
Dorothy Parker12883.33
Sisyphus11927.27
Padua91000

Meanwhile, the pareto front of the second round is:

answerJ1 totalJ1 average
Chicago135989.63
Australia1311016.79
Brazil961133.33
South Africa911134.07
Georgia841285.71
The Philippines671367.16
Algeria461373.91
Finland441472.73
Andrew Jackson351485.71
Malta341576.47
George Sand251584
The Orinoco211600
Tonga201700
Ghana171741.18
Avignon161775
Mozambique151813.33
Sikhism151813.33
Aeschylus131938.46
Ganymede92000

Certainly, there are some interesting points in this data here. I had no idea George Sand appeared so often and as such a high value response, for instance. Nonetheless, this method of analysis does still leave something to be desired: it maintains the previous bias towards geographic locations for one. More egregiously, it provides a high number of answers that are especially common without being high value — the first item on the pareto front of the second roound I would really consider “interesting” is Malta.

Perhaps there’s some sort of value function that could be used instead to rank answers. The most natural solution is to total up the point values of each answers appearance, giving the total dollar sum for which an answer has been worth over the last 25 years. Unfortunately, I don’t find this solution very satisfying either. Looking at the results of these totals, we see more of the usual suspects: Chicago, Australia, Andrew Jackson, etc. It seems that simply multiplying the total number of appearances by the average point value isn’t going to work either.

Let’s take a moment to consider the distribution of average point values. The distributions for average values look a bit like this:

The large spikes in this graph are due to the fact that most answers occur only once or twice, and therefore average values are very likely to be one of the possible values that a single answer can have. If we restrict these graphs to answers that have appeared more than 3 times over the time span investigated, we get a much clearer picture:

This data is distributed roughly normally! That gives the impression that it would be more reasonable to multiply the frequency of an answers appearance by the difference of the average value to the mean of average values. This would have the added benefit of eliminating anything with an average value not above the average. Additionally, every clue in jeopardy has a value divisible by 200. Removing this factor would make our metric that littlest bit cleaner.

To summarise, if an answer appears \(x\) times with an average value of \(y\), then the “interest” of that answer can be calculated as

$$x\cdot(y-600)/200 $$

Of course, that only applies to Single Jeopardy, where the average value is 600. For Double Jeopardy, the calculation would be \(x\cdot(y-1200)/200\). After ranking items by this metric, I’m happy to say that I’m broadly satisfied with it! It does seem to highlight topics that are important yet obscure; a lot of these names and places are standard parts of college curricula, but not well known by the general population, which is exactly what I wanted. I tinkered with a few alternate formulas, but they always seemed to emphasize what I wanted to see, rather than neutrally reflecting the data. Besides, the mathematical simplicity of this formula makes it feel like it has a very strong connection to the game. It’s easy to see how studying the topics that score high here would translate to an improved average score while playing Jeopardy! at home. It even seems to de-emphasize geography!

This time, while displaying the top scorers, I think I’ll separate the geographic answers out entirely. Without further ado, here’s the top scorers! In this chart, J1 represents answers from the Single Jeopardy round, while J2 represents answers from Double Jeopardy.

geographic J1 answersinterest
Ethiopia36
Thailand27
Indonesia26
Oklahoma25
Jordan24
Hungary24
the Philippines24
Maryland23
Wyoming21
Bhutan19
Lebanon19
Sri Lanka19
Malaysia19
Andromeda18
Oregon18
Padua18
Mississippi18
Guam18
Libya17
Memphis17
South Africa17
Singapore17
Maui17
Turkey17
Carthage16
St. Augustine16
Uranus16
Maine16
The Atlas Mountains16
Chad15
Dubai15
Crete15
Montana15
Iowa15
Bavaria15
Andora15
Windsor15
Panama15
Namibia14
Milan14
Other J1 answersinterest
Grover Cleveland25
Andrew Jackson19
As You Like It18
Sisyphus18
Martin Van Buren18
Ferrari18
Jacob17
The Crimean war17
Dorothy Parker17
William Jennings Bryan17
Richard III17
Earl Warren17
Twelfth Night16
Titus Andronicus16
The Rosetta Stone16
Caligula16
Samuel Pepys16
Howard Jughes16
Lolita15
Wagner15
William McKinley15
National Geographic15
Job15
Sinclair Lewis15
Joshua15
cholera14
Robert the Bruce14
Jack London14
Patrick Henrey14
Van Buren14
curling14
Sikhism13
Henry James13
Guy Fawkes13
Phosphorus13
John Tyler13
Strom Thurmond13
deciduous13
Ulysses13
Trotsky13
geographic J2 answersinterest
Malta64
Finland60
the Philippines56
Tonga50
Kazakhstan46
Avignon46
Ghana46
Mozambique46
Andromeda42
Qatar42
Botswana42
the Orinoco42
Bhutan42
Nigeria40
Yemen40
Algeria40
San Marino40
Liechtenstein38
Bahrain38
Djibouti38
Georgia36
Io36
East Timor36
Ganymede36
Ethiopia34
the Bay of Biscay34
Ur34
Montenegro34
the Tagus34
Lebanon34
Fiji34
the Caspian Sea32
Angola32
Borneo32
Timbuktu32
Thermopylae32
Nunavut32
Cornwall32
Suriname30
El Savador30
other J2 answersinterest
Andrew Jackson50
George Sand48
Aeschylus48
Sikhism46
Twelfth Night44
Petrarch44
Henry Moore44
Andromeda42
Virginia Woolf42
Fidelio40
Voltaire38
Caligula38
Maria Theresa38
Aaron Copland38
Zachary Taylor38
Charles II38
John Locke38
Heroditus38
Raphael36
Much Ado About Nothing36
August Wilson36
Zoroastrianism36
Richard Wright36
Pericles36
Ambrose Bierce36
The Sun Also Rises34
Marc Chagall34
Billy Budd34
William Butler Yeats34
Ovid34
Solzhenitsyn34
Gilgamesh34
John Donne34
Shelley32
Langston Hughes32
Skylab32
Cicero32
John Keats32
the Knights Templar32
the Etruscans32

Overall, I’m quite pleased with this! These results mostly line up with what I expected, with names like Keats, Solzhenitsyn, and Woolf winding up as big winners. There’s a few unexpected results as well — I’m shocked curling wound up being so high up, for example. If you want proof that this metric highlights the top left edge of the data, look at the scatter plots from before, now with the 100 most “interesting” datapoints in orange:

All the code used is available in this project’s Github repository. Feel free to play around with it yourself! If you find anything interesting or surprising, feel free to contact me and let me know!