Wednesday, November 28, 2012

Surname Ngrams

Google has an interesting feature: The Google Ngram Viewer. An  n-gram references the frequency a phoneme, word, or phrase occurs across a span of text or speech. The Google Ngram Viewer allows you to search through Google Books across a span of years for a series of terms, and it will graph the rate of appearance for those terms.
I thought it might be interesting to graph the frequency for several of my surnames. I chose the eight surnames of my great great grandparents. However, after the inital attempt, I removed 'Deutsch' from the list since it dwarfed the rest of the surnames. Terms that are relatively similar in frequency work best in these graphs.
Below is the chart for Cruvant, Blatt, Feinstein, Newmark, Vanevery, Denyer and Lichtman. The span of years I chose was 1900-2000. I selected English language texts. Below the graph are links to the actual Google Books search results for each term graphed for particular decades.
(Click to enlarge)

If you enlarge the graph, and look carefully, you'll notice that the blue line for Cruvant appears briefly in the 1950s. This is due single-handedly to my cousin, Bernard Cruvant, who got some press for his psychiatrical work.  The surname Feinstein definitely grew in use in the last half of the century, going from least frequent to most. I was at first a little surprised "Denyer" wasn't more common, but I have to go back a couple centuries for that particular spelling of the English word, for someone who denies, to be prevalent.  And I hadn't noticed the search was 'case-sensitive.' When I plotted 'Denyer' vs 'denyer' it was clear that in the 1700s the religious term made several appearances. If I hadn't limited the seach to English texts, the Blatt surname would likely have increased in frequency significantly.

Some reviews I found online raised questions about the accuracy of the results.
1) The amount of texts from particular time periods are likely uneven, does Google weigh this into the equation?

I believe the answer must be, "yes." The Y-axis is a percentage, not number of occurrences. They claim each point on the graph is the frequency for a given year.

Of course, the type of material that has been scanned will make a difference. (Fiction, scholastic journals, reference works, etc) I suspect the diversity of digitized material decreases the further back one goes.

2) Optical character recognition can easily get confused. One example is that in older texts the letter 's' looks a lot like the modern 'f''. 

3) If a particular term has had multiple spellings over time that will also, naturally, impact the graph, as I pointed out with the surname, Denyer.

No comments: