This comment is awaiting moderation at The Aporetic's post on Google ngrams, "The Segway of Digital Searching," but I thought I'd get it up here quick, just so it may serve whatever function it may have. I have a lot more say about this, especially in light of an interesting discussion at this past weekend's THATCamp. But time is short...
(And, on reflection, my attempt to shift the metaphor from ngrams as Segway to ngrams as primitive combustion engine may be a little strong...)
I want to defend the ngrams data and suggest that many of things which are irking you are as much a function of the primitive search interface as they are the data.
The reason to be excited is that, right now, so much data is available. What we can do with this stunningly enormous bag of glyphs (I would not even use the term "word" yet) is indeed primitive. But I do think it already opens itself to possibilities you haven't fully allowed. Searching on big abstract nouns is, at best, like trying to read braille through a burlap sack. But searching on other terms is more productive.
Proper names (with appropriate capitalization) produces results which, while unsurprising, seem basically right:
No new knowledge or research agenda here; but I'd be loath to say that even this primitive visualization is worthless. (I would further qualify the results and insist that what is of interest is the trend; even the relative heights can be confounded by other people with the same surname or names with unusual diacritics). I, like Dan Cohen and Benjamin Schmidt, was impressed by the way in which the Science paper attempted to examine censorship and suppression by a comparing ngram curves in different languages across the same historical period.
Or consider the wonderful search which someone much cleverer than I came up with of "beft/best." (Dan Cohen mentions this example with reference to Danny Sullivan's post.) That one image confirms what we already know about book and tyographical history. But it's hard to imagine a more compelling visualization of this fact, isn't it?
(I have turned off the smoothing in each case which, I think, prevents the ugliness of the underlying data from being prematurely obscured.)
My point? The data itself remains non-ideal, the OCR is, let's be kind and say "imperfect." The absence of any link back to the texts from which the ngrams are extracted hampers research. There are reasons to question the quality of the metadata which provides the dates (or the justification/reasoning about how to handle multiple editions of a single work, etc). The lack of any real sense of precisely what books are being searched is a bigger problem still. The absence of periodicals, newspapers, etc, is an enormous lack. (All these points are eloquently made by Mark Davies of the Corpus of Historical American English.) Heck, I have deeper reservations than these about this sort of quantification of culture.
But to judge this data based on its current state, and the currently usable interface is premature. The analogy is not to the Segway (we can do it, so what?), but to the first combustion engines (yeah it runs, but it doesn't take us anywhere).
But the data itself is available. There is nothing but will and know-how (and the frighteningly large processing requirements) preventing someone from taking the 4-gram or 5-gram data and making it queryable in precisely the fashion you describe: show me most frequent collocates of "hysteria." This seems worthy of (not uncritical) celebration and support.