as long as sexuality remains as integrated into social life in general as say, eating, its possibilities of symbolic extension are to that degree limited
—Jameson
The notorious difficulty of defining obscenity and pornography has a long history. But the term porn has, over the past decade or so, begun to crop up with new meanings, in new places. This post assumes that this accretion is not merely random and is subject to at least some degree of analysis and consideration and, in that spirit, tries to sort that meaning out. Be forewarned, however, this post will feature nothing that is itself prurientPrurience is of key importance as a criterion in the legal definition of obscenity. "Pornographic" and "pornography," unlike obscene and obscenity, do not have strict legal definitions.; indeed, the meanings of porn are now freed from sexual explicitness, at the exact historical moment which has seen a relatively broad acceptance of a distinct and well-defined pornography industry (a process that I take to have been greatly accelerated by, but that predates, the internet).
I was reminded of the changing uses to which the word pornographic has been put when I happened to hear this exchange regarding the Jodi Arias trial a couple of weeks ago, between host of NPR's "Talk of the Nation" (requiescat in pace) Neal Conan and novelist Walter Mosley:
NEAL CONAN: ... is there a trial that you followed closely?
WALTER MOSLEY: You know, I actually try my best not to follow trials because there seems to be something a little pornographic and a little un-American about it. I kind of feel that if somebody is, like, being tried for something, that that's - it's not exactly a private thing, but it's a thing between them and the law, and that's the reason we have law, so I don't have to make a decision about it.
(Why We Can't Look Away from True-Life Courtroom Dramas, May 13, 2013)
What does "pornographic" here mean? In his comments, Mosley appeals to a notion of privacy, as if the "pornography" of high-profile trials (like that of Jodi Arias) inheres in a violation of the intimacy of a special, private moment between the defendant and the court. I will leave entirely unremarked both the peculiar notion of privacy between an individual and the state and what may be un-American about it.
But surely Mosley does not mean to indicate a violation of privacy, in the strict sense, so much as the particular kind of spectatorship these trials invite, the transformation of the trial from into a spectacle. A trial, for such a perspective, is a sort of means towards an end (call that end, say, justice); but the trial becomes a spectacle when this means/ends logic is interrupted—when the end seems to become obscured, and one revels in the trial as an end in itself. It is a violation, not of any sort of privacy, so much as the usual solemnity and function of trial.
This mode of spectatorship that confuses means and ends is what I take Mosley to have meant when he called such trials pornographic. Of course, what counts as a means and what counts as an end are hardly simple question; and this observation alone that doesn't really answer my broader question. Why is it that the term to name such spectacle is pornographic, rather than say sensational or exploitative or even the bland and polemically boring inappropriate?
Food Porn

Petits pains au chocolat by Laurence Vagner, on Flickr
The best evidence of the shifting meanings of pornographic, I think, are the tumblrs, pinterest boards, and blogs that treat the term porn as nearly a suffix to name not a particular kind of content (though there are plenty of those) but rather the particular mode of engagement that concerns Mosley when it is applied to trials. Things like "book porn," or "library porn," or "bookcase porn" With my examples, I fear I've given myself away.. Surely, the most common of such non-pornographic porn (and likely the origin of the instances I quoted and, no doubt, of many more) comes from food porn.
Food porn, if you're unfamiliar, has its own wikipedia entry—though that page betrays a variety (bordering on incoherent) of definitions broader than what I would associate with the term Among them, an suggestion that the porn of food porn comes from its nutritional qualities—i.e. food porn is "unhealthy" food, as in this "Right Stuff vs. Food Porn" eating advice column.. One can go to Food Porn Daily for what I take to be instances of this particular genre. And while there's plenty of variability in what gets tagged #foodporn, the instances at Food Porn Daily make the template pretty clear: pictures of food with: highly saturated color; always in close up, frequently at a low angle (a picture of cookies cooling on a baking sheet, shot from what looks like just inches above the baking sheet); at least one area of very crisp focus, often with a shallow depth of field (so that the front of these ricotta crepes with Smoked Salmon, Capers, Red Onions, and Cherry tomatoes is in sharp focus, even as the opposite end of the crepe blurs).
The food may be unplated, on a cutting board or still cooking in a pan; it may be plated and presented, pret a manger; occasionally one finds a fork, having already separated out a bite, lying on the plate ready to be picked up. Occasionally, one even sees a small bite already taken (as is the case, I take it, in this image of a proscuitto, strawberry, and brie panino on crusty ciabatta. Banal Hypothesis: An analysis of the titles of menu items (à la Franco Moretti's examination of the lengths of novel titles in "Style Inc") would reveal a marked increase in complexity of dish names over the past two decades.. But any human presence is absent (though the image of a half-eaten hamburger in this story about the "end of food porn" might complicate that claim). The photos at Food Porn Daily achieve a remarkable stylistic consistency even though they are aggregated from a variety of sources. Indeed, food photography appears to be a well established commercial specialization, and the internet does not lack helpful introductions and recommendations for creating food porn. That is, food porn has a well established set of stylistic conventions, all of which work to solicit a certain type of gaze. All the things that we might wish to enjoy in a piece of food are translated (or perhaps merely replaced) by visual analogues/correlates/substitutions/replacements. The papery texture of crisply cooked pastry dough must be captured at the right depth, with the right color, to suggest its materiality even in its absence; the doneness of an egg yolk (warm not slimy, cooked through but not solid) must inhere in the brightness of the yellow, perhaps even in the visual evidence of the flow from the (we infer) egg, just cut, only moments before, by some invisible hand.
If you wish to make these dishes, there is a Food Porn Daily cookbook; though the amazon reviews confirm what surely you already knew: these pictures, not the dishes, are themselves what is to be consumed. You might try to make these dishes yourself; but really? Expert though you may be, your dish will never look like that anyway (even if you don't replace milk with Elmer's glue). At one level this is no different than the rise of the Food Network, or home improvement television, or other genres of "do it yourself" entertainment which cloak entertainment and aspiration in the "how to" and "do it yourself" rhetoric of self-betterment.
But food porn (and its fellow pornographic genres) certainly intensifies the divorce between means and ends. A bookshelf at bookshelfporn is unlikely to inspire one to remodel one's home, or take up saw and mallet. Wood-grained bookshelves and wood-grilled paninis? One could rewrite this entire post focusing just on the issue of class. Look at these bookcases; look at these dishes. Whose idea of the "good life" is this? And how do adjectives and adjectival phrases (crunchy cibatta bread, etc) capture that aspiration? And even if so inclined, you could never build all the book shelves as bookshelfporn. Overabudance and excess are part of the spectacle—one picture of a bookshelf is not bookshelfporn; but once it's aggregated with others, you're on your way (perhaps a bookshelf porn guide to carpentry is on its way). In food porn, the food itself is a necessary, but not sufficient condition for the required spectacle. Such photos represent a condition that in many cases never existed$mdash;they are often postprocessed, or generated by techniques like HDR; the macro lenses and lighting, the sort of focus and angle typical of these photos creates a prosthetic, inhuman gaze.
So, here's a definitional distinction that I'll hazard. While Walter Mosley can invoke the adjectival pornographic to describe cable news coverage of a trial, the term porn (as a quasi-suffix to indicate a certain genre of tumblr/blog/website) is reserved for visual material, and indeed, for collections of photographic images. The photographic conventions (which, it may be worth admitting, are not unrelated to pornography in the most conventional sense) are easily transferable to nearly object that one could conceive a lust for (garden porn! bicycle porn! finch porn! banana peel porn! linotype porn!)Please, please, please be careful about googling any of those terms.
Equally vital is the sense of an excessive superabudance of images generated by the fact of aggregation and collection (or curation to speak in the argot of Web 2.0). Although food porn predates sites like tumblr and pinboard, for such a definition it is through such sites that (to adapt Hegelian terms) this particular idea of "porn" fully achieves it Concept. The mode of spectatorship that is the defining fact of this genre requires excess. One finds that sort of excess which itself inheres in the word obscene:
- Offending against moral principles, repugnant; repulsive, foul, loathsome. Now (also): spec. (of a price, sum of money, etc.) ridiculously or offensively high.
...
1974 Greenville (S. Carolina) News 23 Apr. 1/8 Energy officials have already predicted that first-quarter oil profits will be 'embarrassingly high' or 'whoppers'. Sen. Henry Jackson, D-Wash., has said they'll be 'almost obscene.'
("Obscene")
Another reason you can't eat the food porn: there is just too much of it.
So, a definition. Food porn—a genre of "content" (and here too, using the vacuous term content, the connection of this genre to the history of the web seems evident) that redeploys certain photographic conventions and a sense of excess in order to solicit a mode of spectacular consumption, of consumption (of visual apprehension) as an end in itself.
It is as "an end in itself" that it becomes possible, as Mosley does, to use the term pornographic to describe a way of watching a trial and to have the moral force of that statement be completely clear. He also has a not very convincing argument that contemporary foodieism emerged out of the decline in sexual promiscuity in the wake of the AIDS crisis. When, in 2001 (a comparatively early moment in this history of the idea of food porn), Anthony Bourdain he invoked this same sense of implicit morality, writing about "food porn" chiefly as a sort of vicariousness (at times, he seems to cross the vehicle and tenor of the metaphor, calling "food porn" not a substitute for food, but a "substitute for sex").
An obvious moral attends such a definition of imagery and the modes of attention it solicits in terms of vicariousness. Food porn becomes a dangerous supplement to real culinary enjoyment. And, indeed, D. H. Lawrence's objections to pornography of a more conventional sort provides the template: pornography as a degraded, perverted, alienated, vicarious sexuality (just as, from that wider template of phonocentrism, writing is a sort of degraded/alienated speech). Bourdain, in 2001, expresses hope that we will soon move beyond mere food porn to actual food:
As we once only read about sex in the '50s before indulging ourselves indiscriminately in its pleasures in the '60s, '70s and early '80s, we might now also be approaching a crossroads. Instead of simply reading about small, good things and gaping at them in pictures, maybe we will begin, once again after a long, long absence, to cook it, rediscovering the best of ourselves and holding it close.
("Food Porn")
One must given Bourdain credit for mentions in the piece of the Olympia Press, etc, which prove an admirable familiarity with the print history of obscenity. And, from a very different corner, one finds Frederic Jameson drawing on this moral dimension of the term porn in his description of the pornographic in the opening lines from Signatures of the Visible (1990). Here though, it is not vicariousness (the replacement of the proper object by its mere subtitute) that is the problem, so much as the rapt, uncritical, unthinking, fascination which marks the pornographic gaze:
The visual is essentially pornographic, which is to say that it has its end in rapt, mindless fascination; thinking about its attributes becomes an adjunct to that, if it is unwilling to betray its object; while the most austere films necessarily draw their energy from the attempt to repress their own excess (rather than from the more thankless effort to discipline the viewer). Pornographic films are thus only the potentiation of films in general, which ask us to stare at the world as though it were a naked body. (1)
The first sentence of that block quote won the Philosophy and Literature-sponsored third annual "bad writing" contest. I loathe such contests (even as I value enormously clear writing). I will grant, however, that the demonstrative that after the semicolon (which, I think, refers to "that fascination") is a little ungainly. I wonder as well whether potential mightn't be a fairer term for potentiation.. One can find this formulation ("the visual is essentially pornographic") quoted with some frequency. While it introduces Signatures of the Visible, it is not elaborated beyond this passage and serves chiefly as a way of introducing a sort of statement of values and commitments that informs Jameson's film criticism (of which Signatures is a collection; I recall liking the Dog Day Afternoon essay).
And while Jameson says "The visual is essentially pornographic" he (the thinker who offers Always historicize! as the "slogan" of "all dialectical thought" and the "moral" of The Political Unconscious [9]Apparently, you can get a lot from a Jameson book without ever moving past the first page.), surely would not offer this as a statement as "the visual" as such. Indeed, the thinker who insists that "the senses are themselves not natural organs but rather the results of a long process of differentiation even within human history" (Political 62), is making not a statement about the visual as such, but about the visual at our own historical moment—or rather, the place of the visual at the start of 1990s, prior to what we might imagine as our own post-cinematic moment.
Indeed, there are two histories that The Political Unconscious offers that help account for the suffixication of "porn" to name a genre of browsable imagery defined not by the content of the images, but by a mode of engagement ("rapt fascination").
Discussing psychoanalysis, Jameson suggests that the emergence of the psychoanalytic hermeneutic depends on a larger autonomization of sexuality itself:
The psychoanalytic demonstration of the sexual dimension of overtly nonsexual conscious experience and behavior is possible only when the sexual "dispositif" or apparatus has by a process of isolation, autonomization, specialization, developed into an independent sign system or symbolic dimension in its own right; as long as sexuality remains as integrated into social life in general as say, eating, its possibilities of symbolic extension are to that degree limited, and the sexual retains its status as a banal inner-worldly event and bodily function. (Political 64)
Food porn, like psychoanalysis, reveals the "sexual dimension of the overtly nonsexual" as well. Not, however, the same way psychoanalysis does—through a hermeneutics of revealing, discovering a latent sexuality behind some overt, apparently nonsexual meanings. The apparatus at stake here is the pornographic gaze itself, which has itself been sufficiently autonomized to be utterly separable even from pornography.
There is a second, broader, history out of which food porn emerges: that of aesthetics itself. Insomcuh as food porn (and similar phenomena) represents the consuming of images as images, as a quasi-autonomous experience that is only tangentially connected to the culinary enjoymentAdorno refers at one point to non-aesthetic pleasure as merely culinary. (or whatever) that they appear to represent. Of the emergence of visual art from the autonomization of sight itself, Jameson writes:
as sight becomes a separate activity in its own right, it acquires new objects that are themselves the products of a process of abstraction and rationalization which strips the experience of the concrete of such attributes as color, spatial depth, texture, and the like, which in their turn undergo reification. The history of forms evidently reflects this process, by which the visual features of ritual, or those practices of imagery still functional in religious ceremonies, are secularized and reorganized into ends in themselves, in easel painting and new genres like landscape, then more openly in the perceptual revolution of the impressionists, with the autonomy of the visual finally triumphantly proclaimed in abstract expressionism. So Lukács is not wrong to associate the emergence of modernism with the reification which is its precondition; but he oversimplifies and deproblematizes a complicated and interesting situation by ignoring the Utopian vocation of the newly reified sense, the mission of this heightened and autonomous language of color to restore at least symbolic experience of libidinal gratification to a world drained of it, a world of extension, gray and merely quantifiable. (Political 63)
The engagement solicited by food porn, and indeed by porn, is itself not unrelated to the aesthetic gaze and its history which Jameson offers here in miniature. The aesthetic as a separate domain insists on such autonomy. In a provocative aside in her essay on "Jane Austen and the Masturbating Girl," Eve Sedgwick connects the autonomous aesthetic of Kant to autoerotic pleasure: "the Aesthetic in Kant is both substantially indistinguishable from, and at the same time definitionally opposed against, autoerotic pleasure" (111). Definitionally opposed because, in the Kantian vocabulary, the pornographic would be merely agreeable; it is interested pleasure which gratifies some bodily need. The pleasure one takes in eating an hamburger, for Kant, is not aesthetic because it sates a hunger. But if I take pleasure not in the hamburger, but in an image of it, my pleasure begins to look less interested, less purposeful even it maintains a certain... purposiveness; it begins to look, maybe eerily, aesthetic.
Of course, this slippery slope which leads to the equation of porn with the aesthetic is one which the twentieth-century assiduously avoided, even as the substantial indistinguishability Sedgwick notes persisted. One can see it, for instance, in the similarity of the formulae art for art's sake and dirt for dirt's sake. This latter formula was invoked by Judge Woolsey in the famous 1933 Ulysses decision (one finds it today reprinted with the Viking edition). In finding that the work was nowhere "dirt for dirt's sake," Woolsey finds that the work is not obscene. And yet, that slogan itself, which offers a definition of obscenity, deliberately echoes the slogan-like assertion of aesthetic autonomy, "art for art's sake."
Works Cited
Jameson, Frederic. The Political Unconscious. Cornell, NY: Cornell UP, 1981. Print.
———. Signatures of the Visible. New York: Routledge, 1990. Print.
"obscene, adj." OED Online. March 2013. Oxford University Press. Online.
Sedgwick, Eve Kosofsky. Tendencies. Durham, Duke UP, 1993. Print.
In my last post I talked about some of the challenges of reproducing the analysis, offered in Stephen Ramsay's Reading Machines of the distinctiveness of characters' vocabulary in Woolf's The Waves. Here I follow up a bit more on why I was unable to get the same results Ramsay reports.
R, tm, and weightTfIdf
To get R to generate tf—idf scores, one can do the following:
# Generate a Document Term Matrix
dtm <- DocumentTermMatrix(corpus, control=list(weighting=weightTfIdf))
This was how I got my scores last time; and those scores didn't match the one's reported by Ramsay. So, one explanation is that I bollocks'd the data; the other is that there may be something going on under the hood.
To explore this latter possibility, I had a look at the weightTfIdf function.Doing so is easy; if you're in an R console, with the tm package loaded, just type weightTfIdf and you'll get all 26 lines of the code. It is worth noting that by default, the function normalizes its scores according to text. That is, rather than calculating the tf-idf score using a raw count of the number of times a particular term appears in a single document (in our case, the number of times a particular character uses a word), it divides the raw term frequency by the total number of words in that document (in our case, the total number of words said by a character). This normalizing process explains why the scores Ramsay reports for Louis (12) start at 5.9, while my own scores were always small numbers (0.006...; 0.003..., etc).
Short digression on misunderstanding the code
When I first dug into weightTfIdf code, I thought I discovered a difference in implementation:In order to work with the data, this function transposes the Document Term Matrix into a Term Document Matrix; this means that rather than the columns being terms and the rows texts, the rows are terms and the columns documents. I don't know why the function does this; but it does, and that explains why we're looking at row sums in what follows rather than column sums.
lnrs <- log2(nDocs(m)/rs)
Here we're calculating part of the tf-idf score—the part that corresponds to Ramsay's log(N/df). Ignore the log2 for a moment and look at that divisor, rs. That should represent the number of documents which contain a particular term. How is that calculated? Well, the code says:
rs <- row_sums(m > 0)
That row_sums function comes the slam package; and when I first saw this, I thought I had my explanation. Ah ha! If the divisor in the logarithm is the sum of a row, then it is dividing the number of documents not by the number of documents which contain the specified term, but by the sum of the row—that is, by the total occurrences of that term! Here is our explanation!
"Well, but what is that comparison, m > 0, doing in the function call," you ask? Good question! Well, I assumed that it was just a way to pass non-zero values to the row_sums function, and went about trying to rewrite the function properly. This heady delirium lead to questions on StackOverflow and wasted time. Because, of course, I totally misunderstood how this part of the code works.
The trick is that a relational operator (like >) on a matrix, returns a matrix with boolean values, evaluating that expression for each item in the matrix. So, each cell in the term document matrix is now either TRUE or FALSE based on whether the value of that term was greater than 0. So it might look something like this:
Docs
Terms bernard.txt jinny.txt louis.txt neville.txt rhoda.txt susan.txt
absorption TRUE FALSE FALSE FALSE FALSE FALSE
abstract TRUE FALSE FALSE FALSE FALSE FALSE
abstraction FALSE FALSE TRUE FALSE FALSE FALSE
absurd TRUE FALSE FALSE TRUE FALSE FALSE
absurdity TRUE FALSE FALSE TRUE FALSE FALSE
absurdly FALSE FALSE FALSE TRUE FALSE FALSE
And, since TRUE is treated as numerically equivalent to 1, and FALSE to 0, if we were to sum these rows, we would get... exactly what we were looking for; i.e., the number of documents in which the term appears. So, that was wasted time.
Another thing to note here is that weightTfIdf uses log2(), the binary logarithmWhich returns the exponent to which you would raise 2 in order to get the specified term; i.e. log2(16)=4, b/c 24=16.. The logarithm here works to essentially scale the term's frequency based on how specific the term is to any particular document; terms which occur in all documents will have N nearly equal to df, or N/df nearly equal to 1. And log(1) == log2(1) == 0. That is, it will push the weight to (or towards) 0. Whereas a term which occurs in only one document will maximize N/1. A logarithmic function is used simply to dampen that effect, preventing scores linearly increasing in the case of terms highly specific to a single document.
Does it make a difference whether one uses log2() or the natural logarithm or the base 10 "common" logarithm I learned back in school? I don't really know; I doubt it. The particular logarithmic function one chooses will change the score, but it wouldn't change the relationship among the terms (which had the highest score).
Hand Simulations
After all that, I tried to tinker a bit to bring results into line. I stopped normalizing my data and tried different log functions in an attempt to match Ramsay's scores. Well, okay, let's take one last look.
Despairing, I returned to the clearest data Ramsay gives us—the list of terms and scores which appears on page 12 and picked a few, to see if I could manage this by hand.My high school computer science teacher (we switched from Pascal to C++ midway through my high school career) used to make us "hand simulate" algorithms; print out the source code, jot down the variable names, and debug by hand. Smart guy. (Though the phrase "hand simulation" may have been unfortunate.) (I don't really mean by hand of course. Just... more slowly.) Here are the terms Ramsay lists for Louis, which have a score of 5.0021615: australian, beast, grained, though, wilt.
So, I returned to my data.
> rawdtm <- DocumentTermMatrix(characters) # A basic Document Term Matrix of raw frequency counts
> terms <- c('australian','beast','grained','thou','wilt')
> inspect(rawdtm[,terms])
A document-term matrix (6 documents, 5 terms)
Non-/sparse entries: 6/24
Sparsity : 80%
Maximal term length: 10
Weighting : term frequency (tf)
Terms
Docs australian beast grained thou wilt
bernard.txt 1 0 0 0 0
jinny.txt 0 0 0 0 0
louis.txt 6 6 6 6 6
neville.txt 0 0 0 0 0
rhoda.txt 0 0 0 0 0
susan.txt 0 0 0 0 0
This is already puzzling. These five terms all had the same score. Yet, australian occurs six times in Louis, but across two texts; while the other terms occurs six time in Louis, and only in Louis.
So, back to the data. I open up the original gutenberg and search through it for australian. And, indeed, looks like 7 total occurrences; one of which is from Bernard:
I thought how Louis would mount those steps in his neat suit with his cane in his hand
and his angular, rather detached gait. With his Australian accent
("My father, a banker at Brisbane") he would come, I thought, with
greater respect to these old ceremonies than I do, who have heard
the same lullabies for a thousand years.
Let's check grained. I found six occurrences; all in Louis's speech and, as I was doing this by hand, I noticed that all those occurrences of grained were in the phrase grained oak:
- "the grained oak panel of Mr Wickham's door"
- "...Mr Wickham's grained oak door"
- "...I am also one who will force himself to desert these windy and moonlit territories, these midnight wanderings, and confront grained oak doors."
- "I, the companion of Plato, of Virgil, will knock at the grained oak door."
- "I brought down my fist on my master's grained oak door."
- "...yet brought down my fist on the grained oak door"
Why does grained occur in Ramsay & Steger's list, but not oak? Well, maybe everyone talks about oak trees, but only Louis talks about grained oak. Back to our data:
>inspect(rawdtm[,c('grained','oak')])
A document-term matrix (6 documents, 2 terms)
Non-/sparse entries: 3/9
Sparsity : 75%
Maximal term length: 7
Weighting : term frequency (tf)
Terms
Docs grained oak
bernard.txt 0 2
jinny.txt 0 0
louis.txt 6 10
neville.txt 0 0
rhoda.txt 0 0
susan.txt 0 0
Hmm... well, let's do the math for these three terms (the first two of which got the same score in Ramsay's analysis; the last of which didn't appear at all). For these three terms, australian, oak, and grained for Louis, we have:
tf df N 1+tf*(log2(N/df)) 1+tf*(log10(N/df)) 1+tf*(log(N/df))
australian 6 2 6 10.50978 3.862728 7.59164
grained 6 1 6 16.50978 5.668909 11.75056
oak 10 2 6 16.84963 5.771213 11.98612
So oak should have the highest score of the three.
Something is clearly wrong.
I've returned to the raw counts, double checked them against the original file; and computed these things individually. I've tried to imagine all possible ways, but have been unable to produce any of the scores listed on pg 12.
I suspect that there may be a problem in the Ramsay and Steger's data. There is significant overlap in Ramsay and Steger's results and my own. It is for Louis that our data is most consistent; we share 19 of the top 24 terms. It is least similar for Bernard (sharing only 4/24 terms) and Rhoda (11/24). Did the Ramsay & Steger's analysis, perhaps, discard the final chapter where only Bernard talks (perhaps deliberately)? That would obviously greatly change the Bernard data set; and, by removing so much text, it could easily affect the df term for other characters, accounting for the other discrepancies. (But it wouldn't explain oak.)
It may very well be my data that is the problem, but if so I've been unable to locate the error. I'm abandoning trying to reconcile these approaches to a method that should, in principle, be eminently reproducible. I will though summarize the top 24 terms with highest tf-idf scores, using the best data I've got, and not normalizing.
> louis[1:24]
western oak beast grained thou wilt accent
23.264663 15.849625 15.509775 15.509775 15.509775 15.509775 14.000000
boasting nile average clerks stamps australian boys
12.679700 12.679700 10.339850 10.339850 10.000000 9.509775 8.000000
pitchers steel beaten boast bobbing custard eatingshop
7.924813 7.924813 7.754888 7.754888 7.754888 7.754888 7.754888
england eyres ham
7.754888 7.754888 7.754888
> bernard[1:24]
curiosity hampton letter phrases byron elderly heaven married
25.84963 23.77444 23.26466 22.22858 22.18948 20.67970 20.67970 20.67970
observed dinner phrase willow fin simple describe self
20.67970 20.60451 20.00000 19.01955 18.09474 18.09474 17.43459 17.43459
stick sense story nature pictures thinking canopy enemy
17.43459 16.37895 16.00000 15.84963 15.84963 15.84963 15.50978 15.50978
> neville[1:24]
story doomed immitigable papers byron catullus
12.000000 10.339850 10.339850 10.339850 7.924813 7.924813
cheep perfection camel detect don hosepipes
7.924813 7.924813 7.754888 7.754888 7.754888 7.754888
hubbub loads mallet marvel squirting waits
7.754888 7.754888 7.754888 7.754888 7.754888 7.754888
boys founder knives pocket scene shakespeare
7.000000 6.339850 6.339850 6.339850 6.339850 6.339850
> jinny[1:24]
tunnel prepared billowing game native peers quicker
9.509775 7.924813 7.754888 7.754888 7.754888 7.754888 7.754888
melancholy bodies band cabinet coach crag dazzle
6.339850 5.264663 5.169925 5.169925 5.169925 5.169925 5.169925
deftly equipped eyebrows felled haymarket jump lockets
5.169925 5.169925 5.169925 5.169925 5.169925 5.169925 5.169925
matthews murmured prepare
5.169925 5.169925 5.169925
> rhoda[1:24]
oblong dips tiger fuller swallow fallen steep suspended
18.094738 12.924813 11.094738 10.339850 9.509775 8.000000 7.924813 7.924813
cliffs minnows pond terror bunch foam party puddle
7.754888 7.754888 7.754888 7.000000 6.339850 6.339850 6.339850 6.339850
pools violets bow caverns chirp choke column columns
6.000000 6.000000 5.169925 5.169925 5.169925 5.169925 5.169925 5.169925
> susan[1:24]
kitchen setter washing bury cart gate apron
15.849625 10.339850 10.339850 8.000000 7.924813 7.924813 7.754888
seasons squirrel windowpane beds butter clean wet
7.754888 7.754888 7.754888 6.339850 6.339850 6.339850 6.339850
blown winter baby bitten boil cabbages carbolic
6.000000 5.264663 5.169925 5.169925 5.169925 5.169925 5.169925
clara cradle eggs
5.169925 5.169925 5.169925
I am currently teaching a graduate course (eng630: "Digital Humanities": Emerging Tools and Debates in Literary Study) and, as much as possible, I'm trying to make clear the mechanics behind some of the text-analysis in the works we're reading. So, this week, as I prepared to discuss Stephen Ramsay's Reading Machines, I wanted to reproduce some of the analysis done there. The first chapter, for instance, offers a tf-idf reading of Woolf's The Waves. Here is how Ramsay describes it:
It is possible—and indeed an easy matter—to use a computer to transform Woolf's novel into lists of tokens in whcih each list represents the words spoken by the characters ordered from most distinctive to least distinctive term. Tf-idf, one of the classic formulas from the field of information retrieval, endeavours to generate lists of distinctive terms for each document in a corpus. We might therefore conceive of Woolf's novel as a 'corpus' of separate documents (each speaker's monologue representing a separate document), and use the formual to factor the presence of a word in a particular speaker's vocabulary against the presence of the word in other speakers' vocabularies. (11)
This post summarizes how I tried to do just that, and the different results I got. I'm not sure what accounts for the differences from Ramsay's (and Sara Steger's) results; I'll try to show you what I mean below. In a future post I'll use the same "method" on aa different text (spoiler: it's Ulysses).
Readers familiar with The Waves, and the demands of text processing, will immediately recognize why the analysis of the characters' monologues would present itself as a tractable problem ("indeed an easy matter"). While, in theory, one could do a similar analysis for any novel (or any work with multiple speakers), the narrative structure of The Waves makes it particularly available to this sort of analysis. Chapters describing the process of the sun across the sky in the course of a single day alternate with chapters in which characters speak in semi-monologue about their lives. This device itself is the novel's most obvious departure from the conventions of narrative fiction, but it also makes it "an easy matter" (well, maybe for some people) to extract these dialogues. If you had good, marked-up data, you could easily extract this information (as Lincoln Mullen shows in this post, working with the Folger's TEI Shakespeare); but if all you have is unstructured plaintext, you're going to have a problem. Woolf's novel though, even in plaintext, carries a good deal of this informational structure in its novelistic form (there is, as they say, no such thing as an unmarked text).
Here is a chunk of The Waves, quoted at random:
'Where is Bernard?' said Neville. 'He has my knife. We were in
the tool-shed making boats, and Susan came past the door... Now we
must drop our toys. Now we must go in together. The copy-books are
laid out side by side on the green baize table.'
'I will not conjugate the verb,' said Louis, 'until Bernard has
said it...
There is always a short phrase (starting with an opening single quotation mark—i.e. an apostrophe—and a capital letter), some text, a closing single quote (variously punctuated), the word said followed by a character name and some punctuation mark, an opening single quotation mark and some words. This single "monologue" may continue into the next paragraph (which would then, consistent with convention, be opened by a single quotation mark—i.e. an apostrophe). Finally the monologue is closed by an apostrophe before the narrative turns to another character (and another opening apostro-quote), or to one of those sun-dappled interludes.
Whew; describing what is so obvious to any reader of the text is painful (as, I imagine, is reading my description of it), but it is this highly structured convention which makes Woolf's novel comparatively available to processing. Even absent TEI (or other) markup, Woolf's convention creates an ad-hoc ordered hierarchy of content objects, at least for the reader interested in the characters' monologues. Someone with more regex-fu than I have might be able to cut out character dialogue programmaticly.○I think the regex would go something like /'([\^,]+,)' said Louis, '(*)'/ and would capture, as $1 and $2 the material the character says... in theory. I tried to sort this out, but quickly gave up. Instead, I manually paged through the text and pasted together all the text said by a single character, so that from that opening apostro-quote to the closing apstro-quote would all be one one line, and the phrase [character name] said would occur somewhere near the front of that line.○In emacs, checking twitter and listening to podcasts, this represented an hour and a half's labor; labor, mind you, which was sufficiently mindless that I enjoyed a beer. Though, as you'll see below, this fact led me to redo the entire thing.
And so, with thanks to Woolf for her highly structured departure from novelistic convention, and to emacs for keybindings that made this somewhat less loathsome, you're ready to extract your data. It is now simply a matter of grepping the file for each character:
grep 'said Louis' the-waves.txt > characters/louis.txt`
grep 'said Neville' the-waves.txt > characters/neville.txt`
...
And so on.○The text Ramsay & Steger use, and that I also used, comes from Project Gutenberg Australia. Because of copyright, I cannot share the processed data I am working with—and, as you'll see, this extraction process is a crucial step. If you'd be interested in seeing or using this data, to save yourself the hour and a half's labor, however, just drop me an email. I have relatives in Australia who would be happy to host you during the term of your interaction with this copyrighted material.
At which point, the actual, real analysis begins. Here is the code I used, in R (using the tm package), to get my results. It assumes that that each individual's speech is contained in a single text file in a directory immediately below the working directory, called 'characters' (that's what all that grepping above was about).
# This code relies on the tm (text mining) package
library('tm')
# Create a corpus based on the subdirectory
characters <- Corpus(DirSource('characters/'))
# To aid processing lets make everything lower-case
characters <- tm_map(characters,tolower)
# And remove punctuation
characters <- tm_map(characters,removePunctuation)
# And we'll remove stopwords - this step, is optional. But in the version
# of the code I'm pasting here I removed them, in an effort (to no avail,
# alas!) to match Ramsay & Steger's result.
characters <- tm_map(characters,removeWords, stopwords('english'))
# Now, we create a Document Term Matrix - that is, a set of the
# the frequencies for each word in each document. The secret
# sauce is that control=list(weighting=weightTfIdf) line, which
# asks that those not be raw counts, but tfidf scores.
dtm <- DocumentTermMatrix(characters, control=list(weighting=weightTfIdf))
And here is just a taste of what that looks like. (This code asks R○Personify much? to let me see the 45th through 55th terms in the matrix (the terms are arranged alphabetically) for all texts.○You access the matrix by requesting row and column: matrix[row, column]; so the empty row field requests all rows (that is, all texts), and columns (which represent words) 45 through 55 (a range chosen entirely at random).
>Inspect(dtm[,45:55])
A document-term matrix (6 documents, 11 terms)
Non-/sparse entries: 16/50
Sparsity : 76%
Maximal term length: 13
Weighting : term frequency - inverse document frequency (normalized) (tf-idf)
Terms
Docs account accounts accretions accumulating accumulation
bernard.txt 0.0002033962 0.0002033962 0.0001247118 0.0002033962 0.0002033962
jinny.txt 0.0000000000 0.0000000000 0.0000000000 0.0000000000 0.0000000000
louis.txt 0.0000000000 0.0000000000 0.0000000000 0.0000000000 0.0000000000
neville.txt 0.0000000000 0.0000000000 0.0004414937 0.0000000000 0.0000000000
rhoda.txt 0.0000000000 0.0000000000 0.0000000000 0.0000000000 0.0000000000
susan.txt 0.0000000000 0.0000000000 0.0000000000 0.0000000000 0.0000000000
Terms
Docs accumulations accuracy accurately achieve acid
bernard.txt 0.0002033962 0.0002033962 0.0000000000 0.0001247118 0.0000000000
jinny.txt 0.0000000000 0.0000000000 0.0000000000 0.0000000000 0.0000000000
louis.txt 0.0000000000 0.0000000000 0.0000000000 0.0004378349 0.0004378349
neville.txt 0.0000000000 0.0000000000 0.0000000000 0.0000000000 0.0004414937
rhoda.txt 0.0000000000 0.0000000000 0.0007776662 0.0000000000 0.0000000000
susan.txt 0.0000000000 0.0000000000 0.0000000000 0.0000000000 0.0000000000
Terms
Docs acknowledge
bernard.txt 0.0000786844
jinny.txt 0.0000000000
louis.txt 0.0002762431
neville.txt 0.0002785515
rhoda.txt 0.0000000000
susan.txt 0.0000000000
So, this shows us, for instance, that Bernard, Louis, and Neville, all use the word acknowledge (Jinny, Rhoda, and Susan don't); and Louis and Neville use it more than Bernard (but at the exact rate as each other).
At this point, we've got the data. All that's needed is a little R data-finesse to get it back out in the order we want it.
I'm quite new to R, so I may be missing the better/more obvious way to do this, but this way seems to work. I load the data into a matrix, and then extract it into lists (I think I'm getting my R data types right), ordered by the word's score. We can then output as many (score, term) pairs from the re-ordered lists that we want (say, the top 24 terms).
m <- as.matrix(dtm)
bernard <- sort(m[1,], decreasing=TRUE)
jinny <- sort(m[2,], decreasing=TRUE)
louis <- sort(m[3,], decreasing=TRUE)
neville <- sort(m[4,], decreasing=TRUE)
rhoda <- sort(m[5,], decreasing=TRUE)
susan <- sort(m[6,], decreasing=TRUE)
>louis[1:24]
western accent grained thou wilt beast
0.006426702 0.005691854 0.004284468 0.004284468 0.004284468 0.003570390
boasting nile average clerks oak stamps
0.003502680 0.003502680 0.002856312 0.002856312 0.002762431 0.002762431
australian boys pitchers steel beaten bobbing
0.002627010 0.002209945 0.002189175 0.002189175 0.002142234 0.002142234
custard eatingshop england eyres fourthirty ham
0.002142234 0.002142234 0.002142234 0.002142234 0.002142234 0.002142234
> bernard[1:24]
thats hampton lady curiosity letter ones
0.002237358 0.001870677 0.001870677 0.001830566 0.001830566 0.001745965
elderly heaven married observed byron phrases
0.001627170 0.001627170 0.001627170 0.001627170 0.001621254 0.001610960
dinner willow phrase fin simple describe
0.001496542 0.001496542 0.001495004 0.001423774 0.001423774 0.001371830
self stick sense nature thinking canopy
0.001371830 0.001371830 0.001288768 0.001247118 0.001247118 0.001220377
> neville[1:24]
story ones doomed immitigable papers cheep
0.003342618 0.003090456 0.002880181 0.002880181 0.002880181 0.002207469
perfection camel detect hosepipes hubbub loads
0.002207469 0.002160136 0.002160136 0.002160136 0.002160136 0.002160136
mallet marvel squirting boys byron founder
0.002160136 0.002160136 0.002160136 0.001949861 0.001765975 0.001765975
scene shakespeare stair abject admirable ajax
0.001765975 0.001765975 0.001671309 0.001440091 0.001440091 0.001440091
> jinny[1:24]
tunnel prepared billowing game native peers
0.003833041 0.003194201 0.003125710 0.003125710 0.003125710 0.003125710
quicker melancholy bodies band bodys cabinet
0.003125710 0.002555361 0.002121992 0.002083807 0.002083807 0.002083807
coach crag dazzle deftly equipped eyebrows
0.002083807 0.002083807 0.002083807 0.002083807 0.002083807 0.002083807
felled glasses jump lockets matthews murmured
0.002083807 0.002083807 0.002083807 0.002083807 0.002083807 0.002083807
> rhoda[1:24]
oblong dips tiger fuller themoh swallow
0.005443664 0.003888331 0.003337767 0.003110665 0.003110665 0.002860943
fallen suspended cliffs garland manybacked minnows
0.002707581 0.002384119 0.002332999 0.002332999 0.002332999 0.002332999
pond structure terror bunch foam moonlight
0.002332999 0.002332999 0.002105897 0.001907295 0.001907295 0.001907295
party puddle dream pools violets amorous
0.001907295 0.001907295 0.001805054 0.001805054 0.001805054 0.001555332
> susan[1:24]
kitchen setter washing windowpane bury cart
0.006213103 0.004053254 0.004053254 0.004053254 0.003136025 0.003106551
gate horses apron seasons squirrel beds
0.003106551 0.003106551 0.003039940 0.003039940 0.003039940 0.002485241
butter clean wet winter baby boil
0.002485241 0.002485241 0.002485241 0.002063764 0.002026627 0.002026627
cabbages carbolic clara cradle eggs ernest
0.002026627 0.002026627 0.002026627 0.002026627 0.002026627 0.002026627
My data doesn't quite match Ramsay & Steger's (qtd. in Ramsay 13); look at the Louis data to see what I mean (I've reordered the terms alphabetically so that you can see the similarities and differences more easily):
Louis
Ramsay & Steger |
Me |
| accent |
accent |
| attempt |
|
| australian |
australian |
| average |
average |
| beast |
beast |
| beaten |
beaten |
|
boasting |
| bobbing |
bobbing |
|
boys |
| clerks |
clerks |
| custard |
custard |
| discord |
|
| disorder |
|
| eating-shop |
eatingshop |
| england |
england |
| eyres |
eyres |
| four-thirty |
fourthirty |
| grained |
grained |
| ham |
ham |
| mr |
|
| nile |
nile |
|
oak |
| pitchers |
pitchers |
|
stamps |
| steel |
steel |
| thou |
thou |
| western |
western |
| wilt |
wilt |
The terms fourthirty and eatingshop are victims here of the way R removed punctuation. R can also explain one other of the differences: Ramsay's list has the word mr, which my list lacks. mr is on the list of stopwords I removed from the text. But the others? I don't have any explanation for those. Ramsay's list has these words, which my list lacks (in addition to mr): attempt, discord, and disorder. And my list has oak, stamp, boys, and boasting, which his lacks.
Well, so, okay; but pretty good, right? Well, maybe not. It only gets worse for the other characters. Here is a summary of the discrepancies for the other characters:
Bernard (4 Shared)
Here my list and Ramsay & Steger's are very different.
The lists share only four terms: letter, curiosity, simple, and canopy.
Ramsay & Steger's then has: arrive, bandaged, bowled, brushed, buzzing, complex, concrete, deeply, detachment, final, getting, hoot, hums, important, low, moffat, rabbit, thinks, tick, tooth
important would be removed by my stoplist... the rest though should otherwise be in my list.
But mine has: thats, hampton, lady, ones, elderly, heaven, married, observed, byron, phrases, dinner, willow, phrase, fin, describe, self, stick, sense, nature, thinking.
Let's look at some of the words and try to sort this out; hoot seems a pretty unique word. Going back through the text, I find seven instances of hoot or hoots. They breakdown this way by character:
- Louis: 'a siren hoots'
- Bernard: 'But now list; tick, tick; hoot, hoot; the world has hailed us back to it... Then tick, tick (the clock); then hoot, hoot (the cars)'; 'a siren hoots'
- Rhoda: 'the steamer hoots'
Well, hoot seems unique to Bernard. Okay, let me jump back into R.
>inspect(dtm[,c('hoot')])
A document-term matrix (6 documents, 1 terms)
Non-/sparse entries: 1/5
Sparsity : 83%
Maximal term length: 4
Weighting : term frequency - inverse document frequency (normalized) (tf-idf)
Terms
Docs hoot
bernard.txt 0.0008135849
jinny.txt 0.0000000000
louis.txt 0.0000000000
neville.txt 0.0000000000
rhoda.txt 0.0000000000
susan.txt 0.0000000000
Just so that we aren't confused, lets grab the raw counts (rather than the tfidf scores).
> raw <- DocumentTermMatrix(characters)
> inspect(raw[,c('hoot')])
A document-term matrix (6 documents, 1 terms)
Non-/sparse entries: 1/5
Sparsity : 83%
Maximal term length: 4
Weighting : term frequency (tf)
Terms
Docs hoot
bernard.txt 4
jinny.txt 0
louis.txt 0
neville.txt 0
rhoda.txt 0
susan.txt 0
Well, that's no help then; hoot is unique to Bernard. At this point I begin to suspect something unpleasant. Maybe in my manual data munging, I bollocks'd something. Obviously, It seems like I got the occurrences of hoots in there, attributed to the right person (though maybe I deleted some other hoots?); but if I deleted something, or double pasted something, that could change the complexion of corpus as a whole, and so dilute the score (or inflate the score of some of these other terms showing up in my list).
So, at this point I went back and reprocessed the file again to insure I didn't break anything. I used this bit of elisp (courtesy of this) to remove (I included it in a macro for a first pass) hard newlines within a paragraph:
(defun remove-line-breaks ()
"Remove line endings in a paragraph."
(interactive)
(let ((fill-column (point-max)))
(fill-paragraph nil)))
And I ran it again. My scores shifted ever so slightly, but my top terms for Bernard remained the same.
Back in R, let's compare my lowest rank term with hoot again:
>inspect(dtm[,c('canopy','hoot')])
A document-term matrix (6 documents, 2 terms)
Non-/sparse entries: 2/10
Sparsity : 83%
Maximal term length: 6
Weighting : term frequency - inverse document frequency (normalized) (tf-idf)
Terms
Docs canopy hoot
bernard.txt 0.001219801 0.0008132009
jinny.txt 0.000000000 0.0000000000
louis.txt 0.000000000 0.0000000000
neville.txt 0.000000000 0.0000000000
rhoda.txt 0.000000000 0.0000000000
susan.txt 0.000000000 0.0000000000
> inspect(raw[,c('canopy','hoot')])
A document-term matrix (6 documents, 2 terms)
Non-/sparse entries: 2/10
Sparsity : 83%
Maximal term length: 6
Weighting : term frequency (tf)
Terms
Docs canopy hoot
bernard.txt 6 4
jinny.txt 0 0
louis.txt 0 0
neville.txt 0 0
rhoda.txt 0 0
susan.txt 0 0
That is to say, canopy, based on my raw scores, does look more distinctive than hoot. What about moffat (from Mrs Moffat in the text○So, if mr showed up in their analysis, why not mrs here? Because other characters talk about other Mrses—Mrs Crane, Mrs Constable.) which ranks high on Ramsay & Steger's list, but not at all on mine.
inspect(dtm[,c('moffat','canopy')])
A document-term matrix (6 documents, 2 terms)
Non-/sparse entries: 2/10
Sparsity : 83%
Maximal term length: 6
Weighting : term frequency - inverse document frequency (normalized) (tf-idf)
Terms
Docs moffat canopy
bernard.txt 0.001219801 0.001219801
jinny.txt 0.000000000 0.000000000
louis.txt 0.000000000 0.000000000
neville.txt 0.000000000 0.000000000
rhoda.txt 0.000000000 0.000000000
susan.txt 0.000000000 0.000000000
> inspect(raw[,c('moffat','canopy')])
A document-term matrix (6 documents, 2 terms)
Non-/sparse entries: 2/10
Sparsity : 83%
Maximal term length: 6
Weighting : term frequency (tf)
Terms
Docs moffat canopy
bernard.txt 6 6
jinny.txt 0 0
louis.txt 0 0
neville.txt 0 0
rhoda.txt 0 0
susan.txt 0 0
So moffat's score is just the same as canopy (but there are a lot of terms with that score, and terms with the same score are then ranked alphabetically, so it gets pushed off our top 24 listRamsay & Steger's scores are likewise ranked alphabetically when they have equal scores; have a look at those lists on page 13, and you'll see islands of alphabetical ordering.).
So, let me jump back to my initial, raw file; I check there, and Moffat indeed occurs 6 times.
So what on earth is going on here? At this point, I don't know. Here are, I think, the possibilities. The fact that the greatest discrepancy comes from the character with the most monologue data is perhaps meaningful, but how it's meaningful is not obvious. So:
- Perhaps, despite two attempts to get this data all set, I bollocks'd something that is upsetting the scores.
- Perhaps Ramsay & Steger stemmed their data; I haven't got the stemmer working in
R properly yet, so that could account for a difference (but terms like grained and bobbing appear on their list, and don't appear to have been stemmed).
- Could the interlude chapters be upsetting things? I discard them from my analysis entirely. If they were included they might change the overall complexion.
- While I am very wary of suggesting Ramsay & Steger's data is wrong, I will note that if they tried to manipulate this data using regex, the data itself isn't consistent. There are cases where paragraphs are missing opening apostro-quote marks (four of them by my count) and a paragraph missing a closing apostro-quote. Depending on how you built your regex, these could throw things off and produce the dilution effect I am worried about.
After tinkering for a bit, I suspected that this might be so. But looking at the raw counts for my data makes me doubt that. One thing you might suspect, if carving up the text into characters' monologues were the problem, would be that some key term might be misattributed; but, for instance, my raw counts of catullus seem consistent with Ramsay & Steger's results:
> inspect(dtm[,c('story','catullus')])
A document-term matrix (6 documents, 2 terms)
Non-/sparse entries: 5/7
Sparsity : 58%
Maximal term length: 8
Weighting : term frequency - inverse document frequency (normalized) (tf-idf)
Terms
Docs story catullus
bernard.txt 0.0011797090 0.000124653
jinny.txt 0.0000000000 0.000000000
louis.txt 0.0002763958 0.000000000
neville.txt 0.0031438302 0.002076189
rhoda.txt 0.0000000000 0.000000000
susan.txt 0.0000000000 0.000000000
> inspect(raw[,c('story','catullus')])
A document-term matrix (6 documents, 2 terms)
Non-/sparse entries: 5/7
Sparsity : 58%
Maximal term length: 8
Weighting : term frequency (tf)
Terms
Docs story catullus
bernard.txt 15 1
jinny.txt 0 0
louis.txt 1 0
neville.txt 12 5
rhoda.txt 0 0
susan.txt 0 0
That is, Ramsay & Steger think Catullus is distinctive for Neville. And, indeed, it appears to be so. The difference between my results and theirs the tfidf score—that is, in how distinctive it is. If their corpus were differently constructed than mine in some way, it might affect how distinctive it is.
So, there may be a data carving problem; who miscarved though is not obvious from this data, I don't think. It is also possible that there may be some algorithmic difference; I am using the tf-idf algorithm built into R as a sort of black box. My scores are very different from the one's Ramsay & Steger share on pg. 12. So we're definitely doing something different. And that might account for these differences. What clear I need to do is return to algorithm to better understand what's going on here.
For now, though, I don't know. I'll here just summarize the data for the rest of the characters.These are using the reprocessed data, so they may be a little different from above; there were no differences in top terms for Louis or Bernard, and these scores were extracted using exactly the same code as above.
Neville (12 Shared)
- Shared: doomed, immitigable, papers, camel, detect, hubbub, loads, mallet, marvel, abject, admirable, ajax
- Just Ramsay & Steger: catullus, bookcase, bored, expose, incredible, lack, shoots, squirting, waits, stair, aloud
- Just mine: boys, byron, cheep, founder, hosepipes, ones, perfection, scene, shakespeare, story
Jinny (20 Shared)
- Shared: tunnel, prepared, melancholy, billowing, game, native, peers, quicker, band, cabinet, coach, crag, dazzle, deftly, equipped, eyebrows, felled, jump, lockets
- Just Ramsay & Steger: fiery, victory, banners, frightened, gaze
- Just mine: bodies, bodys, matthews, murmured, prepare
Rhoda (13 Shared)
- Shared: oblong, dips, bunch, fuller, party, cliffs, manybacked, minnows, pond, structure, tiger, swallow, bow
- Just Ramsay & Steger: moonlight, them— (this result indicates that we're handling puncutation differently...), allowed, empress, fleet, garland, immune, wonder, africa, amorous, attitude
- Just mine: caverns, chirp, choke, column, fallen, foam, pools, puddle, suspended, terror, violets
Susan (16 Shared)
- Shared: setter, washing, apron, squirrel, windowpane, kitchen, baby, bitten, boil, cabbages, carbolic, clara, cradle, eggs, ernest, seasons
- Just Ramsay & Steger: cow, pear, betty, hams, hare, lettuce, locked, maids
- Just mine: beds, bury, butter, cart, clean, gate, wet, winter
Oh, and here is the breakdown of the amount of text I have for each character:
wc *.txt
46 32608 182921 bernard.txt
33 6331 34467 jinny.txt
46 8905 49588 louis.txt
39 10011 55543 neville.txt
40 8147 44839 rhoda.txt
34 6131 33023 susan.txt
238 72133 400381 total
Bernard has the most,○Of course, because the final chapter is offered entirely in his voice. followed by Neville, Louis, Rhoda, Jinny, and Susan.
Works Cited
Ramsay, Stephen. Reading Machines. Urbana: U of Illinois P, 2011. Print.
This is an extended version of the (less than) two minute "dork short" or "lightning talk" I gave at THATCamp Virginia a while ago (this post has been sitting in the hopper for a while). I offer an observation, an anecdote, and a suggestion.
tl;dr: I'm trying to put together an edition of Claude McKay's Harlem Shadows. Would you like to help?
An Observation:
An enormous wealth of public domain material is available on the web, from sources like Project Gutenberg or The Oxford Text Archive or The Internet Archive or Google Books or smaller projects like The Modernist Journals Project.
Yet, in my experience, these texts seem underused. (Am I wrong?)
An Anecdote:
When I was a teaching assistant for UVA's twentieth-century literature survey a few years ago, the professors taught Claude McKay's Harlem Shadows. Published in 1922, Harlem Shadows is just inside the public domain.
The text they used was a cheap (though still in the neighborhood of $15; here it is at Amazon) paperback facsimile of the 1922 edition. When I opened this slight paperback, it looked eerily familiar.
Compare:


The top image is from the Google Books edition; the bottom is a scan I just made of the Kessinger edition.
Kessinger's "edition" of Harlem Shadows is printed from page images available at Google Books
(scanned, in turn, from a copy at Indiana University library). They've cleaned up the title page a bit, but look at the distinct pencil marks. That's Kessinger's business: get new ISBNs for Google Books scans and then sell them. (When folks first noticed Kessinger doing this a while ago it caused some consternation.)
(Worth noting: there is a another copy of Harlem Shadows (scanned from a copy held at Princeton) in GBooks, which misidentifies Max Eastman in the author metadata; in addition to the two Google Books copies, archive.org has two copies; one from the Library of Congress and one from the University of Toronto, all the same edition. Thoughts on easily breaking up those four PDFs and digitally collating them?)
It seems unfortunate that right now a professor who wants to teach Harlem Shadows, ends up assigning Kessinger's rather ugly print-out of a Google Books PDF.
A Suggestion:
Can we do something to make public domain texts more useful? Is there a place for (some) scholars to take the lead here? Rather than paying Kessinger to print out Google Books page-scans, could we not use the (in this case, multiple sets of) page-scans available from a variety of sources to put together a lightly marked up version of the text? Couldn't we draw on existing bibliographies to make clear what the book object represented by those scans actually is. And then, from our single encoding, could we not export to multiple formats: PDF (by way of LaTeX, for folks who want to print this thing out); HTML; and ePub (etc) for eReaders?
Such an idea is not novel; it is merely an expression of the dream of a markup language like TEI. Not so long ago, a proposal for a "A Git Powered Project Gutenberg" lead to a discussion on Hacker News which in turn lead to a hastily arranged group (which just as quickly disarranged itself)—all focused around the idea of making public domain texts better. There is interest in improving the accessibility and usability of public domain texts and it isn't confined to academic literature departments.
Scholars could play a key role here by helping to establish a good text and providing annotations and glosses or other contextual material. In my wilder moments I imagine scholars providing a base text which than then becomes the staple, raw ingredient in a variety of remix editions, produced for audiences varying from high school to the college classroom, and beyond. These texts in turn could be cut and remixed to produce a roll-your-own anthology.
An Acknowledgment and a Goal:
There are some excellent reasons why I shouldn't be doing this. First, in the specific case of Harlem Shadows, I am not a specialist in American, African American, or Caribbean literature in general, nor in Claude McKay's work in particular. Nor am I an expert in text markup. Nor am I sufficiently well versed in the dark bibliographical arts to really be handling the complexities of putting together a proper critical edition.
With those reservations stated, I'm trying to carve some time out to work on this nonetheless. One's reach should exceed one's grasp, else what's a public domain for? But boy would I love some help.
I've converted the plaintext, OCR'd version of Harlem Shadows available through archive.org to a lightly marked up TEI version of that text. This markup itself is worthy of scrutiny; but I wanted to have something to start with on the way to producing a proofread, bibliographically sound, TEI-version of the text; to that I'd like to add annotations and textual notes, as well as supplementary material—early reviews, maybe McKay's prose from this period, as relevant. Think Norton Critical Edition (minus the criticism which is likely too thorny a permissions matter; though I'd love to proved wrong on this front).
To begin:
- here is a github repository with my initial stab at marking up the text.
- here is a wiki to organize future work. (Let me know if you want to be added to the wiki).
(A minor technical note: For a while I was imagining that it would be possible to use stand-off markup to keep text and annotation completely separate. This would be great for many reasons; in theory, one could have different sets of notes for different audiences (the high school versus the college class room; a reading versus a scholarly edition); from the little reading I've done, that seems not easily feasible at the moment. For software developers, however, the problem of how to combine constantly evolving sets of dependent texts is simply a fact of life; version control systems, like git, provide some help in managing this problem.)
As a preliminary schedule: begin finalizing markup of the edition by the end of the summer. Continue collecting and adding supplementary material and annotations in the Fall. Then start working on processing the text out to desired formats (the TEI Stylesheets provide a great place to start); so that this time next summer, an edition of sorts (available in multiple formats) is done.
For now I'd be interested in other folks sharing their thoughts, criticism, or enthusiasm. Or, better yet, take some of this material and fix it or fork it.
During break, I've been enjoying reading David Graeber's Debt: The First Five Thousand Years. Graeber's description of the opposing logics of the market and the state recalled to my mind the penultimate stanza of Auden's "September 1, 1939" that I couldn't resist quickly noting it here.
Here is Graeber:
This is the great trap of the twentieth century: on the one side is the logic of the market, where we like to imagine we all start out as individuals who don't owe each other anything. On the other is the logic of the state, where we all begin with a debt we can never truly pay. We are constantly told that they are opposites, and that tbetween them they contain the only real human possibilities. But it's a false dichotomy. States created markets. Markets require states. Neither could continue without the other, at least, in anything like the forms we would recognize today.
These two traps are what Auden will call the romantic lie and the lie of authority; and what Graeber describes as the interdependence of the market and the state is what Auden will call (and famously regret calling) love:
All I have is a voice
To undo the folded lie,
The romantic lie in the brain
Of the sensual man-in-the-street
And the lie of Authority
Whose buildings grope the sky:
There is no such thing as the State
And no one exists alone;
Hunger allows no choice
To the citizen or the police;
We must love one another or die.
For older posts see the archive.