Mining Obscenity III: Failure, with Visualizations
So, lately I've been writing about my interest in the changing semantics of the word bitch, trying to pin down when it went from a term meaning primarily "female dog" to being primarily an obscenity. I still don't have a good answer. In this post I'll try to explain why. Along the way I'll talk about:blog comments powered by Disqus
- exploring and visualizing the distribution of texts from Project Gutenberg
- a way to visualize the changing meaning of the term bitch
- and close with some general comments on what makes a useful tool for doing this sort of stuff.
Last time on "Mining Obscenity". . .So, as I've written about before, I became interested in tracing the changing definitions of the word "bitch," of trying to get some idea about when the shift occurred from bitch being used in print to mean "a female dog" (and, I learned, sometimes other animals) to its being a (mildly) obscene obloquy. (This is, of course, just one change in the term's meanings—more recently, for instance, one could chart the way the term comes increasingly to be used by men to emasculate other men.) But I think (and hope) the general premise is clear enough: to provide some sense about when, historically, the obscene/derogatory meaning took precedence over "female dog", at least as reflected in print (which itself raises questions about how one gets a historically valid sample, et cetera, et cetera).
Exploring Project GutenbergSo one source of textual data is Project Gutenberg. The amount of data ready to hand at Gutenberg, as well as its availability in vanilla plaintext, has made it attractive to folks doing stylometric analyses. And the very kind folks at Gutenberg have even included a very helpful way of getting all their ebooks. (Zipped up, that is something like 14.5 gigabytes according to the PG website). Project Gutenberg also makes available its catalog data in one big RDF file. As a preliminary step I decided to start with this catalog file just to get some sense of the distribution of texts in the Gutenberg archive. So, using Python to extract data from the RDF and Processing to visualize it, I produced this picture of the distribution of texts. [caption id="attachment_275" align="alignleft" width="300" caption="Authors in Project Gutenberg "][/caption] Each gray horizontal line represents the lifespan of an author who has at least one work in the Project Gutenberg archive. Authors with more than 50 works in the archive get more than a line—they get a box with their name in it. These "major authors" are then color coded: authors with more than 150 works in PG get a red box; authors with between 100 and 150 get a blue box; and authors with more than 50, but less than 100, get a green box. The lines are stacked (using a very crude algorithm; "major authors" aren't stacked the same way—they're just chucked at some height), so that the height of the stacked lines gives some insight into the number of authors writing at a certain period. It isn't especially pretty (and some boxes are less visible because they have been drawn over), mostly because my programming ability is pretty limited. But it offers some insight into the historical distribution of PG's authors. There are a lot of authors in the nineteenth/twentieth centuries, because the novel (with the predictable exception of Shakespeare) dominates PG's holdings. (I've focused on the period from 1500 - 2000 here; PG includes some works in the period before 1500—some translations of the classics, some Li Po, some Confucius, and so on, but not too many by comparison). But there are still lots of problems. If you were paying attention you'll note that I said that authors with more than 150 works in PG get a red box, which would seem to suggest that Shakespeare was even more prolific than you remembered. This inflated number is because PG's Shakespeare holdings include a number of different versions of each of Shakespeare's plays, translations of some of them, as well as a version of The Complete Works. So what gets tallied up as a "work" is not really a work. (Of course what exactly defines "a work" —how we define its unity and its singularity—is just one more of those thorny questions that I'm trying to shunt aside to get some heuristic peek into literary history.) This is (I hope) somewhat interesting, at least as a glimpse into PG. But if you've been paying attention you should be asking—by now you're probably screaming in frustration—why are you visualizing the lifetimes of authors rather than publication dates of individual works? Well, that is simple. PG's catalog data does not include publication date in its catalog data. (For that matter, it doesn't include any data about what edition a particular etext represents at all). Well, that's certainly a problem.
But what if we just ignored all that: Visualizing the Semantics of Bitch (with Bad Data)Okay, so that's a problem. But let's say we ignored this problem and tried to forge ahead anyway. Maybe you could take Gutenberg's textual data and get metadata about the works from some other source. Great idea! But this solution proved more difficult than I could easily manage. Well you could always just make the data up. Let's just take each author's birth year, add it to the year in which s/he dies, and divide by 2, effectively assuming that each author produced all of their work in one great burst of creativity midway on life's journey. This would be an assumption so ugly as to call any resulting visualization severely into question, as least in terms of its philological accuracy. But as proof-of-concept, I decided to make the assumption anyway. So, after waiting for the massive 15 gig-ish download of PG's etexts, how would one proceed? Well, I imagine that there are other ways to approach this, perhaps better ways, but I used used rgrep to search all the files for instances of the (case insensitive) string "bitch." Using arguments you can have rgrep return a line on either side of the occurrence of the searched for term. The results will look something like this:
./etext97/itwls10.txt-4218-of the stag; but, partaking more of the nature of the domestic than ./etext97/itwls10.txt:4219:of the wild animal, it remained with the herd of cattle. A bitch ./etext97/itwls10.txt-4220-also was pregnant by a monkey, and produced a litter of whelps -- ./etext05/8cptm10.txt-62244-"Yes; yes, by the stitching 'tis plain to be seen ./etext05/8cptm10.txt:62245:"It was made by that Bourbonite bitch, VICTORINE!" ./etext05/8cptm10.txt-62246-What a word for a hero!--but heroes _will_ err,Above are two results from such a search, the middle line of each contains the searched for term. At the beginning of each line is the file in which the grepped-for term occurs, followed by the line number, and then a line of text. Pipe all those results into a text file and you have the raw material you need. The file info (by way of etext number) can connect the text to its entry in the RDF catalog (and thence to the author, title, and birth/death date info). Determining the meaning of "bitch" in these passages though is not an easy task. One can imagine a machine learning solution—but on such small samples it seems unlikely to work well and would introduce a whole other level of complexity. You could try simply searching for selected key terms within a certain proximity of the occurrence "bitch" (like "dog" or "litter") and come to a conclusion based on the result. But since the number of results was relatively low (around 400 results), I thought it would be easier and better to just do it manually. To ease the task I wrote a quick Python script to display each extract and accept as input a number (0 - 4) to classify the term. Here is what it looked like: There are certainly other ways to break up the meanings, but after surveying the data this seemed sufficient. With this scheme, one could skip an entry if it was a false positive (for example, the name Bitchov or similar—there were actually a couple of these). I ranked "son-of-bitch" separately only because it occurred so frequently that it might be worth keeping an eye on it (as a specialized instance of the range of the term's obscene meaning); and I left open the possibility of ranking a term as "ambiguous" since, even with 3 lines of context, the term's meaning might not be obvious. (By keeping ambiguous results separate from false positives, "0", one could go back and grab more context to resolve the ambiguity). So, for a couple days I left this simple program running. Whenever I had a few free minutes to do some simple classifying while talking on the phone or waiting for water to boil, I classified some occurrences of the term "bitch." Once all of them had been classified and the output written to a file, it was time to return to Processing to try to visualize this. After some futzing around, here is what all that bitch data looked like. [caption id="attachment_291" align="aligncenter" width="1024" caption="Visualization of the Relative Obscenity of \"Bitch\""][/caption] Let me first reiterate that this visualization does not really show anything—that the data it represents is fundamentally flawed. As I noted above, because date of publication was not easily available, the dates used here are effectively inventions. (They are accurate within a tolerance of, say, half three score and ten.) Moreover, even with all that text downloaded from Gutenberg, we still have a pretty small number of points to draw any conclusions from. (You'll note that, for purposes of visualization, I've grouped occurrences by the decade in which they occur, fudging the dates still further). And, as if that weren't enough, let's recall that the same "work" can appear more than once in PG leading to double-counting. (I went through the data by hand to try to remove these, but I could have missed some). So, this sure seems like a long blog post for a useless visualization, isn't it? Well, here is what I like: this visualization divides the two meanings of bitch horizontally—points appearing below the center, horizontal line represent instances of the term being used in its obscene sense (the color-code gives some further insight into how these break down using the 4-part division discussed above), points above the line represent instances where the term is used in a non-obscene way (to mean "female dog"). This is simple, but has the advantage of allowing that both meanings might be equally available, or available in some mixed proportion, at any historical moment. With a larger data set, and with correct publication dates, this seems to me to be a elegant way of answering the admittedly amorphous question with which I began (though I'm certainly open to criticism of this entire approach). It could also be improved upon. You could keep in memory the text samples from which these points were derived so that one could mouse-over each point and get data about what author/work that point represents, a keyword in context sort of view, and even a link to the full-text. With a sufficiently complete data set, I would expect expect that we'd see that, during the twentieth century, the occurrences of the term as obscene would greatly increase while the occurrences of the term as meaning "female dog" would decrease. Exactly where the obscene meanings takes precedence would be the interesting thing to know. (Indeed, it is the thing I was interested in originally.)
...I'm increasingly of the opinion that end user application style software is not really what scholars who are serious about exploring the possibilities of using technology to enhance their research or open new avenues of research require. Rather, I'm beginning to feel that a good grounding in programming, a simple, expressive language, and good provision of libraries for abstracting over data encodings and difficult algorithms required in each discipline will be much more conducive to interesting computational scholarship. The things that make computational scholarship interesting can't, I think, be packaged up in an end user application. Like scholarship conducted in any paradigm, computational scholarship is interesting and worthwhile when it's exploratory. But the restrictions of an end user application seriously stiffle any possibility for exploration.Such a statement has the potential to stir up a debate I've seen elsewhere about whether "Digital Humanists" should learn to program, which I have no interest in doing. Nevertheless, at least for tasks like the one I've (painfully) described here, I think the perspective Lewis describes is helpful. Insofar as I even made half a stab at solving this little riddle, it is because of the availability of a set of tools that are easy enough to be picked up by a nonspecialist, but supple enough to be used in unanticipated ways. In particular I would single out Python, the Natural Language Toolkit, and Processing. As has been been noted elsewhere, Python's simplicity, makes it fun to work with and perfect for these sort of problems. In addition to Python's native facility with strings, the NLTK makes all sorts of text analysis tasks (frequency counts, etc) very simple (and it is all wonderfully well documented). And Processing does for visualization what the NLTK does for text analysis. Using them as I have here produces an admittedly heterogeneous solution, cobbled together out of what one can learn on the fly (biggest challenge—figuring out SAX processing to handle PG's massive RDF catalog file). One could simply do everything I've done here using rgrep, Python, and Processing, within a single language: there are graphics libraries for Python, and one could do all the string/data manipulation by way of Processing (perhaps with some help from native Java libraries). But it seems that using a language in a task-specific way provides a helpful midway point between spending too much time trying to learn how to code, and just waiting for the exact right tool to appear (in this case, the obscene-semantics-historical-separator—surely it's next from Google Labs).