visualization

You are currently browsing articles tagged visualization.

Text Visualization with Paper and Yarn

“I am quite content to go down to posterity as a scissors and paste man for that seems to me a harsh but not unjust description.”—James Joyce

So, let’s say you took Gabler’s edition of Ulysses, photocopied each page of “Wandering Rocks” (episode 10) at 50% of its normal size, and then taped them all together. You now have one long piece of paper. Cut at the breaks between the nineteen sections of which the episode is composed and you have nineteen pieces of paper—one for each of the episode’s sections. The sizes of these pieces, of course, would vary; the first (describing Father Conmee’s walk and trip on the tram) would be the longest.

Next, grab some yarn and some paper clips (because they’re handy). Cut some short lengths of yarn. Tie a paper clip to each end. Now let’s have a look at the second sheet (“Text Interruptions”) in this Google Doc, containing a list of moments in “Wandering Rocks” where a recognizable line from one section intrudes in another. Take one end of your paper-clip/yarn device and clip to the line where a reference occurs; connected the other side to the area referenced (the Gabler edition has lines numbers; that is chiefly why we’re using).

When all is said and done, with some variability depending on how exactly you connect things, you might have something that looks like this:

Wandering Rocks, Visualized in Paper

(Check out this Flickr image for an annotated version of the same image.)

This is a sort of basic visualization of the connections between the sections of “Wandering Rocks.” Using scissors and some basic office supplies you can begin to get a grip on how, precisely, the various sections of the episode related to one another.

This only visualizes, however, one of the ways in which these sections are connected. Certain characters, for example, help synchronize the sections by appearing in more than one section: this is not visualized (perhaps we could highlight proper names which occur in more than one section; or connect them with a different colored thread). Location also helps synchronize the sections: Bloom, Stephen, and his sister all appear at the book stall, for instance. Maybe we could lay the nineteen pieces of paper out on a map (how would we handle the episodes where characters are moving?). You’ll notice that I haven’t even tried to make sense of the final, nineteenth section, where many of the characters see the cavalcade as it moves through Dublin (its movement appears in a number of other sections too). Have a look at the Google Doc to see my raw data; if you think you can improve it, email me (cforster @ virginia.edu) and I’ll happily add you as an editor for the document.

It is also worth remembering that the chief unit of analysis here is the “line” in the Gabler edition. But all “lines” within the narration are not equal in terms of the time narrated. So you can line up the sections based on the synchronizations within the sections; but these provide only a point of synchronization; you cannot extrapolate out beyond that point.

There isn’t too much to be learned from this very basic attempt to get a handle on the complexity of this episode. But it does seem interesting that sections tend to branch out—rather than, for instance, many sections all referring to one section (though this situation is precisely that of the final section, which I have ignored; and, in another, of Father Conmee’s walk which, through its geographical progression, may relate to other sections in ways I have ignored).

This yarn stuff is fun, but wouldn’t it be nice to have this digitally? Let’s take this and do it in processing. In trying to write up in code this same visualization, I think the chief lesson of playing with yarn is that there are essentially two key types of objects for this analysis: chunks of text (representable as a rectangle of length propotional to the amount of text they represent); and flexible connections between parts of the text (not necessarily between sections: a link could, theoretically, be within a single section).

These two types of things were instatiated as two basic classes in my code: textChunk and connection. A textChunk contains its starting line and ending line, its length (computed as the difference between those first two pieces of data; I keep it onboard rather than re-computing it constantly), and a quick description (stored as a String); each textChunk object also contains the coordinates of its current location on the screen. The connection objects similarly contain the points they link together (stored as simply two integers representing the two line numbers that are linked; we don’t even need x,y coordinates since we’re working with a basically one dimensional representation of the text here). There are also a handful of methods for these objects: constructors to load up the data (though the way the data is currently stored/loaded is an embarassment); some methods to draw the objects, etc.

Here is what is looks like, comparing my yarn visualization with my version in processing (not too bad, huh?):

Two Visualizations Compared: One Paper, One Digital

(In mapping things out, I got some of the inspiration here from my friend & colleague Jean Bauer‘s much more sophisticated tool for visualizing relational databases, Davila, also written in processing; originally I was simply going to gut her code and repurpose it here; but her code is far more elegant than mine, and is designed for situations far more complex. It made sense to just start from scratch.)

Each object bridges the gap between what it represents (which remains basically static) and the current state of its representation (which can be moved around and interacted with).

Wandering Rocks Visualized

The interactions are basic. You can grab each textChunk and move it around manually. Hovering over a block will produce a little description of that block in the white section near the bottom of the window. You can hit ‘a’ and the blocks will automatically align. That function isn’t working entirely perfectly yet, so I had to do manually massage things a bit to get them to look as these do above.

But as you move the blocks around, the connections stretch and keep the links between the sections evident. The blocks lined up on the right hand side are those without connections. (Oh yeah, those curved white lines; they’re my beginning of an attempt to mark the skiff’s progress.)

There would be other ways to begin visualizing “Wandering Rocks” (and I’d love to hear suggestions). And there are certainly ways to improve this one. One could attach the entire text (its available through Project Gutenberg y’know) of each section; though I’m not sure what the advantage would be of doing that would be. The colors just alternate now (for odd and even sections), to avoid sheer monochromatism. But the color of the textChunk could be tied to location or character; similarly the color of the connection could be made meaningful in some way.

I may post the code if I can get it cleaned up enough; if you’d like to see it in its current state, just email me, and I’ll chuck you a tar ball with everything as it stands.

What we’re playing with here is the tension between narrated time and narrative time. This neglects the entire dimension of space, which is central to the text of “Wandering Rocks” itself. In the comments of my previous post, crazymonk pointed to these maps from the wonderful Robot Wisdom site which is full of interesting Joyce material. The next step on this odd little project will be to continue to improve this visualization with an eye to moving towards a mapped visualization of the action of the episode. The simultaneity I’ve trying to visualize here is directly connected to the way the episode attempts to unify diverse locations. Bringing together a basic geographical representation of the episode’s action (and the action of the novel) with the concerns I’m tinkering with here, would allow this visualization to move from merely playing to something else I think… Of which, more anon (or, anonish).

Tags: , , , , ,

So, lately I’ve been writing about my interest in the changing semantics of the word bitch, trying to pin down when it went from a term meaning primarily “female dog” to being primarily an obscenity. I still don’t have a good answer. In this post I’ll try to explain why.

Along the way I’ll talk about:

That might seem like a lot of stuff, especially if you’re the poor soul reading this; those links above can hopefully get you where you might be interested in going (and of course, there are more entertaining places on the internet anyway you know).

I need to make very clear here though that, despite all that follows, there is nothing even approaching an answer to the question with which I began in what follows. I will share a sort of dummy visualization of the changing semantics of “bitch,” but it is worthless as an answer to the question with which I started—G.I.G.O..

Last time on “Mining Obscenity”. . .

So, as I’ve written about before, I became interested in tracing the changing definitions of the word “bitch,” of trying to get some idea about when the shift occurred from bitch being used in print to mean “a female dog” (and, I learned, sometimes other animals) to its being a (mildly) obscene obloquy. (This is, of course, just one change in the term’s meanings—more recently, for instance, one could chart the way the term comes increasingly to be used by men to emasculate other men.)

But I think (and hope) the general premise is clear enough: to provide some sense about when, historically, the obscene/derogatory meaning took precedence over “female dog”, at least as reflected in print (which itself raises questions about how one gets a historically valid sample, et cetera, et cetera).


Exploring Project Gutenberg

So one source of textual data is Project Gutenberg. The amount of data ready to hand at Gutenberg, as well as its availability in vanilla plaintext, has made it attractive to folks doing stylometric analyses. And the very kind folks at Gutenberg have even included a very helpful way of getting all their ebooks. (Zipped up, that is something like 14.5 gigabytes according to the PG website).

Project Gutenberg also makes available its catalog data in one big RDF file. As a preliminary step I decided to start with this catalog file just to get some sense of the distribution of texts in the Gutenberg archive. So, using Python to extract data from the RDF and Processing to visualize it, I produced this picture of the distribution of texts.

Graph of Authors in Project Gutenberg

Authors in Project Gutenberg

Each gray horizontal line represents the lifespan of an author who has at least one work in the Project Gutenberg archive. Authors with more than 50 works in the archive get more than a line—they get a box with their name in it. These “major authors” are then color coded: authors with more than 150 works in PG get a red box; authors with between 100 and 150 get a blue box; and authors with more than 50, but less than 100, get a green box. The lines are stacked (using a very crude algorithm; “major authors” aren’t stacked the same way—they’re just chucked at some height), so that the height of the stacked lines gives some insight into the number of authors writing at a certain period.

It isn’t especially pretty (and some boxes are less visible because they have been drawn over), mostly because my programming ability is pretty limited. But it offers some insight into the historical distribution of PG’s authors. There are a lot of authors in the nineteenth/twentieth centuries, because the novel (with the predictable exception of Shakespeare) dominates PG’s holdings. (I’ve focused on the period from 1500 – 2000 here; PG includes some works in the period before 1500—some translations of the classics, some Li Po, some Confucius, and so on, but not too many by comparison).

But there are still lots of problems. If you were paying attention you’ll note that I said that authors with more than 150 works in PG get a red box, which would seem to suggest that Shakespeare was even more prolific than you remembered. This inflated number is because PG’s Shakespeare holdings include a number of different versions of each of Shakespeare’s plays, translations of some of them, as well as a version of The Complete Works. So what gets tallied up as a “work” is not really a work. (Of course what exactly defines “a work” —how we define its unity and its singularity—is just one more of those thorny questions that I’m trying to shunt aside to get some heuristic peek into literary history.)

This is (I hope) somewhat interesting, at least as a glimpse into PG. But if you’ve been paying attention you should be asking—by now you’re probably screaming in frustration—why are you visualizing the lifetimes of authors rather than publication dates of individual works? Well, that is simple. PG’s catalog data does not include publication date in its catalog data. (For that matter, it doesn’t include any data about what edition a particular etext represents at all).

Well, that’s certainly a problem.


But what if we just ignored all that: Visualizing the Semantics of Bitch (with Bad Data)

Okay, so that’s a problem. But let’s say we ignored this problem and tried to forge ahead anyway. Maybe you could take Gutenberg’s textual data and get metadata about the works from some other source. Great idea! But this solution proved more difficult than I could easily manage.

Well you could always just make the data up. Let’s just take each author’s birth year, add it to the year in which s/he dies, and divide by 2, effectively assuming that each author produced all of their work in one great burst of creativity midway on life’s journey.

This would be an assumption so ugly as to call any resulting visualization severely into question, as least in terms of its philological accuracy. But as proof-of-concept, I decided to make the assumption anyway.

So, after waiting for the massive 15 gig-ish download of PG’s etexts, how would one proceed? Well, I imagine that there are other ways to approach this, perhaps better ways, but I used used rgrep to search all the files for instances of the (case insensitive) string “bitch.” Using arguments you can have rgrep return a line on either side of the occurrence of the searched for term. The results will look something like this:

./etext97/itwls10.txt-4218-of the stag; but, partaking more of the nature of the domestic than
./etext97/itwls10.txt:4219:of the wild animal, it remained with the herd of cattle.  A bitch
./etext97/itwls10.txt-4220-also was pregnant by a monkey, and produced a litter of whelps
--
./etext05/8cptm10.txt-62244-"Yes; yes, by the stitching 'tis plain to be seen
./etext05/8cptm10.txt:62245:"It was made by that Bourbonite bitch, VICTORINE!"
./etext05/8cptm10.txt-62246-What a word for a hero!--but heroes _will_ err,

Above are two results from such a search, the middle line of each contains the searched for term. At the beginning of each line is the file in which the grepped-for term occurs, followed by the line number, and then a line of text. Pipe all those results into a text file and you have the raw material you need. The file info (by way of etext number) can connect the text to its entry in the RDF catalog (and thence to the author, title, and birth/death date info).

Determining the meaning of “bitch” in these passages though is not an easy task. One can imagine a machine learning solution—but on such small samples it seems unlikely to work well and would introduce a whole other level of complexity. You could try simply searching for selected key terms within a certain proximity of the occurrence “bitch” (like “dog” or “litter”) and come to a conclusion based on the result. But since the number of results was relatively low (around 400 results), I thought it would be easier and better to just do it manually. To ease the task I wrote a quick Python script to display each extract and accept as input a number (0 – 4) to classify the term. Here is what it looked like:

There are certainly other ways to break up the meanings, but after surveying the data this seemed sufficient. With this scheme, one could skip an entry if it was a false positive (for example, the name Bitchov or similar—there were actually a couple of these). I ranked “son-of-bitch” separately only because it occurred so frequently that it might be worth keeping an eye on it (as a specialized instance of the range of the term’s obscene meaning); and I left open the possibility of ranking a term as “ambiguous” since, even with 3 lines of context, the term’s meaning might not be obvious. (By keeping ambiguous results separate from false positives, “0″, one could go back and grab more context to resolve the ambiguity).

So, for a couple days I left this simple program running. Whenever I had a few free minutes to do some simple classifying while talking on the phone or waiting for water to boil, I classified some occurrences of the term “bitch.” Once all of them had been classified and the output written to a file, it was time to return to Processing to try to visualize this. After some futzing around, here is what all that bitch data looked like.

Visualization of the Relative Obscenity of "Bitch"

Let me first reiterate that this visualization does not really show anything—that the data it represents is fundamentally flawed. As I noted above, because date of publication was not easily available, the dates used here are effectively inventions. (They are accurate within a tolerance of, say, half three score and ten.) Moreover, even with all that text downloaded from Gutenberg, we still have a pretty small number of points to draw any conclusions from. (You’ll note that, for purposes of visualization, I’ve grouped occurrences by the decade in which they occur, fudging the dates still further). And, as if that weren’t enough, let’s recall that the same “work” can appear more than once in PG leading to double-counting. (I went through the data by hand to try to remove these, but I could have missed some).

So, this sure seems like a long blog post for a useless visualization, isn’t it?

Well, here is what I like: this visualization divides the two meanings of bitch horizontally—points appearing below the center, horizontal line represent instances of the term being used in its obscene sense (the color-code gives some further insight into how these break down using the 4-part division discussed above), points above the line represent instances where the term is used in a non-obscene way (to mean “female dog”). This is simple, but has the advantage of allowing that both meanings might be equally available, or available in some mixed proportion, at any historical moment. With a larger data set, and with correct publication dates, this seems to me to be a elegant way of answering the admittedly amorphous question with which I began (though I’m certainly open to criticism of this entire approach).

It could also be improved upon. You could keep in memory the text samples from which these points were derived so that one could mouse-over each point and get data about what author/work that point represents, a keyword in context sort of view, and even a link to the full-text.

With a sufficiently complete data set, I would expect expect that we’d see that, during the twentieth century, the occurrences of the term as obscene would greatly increase while the occurrences of the term as meaning “female dog” would decrease. Exactly where the obscene meanings takes precedence would be the interesting thing to know. (Indeed, it is the thing I was interested in originally.)


A Final Thought

While I want to stress once again that this exercise in digital philological visualization (does that sound suitably buzzword-worthy to win me a prize of some sort?) fails, it fails because the data is not readily available; to get a meaningful result would require more, and better, data than is available from PG at present. (I’ll be putting this little toy problem on the back burner now, but would be interested in exploring other sources of data—Google Books is the obvious choice, but after spending some time playing with the Books API, I’m not sure the necessary data is currently available [nor am I confident that such a use is even permissible within the terms of use]) .

If you will grant that ferreting out the historical contours of the changing uses of the term “bitch” is worthwhile (and maybe it isn’t; perhaps this whole post reeks of sheer pedantry), a visualization like this one seems to illustrate that change (or at least one aspect of). And if you’ll grant all that, there is a final point worth making. This sort of visualization answers the question posed simply and without oversimplification, but it is tailor made to this particular problem. This recalls something I recently read on the Humanist discussion list in a message by Richard Lewis. He wrote:

…I’m increasingly of the opinion that end user application style software is not really what scholars who are serious about exploring the possibilities of using technology to enhance their research or open new avenues of research require. Rather, I’m beginning to feel that a good grounding in programming, a simple, expressive language, and good provision of libraries for abstracting over data encodings and difficult algorithms required in each discipline will be much more conducive to interesting computational scholarship.

The things that make computational scholarship interesting can’t, I think, be packaged up in an end user application. Like scholarship conducted in any paradigm, computational scholarship is interesting and worthwhile when it’s exploratory. But the restrictions of an end user application seriously stiffle any possibility for exploration.

Such a statement has the potential to stir up a debate I’ve seen elsewhere about whether “Digital Humanists” should learn to program, which I have no interest in doing. Nevertheless, at least for tasks like the one I’ve (painfully) described here, I think the perspective Lewis describes is helpful. Insofar as I even made half a stab at solving this little riddle, it is because of the availability of a set of tools that are easy enough to be picked up by a nonspecialist, but supple enough to be used in unanticipated ways. In particular I would single out Python, the Natural Language Toolkit, and Processing. As has been been noted elsewhere, Python’s simplicity, makes it fun to work with and perfect for these sort of problems. In addition to Python’s native facility with strings, the NLTK makes all sorts of text analysis tasks (frequency counts, etc) very simple (and it is all wonderfully well documented). And Processing does for visualization what the NLTK does for text analysis.

Using them as I have here produces an admittedly heterogeneous solution, cobbled together out of what one can learn on the fly (biggest challenge—figuring out SAX processing to handle PG’s massive RDF catalog file). One could simply do everything I’ve done here using rgrep, Python, and Processing, within a single language: there are graphics libraries for Python, and one could do all the string/data manipulation by way of Processing (perhaps with some help from native Java libraries). But it seems that using a language in a task-specific way provides a helpful midway point between spending too much time trying to learn how to code, and just waiting for the exact right tool to appear (in this case, the obscene-semantics-historical-separator—surely it’s next from Google Labs).

Tags: , , , , , ,