Chris Forster

I’ve been trying to think intelligently about the place of quantitative data in literary studies, especially in light of two excellent posts, one by Andrew Goldstone, the other by Tressie McMillan Cottom, both responding to this review by Ben Merriman.

But before I could even try to say something interesting in response, Ted Underwood announced that he was making available “a dataset for distant-reading literature in English, 1700-1922” (here is a link to the data). This post is a look at that data, mostly using R. I have, essentially, nothing thoughtful to offer in this post; instead, this is an exploration of this dataset (many, many thanks to Ted Underwood and HathiTrust for this fascinating bounty), studded with some anticlimaxes in the form of graphs that do little beyond give a sense of how one could begin to think about this dataset.

With the exception of a bash script (which may, though, be the most repurposable bit of code), everything here is done in R. I don’t like R, and I’m not very good with it,I think R’s datatypes are what make it a challenge; lists in particular seem to materialize out of nowhere and are frustrating to use… but it is great for making pretty graphs and getting an initial handle on a bunch of data. I try to comment on, and explain, the code below (often in comments)—though if you’ve never looked at R, this may seem really weird. I also may have made some horrible mistakes; if so, please let me know.

The New HathiTrust Data Set

Underwood calls this dataset “an easier place to start with English-language literature” within the HathiTrust dataset. I had poked around the HathiTrust data before, and it really is a very complicated undertaking. This dataset that Underwood has provided makes this much much easier.

The data can be downloaded here. In this post I’ll look at the fiction metadata, and take a peak at the fiction word counts for the years 1915–1919. Those files looks something like this:

  • fiction_metadata.csv: 17 megabytes, containing author, title, date, and place for each work of fiction. It also includes subjects, an id for HathiTrust (htid), and other fields.

  • fiction_yearly_summary.csv: 35 megabytes, containing token frequencies. The first 20 lines look like this.

year,word,termfreq,correctionapplied
1701,',162,0
1701,a,813,0
1701,further,2,0
1701,native,1,0
1701,forgot,9,0
1701,mayor,1,0
1701,wonder,13,0
1701,incapable,3,0
1701,reflections,5,0
1701,absence,5,0
1701,far,16,0
1701,performance,2,0
1701,say,44,43
1701,notorious,1,0
1701,words,15,0
1701,leaves,2,0
1701,unlucky,2,0
1701,aware,1,0
1701,differ,1,0
  • In a directory I uncompressed fiction_1915-1919.tar.gz. The result is 8656 files, each representing a single work, and totalling 827 megabytes. (827 megabytes of text is not “big data”—but it is enough to making toying with it on your laptop at times a little tricky.)

Examining the Metadata: Volumes of Fiction Per Year

So, let’s begin, by loading our plotting library (ggplot) and the CSV file with the fiction metadata file fiction_metadata.csv.

library(ggplot2)

# Load the metadata from the CSV vile
fiction.data <- read.csv('fiction_metadata.csv',header=T)

# Let's look at how many items we have for each date.
ggplot(fiction.data) + 
  geom_histogram(aes(x=fiction.data$date),binwidth=1) +
  ggtitle('Books per Year in Fiction Dataset') +
  xlab('Year') +
  ylab('Number of Books Per Year in Fiction Data')

Bar Plot of Works of Fiction Per Year in HathiTrust Dataset

This gives a sense of just how few books from before 1800 are in this dataset.

nrow(fiction.data)
   [1] 101948

nrow(subset(fiction.data,fiction.data$date < 1800))
   [1] 1129

That is, 101948 volumes total, 1129 of which were published prior to 1800, or about 1%. The number of volumes appearing in the dataset per year tends to increase constantly—with a few exceptions. That dip around 1861-1864 may be a result of particularly American factors influencing the dataset; and perhaps it is war again accounts for some of the dip at this period end—though that dip seems to begin prior to 1914.

Examining the Metadata: Change in Length of Volumes Over Time

The length of each volume is contained in the totalpages field in the metadata file. Let’s plot the length of works of fiction over time (so, plot date by totalpages).

ggplot(fiction.data,aes(x=fiction.data$date,y=fiction.data$totalpages)) +
  geom_point(pch='.',alpha=0.1,color='blue') +
  ggtitle('Length of Books by Year') +
  xlab('Year') +
  ylab('Length of Book, in Pages')

Not Especially Legible Plot of Length of Works of Fiction Over Time in HathiTrust Dataset

Interesting. It seems that, in the mid-eighteenth century near the dawn of the novel, works of fiction were around 300 pages long. Their length diversified over the course of the novel’s history, as novels grew both longer and shorter as the possibilities for fiction widened, perhaps as a function of increased readership stemming from both the decreasing cost of books and the increasing rate of literacy.

Well, not really. Matthew Lincoln has a very nice post about the dangers of constructing a “just-so” story (often to insist that this graph tells us “nothing new). But there are at least two problems with the interpretation offered above—one broad and one more specific. Broadly, it is worth reiterating the danger of mistaking this data for an unproblematic representation of any particular historical phenomenon (say especially readership of novels). Underwood describes the dataset carefully as representing works held by ”‘American university and public libraries, insofar as they were digitized in the year 2012 (when the project began).’“ And, of course, lots of other things which would be relevant to an investigation of fiction—think of pulp paperbacks and similar forms—will not be in that sample, because they were often not collected by libraries. (Likeiwse, as Underwood notes, pre 1800 books are more likely to be held in Special Collections, and therefore not digitized).

The second point is specific to the graph above. That scatter plot is sparse in the early half of this period and very dense in the latter half. The translucency of each point (set by alpha=0.2) captures some of this, but nevertheless the graph as a whole overemphases the increased spread of data, when really what is happening is an increase in the amount of data. If we plot things differently, I think this becomes evident. Let’s breakdown our data by decade, and then do a box plot per decade of fiction length:

# This helper function will convert a year into a "decade"
# through some simple division and then return the decade
# as a "factor" (an R data-type).
as.Decade <- function(year) {
  decade <- (as.numeric(year)%/%10)*10
  return(as.factor(decade))
}

# Add a "decade" column by applying our as.Decade function 
# to the data. (The unlist function... is because lapply returns
# a list, and I'm not very good at R, so that's how I got it to work.
fiction.data$decade <- unlist(lapply(fiction.data$date, as.Decade))

# Box plot of our length data, grouped by decade
ggplot(fiction.data, 
  aes(x=fiction.data$decade,y=fiction.data$totalpages)) +
  geom_boxplot() +
  ggtitle('Length of Books, Grouped by Decades') +
  xlab('Decade') +
  ylab('Length of Books, in Pages')

Less Misleading Plot of Length Across Time in HathiTrust Fiction Dataset

This plot confirms that, indeed, we see a greater range in the lengths of works of fiction (so my inference from the previous graph is not completely wrong). But a box plot clarifies what is, to me, a surprising constancy in the length of the works collected in this dataset. The apparent increase in variability in length is real—but it is not the most, or the only, salient feature of this data; this fact is better captured in the second graph (the box plot).

Summary: Frequently Occuring Terms

The file fiction_yearly_summary.csv contains the per-year frequencies of the top 10,000 most frequently occuring tokens in the fiction dataset. We can chart the fluctuations of a term’s use, for instance, across the period.

# Load our data
yearly.summary <- read.csv('fiction_yearly_summary.csv')

# Extract some meaninful bit, say, occurences of `love`
love <- subset(yearly.summary, yearly.summary$word=='love')

# Plot it
ggplot(love,aes(x=love$year,y=love$termfreq)) +
  geom_line() +
  xlab("Occurences of token 'love'") + ylab('Year') +
  ggtitle('"Love" in the Dataset')

Unnormalized Occurences of the Term 'Love' in Dataset

Yet, of course, looking at that sharp rise, we quickly realize—yet again—the importance of normalization. We are not witnessing the explosion of love at the dawn of the twentieth century (and its nearly as rapid declension). We could noralize by adding all the words together—but we only have counts for the top 10,000 wods. Thankfully, the dataset offers “three special tokens for each year: #ALLTOKENS counts all the tokens in each year, including numbers and punctuation; #ALPHABETIC only counts alphabetic tokens; and #DICTIONARYWORD counts all the tokens that were found in an English dictionary.”

So, let’s normalize by using DICTIONARYWORD.

# Let's extract the DICTIONARYWORD tokens into a data frame
yearly.total <- 
subset(yearly.summary,yearly.summary$word=='#DICTIONARYWORD')

# Let's simplify this dataframe to just what we're interested in.
yearly.total <- yearly.total[c('year','termfreq')]

# And rename the termfreq column to "total"
colnames(yearly.total) <- c('year','total')

# Now we can use merge to combine this data, giving each row 
# a column that contains the total number of (dictionary words)
# for that year. 
love.normalized <- merge(love, yearly.total, by=c('year'))

# This method profligately repreats data; but it makes things 
# easier. The result looks like this:
head(love.normalized)
>   year word termfreq correctionapplied  total
> 1 1701 love      222                 0  37234
> 2 1702 love        1                 0   7036
> 3 1703 love      524                 0 416126
> 4 1706 love       12                 0  36501
> 5 1708 love      578                 0 482779
> 6 1709 love      361                 0 133847

# Now, graph the data
ggplot(love,
       aes(x=love.normalized$year,
           y=(love.normalized$termfreq/love.normalized$total)))+
  geom_line() +
  xlab('Year') +
  ylab('Normalized Frequency of "love"') +
  ggtitle('The Fate of Love')

Normalized Plot of 'Love' in the Dataset

Well, that look’s about right. Just for fun, let’s try a different term, one that is something less of an ever-fixed mark, but which perhaps alters its relative frequency when it historical alteration finds.

# We subset the term we're interested in.
america <- subset(yearly.summary, yearly.summary$word=='america')
# And normalize using our already-constructed yearly.total 
# data frame.
america.normalized <- merge(america, yearly.total, by=c('year'))

# Plot as before, though this time we'll use geom_smooth() 
# as well to add a quick "smooth" fit line to get a sense of 
# the trend. Minor digression: things like geom_smooth() are one 
# of the things that make R great (if very dangerous) for an 
# utter amateur.
ggplot(america.normalized,
    aes(x=america.normalized$year,
    y=(america.normalized$termfreq/america.normalized$total)))+
    geom_line() +
    geom_smooth() +
    xlab('Year') +
    ylab('Normalized Frequency of "america"') +
  ggtitle("Occurences of 'america' in the Dataset")

Occurences of 'america' in the Dataset

Not sure there’s much surprising here, but okay, seems reasonablish.

Extracting Counts from Individual Volume Files

Now, what if you want to look at terms that don’t occur in the top 10,000. Then, you need to dig in to the files for individual volumes. For simplicity’s sake, I’ll look only at one set of those files, representing volumes of fiction between 1915 and 1919, which I’ve uncompressed in a subdirectory called fiction_1915-1919.

I’ve been using R for everything so far, and I imagine you could use R to loop over the files in the directory, open them up and look for a specified term. As someone who finds R idiosyncratic to the point of excruciation, this doesn’t sound particularly fun. R is great when you’re manipulating/plotting data frames—less so when doing more complicated tasks on the filesystem. So, to extract the information we want, I’ll used a simple bash script.

#!/bin/bash

# Our input directory
INPUTDIRECTORY=./fiction_1915-1919

# Let's take a single command line argument ($1) and store it
# as the value we're looking for (the proverbial needle in our
# data haystack).
NEEDLE=$1

# We use this convention, with find and while read 
# because a simple for loop, or ls, might have a problem
# with ~10000 files.
find $INPUTDIRECTORY | while read file
do
    # For each file, we use grep to search for our term,
    # storing just the number of occurences in result.
    result=$(grep -w -m 1 $NEEDLE $file | awk '{ print $2 }')
    # Get the htid of the file we're looking at from the filename
    id=$(basename $file .tsv)
    # And then print the result to the screen
    echo $id,$result
done

I’m assuming some familiarity with bash scripts; to make a script executable, using its enough to type chmod +x wordcounter.bash. Save this script to a file (say, wordcounter.bash), make it executable, and then run it with an argument: ./wordcounter.bash positivism and it will output to the screen; pipe that to a csv (type ./wordcounter.bash positivism > positivism.csv) and you can use it in R. Here is what the results look like when they start appearing on the screen:

bc.ark+=13960=t19k4r10s, 
bc.ark+=13960=t25b0m976, 
bc.ark+=13960=t6tx3tq53, 
chi.086426399, 
chi.086523141, 
chi.64465423, 
chi.73664930, 
coo.31924002898983, 
coo.31924013129774, 

Those gibberish-looking strings (bc.ark+=13950=tk19k4r10s) are HathiTrust IDs. Then you get a comma, and after the comma the number of times the term appeared in the file… unless it didn’t appear, in which case you just a blank.

Some Notes

This will only work on unixy systems—Linux, OSX, or (I assume) cygwin on Windows.

When a token does not appear in file, this script outputs the htid, a comma, and then nothing. That’s fine—it’s easier to handle this after we’ve imported the resulting csv (to, say, R) than it would have been to write some logic in this script here to output 0. Also, this crude method is probably faster than doing it within R or Python and is certainly not slower. It could be speeded up by doing something fancy, like parallelization. To search through the 8656 files of fiction_1915-1919 for one term took 1 minute and 12 seconds—a totally managable timeframe. Assuming that rate (processing, say, 120 files/second) is roughly constant across the dataset of roughly 180,000 volumes, it should be possible to use this method to search for a term across all the volumes in the dataset in roughly 25 minutes, give or take. That is, of course, based on doing this on my laptop (with a 1.8Ghz Core i5 CPU), no parallelization (though this should be an eminently parallizable task—like really). Not fast, but totally managable.

Plotting Our Extracted Counts from Individual Volume Files

So, assuming the script works… back to R.

# Input the data culled by our custom bash script
gramophone <- read.csv('gramophone.csv')
film <- read.csv('film.csv')
typewriter <- read.csv('typewriter.csv')

# Remember all those spots where a token doesn't occur, 
# which appear as blanks? Those get read by R as NA 
# values. Here we replace them with zeros.
gramophone[is.na(gramophone)] <- 0
film[is.na(film)] <- 0
typewriter[is.na(typewriter)] <- 0

# Let's rename our columns
colnames(gramophone) <- c('htid','gramophone')
colnames(film) <- c('htid','film')
colnames(typewriter) <- c('htid','typewriter')

# We'll put this data together into one data frame
# for convenience sake.
gft <- merge(gramophone,film,by=c('htid'))
gft <- merge(gft,typewriter,by=c('htid'))

Right now, though, all we have is HathiTrust IDs and frequencies of our term (or terms). We have no information about date, or title. So let’s get that information from the metadata files we’ve worked with earlier.

# From our custom culled data
gramophone <- read.csv('gramophone.csv')
film <- read.csv('film.csv')
typewriter <- read.csv('typewriter.csv')

# All those spots where a token doesn't occur, which produce blank lines
gramophone[is.na(gramophone)] <- 0
film[is.na(film)] <- 0
typewriter[is.na(typewriter)] <- 0

colnames(gramophone) <- c('htid','gramophone')
colnames(film) <- c('htid','film')
colnames(typewriter) <- c('htid','typewriter')

# put it all together with our main metadata data frame
gft <- merge(gramophone,film,by=c('htid'))
gft <- merge(gft,typewriter,by=c('htid'))

# Now get the metadata from fiction_metadata.csv and
# merge based on htid.
fiction.data <- read.csv('fiction_metadata.csv',header=T)
gft <- merge(gft,fiction.data,by=c('htid'))

# To normalize let's load our annual totals as well. We can
# merge those with our dataframe based on date.

# Get Yearly Totals
yearly.summary <- read.csv('fiction_yearly_summary.csv')
yearly.total <- subset(yearly.summary,yearly.summary$word=='#DICTIONARYWORD')
yearly.total <- yearly.total[c('year','termfreq')]
colnames(yearly.total) <- c('date','total')

# Merge yearly totals with our main dataframe based on date.
gft <- merge(gft,yearly.total,by=c('date'))

# Our dataframe is now 23 columns:
colnames(gft)
> [1] "date"          "htid"          "gramophone"    "film"         
> [5] "typewriter"    "recordid"      "oclc"          "locnum"       
> [9] "author"        "imprint"       "place"         "enumcron"     
>[13] "subjects"      "title"         "prob80precise" "genrepages"   
>[17] "totalpages"    "englishpct"    "datetype"      "startdate"    
>[21] "enddate"       "imprintdate"   "total"        

# That's not crazy, but to make things easier to under, 
# let's subset just the data we're interested in right now---say,
# the occurence of our terms and their date.
gft.simple <- gft[,c('date','gramophone','film','typewriter','total')]

head(gft.simple)
>   date gramophone film typewriter     total
> 1 1915          0    0          0 106553905
> 2 1915          0    1          0 106553905
> 3 1915          0    0          0 106553905
> 4 1915          0    0          0 106553905
> 5 1915          0    0          0 106553905
> 6 1915          0    0          0 106553905
nrow(gft.simple)
> [1] 8655

Okay, looks good—there are our 8655 volumes, each with date of publication, the occurences of our three search terms (gramophone, film, and typewriter), and the total number of DICTIONARYWORDs for that year. Note that each row still represents a single volume—but we’ve discarded author, title, htid, etc. We’ve also added the total dictionary words for a volume’s year to each row (note the repeated totals in those first 1915 volumes), which is grossly inefficient. All this, however, is in the interest of simplicity—so that we can easily plot the relative occurences of our selected terms (here, gramophone, film, and typewriter).

In order to make this data easily plottable, we need some additional R tricks: we need to reformat our data from a “data frame” to a long “data matrix” (using the melt function). Then we can create a stacked bar graph of terms per year. Let’s start by plotting our raw counts.

# Our libraries
library(reshape2) # For melting data.
library(ggplot2)  # For graphing data.

# This next is necessary b/c R throws an error otherwise. 
# Not totally sure why...
gft.simple$date <- as.factor(gft.simple$date)

# Create a "long" format matrix, from our raw counts data.
gft.m <- melt(gft.simple[,c('date','gramophone','film','typewriter')],id.vars='date')

# Create a bar plot of all our values, coded by variable
ggplot(gft.m,
aes(factor(date),y=value,fill=variable)) +
  geom_bar(stat='identity') +
  xlab('Year') +
  ylab('Raw Word Occurence') +
  ggtitle("Raw Counts for 'gramophone,' 'film,' and 'typewriter'")

Stacked Bar Chart of 'gramophone','film','typewriter' occurences, in Dataset, 1915-1919

These are, though, raw counts. To normalize, we can divide the counts for our terms by the total and plot the result.

# We'll create a new data frame for our normalized data, 
# beginning with out simplified data.
gft.normalized <- gft.simple

# In this new dataframe, normalize our scores by dividing 
# the raw count in each row by the total in each row.
gft.normalized$gramophone <- gft.normalized$gramophone/gft.normalized$total
gft.normalized$film <- gft.normalized$film/gft.normalized$total
gft.normalized$typewriter <- gft.normalized$typewriter/gft.normalized$total

# How does it look?
head(gft.normalized)
> date gramophone         film typewriter     total
> 1 1915          0 0.000000e+00          0 106553905
> 2 1915          0 9.384921e-09          0 106553905
> 3 1915          0 0.000000e+00          0 106553905
> 4 1915          0 0.000000e+00          0 106553905
> 5 1915          0 0.000000e+00          0 106553905
> 6 1915          0 0.000000e+00          0 106553905

# Well, that looks about right. Let's begin our melt/plot 
# process again by creating a matrix.
gft.norm.m <- melt(gft.normalized[,c('date','gramophone','film','typewriter')],id.vars='date')
ggplot(gft.norm.m,aes(factor(date),y=value,fill=variable)) +
  geom_bar(stat='identity') +
  xlab('Year') +
  ylab('Normalized Word Frequency (by Year)') +
  ggtitle("Normalized Scores for 'gramophone,' 'film,' and 'typewriter'")

Normalized, Stacked Bar Chart

Normalization makes some minor adjustments, but pretty similar. Not sure I would want to make any claims as to the importance or meaning of these graphs. They’re over a short historical span, and so far lack any richer contextualization. Like I said, for now, anticlimaxes.

Don’t get me wrong, Markdown’s great. Indeed, nearly all the writing I do now is in Markdown (or at least starts that way). There has been a good amount of writing about the virtues of Markdown for academic writing in particular, so I’ll just link to them here:

But Markdown, as it stands, has some drawbacks, which become acute when you are trying to extend it to cover the needs of academic writing (or, say, as a transcription format for texts).

The Problem

What I will describe as “problems” all stems from the fact that Markdown remains essentially a simplified syntax for HTML. A tool like Pandoc, which has a special (and especially powerful) flavor of Markdown all its own, helps reduce the borders between document formats. With Pandoc it becomes easy to convert HTML to LaTeX, or Rich Text Format to Word’s .docx. It could easily feel like Markdown is a universal document format—write it in Markdown, and publish as whatever.

That is a lovely dream—an easy-to-write plaintext format that can easily be output to any desired format. In reality, though, Markdown (even Pandoc’s Markdown) remains yoked to HTML, and so it suffers from some of its problems.

The problem I encounter most frequently in HTML (and in Markdown) concerns nesting a block quote within a paragraph. In short, can you have a block quote within a paragraph? If you’re writing HTML (or MarkDown), the answer is no—HTML treats “block quotes” as block elements; this means that one cannot be contained within a paragraph (this restriction does not exist in LaTeX or TEI). Yet, what could be more common in writing on works of literature? Representing poetry presents its own problems for HTML and Markdown.By contrast to the challenge presented by the mere fact of poetry, note the many syntaxes/tools available for fenced code blocks, syntax highlighting, and so on; Markdown, for now, remains of greatest interest to software developers and so reflects their habits and needs.(Note: If you’re looking for practical advice, you can easily represent poetry in Pandoc’s markdown using “line blocks”; this is not a perfect solution, but it will do for many needs).

Perversely, markdown also represents something of a step backward with regard to semantics. If you’ve spent some time with HTML, you may have noticed how HTML5 cements a model of HTML as a semantic markup language (with, implicitly, matters of presentation controlled by CSS). That means that the <i> tag, which long ago meant italics, has since acquired semantic meaning. According to the w3c, it should be used to “represent[] a span of text offset from its surrounding content without conveying any extra emphasis or importance, and for which the conventional typographic presentation is italic text; for example, a taxonomic designation, a technical term, an idiomatic phrase from another language, a thought, or a ship name.” Those instances where one wishes to express emphasis, use the <em> tag. If you need to mark a title, don’t simply italicize it, use <cite> .But hold up, that cite element obscures the distinctions we normally make between italicizing certain titles and putting others in quotation marks. In practice, of course, I doubt these distinctions are widely respected across the web; but all those at least potentially useful distinctions are lost in markdown, whose syntax marks them all with * or _. Markdown is, in fact, rather unsemantic. (To a lesser degree, one might detect this tendency as well in the way headings—rather than divs—are Markdown’s primary way of structuring a document, but I’ll stop now.) So, two points: Markdown inherits HTML’s document which includes an inability to nest block-level elements within paragraphs; in simplifying HTML, it produces a less semantically clear and rich format. (Technically, of course, one could simply include any HTML element for which Markdown offers no shortened syntax—like <cite> for example.)

A Solution

On the CommonMark forum, some folks have proposed additional syntax to fix the latter problem, and capture some of the semantic distinctions mentioned above (indeed, following the discussions over there has helped sensitize to me some of the challenges and limitations of markdown as a sort of universal format donor). So, some of these issues could be resolved through extensions or modifications of Markdown.

Yet, given these deficits in Markdown, I wonder if it isn’t worth asking a more basic question—whether the plaintext format for “academic” writing should be so tightly yoked to HTML? If Markdown is, fundamentally, a simplified, plaintext syntax for HTML, could we imagine a similar, easy-to-write, plaintext format that wouldn’t be tied to HTML? Could we imagine, say, a format that would represent a simplification of syntax, not of HTML, but of a format better suited to the needs of representing more complex documents? Could we imagine a plaintext format that would be to TEI, say, what markdown is to HTML?

Such a format would not need to look particularly different from Markdown. Its syntax could overlap significantly; as in Pandoc’s Markdown format, file metadata (things like title, author, and so on) could appear (perhaps as YAML) at the front of the file (and be converted into elements within teiHeader). You could still use *, **, and []() as your chief tools; footnotes and references could be marked the same way (you could preserve Pandoc’s wonderful citation system, with such things represented as <refs> in TEI).

The most substantive difference would not be in syntax, but in the document model. Any Markdown file can contain HTML—all HTML is valid markdown; this ensures that Markdown is never less powerful than HTML. But are the burdens of HTML worth the costs if one wishes to do scholarly/academic, or similar types of writing, in plaintext? Projects exist to repurpose Pandoc markdown for scholarly writing: Tim T. Y. Lin’s ScholarlyMarkdown, or Martin Fenner’s similar project, or the workflow linked-to above, by Dennis Tennen and Grant Wythoff at the Programming Historian. What I’m imagining, though, is entirely less practical than any of these projects at the moment because it would necessitate a change in the document model into which markdown is converted. Pandoc works its magic by reading documents from a source format (through a “reader”) into an intermediary format (a format of its own that you can view by outputting -t native), which it can then output (through a “writer”). Could TEI (or some representation of it), essentially, fulfill that role as intermediary format? (A Pandoc car with a TEI engine swapped in?)

I like writing in plaintext, but I don’t love being bound by the peculiarities that Markdown has inherited from HTML. So, it is worth considering what it is that people like about Markdown. I suspect that most of the things people like about Markdown (free, easy to write, nonproprietary, easily usable with version control, and so on), have little to do with its HTML-based document model but stem from its being a plaintext format (and the existing infrastructure of scripts/apps/workflows around markdown). TEI provides an alternative document model—indeed, a richer document model. Imagine a version of Pandoc that uses TEI (or a simplified TEI subset) behind the scenes as its native format. Folks often complain about the complexity and verbosity of TEI (and XML more generally), and not without reason. I would certainly never want to write TEI; but a simplified TEI syntax that could then take advantage of all the virtues of TEI, that would be something.

[Closing Note: At one point I wondered how easy it would be to convert markdown to TEI with Pandoc… I’ve managed to finagle a set of scripts to do that; it’s janky, but for anyone interested, it’s here.]

Recall these lines from Clement C. Moore’s “A Visit from Saint Nicholas,” (alternately titled “The Night Before Christmas” or “‘Twas the Night Before Christmas”), first published in 1823. See wikipedia page for some notes on contentions with regard to its authorship.

When what to my wondering eyes should appear, But a miniature sleigh and eight tiny reindeer…

But, exactly how miniature is this sleigh, and how tiny are these reindeer? While Moore’s poem did a lot to consolidate the mythology of Santa Claus, one thing that has not remained of Moore’s Saint Nicholas is his height. Recalling this insistence on the tinyness of Santa, much of the confusion around his movement through chimney flues is eliminated. But it also lends a different stress to the comparison of the elf’s nose to “like a cherry” or of his “little round belly” that shakes “like a bowl full of jelly.” At stake here is not simply nose complexion nor belly texture, but size.

If today our Santa is our bigger, it was not always so. And many earlier illustrations are consistent with Moore’s text. Consider these from a 1912 edition [archive.org] of the poem, by Jessie Wilcox Smith:

Santa

Santa Filling Stockings

Likewise, look at this svelte Santa, by Arthur Rackham from this undate edition [HathiTrust], who is clearly small enough to easily slip down that chimney:

Santa Emerging from Chimney

You can find more Santas at the Public Domain Review, including a gun-toting, WWII Santa, or listen to the poem on wax cylinder [1914].

Editions of the poem:

Spoilers Abound Below

Pioneer F Plaque Symbology

Revised December 5; in the first version, I confused the name of a character (calling Dr. Mann, Dr. Miller.)

Interstellar beats the drum of Dylan Thomas’s villanelle “Do Not Go Gentle Into that Good Night” pretty hard pretty hard—reciting it on multiple occasions (though never all the way through, if I recall correctly, and so never really enjoying its full villanelle-ness). Poetry in the movies often serves a chiefly hortatory, emotive function; it is discourse of moral and emotional seriousness. It is recited by serious people (from memory, of course), and it shows their seriousness. And here seems no different. It confers dignity and emotional seriousness on what would otherwise be the mere extinction of humanity.That summarizes, perhaps, my chief gripe about the movie; its bullying emotionalism. Its soundtrack, in particular, bullies you into feeling what it wants you to feel. As my much beloved Flophouse Podcast is fond of noting, is it really necessary to reinforce the stakes in this way? Is the drama of interstellar exploration so boring that only by augmenting it with heaping doses of Dylan Thomas, or a thudding score, will we realize its import?

In the dystopian future of Interstellar, nearly all crops are dying from an unexplained blight, and NASA Scientist Prof. Brand (Michael Caine) is leading a secret team to save humanity. He offers the poem as a sort of allegory for the necessity of humanity resisting its fate. It is the species that must not go gentle into that good night. And so the addressee of Dylan Thomas’s poem, which is offered from a child to a father, is reversed. Thomas writes, in the villanelle’s conclusion:

And you, my father, there on the sad height,
Curse, bless, me now with your fierce tears, I pray.
Do not go gentle into that good night.
Rage, rage against the dying of the light.

While the poem advises resistance from closure and finality, the formal demands of the villanelle, which brings together its rhyming refrains in its closing couplet, inexorably move toward them.Elizabeth Bishop’s perhaps superior villanelle “One Art” wonderfully expresses its emotion and irony by defying the meter of the villanelle in its final line.

Thomas’s poem of grieving stands in tension with its form. Its rage is, of necessity, purely affective—it has no real consequence; death is as sure as the rhyme which snaps together the poem’s close. But not so in the dystopian future of Interstellar where Thomas’s words become not the lament of a child to a parent, but the advice of a father to his children. The generational logic of the poem is turned on its head and the poem becomes not the cry of the grieving child at a death as inevitable as the end of day, but an expression of the parent’s anxiety that children (not even his children; but children) will simply wither out of existence. Thomas’s poem grieves the natural course of things; Prof. Brand’s reading repurposes it as a resistance to the potential extinction of that putatively natural course.

And yet, the poem’s place in the film is vexed. It is recited by the characters who (after a precisely timed revelation) are revealed as something like the film’s “villains”—characters whose lies reveal that the will to live and the refusal to acquiesce are not, in themselves, particularly good things; raging ain’t so great after all. It turns out that Brand’s Plan A—the mass migration of the human population to another planet once he cracks a pesky gravity equation (which, like any good academic, he promises requires just a little more research)—is a noble lie. On his deathbed he reveals that he already knew the equation would never work out. Plan A was a false promise fed to people who would be unwilling to hazard the risks of interstellar travel unless their own lives, or those of their family, were guaranteed. After all, no one would sacrifice themselves merely for Plan B, wherein the human species is preserved in a sort of dorm-fridge full of petri dishes (“genetic samples”), and shipped off-planet. A process which the younger Prof. Brand (daughter of Caine’s Prof. Brand, played by Anne Hatheway) assures us, would be totally effective and superior to earlier forms of colonization because it ensures genetic diversity.Um… imperialist biopolitics much, Professor Brand?).

The other person we hear recite Thomas’s poem is Dr. Mann (played by a hadsome, young up-and-comer),I’m pretty sure he recites it, but not entirely positive… I’ve only seen the film once. Boy, this is all gonna be really unconvincing if I misremembered this. who deceives our intrepid explorers with forged data suggesting that the planet he is exploring is a reasonable prospect for human colonization. Mann forges that data to justify his own worthiness to be retrieved from the planet. As Mann explains to Cooper (while he is killing him… he has really missed human conversation while in cryo-freeze), the will to live is simply too strong; Mann knows he’s a coward, but insists that Cooper has never had to face the sort of isolation and horror that he has. The will to live (that rage against the dying of light) is so strong in Mann that he’s willing lie, and to kill (both Cooper and Romilly) for it.

And so, the rage against death that Mann and Brand profess, by way of Thomas’s poem, is not a good in and of itself. Indeed, their recitations of the poem marks them as self-interested to the point of villainy. They quote the poem to buttress their rage against death itself—their own, individual death (in the case the more villainous Mann) or that of the species (in the case of Brand). But the film ultimately rejects this position—it is not life which needs to continue, (cue music and impassioned speech by Anne Hathaway) but love (and love of a very recognizable, reproductive sort). What old folks should do at the end of day, like the elderly Murph Cooper at film’s end, is not rage against the dying of the light, but quietly die in the peace and comfort of their children. Can one imagine a more forceful restoration of the conventional order of things than Murph quickly dispatching her father back to interstellar space in order to find a girlfriend? This is what T.J. West calls calls, fairly I think, the film’s “ruthlessly heterosexual love plot that could have come straight out of a screenwriter’s how to manual.”

And so, the film refuses the queer reproduction of Plan B (I leave aside any potential connection one may seen between the film’s Plan B, and the contraceptive of the same name), and delights in a reproductive futurity for which the reuniting of Anne Hathaway’s character with Matthew McConaughey’s is important and meaningful. Thomas’s poem comes to stand not as it might appear in the trailer, as some exhortation to intergalactic heroism in the face of global environmental catastrophe, but as the most explicit statement of the position to be resisted—one where the affective attachments of individuals (in Thomas’s poem, the speaker to his father) may be fundamentally at odds with the nature of the world in which we live (the necessity of death). Whatever the rage of Thomas’s poem accomplishes, it doesn’t set up colonies on distant planets.

Over and over characters in the film (Cooper chiefly) are told that they must realize their mission is bigger than their petty human attachments. Cooper must think beyond his children; “You can’t just think about your family,” Doyle says, “You have to think bigger than that.” And he is echoes by Brand: “You might have to decide between seeing your children again and the future of the human race.” Brand herself must defer to objective facts in choosing which planet to visit; the data, not her love for Dr. Doyle, must decide. But in the film all of this turns out to be untrue. John Brand’s insistence that “Nothing in our solar system can safe us,” is, at best, half true—it is the plucky Murph Cooper who saves the world from her childhood bedroom. “We must think not as individuals but as a species,” Prof Brand insists. Interstellar goes out of its way—with some pretty cringe-inducing moments—to create a universe where precisely the opposite is true, where the affective attachments of individuals are what save the species After all, if Cooper had listened to Brand (had listened to love) and gone to Doyle’s planet rather than Mann’s, all would be well now.

Interstellar tackles a posthumanity-shaped problem, but answers it with a humanity so cloying it is almost (almost!) indigestible. It turns out that the problems of three little people do amount to a hill of beans in this crazy world—indeed, they amount to the whole world.

The Podcast as a Genre

What precisely is a podcast? I once heard a minimal definition of a podcast as an mp3 file attached to an RSS feed—which is to say, syndicated audio content on the internet. But looking around, there are plenty of podcasts that don’t meet this criteria: podcasts that lack an RSS feed (WHY?!?), to speak nothing of “video podcasts” (which people are apparently strill trying to make happen). “Podcast” can sometimes be used as a verb to mean something like “transmitting audio over the internet” (e.g. “Will you be podcasting that keynote lecture?”). Looking at iTunes, you realize plenty of “podcasts” are just radio shows put on the internet: iTunes’s most popular podcasts are mostly public radio fare (like “This American Life” and “Radiolab”).

But, the podcast is not simply a technology or a channel. I’ve been listening to podcasts for awhile now and have been curious to watch my habits slowly shift, moving away from “radio shows on the internet” (Fresh Air, whenever I want it!) to something else. This piece looks at the “return” of podcasts as a medium, mostly considering the podcast as a business model. It does however offer this, from “Planet Money” podcaster Alex Blumberg, on what makes podcasts different:

“It’s the most intimate of mediums. It’s even more intimate than radio. Often you’re consuming it through headphones. I feel like there’s a bond that’s created.” Source

That seems entirely right to me, and it helpfully points to some of the ways that what I’ll call podcasts as a genre differ from understanding podcasts as just “radio over the internet.” The “podcast” as a form blurs the line between a medium (say, a recurring, asynchronously consumed type of audio—usually neither music or fiction) and a genre. The podcast, as medium, has been enabled by readier access to bandwidth, software technologies like iTunes syndication and RSS, and developments in hardware like relatively cheap but entirely decent microphonesWoe unto the podcaster who relies on built-in mics on laptops and phones, for he shall receive low traffic. and of course the iPod. But these technologies, in their use, create a sort of gravitational pull toward a form that is less formal, more niche, and therefore oddly closer to a sort of specialized and heightened mode of casual conversation than it is to most radio genres.

When the costs of creating and distributing recordings of folks talking into microphones gets way cheaper than the costs of writing/producing/reporting stories, you get a new sort of show—where folks just sit around and talk. Central to the conventions of this genre is, I think, the group of regular or semi-regular folks who sit around and talk about something. Such are Leo Laporte’s TWIT podcasts; the original TWiT, one of the first podcasts I listened to, was indeed Leo Laporte sitting with folks (some of whom his listeners recognize as, like Laporte, erstwhile TechTV employees) and talking about the week’s technology news. This form tends to be parasitic on some other type of content—on news or culture (daily or weekly or semi-regularly), or even on a specific film or primary text. There has to be some reason, some excuse or alibi, for the conversation to exist—but the podcast offers a conversation rather than the news.

This may not seem especially novel—after all, personality-driven “analysis” now dominates cable news. Yet cable news analysis shows usually center on a single individual, and their dominant moods are outrage or indignation or derision; they tend to be centered a personality (variably likeable or not) who offers a “perspective.” But what a podcast offers is not a perspective (or not chiefly a perspective) but something more like a performance of community. In place of the singular personality, we get personalities. A podcast tends to create characters, or caricatures, out of its hosts: for instance, Stephen Metcalf’s snobbish nostalgia for the world of print clashing regularly with Julia Turner’s culturally omnivorous techno-utopianism on the Slate Culturefest (both, of course, unfair exagerations). But in other podcasts (perhaps notably, podcasts not affiliated with any large online media presence), this develops into a sense of shared reference—something like insiderness or knowingness. The result is that certain podcasts (the podcastiest of the podcasts by my sense of the genre) rely heavily on inside jokes. Consider the following short phrases: “Who the hell is Casey?”; “Does this look clean to you?”; “The Port Hole of Time.” To the listeners of certain podcasts, they will immediately register as inside jokes—from, respectively: The Accidental Tech Podcast; Back to Work (quoting the film The Aviator, which in the universe of Back to Work is frequently referered to as simply the film); and The Flop House. Listeners of these podcasts (and I listen to all of these pretty faithfully, though the truly faithful will likely fault my selections) come to recognize these, and participate in the joke. These podcasts create a universe of reference alienating to the newcomer, but comforting to the regular. And the result is just wonderful. These are my guiltiest of guilty pleasure. I try to conceal my love for them, but I cannot.

That intimacy of the medium described by Alex Blumberg, created by the circumstances of consumption (on headphones or in the carAre these things great, or what?), manifests in the genre as a tendency towards dense self-reference.

The result is that the topic of the podcast can increasingly seem to be just an alibi for the interactions of its hosts. I don’t really care about Apple News, but listen to ATP regularly. The greatest joy of The Flop House (a “bad movie” podcast, which reviews/discussions relatively recent theatrical “flops”) is the experience of hearing the hosts summarize the plot of a movie and the digressions that ensue. One emphatically does not have to have seen the movie to enjoy the podcast, and unlike a review (or even the discussions of film and TV on the Slate Culturefest), it is completely beside the point whether you will see the movie at some point in the future. I suspect that I’ll never see the Bratz movie; but I shall cherish all the days of my life The Flop House’s discussion of it. Listen to early episodes and you’ll see that the plot summary initially presented a challenge—something they glossed over or tried to get past in order to get to the discussion (on at least one occassion they just read the Wikipedia summary of a movie). But the joy of the show is entirely in the interactions between its hosts, and so something as rote as a plot summary becomes the perfect opportunity for such interaction. It also explains why at least I find these sorts of shows more engaging than other audio content. The academic lecture, or even Fresh Air-style interviews, sometimes allows distraction. But the developing conversation, and tissue of self-reference, simulates the experience of interaction rather than, say, the communication of information. (What an interview show like Fresh Air lacks is the regularity of its participants; you’re usually learning something about a guest rather than a conversation between people who already know each other.)

By foregrounding in jokes and habits of communication, the podcast turns out to be a cousin to that other “internetiest” of forms: the meme. The meme is likewise an in-joke, where the in-group is those folks who recognize the meme and understand its conventions. The humor of any individual “doge,” meme (remember that?) is siphoned off from the larger system of doge memes that makes any particular meme legible and funny. (A picture of a cat with some funny, misspelled words, encoutered in utter isolation, carved into the face of some alien moon millenia hence, would be funny because absurd—but it wouldn’t be a meme and wouldn’t participate in its humor.)

The affective range of the podcast is much wider than that of meme, chiefly because hearing a conversation between the same set of people (semi)regularly opens more possibilities than silly pictures and block letters. (There I said it; call me elitist.) But this affective depth cuts the other way—it also suggests what I find mildly unsettling about the form, and perhaps slightly embarassing about my enjoyment of it. If I’m right that inside jokes, and a certain performance of knowing insiderness, are what separates the podcast as a genre from its radio peers, it also feels a little like media consumption as simulated friendship. Its enjoyments are those of easy familiarity and comfortable in-jokes, but with friends who aren’t yours. (You might call this the anxiety of authenticity, and I’ll just take my lumps for worrying over something as old-fashioned as authenticity.)

More troublingly, that same affective register (of chummy friendship and inside jokes) seems downright insidious when you realize how overwhelmingly the list of podcasts I’ve cited here is dominanted by white guys. In so much as the pleasures and affects of the genre are those associated with those of the proverbial boys club, it is dismaying to see how much of a boy’s club it often is.

What is a podcast? It is the humanization of the internet meme, a type of low-participation friendship, a reduced agency form of “hanging out.”

Yours in Flopitude, Chris [Last Name Witheld]

For older posts see the archive.