Chris Forster 2015-09-30T12:06:05-04:00 http://cforster.com Chris Forster Creative Commons Attribute-ShareAlike 3.0 Uported License Two Cheers for Banned Books Week 2015-09-30T00:00:00-04:00 http://cforster.com/2015/09/banned-books/ <p><span class='marginnote'>My title, recalling E. M. Forster&rsquo;s (no relation, sadly) <em>Two Cheers for Democracy</em>, might be too generous. I can&rsquo;t imagine mustering more than two cheers for anything. Two is probably the utter limit of my cheering.</span></p> <p>Is &ldquo;Banned Books Week&rdquo; anarchronistic? That&rsquo;s the claim of <a href="http://www.slate.com/articles/arts/culturebox/2015/09/banned_books_week_no_one_bans_books_anymore_and_censorship_of_books_is_incredibly.single.html">this article at Slate</a>. John Overholt, in <a href="https://twitter.com/john_overholt/status/648471105204830208">a single tweet</a>, manages to voice what I think are all the most pressing complaints about such a perspective:</p> <blockquote class="twitter-tweet" lang="en"><p lang="en" dir="ltr">A. &ldquo;We won&rdquo; is a huge oversimplification.&#10;B. It&rsquo;s only true for the narrowest definition of ban.&#10;C. They keep trying. <a href="http://t.co/iHo3BNbGx9">http://t.co/iHo3BNbGx9</a></p>&mdash; John Overholt (@john_overholt) <a href="https://twitter.com/john_overholt/status/648471105204830208">September 28, 2015</a></blockquote> <script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script> <p>And yet, as someone who spends quite a bit of time trying to think about what early-twentieth century censorship means, there is an important grain of truth in the Slate piece that is worth preserving, even if declarations of &ldquo;Mission Accomplished&rdquo; feel premature.</p> <p>The idea of book banning conjures images of state censorship and book burning&mdash;images one can find in abundance, for instance, in Kevin Birmingham&rsquo;s wonderful account <em>Ulysses</em>, <a href="http://www.nybooks.com/articles/archives/2015/apr/23/ulysses-its-still-scandal/"><em>The Most Dangerous Book</em></a>; images of court rooms where lawyers make grand appeals to <em>literary value</em> and <em>freedom of expression</em>.<span class='marginnote'>I pass over in silence distinctions between <em>books</em> and <em>literature</em>&hellip;</span> The scene that Michelle Anne Schingler describes, at <a href="http://bookriot.com/2015/09/29/hey-slate-banned-books-week-isnt-crock/">Book Riot</a>, however is a very different one. Schingler wastes no time in affirming &ldquo;No,&rdquo; book banning is not simply over. Her examples all concern libraries; and indeed as Graham contends in her piece at Slate, the most recent cases that the <a href="http://www.bannedbooksweek.org/about">Banned Books website</a> cites all concern attempts to limit or remove books from library collections or school settings.</p> <p>When Graham declares &ldquo;we won,&rdquo; that &ldquo;book banning&rdquo; is over, this is indeed (as Overholt suggests) an oversimplification. It imagines a single struggle, which reaches some sort of crisis, and ends. Graham is offering a narrative very close to that recounted by Charles Rembar in his memoir <em>The End of Obscenity</em>. Rembar was the defense attorney during many of the key 1960s obscenity trials in the United States, and his memoir wonderfully charts the erosion of state censorship in the period. Suppression of books on grounds of obscenity, Rembar suggests (I think, rightly), ends after the trials of <em>Lady Chatterley&rsquo;s Lover</em>, <em>Tropic of Cancer</em>, and <em>Fanny Hill</em> in the United States. Starting in <a href="http://www.oyez.org/cases/1950-1959/1956/1956_582">Roth v. United States</a>, and culminating ultimately in the so-called <a href="http://en.wikipedia.org/wiki/Miller_test">&ldquo;Miller Test&rdquo;</a>, American jurisprudence evolves a set of standards that have the effect of ending the censorship of books on the grounds of obscenity.<span class='marginnote'>In English jurisprudence, the 1959 reform of the Obscene Publications Act (which enabled publication of Lawrence&rsquo;s <em>Lady Chatterley</em> by Penguin) plays the same role as the court cases discussed by Rembar.</span> After those trials, it has proved essentially impossible for a book to be banned on grounds of obscenity; contract, libel, and copyright all continue to shape cultural production in important ways (the latter especially so), but obscenity and its particular brand of state-controlled book burning is indeed over. The Miller standard may justifiably be celebrated as a sort of liberal triumph.</p> <p><img src="http://images.all-free-download.com/images/graphiclarge/miller_logo_29850.jpg" alt="Miller Logo"></p> <p>And boy, do we love to tell this story. Birmingham offers a version in his account of <em>Ulysses</em>; we get a sort of version in <a href="http://www.imdb.com/title/tt1049402/">movies about Allen Ginsberg&rsquo;s &ldquo;Howl&rdquo;</a>; or in TV movies about <a href="http://www.imdb.com/title/tt0757175/">the <em>Chatterley</em> trial</a>. Folks love this tale of heroic lawyers fighting on behalf of great works of literature, against philistine puritans&mdash;figures like Anthony Comstock or William Joynson-Hicks (more commonly called simply, &ldquo;Jix&rdquo;). We tell very similar stories about Elvis and his hips, or Lenny Bruce and his comedy&mdash;tales where transgression and freedom contend with (usually comicly absurd) conservatism. (We even tell a version of this story about <a href="https://en.wikipedia.org/wiki/Footloose_(1984_film)">dancing in small towns</a>.) It&rsquo;s usually a narrative of triumph, told by liberal proponents who indeed end by declaring &ldquo;We won.&rdquo; And, as history, it is usually an oversimplication.<span class='marginnote'>For one thing (and this is a hobby horse of mine), it tends to remove books from a broader media history which shapes what it means to &ldquo;ban&rdquo; a &ldquo;book.&rdquo; I have a different story of my own, about the place of literature in the changing media ecology of the long twentieth century&hellip; but that&rsquo;s another tale for another time.</span></p> <p><img src="https://upload.wikimedia.org/wikipedia/en/e/e4/FootloosePoster.jpg" alt="Footloose"></p> <p>&ldquo;Banned Books Week&rdquo; conflates two narratives, perhaps deliberately. It inserts present instances of political struggle which involve books, particularly (like those described by Schingler) around libraries, into a longer history of book banning. It is, in some ways, a savvy rhetorical move to align parents who want to limit access to particular titles with Anthony Comstock and similar figures (after all, who wants to be <a href="https://www.youtube.com/watch?v=20jbY6awlTw">this guy</a>). But this conflation also has the, I think unfortunate, effect of casting contemporary debates about education and the meaning of &ldquo;the public&rdquo; as matters of &ldquo;banning books.&rdquo; I think it makes more sense to understand <a href="http://bannedbooks.world.edu/2012/01/22/banned-books-awareness-beloved/">attempts to limit access to Toni Morrison&rsquo;s <em>Beloved</em></a>, not as a debate about book censorship continuous with the suppression of <em>Ulysses</em> or <em>Lady Chatterley&rsquo;s Lover</em>, but as part of the same political struggle over <a href="https://www.washingtonpost.com/opinions/whitewashing-civil-war-history-for-young-minds/2015/07/06/1168226c-2415-11e5-b77f-eb13a215f593_story.html">how to teach the causes of the American Civil War</a>, or even <a href="http://www.nytimes.com/2013/11/23/education/texas-education-board-flags-biology-textbook-over-evolution-concerns.html">whether to mention evolution</a>. These are debates about books; but more fundamentally they are debates about education and, more importantly, debates about <em>the public</em>. They ask not, &ldquo;Should this book be legally available?&rdquo;, but &ldquo;Should <em>my children</em> learn this?&rdquo; or &ldquo;Should <em>my tax dollars</em> pay for this?&rdquo; Defending against the active defunding of public goods by appealing to the &ldquo;freedom to read,&rdquo; seems to me, to be a tactic of ambivalent value.</p> <p>When &ldquo;Banned Books Week&rdquo; began in 1982, the heroic age of the struggle against state censorship of books in the United States was already over. In 1982, rather than the State of New York seeking to prevent folks from reading <em>Ulysses</em>, we find <a href="https://news.google.com/newspapers?id=9CsdAAAAIBAJ&amp;sjid=RqUEAAAAIBAJ&amp;pg=6717%2C5959414">the Moral Majority complaining about works like <em>Our Bodies, Our Selves</em></a>. This concern with women&rsquo;s sexuality and health is uncannily recalled when earlier this month a <a href="http://www.wbir.com/story/news/local/2015/09/07/local-mom-objects--controversial-book--summer-reading-list/71843596/">Knoxville parent complained</a> about the explict references to women&rsquo;s bodies in the <em>The Immortal Life of Henrietta Lacks</em>. Is this debate about women&rsquo;s sexual health and knowledge, either in the early 1980s or now, best understood as a debate about books? Or does it have more in common with a history that, as this moment, materializes as an effort to defund Planned Parenthood? </p> <p>Schingler writers, &ldquo;Reading only about people our parents and pastors are comfortable with isn’t an education, it’s an echo-chamber.&rdquo; I agree. Reading is a wonderful and potentially transformative experience. It should be celebrated and defended zealously. But if we find people seeking to limit access to books, we may wonder whether their target is <em>books</em> per se, or something else: <em>public education</em> or <em>women&rsquo;s health</em>, both of which require a well-funded state. Schingler writes, &ldquo;Libraries are a marketplace of ideas, and if they’re going to operate in a truly democratic fashion, all ideas should be represented.&rdquo; May be. But the arguments of would-be book-banners are, right now, often couched exactly in market terms&mdash;not that this or that book should not be published or legally allowed to circulate, but that <em>my tax dollars</em> shouldn&rsquo;t have to pay for it. We love the version of this conflict which is a struggle between freedom and censorship; but the conflict today is precisely one which takes place through appeals to market values&mdash;not between freedom and suppression, but <em>what</em> to fund according to what criteria. The real argument today seems less about &ldquo;freedom,&rdquo; than about our willingness to fund and maintain a robust sense of &ldquo;public goods.&rdquo;</p> <p>As a matter of rhetoric and political tactics, it perhaps makes sense to throw the weight of a long historical struggle against state censorship behind our own moment of squabbles in local school boards or funding lines in state budgets. We should be careful, though, that such rhetoric doesn&rsquo;t lead us to mistake a fight about public education or women&rsquo;s health or the rights of queer people for <em>the right to read</em>. Indeed, if we could add a little nuance and history to our sense of the long struggle to publish controversial books, we might even realize that the history of books and their banning is already replete with lessons for these distinct, but not unrelated, struggles (see, for instance, the case of <em>The Well of Loneliness</em> and its banning in England). </p> <p>So, two cheers for Banned Books Week and for all efforts to protect the freedom to read. The fullest possible access to the textual record is indeed a public good worthy of our time, attention, and dedication. It is not, though, the only good; it may not, at this moment, even be the most pressing one.</p> A Walk Through the Metadata: Gender in the HathiTrust Dataset 2015-09-08T00:00:00-04:00 http://cforster.com/2015/09/gender-in-hathitrust-dataset/ <p>[This post, featuring too many graphs, was created with <a href="http://yihui.name/knitr/"><code>knitr</code></a>. You can see the source that generated those graphs, and the rest of the post, <a href="http://cforster.com/files/2015-09-08-gender-in-hathitrust-dataset.Rmd">here</a>. <span class='addition'><strong>Update</strong>: Many thanks to <a href="https://twitter.com/lincolnmullen">Lincoln Mullen</a> who is not only the maintainer, and one of the authors of the <code>gender</code> package, but noted that I was (inaccurately) using publication dates, rather than author birthdates, when inferring the gender of names; he suggested a smart way of approximating author birth dates (used below), and also drew my attention to the <code>napp</code> dataset, accessible by the <code>gender</code> package. In light of his comments I made a few changes, re-ran the data, and have updated this post. The plots are new; I&rsquo;ve added a little new text as well which, like this, is in dark blue. <strong>9/11/15</strong></span>]</p> <div class='epigraph'>&ldquo;there is no such thing as distinguishing men from women.&rdquo;<br />&mdash;<a href="https://books.google.com/books?id=o4ZHAAAAYAAJ&lpg=PA478&ots=rSPKars5gB&dq=%22there%20is%20no%20such%20thing%20as%20distinguishing%20men%20from%20women%22&pg=PA478#v=onepage&q&f=false"><em>The Critical Review</em>, 1777</a> </div> <p>I&rsquo;ve been tinkering with the HathiTrust dataset that Ted Underwood and HathiTrust released last month.<span class='marginnote'>Some Links: [<a href="https://sharc.hathitrust.org/genre">The Dataset</a>]; [<a href="http://tedunderwood.com/2015/08/07/a-dataset-for-distant-reading-literature-in-english-1700-1922/">Ted Underwood&rsquo;s Discussion of It</a>]; [<a href="http://cforster.com/2015/08/exploring-hathitrust-dataset">My Previous Exploration of It, Mostly Using R</a>]</span> The thorniest questions I&rsquo;ve encountered concern how to handle/understand volumes and titles which occur in the dataset more than once (and some related issues&mdash;multivolume works, etc). I&rsquo;ll try to write a quick post about those issues in the future. For now, let&rsquo;s look at other ways we might explore this dataset.</p> <h2>Examining the Gender of Authorship in HathiTrust Summary Metadata</h2> <p>I find the metadata fascinating in a way that the actual <em>data</em> (the word frequencies for each volume) is not. Let&rsquo;s consider, for example, the relationship between authorship in the dataset and gender. Authorial gender is not one of the included metadata fields, but we might try to examine it by using the <a href="https://github.com/ropensci/gender"><code>gender</code></a> package for <code>R</code>. This package uses a variety of historical sources (Social Security data, US Census data, as well as some other sources) to intelligently infer gender based on first names <span class='strikethrough'>(since the package relies on largely US data, it might not be as successful in predicting gender of names in other Anglophone countries, including England&mdash;an issue I note here, parenthetically, and then ignore)</span>. <span class='addition'>The package can also use <code>napp</code> data, covering Canada, the United Kingdom, Germany, Iceland, Norway, and Sweden, from the years 1758 to 1910</span>.<span class='marginnote'>This <a href="http://apps.lincolnmullen.com/gender-predictor/">web app</a> nicely illustrates what the package does.</span> Here, for instance, is how you would load the package and infer the likely gender of the name a <em>George</em> born in 1819:</p> <div class="highlight"><pre><code class="language-r" data-lang="r"><span class="kn">library</span><span class="p">(</span>gender<span class="p">)</span> gender<span class="p">(</span><span class="s">&#39;george&#39;</span><span class="p">,</span>method<span class="o">=</span><span class="s">&#39;napp&#39;</span><span class="p">,</span>year<span class="o">=</span><span class="m">1819</span><span class="p">)</span> Source<span class="o">:</span> local <span class="kt">data frame</span> <span class="p">[</span><span class="m">1</span> x <span class="m">6</span><span class="p">]</span> name proportion_male proportion_female gender year_min year_max <span class="m">1</span> george <span class="m">1</span> <span class="m">0</span> male <span class="m">1819</span> <span class="m">1819</span></code></pre></div> <p>This sort of inference based on first names is obviously imperfect; there are cases where, for a variety of reasons, the prediction will be wrong. The suggestion that in 1819 the name <em>George</em> belongs to a man may be very wrong indeed if that particular George is the author of <em>Middlemarch</em>. For certain purposes, however, that misattribution may be exactly what we&rsquo;re interested in. The simplicity of the approach can be a strength. The package, in most cases, will make the same inference&mdash;even the same incorrect inference&mdash;about the gender of a name as a reader would. This makes it ideal if you&rsquo;re interested in how readers understood the authorship of the books they were reading, or how (perceived) authorial gender shaped the market for literature.<span class='marginnote'>Before we start taking this too seriously, Francis Beaumont&rsquo;s name is detected as female (understandable, but wrong); as is Oliver Goldsmith&rsquo;s (?!?!) in some years (e.g. 1792).</span></p> <p>Such a summary of the dataset is very different than, say, using the gender information inferred about a volume based on its author&rsquo;s first name to train a classifier on the volume-level word counts. Such a classifier could be used on texts from outside the dataset, or on texts <em>within</em> the dataset where an author&rsquo;s gender is unknown. One might try to use it to test Virginia Woolf&rsquo;s &ldquo;guess that Anon, who wrote so many poems without signing them, was often a woman&rdquo; (49). <em>That</em> sort of project, however, which would attempt to link &ldquo;gender&rdquo; (and in this case what exactly that word means becomes rather pressing) not simply to a name in a metadata field, but to a vocabulary (or some other representation of language use), would begin to encounter the thornier theoretical/methodological questions that I am so happy to skirt past here.</p> <p>Hewing to this more modest, and (I think) less theoretically fraught, goal of understanding the makeup of the dataset, I used the <code>gender</code> package to infer the gender of the author of each volume in the three HathiTrust datasets. <span class='addition'>To maximize recognition I used a somewhat heterdox method. Since the package expects birthdates, I subtracted 30 and 50 years from the publication date and passed this range to the package (this was Lincoln&rsquo;s, I think very reasonable, suggestion). I queried against first the <code>napp</code> data; if this returned no result I tried the <code>ipums</code> census data. In both cases, I massaged the dates so that if they were out of range, I checked against the earliest available date (a historically imprecise result strikes me as better than no result at all).</span> So to every row in the metadata summary files I added a column for gender, which represented the result of applying this idiosyncratic use of the <code>gender</code> function to the author&rsquo;s first name.<span class='marginnote'>For my purposes a &ldquo;first name&rdquo; is the first word after the comma in the <code>author</code> field; there would be better ways to do this.</span> (This process was actually rather time consuming&mdash;<em>hint</em>, <code>mclapply</code> is your friend, as are virtualized servers you can let hum away for hours. <span class='addition'>Lincoln notes that the package allows you to pass a vector of names to the package; this makes the process more efficient for large datasets, paritcularly when names are repeated. I nevertheless did it the (seriously) less efficient way, in part because I had written the code for an earlier version of the <code>gender</code> package, and in part because my odd use of <em>two</em> methods to try to find data complicates matters.</span>)</p> <p>I then tallied up the number of works by men and women for each genre. (I did the tallying with Python; you can find the results of those tallies as CSVs <a href="https://gist.github.com/c-forster/2acc9f84a7ecc8375715">here</a>.) In addition to <code>male</code> and <code>female</code>, there are two other categories here: <code>missing</code> means a name was not provided (or, more precisely, was not detected by my script) in the HathiTrust metadata; <code>undetected</code> means that the <code>gender</code> package had not value for the &ldquo;name&rdquo; it was given (or, more precisely, whatever string it received from how I parsed the name). That is, <code>missing</code> means <em>no name</em> and <code>undetected</code> means that <code>gender</code> had no association for the name. (There are also columns for each of these values normalized by the number of volumes in the dataset for that year). Without further ado, three area graphs representing the gender breakdown of authorship in each of the HathiTrust datasets (fiction, poetry, drama).</p> <p><span class='marginnote'>If you right click and open each graph in another tab, they should be a bit bigger.</span></p> <p><img src="/../figs/2015-08-27-gender-in-hathitrust-dataset/gender-area%20graphs-1.png" alt="center"> <img src="/../figs/2015-08-27-gender-in-hathitrust-dataset/gender-area%20graphs-2.png" alt="center"> <img src="/../figs/2015-08-27-gender-in-hathitrust-dataset/gender-area%20graphs-3.png" alt="center"> </p> <p>The data before 1800 is sparse and so these graphs look a little volatile. The prevalence of <code>missing</code> and <code>undetected</code> in the fiction data before 1800, however, may reflect the lack of attribution common in the late eighteenth century. &ldquo;Over 80 per cent of all novel titles published in the 1770s and 1780s were published anonymously,&rdquo; James Raven claims in the introduction to the first volume of the two volume <em>The English Novel 1770&ndash;1829: A Bibliographical Survey of Prose Fiction Published in the British Isles</em> (41). (I&rsquo;ll abbreviate that <em>BSPF</em> for the rest of the post).</p> <p><span class='addition'><a href="https://twitter.com/heatherfro/status/641710059748204544">In a tweet</a>, <a href="https://twitter.com/heatherfro">Heather Froelich</a> asks, &ldquo;What&rsquo;s in those slices of undetected and missing texts.&rdquo; Looking at the amended metadata file, it looks that there are ~11,000 records with either <code>missing</code> or <code>undetected</code> gender (that&rsquo;s ~10% of the dataset). The most frequently occuring titles in the <code>missing</code> data are:</span></p> <div class="highlight"><pre><code class="language-text" data-lang="text">The New British novelist; The British novelists, Stories by American authors, The Harvard classics shelf of fiction, The German classics of the nineteenth and twentieth centuries, The International library of famous literature, The lady of the manor, The book of the thousand nights and one night, The book of the thousand nights and a night, The thousand and one nights, The Odyssey of Homer, Stories by English authors, The Bibliophile library of literature, art and rare manuscripts, The masterpiece library of short stories, Florence Macarthy, </code></pre></div> <p><span class='addition'>So why is the author missing from these? Checking the full records, the most frequenltly occuring items in this series are multivolume collections of other works. <em>The New British Novelists</em> lists no author in the dataset&rsquo;s <code>author</code> metadata; the title appears in the fiction dataset 50 times. Checking <a href="http://catalog.hathitrust.org/Record/008558481">the original page images</a>, we see that this is a series, published starting in 1820 which collects different novels by major British novelists. It includes a range of major novels, many themselves multivolume works: <em>Clarissa</em>, <em>Robinson Crusoe</em>, <em>Humphrey Clinker</em>, and so on. <a href="http://catalog.hathitrust.org/Record/007688604"><em>The Harvard Classics Shelf of Fiction</em></a> appears to be a similar case. <span class='marginnote'>Is there an existing literature on these sorts of collections and their role in reputation creation/maintenance?</span> In the earlier period, there are titles (like <a href="http://catalog.hathitrust.org/Record/007700038"><em>The Infernal Wanderer</em></a>) which simply lack an author; others (like <a href="http://catalog.hathitrust.org/Record/008405386"><em>Turkish Tales</em></a>) lack an author in the dataset, but currently have one in HathiTrust (perhaps because this record has been updated since the dataset was exported); and quite a few don&rsquo;t meet my naming convention. Works by <a href="http://catalog.hathitrust.org/Record/001227706">Phalaris</a>, [Madame d&rsquo;] <a href="http://catalog.hathitrust.org/Record">Aulnoy</a>, [Mssr.] <a href="http://catalog.hathitrust.org/Record/100024486">Scarron</a>, [Mrs.] <a href="http://catalog.hathitrust.org/Record/000245324">Manley</a>, Volatire, Virgil, and many others are &ldquo;missing&rdquo; because when I try to splice &lsquo;em up (relying on a comma to separate first and last names), we get nothing. Some of these authors were referred to simply by a last name and title (Mrs. Manley) and this has entered the dataset as simply <code>Manley</code>.</span></p> <p><span class='addition'>In the <code>undetected</code> data; the most frequently occuring names are:</span></p> <div class="highlight"><pre><code class="language-text" data-lang="text">Bjørnson, Bjørnstjerne Dostoyevsky, Fyodor Orczy, Emmuska Orczy Burgess, Gelett Cullum, Ridgwell Hearn, Lafcadio Watanna, Onoto MacManus, Seumas Tagore, Rabindranath Hemyng, Bracebridge Gordon-Cumming, Roualeyn Ritchie, Leitch </code></pre></div> <p><span class='addition'>A look at the names is enough to guess why <code>gender</code> likely had a probably with them. (There are sufficiently few names here (321 unique individual, undetected names) that I am half tempted to put together a manual reconillation for names and genders). It also provides a clear illustration of the implicit cultural construction of &ldquo;data.&rdquo; These &ldquo;undetected&rdquo; names are largely non-Anglophone names&mdash;and so the attempt to infer one culturally mediated category (<em>gender</em>) gets complicated by the complexities of another one (nationality). Names that are undetected are <em>not</em> randomly distributed through the data but are dispropotionately non-Anglophone. </span></p> <p>To more clearly see the trends, let&rsquo;s look at works published under names that we have identified as female across genres; first raw counts and then as a proportion of all works published per year.</p> <p><img src="/../figs/2015-08-27-gender-in-hathitrust-dataset/htdata-works-by-women-1.png" alt="center"> <img src="/../figs/2015-08-27-gender-in-hathitrust-dataset/htdata-works-by-women-2.png" alt="center"> </p> <p>The second graph is the interesting one. Among the genres, female authors are best represented in fiction, and least well-represented in drama. The trend in fiction, however, is odd&mdash;while poetry and drama show upward trends (poetry&rsquo;s is slow and steady across the 19th century; drama&rsquo;s rather sudden after 1900), fiction has a high point in the early nineteenth century where women represent a larger proportion of fiction writers than anywhere else in this data. At times, early in the data, half of the works of fiction in the dataset are written by a woman (more on this figure below). Yet, over the course of the nineteenth century this proportion diminishes. When the graph ends in 1922, women represent about a quarter of the authors of each of the three genres.</p> <p><span class='addition'> On twitter, <a href="https://twitter.com/cforster/status/641709866957021185">I suggested</a> that in the normalized data for fiction by women above, one sees a decline in works by women. This may be consistent with the <em>BSPF</em> data (which, in its admittedly narrower slice, shows a decline from 1815 to 1830). To get some sense whether that&rsquo;s a fair description, let&rsquo;s isolate the fiction by women data, and add a rolling mean, with a a window of 5.</span></p> <p><img src="/../figs/2015-08-27-gender-in-hathitrust-dataset/women-and-fiction-with-rolling-mean-1.png" alt="center"> </p> <p><span class='addition'>At some point, one is reading Rorschach plots; but this plot seems to suggests two periods of downward trends from 1805&ndash;1830, and then again from about 1885 to 1900. (That preciptious drop at the end is a function of doing the rolling average running out of data).</span></p> <h2>Data from <em>The English Novel: A Bibliographical Survey of Prose Fiction, 1770-1830</em></h2> <p>To get some sense about how reasonable these trendlines look, we might try to compare them to another source. I&rsquo;ve already quoted the <em>BSPF</em>, which offers a portrait of the authorship of novels between 1770 and 1830. The <em>BSPF</em> has totals based on both what is stated on title pages and in prefaces, as well as more comprehensive totals based on what the editors were able to infer about the authorship of works from other sources.<span class='marginnote'>For instance, if a work states that it is &ldquo;By the author of <em>Waverley</em>, one can make additional inferences about the author&rsquo;s gender.</span> There turns out to be a significant discrepancy between what a title page, or preface, states, and what one may be able to infer about the gender of an author with just a little more knowledge. The majority of novels in this period were published without a clear statement of authorship. But if we look at the more comprehensive portrait of authorship that the <em>BSPF</em> offers, the story is a little different.</p> <p><img src="/../figs/2015-08-27-gender-in-hathitrust-dataset/bspf-data-1.png" alt="center"> </p> <p>The graph above summarizes the trends in the <em>inferred</em> data. It has three distinct moments&mdash;a predominance of &ldquo;anonymous&rdquo; or unattributed works until around 1800; the predominance of women writers during the first decades of the nineteenth century, and concluding with what Peter Garside calls &ldquo;the male invasion of mainstream fiction&rdquo; (2:63). <span class='marginnote'>Garside notes, for instance, &ldquo;the <em>publication</em> of Jane Austen&rsquo;s novels was achieved not against the grain but during a period of female ascendancy&rdquo; (2:75).</span> This data suggests that authors of novels were most likely to be, in this order, <em>anonymous</em>, <em>women</em>, and then <em>men</em>.</p> <p>The three waves visible in the graph above, however, is based on the inferences that the editors of the <em>BSPF</em> made to ascertain the the gender of the authors in their bibliography. The metadata available on title pages&mdash;of the sort that&rsquo;s compiled in the HT metadata&mdash;often lacks information that might otherwise be available to most readers.</p> <blockquote> <p>Occasionally, full author names are found within a novel&mdash;as in a signed Preface, or through the inclusion of an engraved portrait or additional title-page&mdash;when the main title-page offers no direct authorial description. Augusta Ann Hirst&rsquo;s <em>Helen; or Domestic Occurences</em> (1807:28), for example, carries only the bare title on its title-page, though the full author&rsquo;s name appears immediately afterwards in a Dedication to the Countess Fitzwilliam, and the author&rsquo;s name later featured directly on the title-page in the Minerva reissue of 1808. (2:68)</p> </blockquote> <p>HathiTrust has <a href="http://catalog.hathitrust.org/Record/100004286/Home">a copy of <em>Helen, or, Domestic Occurrences: A Tale</em></a> (though it is not included in the fiction dataset). And indeed its title page lacks the author&rsquo;s name, though one can discover it in the dedication.</p> <p><img src="/images/helen_pages.png" alt="Title Page, and End of Dedication from *Helen*"></p> <p>Through the magic of librarians, the <a href="http://catalog.hathitrust.org/Record/100004286">HathiTrust record</a>, however, includes the correct author and even notes that its &ldquo;Dedication signed.&rdquo; </p> <p>Looking only at what one can infer about the authorial gender of works from the information available on the title page, most works would be &ldquo;anonymous,&rdquo; even if (some) contemporary readers may have been able to see through that that anonymity. Note the difference between the trends in authorship when we look only at information available from examining &ldquo;proper names from title-pages and prefaces only&rdquo; with the inferred conclusion (all this data is taken from the wonderfully comprehensive <em>BSPF</em>). </p> <p><img src="/../figs/2015-08-27-gender-in-hathitrust-dataset/bspf-data-2-1.png" alt="center"> </p> <p>The inferred trends for both male and female authorship are significantly higher than their stated counterparts (these terms, <em>inferred</em> and <em>stated</em> are my clumsy language; for anyone interested, <em>The English Novel, 1770&ndash;1830</em> really is an invaluable, if imposingly weighty, resource). There are perhaps two interesting trends here. The decrease in anonymous authorship at the start of the nineteenth century coincides with a rise in female authorship; female authorship is <em>more public</em> than its male counterpart. After 1820 one sees a sharp rise in male authorship&mdash;which is itself a rise in <em>anonymous</em> male authorship.</p> <h2>Comparing HathiTrust and <em>BSPF</em></h2> <p>Using the method described above to infer authorship in the HathiTrust dataset should produce results similar to the raw, stated dates in the <em>BSPF</em> data. There are, though, a few differences to account for first. For one, James Raven&rsquo;s and Peter Garside&rsquo;s introductions to the two volumes of the <em>Bibliographical Survey of Prose Fiction</em> offer summary counts of &ldquo;New Novels&rdquo; but the HathiTrust data represents <em>books</em> owned by libraries. To be able to compare to the <em>BSPF</em> data with the HT data, we need to eliminate reprints (we only want <em>new</em> novels) and we need to count works, not books (so, multivolume works should be counted as a single work). I&rsquo;ve tried to do this rather crudely by creating for each work in the HT fiction dataset an &ldquo;ID&rdquo; which consists only of a work&rsquo;s title and it&rsquo;s author.<span class='marginnote'>Using title alone as an ID could, in theory, lead to a problem if two works have the same title&mdash;which is actually quite common for multivolume sets, like <em>The Novels</em> of Walter Scott and the <em>The Novels</em> of Charles Dickens, and similar</span> My script loops over the works in the metadata summary, counting a work as &ldquo;new&rdquo; only if we haven&rsquo;t seen its ID before. Because we look only at title and author (and not <code>enumcron</code>), we also only count one volume from a multivolume work (though, as I mention above, this problem is quite a bit thornier than I&rsquo;m allowing here).</p> <p>Second complication: <strong>geography</strong>: the HT dataset is culled from American libraries, whereas the <em>BSPF</em> data is focused on works published in &ldquo;the British Isles.&rdquo; Well, that raises an interesting question (digression ahead!): where were fiction volumes in the HathiTrust dataset published?</p> <p><img src="/../figs/2015-08-27-gender-in-hathitrust-dataset/ht-places-of-publication-1.png" alt="center"> </p> <p>As this graph makes clear, most of the works in the HathiTrust dataset were published in 5 places (heck, many were published in <em>one</em> place). Those labels along the x-axis are <a href="http://www.loc.gov/marc/countries/countries_code.html">MARC country codes</a>; so the top publication locations are: New York State (<code>nyu</code>), England (<code>enk</code>), Massachusetts (<code>mau</code>), No place/Unknown (<code>xx</code>), Pennsylvania (<code>pau</code>), Illinois (<code>ilu</code>), Scotland (<code>stk</code>), Germany(<code>gw</code>). This summary, however, represents the entire HT fiction dataset&mdash;from 1700-1922. Let&rsquo;s look at just the portion covered by the <em>BSPF</em>, 1770 and 1830:</p> <p><img src="/../figs/2015-08-27-gender-in-hathitrust-dataset/ht-places-of-publication-1770-1830-1.png" alt="center"> For this period the top two locations are England and Scotland. It seems unlikely, therefore, that any differences between the <em>BSPF</em> and the HT datasets could be attributed to the different geographical coverage of the two datasets. But, just to be sure let&rsquo;s extract only the works from the fiction dataset published in England and Scotland and Ireland between 1770 and 1830, and compare the gender breakdown one last time.</p> <p>To create this subset of the HT summary metadata, I&rsquo;ve used some Python that tries to more closely match the parameters of the <em>BSPF</em> data: it covers only works published between 1770 and 1830, published in England, Scotland, or Ireland, and it tries represent only &ldquo;new works.&rdquo; <span class='marginnote'>The Python that did this is <a href="https://gist.github.com/c-forster/caf3389c74fddffdfcd3">here</a>; the summary of the data is <a href="https://gist.github.com/c-forster/cb23a1224eadfe282257">here</a>.</span></p> <p><img src="/../figs/2015-08-27-gender-in-hathitrust-dataset/compare-bspf-ht-1.png" alt="center"> </p> <p>We can get some sense of how the HT data compares to <em>BSPF</em> by plotting them together.</p> <p><img src="/../figs/2015-08-27-gender-in-hathitrust-dataset/dataset-comparisons-1.png" alt="center"> </p> <p>The data for female authorship in the two datasets (or rather, in the <em>BSPF</em> data and my weird manipulation of the HathiTrust data) seems, to my layman&rsquo;s eye, surprisingly consistent. Of course, recalling the difference (often of between 10 and 20 percentage point) between authorial gender as determined by consulting title pages/prefaces with what the <em>BSPF</em> editors were able to infer, one might suggest (at least for the period 1770&ndash;1830) that the summary I offered above significantly under represents female authorship.</p> <p>The data for male and anonymous authorship is much less consistent; <em>BSPF</em> reports more anonymous texts and my analysis of the HT metadata; while the HT data reports more male writers. I basically don&rsquo;t understand why this would be so&mdash;I would have expected, if anything, the opposite. The <code>anonymous</code> line for the HT data in the above graph combines both <code>missing</code> authors and <code>undetected</code>, treating as anonymous anything that couldn&rsquo;t be coaxed into another category; if anything, it should <em>overrepresent</em> anonymous writers. Perhaps this reflects something about the underlying data; or perhaps something about the way I carved up first names. For now, I just don&rsquo;t know. So, here ends our amble through the data.</p> <h2>Works Cited</h2> <p>Woolf, Virginia. <em>A Room of One&rsquo;s Own</em>. </p> <p>Raven, James et al. <em>The English Novel 1770-1829: A Bibliographical Survey of Prose Fiction Published in the British Isles</em>. 2 vols. New York: Oxford University Press, 2000. Print.</p> Looking at a Dataset for Distant Reading: Some Anticlimaxes 2015-08-10T00:00:00-04:00 http://cforster.com/2015/08/exploring-hathitrust-dataset/ <p>I&rsquo;ve been trying to think intelligently about the place of quantitative data in literary studies, especially in light of two excellent posts, one by <a href="http://andrewgoldstone.com/blog/2015/08/08/distant/">Andrew Goldstone</a>, the other by <a href="http://tressiemc.com/2015/08/06/nascent-thoughts-on-text-analysis-across-disciplines/">Tressie McMillan Cottom</a>, both responding to <a href="http://bostonreview.net/books-ideas/ben-merriman-moretti-jockers-digital-humanities">this review</a> by Ben Merriman. </p> <p>But before I could even try to say something interesting in response, Ted Underwood <a href="http://tedunderwood.com/2015/08/07/a-dataset-for-distant-reading-literature-in-english-1700-1922/">announced</a> that he was making available &ldquo;a dataset for distant-reading literature in English, 1700-1922&rdquo; (here is <a href="https://sharc.hathitrust.org/genre">a link to the data</a>). This post is a look at that data, mostly using <a href="https://www.r-project.org/"><code>R</code></a>. I have, essentially, nothing thoughtful to offer in this post; instead, this is an exploration of this dataset (many, <em>many</em> thanks to Ted Underwood and HathiTrust for this fascinating bounty), studded with some anticlimaxes in the form of graphs that do little beyond give a sense of how one could begin to think about this dataset.</p> <p>With the exception of a bash script (which may, though, be the most repurposable bit of code), everything here is done in <code>R</code>. I don&rsquo;t like <code>R</code>, and I&rsquo;m not very good with it,<span class='marginnote'>I think <code>R</code>&rsquo;s datatypes are what make it a challenge; lists in particular seem to materialize out of nowhere and are frustrating to use&hellip;</span> but it is great for making pretty graphs and getting an initial handle on a bunch of data. I try to comment on, and explain, the code below (often in comments)&mdash;though if you&rsquo;ve never looked at <code>R</code>, this may seem really weird. I also may have made some horrible mistakes; if so, please let me know.</p> <h2>The New HathiTrust Data Set</h2> <p>Underwood calls this dataset &ldquo;an easier place to start with English-language literature&rdquo; within the HathiTrust dataset. I had poked around the HathiTrust data before, and it really is a very complicated undertaking. This dataset that Underwood has provided makes this <em>much much easier</em>.</p> <p>The data can be downloaded <a href="https://sharc.hathitrust.org/genre">here</a>. In this post I&rsquo;ll look at the fiction metadata, and take a peak at the fiction word counts for the years 1915&ndash;1919. Those files looks something like this:</p> <ul> <li><p><code>fiction_metadata.csv</code>: 17 megabytes, containing author, title, date, and place for each work of fiction. It also includes subjects, an id for HathiTrust (<code>htid</code>), and other fields. </p></li> <li><p><code>fiction_yearly_summary.csv</code>: 35 megabytes, containing token frequencies. The first 20 lines look like this. </p></li> </ul> <div class="highlight"><pre><code class="language-text" data-lang="text">year,word,termfreq,correctionapplied 1701,&#39;,162,0 1701,a,813,0 1701,further,2,0 1701,native,1,0 1701,forgot,9,0 1701,mayor,1,0 1701,wonder,13,0 1701,incapable,3,0 1701,reflections,5,0 1701,absence,5,0 1701,far,16,0 1701,performance,2,0 1701,say,44,43 1701,notorious,1,0 1701,words,15,0 1701,leaves,2,0 1701,unlucky,2,0 1701,aware,1,0 1701,differ,1,0 </code></pre></div> <ul> <li><p>In a directory I uncompressed <code>fiction_1915-1919.tar.gz</code>. The result is 8656 files, each representing a single work, and totalling 827 megabytes. (827 megabytes of text is not &ldquo;big data&rdquo;&mdash;but it is enough to making toying with it on your laptop at times a little tricky.)</p></li> </ul> <h2>Examining the Metadata: Volumes of Fiction Per Year</h2> <p>So, let&rsquo;s begin, by loading our plotting library (<code>ggplot</code>) and the CSV file with the fiction metadata file <code>fiction_metadata.csv</code>.</p> <div class="highlight"><pre><code class="language-r" data-lang="r"><span class="kn">library</span><span class="p">(</span>ggplot2<span class="p">)</span> <span class="c1"># Load the metadata from the CSV vile</span> fiction.data <span class="o">&lt;-</span> read.csv<span class="p">(</span><span class="s">&#39;fiction_metadata.csv&#39;</span><span class="p">,</span>header<span class="o">=</span><span class="bp">T</span><span class="p">)</span> <span class="c1"># Let&#39;s look at how many items we have for each date.</span> ggplot<span class="p">(</span>fiction.data<span class="p">)</span> <span class="o">+</span> geom_histogram<span class="p">(</span>aes<span class="p">(</span>x<span class="o">=</span>fiction.data<span class="o">$</span><span class="kp">date</span><span class="p">),</span>binwidth<span class="o">=</span><span class="m">1</span><span class="p">)</span> <span class="o">+</span> ggtitle<span class="p">(</span><span class="s">&#39;Books per Year in Fiction Dataset&#39;</span><span class="p">)</span> <span class="o">+</span> xlab<span class="p">(</span><span class="s">&#39;Year&#39;</span><span class="p">)</span> <span class="o">+</span> ylab<span class="p">(</span><span class="s">&#39;Number of Books Per Year in Fiction Data&#39;</span><span class="p">)</span> </code></pre></div> <p><a href="/images/htdata-plot1.png"><img src="/images/htdata-plot1.png" alt="Bar Plot of Works of Fiction Per Year in HathiTrust Dataset"></a> </p> <p>This gives a sense of just how few books from before 1800 are in this dataset. </p> <div class="highlight"><pre><code class="language-r" data-lang="r"><span class="kp">nrow</span><span class="p">(</span>fiction.data<span class="p">)</span> <span class="p">[</span><span class="m">1</span><span class="p">]</span> <span class="m">101948</span> <span class="kp">nrow</span><span class="p">(</span><span class="kp">subset</span><span class="p">(</span>fiction.data<span class="p">,</span>fiction.data<span class="o">$</span>date <span class="o">&lt;</span> <span class="m">1800</span><span class="p">))</span> <span class="p">[</span><span class="m">1</span><span class="p">]</span> <span class="m">1129</span> </code></pre></div> <p>That is, 101948 volumes total, 1129 of which were published prior to 1800, or about 1%. The number of volumes appearing in the dataset per year tends to increase constantly&mdash;with a few exceptions. That dip around 1861-1864 may be a result of particularly American factors influencing the dataset; and perhaps it is war again accounts for some of the dip at this period end&mdash;though that dip seems to begin prior to 1914. </p> <h2>Examining the Metadata: Change in Length of Volumes Over Time</h2> <p>The length of each volume is contained in the <code>totalpages</code> field in the metadata file. Let&rsquo;s plot the length of works of fiction over time (so, plot <code>date</code> by <code>totalpages</code>).</p> <div class="highlight"><pre><code class="language-r" data-lang="r">ggplot<span class="p">(</span>fiction.data<span class="p">,</span>aes<span class="p">(</span>x<span class="o">=</span>fiction.data<span class="o">$</span><span class="kp">date</span><span class="p">,</span>y<span class="o">=</span>fiction.data<span class="o">$</span>totalpages<span class="p">))</span> <span class="o">+</span> geom_point<span class="p">(</span>pch<span class="o">=</span><span class="s">&#39;.&#39;</span><span class="p">,</span>alpha<span class="o">=</span><span class="m">0.1</span><span class="p">,</span>color<span class="o">=</span><span class="s">&#39;blue&#39;</span><span class="p">)</span> <span class="o">+</span> ggtitle<span class="p">(</span><span class="s">&#39;Length of Books by Year&#39;</span><span class="p">)</span> <span class="o">+</span> xlab<span class="p">(</span><span class="s">&#39;Year&#39;</span><span class="p">)</span> <span class="o">+</span> ylab<span class="p">(</span><span class="s">&#39;Length of Book, in Pages&#39;</span><span class="p">)</span> </code></pre></div> <p><a href="/images/htdata-plot2.png"><img src="/images/htdata-plot2.png" alt="Not Especially Legible Plot of Length of Works of Fiction Over Time in HathiTrust Dataset"></a></p> <p>Interesting. It seems that, in the mid-eighteenth century near the dawn of the novel, works of fiction were around 300 pages long. Their length diversified over the course of the novel&rsquo;s history, as novels grew both longer and shorter as the possibilities for fiction widened, perhaps as a function of increased readership stemming from both the decreasing cost of books and the increasing rate of literacy. </p> <p>Well, <strong>not really</strong>. Matthew Lincoln has <a href="http://matthewlincoln.net/2015/03/21/confabulation-in-the-humanities.html">a very nice post</a> about the dangers of constructing a &ldquo;just-so&rdquo; story (often to insist that this graph tells us &ldquo;nothing new). But there are at least two problems with the interpretation offered above&mdash;one broad and one more specific. Broadly, it is worth reiterating the danger of mistaking this data for an unproblematic representation of any particular historical phenomenon (say especially <em>readership of novels</em>). Underwood describes the dataset carefully as representing works held by &rdquo;&lsquo;American university and public libraries, insofar as they were digitized in the year 2012 (when the project began).&rsquo;&ldquo; And, of course, lots of other things which would be relevant to an investigation of fiction&mdash;think of pulp paperbacks and similar forms&mdash;will not be in that sample, because they were often not collected by libraries. (Likeiwse, as Underwood notes, pre 1800 books are more likely to be held in Special Collections, and therefore not digitized).</p> <p>The second point is specific to the graph above. That scatter plot is sparse in the early half of this period and very dense in the latter half. The translucency of each point (set by <code>alpha=0.2</code>) captures some of this, but nevertheless the graph as a whole overemphases the increased <em>spread</em> of data, when really what is happening is an increase in the amount of data. If we plot things differently, I think this becomes evident. Let&rsquo;s breakdown our data by decade, and then do a <a href="https://en.wikipedia.org/wiki/Box_plot">box plot</a> per decade of fiction length:</p> <div class="highlight"><pre><code class="language-r" data-lang="r"><span class="c1"># This helper function will convert a year into a &quot;decade&quot;</span> <span class="c1"># through some simple division and then return the decade</span> <span class="c1"># as a &quot;factor&quot; (an R data-type).</span> as.Decade <span class="o">&lt;-</span> <span class="kr">function</span><span class="p">(</span>year<span class="p">)</span> <span class="p">{</span> decade <span class="o">&lt;-</span> <span class="p">(</span><span class="kp">as.numeric</span><span class="p">(</span>year<span class="p">)</span><span class="o">%/%</span><span class="m">10</span><span class="p">)</span><span class="o">*</span><span class="m">10</span> <span class="kr">return</span><span class="p">(</span><span class="kp">as.factor</span><span class="p">(</span>decade<span class="p">))</span> <span class="p">}</span> <span class="c1"># Add a &quot;decade&quot; column by applying our as.Decade function </span> <span class="c1"># to the data. (The unlist function... is because lapply returns</span> <span class="c1"># a list, and I&#39;m not very good at R, so that&#39;s how I got it to work.</span> fiction.data<span class="o">$</span>decade <span class="o">&lt;-</span> <span class="kp">unlist</span><span class="p">(</span><span class="kp">lapply</span><span class="p">(</span>fiction.data<span class="o">$</span><span class="kp">date</span><span class="p">,</span> as.Decade<span class="p">))</span> <span class="c1"># Box plot of our length data, grouped by decade</span> ggplot<span class="p">(</span>fiction.data<span class="p">,</span> aes<span class="p">(</span>x<span class="o">=</span>fiction.data<span class="o">$</span>decade<span class="p">,</span>y<span class="o">=</span>fiction.data<span class="o">$</span>totalpages<span class="p">))</span> <span class="o">+</span> geom_boxplot<span class="p">()</span> <span class="o">+</span> ggtitle<span class="p">(</span><span class="s">&#39;Length of Books, Grouped by Decades&#39;</span><span class="p">)</span> <span class="o">+</span> xlab<span class="p">(</span><span class="s">&#39;Decade&#39;</span><span class="p">)</span> <span class="o">+</span> ylab<span class="p">(</span><span class="s">&#39;Length of Books, in Pages&#39;</span><span class="p">)</span> </code></pre></div> <p><a href="/images/htdata-plot3.png"><img src="/images/htdata-plot3.png" alt="Less Misleading Plot of Length Across Time in HathiTrust Fiction Dataset"></a></p> <p>This plot confirms that, indeed, we see a greater range in the lengths of works of fiction (so my inference from the previous graph is not completely wrong). But a box plot clarifies what is, to me, a surprising constancy in the length of the works collected in this dataset. The apparent increase in variability in length is real&mdash;but it is not the most, or the only, salient feature of this data; this fact is better captured in the second graph (the box plot).</p> <h2>Summary: Frequently Occurring Terms</h2> <p>The file <code>fiction_yearly_summary.csv</code> contains the per-year frequencies of the top 10,000 most frequently occuring tokens in the fiction dataset. We can chart the fluctuations of a term&rsquo;s use, for instance, across the period.</p> <div class="highlight"><pre><code class="language-r" data-lang="r"><span class="c1"># Load our data</span> yearly.summary <span class="o">&lt;-</span> read.csv<span class="p">(</span><span class="s">&#39;fiction_yearly_summary.csv&#39;</span><span class="p">)</span> <span class="c1"># Extract some meaninful bit, say, occurences of `love`</span> love <span class="o">&lt;-</span> <span class="kp">subset</span><span class="p">(</span>yearly.summary<span class="p">,</span> yearly.summary<span class="o">$</span>word<span class="o">==</span><span class="s">&#39;love&#39;</span><span class="p">)</span> <span class="c1"># Plot it</span> ggplot<span class="p">(</span>love<span class="p">,</span>aes<span class="p">(</span>x<span class="o">=</span>love<span class="o">$</span>year<span class="p">,</span>y<span class="o">=</span>love<span class="o">$</span>termfreq<span class="p">))</span> <span class="o">+</span> geom_line<span class="p">()</span> <span class="o">+</span> xlab<span class="p">(</span><span class="s">&quot;Occurences of token &#39;love&#39;&quot;</span><span class="p">)</span> <span class="o">+</span> ylab<span class="p">(</span><span class="s">&#39;Year&#39;</span><span class="p">)</span> <span class="o">+</span> ggtitle<span class="p">(</span><span class="s">&#39;&quot;Love&quot; in the Dataset&#39;</span><span class="p">)</span> </code></pre></div> <p><a href="/images/htdata-plot4.png"><img src="/images/htdata-plot4.png" alt="Unnormalized Occurences of the Term &#39;Love&#39; in Dataset"></a></p> <p>Yet, of course, looking at that sharp rise, we quickly realize&mdash;yet again&mdash;the importance of normalization. We are not witnessing the explosion of love at the dawn of the twentieth century (and its nearly as rapid declension). We could noralize by adding all the words together&mdash;but we only have counts for the top 10,000 wods. Thankfully, the dataset offers &ldquo;three special tokens for each year: #ALLTOKENS counts all the tokens in each year, including numbers and punctuation; #ALPHABETIC only counts alphabetic tokens; and #DICTIONARYWORD counts all the tokens that were found in an English dictionary.&rdquo;</p> <p>So, let&rsquo;s normalize by using DICTIONARYWORD.</p> <div class="highlight"><pre><code class="language-r" data-lang="r"><span class="c1"># Let&#39;s extract the DICTIONARYWORD tokens into a data frame</span> yearly.total <span class="o">&lt;-</span> <span class="kp">subset</span><span class="p">(</span>yearly.summary<span class="p">,</span>yearly.summary<span class="o">$</span>word<span class="o">==</span><span class="s">&#39;#DICTIONARYWORD&#39;</span><span class="p">)</span> <span class="c1"># Let&#39;s simplify this dataframe to just what we&#39;re interested in.</span> yearly.total <span class="o">&lt;-</span> yearly.total<span class="p">[</span><span class="kt">c</span><span class="p">(</span><span class="s">&#39;year&#39;</span><span class="p">,</span><span class="s">&#39;termfreq&#39;</span><span class="p">)]</span> <span class="c1"># And rename the termfreq column to &quot;total&quot;</span> <span class="kp">colnames</span><span class="p">(</span>yearly.total<span class="p">)</span> <span class="o">&lt;-</span> <span class="kt">c</span><span class="p">(</span><span class="s">&#39;year&#39;</span><span class="p">,</span><span class="s">&#39;total&#39;</span><span class="p">)</span> <span class="c1"># Now we can use merge to combine this data, giving each row </span> <span class="c1"># a column that contains the total number of (dictionary words)</span> <span class="c1"># for that year. </span> love.normalized <span class="o">&lt;-</span> <span class="kp">merge</span><span class="p">(</span>love<span class="p">,</span> yearly.total<span class="p">,</span> by<span class="o">=</span><span class="kt">c</span><span class="p">(</span><span class="s">&#39;year&#39;</span><span class="p">))</span> <span class="c1"># This method profligately repreats data; but it makes things </span> <span class="c1"># easier. The result looks like this:</span> <span class="kp">head</span><span class="p">(</span>love.normalized<span class="p">)</span> <span class="o">&gt;</span> year word termfreq correctionapplied total <span class="o">&gt;</span> <span class="m">1</span> <span class="m">1701</span> love <span class="m">222</span> <span class="m">0</span> <span class="m">37234</span> <span class="o">&gt;</span> <span class="m">2</span> <span class="m">1702</span> love <span class="m">1</span> <span class="m">0</span> <span class="m">7036</span> <span class="o">&gt;</span> <span class="m">3</span> <span class="m">1703</span> love <span class="m">524</span> <span class="m">0</span> <span class="m">416126</span> <span class="o">&gt;</span> <span class="m">4</span> <span class="m">1706</span> love <span class="m">12</span> <span class="m">0</span> <span class="m">36501</span> <span class="o">&gt;</span> <span class="m">5</span> <span class="m">1708</span> love <span class="m">578</span> <span class="m">0</span> <span class="m">482779</span> <span class="o">&gt;</span> <span class="m">6</span> <span class="m">1709</span> love <span class="m">361</span> <span class="m">0</span> <span class="m">133847</span> <span class="c1"># Now, graph the data</span> ggplot<span class="p">(</span>love<span class="p">,</span> aes<span class="p">(</span>x<span class="o">=</span>love.normalized<span class="o">$</span>year<span class="p">,</span> y<span class="o">=</span><span class="p">(</span>love.normalized<span class="o">$</span>termfreq<span class="o">/</span>love.normalized<span class="o">$</span>total<span class="p">)))</span><span class="o">+</span> geom_line<span class="p">()</span> <span class="o">+</span> xlab<span class="p">(</span><span class="s">&#39;Year&#39;</span><span class="p">)</span> <span class="o">+</span> ylab<span class="p">(</span><span class="s">&#39;Normalized Frequency of &quot;love&quot;&#39;</span><span class="p">)</span> <span class="o">+</span> ggtitle<span class="p">(</span><span class="s">&#39;The Fate of Love&#39;</span><span class="p">)</span> </code></pre></div> <p><a href="/images/htdata-plot5.png"><img src="/images/htdata-plot5.png" alt="Normalized Plot of &#39;Love&#39; in the Dataset"></a></p> <p>Well, that look&rsquo;s about right. Just for fun, let&rsquo;s try a different term, one that is something less of an ever-fixed mark, but which perhaps alters its relative frequency when it historical alteration finds. </p> <div class="highlight"><pre><code class="language-r" data-lang="r"><span class="c1"># We subset the term we&#39;re interested in.</span> america <span class="o">&lt;-</span> <span class="kp">subset</span><span class="p">(</span>yearly.summary<span class="p">,</span> yearly.summary<span class="o">$</span>word<span class="o">==</span><span class="s">&#39;america&#39;</span><span class="p">)</span> <span class="c1"># And normalize using our already-constructed yearly.total </span> <span class="c1"># data frame.</span> america.normalized <span class="o">&lt;-</span> <span class="kp">merge</span><span class="p">(</span>america<span class="p">,</span> yearly.total<span class="p">,</span> by<span class="o">=</span><span class="kt">c</span><span class="p">(</span><span class="s">&#39;year&#39;</span><span class="p">))</span> <span class="c1"># Plot as before, though this time we&#39;ll use geom_smooth() </span> <span class="c1"># as well to add a quick &quot;smooth&quot; fit line to get a sense of </span> <span class="c1"># the trend. Minor digression: things like geom_smooth() are one </span> <span class="c1"># of the things that make R great (if very dangerous) for an </span> <span class="c1"># utter amateur.</span> ggplot<span class="p">(</span>america.normalized<span class="p">,</span> aes<span class="p">(</span>x<span class="o">=</span>america.normalized<span class="o">$</span>year<span class="p">,</span> y<span class="o">=</span><span class="p">(</span>america.normalized<span class="o">$</span>termfreq<span class="o">/</span>america.normalized<span class="o">$</span>total<span class="p">)))</span><span class="o">+</span> geom_line<span class="p">()</span> <span class="o">+</span> geom_smooth<span class="p">()</span> <span class="o">+</span> xlab<span class="p">(</span><span class="s">&#39;Year&#39;</span><span class="p">)</span> <span class="o">+</span> ylab<span class="p">(</span><span class="s">&#39;Normalized Frequency of &quot;america&quot;&#39;</span><span class="p">)</span> <span class="o">+</span> ggtitle<span class="p">(</span><span class="s">&quot;Occurences of &#39;america&#39; in the Dataset&quot;</span><span class="p">)</span> </code></pre></div> <p><a href="/images/htdata-plot6.png"><img src="/images/htdata-plot6.png" alt="Occurences of &#39;america&#39; in the Dataset"></a></p> <p>Not sure there&rsquo;s much surprising here, but okay, seems reasonablish.</p> <h2>Extracting Counts from Individual Volume Files</h2> <p>Now, what if you want to look at terms that don&rsquo;t occur in the top 10,000. Then, you need to dig in to the files for individual volumes. For simplicity&rsquo;s sake, I&rsquo;ll look only at one set of those files, representing volumes of fiction between 1915 and 1919, which I&rsquo;ve uncompressed in a subdirectory called <code>fiction_1915-1919</code>. </p> <p>I&rsquo;ve been using <code>R</code> for everything so far, and I imagine you could use <code>R</code> to loop over the files in the directory, open them up and look for a specified term. As someone who finds <code>R</code> idiosyncratic to the point of excruciation, this doesn&rsquo;t sound particularly fun. <code>R</code> is great when you&rsquo;re manipulating/plotting data frames&mdash;less so when doing more complicated tasks on the filesystem. So, to extract the information we want, I&rsquo;ll used a simple bash script.</p> <div class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="c">#!/bin/bash</span> <span class="c"># Our input directory</span> <span class="nv">INPUTDIRECTORY</span><span class="o">=</span>./fiction_1915-1919 <span class="c"># Let&#39;s take a single command line argument ($1) and store it</span> <span class="c"># as the value we&#39;re looking for (the proverbial needle in our</span> <span class="c"># data haystack).</span> <span class="nv">NEEDLE</span><span class="o">=</span><span class="nv">$1</span> <span class="c"># We use this convention, with find and while read </span> <span class="c"># because a simple for loop, or ls, might have a problem</span> <span class="c"># with ~10000 files.</span> find <span class="nv">$INPUTDIRECTORY</span> <span class="p">|</span> <span class="k">while</span> <span class="nb">read </span>file <span class="k">do</span> <span class="c"># For each file, we use grep to search for our term,</span> <span class="c"># storing just the number of occurences in result.</span> <span class="nv">result</span><span class="o">=</span><span class="k">$(</span>grep -w -m <span class="m">1</span> <span class="nv">$NEEDLE</span> <span class="nv">$file</span> <span class="p">|</span> awk <span class="s1">&#39;{ print $2 }&#39;</span><span class="k">)</span> <span class="c"># Get the htid of the file we&#39;re looking at from the filename</span> <span class="nv">id</span><span class="o">=</span><span class="k">$(</span>basename <span class="nv">$file</span> .tsv<span class="k">)</span> <span class="c"># And then print the result to the screen</span> <span class="nb">echo</span> <span class="nv">$id</span>,<span class="nv">$result</span> <span class="k">done</span> </code></pre></div> <p><span class='marginnote'>I&rsquo;m assuming some familiarity with bash scripts; to make a script executable, using its enough to type <code>chmod +x wordcounter.bash</code>.</span> Save this script to a file (say, <code>wordcounter.bash</code>), make it executable, and then run it with an argument: <code>./wordcounter.bash positivism</code> and it will output to the screen; pipe that to a csv (type <code>./wordcounter.bash positivism &gt; positivism.csv</code>) and you can use it in <code>R</code>. Here is what the results look like when they start appearing on the screen:</p> <div class="highlight"><pre><code class="language-text" data-lang="text">bc.ark+=13960=t19k4r10s, bc.ark+=13960=t25b0m976, bc.ark+=13960=t6tx3tq53, chi.086426399, chi.086523141, chi.64465423, chi.73664930, coo.31924002898983, coo.31924013129774, </code></pre></div> <p>Those gibberish-looking strings (<code>bc.ark+=13950=tk19k4r10s</code>) are HathiTrust IDs. Then you get a comma, and after the comma the number of times the term appeared in the file&hellip; unless it <em>didn&rsquo;t</em> appear, in which case you just a blank. </p> <h2>Some Notes</h2> <p>This will only work on unixy systems&mdash;Linux, OSX, or (I assume) cygwin on Windows. </p> <p>When a token does not appear in file, this script outputs the <code>htid</code>, a comma, and then nothing. That&rsquo;s fine&mdash;it&rsquo;s easier to handle this after we&rsquo;ve imported the resulting csv (to, say, <code>R</code>) than it would have been to write some logic in this script here to output 0. Also, this crude method is probably faster than doing it within <code>R</code> or Python and is certainly not slower. It could be speeded up by doing something fancy, like parallelization. To search through the 8656 files of <code>fiction_1915-1919</code> for one term took 1 minute and 12 seconds&mdash;a totally managable timeframe. Assuming that rate (processing, say, 120 files/second) is roughly constant across the dataset of roughly 180,000 volumes, it should be possible to use this method to search for a term across all the volumes in the dataset in roughly 25 minutes, give or take. That is, of course, based on doing this on my laptop (with a 1.8Ghz Core i5 CPU), no parallelization (though this should be an eminently parallizable task&mdash;like really). Not fast, but totally managable.</p> <h2>Plotting Our Extracted Counts from Individual Volume Files</h2> <p>So, assuming the script works&hellip; back to <code>R</code>.</p> <div class="highlight"><pre><code class="language-r" data-lang="r"><span class="c1"># Input the data culled by our custom bash script</span> gramophone <span class="o">&lt;-</span> read.csv<span class="p">(</span><span class="s">&#39;gramophone.csv&#39;</span><span class="p">)</span> film <span class="o">&lt;-</span> read.csv<span class="p">(</span><span class="s">&#39;film.csv&#39;</span><span class="p">)</span> typewriter <span class="o">&lt;-</span> read.csv<span class="p">(</span><span class="s">&#39;typewriter.csv&#39;</span><span class="p">)</span> <span class="c1"># Remember all those spots where a token doesn&#39;t occur, </span> <span class="c1"># which appear as blanks? Those get read by R as NA </span> <span class="c1"># values. Here we replace them with zeros.</span> gramophone<span class="p">[</span><span class="kp">is.na</span><span class="p">(</span>gramophone<span class="p">)]</span> <span class="o">&lt;-</span> <span class="m">0</span> film<span class="p">[</span><span class="kp">is.na</span><span class="p">(</span>film<span class="p">)]</span> <span class="o">&lt;-</span> <span class="m">0</span> typewriter<span class="p">[</span><span class="kp">is.na</span><span class="p">(</span>typewriter<span class="p">)]</span> <span class="o">&lt;-</span> <span class="m">0</span> <span class="c1"># Let&#39;s rename our columns</span> <span class="kp">colnames</span><span class="p">(</span>gramophone<span class="p">)</span> <span class="o">&lt;-</span> <span class="kt">c</span><span class="p">(</span><span class="s">&#39;htid&#39;</span><span class="p">,</span><span class="s">&#39;gramophone&#39;</span><span class="p">)</span> <span class="kp">colnames</span><span class="p">(</span>film<span class="p">)</span> <span class="o">&lt;-</span> <span class="kt">c</span><span class="p">(</span><span class="s">&#39;htid&#39;</span><span class="p">,</span><span class="s">&#39;film&#39;</span><span class="p">)</span> <span class="kp">colnames</span><span class="p">(</span>typewriter<span class="p">)</span> <span class="o">&lt;-</span> <span class="kt">c</span><span class="p">(</span><span class="s">&#39;htid&#39;</span><span class="p">,</span><span class="s">&#39;typewriter&#39;</span><span class="p">)</span> <span class="c1"># We&#39;ll put this data together into one data frame</span> <span class="c1"># for convenience sake.</span> gft <span class="o">&lt;-</span> <span class="kp">merge</span><span class="p">(</span>gramophone<span class="p">,</span>film<span class="p">,</span>by<span class="o">=</span><span class="kt">c</span><span class="p">(</span><span class="s">&#39;htid&#39;</span><span class="p">))</span> gft <span class="o">&lt;-</span> <span class="kp">merge</span><span class="p">(</span>gft<span class="p">,</span>typewriter<span class="p">,</span>by<span class="o">=</span><span class="kt">c</span><span class="p">(</span><span class="s">&#39;htid&#39;</span><span class="p">))</span> </code></pre></div> <p>Right now, though, all we have is HathiTrust IDs and frequencies of our term (or terms). We have no information about date, or title. So let&rsquo;s get that information from the metadata files we&rsquo;ve worked with earlier.</p> <div class="highlight"><pre><code class="language-r" data-lang="r"><span class="c1"># From our custom culled data</span> gramophone <span class="o">&lt;-</span> read.csv<span class="p">(</span><span class="s">&#39;gramophone.csv&#39;</span><span class="p">)</span> film <span class="o">&lt;-</span> read.csv<span class="p">(</span><span class="s">&#39;film.csv&#39;</span><span class="p">)</span> typewriter <span class="o">&lt;-</span> read.csv<span class="p">(</span><span class="s">&#39;typewriter.csv&#39;</span><span class="p">)</span> <span class="c1"># All those spots where a token doesn&#39;t occur, which produce blank lines</span> gramophone<span class="p">[</span><span class="kp">is.na</span><span class="p">(</span>gramophone<span class="p">)]</span> <span class="o">&lt;-</span> <span class="m">0</span> film<span class="p">[</span><span class="kp">is.na</span><span class="p">(</span>film<span class="p">)]</span> <span class="o">&lt;-</span> <span class="m">0</span> typewriter<span class="p">[</span><span class="kp">is.na</span><span class="p">(</span>typewriter<span class="p">)]</span> <span class="o">&lt;-</span> <span class="m">0</span> <span class="kp">colnames</span><span class="p">(</span>gramophone<span class="p">)</span> <span class="o">&lt;-</span> <span class="kt">c</span><span class="p">(</span><span class="s">&#39;htid&#39;</span><span class="p">,</span><span class="s">&#39;gramophone&#39;</span><span class="p">)</span> <span class="kp">colnames</span><span class="p">(</span>film<span class="p">)</span> <span class="o">&lt;-</span> <span class="kt">c</span><span class="p">(</span><span class="s">&#39;htid&#39;</span><span class="p">,</span><span class="s">&#39;film&#39;</span><span class="p">)</span> <span class="kp">colnames</span><span class="p">(</span>typewriter<span class="p">)</span> <span class="o">&lt;-</span> <span class="kt">c</span><span class="p">(</span><span class="s">&#39;htid&#39;</span><span class="p">,</span><span class="s">&#39;typewriter&#39;</span><span class="p">)</span> <span class="c1"># put it all together with our main metadata data frame</span> gft <span class="o">&lt;-</span> <span class="kp">merge</span><span class="p">(</span>gramophone<span class="p">,</span>film<span class="p">,</span>by<span class="o">=</span><span class="kt">c</span><span class="p">(</span><span class="s">&#39;htid&#39;</span><span class="p">))</span> gft <span class="o">&lt;-</span> <span class="kp">merge</span><span class="p">(</span>gft<span class="p">,</span>typewriter<span class="p">,</span>by<span class="o">=</span><span class="kt">c</span><span class="p">(</span><span class="s">&#39;htid&#39;</span><span class="p">))</span> <span class="c1"># Now get the metadata from fiction_metadata.csv and</span> <span class="c1"># merge based on htid.</span> fiction.data <span class="o">&lt;-</span> read.csv<span class="p">(</span><span class="s">&#39;fiction_metadata.csv&#39;</span><span class="p">,</span>header<span class="o">=</span><span class="bp">T</span><span class="p">)</span> gft <span class="o">&lt;-</span> <span class="kp">merge</span><span class="p">(</span>gft<span class="p">,</span>fiction.data<span class="p">,</span>by<span class="o">=</span><span class="kt">c</span><span class="p">(</span><span class="s">&#39;htid&#39;</span><span class="p">))</span> <span class="c1"># To normalize let&#39;s load our annual totals as well. We can</span> <span class="c1"># merge those with our dataframe based on date.</span> <span class="c1"># Get Yearly Totals</span> yearly.summary <span class="o">&lt;-</span> read.csv<span class="p">(</span><span class="s">&#39;fiction_yearly_summary.csv&#39;</span><span class="p">)</span> yearly.total <span class="o">&lt;-</span> <span class="kp">subset</span><span class="p">(</span>yearly.summary<span class="p">,</span>yearly.summary<span class="o">$</span>word<span class="o">==</span><span class="s">&#39;#DICTIONARYWORD&#39;</span><span class="p">)</span> yearly.total <span class="o">&lt;-</span> yearly.total<span class="p">[</span><span class="kt">c</span><span class="p">(</span><span class="s">&#39;year&#39;</span><span class="p">,</span><span class="s">&#39;termfreq&#39;</span><span class="p">)]</span> <span class="kp">colnames</span><span class="p">(</span>yearly.total<span class="p">)</span> <span class="o">&lt;-</span> <span class="kt">c</span><span class="p">(</span><span class="s">&#39;date&#39;</span><span class="p">,</span><span class="s">&#39;total&#39;</span><span class="p">)</span> <span class="c1"># Merge yearly totals with our main dataframe based on date.</span> gft <span class="o">&lt;-</span> <span class="kp">merge</span><span class="p">(</span>gft<span class="p">,</span>yearly.total<span class="p">,</span>by<span class="o">=</span><span class="kt">c</span><span class="p">(</span><span class="s">&#39;date&#39;</span><span class="p">))</span> <span class="c1"># Our dataframe is now 23 columns:</span> <span class="kp">colnames</span><span class="p">(</span>gft<span class="p">)</span> <span class="o">&gt;</span> <span class="p">[</span><span class="m">1</span><span class="p">]</span> <span class="s">&quot;date&quot;</span> <span class="s">&quot;htid&quot;</span> <span class="s">&quot;gramophone&quot;</span> <span class="s">&quot;film&quot;</span> <span class="o">&gt;</span> <span class="p">[</span><span class="m">5</span><span class="p">]</span> <span class="s">&quot;typewriter&quot;</span> <span class="s">&quot;recordid&quot;</span> <span class="s">&quot;oclc&quot;</span> <span class="s">&quot;locnum&quot;</span> <span class="o">&gt;</span> <span class="p">[</span><span class="m">9</span><span class="p">]</span> <span class="s">&quot;author&quot;</span> <span class="s">&quot;imprint&quot;</span> <span class="s">&quot;place&quot;</span> <span class="s">&quot;enumcron&quot;</span> <span class="o">&gt;</span><span class="p">[</span><span class="m">13</span><span class="p">]</span> <span class="s">&quot;subjects&quot;</span> <span class="s">&quot;title&quot;</span> <span class="s">&quot;prob80precise&quot;</span> <span class="s">&quot;genrepages&quot;</span> <span class="o">&gt;</span><span class="p">[</span><span class="m">17</span><span class="p">]</span> <span class="s">&quot;totalpages&quot;</span> <span class="s">&quot;englishpct&quot;</span> <span class="s">&quot;datetype&quot;</span> <span class="s">&quot;startdate&quot;</span> <span class="o">&gt;</span><span class="p">[</span><span class="m">21</span><span class="p">]</span> <span class="s">&quot;enddate&quot;</span> <span class="s">&quot;imprintdate&quot;</span> <span class="s">&quot;total&quot;</span> <span class="c1"># That&#39;s not crazy, but to make things easier to under, </span> <span class="c1"># let&#39;s subset just the data we&#39;re interested in right now---say,</span> <span class="c1"># the occurence of our terms and their date.</span> gft.simple <span class="o">&lt;-</span> gft<span class="p">[,</span><span class="kt">c</span><span class="p">(</span><span class="s">&#39;date&#39;</span><span class="p">,</span><span class="s">&#39;gramophone&#39;</span><span class="p">,</span><span class="s">&#39;film&#39;</span><span class="p">,</span><span class="s">&#39;typewriter&#39;</span><span class="p">,</span><span class="s">&#39;total&#39;</span><span class="p">)]</span> <span class="kp">head</span><span class="p">(</span>gft.simple<span class="p">)</span> <span class="o">&gt;</span> date gramophone film typewriter total <span class="o">&gt;</span> <span class="m">1</span> <span class="m">1915</span> <span class="m">0</span> <span class="m">0</span> <span class="m">0</span> <span class="m">106553905</span> <span class="o">&gt;</span> <span class="m">2</span> <span class="m">1915</span> <span class="m">0</span> <span class="m">1</span> <span class="m">0</span> <span class="m">106553905</span> <span class="o">&gt;</span> <span class="m">3</span> <span class="m">1915</span> <span class="m">0</span> <span class="m">0</span> <span class="m">0</span> <span class="m">106553905</span> <span class="o">&gt;</span> <span class="m">4</span> <span class="m">1915</span> <span class="m">0</span> <span class="m">0</span> <span class="m">0</span> <span class="m">106553905</span> <span class="o">&gt;</span> <span class="m">5</span> <span class="m">1915</span> <span class="m">0</span> <span class="m">0</span> <span class="m">0</span> <span class="m">106553905</span> <span class="o">&gt;</span> <span class="m">6</span> <span class="m">1915</span> <span class="m">0</span> <span class="m">0</span> <span class="m">0</span> <span class="m">106553905</span> <span class="kp">nrow</span><span class="p">(</span>gft.simple<span class="p">)</span> <span class="o">&gt;</span> <span class="p">[</span><span class="m">1</span><span class="p">]</span> <span class="m">8655</span> </code></pre></div> <p>Okay, looks good&mdash;there are our 8655 volumes, each with date of publication, the occurences of our three search terms (<code>gramophone</code>, <code>film</code>, and <code>typewriter</code>), and the total number of DICTIONARYWORDs for that year. Note that each row still represents a single volume&mdash;but we&rsquo;ve discarded <code>author</code>, <code>title</code>, <code>htid</code>, etc. We&rsquo;ve also added the total dictionary words for a volume&rsquo;s year to each row (note the repeated totals in those first 1915 volumes), which is grossly inefficient. All this, however, is in the interest of simplicity&mdash;so that we can easily plot the relative occurences of our selected terms (here, <code>gramophone</code>, <code>film</code>, and <code>typewriter</code>).</p> <p>In order to make this data easily plottable, we need some additional <code>R</code> tricks: we need to reformat our data from a &ldquo;data frame&rdquo; to a long &ldquo;data matrix&rdquo; (using the <code>melt</code> function). Then we can create a stacked bar graph of terms per year. Let&rsquo;s start by plotting our raw counts.</p> <div class="highlight"><pre><code class="language-r" data-lang="r"><span class="c1"># Our libraries</span> <span class="kn">library</span><span class="p">(</span>reshape2<span class="p">)</span> <span class="c1"># For melting data.</span> <span class="kn">library</span><span class="p">(</span>ggplot2<span class="p">)</span> <span class="c1"># For graphing data.</span> <span class="c1"># This next is necessary b/c R throws an error otherwise. </span> <span class="c1"># Not totally sure why...</span> gft.simple<span class="o">$</span>date <span class="o">&lt;-</span> <span class="kp">as.factor</span><span class="p">(</span>gft.simple<span class="o">$</span><span class="kp">date</span><span class="p">)</span> <span class="c1"># Create a &quot;long&quot; format matrix, from our raw counts data.</span> gft.m <span class="o">&lt;-</span> melt<span class="p">(</span>gft.simple<span class="p">[,</span><span class="kt">c</span><span class="p">(</span><span class="s">&#39;date&#39;</span><span class="p">,</span><span class="s">&#39;gramophone&#39;</span><span class="p">,</span><span class="s">&#39;film&#39;</span><span class="p">,</span><span class="s">&#39;typewriter&#39;</span><span class="p">)],</span>id.vars<span class="o">=</span><span class="s">&#39;date&#39;</span><span class="p">)</span> <span class="c1"># Create a bar plot of all our values, coded by variable</span> ggplot<span class="p">(</span>gft.m<span class="p">,</span> aes<span class="p">(</span><span class="kp">factor</span><span class="p">(</span><span class="kp">date</span><span class="p">),</span>y<span class="o">=</span>value<span class="p">,</span>fill<span class="o">=</span>variable<span class="p">))</span> <span class="o">+</span> geom_bar<span class="p">(</span>stat<span class="o">=</span><span class="s">&#39;identity&#39;</span><span class="p">)</span> <span class="o">+</span> xlab<span class="p">(</span><span class="s">&#39;Year&#39;</span><span class="p">)</span> <span class="o">+</span> ylab<span class="p">(</span><span class="s">&#39;Raw Word Occurence&#39;</span><span class="p">)</span> <span class="o">+</span> ggtitle<span class="p">(</span><span class="s">&quot;Raw Counts for &#39;gramophone,&#39; &#39;film,&#39; and &#39;typewriter&#39;&quot;</span><span class="p">)</span> </code></pre></div> <p><a href="/images/htdata-plot7.png"><img src="/images/htdata-plot7.png" alt="Stacked Bar Chart of &#39;gramophone&#39;,&#39;film&#39;,&#39;typewriter&#39; occurences, in Dataset, 1915-1919"></a> </p> <p>These are, though, raw counts. To normalize, we can divide the counts for our terms by the total and plot the result. </p> <div class="highlight"><pre><code class="language-r" data-lang="r"><span class="c1"># We&#39;ll create a new data frame for our normalized data, </span> <span class="c1"># beginning with out simplified data.</span> gft.normalized <span class="o">&lt;-</span> gft.simple <span class="c1"># In this new dataframe, normalize our scores by dividing </span> <span class="c1"># the raw count in each row by the total in each row.</span> gft.normalized<span class="o">$</span>gramophone <span class="o">&lt;-</span> gft.normalized<span class="o">$</span>gramophone<span class="o">/</span>gft.normalized<span class="o">$</span>total gft.normalized<span class="o">$</span>film <span class="o">&lt;-</span> gft.normalized<span class="o">$</span>film<span class="o">/</span>gft.normalized<span class="o">$</span>total gft.normalized<span class="o">$</span>typewriter <span class="o">&lt;-</span> gft.normalized<span class="o">$</span>typewriter<span class="o">/</span>gft.normalized<span class="o">$</span>total <span class="c1"># How does it look?</span> <span class="kp">head</span><span class="p">(</span>gft.normalized<span class="p">)</span> <span class="o">&gt;</span> date gramophone film typewriter total <span class="o">&gt;</span> <span class="m">1</span> <span class="m">1915</span> <span class="m">0</span> <span class="m">0.000000e+00</span> <span class="m">0</span> <span class="m">106553905</span> <span class="o">&gt;</span> <span class="m">2</span> <span class="m">1915</span> <span class="m">0</span> <span class="m">9.384921e-09</span> <span class="m">0</span> <span class="m">106553905</span> <span class="o">&gt;</span> <span class="m">3</span> <span class="m">1915</span> <span class="m">0</span> <span class="m">0.000000e+00</span> <span class="m">0</span> <span class="m">106553905</span> <span class="o">&gt;</span> <span class="m">4</span> <span class="m">1915</span> <span class="m">0</span> <span class="m">0.000000e+00</span> <span class="m">0</span> <span class="m">106553905</span> <span class="o">&gt;</span> <span class="m">5</span> <span class="m">1915</span> <span class="m">0</span> <span class="m">0.000000e+00</span> <span class="m">0</span> <span class="m">106553905</span> <span class="o">&gt;</span> <span class="m">6</span> <span class="m">1915</span> <span class="m">0</span> <span class="m">0.000000e+00</span> <span class="m">0</span> <span class="m">106553905</span> <span class="c1"># Well, that looks about right. Let&#39;s begin our melt/plot </span> <span class="c1"># process again by creating a matrix.</span> gft.norm.m <span class="o">&lt;-</span> melt<span class="p">(</span>gft.normalized<span class="p">[,</span><span class="kt">c</span><span class="p">(</span><span class="s">&#39;date&#39;</span><span class="p">,</span><span class="s">&#39;gramophone&#39;</span><span class="p">,</span><span class="s">&#39;film&#39;</span><span class="p">,</span><span class="s">&#39;typewriter&#39;</span><span class="p">)],</span>id.vars<span class="o">=</span><span class="s">&#39;date&#39;</span><span class="p">)</span> ggplot<span class="p">(</span>gft.norm.m<span class="p">,</span>aes<span class="p">(</span><span class="kp">factor</span><span class="p">(</span><span class="kp">date</span><span class="p">),</span>y<span class="o">=</span>value<span class="p">,</span>fill<span class="o">=</span>variable<span class="p">))</span> <span class="o">+</span> geom_bar<span class="p">(</span>stat<span class="o">=</span><span class="s">&#39;identity&#39;</span><span class="p">)</span> <span class="o">+</span> xlab<span class="p">(</span><span class="s">&#39;Year&#39;</span><span class="p">)</span> <span class="o">+</span> ylab<span class="p">(</span><span class="s">&#39;Normalized Word Frequency (by Year)&#39;</span><span class="p">)</span> <span class="o">+</span> ggtitle<span class="p">(</span><span class="s">&quot;Normalized Scores for &#39;gramophone,&#39; &#39;film,&#39; and &#39;typewriter&#39;&quot;</span><span class="p">)</span> </code></pre></div> <p><a href="/images/htdata-plot8.png"><img src="/images/htdata-plot8.png" alt="Normalized, Stacked Bar Chart"></a></p> <p>Normalization makes some minor adjustments, but pretty similar. Not sure I would want to make any claims as to the importance or meaning of these graphs. They&rsquo;re over a short historical span, and so far lack any richer contextualization. Like I said, for now, anticlimaxes.</p> About 1400 Words of Skepticism about Markdown, and an Imagined Alternative 2015-06-29T00:00:00-04:00 http://cforster.com/2015/06/markdown-skepticism/ <p>Don&rsquo;t get me wrong, <a href="http://daringfireball.net/projects/markdown/">Markdown</a>&rsquo;s great. Indeed, nearly all the writing I do now is in Markdown (or at least starts that way). There has been a good amount of writing about the virtues of Markdown for academic writing in particular, so I&rsquo;ll just link to them here:</p> <ul> <li>W. Caleb McDaniel&rsquo;s <a href="http://wcm1.web.rice.edu/my-academic-book-in-plain-text.html">Why (and How) I Wrote My Academic Book in Plain Text</a></li> <li>Nikola Sander&rsquo;s <a href="http://nikolasander.com/writing-in-markdown/">Writing Academic Papers in Markdown Using Sublime Text and Pandoc</a></li> <li>Dennis Tennen and Grant Wythoff&rsquo;s <a href="http://programminghistorian.org/lessons/sustainable-authorship-in-plain-text-using-pandoc-and-markdown">Sustainable Authorship in Plaintext Using Pandoc and Markdown</a> (This latter I especially recommend).</li> </ul> <p>But Markdown, as it stands, has some drawbacks, which become acute when you are trying to extend it to cover the needs of academic writing (or, say, as <a href="http://web.uvic.ca/%7Emvp1922/otsummit/">a transcription format for texts</a>). </p> <h2>The Problem</h2> <p>What I will describe as &ldquo;problems&rdquo; all stems from the fact that Markdown remains essentially a simplified syntax for HTML. A tool like <a href="http://pandoc.org">Pandoc</a>, which has a special (and especially powerful) flavor of Markdown all its own, helps reduce the borders between document formats. With Pandoc it becomes easy to convert <code>HTML</code> to <code>LaTeX</code>, or Rich Text Format to Word&rsquo;s <code>.docx</code>. It could easily feel like Markdown is a universal document format&mdash;write it in Markdown, and publish as whatever.</p> <p>That is a lovely dream&mdash;an easy-to-write plaintext format that can easily be output to any desired format. In reality, though, Markdown (even Pandoc&rsquo;s Markdown) remains yoked to HTML, and so it suffers from some of its problems.</p> <p>The problem I encounter most frequently in HTML (and in Markdown) concerns nesting a block quote within a paragraph. In short, can you have a block quote <em>within</em> a paragraph? If you&rsquo;re writing HTML (or MarkDown), the answer is no&mdash;HTML treats &ldquo;block quotes&rdquo; as <code>block</code> <em>elements</em>; this means that one cannot be contained within a paragraph (this restriction does not exist in LaTeX or TEI). Yet, what could be more common in writing on works of literature? Representing poetry presents its own problems for HTML and Markdown.<span class='marginnote'>By contrast to the challenge presented by the mere fact of poetry, note the many syntaxes/tools available for fenced code blocks, syntax highlighting, and so on; Markdown, for now, remains of greatest interest to software developers and so reflects their habits and needs.</span>(<em>Note</em>: If you&rsquo;re looking for practical advice, you can easily represent poetry in Pandoc&rsquo;s markdown using <a href="http://pandoc.org/README.html#line-blocks">&ldquo;line blocks&rdquo;</a>; this is not a perfect solution, but it will do for many needs).</p> <p>Perversely, markdown also represents something of a step backward with regard to <em>semantics</em>. If you&rsquo;ve spent some time with HTML, you may have noticed how HTML5 cements a model of HTML as a semantic markup language (with, implicitly, matters of presentation controlled by CSS). That means that the <code>&lt;i&gt;</code> tag, which long ago meant <em>italics</em>, has since acquired semantic meaning. <a href="https://docs.webplatform.org/wiki/html/elements/i">According to the w3c</a>, it should be used to &ldquo;represent[] a span of text offset from its surrounding content without conveying any extra emphasis or importance, and for which the conventional typographic presentation is italic text; for example, a taxonomic designation, a technical term, an idiomatic phrase from another language, a thought, or a ship name.&rdquo; Those instances where one wishes to express emphasis, use the <code>&lt;em&gt;</code> tag. If you need to mark a title, don&rsquo;t simply italicize it, use <a href="https://docs.webplatform.org/wiki/html/elements/cite"><code>&lt;cite&gt;</code></a> .<span class='marginnote'>But hold up, that <code>cite</code> element obscures the distinctions we normally make between italicizing certain titles and putting others in quotation marks.</span> In practice, of course, I doubt these distinctions are widely respected across the web; but all those at least <em>potentially</em> useful distinctions are lost in markdown, whose syntax marks them all with <code>*</code> or <code>_</code>. Markdown is, in fact, rather <em>unsemantic</em>. (To a lesser degree, one might detect this tendency as well in the way headings&mdash;rather than <code>divs</code>&mdash;are Markdown&rsquo;s primary way of structuring a document, but I&rsquo;ll stop now.) So, two points: Markdown inherits HTML&rsquo;s document which includes an inability to nest block-level elements within paragraphs; in simplifying HTML, it produces a less semantically clear and rich format. (Technically, of course, one could simply include any HTML element for which Markdown offers no shortened syntax&mdash;like <code>&lt;cite&gt;</code> for example.)</p> <h2>A Solution</h2> <p>On the <a href="http://talk.commonmark.org/">CommonMark forum</a>, some folks have proposed additional syntax to fix the latter problem, and capture some of the semantic distinctions mentioned above (indeed, following the discussions over there has helped sensitize to me some of the challenges and limitations of markdown as a sort of universal format donor). So, some of these issues could be resolved through extensions or modifications of Markdown. </p> <p>Yet, given these deficits in Markdown, I wonder if it isn&rsquo;t worth asking a more basic question&mdash;whether the plaintext format for &ldquo;academic&rdquo; writing should be so tightly yoked to HTML? If Markdown is, fundamentally, a simplified, plaintext syntax for HTML, could we imagine a similar, easy-to-write, plaintext format that wouldn&rsquo;t be tied to HTML? Could we imagine, say, a format that would represent a simplification of syntax, not of HTML, but of a format better suited to the needs of representing more complex documents? Could we imagine a plaintext format that would be to <a href="http://www.tei-c.org/index.xml">TEI</a>, say, what markdown is to HTML?</p> <p>Such a format would not need to <em>look</em> particularly different from Markdown. Its syntax could overlap significantly; as in Pandoc&rsquo;s Markdown format, file metadata (things like title, author, and so on) could appear (perhaps as YAML) at the front of the file (and be converted into elements within <code>teiHeader</code>). You could still use <code>*</code>, <code>**</code>, and <code>[]()</code> as your chief tools; footnotes and references could be marked the same way (you could preserve Pandoc&rsquo;s wonderful citation system, with such things represented as <code>&lt;refs&gt;</code> in TEI).</p> <p>The most substantive difference would not be in syntax, but in the document model. Any Markdown file can contain HTML&mdash;all HTML is valid markdown; this ensures that Markdown is never less powerful than HTML. But are the burdens of HTML worth the costs if one wishes to do scholarly/academic, or similar types of writing, in plaintext? Projects exist to repurpose Pandoc markdown for scholarly writing: Tim T. Y. Lin&rsquo;s <a href="http://scholarlymarkdown.com/Scholarly-Markdown-Guide.html#first-steps">ScholarlyMarkdown</a>, or Martin Fenner&rsquo;s <a href="http://blog.martinfenner.org/2013/06/29/metadata-in-scholarly-markdown/">similar project</a>, or the workflow linked-to above, by Dennis Tennen and Grant Wythoff at the <a href="http://programminghistorian.org/lessons/sustainable-authorship-in-plain-text-using-pandoc-and-markdown">Programming Historian</a>. What I&rsquo;m imagining, though, is entirely less practical than any of these projects at the moment because it would necessitate a change in the document model into which markdown is converted. Pandoc works its magic by reading documents from a source format (through a &ldquo;reader&rdquo;) into an intermediary format (a format of its own that you can view by outputting <code>-t native</code>), which it can then output (through a &ldquo;writer&rdquo;). Could TEI (or some representation of it), essentially, fulfill that role as intermediary format? (A Pandoc car with a TEI engine swapped in?)</p> <p>I like writing in plaintext, but I don&rsquo;t love being bound by the peculiarities that Markdown has inherited from HTML. So, it is worth considering what it is that people like about Markdown. I suspect that most of the things people like about Markdown (free, easy to write, nonproprietary, easily usable with version control, and so on), have little to do with its HTML-based document model but stem from its being a plaintext format (and the existing infrastructure of scripts/apps/workflows around markdown). TEI provides an alternative document model&mdash;indeed, a <em>richer</em> document model. Imagine a version of Pandoc that uses TEI (or a simplified TEI subset) behind the scenes as its native format. Folks often complain about the complexity and verbosity of TEI (and XML more generally), and not without reason. I would certainly never want to <em>write</em> TEI; but a simplified TEI syntax that could then take advantage of all the virtues of TEI, that would be something.</p> <p>[Closing Note: At one point I wondered how easy it would be to convert markdown to TEI with Pandoc&hellip; I&rsquo;ve managed to finagle a set of scripts to do that; it&rsquo;s janky, but for anyone interested, it&rsquo;s <a href="https://github.com/c-forster/markdown2tei">here</a>.]</p> About 300 Words, Reminding You About Santa Claus's Size 2014-12-22T00:00:00-05:00 http://cforster.com/2014/12/santa/ <p><img src="/images/hm-001.jpg" alt=""></p> <p>Recall these lines from Clement C. Moore&rsquo;s &ldquo;A Visit from Saint Nicholas,&rdquo; (alternately titled &ldquo;The Night Before Christmas&rdquo; or &ldquo;&lsquo;Twas the Night Before Christmas&rdquo;), first published in 1823. <span class='marginnote'>See <a href="http://en.wikipedia.org/wiki/A_Visit_from_St._Nicholas">wikipedia page</a> for some notes on contentions with regard to its authorship.</span></p> <div class='poetry'> When what to my wondering eyes should appear, But a miniature sleigh and eight tiny reindeer&hellip; </div> <p>But, exactly <em>how miniature</em> is this sleigh, and <em>how tiny</em> are these reindeer? While Moore&rsquo;s poem did a lot to consolidate the mythology of Santa Claus, one thing that has <em>not</em> remained of Moore&rsquo;s Saint Nicholas is his height. Recalling this insistence on the tinyness of Santa, much of the confusion around his movement through chimney flues is eliminated. But it also lends a different stress to the comparison of the elf&rsquo;s nose to &ldquo;like a cherry&rdquo; or of his &ldquo;little round belly&rdquo; that shakes &ldquo;like a bowl full of jelly.&rdquo; At stake here is not simply nose complexion nor belly texture, but <em>size</em>.</p> <p>If today our Santa is our bigger, it was not always so. And many earlier illustrations are consistent with Moore&rsquo;s text. Consider these from a <a href="https://archive.org/stream/twasnightbeforec00moor">1912 edition</a> [archive.org] of the poem, by Jessie Wilcox Smith:</p> <p><img src="/images/hm-002.jpg" alt="Santa" title="Illustration by Jessie Wilcox Smith"></p> <p><img src="/images/hm-003.jpg" alt="Santa Filling Stockings" title="Illustration by Jessie Wilcox Smith"></p> <p>Likewise, look at this svelte Santa, by Arthur Rackham from <a href="http://hdl.handle.net/2027/miun.aek2825.0001.001">this undate edition</a> [HathiTrust], who is clearly small enough to easily slip down that chimney:</p> <p><img src="/images/lippincot-santa-01.png" alt="Santa Emerging from Chimney" title="Illustration by Arthur Rackham"></p> <p>You can find more Santas at <a href="http://publicdomainreview.org/collections/a-pictorial-history-of-santa-claus/">the Public Domain Review</a>, including a gun-toting, WWII Santa, or <a href="http://cylinders.library.ucsb.edu/search.php?queryType=@attr+1=1020&amp;num=1&amp;start=1&amp;query=cylinder0198">listen to the poem on wax cylinder</a> [1914].</p> <h2>Editions of the poem:</h2> <ul> <li><a href="https://archive.org/stream/twasnightbeforec00moor">https://archive.org/stream/twasnightbeforec00moor</a></li> <li><a href="http://hdl.handle.net/2027/miun.aek2825.0001.001">http://hdl.handle.net/2027/miun.aek2825.0001.001</a></li> </ul> 1567 Words on <em>Interstellar</em> and Dylan Thomas 2014-12-03T00:00:00-05:00 http://cforster.com/2014/12/interstellar/ <p><strong>Spoilers Abound Below</strong></p> <p><a href="https://www.flickr.com/photos/nasacommons/9457918847" title="Pioneer F Plaque Symbology by NASA on The Commons, on Flickr"><img src="https://farm8.staticflickr.com/7357/9457918847_1682a27c1a.jpg" width="500" height="399" alt="Pioneer F Plaque Symbology"></a></p> <p><strong>Revised December 5; in the first version, I confused the name of a character (calling Dr. <em>Mann</em>, Dr. <em>Miller</em>.)</strong></p> <p><em>Interstellar</em> beats the drum of <a href="http://www.slate.com/blogs/browbeat/2014/11/05/_do_not_go_gentle_into_that_good_night_in_interstellar_back_to_school_and.html">Dylan Thomas&rsquo;s villanelle &ldquo;Do Not Go Gentle Into that Good Night&rdquo; pretty hard</a> pretty hard&mdash;reciting it on multiple occasions (though never all the way through, if I recall correctly, and so never really enjoying its full <em>villanelle</em>-ness). Poetry in the movies often serves a chiefly hortatory, emotive function; it is discourse of moral and emotional seriousness. It is recited by serious people (from memory, of course), and it shows their seriousness. And here seems no different. It confers dignity and emotional seriousness on what would otherwise be the mere extinction of humanity.<span class='marginnote'>That summarizes, perhaps, my chief gripe about the movie; its bullying emotionalism. Its soundtrack, in particular, bullies you into feeling what it wants you to feel. As my <a href="http://cforster.com/2014/09/podcast-as-a-genre/">much beloved</a> <a href="http://flophousepodcast.com">Flophouse Podcast</a> is fond of noting, is it really necessary to reinforce the stakes in this way? Is the drama of interstellar exploration so boring that only by augmenting it with heaping doses of Dylan Thomas, or a thudding score, will we realize its import?</span></p> <p>In the dystopian future of <em>Interstellar</em>, nearly all crops are dying from an unexplained blight, and NASA Scientist Prof. Brand (Michael Caine) is leading a secret team to save humanity. He offers the poem as a sort of allegory for the necessity of humanity resisting its fate. It is the <em>species</em> that must not go gentle into that good night. And so the addressee of Dylan Thomas&rsquo;s poem, which is offered from a child to a father, is reversed. <a href="http://www.poets.org/poetsorg/poem/do-not-go-gentle-good-night">Thomas writes</a>, in the villanelle&rsquo;s conclusion: </p> <blockquote> <p>And you, my father, there on the sad height,<br/> Curse, bless, me now with your fierce tears, I pray.<br/> Do not go gentle into that good night.<br/> Rage, rage against the dying of the light.</p> </blockquote> <p>While the poem advises resistance from closure and finality, the formal demands of the <a href="http://en.wikipedia.org/wiki/Villanelle">villanelle</a>, which brings together its rhyming refrains in its closing couplet, inexorably move toward them.<span class='marginnote'>Elizabeth Bishop&rsquo;s perhaps superior villanelle <a href="">&ldquo;One Art&rdquo;</a> wonderfully expresses its emotion and irony by defying the meter of the villanelle in its final line.</span> </p> <p>Thomas&rsquo;s poem of grieving stands in tension with its form. Its rage is, of necessity, purely affective&mdash;it has no real consequence; death is as sure as the rhyme which snaps together the poem&rsquo;s close. But not so in the dystopian future of <em>Interstellar</em> where Thomas&rsquo;s words become not the lament of a child to a parent, but the advice of a father to his <em>children</em>. The generational logic of the poem is turned on its head and the poem becomes not the cry of the grieving child at a death as inevitable as the end of day, but an expression of the parent&rsquo;s anxiety that children (not even <em>his</em> children; but <em>children</em>) will simply wither out of existence. Thomas&rsquo;s poem grieves the natural course of things; Prof. Brand&rsquo;s reading repurposes it as a resistance to the potential extinction of that putatively natural course. </p> <p>And yet, the poem&rsquo;s place in the film is vexed. It is recited by the characters who (after a precisely timed revelation) are revealed as something like the film&rsquo;s &ldquo;villains&rdquo;&mdash;characters whose lies reveal that the will to live and the refusal to acquiesce are not, in themselves, particularly good things; raging ain&rsquo;t so great after all. It turns out that Brand&rsquo;s <em>Plan A</em>&mdash;the mass migration of the human population to another planet once he cracks a pesky gravity equation (which, like any good academic, he promises requires just <em>a little more</em> research)&mdash;is a noble lie. On his deathbed he reveals that he already knew the equation would never work out. <em>Plan A</em> was a false promise fed to people who would be unwilling to hazard the risks of interstellar travel unless their own lives, or those of their family, were guaranteed. After all, no one would sacrifice themselves merely for <em>Plan B</em>, wherein the human species is preserved in a sort of dorm-fridge full of petri dishes (&ldquo;genetic samples&rdquo;), and shipped off-planet. A process which the younger Prof. Brand (daughter of Caine&rsquo;s Prof. Brand, played by Anne Hatheway) assures us, would be totally effective and superior to earlier forms of colonization because it ensures genetic diversity.<span class='marginnote'>Um&hellip; imperialist biopolitics much, Professor Brand?</span>). </p> <p>The other person we hear recite Thomas&rsquo;s poem is Dr. Mann (played by a hadsome, young up-and-comer),<span class='marginnote'>I&rsquo;m <em>pretty sure</em> he recites it, but not entirely positive&hellip; I&rsquo;ve only seen the film once. Boy, this is all gonna be <em>really</em> unconvincing if I misremembered this.</span> who deceives our intrepid explorers with forged data suggesting that the planet he is exploring is a reasonable prospect for human colonization. Mann forges that data to justify his own worthiness to be retrieved from the planet. As Mann explains to Cooper (while he is killing him&hellip; he has really missed human conversation while in cryo-freeze), the will to live is simply too strong; Mann knows he&rsquo;s a coward, but insists that Cooper has never had to face the sort of isolation and horror that he has. The will to live (that rage against the dying of light) is so strong in Mann that he&rsquo;s willing lie, and to kill (both Cooper and Romilly) for it.</p> <p>And so, the rage against death that Mann and Brand profess, by way of Thomas&rsquo;s poem, is not a good in and of itself. Indeed, their recitations of the poem marks them as self-interested to the point of villainy. They quote the poem to buttress their rage against death itself&mdash;their own, individual death (in the case the more villainous Mann) or that of the species (in the case of Brand). But the film ultimately rejects this position&mdash;it is not <em>life</em> which needs to continue, (cue music and impassioned speech by Anne Hathaway) but love (and love of a very recognizable, reproductive sort). What old folks should do at the end of day, like the elderly Murph Cooper at film&rsquo;s end, is not rage against the dying of the light, but quietly die in the peace and comfort of their children. Can one imagine a more forceful restoration of the conventional order of things than Murph quickly dispatching her father back to interstellar space in order to find a girlfriend? This is what <a href="http://tjwest3.com/2014/11/09/review-interstellar/">T.J. West calls</a> calls, fairly I think, the film&rsquo;s &ldquo;ruthlessly heterosexual love plot that could have come straight out of a screenwriter’s how to manual.&rdquo; </p> <p>And so, the film refuses the queer reproduction of <em>Plan B</em> (I leave aside any potential connection one may seen between the film&rsquo;s <em>Plan B</em>, and the <a href="http://en.wikipedia.org/wiki/Levonorgestrel">contraceptive</a> of the same name), and delights in a reproductive futurity for which the reuniting of Anne Hathaway&rsquo;s character with Matthew McConaughey&rsquo;s is important and meaningful. Thomas&rsquo;s poem comes to stand not <a href="https://www.youtube.com/watch?v=Lm8p5rlrSkY">as it might appear in the trailer</a>, as some exhortation to intergalactic heroism in the face of global environmental catastrophe, but as the most explicit statement of the position to be resisted&mdash;one where the affective attachments of individuals (in Thomas&rsquo;s poem, the speaker to his father) may be fundamentally at odds with the nature of the world in which we live (the necessity of death). Whatever the rage of Thomas&rsquo;s poem accomplishes, it doesn&rsquo;t set up colonies on distant planets.</p> <p>Over and over characters in the film (Cooper chiefly) are told that they must realize their mission is bigger than their petty human attachments. Cooper must think beyond his children; &ldquo;You can&rsquo;t just think about your family,&rdquo; Doyle says, &ldquo;You have to think bigger than that.&rdquo; And he is echoes by Brand: &ldquo;You might have to decide between seeing your children again and the future of the human race.&rdquo; Brand herself must defer to objective facts in choosing which planet to visit; the data, not her love for Dr. Doyle, must decide. But in the film all of this turns out to be untrue. John Brand&rsquo;s insistence that &ldquo;Nothing in our solar system can safe us,&rdquo; is, at best, half true&mdash;it is the plucky Murph Cooper who saves the world from her childhood bedroom. &ldquo;We must think not as individuals but as a species,&rdquo; Prof Brand insists. <em>Interstellar</em> goes out of its way&mdash;with some pretty cringe-inducing moments&mdash;to create a universe where precisely the opposite is true, where the affective attachments of individuals are what save the species After all, if Cooper had listened to Brand (had listened to <em>love</em>) and gone to Doyle&rsquo;s planet rather than Mann&rsquo;s, all would be well now. </p> <p><em>Interstellar</em> tackles a posthumanity-shaped problem, but answers it with a humanity so cloying it is almost (almost!) indigestible. It turns out that the problems of three little people <em>do</em> amount to a hill of beans in this crazy world&mdash;indeed, they amount to the whole world.</p> The Podcast as a Genre 2014-09-12T00:00:00-04:00 http://cforster.com/2014/09/podcast-as-a-genre/ <p>What precisely is a podcast? I once heard a minimal definition of a podcast as an mp3 file attached to an RSS feed&mdash;which is to say, syndicated audio content on the internet. But looking around, there are plenty of podcasts that don&rsquo;t meet this criteria: podcasts that lack an RSS feed (WHY?!?), to speak nothing of &ldquo;video podcasts&rdquo; (which people are apparently strill trying to make happen). &ldquo;Podcast&rdquo; can sometimes be used as a verb to mean something like &ldquo;transmitting audio over the internet&rdquo; (e.g. &ldquo;Will you be podcasting that keynote lecture?&rdquo;). Looking at iTunes, you realize plenty of &ldquo;podcasts&rdquo; are just radio shows put on the internet: iTunes&rsquo;s most popular podcasts are mostly public radio fare (like &ldquo;This American Life&rdquo; and &ldquo;Radiolab&rdquo;). </p> <p>But, the podcast is not simply a technology or a channel. I&rsquo;ve been listening to podcasts for awhile now and have been curious to watch my habits slowly shift, moving away from &ldquo;radio shows on the internet&rdquo; (<em>Fresh Air</em>, whenever I want it!) to something else. <a href="http://niemanstoryboard.org/stories/finding-the-tribe/">This piece</a> looks at the &ldquo;return&rdquo; of podcasts as a medium, mostly considering the podcast as a business model. It does however offer this, from &ldquo;Planet Money&rdquo; podcaster Alex Blumberg, on what makes podcasts different:</p> <blockquote> <p>&ldquo;It&rsquo;s the most intimate of mediums. It&rsquo;s even more intimate than radio. Often you’re consuming it through headphones. I feel like there&rsquo;s a bond that’s created.&rdquo; <a href="http://niemanstoryboard.org/stories/finding-the-tribe/">Source</a></p> </blockquote> <p>That seems entirely right to me, and it helpfully points to some of the ways that what I&rsquo;ll call podcasts <em>as a genre</em> differ from understanding podcasts as just &ldquo;radio over the internet.&rdquo; The &ldquo;podcast&rdquo; as a form blurs the line between a medium (say, a recurring, asynchronously consumed type of audio&mdash;usually neither music or fiction) and a genre. The podcast, as medium, has been enabled by readier access to bandwidth, software technologies like iTunes syndication and RSS, and developments in hardware like relatively cheap but entirely decent microphones<span class='marginnote'>Woe unto the podcaster who relies on built-in mics on laptops and phones, for he shall receive low traffic.</span> and of course the iPod. But these technologies, in their use, create a sort of gravitational pull toward a form that is less formal, more niche, and therefore oddly closer to a sort of specialized and heightened mode of casual conversation than it is to most radio genres.</p> <p>When the costs of creating and distributing recordings of folks talking into microphones gets <em>way</em> cheaper than the costs of writing/producing/reporting stories, you get a new sort of show&mdash;where folks just sit around and talk. Central to the conventions of this genre is, I think, the group of regular or semi-regular folks who sit around and talk about something. Such are Leo Laporte&rsquo;s <a href="http://www.twit.tv">TWIT</a> podcasts; the original TWiT, one of the first podcasts I listened to, was indeed Leo Laporte sitting with folks (some of whom his listeners recognize as, like Laporte, erstwhile TechTV employees) and talking about the week&rsquo;s technology news. This form tends to be parasitic on some other type of content&mdash;on news or culture (<a href="http://www.tommerritt.com/category/shows/daily-tech-news-show/">daily</a> or <a href="http://www.slate.com/articles/podcasts/culturegabfest.html">weekly</a> or <a href="http://digitalcampus.tv">semi-regularly</a>), or even on a specific film or primary text. There has to be some <em>reason</em>, some excuse or alibi, for the conversation to exist&mdash;but the podcast offers a conversation rather than the news. </p> <p>This may not seem especially novel&mdash;after all, personality-driven &ldquo;analysis&rdquo; now dominates cable news. Yet cable news analysis shows usually center on a single individual, and their dominant moods are outrage or indignation or derision; they tend to be centered <strong>a personality</strong> (variably likeable or not) who offers a &ldquo;perspective.&rdquo; But what a podcast offers is not a perspective (or not <em>chiefly</em> a perspective) but something more like a performance of community. In place of the singular personality, we get personalities. A podcast tends to create characters, or caricatures, out of its hosts: for instance, Stephen Metcalf&rsquo;s snobbish nostalgia for the world of print clashing regularly with Julia Turner&rsquo;s culturally omnivorous techno-utopianism on the <a href="http://www.slate.com/articles/podcasts/culturegabfest.html">Slate Culturefest</a> (both, of course, unfair exagerations). But in other podcasts (perhaps notably, podcasts not affiliated with any large online media presence), this develops into a sense of shared reference&mdash;something like <em>insiderness</em> or <em>knowingness</em>. The result is that certain podcasts (the podcastiest of the podcasts by my sense of the genre) rely heavily on inside jokes. Consider the following short phrases: &ldquo;Who the hell is Casey?&rdquo;; &ldquo;Does this look clean to you?&rdquo;; &ldquo;The Port Hole of Time.&rdquo; To the listeners of certain podcasts, they will immediately register as inside jokes&mdash;from, respectively: <a href="http://atp.fm">The Accidental Tech Podcast</a>; <a href="http://5by5.tv/b2w">Back to Work</a> (quoting the film <em>The Aviator</em>, which in the universe of <em>Back to Work</em> is frequently referered to as simply <em>the film</em>); and <a href="http://www.flophousepodcast.com/">The Flop House</a>. Listeners of these podcasts (and I listen to all of these pretty faithfully, though the truly faithful will likely fault my selections) come to recognize these, and participate in the joke. These podcasts create a universe of reference alienating to the newcomer, but comforting to the regular. And the result is just wonderful. These are my guiltiest of guilty pleasure. I try to conceal my love for them, but I cannot.</p> <p>That intimacy of the medium described by Alex Blumberg, created by the circumstances of consumption (on headphones or in the car<span class='marginnote'>Are <a href="http://www.amazon.com/MP3-Player-Cassette-Adapter-Equipment/dp/B003Q9LRPO">these things</a> great, or what?</span>), manifests in the genre as a tendency towards dense self-reference.</p> <p>The result is that the topic of the podcast can increasingly seem to be just an alibi for the interactions of its hosts. I don&rsquo;t really care about Apple News, but listen to <a href="http://atp.fm">ATP</a> regularly. The greatest joy of <em>The Flop House</em> (a &ldquo;bad movie&rdquo; podcast, which reviews/discussions relatively recent theatrical &ldquo;flops&rdquo;) is the experience of hearing the hosts <em>summarize the plot of a movie</em> and the digressions that ensue. One emphatically does not have to have seen the movie to enjoy the podcast, and unlike a review (or even the discussions of film and TV on the <em>Slate Culturefest</em>), it is completely beside the point whether you will see the movie at some point in the future. I suspect that I&rsquo;ll never see the <a href="http://www.imdb.com/title/tt0804452/"><em>Bratz</em> movie</a>; but I shall cherish all the days of my life <a href="http://www.flophousepodcast.com/2008/04/episode-14-bratz/"><em>The Flop House</em>&rsquo;s discussion of it</a>. Listen to early episodes and you&rsquo;ll see that the plot summary initially presented a challenge&mdash;something they glossed over or tried to get past in order to get to the discussion (on at least one occassion they just read the Wikipedia summary of a movie). But the joy of the show is entirely in the interactions between its hosts, and so something as rote as a plot summary becomes the perfect opportunity for such interaction. It also explains why at least I find these sorts of shows more engaging than other audio content. The academic lecture, or even <em>Fresh Air</em>-style interviews, sometimes allows distraction. But the developing conversation, and tissue of self-reference, simulates the experience of interaction rather than, say, the communication of information. (What an interview show like <em>Fresh Air</em> lacks is the regularity of its participants; you&rsquo;re usually learning something <em>about</em> a guest rather than a conversation between people who already know each other.)</p> <p>By foregrounding in jokes and habits of communication, the podcast turns out to be a cousin to that other &ldquo;internetiest&rdquo; of forms: the meme. The meme is likewise an in-joke, where the in-group is those folks who recognize the meme and understand its conventions. The humor of any individual <a href="http://knowyourmeme.com/memes/doge">&ldquo;doge,&rdquo;</a> meme (remember that?) is siphoned off from the larger system of doge memes that makes any particular meme legible and funny. (A picture of a cat with some funny, misspelled words, encoutered in utter isolation, carved into the face of some alien moon millenia hence, would be funny because absurd&mdash;but it wouldn&rsquo;t be a meme and wouldn&rsquo;t participate in its humor.)</p> <p>The affective range of the podcast is much wider than that of meme, chiefly because hearing a conversation between the same set of people (semi)regularly opens more possibilities than silly pictures and block letters. (There I said it; call me elitist.) But this affective depth cuts the other way&mdash;it also suggests what I find mildly unsettling about the form, and perhaps slightly embarassing about my enjoyment of it. If I&rsquo;m right that inside jokes, and a certain performance of knowing insiderness, are what separates the podcast as a genre from its radio peers, it also feels a little like media consumption as simulated friendship. Its enjoyments are those of easy familiarity and comfortable in-jokes, but with friends who aren&rsquo;t yours. (You might call this the anxiety of authenticity, and I&rsquo;ll just take my lumps for worrying over something as old-fashioned as authenticity.)</p> <p>More troublingly, that same affective register (of chummy friendship and inside jokes) seems downright insidious when you realize how overwhelmingly the list of podcasts I&rsquo;ve cited here is dominanted by white guys. In so much as the pleasures and affects of the genre are those associated with those of the proverbial boys club, it is dismaying to see how much of a boy&rsquo;s club it often is.</p> <p>What is a podcast? It is the humanization of the internet meme, a type of low-participation friendship, a reduced agency form of &ldquo;hanging out.&rdquo;</p> <p>Yours in Flopitude, Chris [Last Name Witheld]</p> From New Hampshire to Harlem, by Way of London 2013-10-10T00:00:00-04:00 http://cforster.com/2013/10/spring-new-hampshire/ <p>While I haven&rsquo;t been vocal about it, work has continued, in off moments and stolen time, on the online edition of Claude McKay&rsquo;s <em>Harlem Shadows</em> which I described <a href="/2012/06/drill-baby-drill/">some time ago</a>. At the present moment, <a href="http://roopikarisam.com/">Roopika Risam</a> and I have collected nearly all the textual variants and have marked them up in TEI; we added (as yet unproofread) versions of early reviews and other supplemental material (and still more is being hunted down and added); and there is enough XSLT and CSS to hold the whole thing together, more or less. It is very much still a work in progress, but you can see the current state of its progress <a href="http://harlemshadows.org/beta/">here</a>. </p> <p>This process has also been an opportunity to understand the textual history of the poems of <em>Harlem Shadows</em>, including the relationship of the collection <em>Harlem Shadows</em> to McKay&rsquo;s earlier collection <em>Spring in New Hampshire</em>. The Jamaican poet who travels to rural Kansas in order to pursue a degree in agriculture and ends up being one of the early voices of Harlem Renaissance, manages to do so by passing through not only Harlem, but New Hampshire and, crucially, London. <em>Spring in New Hampshire</em> was how many readers first encountered McKay (including readers like Charlie Chaplin and Hubert Harrison), and the collection offers a valuable first draft of <em>Harlem Shadows</em>.</p> <p>The collection <em>Spring in New Hampshire</em> was first published in 1920. Its &ldquo;Acknowledgments&rdquo; page notes two facts which underscore this volume&rsquo;s importance in the emergence of <em>Harlem Shadows</em>.</p> <p><a href="/images/harlem-shadows/spring_acknowledgments.png"><img src="/images/harlem-shadows/spring_acknowledgments.png" alt="Acknowledgments are due to the Editors of The Seven Arts, the American Pearsons and The Liberator, where, as in the current issue of The Cambrdige Magazine a number of the poems included in this volume have appeared. An American edition is being published simultaneously by Alfred A. Knopf, 220 West Forty-second Street, New York." title="Acknowledgments in Spring in New Hampshire" width="450"/></a></p> <p>First, when <em>Spring in New Hampshire</em> appeared, an American edition of was clearly imagined as immiment. But the American edition, purportedly &ldquo;being published simultaneously by Alfred A. Knopf,&rdquo; never materialized. What did appear, two years later (published by Harcourt, Brace, and Co) was <em>Harlem Shadows</em>. </p> <p>And if <em>Harlem Shadows</em> is substantially indebted to <em>Spring in New Hampshire</em> <span class="marginnote">About one third of <em>Harlem Shadows</em>&rsquo;s poems appear in <em>Spring</em>, among them &ldquo;Tropics in New York,&rdquo; &ldquo;The Barrier,&rdquo; &ldquo;North and South,&rdquo; &ldquo;Harlem Shadows,&rdquo; &ldquo;The Harlem Dancer,&rdquo; and &ldquo;The Lynching&rdquo;.</span>, <em>Spring in New Hampshire</em> in turn is less an origin than another gathering point for poems culled from elsewhere; this is especially the case of a large selection of poems which appear in the Summer 1920 issue of <em>The Cambridge Magazine</em>. This latter includes 23 of <em>Spring in New Hampshire</em>&rsquo;s 31 poems. And, with the exception of the dedication of &ldquo;Spring in New Hampshire&rdquo; (dedicated in <em>Spring</em> to &ldquo;J. L. J. F. E.&rdquo;<span class="marginnote">This would almost certainly be Dutch bibliophile, and the man in part responsible for McKay&rsquo;s trip to London, J. L. J. F. Ezerman (Gosciak 117).</span>), there are <em>no textual differences</em> between the poems as they appear in <em>CM</em> and as they appear in <em>Spring</em>.</p> <p>To secure the point I&rsquo;m moving towards, compare these images, taken from the appearance of &ldquo;The Tropics in New York&rdquo; in <em>Cambridge Magazine</em> (top) and <em>Spring in New Hampshire</em> (bottom): <span class="marginnote">My thanks to <a href="http://twitter.com/nickmimic">Nicholas Morris</a> who takes no responsibility for this conjecture, but was enormously helpful in discussing its plausibility.</span></p> <p><a href="/images/harlem-shadows/tropics_compared.png"><img src="/images/harlem-shadows/tropics_compared.png" alt="Comparison of 'Tropics in New York' in both SPRING IN NEW HAMPSHIRE and CAMBRIDGE MAGAZINE." title="Comparison of two Appearances" width="450" /></a></p> <p>Do you see that imperfection in the &lsquo;I&rsquo; of &ldquo;I could no more gaze&rdquo; in both versions. Does they look identical to you too? It seems reasonable to conjecture that the <em>Cambridge Magazine</em> poems and <em>Spring in New Hampshire</em> were both printed, if not from a single setting of type, then at least from a setting of type which likely included some of the same typeset material (in either monotype, linotype, or set by hand) from the <em>Cambridge Magazine</em>.<span class="marginnote">In the interest of full disclosure, there is a <a href="/images/harlem-shadows/when-dawn_comparison.png">similar imperfection</a> in the <em>Cambridge Magazine</em> text of &ldquo;When Dawn Comes to the City&rdquo; which does not appear in the <em>Spring in New Hampshire</em>; but this does not vitiate the possibility, and evidence, suggesting the two texts represent something like a single setting of type.</span></p> <p>The circumstances surrounding <em>Cambridge Magazine</em> likewise seem to confirm this possibility. <em>Cambridge Magazine</em>, at the time, was run by C. K. Ogden, with whom McKay spent time while visiting England in 1920. Ogden ran the magazine in collaboration with a number of his friends (among them, I. A. Richards). Of Ogden, McKay would write in his autobiography <em>A Long Way from Home</em>: &ldquo;besides steering me round the picture galleries and being otherwise kind, [Ogden] had published a set of my verses in his <em>Cambridge Magazine</em>. Later he got me a publisher&rdquo; (71). If McKay means that Ogden secured the publisher for <em>Spring in New Hampshire</em> (and that seems the most likely meaning here), it would certainly make sense that Ogden would go through the same publishing channels (including, perhaps, the same printer) as for the periodical for which he was responsible. <span class="marginnote">The frontmatter of <em>Spring in New Hampshire</em> (published by Grant Richards), lists the printer as &ldquo;The Morland Press.&rdquo;</span> And while Ogden (according to Josh Gosciak) authored the prefatory note for the appearance of the poems in <em>Cambridge Magazine</em>, it was I. A. Richards (a friend of Ogden&rsquo;s, who regularly appeared in the <em>Cambridge Magazine</em>, including the Summer 1920 issue in which McKay&rsquo;s poems appeared) who wrote the note the <em>Spring in New Hampshire</em> (after, according to McKay, George Bernard Shaw declined to write such an introductory note, <em>Long Way Home</em> 55). </p> <p>All of which is interesting and worthy of note insomuch as it suggests that <em>Harlem Shadows</em>, key document of the Harlem Renaissance, has its origins not only in Jamaica and New York, but in New Hampshire and London. This eclecticism was vital to Ogden&rsquo;s interest in McKay; Ogden was at this moment, working with I. A. Richards on what would emerge as &ldquo;Basic English.&rdquo; In <a href="http://en.wikipedia.org/wiki/Basic_English">BASIC</a>, &ldquo;Ogden wanted a usable language that reflected the hybridity of the changing dynamic of cultures and languages in the Caribbean, Africa, the United States, and Asia&rdquo; (Gosciak 102). And in McKay, Ogden believed he had found a uniquely valuable voice in the development of such a language&mdash;a language, Gosciak describes, which would &ldquo;decolonize the dominant ideology that espoused war and imperialism&rdquo; (103). </p> <p>Yet, the way in which McKay and Ogden imagined this decolonization of English is somewhat surprising. McKay came to Ogden frustrated with what he perceived to be the limitations of his previously published poetry, in Jamacian dialect. Here is Gosciak again:</p> <blockquote> <p>[McKay&rsquo;s] reputation was as the &ldquo;Bobby Burns&rdquo; of Jamaican folk wisdom, who could write persuasive &ldquo;love songs&rdquo; in a sonorous dialect. But an exasperated McKay explained to Ogden: &ldquo;One can&rsquo;t express any deep thought to perfection in it, nor can it effectively bring forth the note of sorrow.&rdquo; Dialect was hackneyed, McKay concluded. &ldquo;I&rsquo;ve buried it and don&rsquo;t care to revive it again.&rdquo; Ogden was sympathetic to the poet&rsquo;s desires to internationalize his poetics, and he mentioned him in precision and exactness&mdash;de-emotionalizing his lyrics of the charged baggage of Harlem and race. (104)</p> </blockquote> <p>This tension is manifest in a disagreement between Ogden and McKay over what to title the collection. Ogden was interested in developing an international English, Gosciak, drawing on material in the <a href="http://library.mcmaster.ca/archives/findaids/findaids/o/ogden.htm">Papers of CK Ogden</a> explains:</p> <blockquote> <p>McKay was opposed to [the title] <em>Spring in New Hampshire, and Other Poems</em>; he believed the title conjured associations with New England, which he felt was &ldquo;played out&rdquo;&hellip; McKay preferred &ldquo;a terse, simple thing&rdquo; for a title, such as &ldquo;Poems or Verse,&rdquo; which Ogden, too, appreciated. But McKay also had an eye for the New York&mdash;and Harlem&mdash;reading public. He suggested &ldquo;Dawn in New York,&rdquo; invoking imagery that would ultimately give texture to <em>Harlem Shadows</em> in 1922. Ogden persisted in his claims for the high lyricism of Frost, and eventually McKay came around to that aesthetic ground. (The choice of title, <em>Spring in New Hampshire</em>, was, as McKay acknowledged, a bold move for a poet who would very soon reprsent the Harlem Renaissance.) (Gosciak 105)</p> </blockquote> <p>There is sort of confusion of motivations here; McKay&rsquo;s frustration with dialect and Ogden&rsquo;s attempt to decolonize English both find expression of in the poems of <em>Spring in New Hampshire</em>&mdash;poems that rely on traditional forms&mdash;sonnets aplenty!&mdash;and frequently Victorian diction. Yet, Ogden&rsquo;s vision of de-colonizing English also involves de-racinating, with the effect that Ogden preferred to see McKay&rsquo;s verse avoid any allusion too direct to Harlem or race.</p> <p>And so <em>Spring in New Hampshire</em> ends up being as notable for what it <em>doesn&rsquo;t</em> share with <em>Harlem Shadows</em> as what it does. The most famous poem of <em>Harlem Shadows</em>, &ldquo;If We Must Die,&rdquo; had first appeared in <em>The Liberator</em> in 1919, but it was not included in <em>Spring in New Hampshire</em>. In <em>A Long Way Home</em>, McKay recounts bringing a copy of <em>Spring in New Hampshire</em> to Frank Harris, of <em>Pearson&rsquo;s Magazine</em> (who had wanted to publish &ldquo;If We Must Die,&rdquo; though he lost out to <em>The Liberator</em>):</p> <blockquote> <p>[Harris] was pleased that I had put over the publication of a book of poems in London. &ldquo;It&rsquo;s a hard, mean city for any kind of genius,&rdquo; he said, &ldquo;and that&rsquo;s an achivement for you.&rdquo; He looked through the little brown-covered book. Then he ran his finger down the table of contents, closely scrutinizing. I noticed his aggressive brow becoem heavier and scowling. Suddenly he roared: &ldquo;Where is the poem?&hellip; That fighting poem, &lsquo;If We Must Die.&rsquo; Why isn&rsquo;t it printed here?&rdquo;</p> <p>I was ashamed. My face was scorched with fire. I stammered: &ldquo;I was advised to keep it out.&rdquo;</p> <p>&ldquo;You are a bloody traitor to your race, sir!&rdquo; Frank Harris shouted. &ldquo;A damned traitor to your own integrity. That&rsquo;s what the English and civilization have done to your people. Emasculated them. Deprived them of their guts. Better you were a head-hunting, blood-drinking cannibal of the jungle than a civilized coward. You were bolder in America. The English make obscene sycophants of their subject peoples. I am Irish and I know. But we Irish have guts you cannot rip out of us. I am ashamed of you, sir. It&rsquo;s a good thing you got out of Engliand. It is no place for a genius to live.&rdquo;</p> <p>Frank Harris&rsquo;s words cut like a whip into my hide, and I was glad to get out of his uncomfortable presence. Yet I felt relieved after his castigation. The excision of the poem had been like a nerve cut out of me, leaving a wound which would not heal. And it hurt more every time I saw the damned book of verse. I resolved to plug hard for the publication of an American edition, which would include the omitted poem. (81-82)</p> </blockquote> <p>McKay here ends up being caught between two white editors, and their respective ways of imagining a response to British colonialism. (This situation recalls that of McKay and his relationship to dialect poetry discussed in Michael North&rsquo;s excellent chapter in <em>The Dialect of Modernism</em>.) </p> <p>That American edition that McKay resolves to publish after this encounter with Harris would, of course, be <em>Harlem Shadows</em>. <em>Harlem Shadows</em>, in McKay&rsquo;s depiction, is a <em>version</em> of <em>Spring in New Hampshire</em> and a repudiation of it. Elsewhere in his autobiography he writes, &ldquo;I was full and overflowing with singing and I sang all moods, wild, sweet, and bitter. I was steadfastly pursuing one object: the publication of an American book of verse. I desired to see &#39;If We Must Die,&rsquo; the sonnet I had omitted in the London volume, inside of a book&rdquo; (116). </p> <p>Yet, if McKay&rsquo;s comments encourage us to read <em>Harlem Shadows</em> as a re-politicized version of <em>Spring in New Hampshire</em>, &ldquo;If We Must Die&rdquo; itself, nevertheless, famously operates by abstracting the political violence of the &ldquo;Red Summer&rdquo; of 1919 into an unspecified &ldquo;we kinsmen&rdquo; against a &ldquo;common foe.&rdquo; And, indeed, <em>Harlem Shadows</em>, like <em>Spring in New Hampshire</em>, does not include some of McKay&rsquo;s most explicitly political poetry of this period&mdash;poems like <a href="http://harlemshadows.org/beta/index.html#supp_mckay_to-the-white-fiends">&ldquo;To the White Fiends&rdquo;</a> or &ldquo;A Capitalist at Dinner,&rdquo; which were initially published in the same period, and in the same venues, as poems like &ldquo;If We Must Die,&rdquo; remain excluded. </p> <p>All of which indicates the value of a comprehensive collection of all the contemporary poems and material which went into the making of <em>Harlem Shadows</em>&mdash;both through their inclusion and their exclusion. </p> <h2>Appendix: Tables of Contents Compared</h2> <p>Below I&rsquo;ve preserved the original orderings of the tables of contents for both <em>Harlem Shadows</em> and <em>Spring in New Hampshire</em> and used color (a lovely salmon) to indicate which titles are shared.</p> <p><hr /> <table> <tr> <td><strong>Harlem Shadows</strong></td> <td><strong>Spring in New Hampshire</strong></td> <tr> <td style='background-color:#adff2f'>The Easter Flower</td> <td style='background-color:#ffa07a'>Spring in New Hampshire</td> </tr></p> <p><tr> <td style='background-color:#adff2f'>To One Coming North</td> <td style='background-color:#ffa07a'>The Spanish Needle</td> </tr></p> <p><tr> <td style='background-color:#adff2f'>America</td> <td style='background-color:#ffa07a'>The Lynching</td> </tr></p> <p><tr> <td style='background-color:#adff2f'>Alfonso, Dressing to Wait at Table</td> <td style='background-color:#ffa07a'>To O. E. A.</td> </tr></p> <p><tr> <td style='background-color:#ffa07a'>The Tropics in New York</td> <td style='background-color:#40e0d0'>Alfonso, Dressing to Wait at Table, Sings</td> </tr></p> <p><tr> <td style='background-color:#adff2f'>Flame Heart</td> <td style='background-color:#40e0d0'>Flowers of Passion</td> </tr></p> <p><tr> <td style='background-color:#adff2f'>Home Thoughts</td> <td style='background-color:#40e0d0'>To Work</td> </tr></p> <p><tr> <td style='background-color:#ffa07a'>On Broadway</td> <td style='background-color:#ffa07a'>Morning Joy</td> </tr></p> <p><tr> <td style='background-color:#ffa07a'>The Barrier</td> <td style='background-color:#40e0d0'>Reminiscences</td> </tr></p> <p><tr> <td style='background-color:#adff2f'>Adolescence</td> <td style='background-color:#ffa07a'>On Broadway</td> </tr></p> <p><tr> <td style='background-color:#adff2f'>Homing Swallows</td> <td style='background-color:#40e0d0'>Love Song</td> </tr></p> <p><tr> <td style='background-color:#adff2f'>The City&rsquo;s Love</td> <td style='background-color:#ffa07a'>North and South</td> </tr></p> <p><tr> <td style='background-color:#ffa07a'>North and South</td> <td style='background-color:#ffa07a'>Rest in Peace</td> </tr></p> <p><tr> <td style='background-color:#adff2f'>Wild May</td> <td style='background-color:#ffa07a'>A Memory of June</td> </tr></p> <p><tr> <td style='background-color:#adff2f'>The Plateau</td> <td style='background-color:#ffa07a'>To Winter</td> </tr></p> <p><tr> <td style='background-color:#ffa07a'>After the Winter</td> <td style='background-color:#ffa07a'>Winter in the Country</td> </tr></p> <p><tr> <td style='background-color:#adff2f'>The Wild Goat</td> <td style='background-color:#ffa07a'>After the Winter</td> </tr></p> <p><tr> <td style='background-color:#ffa07a'>Harlem Shadows</td> <td style='background-color:#ffa07a'>The Tropics in New York</td> </tr></p> <p><tr> <td style='background-color:#adff2f'>The White City</td> <td style='background-color:#ffa07a'>I Shall Return</td> </tr></p> <p><tr> <td style='background-color:#ffa07a'>The Spanish Needle</td> <td style='background-color:#ffa07a'>The Castaways</td> </tr></p> <p><tr> <td style='background-color:#adff2f'>My Mother</td> <td style='background-color:#40e0d0'>December 1919</td> </tr></p> <p><tr> <td style='background-color:#ffa07a'>In Bondage</td> <td style='background-color:#40e0d0'>Flame-Heart</td> </tr></p> <p><tr> <td style='background-color:#adff2f'>December, 1919</td> <td style='background-color:#ffa07a'>In Bondage</td> </tr></p> <p><tr> <td style='background-color:#adff2f'>Heritage</td> <td style='background-color:#ffa07a'>Harlem Shadows</td> </tr></p> <p><tr> <td style='background-color:#adff2f'>When I Have Passed Away</td> <td style='background-color:#ffa07a'>The Harlem Dancer</td> </tr></p> <p><tr> <td style='background-color:#adff2f'>Enslaved</td> <td style='background-color:#ffa07a'>A Prayer</td> </tr></p> <p><tr> <td style='background-color:#ffa07a'>I Shall Return</td> <td style='background-color:#ffa07a'>The Barrier</td> </tr></p> <p><tr> <td style='background-color:#ffa07a'>Morning Joy</td> <td style='background-color:#ffa07a'>When Dawn Comes to the City</td> </tr></p> <p><tr> <td style='background-color:#adff2f'>Africa</td> <td style='background-color:#40e0d0'>The Choice</td> </tr></p> <p><tr> <td style='background-color:#adff2f'>On a Primitive Canoe</td> <td style='background-color:#40e0d0'>Sukee River</td> </tr></p> <p><tr> <td style='background-color:#ffa07a'>Winter in the Country</td> <td style='background-color:#40e0d0'>Exhortation</td> </tr></p> <p><tr> <td style='background-color:#ffa07a'>To Winter</td> <td style='background-color:#ffa07a'> </td> </tr></p> <p><tr> <td style='background-color:#ffa07a'>Spring in New Hampshire</td> <td style='background-color:#ffa07a'> </td> </tr></p> <p><tr> <td style='background-color:#adff2f'>On the Road</td> <td style='background-color:#ffa07a'> </td> </tr></p> <p><tr> <td style='background-color:#ffa07a'>The Harlem Dancer</td> <td style='background-color:#ffa07a'> </td> </tr></p> <p><tr> <td style='background-color:#adff2f'>Dawn in New York</td> <td style='background-color:#ffa07a'> </td> </tr></p> <p><tr> <td style='background-color:#adff2f'>The Tired Worker</td> <td style='background-color:#ffa07a'> </td> </tr></p> <p><tr> <td style='background-color:#adff2f'>Outcast</td> <td style='background-color:#ffa07a'> </td> </tr></p> <p><tr> <td style='background-color:#adff2f'>I Know My Soul</td> <td style='background-color:#ffa07a'> </td> </tr></p> <p><tr> <td style='background-color:#adff2f'>Birds of Prey</td> <td style='background-color:#ffa07a'> </td> </tr></p> <p><tr> <td style='background-color:#ffa07a'>The Castaways</td> <td style='background-color:#ffa07a'> </td> </tr></p> <p><tr> <td style='background-color:#adff2f'>Exhortation: Summer, 1919</td> <td style='background-color:#ffa07a'> </td> </tr></p> <p><tr> <td style='background-color:#ffa07a'>The Lynching</td> <td style='background-color:#ffa07a'> </td> </tr></p> <p><tr> <td style='background-color:#adff2f'>Baptism</td> <td style='background-color:#ffa07a'> </td> </tr></p> <p><tr> <td style='background-color:#adff2f'>If We Must Die</td> <td style='background-color:#ffa07a'> </td> </tr></p> <p><tr> <td style='background-color:#adff2f'>Subway Wind</td> <td style='background-color:#ffa07a'> </td> </tr></p> <p><tr> <td style='background-color:#adff2f'>The Night Fire</td> <td style='background-color:#ffa07a'> </td> </tr></p> <p><tr> <td style='background-color:#adff2f'>Poetry</td> <td style='background-color:#ffa07a'> </td> </tr></p> <p><tr> <td style='background-color:#adff2f'>To a Poet</td> <td style='background-color:#ffa07a'> </td> </tr></p> <p><tr> <td style='background-color:#ffa07a'>A Prayer</td> <td style='background-color:#ffa07a'> </td> </tr></p> <p><tr> <td style='background-color:#ffa07a'>When Dawn Comes to the City</td> <td style='background-color:#ffa07a'> </td> </tr></p> <p><tr> <td style='background-color:#adff2f'>O Word I Love to Sing</td> <td style='background-color:#ffa07a'> </td> </tr></p> <p><tr> <td style='background-color:#adff2f'>Absence </td> <td style='background-color:#ffa07a'> </td> </tr></p> <p><tr> <td style='background-color:#adff2f'>Summer Morn in New Hampshire</td> <td style='background-color:#ffa07a'> </td> </tr></p> <p><tr> <td style='background-color:#ffa07a'>Rest in Peace</td> <td style='background-color:#ffa07a'> </td> </tr></p> <p><tr> <td style='background-color:#adff2f'>A Red Flower</td> <td style='background-color:#ffa07a'> </td> </tr></p> <p><tr> <td style='background-color:#adff2f'>Courage</td> <td style='background-color:#ffa07a'> </td> </tr></p> <p><tr> <td style='background-color:#ffa07a'>To O. E. A.</td> <td style='background-color:#ffa07a'> </td> </tr></p> <p><tr> <td style='background-color:#adff2f'>Romance</td> <td style='background-color:#ffa07a'> </td> </tr></p> <p><tr> <td style='background-color:#adff2f'>Flower of Love</td> <td style='background-color:#ffa07a'> </td> </tr></p> <p><tr> <td style='background-color:#adff2f'>The Snow Fairy</td> <td style='background-color:#ffa07a'> </td> </tr></p> <p><tr> <td style='background-color:#adff2f'>La Paloma in London</td> <td style='background-color:#ffa07a'> </td> </tr></p> <p><tr> <td style='background-color:#ffa07a'>A Memory of June</td> <td style='background-color:#ffa07a'> </td> </tr></p> <p><tr> <td style='background-color:#adff2f'>Flirtation</td> <td style='background-color:#ffa07a'> </td> </tr></p> <p><tr> <td style='background-color:#adff2f'>Tormented</td> <td style='background-color:#ffa07a'> </td> </tr></p> <p><tr> <td style='background-color:#adff2f'>Polarity</td> <td style='background-color:#ffa07a'> </td> </tr></p> <p><tr> <td style='background-color:#adff2f'>One Year After</td> <td style='background-color:#ffa07a'> </td> </tr></p> <p><tr> <td style='background-color:#adff2f'>French Leave</td> <td style='background-color:#ffa07a'> </td> </tr></p> <p><tr> <td style='background-color:#adff2f'>Jasmines</td> <td style='background-color:#ffa07a'> </td> </tr></p> <p><tr> <td style='background-color:#adff2f'>Commemoration</td> <td style='background-color:#ffa07a'> </td> </tr></p> <p><tr> <td style='background-color:#adff2f'>Memorial</td> <td style='background-color:#ffa07a'> </td> </tr></p> <p><tr> <td style='background-color:#adff2f'>Thirst</td> <td style='background-color:#ffa07a'> </td> </tr></p> <p><tr> <td style='background-color:#adff2f'>Futility</td> <td style='background-color:#ffa07a'> </td> </tr></p> <p><tr> <td style='background-color:#adff2f'>Through Agony</td> <td style='background-color:#ffa07a'> </td> </tr></p> <p></table></p> <h1>Works Cited</h1> <ul> <li><p>Gosciak, Josh. <em>The Shadowed Country: Claude Mckay and the Romance of the Victorians.</em> New Brunswick, N.J.: Rutgers University Press, 2006. Print.</p></li> <li><p>McKay, Claude. <em>A Long Way from Home</em>. Ed. Gene Andrew Jarrett. New Brunswick, N.J.: Rutgers University Press, 2007. Print.</p></li> </ul> Somewhere in New Jersey... 2013-09-27T00:00:00-04:00 http://cforster.com/2013/09/sysadmin-for-poets/ <p>In a <a href="/2013/08/teaching-dh">previous post</a> on a graduate course I taught last Spring, I mentioned the server I ended up using a way to try to establish some uniformity of access to software packages and tools. In this post, I&rsquo;ll try to add a few details.</p> <h1>Virtual Private Servers</h1> <p>The first thing to understand about &ldquo;the server&rdquo; is that I rented a <em>VPS</em>, a &ldquo;virtual(ized) private server,&rdquo; rather than some shared space. If you don&rsquo;t know what that means, let me try to explain (if you do know what that means, you may want to move along to the next section). </p> <p>The word <em>server</em> itself is one of the slipperier bits of our contemporary argot. It can name a number of different links in a relationship between one piece of software and another (to say nothing of the people on either end, or perhaps in between). This slipperiness is only compounded with the advent of <em>virtual private servers</em>. </p> <p>Prior to virtualization, if you wanted to host a webpage, you could either run (or rent) your own dedicated hardware, or you go with a &ldquo;shared hosting&rdquo; option. For most folks who run their own blog, shared hosting remains the obvious choice. This blog is hosted on such a shared host. Shared hosting relies on a few facts:<span class="marginnote"><strong>LAMP</strong>: that is Linux (an operating system to manage the hardware resources, to schedule processes&hellip; you know, to turn electricity, plastic, and rare earth elements into a computer), Apache (the most widely used web server), MySQL (the most widely used database; which has its own database <em>server</em>), and (usually) PHP (a scripting language).</span></p> <ul> <li>most folks&rsquo; hosting needs can be solved by a single &ldquo;stack&rdquo; of common software, usually the so-called <strong>LAMP</strong> stack</li> <li>most folks&rsquo; blogs don&rsquo;t get sufficient traffic to require that much hardware, and so many blogs (websites, whatever) can be &ldquo;served&rdquo; by a single (&ldquo;shared&rdquo;) host.</li> <li>the OS can easily separate out different users; that way, I can run Wordpress and create a database, and you can run Drupal and create a database, all on the same hardware, without either of us being able to destroy one another&rsquo;s data.</li> </ul> <p>The drawback to shared hosting is that you have very little control over the server. You usually can&rsquo;t, for instance, install new software, beyond packages of PHP scripts. Of course, for web hosting, that&rsquo;s no big deal; you don&rsquo;t really need to install anything beyond whatever CMS you want anyway. <span class="marginnote">Anyone who has had problems with a web host&rsquo;s version of PHP, or similar issues, knows that this isn&rsquo;t quite true.</span></p> <p>This compromise with respect to control over the server configuration is a function of cost (and of course, expertise; do you really want to have to worry about building that &ldquo;software stack&rdquo;?). To have complete control over a server would mean that you had a dedicated server (in this case, &ldquo;server&rdquo; means actual <em>hardware</em>), perhaps in your office (in theory it could be your laptop), in an IT closet, or rented in a server farm somewhere. With the advent of software virtualization, however, this changed. With virtualization, a single piece of hardware can run multiple &ldquo;virtual machines&rdquo;; the host system simulates another machine, on which you can then run software which itself doesn&rsquo;t really know the difference. (If you run Windows on a Mac with Parallels, this idea may be quite familiar.) You can do the same thing with a server. That computer that talked to the internet, and served web pages (or whatever), is not just a <em>virtual</em> machine. And such a virtualized private server (a VPS, as opposed to a <em>shared</em> host) reconfigures the costs involve, and makes it more affordable to give you more control over the software installed, without incurring the full cost of owning/renting/running an actual physical hardware server. </p> <p>A virtualized server is (<em>very</em> significantly) cheaper than running your own hardware, but allows you many of the advantages of actually running your own hardware. It is still, in general, more expensive than shared hosting. To give you some sense of the cost, the lowest tier <a href="https://www.linode.com/">Linode</a> VPS is $20/month. <a href="https://www.digitalocean.com/pricing">Digital Ocean&rsquo;s lowest tier</a> is only $5/month. </p> <h1>A Customized Server</h1> <p>Because it is so customizable, I thought a VPS could offer a solution to the challenge of trying to make software uniformly available to students enrolled in the digital humanities seminar I was teaching. With such a system I could give everyone access to Python and the NLTK (and its associated corpora) without having to ask folks to install that software on their own machines. I could install MALLET and R (and relevant R packages) and Stanford&rsquo;s Named Entity Recognizer and an XSLT processor. This was also a relatively flexible solution; if someone wanted to try something else, perhaps something I&rsquo;d never heard of, it was in many easy to install it on the server.</p> <p>The <em>easy</em> in that last sentence is a function of Linux package management. If you&rsquo;re used to installing from .EXEs (or .DMGs) you download from the web, the world of package management can seem arcane. However, for large pieces of software, package management systems are wonderful and can be (deceptively) simple to use. While getting everything up and running is a bit of a trick (you need to first install an operating system&mdash;about which a little more below&mdash;and then some basic software to let you connect to the server), installing a piece of software, like the R language, is as simple as typing:</p> <div class="highlight"><pre><code class="language-bash" data-lang="bash">sudo apt-get install r-base</code></pre></div> <p>And then, you have R installed, and you&rsquo;re ready to go at the command line. In my experience, Linux package management is often <em>easier</em> than trying to handle software dependencies on other OSes (at least once you&rsquo;re familiar with conventions of your package management system). </p> <p><em>Easy</em> is also relative; if you&rsquo;re comfortable at the command line, using a package manager feels intellectually more intuitive and comprehensible than dragging a DMG into an &ldquo;Applications&rdquo; folder, or double-clicking an EXE. But if you&rsquo;re uncomfortable with the command line, this will likely feel as uncomfortable as navigating your filesystem or anything else.</p> <h1>Server Setup</h1> <p>Before you can install packages, though you need to first install a base, Linux operating system. If you&rsquo;re unfamiliar with Linux, this may not be the kindest or easiest way to get acquainted with it, though the good folks at Linode (hardly unbiased) insist <a href="https://www.linode.com/faq.cfm">&ldquo;If you&rsquo;re looking to learn, there is no better environment. Experiment with the different Linux flavors, redeploy from scratch in a matter of minutes.&rdquo;</a>. I have spent too much of my life playing with Linux distros, and so this part of the process felt quite natural. And yet, I <em>still</em> managed to make what I now consider a wrong choice in configuring the server. Linux comes in a wide array of flavors or distributions.<span class="marginnote">Technically, &ldquo;Linux&rdquo; is not an OS, but an OS &ldquo;kernel.&rdquo; This distinction, and what we call things, can get <a href="http://en.wikipedia.org/wiki/GNU/Linux_naming_controversy">contentious</a>.</span> Of the options Linode offers (Ubuntu, Arch, OpenSuse, Gentoo, CentOS, Slackware, Debian, Fedora) I opted for Debian. Debian has a reputation for being a very stable distro; and so it would be a great choice for running a web-server. Of course, I <em>wasn&rsquo;t</em> running a web and so stability was not, in fact, my <em>chief</em> concern. I probably should have chosen a distribution which prioritized not stability, but the ease and availability of new and up-to-date software packages. Arch Linux, with its &ldquo;rolling release&rdquo; schedule (and the operating system <a href="http://www.mylinuxrig.com/post/11831613369/the-linux-setup-chris-forster-academic">I once loved</a> with a passion I&rsquo;ve not since been able to match) would have made more sense. It would have been <em>easier</em> with Arch to install the most update versions of certain Python packages, etc etc. Oh well. Maybe next time. </p> <p>Once you&rsquo;re base OS is installed; it&rsquo;s time to install the basic packages you will need to do <em>anything at all</em>. But since you now are responsible for a computer somewhere in New Jersey<span class="marginnote">I must say, the responsiveness of the Linode servers shocked me; that computer in NJ was consistently more responsive than my home media server. I regularly ran emacs sessions <em>on the server</em> and found them completely responsive.</span> you need to worry about security. You don&rsquo;t want someone hijacking your VPS and using to send spam or whatever else. Linode offers <a href="http://library.linode.com/securing-your-server">some tips</a> and I consulted <a href="http://feross.org/how-to-setup-your-linode/">this page</a> for some advice as well. It wasn&rsquo;t nearly as scary as I imagined. I installed <code>fail2ban</code> (and left the default settings), turned off root log-in from ssh, and that was about it. I have (so far) had no problems. </p> <h1>Connecting and Interacting with the Server</h1> <p>I&rsquo;ve sort of glossed over a pretty fundamental fact; with the exception of RStudio server (mentioned below), the only way to interact with the server as I&rsquo;ve described it is through <a href="http://en.wikipedia.org/wiki/Secure_Shell">SSH</a>; and so the only access you have to the server is command line access. For the most part, that was fine for the goals I had for the class. Command line access allowed people to experiment with Python and NLTK; they could run MALLET, and similar things. You couldn&rsquo;t run, say, Gephi on the server though.</p> <p>And this limitation proved frustrating for folks working with film and images. For one thing, moving large movie/image files back and forth to the server would have proved unpleasant; moreover, software like <a href="http://rsbweb.nih.gov/ij/">ImageJ</a> couldn&rsquo;t be installed on the server and had to be installed locally. (I did manage, though, to do some image manipulation with ImageMagick&mdash;here is every page of the <em>Little Review</em> in a single image (a larger, ~150M, image is available <a href="https://www.dropbox.com/s/idnsh1tmkteeevu/lr.png">here</a>):</p> <p><a href="/images/lr_smaller.png"><img src="/images/lr_smaller.png" alt="Little Review Montage" width="450"/></a></p> <p><a href="/images/lr_montage_close-up.png"><img src="/images/lr_montage_close-up.png" alt="A Close-Up of the Little Review Montage" width="450" /></a></p> <p>For folks who were interested in using <code>R</code>, <a href="http://www.rstudio.com/ide/docs/server/getting_started">RStudio Server</a> worked <em>wonderfully</em>; it allowed folks to connect to the server through their web browser and have an RStudio Session that looked something like this:</p> <p><a href="/images/rstudio-running.png"><img src="/images/rstudio-running.png" alt="RStudio, Running" width="450" /></a></p> <p>While I gripe about <code>R</code>, RStudio is really excellent. If there is comparable server project that will let folks run python and python packages (including matplotlib) through a browser in a similar way, please, <em>please</em> let me know. RStudio provides a great way to provide a standardized R environment with a common set of shared packages (and perhaps even data) available to all.</p> <p>There are other things one could do with sort of set-up; if you wanted to run an old-school Bulletin Board System or MUD, you could do it (here are <a href="http://lunduke.com/?p=2156">some ideas</a> about running a BBS system). <span class="marginnote">You also might look at <a href="http://www.telnetbbsguide.com/ssh.htm">The BBS Corner&rsquo;s Telnet &amp; Dial-Up BBS Guide</a> or <a href="http://www.convolution.us/">Convolution BBS</a>.</span> As a way to offer <em>certain</em> pieces of software to people without requiring them to install them, running this server was very helpful. If you&rsquo;re doing tasks that can be scripted and which can often take a very long time to complete (such as topic modeling, or named entity extraction, or POS tagging)<span class="marginnote">I would add certain types of image manipulation&mdash;as I did with the <em>Little Review</em> example noted above; though as I say, shuttling the images back and forth can be unpleasant; this unpleasantness is partially allayed if you&rsquo;re scripting your image acquisition with a script, &ldquo;a little <code>wget</code> magic&rdquo;, or similar ought to work.</span> you can set them going, disconnect from the server and then check on them later (using a program like <a href="http://www.gnu.org/software/screen/"><code>screen</code></a> to make this easy; it&rsquo;s <em>really</em> great to be walking to home, knowing that somewhere in New Jersey a computer is dutifully seeking 100 topics across 5000 documents).</p> <p>I&rsquo;ve skipped over some of the other unglamorous, sysadminy things one must do: creating user accounts for each person in the class; creating a shared directory that everyone could read and write to; and other things like that. For someone comfortable at the command line, and interested to learn, all of that stuff is entirely manageable.</p> On "Teaching" "Digital Humanities" 2013-08-23T00:00:00-04:00 http://cforster.com/2013/08/teaching-dh/ <p>As this academic year warms up, some thoughts on the last one; last year I had the unexpected opportunity to teach a graduate seminar in the Spring.<span class="marginnote">I was not as diligent as a blog coordinator, keeping up with summary posts, as I would have liked, but some summaries and links to student blogs are available <a href="http://630dh.cforster.com/">here</a>.</span> I wavered between a theories of modernism course (think: Hugh Kenner, Peter Burger, Frederic Jameson, Susan Stanford Friedman) and an &ldquo;Intro DH&rdquo; class, settling on the latter simply because I thought it would be more valuable to graduate students (few of whom, in our department, have strong research interests in modernist studies). </p> <p>The course benefited from a number of sources; one is Scott Weingart&rsquo;s excellent <a href="http://www.scottbot.net/HIAL/?page_id=21794">list of DH syllabi</a>.<span class="marginnote">He says syllabi; I say syllabuses; it is a battle that has <a href="http://books.google.com/ngrams/graph?content=syllabi%2Csyllabuses&amp;year_start=1800&amp;year_end=2000&amp;corpus=15&amp;smoothing=3&amp;share=">raged for centuries</a>.</span> I should also thank a number of people who were kind enough to offer thoughts (and sometimes texts) as I put together the syllabus: Stéfan Sinclair, David Golumbia, and Brian Lennon offered suggestions. They were all more than generous; though, of course, they bear no responsibility for whatsoever for the syllabus. I also was fortunate enough to end up corresponding with a number of other folks during the semester, thanks to all of them. </p> <p>If you&rsquo;re interested in seeing the syllabus, you can see it in <a href="/files/eng630_syllabus-final.pdf">[PDF]</a>, <a href="/files/eng630_syllabus-final.html">[HTML]</a>, or (heaven help you) on <a href="https://github.com/c-forster/eng630-syllabus">GitHub</a>.<span class="marginnote">In theory, the whole github syllabuses thing sounds promising; and for DH stuff, who knows? But really, just putting the stuff on the web is probably the best way to share teaching materials.</span></p> <p>The syllabus includes, at least to some extent, basically all the texts that, a while back, Brian Croxall mentioned as the &ldquo;usual&rdquo; DH reading list: </p> <blockquote class="twitter-tweet"><p><a href="https://twitter.com/digiwonk">@digiwonk</a> <a href="https://twitter.com/readywriting">@readywriting</a> Graphs, Maps, Trees would be first. Then Debates, two Blackwell books. Ramsay, Jockers, Kirschenbaum. The usual.</p>&mdash; Brian Croxall (@briancroxall) <a href="https://twitter.com/briancroxall/statuses/358076804084416512">July 19, 2013</a></blockquote> <script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script> <p>I would have <em>loved</em> to use Jockers&rsquo;s <em>Macroanalysis</em>, but it was not out in time; I would certainly use it were I to teach the class again. Indeed, I could very easily imagine a class taught around <em>Macroanalysis</em> as its central text.</p> <p>I imagined the course as centered by a basically epistemological perspective: does the &ldquo;becoming digital of textuality&rdquo; <span class="marginnote">&ldquo;becoming digital of textuality&rdquo; is clumsy; but I think it gets at the thing I&rsquo;m interested in better than anything else.</span> change the sorts of knowledge about literature and culture that scholars produce. Does it offer, to put it more polemically, a <em>science</em> of culture? (I avoided this polemical formulation in class; not least because of the definition of &ldquo;science&rdquo; that it assumes). This was the motivation for starting with C.P. Snow&rsquo;s &ldquo;The Two Cultures,&rdquo; a text <span class="marginnote">I would, in the future, however excerpt Snow; there are a lot more relics of the Cold War in that essay than I recalled, not all of which were directly relevant.</span> which I sometimes see condescendingly referred to as if it were backward, outdated, or self-evidently wrong, when I find its core thesis remains provocative and at least partially compelling.</p> <p>The course was then organized around questions of digitization and textual representation<span class="marginnote">The Latour and Lowe essay, from <em>Switching Codes</em> is a real gem, and one I don&rsquo;t see mentioned very frequently.</span>, and then a whole slew of things I called distant reading. </p> <p>I&rsquo;m not sure that there&rsquo;s anything in my life that I&rsquo;d call an unmitigated success, and this class is surely no exception. I think though that it did the job of familiarizing folks with at least some of what &ldquo;DH&rdquo; is, particularly for folks in English departments. </p> <p>I&rsquo;m not sure if I&rsquo;ll ever teach such a course again. But here are some things I think I learned&mdash;things I&rsquo;d change and things I&rsquo;d do the same way again:</p> <ul> <li><p><strong>More, not less, technical:</strong> I was acutely aware of the worry among students that this class would be &ldquo;highly technical&rdquo; and require students to have all sorts of prerequisite knowledge. To avoid that, I think I erred too far on the other side, and set the bar too low. I integrated some tools into class, while leaving the more technical, hands-on stuff for suppelementary (optional) hands-on sessions (we did a little Intro Python &amp; nltk; we did some topic modeling with MALLET, some web-scraping, etc). This was logistically a problem (finding a time that worked for everyone). More fundamentally, my sense is that the class would have been stronger had it been <em>more</em> technical. The evaluations seem to confirm this almost unanimously; everyone thought more hands on with software would be a benefit. </p> <p>What would that mean in practice? While we spent time talking about encoding; and looking at some examples of TEI encoded documents (including the <a href="http://www.folgerdigitaltexts.org/">Folger&rsquo;s</a>), we didn&rsquo;t actually encode anything. But until you&rsquo;ve had to complete a teiHeader, you don&rsquo;t really know the burden of metadata. Actually making folks encode a text would be one avenue I might pursue. By selecting texts with a sufficiently interesting textual history and encoding some apparatus, this could be interesting assignment indeed.</p> <p><a href="http://mattwilkens.com/2012/12/31/dh-grad-course-reflections/">Matt Wilkens reports</a> having success using <a href="http://programminghistorian.org/lessons">&ldquo;The Programming Historian&rdquo;</a>; I could imagine doing something similar, and centering the class&rsquo;s practical activities around Python (about which more below); I would do so despite serious reservations about both my own qualification to be teaching such material (I am, after all, an autodidact in these matters) and the utility that a superficial command of a scripting language would give a graduate student in English.</p></li> <li><p><strong>One Tool Well</strong>: This is a corollary to the previous point; more technical, but also more focused. As we moved through the semester, the variety of different software packages we looked at increased: basic word frequencies with command line tools, R (particularly for mapping), MALLET, Stanford&rsquo;s Named Entity Recognizer, Python (with a number of libraries&mdash;particularly the NLTK), ImageJ, and others. There is a value to examining a diversity of software tools; but its costs, upon reflection, now seem too high. For someone coming to such technologies for the very first time, I think focusing on one, very flexible technology to do all (or at least most) of the things we would be interested in doing might be the best approach. And Python could fit the bill; certain things may be less pleasant in Python, but overall the consistency of a single language and syntax would have been a virtue.</p> <p>(I will say that while I think I&rsquo;d prefer Python over R, <em>if you were to use R</em>, running <a href="http://www.rstudio.com/ide/docs/server/getting_started">RStudio Server</a> would be a great way to provide a consistent software base for students; as a piece of software, it was pretty stellar. Speaking of which&hellip;)</p></li> <li><p><strong>Running a Server? Totally Worth It</strong>: One of problems in a class like this is infrastructure; you want folks to be able to play with some of these tools, but trying to get MALLET, or Python plus a handful of libraries, installed on students&rsquo; machines can be very unpleasant. To create some consistency in the software available to folks and to avoid having to try to install software packages on 6 different OS versions and hardware platforms, I rented some server space with Linode (approximate cost: $25/month) and set up user accounts for everyone. Then I installed all the software packages we would be interested in using.</p> <p>Such a setup utterly lacks any GUI (unless you count browser access to something like RStudio), which requires folks to be comfortable at the command line. This was no small thing. Use of the server was essentially optional, and some folks simply never got interested in it. In the spirit of &ldquo;more, not less, technical&rdquo; I&rsquo;d probably require more engagement with the server in a second version. I&rsquo;ll write up a more detailed explanation of the server set up I used soon; but this worked, over all, very well, and with some tweaks could work even better. </p></li> <li><p><strong>I&rsquo;d Probably Require Twitter</strong>: I suggested folks have a twitter account, but didn&rsquo;t require it. The folks who were on twitter, though benefited from it (I think); it provided an additional shared context; when authors we were reading were on twitter or had blogs, I was sure to note it, and I sometimes saw the extra context folks had gleaned from these resources in their contributions to class.</p></li> </ul> <p>When I planned the syllabus, I was concerned to try to integrate criticism of &ldquo;DH&rdquo; into this class, to make it both a class <em>about</em> &ldquo;digital humanities&rdquo; as well as a digital humanities class. I wanted to leaven ambient excitement or &ldquo;buzz&rdquo; with skepticism, to use the tools but to do so with care and reflection, to balance the hacking with yacking (to invoke a short-hand that has produced much hand-wringing). I will say, however, that during class meetings, I generally found myself working harder to overcome the skepticism, rather than to contain the excitement; wanting there to indeed be a little less yack and a little more&hellip;</p>