A Walk Through the Metadata: Gender in the HathiTrust Dataset
[This post, featuring too many graphs, was created with
knitr. You can see the source that generated those graphs, and the rest of the post, here. Update: Many thanks to Lincoln Mullen who is not only the maintainer, and one of the authors of the
gender package, but noted that I was (inaccurately) using publication dates, rather than author birthdates, when inferring the gender of names; he suggested a smart way of approximating author birth dates (used below), and also drew my attention to the
napp dataset, accessible by the
gender package. In light of his comments I made a few changes, re-ran the data, and have updated this post. The plots are new; I’ve added a little new text as well which, like this, is in dark blue. 9/11/15]
—The Critical Review, 1777
I’ve been tinkering with the HathiTrust dataset that Ted Underwood and HathiTrust released last month.Some Links: [The Dataset]; [Ted Underwood’s Discussion of It]; [My Previous Exploration of It, Mostly Using R] The thorniest questions I’ve encountered concern how to handle/understand volumes and titles which occur in the dataset more than once (and some related issues—multivolume works, etc). I’ll try to write a quick post about those issues in the future. For now, let’s look at other ways we might explore this dataset.
Examining the Gender of Authorship in HathiTrust Summary Metadata
I find the metadata fascinating in a way that the actual data (the word frequencies for each volume) is not. Let’s consider, for example, the relationship between authorship in the dataset and gender. Authorial gender is not one of the included metadata fields, but we might try to examine it by using the
gender package for
R. This package uses a variety of historical sources (Social Security data, US Census data, as well as some other sources) to intelligently infer gender based on first names (since the package relies on largely US data, it might not be as successful in predicting gender of names in other Anglophone countries, including England—an issue I note here, parenthetically, and then ignore). The package can also use
napp data, covering Canada, the United Kingdom, Germany, Iceland, Norway, and Sweden, from the years 1758 to 1910.This web app nicely illustrates what the package does. Here, for instance, is how you would load the package and infer the likely gender of the name a George born in 1819:
library(gender) gender('george',method='napp',year=1819) Source: local data frame [1 x 6] name proportion_male proportion_female gender year_min year_max 1 george 1 0 male 1819 1819
This sort of inference based on first names is obviously imperfect; there are cases where, for a variety of reasons, the prediction will be wrong. The suggestion that in 1819 the name George belongs to a man may be very wrong indeed if that particular George is the author of Middlemarch. For certain purposes, however, that misattribution may be exactly what we’re interested in. The simplicity of the approach can be a strength. The package, in most cases, will make the same inference—even the same incorrect inference—about the gender of a name as a reader would. This makes it ideal if you’re interested in how readers understood the authorship of the books they were reading, or how (perceived) authorial gender shaped the market for literature.Before we start taking this too seriously, Francis Beaumont’s name is detected as female (understandable, but wrong); as is Oliver Goldsmith’s (?!?!) in some years (e.g. 1792).
Such a summary of the dataset is very different than, say, using the gender information inferred about a volume based on its author’s first name to train a classifier on the volume-level word counts. Such a classifier could be used on texts from outside the dataset, or on texts within the dataset where an author’s gender is unknown. One might try to use it to test Virginia Woolf’s “guess that Anon, who wrote so many poems without signing them, was often a woman” (49). That sort of project, however, which would attempt to link “gender” (and in this case what exactly that word means becomes rather pressing) not simply to a name in a metadata field, but to a vocabulary (or some other representation of language use), would begin to encounter the thornier theoretical/methodological questions that I am so happy to skirt past here.
Hewing to this more modest, and (I think) less theoretically fraught, goal of understanding the makeup of the dataset, I used the
gender package to infer the gender of the author of each volume in the three HathiTrust datasets. To maximize recognition I used a somewhat heterdox method. Since the package expects birthdates, I subtracted 30 and 50 years from the publication date and passed this range to the package (this was Lincoln’s, I think very reasonable, suggestion). I queried against first the
napp data; if this returned no result I tried the
ipums census data. In both cases, I massaged the dates so that if they were out of range, I checked against the earliest available date (a historically imprecise result strikes me as better than no result at all). So to every row in the metadata summary files I added a column for gender, which represented the result of applying this idiosyncratic use of the
gender function to the author’s first name.For my purposes a “first name” is the first word after the comma in the
author field; there would be better ways to do this. (This process was actually rather time consuming—hint,
mclapply is your friend, as are virtualized servers you can let hum away for hours. Lincoln notes that the package allows you to pass a vector of names to the package; this makes the process more efficient for large datasets, paritcularly when names are repeated. I nevertheless did it the (seriously) less efficient way, in part because I had written the code for an earlier version of the
gender package, and in part because my odd use of two methods to try to find data complicates matters.)
I then tallied up the number of works by men and women for each genre. (I did the tallying with Python; you can find the results of those tallies as CSVs here.) In addition to
female, there are two other categories here:
missing means a name was not provided (or, more precisely, was not detected by my script) in the HathiTrust metadata;
undetected means that the
gender package had not value for the “name” it was given (or, more precisely, whatever string it received from how I parsed the name). That is,
missing means no name and
undetected means that
gender had no association for the name. (There are also columns for each of these values normalized by the number of volumes in the dataset for that year). Without further ado, three area graphs representing the gender breakdown of authorship in each of the HathiTrust datasets (fiction, poetry, drama).
If you right click and open each graph in another tab, they should be a bit bigger.
The data before 1800 is sparse and so these graphs look a little volatile. The prevalence of
undetected in the fiction data before 1800, however, may reflect the lack of attribution common in the late eighteenth century. “Over 80 per cent of all novel titles published in the 1770s and 1780s were published anonymously,” James Raven claims in the introduction to the first volume of the two volume The English Novel 1770–1829: A Bibliographical Survey of Prose Fiction Published in the British Isles (41). (I’ll abbreviate that BSPF for the rest of the post).
In a tweet, Heather Froelich asks, “What’s in those slices of undetected and missing texts.” Looking at the amended metadata file, it looks that there are ~11,000 records with either
undetected gender (that’s ~10% of the dataset). The most frequently occuring titles in the
missing data are:
The New British novelist; The British novelists, Stories by American authors, The Harvard classics shelf of fiction, The German classics of the nineteenth and twentieth centuries, The International library of famous literature, The lady of the manor, The book of the thousand nights and one night, The book of the thousand nights and a night, The thousand and one nights, The Odyssey of Homer, Stories by English authors, The Bibliophile library of literature, art and rare manuscripts, The masterpiece library of short stories, Florence Macarthy,
So why is the author missing from these? Checking the full records, the most frequenltly occuring items in this series are multivolume collections of other works. The New British Novelists lists no author in the dataset’s
author metadata; the title appears in the fiction dataset 50 times. Checking the original page images, we see that this is a series, published starting in 1820 which collects different novels by major British novelists. It includes a range of major novels, many themselves multivolume works: Clarissa, Robinson Crusoe, Humphrey Clinker, and so on. The Harvard Classics Shelf of Fiction appears to be a similar case. Is there an existing literature on these sorts of collections and their role in reputation creation/maintenance? In the earlier period, there are titles (like The Infernal Wanderer) which simply lack an author; others (like Turkish Tales) lack an author in the dataset, but currently have one in HathiTrust (perhaps because this record has been updated since the dataset was exported); and quite a few don’t meet my naming convention. Works by Phalaris, [Madame d’] Aulnoy, [Mssr.] Scarron, [Mrs.] Manley, Volatire, Virgil, and many others are “missing” because when I try to splice ‘em up (relying on a comma to separate first and last names), we get nothing. Some of these authors were referred to simply by a last name and title (Mrs. Manley) and this has entered the dataset as simply
undetected data; the most frequently occuring names are:
Bjørnson, Bjørnstjerne Dostoyevsky, Fyodor Orczy, Emmuska Orczy Burgess, Gelett Cullum, Ridgwell Hearn, Lafcadio Watanna, Onoto MacManus, Seumas Tagore, Rabindranath Hemyng, Bracebridge Gordon-Cumming, Roualeyn Ritchie, Leitch
A look at the names is enough to guess why
gender likely had a probably with them. (There are sufficiently few names here (321 unique individual, undetected names) that I am half tempted to put together a manual reconillation for names and genders). It also provides a clear illustration of the implicit cultural construction of “data.” These “undetected” names are largely non-Anglophone names—and so the attempt to infer one culturally mediated category (gender) gets complicated by the complexities of another one (nationality). Names that are undetected are not randomly distributed through the data but are dispropotionately non-Anglophone.
To more clearly see the trends, let’s look at works published under names that we have identified as female across genres; first raw counts and then as a proportion of all works published per year.
The second graph is the interesting one. Among the genres, female authors are best represented in fiction, and least well-represented in drama. The trend in fiction, however, is odd—while poetry and drama show upward trends (poetry’s is slow and steady across the 19th century; drama’s rather sudden after 1900), fiction has a high point in the early nineteenth century where women represent a larger proportion of fiction writers than anywhere else in this data. At times, early in the data, half of the works of fiction in the dataset are written by a woman (more on this figure below). Yet, over the course of the nineteenth century this proportion diminishes. When the graph ends in 1922, women represent about a quarter of the authors of each of the three genres.
On twitter, I suggested that in the normalized data for fiction by women above, one sees a decline in works by women. This may be consistent with the BSPF data (which, in its admittedly narrower slice, shows a decline from 1815 to 1830). To get some sense whether that’s a fair description, let’s isolate the fiction by women data, and add a rolling mean, with a a window of 5.
At some point, one is reading Rorschach plots; but this plot seems to suggests two periods of downward trends from 1805–1830, and then again from about 1885 to 1900. (That preciptious drop at the end is a function of doing the rolling average running out of data).
Data from The English Novel: A Bibliographical Survey of Prose Fiction, 1770-1830
To get some sense about how reasonable these trendlines look, we might try to compare them to another source. I’ve already quoted the BSPF, which offers a portrait of the authorship of novels between 1770 and 1830. The BSPF has totals based on both what is stated on title pages and in prefaces, as well as more comprehensive totals based on what the editors were able to infer about the authorship of works from other sources.For instance, if a work states that it is “By the author of Waverley, one can make additional inferences about the author’s gender. There turns out to be a significant discrepancy between what a title page, or preface, states, and what one may be able to infer about the gender of an author with just a little more knowledge. The majority of novels in this period were published without a clear statement of authorship. But if we look at the more comprehensive portrait of authorship that the BSPF offers, the story is a little different.
The graph above summarizes the trends in the inferred data. It has three distinct moments—a predominance of “anonymous” or unattributed works until around 1800; the predominance of women writers during the first decades of the nineteenth century, and concluding with what Peter Garside calls “the male invasion of mainstream fiction” (2:63). Garside notes, for instance, “the publication of Jane Austen’s novels was achieved not against the grain but during a period of female ascendancy” (2:75). This data suggests that authors of novels were most likely to be, in this order, anonymous, women, and then men.
The three waves visible in the graph above, however, is based on the inferences that the editors of the BSPF made to ascertain the the gender of the authors in their bibliography. The metadata available on title pages—of the sort that’s compiled in the HT metadata—often lacks information that might otherwise be available to most readers.
Occasionally, full author names are found within a novel—as in a signed Preface, or through the inclusion of an engraved portrait or additional title-page—when the main title-page offers no direct authorial description. Augusta Ann Hirst’s Helen; or Domestic Occurences (1807:28), for example, carries only the bare title on its title-page, though the full author’s name appears immediately afterwards in a Dedication to the Countess Fitzwilliam, and the author’s name later featured directly on the title-page in the Minerva reissue of 1808. (2:68)
HathiTrust has a copy of Helen, or, Domestic Occurrences: A Tale (though it is not included in the fiction dataset). And indeed its title page lacks the author’s name, though one can discover it in the dedication.
Through the magic of librarians, the HathiTrust record, however, includes the correct author and even notes that its “Dedication signed.”
Looking only at what one can infer about the authorial gender of works from the information available on the title page, most works would be “anonymous,” even if (some) contemporary readers may have been able to see through that that anonymity. Note the difference between the trends in authorship when we look only at information available from examining “proper names from title-pages and prefaces only” with the inferred conclusion (all this data is taken from the wonderfully comprehensive BSPF).
The inferred trends for both male and female authorship are significantly higher than their stated counterparts (these terms, inferred and stated are my clumsy language; for anyone interested, The English Novel, 1770–1830 really is an invaluable, if imposingly weighty, resource). There are perhaps two interesting trends here. The decrease in anonymous authorship at the start of the nineteenth century coincides with a rise in female authorship; female authorship is more public than its male counterpart. After 1820 one sees a sharp rise in male authorship—which is itself a rise in anonymous male authorship.
Comparing HathiTrust and BSPF
Using the method described above to infer authorship in the HathiTrust dataset should produce results similar to the raw, stated dates in the BSPF data. There are, though, a few differences to account for first. For one, James Raven’s and Peter Garside’s introductions to the two volumes of the Bibliographical Survey of Prose Fiction offer summary counts of “New Novels” but the HathiTrust data represents books owned by libraries. To be able to compare to the BSPF data with the HT data, we need to eliminate reprints (we only want new novels) and we need to count works, not books (so, multivolume works should be counted as a single work). I’ve tried to do this rather crudely by creating for each work in the HT fiction dataset an “ID” which consists only of a work’s title and it’s author.Using title alone as an ID could, in theory, lead to a problem if two works have the same title—which is actually quite common for multivolume sets, like The Novels of Walter Scott and the The Novels of Charles Dickens, and similar My script loops over the works in the metadata summary, counting a work as “new” only if we haven’t seen its ID before. Because we look only at title and author (and not
enumcron), we also only count one volume from a multivolume work (though, as I mention above, this problem is quite a bit thornier than I’m allowing here).
Second complication: geography: the HT dataset is culled from American libraries, whereas the BSPF data is focused on works published in “the British Isles.” Well, that raises an interesting question (digression ahead!): where were fiction volumes in the HathiTrust dataset published?
As this graph makes clear, most of the works in the HathiTrust dataset were published in 5 places (heck, many were published in one place). Those labels along the x-axis are MARC country codes; so the top publication locations are: New York State (
nyu), England (
enk), Massachusetts (
mau), No place/Unknown (
xx), Pennsylvania (
pau), Illinois (
ilu), Scotland (
gw). This summary, however, represents the entire HT fiction dataset—from 1700-1922. Let’s look at just the portion covered by the BSPF, 1770 and 1830:
For this period the top two locations are England and Scotland. It seems unlikely, therefore, that any differences between the BSPF and the HT datasets could be attributed to the different geographical coverage of the two datasets. But, just to be sure let’s extract only the works from the fiction dataset published in England and Scotland and Ireland between 1770 and 1830, and compare the gender breakdown one last time.
To create this subset of the HT summary metadata, I’ve used some Python that tries to more closely match the parameters of the BSPF data: it covers only works published between 1770 and 1830, published in England, Scotland, or Ireland, and it tries represent only “new works.” The Python that did this is here; the summary of the data is here.
We can get some sense of how the HT data compares to BSPF by plotting them together.
The data for female authorship in the two datasets (or rather, in the BSPF data and my weird manipulation of the HathiTrust data) seems, to my layman’s eye, surprisingly consistent. Of course, recalling the difference (often of between 10 and 20 percentage point) between authorial gender as determined by consulting title pages/prefaces with what the BSPF editors were able to infer, one might suggest (at least for the period 1770–1830) that the summary I offered above significantly under represents female authorship.
The data for male and anonymous authorship is much less consistent; BSPF reports more anonymous texts and my analysis of the HT metadata; while the HT data reports more male writers. I basically don’t understand why this would be so—I would have expected, if anything, the opposite. The
anonymous line for the HT data in the above graph combines both
missing authors and
undetected, treating as anonymous anything that couldn’t be coaxed into another category; if anything, it should overrepresent anonymous writers. Perhaps this reflects something about the underlying data; or perhaps something about the way I carved up first names. For now, I just don’t know. So, here ends our amble through the data.
Woolf, Virginia. A Room of One’s Own.
Raven, James et al. The English Novel 1770-1829: A Bibliographical Survey of Prose Fiction Published in the British Isles. 2 vols. New York: Oxford University Press, 2000. Print.