Supported by the Arts and Humanities Research Council

Tag: Authorship (Page 1 of 2)

Attribution Corpora: an overview of text preparation (July 2019)

The new Cambridge Edition of the Works of Aphra Behn (forthcoming) is informed by computational stylistic approaches to Behn’s style and authorship, combined with traditional literary approaches to attribution.

A key requirement for computational stylistic investigation is the corpus of texts for analysis: works by Behn, the dubia, and writings by her contemporaries for comparative analysis. Decisions made in the compilation and preparation of those texts affect the linguistic material that is available for computational analysis, the kinds of results that are obtained, and thus the ways in which the findings can be compared with, and interpreted against, other scholarship (whether more traditional literary analysis or other computational work). Due to the importance of transparency and replicability, the following discussion provides a brief outline of the development of these corpora, the challenges and rationale underlying our decisions, and our future intentions for the further development and distribution of the texts. The focus is on the treatment of the texts, rather than the selection of their contents (watch out for a future update on the latter!).

Behn’s drama corpus (created in 2016):

Behn’s dramatic works are relatively plentiful: sixteen plays with secure attribution (dating from 1670-1690), alongside the plays of dubious attribution. There are no manuscript originals and – in most cases – only one lifetime edition of the printed text. The original corpus of Behn’s drama was created in 2016. The raw texts were mainly sourced from EEBO-TCP. Where the digital texts were not available, they were manually keyed by the General Editors.

The computational approach to style and authorship separates the full texts into their constituent linguistic parts – typically individual words, but this can also be word sequences (n-grams) or character sequences – for quantitative analysis. In our investigation of Behn’s works, we use the software package Intelligent Archive, which relies on XML tags to identify the parts of a text to include in analysis (see screenshot, below). This entails that features such as stage directions and speaker tags can be automatically excluded from the word counts for the dramatic texts. The unwanted paratextual material, including prologues, epilogues and songs, is placed in the TEI header, as IA does not include header information when extracting the lexical data. All of the dramatic texts are encoded using TEI Lite protocols, prepared using Oxygen XML Editor. The XML mark-up is therefore not as extensive and detailed as that used in, say, a digital edition, but it is sufficient for the requirements of the computational tests.

TEI Lite XML for Behn’s The Rover

Because the computational stylistic approaches used in our investigations are word-based, it is necessary to regularise the spelling of texts first printed in the latter half of the seventeenth century. Behn’s drama was regularised using VARD (a program developed by Alistair Baron) by Mel Evans and an RA, Georgia Priestley. In the first pass through Behn’s drama, back in 2015-16, this was done before the XML mark-up of the rest of the text. Unfortunately, this order of preparation makes recovering the original spelling of the text (after mark-up) somewhat convoluted; the process has been changed for subsequent texts and a future iteration of the Behn corpus will address this issue prior to their public release (see below).

Our regularisation choices reflect practices employed in previous studies of style and authorship, particularly in relation to early modern plays, as well as taking into account the particularities of the printed texts themselves. In brief, the regularised Behn drama corpus expands contractions (e.g. don’t to do not) on the basis that these may be the decision of the compositor, rather than Behn herself; unfortunately, there are no manuscripts available to check her preferences in a dramatic context. Other principles include regularising second-person verb endings (e.g. knowst to know), but preserving pronouns (you, thou); this decision reflects the potential skewing effect of pronouns in authorship analysis. Whilst pronouns can be treated as stopwords and ignored in the tests, the morphemic differentiation in verb-endings is more difficult to exclude comprehensively in this way, hence their regularisation at the text preparation stage.

Spelling regularisation is not a neutral process, as it necessitates judgements about how orthography and the lexical unit (a word) is conceived. There is no standard way of regularising early modern spelling – in part because it is so variable across time, genre and medium (i.e. print and manuscript) – which entails that comparisons between corpora are inevitably affected by different regularisation principles. The purpose of the regularised text (e.g. for undergraduate study vs. computational analysis) also shapes the decisions made. Thus, the potential significance of variation of regularisation is something we are addressing in subsequent iterations of the corpora and associated analyses.

Behn’s Drama Corpus (2018) and the contemporary Restoration drama corpus (2017-8):

In our current text preparation process (from 2017 onwards) the XML is done first, and the VARDing (spelling regularisation) is done second. This means that we have an original-spelling text to work from, with full mark-up, as well as versions with regularised spelling (with and without regularisation tags). This work continues to be undertaken by Mel Evans and postdoctoral research associate Alan Hogarth. At the time of writing there are 80 plays in the comparison corpus. The majority are works by Behn’s contemporaries (e.g. Thomas D’Urfey, John Dryden, Edward Ravenscroft) but it also includes those who preceded her – and provided source materials – such as Thomas Killigrew and Richard Brome, as well as those who succeeded Behn, some of whom were acquainted with her works, including Mary Pix and Charles Gildon. The temporal breadth is important both to identify and account for changes in theatrical fashions and developments in authorial style over time, as they pertain to Behn and works attributed to her.

Many of the seventeenth-century play texts were taken from EEBO-TCP, with the transcriptions checked and edited for accuracy with the base-text. Some were generously shared by the Visualising English Print (VEP) team, which meant that the dramatic texts were already regularised using VARD. The VEP spellings were updated to follow the regularisation system used in the Behn corpus. Other plays for earlier decades have been kindly provided by Hugh Craig; again, some differences in formatting and spelling regularisation meant that further editing was required to provide greater uniformity with the Behn corpus and other texts.

In summer 2018, the Behn corpus and a sub-set of the contemporary drama was updated to produce versions with regularised interjections (a linguistic category incorporating a lexical unit that stands as a discrete discourse unit such as oaths and exclamations); for instance s’wounds and zounds are regularised to zounds. This enables a more robust comparison of this facet of Behn’s language with that of her contemporary dramatists (Evans, under review). These versions of the plays exist alongside the previous iteration.

Other changes have been made to the 2016 corpus of Behn’s drama, reflecting our findings, the developments in analytic approach and the software used. It appears that Stylo for R, for example, reads content in the TEI header (even when ‘xml plays’ is selected), so plain text versions of the drama corpora were made using Xpath to extract only the dialogue for use with this software package.

We appreciate that the formatting and regularisation processes change the complexion of the data, potentially in quite significant ways in terms of the lexical items available for quantitative analysis. For example, don’t and do not in a corpus entail a greater variety of forms than if all texts are regularised to do not; in the latter scenario, the contracted form is removed from the orthographic repertoire, which might boost (artificially, in a sense) the number of not forms in the corpus. It will be interesting to establish the impact the different regularisation decisions have on the computational results, and this is an aspect we are presently investigating.

Workflow (2019):

Our work on the drama corpus is now informing our preparation of other texts for computational analysis. This includes works by Behn, the dubia, and the writings of her contemporaries in the genres of correspondence, prose and poetry. The workflow for the preparation of these text is as follows:

  • Source the text in digital format (plain text) e.g. EEBO-TCP or an existing transcription from other reputable sources. When no reliable transcription is available, the text is be manually keyed by the editorial team.
  • Add TEI Lite XML tags to the original spelling text
  • VARD the marked-up files, including the following principles:
    Expand contractions, regularise interjections, regularise second-person verb endings (e.g. knowest > know)
  • Use the XML version to create a stripped, plain-text version of the main text.

Iterations are saved at points 2, 3 and 4.

We will provide more information about these new corpora later in 2019.

Future Plans (2020-):

Following the conclusion of the E-ABIDA project, the plain text files for the complete drama corpus (including Behn, contemporary and dubia works) will be made available in both original and regularised spelling via the project website. It is our hope that other researchers will use these resources to enrich the study of Restoration drama and literature.

Looking at the small words

The Quantitative Big Picture

Computational approaches to authorial style and attribution are predominantly quantitative: the various techniques rely on the ability of computers to identify patterns in large datasets (i.e. frequencies of words across the plays of multiple authors), and for these patterns to align with the authorial style of one author or another. In our investigation of Behn’s authorial style, Alan Hogarth and I have been using various exploratory testing methods (Zeta, T-tests, Principal Components Analysis) to establish how distinct Behn is as a dramatic writer when compared to her contemporaries (other genres will follow shortly). The results suggest that Behn does, indeed, have a linguistic profile that stands in contrast to her contemporaries. Whilst a future blog post will talk more about these high-level patterns and comparisons, the present discussion will introduce some of the specific words that characterise Behn’s authorial style, on the basis of their contrastive frequencies in her works versus those of contemporary playwrights. This work introduces a qualitative dimension to the investigation, with the objective being to better understand how and why some words appear as marker-words for one author, rather than another.

Figure 1: PCA results for 33 Restoration comedies using 100 most frequent function words

Figure 1 shows a Principal Components Analysis of the 100-most frequent function words in a corpus of 33 comedies written by five Restoration dramatists, including Behn. This statistical sorting hat organises the dependent variables (the plays) according to the most dominant continuities or trends of the results for the independent variables (the words). The right-hand-side of the line highlights the area when most of Behn’s comedies cluster, largely distinct from the comedies of Dryden, D’Urfey, Shadwell and Ravenscroft. The plays of D’Urfey (green crosses) intersect with some of Behn’s works, suggesting stylistic similarities; the plays of Dryden and Ravenscroft, by comparison, are much more distinct. Looking at the linguistic features that underlie this distribution of plays helps to explain these points of overlap between authors.

Figure 2 shows the same distribution of plays, now with the words underlying their organisation. Within the Behn area of the chart, we find second-person pronoun forms (thou, thee, thy), connectives (and, so) and interjections (oh, ha). The rest of this post will discuss the functionality and distribution of interjections in Behn’s plays, compared with those of her contemporaries, and consider what this information may suggest about her approach to dramatic writing.


Figure 2: PCA results for 33 Restoration comedies using 100 most frequent function words, showing distributional relationship between word frequencies and play texts.


Interjections are powerful small words. They prototypically signal the emotional attitude of the speaker towards their situation, and this affective function has implications for their use in dramatic language. They can be grouped into two types: Primary forms, whose function is solely as an interjection and generally have an atypical phonological and graphological structure (e.g. ugh) and Secondary forms, which are words transferred from other word classes, including euphemistic and taboo words (e.g. damn, blimey).

Interjections have been discussed by various scholars of the history of the language. Corpus-based studies show an increase in forms and frequencies of use in drama and other literary genres over the early modern period. This is thought to reflect a cultural shift towards the subjective expression of emotion, which was conveniently encapsulated through the use of interjection forms (see the work of Irma Taavitsainen (1995, 1997) and Culpeper and Kytö (2010).

Whilst interjections can convey a speaker’s state of mind (e.g. oh may signal a speaker’s shock and surprise), they also have other pragmatic meanings. Interjections can be conative; that is, they instigate a hearer’s reaction (verbal or behavioural) to the speaker’s utterance. They can also organise the discourse, signalling the end of a turn, for example.

In the PCA results, Behn’s authorial style was differentiated from that of the four comparison authors on the basis of the frequencies of oh and ha.  These interjections are the two most-frequent forms in the comedy corpus as a whole (i.e. the most common forms in the 33 comedies combined). So why are they particularly significant in Behn’s plays? Is she using them differently when compared with the four male dramatists?


Oh has an undisputed position as the top interjection in the English language. It is the most frequent form in the extant literary canon for the early modern period, and retains this position in corpus-based analyses of present-day English (for discussion of these distributional trends, see Taavitsainen 1995; Norrick 2011). In Behn’s drama it has three main functions:

  • speaker-oriented: to signal primarily negative emotions and comprehension
  • addressee-oriented: to attract attention and direct a proposition, often combined with a vocative e.g. ‘oh Alphonso’
  • audience-oriented: to introduce an expository frame that licenses the narration of events on stage e.g. ‘oh I am weary of this’

These functions are comparable with previous surveys of early modern English drama (e.g. Culpeper and Kytö 2010). Indeed, Behn’s use of oh does not appear particularly distinctive when compared to her contemporaries: statistically, the frequency of oh in Behn’s comedies is not significantly different to that of the four male writers (Student’s t-test, p > 0.05). This suggests that, quantitatively, oh is a relatively consistent feature across Restoration comedy, which fits its status as the prototypical interjection in English. Its placement in the PCA chart therefore suggests that these pragmatic functions intersect with other stylistic choices associated with the 100 most frequent function words, such as the second-person pronouns and connectives, and that it is this co-textual environment which distinguishes oh in Behn’s linguistic style from that of other authors. We plan on looking further into this feature using corpus linguistic techniques, such as collocate analysis.


Ha is the second most frequent interjection in the 33 comedy corpus overall. In previous studies of early modern texts (e.g. Culpeper and Kytö 2010), ha shows a similar functional range as oh, used for speaker, addressee and textual-oriented functions, including the signalling of laughter in play texts. However, in the comedy corpus, ha shows significant differences in its distribution and use at an authorial level, particularly when compared with oh. Figure 3 shows the frequency of ha for each play for each author. Of these, D’Urfey uses ha more frequently than other authors (p > 0.05). Behn’s usage contrasts significantly with Dryden’s, as does Shadwell’s. Therefore, the frequency profile for ha suggests that D’Urfey is a high-frequency user; Behn, Shadwell and Ravenscroft are mid-frequency users; and Dryden uses ha the least.

Figure 3: Frequency of HA in comedies of Behn and four contemporary writers.

This quantitative picture is very different from that of oh. A qualitative perspective offers some light on what underpins these frequencies. The five plays representing Dryden include only two examples of ha, both of which occur in his adaptation of Molière’s Amphitryon (1668; Dryden 1690). Both are used for a conative function to elicit a response from the hearer, found in heated exchanges discussing sexual misdemeanours (arising from comic misunderstandings):

Amphitryon: Made haste to Bed: Ha, was it not so? Go on –  (Aside) And stab me with each Syllable thou speak’st (p.28)

Yet it seems that, for Dryden, the other functions associated with ha in the period were achieved through other methods, or were not required.

Conversely, the high frequency of ha in D’Urfey’s plays correlates with a broader functional base. The interjection is used for addressee-oriented functions, similar to Dryden, but also for speaker-oriented expression, such as signalling frustration. Its most prominent function in D’Urfey’s plays, however, is to signal laughter. The forms occur in strings of three to five forms (five being the longest strings of the five authors). In the next stage of analysis, these laughter strings will be categorised as form distinct to other uses of ha, to see how that effects the distribution of the interjection as a marker of authorial style.

Behn’s profile is similar to D’Urfey’s in her use of ha for speaker-oriented functions (e.g. comprehension, perception) and to mark laughter (strings of three and four forms). However, Behn does not make extensive use of the addressee-oriented function seen in the plays of Dryden and D’Urfey, suggesting her repertoire did not include, or require, the interjection for this purpose. This finding has a potentially diagnostic application when investigating her dramatic dubia.

Valeria: Ha, ha, ha – I laugh to think how thou art fitted with a Lover, a fellow that I warrant loves every new Face he sees. (The Rover, p.30)


Interjections and Authorial Style

The functional similarities and differences identified in relation to oh and ha provides an important perspective on the identification of these items as marker words in the quantitative analysis of Behn’s dramatic style, and that of her contemporaries. In this post, I’ve discussed only a few examples of the rich dataset that represents the interjectional forms and functions found in the comedy corpus. In the next steps, we will develop these findings to help inform our investigation of the dubia plays; in particular, the forms and functions of interjections (especially oh) provides an instructive perspective on the suspected editorial interference of Charles Gildon within Behn’s posthumously published play The Younger Brother (1696). We also plan on exploring interjections across genres, to establish the extent to which authors show continuities in their stylistic choices when working within different generic conventions, and the ramifications this may have on investigating authorship in mixed-genre datasets.


  • Culpeper, Jonathan, and Merja Kytö. 2010. Early Modern English Dialogues: Spoken Interaction as Writing. Studies in English Language. Cambridge, UK ; New York: Cambridge University Press.
  • Norrick, Neal R. 2009. ‘Interjections as Pragmatic Markers’. Journal of Pragmatics 41 (5): 866–91.
  • Taavitsainen, Irma. 1995. ‘Interjections in Early Modern English: From Imitation of Spoken to Conventions of Written Language’. In Historical Pragmatics: Pragmatic Developments in the History of English, edited by Andreas H Jucker, 439–66. Amsterdam; Philadelphia: John Benjamins Publishing Company.
  • ———. 1997. ‘Genre Conventions: Personal Affect in Fiction and Non-Fiction in Early Modern English’. In English in Transition: Corpus-Based Studies in Linguistic Variation and Genre Styles, edited by Matti Rissanen, Merja Kytö, and Kirsi Heikkonen. Topics in English Linguistics 23. Berlin ; New York: Mouton de Gruyter.

Aphra Behn in Lampeter

A few weeks ago, in late August, I went on one of the most pleasant and thought-provoking research trips I’ve taken recently. Lampeter is a small Welsh town set amid beautiful rolling hills (then somewhat parched after the heatwave of the early summer). It also has a historic University library, founded in the early nineteenth century, which includes among its collections a small but fascinating corpus of early Aphra Behn materials.

Among Lampeter’s Behn holdings is a copy of Aesops Fables (1687). Like many of the library’s early printed books, Aesops Fables was acquired through donation from the educational philanthropist Thomas Phillips (1760-1851).  Phillipps, who had also worked as an East India company surgeon, probably did not set out to collect Aphra Behn but procured her works almost by accident, in the course of adding to his ever-growing library. If he was attracted by anything in particular in Aesops Fables, it is likely not to have been Aphra Behn’s poetic versions of the fables – physically, a small component of a large and complex volume – but rather Francis Barlow’s finely executed animal engravings, which are now recognised as a landmark achievement in seventeenth-century book illustration. Although Aesops Fables, compared with many early modern books, is not rare – the British Library alone holds four copies – Lampeter’s copy is remarkably well-preserved, including all the prefatory material in what seems to have been the original order. It does not, however, include plate 17 from the illustrated ‘Life of Aesop’, an image excised from several extant copies on grounds of supposed obscenity (it depicts Aesop’s master’s wife with her buttocks exposed). Whether Aphra Behn wrote the (unattributed) four-line poems accompanying each of the ‘Life’ illustrations, as well as the six-line fables that Barlow explicitly attributed to her in his preface, is among the questions that the Cambridge Behn project hopes in time to be able to answer.

Section of Plate 17. Image from Early English Books Online.

Lampeter’s other Behn holdings comprise political poems: two dating from 1685 (her elegy on Charles II and consolatory poem to Catherine of Braganza), and two from 1689 (A Congratulatory Poem to Her Sacred Majesty Queen Mary, upon her arrival in England and A Pindaric Poem to the Reverend Doctor Burnet). The poem to Gilbert Burnet – propagandist for William of Orange, and future Bishop of Salisbury – is a rare survival, one of only seven copies now known to be extant. I regret to report that I may have been the first person in its 329-year history actually to read the Lampeter witness, as its pages had never been cut.

Continue reading

« Older posts