Editing Aphra Behn in the Digital Age

Supported by the Arts and Humanities Research Council

Attribution Corpora: an overview of text preparation (July 2019)

The new Cambridge Edition of the Works of Aphra Behn (forthcoming) is informed by computational stylistic approaches to Behn’s style and authorship, combined with traditional literary approaches to attribution.

A key requirement for computational stylistic investigation is the corpus of texts for analysis: works by Behn, the dubia, and writings by her contemporaries for comparative analysis. Decisions made in the compilation and preparation of those texts affect the linguistic material that is available for computational analysis, the kinds of results that are obtained, and thus the ways in which the findings can be compared with, and interpreted against, other scholarship (whether more traditional literary analysis or other computational work). Due to the importance of transparency and replicability, the following discussion provides a brief outline of the development of these corpora, the challenges and rationale underlying our decisions, and our future intentions for the further development and distribution of the texts. The focus is on the treatment of the texts, rather than the selection of their contents (watch out for a future update on the latter!).

Behn’s drama corpus (created in 2016):

Behn’s dramatic works are relatively plentiful: sixteen plays with secure attribution (dating from 1670-1690), alongside the plays of dubious attribution. There are no manuscript originals and – in most cases – only one lifetime edition of the printed text. The original corpus of Behn’s drama was created in 2016. The raw texts were mainly sourced from EEBO-TCP. Where the digital texts were not available, they were manually keyed by the General Editors.

The computational approach to style and authorship separates the full texts into their constituent linguistic parts – typically individual words, but this can also be word sequences (n-grams) or character sequences – for quantitative analysis. In our investigation of Behn’s works, we use the software package Intelligent Archive, which relies on XML tags to identify the parts of a text to include in analysis (see screenshot, below). This entails that features such as stage directions and speaker tags can be automatically excluded from the word counts for the dramatic texts. The unwanted paratextual material, including prologues, epilogues and songs, is placed in the TEI header, as IA does not include header information when extracting the lexical data. All of the dramatic texts are encoded using TEI Lite protocols, prepared using Oxygen XML Editor. The XML mark-up is therefore not as extensive and detailed as that used in, say, a digital edition, but it is sufficient for the requirements of the computational tests.

TEI Lite XML for Behn’s The Rover

Because the computational stylistic approaches used in our investigations are word-based, it is necessary to regularise the spelling of texts first printed in the latter half of the seventeenth century. Behn’s drama was regularised using VARD (a program developed by Alistair Baron) by Mel Evans and an RA, Georgia Priestley. In the first pass through Behn’s drama, back in 2015-16, this was done before the XML mark-up of the rest of the text. Unfortunately, this order of preparation makes recovering the original spelling of the text (after mark-up) somewhat convoluted; the process has been changed for subsequent texts and a future iteration of the Behn corpus will address this issue prior to their public release (see below).

Our regularisation choices reflect practices employed in previous studies of style and authorship, particularly in relation to early modern plays, as well as taking into account the particularities of the printed texts themselves. In brief, the regularised Behn drama corpus expands contractions (e.g. don’t to do not) on the basis that these may be the decision of the compositor, rather than Behn herself; unfortunately, there are no manuscripts available to check her preferences in a dramatic context. Other principles include regularising second-person verb endings (e.g. knowst to know), but preserving pronouns (you, thou); this decision reflects the potential skewing effect of pronouns in authorship analysis. Whilst pronouns can be treated as stopwords and ignored in the tests, the morphemic differentiation in verb-endings is more difficult to exclude comprehensively in this way, hence their regularisation at the text preparation stage.

Spelling regularisation is not a neutral process, as it necessitates judgements about how orthography and the lexical unit (a word) is conceived. There is no standard way of regularising early modern spelling – in part because it is so variable across time, genre and medium (i.e. print and manuscript) – which entails that comparisons between corpora are inevitably affected by different regularisation principles. The purpose of the regularised text (e.g. for undergraduate study vs. computational analysis) also shapes the decisions made. Thus, the potential significance of variation of regularisation is something we are addressing in subsequent iterations of the corpora and associated analyses.

Behn’s Drama Corpus (2018) and the contemporary Restoration drama corpus (2017-8):

In our current text preparation process (from 2017 onwards) the XML is done first, and the VARDing (spelling regularisation) is done second. This means that we have an original-spelling text to work from, with full mark-up, as well as versions with regularised spelling (with and without regularisation tags). This work continues to be undertaken by Mel Evans and postdoctoral research associate Alan Hogarth. At the time of writing there are 80 plays in the comparison corpus. The majority are works by Behn’s contemporaries (e.g. Thomas D’Urfey, John Dryden, Edward Ravenscroft) but it also includes those who preceded her – and provided source materials – such as Thomas Killigrew and Richard Brome, as well as those who succeeded Behn, some of whom were acquainted with her works, including Mary Pix and Charles Gildon. The temporal breadth is important both to identify and account for changes in theatrical fashions and developments in authorial style over time, as they pertain to Behn and works attributed to her.

Many of the seventeenth-century play texts were taken from EEBO-TCP, with the transcriptions checked and edited for accuracy with the base-text. Some were generously shared by the Visualising English Print (VEP) team, which meant that the dramatic texts were already regularised using VARD. The VEP spellings were updated to follow the regularisation system used in the Behn corpus. Other plays for earlier decades have been kindly provided by Hugh Craig; again, some differences in formatting and spelling regularisation meant that further editing was required to provide greater uniformity with the Behn corpus and other texts.

In summer 2018, the Behn corpus and a sub-set of the contemporary drama was updated to produce versions with regularised interjections (a linguistic category incorporating a lexical unit that stands as a discrete discourse unit such as oaths and exclamations); for instance s’wounds and zounds are regularised to zounds. This enables a more robust comparison of this facet of Behn’s language with that of her contemporary dramatists (Evans, under review). These versions of the plays exist alongside the previous iteration.

Other changes have been made to the 2016 corpus of Behn’s drama, reflecting our findings, the developments in analytic approach and the software used. It appears that Stylo for R, for example, reads content in the TEI header (even when ‘xml plays’ is selected), so plain text versions of the drama corpora were made using Xpath to extract only the dialogue for use with this software package.

We appreciate that the formatting and regularisation processes change the complexion of the data, potentially in quite significant ways in terms of the lexical items available for quantitative analysis. For example, don’t and do not in a corpus entail a greater variety of forms than if all texts are regularised to do not; in the latter scenario, the contracted form is removed from the orthographic repertoire, which might boost (artificially, in a sense) the number of not forms in the corpus. It will be interesting to establish the impact the different regularisation decisions have on the computational results, and this is an aspect we are presently investigating.

Workflow (2019):

Our work on the drama corpus is now informing our preparation of other texts for computational analysis. This includes works by Behn, the dubia, and the writings of her contemporaries in the genres of correspondence, prose and poetry. The workflow for the preparation of these text is as follows:

  • Source the text in digital format (plain text) e.g. EEBO-TCP or an existing transcription from other reputable sources. When no reliable transcription is available, the text is be manually keyed by the editorial team.
  • Add TEI Lite XML tags to the original spelling text
  • VARD the marked-up files, including the following principles:
    Expand contractions, regularise interjections, regularise second-person verb endings (e.g. knowest > know)
  • Use the XML version to create a stripped, plain-text version of the main text.

Iterations are saved at points 2, 3 and 4.

We will provide more information about these new corpora later in 2019.

Future Plans (2020-):

Following the conclusion of the E-ABIDA project, the plain text files for the complete drama corpus (including Behn, contemporary and dubia works) will be made available in both original and regularised spelling via the project website. It is our hope that other researchers will use these resources to enrich the study of Restoration drama and literature.

Adventures in Antwerp

The view from the river back towards the old city.
All blog post images @_MelEvans

In May 2019, I had the pleasure of visiting Prof. Dr. Peter Petre and his colleagues at the University of Antwerp. Prof Petre leads the European Research Council-funded project ‘Mind-bending Grammars‘, which investigates the cognitive and social dimensions of language change across the lifespan using the 90 million word corpus EMMA (Early Modern Multiloquent Authors). The project harnesses quantitative and computational techniques to test and develop linguistic theory surrounding the process of language change at the level of the individual.

Given my work on Behn’s authorial style using computational stylistic methods, the finding of which suggest evidence of linguistic developments over her career, there was a clear overlap with the MBG project. I was invited to discuss our approach to Behn’s style and the pervading questions of attribution with the project members, and they also shared their own insights and expertise in a very rich and productive meeting. My visit also included a guest leacture, in which I provided an overview of some of the most recent results of the attribution work to a mixture of students and staff at the university. This included our extensive work on Behn’s dramatic dubia, such as The Counterfeit Bridegroom, as well as some explorations into the socio-pragmatic properties (and their profiling potential) of dramatic interjections e.g. oh, ah and ha, to which the audience posed some very insightful and thought-provoking questions.

My talk also covered my early-stage editorial work on Behn’s correspondence sent from Antwerp in 1666, during her assignment as a royalist spy. On my wanderings around the old parts of the city, it was fun to try to envisage what Behn’s own experiences might have been, and what areas in the old town she may have made time to visit. Prof Petre and his team took me to the site of the inn, the Rosa Noble, which served as Behn’s accommodation during her mission, and for which she ended up in serious debt to her Antwerp landlord. Today it’s a main thoroughfare, but in Behn’s day it was a canal connecting the city to the vast Scheldt river.

I also made sure to visit the Plantin-Moretus Printing House Museum (as recommended on Twitter). This incredible building, which is the site of one of the oldest and most successful printing businesses in western Europe, dates back to the sixteenth century. The route through the museum takes you around the various rooms of the printing business, including the print room, the correction room, and the front-of-house book shop. On a glorious spring day, this was a wonderful place to explore (including the internal courtyard garden), and I heartily recommend it should you have a spare hour or so, regardless of your bibliophilic qualities).

It seems fair to say that Behn’s time in Antwerp, given her extensive and repetitive complaints about her financial difficulties, was certainly not as enjoyable as mine. I hope to return just as soon as a suitable (Behn-related) excuse arises.

Behn at BSECS

Over the past few years, the E-ABIDA panel at the January meeting of the British Society for Eighteenth-Century Studies has become an annual event. BSECS, regularly hosted by St Hugh’s College Oxford, provides an unparalleled opportunity for members of the Behn team to share their latest findings from across the wide range of research represented by the project. Participants in recent years have included Ros Ballaster, Robert D. Hume and Helen Wilcox, as well as the project’s PI, Elaine Hobby.

This year’s panel, in keeping with the conference theme of ‘Islands and Isolation’, considered issues of separation and connectedness in Behn’s work. Alan Hogarth, representing the digital humanities side of the project, reported on his joint research with Mel Evans in ‘Isolating the Idiolect: Forgery and Style in Behn’s Spying Letters’. Behn’s 1660s letters from the Low Countries, where she had been despatched as a spy for the crown, have long been the subject of great fascination for historians and literary scholars alike. The authenticity of a portion of these letters – the embedded missives from William Scot – has recently been called into question by important new research by Nadine Akkerman, who argues that Scot’s letters may in fact have been creative forgeries by Behn herself. Alan’s paper discussed what digital humanities methods can do to complement traditional literary approaches to authorial attribution, and reported on the results of computational analyses carried out by himself and Mel to test the authorship of the Scot letters. A jointly authored publication is to follow – watch this space for further news!

In the second paper of the panel, ‘Imitation and the Isolated Woman: Aphra Behn’s “Oenone to Paris” in Restoration Literary Culture’, Gillian Wright moved from espionage to poetry, addressing what is perhaps the pivotal poetic publication in Behn’s career. Published in 1680, Behn’s ‘Oenone to Paris’, a rendering of one of Ovid’s Heroides, involved collaboration with two of the leading figures in Restoration literary London, the poet and dramatist John Dryden, and the up-and-coming young bookseller Jacob Tonson.

Pieter Lastman, 1610. Paris and Oenone. <https://high.org/collections/paris-and-oenone/>

Through its inclusion in this prestigious volume, ‘Oenone to Paris’ put Behn on the map as a literary translator – or rather (not quite the same thing) as a literary imitator: the resonances of Dryden’s designation of her poem as ‘in Mr. Cowleys way of Imitation only’ are still debated (and were discussed in at least two other papers elsewhere in the conference). Gillian’s paper discussed Behn’s use (and non-use) of previous translations of ‘Oenone to Paris’ by George Turberville, Wye Saltonstall and John Sherburne in 1639, and briefly considered how her practice in this early poem compares with her methods in later imitative works such as Cowley’s sixth book of plants. A longer version of her paper is to be published in Early Modern Women’s Complaint: Gender, Form, and Politics, ed. Sarah C.E. Ross and Rosalind Smith.

The final paper in the panel, Claire Bowditch’s ‘Readers’ Responses and Press Variants in Aphra Behn’s Works’, reported on the findings of over three years of research in numerous scholarly libraries in the United Kingdom, Ireland, and the United States. With Elaine Hobby, Claire has been responsible for collating hundreds of early copies of Behn’s works for The Cambridge Edition of the Works of Aphra Behn, and has discovered some intriguing evidence of both in-press revisions to volumes issued during Behn’s lifetime and also early readers’ responses to her writings. Was Behn herself connected with these in-press revisions, and which of her works proved most appealing – or most provocative – for early readers? Again, a publication is in progress – watch this space.

The panel was ably chaired by Robert D. Hume (a member of the Project Management and Editorial Boards for the Cambridge Behn), and attracted an excellent (and pleasingly appreciative) audience. E-ABIDA will return to BSECS in 2020.

Looking at the small words

The Quantitative Big Picture

Computational approaches to authorial style and attribution are predominantly quantitative: the various techniques rely on the ability of computers to identify patterns in large datasets (i.e. frequencies of words across the plays of multiple authors), and for these patterns to align with the authorial style of one author or another. In our investigation of Behn’s authorial style, Alan Hogarth and I have been using various exploratory testing methods (Zeta, T-tests, Principal Components Analysis) to establish how distinct Behn is as a dramatic writer when compared to her contemporaries (other genres will follow shortly). The results suggest that Behn does, indeed, have a linguistic profile that stands in contrast to her contemporaries. Whilst a future blog post will talk more about these high-level patterns and comparisons, the present discussion will introduce some of the specific words that characterise Behn’s authorial style, on the basis of their contrastive frequencies in her works versus those of contemporary playwrights. This work introduces a qualitative dimension to the investigation, with the objective being to better understand how and why some words appear as marker-words for one author, rather than another.

Figure 1: PCA results for 33 Restoration comedies using 100 most frequent function words

Figure 1 shows a Principal Components Analysis of the 100-most frequent function words in a corpus of 33 comedies written by five Restoration dramatists, including Behn. This statistical sorting hat organises the dependent variables (the plays) according to the most dominant continuities or trends of the results for the independent variables (the words). The right-hand-side of the line highlights the area when most of Behn’s comedies cluster, largely distinct from the comedies of Dryden, D’Urfey, Shadwell and Ravenscroft. The plays of D’Urfey (green crosses) intersect with some of Behn’s works, suggesting stylistic similarities; the plays of Dryden and Ravenscroft, by comparison, are much more distinct. Looking at the linguistic features that underlie this distribution of plays helps to explain these points of overlap between authors.

Figure 2 shows the same distribution of plays, now with the words underlying their organisation. Within the Behn area of the chart, we find second-person pronoun forms (thou, thee, thy), connectives (and, so) and interjections (oh, ha). The rest of this post will discuss the functionality and distribution of interjections in Behn’s plays, compared with those of her contemporaries, and consider what this information may suggest about her approach to dramatic writing.


Figure 2: PCA results for 33 Restoration comedies using 100 most frequent function words, showing distributional relationship between word frequencies and play texts.


Interjections are powerful small words. They prototypically signal the emotional attitude of the speaker towards their situation, and this affective function has implications for their use in dramatic language. They can be grouped into two types: Primary forms, whose function is solely as an interjection and generally have an atypical phonological and graphological structure (e.g. ugh) and Secondary forms, which are words transferred from other word classes, including euphemistic and taboo words (e.g. damn, blimey).

Interjections have been discussed by various scholars of the history of the language. Corpus-based studies show an increase in forms and frequencies of use in drama and other literary genres over the early modern period. This is thought to reflect a cultural shift towards the subjective expression of emotion, which was conveniently encapsulated through the use of interjection forms (see the work of Irma Taavitsainen (1995, 1997) and Culpeper and Kytö (2010).

Whilst interjections can convey a speaker’s state of mind (e.g. oh may signal a speaker’s shock and surprise), they also have other pragmatic meanings. Interjections can be conative; that is, they instigate a hearer’s reaction (verbal or behavioural) to the speaker’s utterance. They can also organise the discourse, signalling the end of a turn, for example.

In the PCA results, Behn’s authorial style was differentiated from that of the four comparison authors on the basis of the frequencies of oh and ha.  These interjections are the two most-frequent forms in the comedy corpus as a whole (i.e. the most common forms in the 33 comedies combined). So why are they particularly significant in Behn’s plays? Is she using them differently when compared with the four male dramatists?


Oh has an undisputed position as the top interjection in the English language. It is the most frequent form in the extant literary canon for the early modern period, and retains this position in corpus-based analyses of present-day English (for discussion of these distributional trends, see Taavitsainen 1995; Norrick 2011). In Behn’s drama it has three main functions:

  • speaker-oriented: to signal primarily negative emotions and comprehension
  • addressee-oriented: to attract attention and direct a proposition, often combined with a vocative e.g. ‘oh Alphonso’
  • audience-oriented: to introduce an expository frame that licenses the narration of events on stage e.g. ‘oh I am weary of this’

These functions are comparable with previous surveys of early modern English drama (e.g. Culpeper and Kytö 2010). Indeed, Behn’s use of oh does not appear particularly distinctive when compared to her contemporaries: statistically, the frequency of oh in Behn’s comedies is not significantly different to that of the four male writers (Student’s t-test, p > 0.05). This suggests that, quantitatively, oh is a relatively consistent feature across Restoration comedy, which fits its status as the prototypical interjection in English. Its placement in the PCA chart therefore suggests that these pragmatic functions intersect with other stylistic choices associated with the 100 most frequent function words, such as the second-person pronouns and connectives, and that it is this co-textual environment which distinguishes oh in Behn’s linguistic style from that of other authors. We plan on looking further into this feature using corpus linguistic techniques, such as collocate analysis.


Ha is the second most frequent interjection in the 33 comedy corpus overall. In previous studies of early modern texts (e.g. Culpeper and Kytö 2010), ha shows a similar functional range as oh, used for speaker, addressee and textual-oriented functions, including the signalling of laughter in play texts. However, in the comedy corpus, ha shows significant differences in its distribution and use at an authorial level, particularly when compared with oh. Figure 3 shows the frequency of ha for each play for each author. Of these, D’Urfey uses ha more frequently than other authors (p > 0.05). Behn’s usage contrasts significantly with Dryden’s, as does Shadwell’s. Therefore, the frequency profile for ha suggests that D’Urfey is a high-frequency user; Behn, Shadwell and Ravenscroft are mid-frequency users; and Dryden uses ha the least.

Figure 3: Frequency of HA in comedies of Behn and four contemporary writers.

This quantitative picture is very different from that of oh. A qualitative perspective offers some light on what underpins these frequencies. The five plays representing Dryden include only two examples of ha, both of which occur in his adaptation of Molière’s Amphitryon (1668; Dryden 1690). Both are used for a conative function to elicit a response from the hearer, found in heated exchanges discussing sexual misdemeanours (arising from comic misunderstandings):

Amphitryon: Made haste to Bed: Ha, was it not so? Go on –  (Aside) And stab me with each Syllable thou speak’st (p.28)

Yet it seems that, for Dryden, the other functions associated with ha in the period were achieved through other methods, or were not required.

Conversely, the high frequency of ha in D’Urfey’s plays correlates with a broader functional base. The interjection is used for addressee-oriented functions, similar to Dryden, but also for speaker-oriented expression, such as signalling frustration. Its most prominent function in D’Urfey’s plays, however, is to signal laughter. The forms occur in strings of three to five forms (five being the longest strings of the five authors). In the next stage of analysis, these laughter strings will be categorised as form distinct to other uses of ha, to see how that effects the distribution of the interjection as a marker of authorial style.

Behn’s profile is similar to D’Urfey’s in her use of ha for speaker-oriented functions (e.g. comprehension, perception) and to mark laughter (strings of three and four forms). However, Behn does not make extensive use of the addressee-oriented function seen in the plays of Dryden and D’Urfey, suggesting her repertoire did not include, or require, the interjection for this purpose. This finding has a potentially diagnostic application when investigating her dramatic dubia.

Valeria: Ha, ha, ha – I laugh to think how thou art fitted with a Lover, a fellow that I warrant loves every new Face he sees. (The Rover, p.30)


Interjections and Authorial Style

The functional similarities and differences identified in relation to oh and ha provides an important perspective on the identification of these items as marker words in the quantitative analysis of Behn’s dramatic style, and that of her contemporaries. In this post, I’ve discussed only a few examples of the rich dataset that represents the interjectional forms and functions found in the comedy corpus. In the next steps, we will develop these findings to help inform our investigation of the dubia plays; in particular, the forms and functions of interjections (especially oh) provides an instructive perspective on the suspected editorial interference of Charles Gildon within Behn’s posthumously published play The Younger Brother (1696). We also plan on exploring interjections across genres, to establish the extent to which authors show continuities in their stylistic choices when working within different generic conventions, and the ramifications this may have on investigating authorship in mixed-genre datasets.


  • Culpeper, Jonathan, and Merja Kytö. 2010. Early Modern English Dialogues: Spoken Interaction as Writing. Studies in English Language. Cambridge, UK ; New York: Cambridge University Press.
  • Norrick, Neal R. 2009. ‘Interjections as Pragmatic Markers’. Journal of Pragmatics 41 (5): 866–91. https://doi.org/10.1016/j.pragma.2008.08.005.
  • Taavitsainen, Irma. 1995. ‘Interjections in Early Modern English: From Imitation of Spoken to Conventions of Written Language’. In Historical Pragmatics: Pragmatic Developments in the History of English, edited by Andreas H Jucker, 439–66. Amsterdam; Philadelphia: John Benjamins Publishing Company. http://public.eblib.com/choice/publicfullrecord.aspx?p=680399.
  • ———. 1997. ‘Genre Conventions: Personal Affect in Fiction and Non-Fiction in Early Modern English’. In English in Transition: Corpus-Based Studies in Linguistic Variation and Genre Styles, edited by Matti Rissanen, Merja Kytö, and Kirsi Heikkonen. Topics in English Linguistics 23. Berlin ; New York: Mouton de Gruyter.

Aphra Behn in Lampeter

A few weeks ago, in late August, I went on one of the most pleasant and thought-provoking research trips I’ve taken recently. Lampeter is a small Welsh town set amid beautiful rolling hills (then somewhat parched after the heatwave of the early summer). It also has a historic University library, founded in the early nineteenth century, which includes among its collections a small but fascinating corpus of early Aphra Behn materials.

Among Lampeter’s Behn holdings is a copy of Aesops Fables (1687). Like many of the library’s early printed books, Aesops Fables was acquired through donation from the educational philanthropist Thomas Phillips (1760-1851).  Phillipps, who had also worked as an East India company surgeon, probably did not set out to collect Aphra Behn but procured her works almost by accident, in the course of adding to his ever-growing library. If he was attracted by anything in particular in Aesops Fables, it is likely not to have been Aphra Behn’s poetic versions of the fables – physically, a small component of a large and complex volume – but rather Francis Barlow’s finely executed animal engravings, which are now recognised as a landmark achievement in seventeenth-century book illustration. Although Aesops Fables, compared with many early modern books, is not rare – the British Library alone holds four copies – Lampeter’s copy is remarkably well-preserved, including all the prefatory material in what seems to have been the original order. It does not, however, include plate 17 from the illustrated ‘Life of Aesop’, an image excised from several extant copies on grounds of supposed obscenity (it depicts Aesop’s master’s wife with her buttocks exposed). Whether Aphra Behn wrote the (unattributed) four-line poems accompanying each of the ‘Life’ illustrations, as well as the six-line fables that Barlow explicitly attributed to her in his preface, is among the questions that the Cambridge Behn project hopes in time to be able to answer.

Section of Plate 17. Image from Early English Books Online.

Lampeter’s other Behn holdings comprise political poems: two dating from 1685 (her elegy on Charles II and consolatory poem to Catherine of Braganza), and two from 1689 (A Congratulatory Poem to Her Sacred Majesty Queen Mary, upon her arrival in England and A Pindaric Poem to the Reverend Doctor Burnet). The poem to Gilbert Burnet – propagandist for William of Orange, and future Bishop of Salisbury – is a rare survival, one of only seven copies now known to be extant. I regret to report that I may have been the first person in its 329-year history actually to read the Lampeter witness, as its pages had never been cut.

Continue reading

Discovering ‘The Rover’: a new British Library resource

Having in the past wandered around the Shakespeare and Renaissance section of the British Library’s Discovering Literature website, I was delighted to be invited to write an article on Behn’s The Rover for their new Restoration and Eighteenth Century section. To me it is excellent news that the British Library has expanded its online teaching resources into Behn’s period, and I greatly enjoyed the chance to pitch my favourite Behn play to the anticipated ‘A-level student and general public’ reader.

The article draws on the work I’ve completed when editing The Rover for volume IV of the project’s forthcoming edition of Behn’s works (for CUP, scheduled for publication in 2020). I position the play in some of its key cultural contexts, and explore a little of its performance history both in 1677 and into the eighteenth century. The article includes information on the significance of the play’s setting in Naples, Italy; on stage courtesans; and on the English laws and conventions governing women’s place in society. I also address Restoration conceptualisations of masculinity and (of course) ideas of carnival.

Part of the delight – as well as the challenge – of writing for Discovering Literature is the need to construct an argument that is punctuated by frequent links to images and existing articles held within the British Library holdings and website archive. The benefits are enormous when used effectively – as I believe they are here. Linked resources in the BL piece include an article by Matthew White on ‘The Turbulent Seventeenth Century’, another on an engraving of Charles’s execution; a link to Coryate’s Crudities (1611) with its opinions on Italian courtesans (with connections to Othello and The Merchant of Venice); and the opportunity to explore Mary Astell’s Reflections upon Marriage (1700). Some of these links were my suggestion, and others were proposed by the Discovering Literature editors drawing on their detailed knowledge of the BL catalogue, and their understanding of the enterprise as a whole.

Title page from Mary Astell’s Reflections upon Marriage (1700). From the British Library collection (Public Domain image).

My favourite enhancements to the article are photographs from the Royal Shakespeare Company’s glorious 2016 production of The Rover – directed by Loveday Ingram. Joseph Millson’s won Best Actor award for his portrayal of Willmore (see Billington’s favourable review in The Guardian). Also enlightening are the images from the Senate House Library’s copy of the 1677 Rover, which the British Library provided so that I could discuss its use as a prompt-copy for early eighteenth century performances of the play. It shows that cuts were made to the play for this performance; most notably, an example where a speech by the courtesan Angellica Bianca was deleted at the end of a scene, resulting in an increase in Willmore’s standing and a decrease in hers.

Like all Discovering Literature articles, my introduction to The Rover is freely available under a Creative Commons Licence. I very much hope it will help to attract yet more readers and directors to Behn’s fascinating, funny play.

Elaine Hobby.

First Steps: learning the art of computational stylistics

Over the last thirty years or so, the analytic potential of computers has been increasingly applied to language, and the language of literature. The field of computational stylistics has flourished across research communities, albeit until recently focussed primarily on English literary texts. The objectives of many scholars working with the quantitative, processing power of computers has been authorship; specifically, to establish the legitimacy of the purported author of a text or texts. The best-known area of (ongoing and often heated) investigation relates to Shakespeare.

However, computational stylistic work isn’t solely about authorship attribution. As just one example, Craig and Hirsch’s recent (2017) publication testifies to the potential directions and avenues of quantitative approaches to early modern literature. And, despite the remit of the attribution team on the Aphra Behn project being precisely that (i.e. evaluating the traditional attributions of texts to Behn, however shakey and speculative), our initial work on Behn’s language has not considered authorship attribution at all. Instead, we have focussed on establishing the stylistic markers of her writing, specifically her dramatic writing, and tracing their evolution over the course of Behn’s career and lifetime. Our article ‘Style and Chronology: a stylochronometric investigation of Aphra Behn’s dramatic style and the dating of the Young King’, which appears in the most recent issue of Language and Literature, provides the first published report on the style and attribution analysis undertaken on the Aphra Behn (E-ABIDA) project.

The article documents our preliminary investigative work which seeks to get a handle on Behn’s dramatic style. Her drama is copious, and presents some particularly acute challenges relating to dubiously attributed texts; challenges we will discuss in more detail in a future post. Significantly, from the perspective of computational stylistic methods, Behn’s dramatic works span a 20-year period, and the distribution of genres (tragedy, comedy, tragicomedy) is uneven across this time-frame. Whilst studies suggest that the authorial signal more often than not shines through the other facets of style (like genre and time-period) (Burrows and Craig 2002), the aforementioned properties of Behn’s literary data means that it is desirable to obtain as much information as we can about what typifies Behn’s language in her plays. Our recent investigation considers whether Behn’s work has a chronological “signal”; that is, can the computational analyses differentiate her plays according to the time period in which they were written – the ‘stylochronometric’ part of the study. In this blog post, I discuss the main reasons for starting with this chronological focus, and offer some of our initial experiences of negotiating Behn’s language through this quantitative lens.

Theoretical Insights

Scholarship on Shakespeare and other writers has observed, explored and debated the chronological developments across literary careers (and Behn is no exception: the fast-changing political and economic context of Restoration theatre entailed that Behn had to be reactive (or anticipative) to sustain her career as a playwright). What is interesting, from a linguistic and cultural perspective, is the direction of chronological developments, be they early modern or from other literary periods in history, identified in individual authorial practices, on the one hand, and their intersection with cultural trends, on the other. Rybicki (2015) makes an important point when he observes how these two levels (micro- and macro-) overlap, with lifespan changes woven through cross-generational trajectories of change. We might consider this process analogous with a flock of birds; no one bird controls the whole flock, but instead each makes minute adjustments (anticipative and reactive) to ensure the continuation of their flight. The relationship between individual and social process in language change (literary and otherwise) is one that fascinates me, and this is an area to which I believe computational stylistics can really contribute.

In our analysis of Behn, we focus on Behn’s changing style within the framework of her lifespan – looking at what develops and how. Since this was an early study, we were not in a position to consider the relationship between Behn and her Restoration colleagues. However, we have since prepared extensive comparative corpora and hopefully we will be able to contextualise the changing stylistic preferences of one writer with those of many others. The ongoing project, Mind-Bending Grammars at the University of Antwerp provides a valuable resource with similar objectives, albeit focussing on changes to morphosyntactic (grammatical) features. Excitingly, their investigation – which uses EEBO-TCP as its corpus – includes the grammar of Aphra Behn. Some of their early findings suggest she was among the more progressive individuals in the acquisition of innovative grammatical structures. These developments of quantitative methods and techniques, which can pull together and scrutinise the intersection of literary, linguistic and cultural change, presents valuable opportunities for how we theorise and understanding temporal developments in language and style.

Practical Applications

Identifying the chronological style of Behn also has more practical applications. By identifying key markers of Behn’s style that can be associated with different temporal periods, we are in a better position to judge the linguistic results produced for the works of questionable origin. Within the drama corpus, the majority of the dubia is dated to the late 1670s, placing it – we know now – at an apparent transitional period between Behn’s early (e.g. Forc’d Marriage, Dutch Lover) and mid-period (e.g. The Rover) plays. It will be interesting to establish the extent to which the developments identified in her dramatic works are apparent in her other writing such as her poetry and her fictional prose. Studies (e.g. van Hulle and Kestemont 2016) have shown that writers working in multiple languages do not necessarily show synchronous change. How this relates to writers, like Behn, working in different genres remains to be seen.

Reading the Numbers

The investigation of Behn’s chronological style has also taught us a valuable lesson in what computational stylistics can, and cannot, provide by way of answers. Whilst the statistical findings are empirically robust, the results of exploratory tests such as Principal Components Analysis require careful interpretation by a human eye (for a gentle introduction, see this ‘PCA 4 dummies’ guide). The stylochronometric investigation sought to test the veracity of Behn’s claim in the dedicatory epistle, published with the play in 1682, that The Young King was in fact the earliest of her dramatic works. The computational analyses therefore had two objectives: firstly, did Behn’s dramatic works convey a sufficiently clear temporal signal in terms of their stylistic properties and secondly, where did The Young King position itself when analysed using the temporal signal criteria.

Using exploratory statistical methods, such as PCA and cluster analysis, Behn’s dramatic works were found to have a chronological signal – on the basis that the works organised themselves into near-perfect date order using the linguistic criteria (e.g. most frequent words). This meant that the computational stylistic investigations had the potential to shed new light on Behn’s statement concerning the dating of The Young King. In an ideal world – the stylistic equivalent of a laboratory experiment, with clean surfaces and no confounding factors – the results would have provided a clear-cut indicator of the temporal markers of The Young King and its chronological position in Behn’s dramatic outputs. However, any study working with historical data has to confront the less-than-desirable make-up of the extant evidence; an example, perhaps, of what Labov (1994: 11) famously called the ‘bad data’ problem in the field of historical sociolinguistics. In our case, Behn’s dramatic output disproportionately favours comedies (12 firmly attributed). Tragedies (1) and tragi-comedies (3) are far less frequent. This is not an issue in itself, necessarily, until you wish to compare texts representing different chronological sub-periods and the genre signal starts to interfere with the temporal and authorial elements of the results.

Making sense of the results for The Young King analysis was therefore challenging. The extensive combination of tests – PCA, cluster analysis, Zeta – provided rich and overlapping results for the stylistic similarities between Behn’s plays, but it was difficult to assess the cause of those similarities. For example, the cluster analysis based on 300 most frequent words in Behn’s drama (reproduced from Evans 2018) shows how The Young King clusters most closely with Abdelazar – Behn’s only tragedy.

Cluster Analysis of Behn’s plays and The Young King (300 Most Frequent Words)

The other clusters on the graph show how Behn’s drama group into temporal periods: early (up to 1673), middle (1677-1682) and late (1682-1690). Is this grouping because Behn was stretching the truth regarding the early dating of The Young King? Abdelazar was first performed in 1676-7. Or is it becaue of genre similarities? PCA analyses, such as the one shown here, offer a similarly complex picture: The Young King and Abdelazar again group together, but whether this is because of chronology and/or genre is not clear.


PCA of Behn’s plays and The Young King; 300 most frequent words.

Our current thinking is that, if the similarity between The Young King and Abdelazar is wholly attributable to genre, then we would expect to find a equivalent proximity between The Young King and Forc’d Marriage – which is not only a tragicomedy but also one of Behn’s earliest plays, thus carrying a double-whammy of stylistic signals (genre, time-period) if Behn’s dating claim for The Young King is accurate. This association does not emerge in any of the tests we conducted. On this basis, it seems more likely that the version of The Young King we have today (published in 1682) was at least heavily revised by Behn later in her career, leaving stylistic traces more typical of her mid-career writings. If our interpretation is accurate, then this provides a new perspective on Behn’s understanding of the likely reception of her work by Restoration audiences, and some of the strategies required to maintain a successful literary career.

Next Steps

One of the most important lessons of our early investigations into Behn’s style is that, for the most part, the results of computational stylistics are indicators, not pronouncements. There is an art, a humanity, to the reading and understanding of quantitative statistical trends – to recognising and appreciating the nuances in why Behn’s language use looks as it does, whether that’s pronouns, verb choices, or sequences of letter-forms or words.

As we go forward with the attribution analysis proper, our investigations will strive to take these early findings on board, and incorporate them into our methodological decisions (i.e. what texts to compare/contrast; how to organise different texts as “representatives” of a particular time-period or genre), our theoretical perspectives (lifespan literary change is a complex but definite and quantitatively chartable phenomenon) and our understanding of an author’s position within wider changes within the linguistic and cultural systems.

Watch this space for more reports on the attribution analyses.


Mind-Bending Grammars Project; led by Peter Petre, University of Antwerp

PCA 4 Dummies


Burrows, John, and Hugh Craig. 2012. ‘Authors and Characters’. English Studies 93 (3): 292–309. https://doi.org/10.1080/0013838X.2012.668786.

Craig, D. H., and Brett Greatley-Hirsch. 2017. Style, Computers, and Early Modern Drama: Beyond Authorship. Cambridge ; New York, NY: Cambridge University Press.

Evans, Mel. 2018. ‘Style and Chronology: A Stylometric Investigation of Aphra Behn’s Dramatic Style and the Dating of The Young King’. Language and Literature, May. https://doi.org/10.1177/0963947018772505.

Labov, William. 1992. Principles of Linguistic Change. Vol. 1: Internal Factors. Language in Society 20. Oxford: Blackwell.

Rybicki, Jan. 2016. ‘Vive La Différence : Tracing the (Authorial) Gender Signal by Multivariate Analysis of Word Frequencies’. Digital Scholarship in the Humanities 31 (4): 746–61. https://doi.org/10.1093/llc/fqv023.