Supported by the Arts and Humanities Research Council

Category: Attribution and Authorship

Adventures in Antwerp

The view from the river back towards the old city.
All blog post images @_MelEvans

In May 2019, I had the pleasure of visiting Prof. Dr. Peter Petre and his colleagues at the University of Antwerp. Prof Petre leads the European Research Council-funded project ‘Mind-bending Grammars‘, which investigates the cognitive and social dimensions of language change across the lifespan using the 90 million word corpus EMMA (Early Modern Multiloquent Authors). The project harnesses quantitative and computational techniques to test and develop linguistic theory surrounding the process of language change at the level of the individual.

Given my work on Behn’s authorial style using computational stylistic methods, the finding of which suggest evidence of linguistic developments over her career, there was a clear overlap with the MBG project. I was invited to discuss our approach to Behn’s style and the pervading questions of attribution with the project members, and they also shared their own insights and expertise in a very rich and productive meeting. My visit also included a guest leacture, in which I provided an overview of some of the most recent results of the attribution work to a mixture of students and staff at the university. This included our extensive work on Behn’s dramatic dubia, such as The Counterfeit Bridegroom, as well as some explorations into the socio-pragmatic properties (and their profiling potential) of dramatic interjections e.g. oh, ah and ha, to which the audience posed some very insightful and thought-provoking questions.

My talk also covered my early-stage editorial work on Behn’s correspondence sent from Antwerp in 1666, during her assignment as a royalist spy. On my wanderings around the old parts of the city, it was fun to try to envisage what Behn’s own experiences might have been, and what areas in the old town she may have made time to visit. Prof Petre and his team took me to the site of the inn, the Rosa Noble, which served as Behn’s accommodation during her mission, and for which she ended up in serious debt to her Antwerp landlord. Today it’s a main thoroughfare, but in Behn’s day it was a canal connecting the city to the vast Scheldt river.

I also made sure to visit the Plantin-Moretus Printing House Museum (as recommended on Twitter). This incredible building, which is the site of one of the oldest and most successful printing businesses in western Europe, dates back to the sixteenth century. The route through the museum takes you around the various rooms of the printing business, including the print room, the correction room, and the front-of-house book shop. On a glorious spring day, this was a wonderful place to explore (including the internal courtyard garden), and I heartily recommend it should you have a spare hour or so, regardless of your bibliophilic qualities).

It seems fair to say that Behn’s time in Antwerp, given her extensive and repetitive complaints about her financial difficulties, was certainly not as enjoyable as mine. I hope to return just as soon as a suitable (Behn-related) excuse arises.

Looking at the small words

The Quantitative Big Picture

Computational approaches to authorial style and attribution are predominantly quantitative: the various techniques rely on the ability of computers to identify patterns in large datasets (i.e. frequencies of words across the plays of multiple authors), and for these patterns to align with the authorial style of one author or another. In our investigation of Behn’s authorial style, Alan Hogarth and I have been using various exploratory testing methods (Zeta, T-tests, Principal Components Analysis) to establish how distinct Behn is as a dramatic writer when compared to her contemporaries (other genres will follow shortly). The results suggest that Behn does, indeed, have a linguistic profile that stands in contrast to her contemporaries. Whilst a future blog post will talk more about these high-level patterns and comparisons, the present discussion will introduce some of the specific words that characterise Behn’s authorial style, on the basis of their contrastive frequencies in her works versus those of contemporary playwrights. This work introduces a qualitative dimension to the investigation, with the objective being to better understand how and why some words appear as marker-words for one author, rather than another.

Figure 1: PCA results for 33 Restoration comedies using 100 most frequent function words

Figure 1 shows a Principal Components Analysis of the 100-most frequent function words in a corpus of 33 comedies written by five Restoration dramatists, including Behn. This statistical sorting hat organises the dependent variables (the plays) according to the most dominant continuities or trends of the results for the independent variables (the words). The right-hand-side of the line highlights the area when most of Behn’s comedies cluster, largely distinct from the comedies of Dryden, D’Urfey, Shadwell and Ravenscroft. The plays of D’Urfey (green crosses) intersect with some of Behn’s works, suggesting stylistic similarities; the plays of Dryden and Ravenscroft, by comparison, are much more distinct. Looking at the linguistic features that underlie this distribution of plays helps to explain these points of overlap between authors.

Figure 2 shows the same distribution of plays, now with the words underlying their organisation. Within the Behn area of the chart, we find second-person pronoun forms (thou, thee, thy), connectives (and, so) and interjections (oh, ha). The rest of this post will discuss the functionality and distribution of interjections in Behn’s plays, compared with those of her contemporaries, and consider what this information may suggest about her approach to dramatic writing.


Figure 2: PCA results for 33 Restoration comedies using 100 most frequent function words, showing distributional relationship between word frequencies and play texts.


Interjections are powerful small words. They prototypically signal the emotional attitude of the speaker towards their situation, and this affective function has implications for their use in dramatic language. They can be grouped into two types: Primary forms, whose function is solely as an interjection and generally have an atypical phonological and graphological structure (e.g. ugh) and Secondary forms, which are words transferred from other word classes, including euphemistic and taboo words (e.g. damn, blimey).

Interjections have been discussed by various scholars of the history of the language. Corpus-based studies show an increase in forms and frequencies of use in drama and other literary genres over the early modern period. This is thought to reflect a cultural shift towards the subjective expression of emotion, which was conveniently encapsulated through the use of interjection forms (see the work of Irma Taavitsainen (1995, 1997) and Culpeper and Kytö (2010).

Whilst interjections can convey a speaker’s state of mind (e.g. oh may signal a speaker’s shock and surprise), they also have other pragmatic meanings. Interjections can be conative; that is, they instigate a hearer’s reaction (verbal or behavioural) to the speaker’s utterance. They can also organise the discourse, signalling the end of a turn, for example.

In the PCA results, Behn’s authorial style was differentiated from that of the four comparison authors on the basis of the frequencies of oh and ha.  These interjections are the two most-frequent forms in the comedy corpus as a whole (i.e. the most common forms in the 33 comedies combined). So why are they particularly significant in Behn’s plays? Is she using them differently when compared with the four male dramatists?


Oh has an undisputed position as the top interjection in the English language. It is the most frequent form in the extant literary canon for the early modern period, and retains this position in corpus-based analyses of present-day English (for discussion of these distributional trends, see Taavitsainen 1995; Norrick 2011). In Behn’s drama it has three main functions:

  • speaker-oriented: to signal primarily negative emotions and comprehension
  • addressee-oriented: to attract attention and direct a proposition, often combined with a vocative e.g. ‘oh Alphonso’
  • audience-oriented: to introduce an expository frame that licenses the narration of events on stage e.g. ‘oh I am weary of this’

These functions are comparable with previous surveys of early modern English drama (e.g. Culpeper and Kytö 2010). Indeed, Behn’s use of oh does not appear particularly distinctive when compared to her contemporaries: statistically, the frequency of oh in Behn’s comedies is not significantly different to that of the four male writers (Student’s t-test, p > 0.05). This suggests that, quantitatively, oh is a relatively consistent feature across Restoration comedy, which fits its status as the prototypical interjection in English. Its placement in the PCA chart therefore suggests that these pragmatic functions intersect with other stylistic choices associated with the 100 most frequent function words, such as the second-person pronouns and connectives, and that it is this co-textual environment which distinguishes oh in Behn’s linguistic style from that of other authors. We plan on looking further into this feature using corpus linguistic techniques, such as collocate analysis.


Ha is the second most frequent interjection in the 33 comedy corpus overall. In previous studies of early modern texts (e.g. Culpeper and Kytö 2010), ha shows a similar functional range as oh, used for speaker, addressee and textual-oriented functions, including the signalling of laughter in play texts. However, in the comedy corpus, ha shows significant differences in its distribution and use at an authorial level, particularly when compared with oh. Figure 3 shows the frequency of ha for each play for each author. Of these, D’Urfey uses ha more frequently than other authors (p > 0.05). Behn’s usage contrasts significantly with Dryden’s, as does Shadwell’s. Therefore, the frequency profile for ha suggests that D’Urfey is a high-frequency user; Behn, Shadwell and Ravenscroft are mid-frequency users; and Dryden uses ha the least.

Figure 3: Frequency of HA in comedies of Behn and four contemporary writers.

This quantitative picture is very different from that of oh. A qualitative perspective offers some light on what underpins these frequencies. The five plays representing Dryden include only two examples of ha, both of which occur in his adaptation of Molière’s Amphitryon (1668; Dryden 1690). Both are used for a conative function to elicit a response from the hearer, found in heated exchanges discussing sexual misdemeanours (arising from comic misunderstandings):

Amphitryon: Made haste to Bed: Ha, was it not so? Go on –  (Aside) And stab me with each Syllable thou speak’st (p.28)

Yet it seems that, for Dryden, the other functions associated with ha in the period were achieved through other methods, or were not required.

Conversely, the high frequency of ha in D’Urfey’s plays correlates with a broader functional base. The interjection is used for addressee-oriented functions, similar to Dryden, but also for speaker-oriented expression, such as signalling frustration. Its most prominent function in D’Urfey’s plays, however, is to signal laughter. The forms occur in strings of three to five forms (five being the longest strings of the five authors). In the next stage of analysis, these laughter strings will be categorised as form distinct to other uses of ha, to see how that effects the distribution of the interjection as a marker of authorial style.

Behn’s profile is similar to D’Urfey’s in her use of ha for speaker-oriented functions (e.g. comprehension, perception) and to mark laughter (strings of three and four forms). However, Behn does not make extensive use of the addressee-oriented function seen in the plays of Dryden and D’Urfey, suggesting her repertoire did not include, or require, the interjection for this purpose. This finding has a potentially diagnostic application when investigating her dramatic dubia.

Valeria: Ha, ha, ha – I laugh to think how thou art fitted with a Lover, a fellow that I warrant loves every new Face he sees. (The Rover, p.30)


Interjections and Authorial Style

The functional similarities and differences identified in relation to oh and ha provides an important perspective on the identification of these items as marker words in the quantitative analysis of Behn’s dramatic style, and that of her contemporaries. In this post, I’ve discussed only a few examples of the rich dataset that represents the interjectional forms and functions found in the comedy corpus. In the next steps, we will develop these findings to help inform our investigation of the dubia plays; in particular, the forms and functions of interjections (especially oh) provides an instructive perspective on the suspected editorial interference of Charles Gildon within Behn’s posthumously published play The Younger Brother (1696). We also plan on exploring interjections across genres, to establish the extent to which authors show continuities in their stylistic choices when working within different generic conventions, and the ramifications this may have on investigating authorship in mixed-genre datasets.


  • Culpeper, Jonathan, and Merja Kytö. 2010. Early Modern English Dialogues: Spoken Interaction as Writing. Studies in English Language. Cambridge, UK ; New York: Cambridge University Press.
  • Norrick, Neal R. 2009. ‘Interjections as Pragmatic Markers’. Journal of Pragmatics 41 (5): 866–91.
  • Taavitsainen, Irma. 1995. ‘Interjections in Early Modern English: From Imitation of Spoken to Conventions of Written Language’. In Historical Pragmatics: Pragmatic Developments in the History of English, edited by Andreas H Jucker, 439–66. Amsterdam; Philadelphia: John Benjamins Publishing Company.
  • ———. 1997. ‘Genre Conventions: Personal Affect in Fiction and Non-Fiction in Early Modern English’. In English in Transition: Corpus-Based Studies in Linguistic Variation and Genre Styles, edited by Matti Rissanen, Merja Kytö, and Kirsi Heikkonen. Topics in English Linguistics 23. Berlin ; New York: Mouton de Gruyter.

First Steps: learning the art of computational stylistics

Over the last thirty years or so, the analytic potential of computers has been increasingly applied to language, and the language of literature. The field of computational stylistics has flourished across research communities, albeit until recently focussed primarily on English literary texts. The objectives of many scholars working with the quantitative, processing power of computers has been authorship; specifically, to establish the legitimacy of the purported author of a text or texts. The best-known area of (ongoing and often heated) investigation relates to Shakespeare.

However, computational stylistic work isn’t solely about authorship attribution. As just one example, Craig and Hirsch’s recent (2017) publication testifies to the potential directions and avenues of quantitative approaches to early modern literature. And, despite the remit of the attribution team on the Aphra Behn project being precisely that (i.e. evaluating the traditional attributions of texts to Behn, however shakey and speculative), our initial work on Behn’s language has not considered authorship attribution at all. Instead, we have focussed on establishing the stylistic markers of her writing, specifically her dramatic writing, and tracing their evolution over the course of Behn’s career and lifetime. Our article ‘Style and Chronology: a stylochronometric investigation of Aphra Behn’s dramatic style and the dating of the Young King’, which appears in the most recent issue of Language and Literature, provides the first published report on the style and attribution analysis undertaken on the Aphra Behn (E-ABIDA) project.

The article documents our preliminary investigative work which seeks to get a handle on Behn’s dramatic style. Her drama is copious, and presents some particularly acute challenges relating to dubiously attributed texts; challenges we will discuss in more detail in a future post. Significantly, from the perspective of computational stylistic methods, Behn’s dramatic works span a 20-year period, and the distribution of genres (tragedy, comedy, tragicomedy) is uneven across this time-frame. Whilst studies suggest that the authorial signal more often than not shines through the other facets of style (like genre and time-period) (Burrows and Craig 2002), the aforementioned properties of Behn’s literary data means that it is desirable to obtain as much information as we can about what typifies Behn’s language in her plays. Our recent investigation considers whether Behn’s work has a chronological “signal”; that is, can the computational analyses differentiate her plays according to the time period in which they were written – the ‘stylochronometric’ part of the study. In this blog post, I discuss the main reasons for starting with this chronological focus, and offer some of our initial experiences of negotiating Behn’s language through this quantitative lens.

Theoretical Insights

Scholarship on Shakespeare and other writers has observed, explored and debated the chronological developments across literary careers (and Behn is no exception: the fast-changing political and economic context of Restoration theatre entailed that Behn had to be reactive (or anticipative) to sustain her career as a playwright). What is interesting, from a linguistic and cultural perspective, is the direction of chronological developments, be they early modern or from other literary periods in history, identified in individual authorial practices, on the one hand, and their intersection with cultural trends, on the other. Rybicki (2015) makes an important point when he observes how these two levels (micro- and macro-) overlap, with lifespan changes woven through cross-generational trajectories of change. We might consider this process analogous with a flock of birds; no one bird controls the whole flock, but instead each makes minute adjustments (anticipative and reactive) to ensure the continuation of their flight. The relationship between individual and social process in language change (literary and otherwise) is one that fascinates me, and this is an area to which I believe computational stylistics can really contribute.

In our analysis of Behn, we focus on Behn’s changing style within the framework of her lifespan – looking at what develops and how. Since this was an early study, we were not in a position to consider the relationship between Behn and her Restoration colleagues. However, we have since prepared extensive comparative corpora and hopefully we will be able to contextualise the changing stylistic preferences of one writer with those of many others. The ongoing project, Mind-Bending Grammars at the University of Antwerp provides a valuable resource with similar objectives, albeit focussing on changes to morphosyntactic (grammatical) features. Excitingly, their investigation – which uses EEBO-TCP as its corpus – includes the grammar of Aphra Behn. Some of their early findings suggest she was among the more progressive individuals in the acquisition of innovative grammatical structures. These developments of quantitative methods and techniques, which can pull together and scrutinise the intersection of literary, linguistic and cultural change, presents valuable opportunities for how we theorise and understanding temporal developments in language and style.

Practical Applications

Identifying the chronological style of Behn also has more practical applications. By identifying key markers of Behn’s style that can be associated with different temporal periods, we are in a better position to judge the linguistic results produced for the works of questionable origin. Within the drama corpus, the majority of the dubia is dated to the late 1670s, placing it – we know now – at an apparent transitional period between Behn’s early (e.g. Forc’d Marriage, Dutch Lover) and mid-period (e.g. The Rover) plays. It will be interesting to establish the extent to which the developments identified in her dramatic works are apparent in her other writing such as her poetry and her fictional prose. Studies (e.g. van Hulle and Kestemont 2016) have shown that writers working in multiple languages do not necessarily show synchronous change. How this relates to writers, like Behn, working in different genres remains to be seen.

Reading the Numbers

The investigation of Behn’s chronological style has also taught us a valuable lesson in what computational stylistics can, and cannot, provide by way of answers. Whilst the statistical findings are empirically robust, the results of exploratory tests such as Principal Components Analysis require careful interpretation by a human eye (for a gentle introduction, see this ‘PCA 4 dummies’ guide). The stylochronometric investigation sought to test the veracity of Behn’s claim in the dedicatory epistle, published with the play in 1682, that The Young King was in fact the earliest of her dramatic works. The computational analyses therefore had two objectives: firstly, did Behn’s dramatic works convey a sufficiently clear temporal signal in terms of their stylistic properties and secondly, where did The Young King position itself when analysed using the temporal signal criteria.

Using exploratory statistical methods, such as PCA and cluster analysis, Behn’s dramatic works were found to have a chronological signal – on the basis that the works organised themselves into near-perfect date order using the linguistic criteria (e.g. most frequent words). This meant that the computational stylistic investigations had the potential to shed new light on Behn’s statement concerning the dating of The Young King. In an ideal world – the stylistic equivalent of a laboratory experiment, with clean surfaces and no confounding factors – the results would have provided a clear-cut indicator of the temporal markers of The Young King and its chronological position in Behn’s dramatic outputs. However, any study working with historical data has to confront the less-than-desirable make-up of the extant evidence; an example, perhaps, of what Labov (1994: 11) famously called the ‘bad data’ problem in the field of historical sociolinguistics. In our case, Behn’s dramatic output disproportionately favours comedies (12 firmly attributed). Tragedies (1) and tragi-comedies (3) are far less frequent. This is not an issue in itself, necessarily, until you wish to compare texts representing different chronological sub-periods and the genre signal starts to interfere with the temporal and authorial elements of the results.

Making sense of the results for The Young King analysis was therefore challenging. The extensive combination of tests – PCA, cluster analysis, Zeta – provided rich and overlapping results for the stylistic similarities between Behn’s plays, but it was difficult to assess the cause of those similarities. For example, the cluster analysis based on 300 most frequent words in Behn’s drama (reproduced from Evans 2018) shows how The Young King clusters most closely with Abdelazar – Behn’s only tragedy.

Cluster Analysis of Behn’s plays and The Young King (300 Most Frequent Words)

The other clusters on the graph show how Behn’s drama group into temporal periods: early (up to 1673), middle (1677-1682) and late (1682-1690). Is this grouping because Behn was stretching the truth regarding the early dating of The Young King? Abdelazar was first performed in 1676-7. Or is it becaue of genre similarities? PCA analyses, such as the one shown here, offer a similarly complex picture: The Young King and Abdelazar again group together, but whether this is because of chronology and/or genre is not clear.


PCA of Behn’s plays and The Young King; 300 most frequent words.

Our current thinking is that, if the similarity between The Young King and Abdelazar is wholly attributable to genre, then we would expect to find a equivalent proximity between The Young King and Forc’d Marriage – which is not only a tragicomedy but also one of Behn’s earliest plays, thus carrying a double-whammy of stylistic signals (genre, time-period) if Behn’s dating claim for The Young King is accurate. This association does not emerge in any of the tests we conducted. On this basis, it seems more likely that the version of The Young King we have today (published in 1682) was at least heavily revised by Behn later in her career, leaving stylistic traces more typical of her mid-career writings. If our interpretation is accurate, then this provides a new perspective on Behn’s understanding of the likely reception of her work by Restoration audiences, and some of the strategies required to maintain a successful literary career.

Next Steps

One of the most important lessons of our early investigations into Behn’s style is that, for the most part, the results of computational stylistics are indicators, not pronouncements. There is an art, a humanity, to the reading and understanding of quantitative statistical trends – to recognising and appreciating the nuances in why Behn’s language use looks as it does, whether that’s pronouns, verb choices, or sequences of letter-forms or words.

As we go forward with the attribution analysis proper, our investigations will strive to take these early findings on board, and incorporate them into our methodological decisions (i.e. what texts to compare/contrast; how to organise different texts as “representatives” of a particular time-period or genre), our theoretical perspectives (lifespan literary change is a complex but definite and quantitatively chartable phenomenon) and our understanding of an author’s position within wider changes within the linguistic and cultural systems.

Watch this space for more reports on the attribution analyses.


Mind-Bending Grammars Project; led by Peter Petre, University of Antwerp

PCA 4 Dummies


Burrows, John, and Hugh Craig. 2012. ‘Authors and Characters’. English Studies 93 (3): 292–309.

Craig, D. H., and Brett Greatley-Hirsch. 2017. Style, Computers, and Early Modern Drama: Beyond Authorship. Cambridge ; New York, NY: Cambridge University Press.

Evans, Mel. 2018. ‘Style and Chronology: A Stylometric Investigation of Aphra Behn’s Dramatic Style and the Dating of The Young King’. Language and Literature, May.

Labov, William. 1992. Principles of Linguistic Change. Vol. 1: Internal Factors. Language in Society 20. Oxford: Blackwell.

Rybicki, Jan. 2016. ‘Vive La Différence : Tracing the (Authorial) Gender Signal by Multivariate Analysis of Word Frequencies’. Digital Scholarship in the Humanities 31 (4): 746–61.