Regression Trees in Stylometry

paper
Authorship
  1. 1. Constantina Stamou

    Department of Computing and Information Systems - University of Luton

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Shakespeare's undated plays and Plato's undated dialogues remain among the problematic cases in the area of Stylometry. Their true chronology is not yet known, and probably will never be, since there is lack of sufficient external evidence to support any assumptions. Nevertheless, researchers have attempted to identify various stylistic markers and to test them using a number of traditional as well as computer-based approaches of considerable success. Several recent examples include the dating of Carlo Sigonio's Consolatio (Forsyth et. al., 1999), of Sophocles' Trachiniae (Craik et al., 1987), the sequence of composition of the plays of Christopher Marlowe (Ule, 1982), the development of Euripides' trimeters (Laan, 1995; Gooijer et al., 2001), and the dating of Ars Poetica by Horace (Frischer, 1991).

Along similar lines, the present experiment aims to assess the discriminatory power of a number of lexical, semantic, syntactic, entropic, and phonetic predictor variables when used to detect the existence of any non-linear patterns for the purposes of dating in a securely dated collection of texts by four authors, namely Edna St Vincent Millay, Edgar Allan Poe, Christina Rossetti, and William Butler Yeats. The main difference with previous approaches is the use of a non-parametric method, namely regression trees as an alternative to linear regression. Regression trees have been used when a need for detecting non-linear patterns in the data is apparent. In addition, regression trees are known to produce accurate results, although they tend to be not so accurate if the data have a good linear structure (Breiman et. al., 1984: 264). It is hoped that modern methods such as regression trees will be more effective in identifying possible accurate predictors.

A regression tree is a treelike structure developed to represent a decision process for regression purposes. Consisting of branches (links) and leaves (nodes), a regression tree aspires to identify the attributes of the objects under examination which function as best discriminators. The aim is to produce a simple yet reliable model of the relationship between one or more categorical/continuous independent variables and a dependent continuous variable. Regression trees are useful when there is need for accurate predictors or understanding of the type of variables and of the interactions among them which underline relevant experiments. The aim is to provide simple descriptions of the conditions under which cases are categorized (Breiman et al., 1984:6). This is achieved by their ability to handle mixed data-types and non-linear relationships since tree-based methods are non-parametric.

Permitting one or more variables at each decision node, a regression tree assigns objects to predetermined ordered response groups. When a regression tree is being developed, data for each case traverse down the tree beginning at the root node, splitting nodes as they progress, until a terminal node is reached. If a case satisfies the condition set at each node, the case follows the left branch towards the left node; otherwise the case follows the right branch. Each terminal node contains a constant which is the average value for the response variable based on all cases that reach the node. To use a tree with new data, observations for each case would be dropped down the tree in a similar fashion to the construction of the predictor tree until the terminal nodes are reached. The predicted value for each terminal node would be the sample average of the dependent variable for that particular group.

To construct a predictor tree, the following criteria need to be established: a set of binary questions of the form ?is variable j of instance i less than or equal to c?? for all reasonable c values; a splitting rule for each intermediate node; a condition that determines when a node becomes terminal; and a rule for assigning the predicted value of the response variable to the terminal nodes (Breiman et al., 1984: 28-29, 229).

GUIDE (Generalized, Unbiased Interaction Detection and Estimation), which was developed by Loh (2002), is a regression tree algorithm based on the previously mentioned criteria. This algorithm can handle categorical predictors and has insignificant selection bias. GUIDE fits the sample mean of the dependent variable at each node, computes the residuals between the observed values at each node and the sample mean, and divides the cases in two groups according to the signs of the residuals. To produce the tree, GUIDE uses the minimal cost-complexity method of CART (Breiman et al., 1984) with V-fold cross validation. GUIDE has been selected as the algorithm for this experiment as the most unbiased, accurate, and readily obtainable.

Subsequently, GUIDE was tested on samples of poetry and personal correspondence by the four authors using stylistic markers showing in previous studies to produce accurate results. The stylistic markers were selected according to three criteria; that they were automatically quantifiable, that they had been used by other researchers for similar projects, and that they were of linguistic validity. No distinction was made in terms of whether the markers had been used for authorship, genre, chronology or stylistic studies in general, since it has already been shown that what markers work in one case do not necessarily work on others (Holmes, 1998: 111; Rudman, 2000).

To examine whether the observed differences in the accuracy of the trees produced by GUIDE are genuine or are attributed to the natural variability among populations and samples, four four-way ANOVA were performed. The objective was to tease out main effects and all two-way interactions for each factor, and assess whether there is a significant difference in the use of the different factors. The dependent variables were Total Number of Variables in the final trees, Mean Square Error (MSE), Median Square Error (MedSE), Scaled Mean Square Error (MSEscaled), ? the last three being variations of the accuracy measure MSE produced by GUIDE for the trees ? and factors such as Genre, Author, Order, and Variable Type.

For the first ANOVA, which used total number of variables as the dependent variable, the result was no significant interaction or significant main effect for any of the factors. It appears that the total number of variables which GUIDE interpreted as important is unpredictable among the different authors. The second ANOVA tested MSE against the four factors. Only one significant interaction was identified, that between author and genre, as well as significant main effects for the same factors. The results produced were almost identical to the ones produced when MedSE is used as the dependent variable. This leads to the conclusion that different authors appear to produce trees of variable predictive accuracy according to the genre they write in, however, it is not possible at this stage to identify which genre is the most difficult to assign. The fourth ANOVA examined the Scaled error against the four factors to assess the size of the improvement for quess work rather than just the raw 'success'. The result was no significant interaction or main effect for the relative error, which implies that no tree consisting of different types of variables for different authors appears to be bigger or more accurate than the other. It would seem that none of these factors produce consistently different results. However, it needs to be mentioned that the interaction effects and main effects detected could be due to the restricted range of Yeats's letters which spread over twenty-eight years and the range of the poems which spread over fifty-four years. This unbalanced productive period ? which is due to the texts availability ? could explain why the significant effects disappear when MSEscaled is used.

To summarize, this study has led to the conclusion that authors and genres differ inconsistently, and no similar results are obtained in any of the analyses so far. This suggests that to use personal correspondence to predict poems is a precarious idea. It also advocates that not all authors can be dated, therefore, the accuracy of dating will depend on individuals, which renders the idea of chronological 'fingerprinting' impossible, since not all authors would have such a 'fingerprint'. However, one possible explanation for such an outcome is the small text size used. Therefore, a study on similar grounds but with aggregated datasets as an attempt to verify the present findings is currently taking place, whose results will be presented at the conference.

References

1. Breiman, L., Friedman, J.H., Olsen, R.A., Stone, C.H. (1984) Classification and Regression Trees Belmont, California: Wadsworth International Group.
2. Craik, E.M. & D.H.A. Kaferly (1987) 'The Computer and Sophocles' Trachiniae' Literary and Linguistic Computing 2(2): 86-97.
3. Forsyth, R. S., Holmes, D. I. & E. K. Tse (1999) 'Cicero, Sigonio and Burrows: Investigating the Authenticity of the Consolatio' Literary and Linguistic Computing 14(3), 375-400
4. Frischer, B. (1991) 'The Date of the Poem' in Shifting Paradigms: New Approaches to Horace's Arts Poetica Atlanta GA Scholars Press.
5. Gooijer, J.G. de & N.M. Laan (2001) 'Change-Point Analysis: Elision in Euripides' Orestes' Computers and the Humanities 35, 167-191.
6. Holmes, D.I. (1998) 'The Evolution of Stylometry in Humanities Scholarship' Literary and Linguistic Computing 13 (3): 111-117.
7. Laan, N.M. (1995) 'Stylometry and Method: the Case of Euripides' Literary and Linguistic Computing 10: 271-278.
8. Loh, W-Y. (2002) 'Regression Trees with Unbiased Variable Selection and Interaction Detection' Statistica Sinica 12: 361-386.
9. Rudman, J. (2000) 'The Style-Marker mapping project: a rational and progress report' ALLC/ACH2000, July 2000.
10. Ule, L. (1982) "Recent Progress in Computer Methods of Authorship Determination" ALLC Bulletin 10,3, 73-89.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

ACH/ALLC / ACH/ICCH / ALLC/EADH - 2004

Hosted at Göteborg University (Gothenburg)

Gothenborg, Sweden

June 11, 2004 - June 16, 2004

105 works by 152 authors indexed

Series: ACH/ICCH (24), ALLC/EADH (31), ACH/ALLC (16)

Organizers: ACH, ALLC

Tags
  • Keywords: None
  • Language: English
  • Topics: None