Rules against the Machine: Building Bridges from Text to Metadata

poster / demo / art installation
Authorship
  1. 1. José Calvo Tello

    Universität Würzburg (Julius Maximilian University of Wurzburg)

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.


Introduction
Digital literary studies advance in their research, requiring more specific metadata about literary phenomena: narrator (Hoover 2004), characters (Kastorp et al. 2015), place and period, etcetera. This metadata can be used to explain results in tasks like authorship attribution or genre detection, or to evaluate digital methods (Calvo Tello 2017). What could be the most efficient way to start annotating this information in corpora of thousand of texts in languages, genres and historical periods for which many NLP tools are not trained for? In this proposal, the aim is to identify specific literary metadata about entire texts with methods that are either language-independent or easily adaptable for humanists.

Two Ways from Text to Metadata
The two approaches to classify unlabeled samples applied here are rule-based classification and supervised machine learning. In rule-based classification (Witten et al. 2011), domain experts define formalised rules that correctly classify the samples. For example a rule based on a single token can be defined for each class to predict whether a text is written in third person (83% of the corpus) or first person using tokens for the two values are the Spanish words
dije ('I said') and
dijo (‘he said’), and the rule:

if
dijo appears 90% more than
dije, the novel is written in third person

if
dijo appears less, in first person

The results of applying this rule can be presented as a confusion matrix:

Fig 1. Confusion Matrix of rule-based results about narrator
For supervised methods (Müller and Guido 2016; VanderPlas 2016), we need labeled samples to train and evaluate the method. In the following table, the different classifiers and document-representations achieve different accuracy scores:

raw
relative
tfidf
zscores

SVC
0.90
0.83
0.83
0.88

KNN
0.83
0.88
0.81
0.81

RF
0.88
0.88
0.86
0.90

DT
0.84
0.83
0.84
0.82

LR
0.88
0.83
0.83
0.17

BN
0.72
0.72
0.72
0.82

GN
0.72
0.80
0.80
0.81

Fig 2. Accuracy (F1-score) for narrator

Corpus and Metadata
The data is part of the
Corpus of Spanish Novels of the Silver Age (1880-1939) (used in Calvo Tello et al. 2017), with 350 novels in XML-TEI by 58 authors. Each text has been annotated manually with metadata and its degree of certainty has been assigned. 262 texts with either high or medium certainty have been used to create a gold-standard with the following classes:

protagonist.gender
protagonist.age
protagonist.socLevel
setting.type
setting.continent
setting.country
setting.name
narrator
representation
time.period
end

Modelisation and Methods
The scripts have been written in Python (available on GitHub)
<https://github.com/cligs/projects2018/tree/master/text2metadata-dh>. The features have been represented as different document models (Kestemont et al. 2016):

raw frequencies
relative frequencies
tf-idf
z-scores

Different classify algorithms (cross validation, 10 folds) and amount of Most Frequent Words have been evaluated. For each class a single token was used to represent each class value and a ratio was assigned for the default class value (see repository in GitHub for rules). Both approaches were compared to a “most populated class” baseline, quite high in many cases.

Results
The results of both approaches are as following:

Class
F1 baseline
F1 Rule
F1 Cross Mean
F1 Cross Std
Algorithm
Model
MFW
Winner

end
0.60
0.54
0.60
0.02
LR
tfidf
100
Baseline

narrator
0.83
0.80
0.91
0.04
RF
tfidf
1000
ML

protagonist.age
0.55
0.25
0.55
0.01
LR
tfidf
100
Baseline

protagonist.gender
0.80
0.68
0.80
0.01
BN
tfidf
100
Baseline

protagonist.socLevel
0.63
0.49
0.64
0.07
SVC
zscores
5000
Baseline

representation
0.88
0.80
0.88
0.01
LR
tfidf
100
Baseline

setting.continent
0.95
0.94
0.96
0.01
SVC
zscores
5000
Baseline

setting.continent.binar
0.95
0.95
0.95
0.19
LR
zscores
500
Baseline

setting.country
0.93
0.38
0.94
0.01
SVC
zscores
1000
Baseline

setting.country.binar
0.87
0.47
0.88
0.03
SVC
zscores
1000
Baseline

setting.name
0.64
0.85
0.71
0.02
SVC
zscores
1000
Rule

setting.type
0.48
0.46
0.71
0.05
SVC
zscores
5000
ML

time.period
0.95
0.95
0.97
0.01
BN
zscores
5000
Baseline

Fig 3. Results
In many cases the baselines are higher than the results of both approaches. The rule outperformed the baseline in the case of name of the setting with very good results. In two cases (narrator and setting's type), Machine Learning is the most successful approach and its F1 is statistically higher than the baseline (one sample t-test, ɑ = 5%). The algorithms Supported Vector Machines, Logistic Regression and Random Forest are most successful, while tf-idf and speacilly z-scores got the best results, the last one a data representation “highly uncommon in other applications” different from stylometry (Kestemont et al, 2016).

Conclusions
In this proposal I have used simple rules and simple features in order to detect relatively complex literary metadata in many cases with high baselines. While Machine Learning showed a statistically significant improvement in detection for two classes (type of setting and narrator), rules worked better for the name of the setting. This is a promising point to continue researching in order to annotate the rest of the corpus.

Bibliography

Calvo Tello, J. (2017). What does Delta see inside the Author?: Evaluating Stylometric Clusters with Literary Metadata. III Congreso de La Sociedad Internacional Humanidades Digitales Hispánicas: Sociedades, Políticas, Saberes. Málaga: HDH, pp. 153–61 <
>.

Calvo Tello, J., Schlör, D., Henny-Krahmer, U. and Schöch, C. (2017). Neutralising the Authorial Signal in Delta by Penalization: Stylometric Clustering of Genre in Spanish Novels. Montréal: ADHO, pp. 181–83 <
>.

Hoover, D. L. (2004). Testing Burrows’s Delta. Literary and Linguistic Computing, 19(4): 453–75.

Kastorp, F., Kestemont, M., Schöch, C. and Bosch, A. Van den (2015). The Love Equation: Computational Modeling of Romantic Relationships in French Classical Drama. Sixth International Workshop on Computational Models of Narrative. Atlanta, GA, USA. <https://zenodo.org/record/18343>.

Kestemont, M., Stover, J., Koppel, M., Karsdorp, F. and Daelemans, W. (2016). Authenticating the writings of Julius Caesar. Expert Systems with Applications, 63: 86–96 <http://dx.doi.org/10.1016/j.eswa.2016.06.029>.

Müller, A. C. and Guido, S. (2016). Introduction to Machine Learning with Python: A Guide for Data Scientist. Beijing: O’Reilly.

VanderPlas, J. (2016). Python Data Science Handbook: Essential Tools for Working with Data. First edition. Beijing Boston Farnham: O’Reilly.

Witten, I., Frank, E. and Hall, M. (2011). Data Mining: Practical Machine Learning Tools and Techniques. 3rd edition. San Francisco: Morgan Kaufmann.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

ADHO / EHD - 2018
"Puentes/Bridges"

Hosted at El Colegio de México, Universidad Nacional Autónoma de México (UNAM) (National Autonomous University of Mexico)

Mexico City, Mexico

June 26, 2018 - June 29, 2018

340 works by 859 authors indexed

Conference website: https://dh2018.adho.org/

Series: ADHO (13), EHD (4)

Organizers: ADHO