Crossed Semantic Analysis of Literary Texts with DeSeRT

paper, specified "long paper"
Authorship
  1. 1. Jean-Gabriel Ganascia

    LIP6, University Pierre and Marie Curie, Labex OBVIL - Université Paris-Sorbonne, Paris IV (Paris-Sorbonne University)

  2. 2. Chiara Mainardi

    Labex OBVIL - Université Paris-Sorbonne, Paris IV (Paris-Sorbonne University), Université Pierre et Marie Curie (UPMC) (University of Pierre and Marie Curie)

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.


Introduction
Doing comparative researches on a large literary corpus is often lengthy and demanding. With the digitization of contents, scholars have started using computers. Semantic information is essential for an appropriate understanding of texts and is an extremely important factor in cross-textual analysis. This is why we have developed a new semantic engine built especially for Digital Humanities, called DeSeRT

A French version of DeSeRT is freely available online at http://obvil-dev.paris-sorbonne.fr/desert/
. This paper presents some of the results that were obtained from a corpus of the 17
th and 18
th century texts characteristic of the debates about theater in the classical age. The following paper is divided into four parts: after a first section that briefly describes the search engine used and the different investigations it allows, a second section introduces the corpus used. Then, the third section shows some of the results obtained.

The DeSeRT Search Engine
DeSeRT has been designed to identify and compare rewriting, paraphrasing or reformulation. It is based on an idea, already developed (Barron-Cedeño et al., 2013; Ferrero and Simac-Lejeune, 2015), according to which, even if the reformulations cannot be reduced to paraphrases, they retain the meaning of original texts by using either the same words or words of similar meaning. As a consequence, the detection of co-occurrences of a few semantically equivalent lemmas in small blocks of texts is sufficient to capture the equivalent meaning and, therefore, to identify the reformulation of the same ideas.
This is implemented in four steps:

Dividing texts into small blocks of words that are partially overlapping. Typically each block may contain 300 words and two consecutive blocks overlap by one-half, but both the block size and the proportion of overlap can vary.
Extracting the meaningful lemmas using a POS tagger. This step enables the exclusion of some syntactical categories, such as prepositions or articles, and to get the lemma associated to each word. The current implementation makes use of TreeTagger

http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/
, but this could easily be changed.

Indexing each block with the lemmas obtained at step 3. Without going into implementation details, let us remark that the index is stored using the SQLite

http://sqlite.org/
database management system.

Retrieving blocks that contain many identical or semantically equivalent lemmas. This fourth module exploits a dictionary of synonyms to recover blocks that have similar meanings. It is also possible to use a thesaurus to restrict the search to a set of predefined terms, but this is not a necessity. The proximity between two blocks of text is based on an Okapi similarity measurement (Spärk-Jones et al, 2000), of which evaluation is greatly simplified by the fact that all blocks have the same size.

More precisely, the Okapi similarity measure of a block B with a set of terms T =
t
1
, … t
n is given by the following formula:

where
f(t
i
, B) is the frequency of the term
t
i in the block
B, |
B| is the size of block
B, k and
b are two free parameters chosen in advance (usually chosen as k∈[1.2, 2] and b=0.75),
avdl is the average document description length in the collection and
IDF(t
i
) is the
Inverse Document Frequency of term
t
i is given by:

It appears that, when the size of each block is the same and when the frequency of terms in blocks is supposed to be at max 1 (this approximation is justified since the blocks are small), this formula can be simplified. It then becomes equivalent to the information theoretic measure of the terms in blocks, i.e.

where
Pr(t
i
) is the probability of the term
t
i in the overall corpus.

Using this score, it is possible to measure the similarity between blocks or between a block and a set
T of terms
t
i. As a consequence, DeSeRT can be used in different ways. The research queries may be done through words or concepts, i.e. words that are expanded using the dictionary of synonyms. It is also possible to compare any text (or file) to a corpus: then, DeSeRT detects corpus blocks where the meaning is similar to blocks of the given text, which allows, for instance, the arguments in a dispute or the anecdotes and the common places that are reused to be followed. Lastly, it is also possible (but not mandatory) to add a thesaurus or ontology to focus the search on a given semantic field.

Note that, based on techniques developed to detect plagiarism, many tools already exist that are designed to identify paraphrases, reuses and borrowings, i.e. sequences of words that are approximately identical, e.g. (Ganascia et al. 2014; Horson et al. 2010). However, these techniques are unable to spot reformulations of the same ideas or allusions to previous texts. DeSeRT has been designed to overcome these limitations.

The Hate of Theater
The project
Haine du Théâtre

http://obvil.paris-sorbonne.fr/projets/la-haine-du-theatre

(“The Hatred of Theater” in French) aims to analyze theater disputes in Europe using scientific approaches and critical editions of polemical texts. The team’s reflections are mainly focused around the discovery of the circumstances and the arguments used in theater controversies all across Europe, not limited to France, but also in England, Spain, Italy, and the Germanic area, from the last decades of the 16
th century up to the beginning of the 19
th century.

The corpus of the project collects many texts written in French during the 17
th and the 18
th centuries. The purpose of this project is to explore the gray areas of theater controversies in order to outline a global overview and to discover where and how the polemics began, their chronological discrepancies and the links between them and their contemporary resurgences. The total collection of the
Haine du Théâtre texts is, by now, made up of 27 texts.

Exploitation of DeSeRT on the Hate of Theatre Corpus

Discovery of reuses
Querying the 27 texts of the
Haine du Théâtre corpus, we found much reuse of similar passages and texts. DeSeRT is not only useful for detecting those parts of text that deal with the same concept, but is also a very good tool to find borrowings.

For example, comparing two texts, the
Défense of Voisin (1666) and
Traité de la Comédie of Nicole (1667), we discover immediately that the
Traité has been included in the
Défense by Voisin, which is a very long text, not only once, but twice. The first time, Voisin presents it as a re-publication, then he re-uses phrases similar to those employed by Nicole in different passages and he sprinkles them in his
Défense.

The keywords of this correspondence are very well detected by DeSeRT, as can be seen in the example below.

Furthermore, continuing the analysis of the text we discover that in these two texts the actor is frequently associated with the idea of purity in religion.

Reformulations of Ideas
DeSeRT also shows in detail the parts of the corpus that are similar or that develop the same ideas. This may either be done on demand, according user requests, i.e. to given texts, or to the overall corpus, which automates the process.
For instance, we have found many topics common not only to the texts by Nicole and Voisin, but also to two others e.g. the theme of the idolatry as the “mother of all spectacles” in (Aubignac, 1666) and (Conti, 1666).
Note that, in the following figures, we have greyed the identical passages with the MEDITE system, which only spots strict homologies, without considering many words that appear identical to DeSeRT because they correspond to the same lemma or to two synonymous lemmas. As a consequence, the number of gray zones considerably underestimates the semantic proximity detected by DeSeRT.

Secondly, as shown in the following figures, the topic “renouncing the Devil” (
Renoncer au Diable in French) is regularly present in the texts written by Aubignac, Conti and Voisin,

D’Aubignac, 1666 : 59, 62, 65, 72, 73, 74, 79, 217. Conti, 1666 : 88, 105, 120, 144, 173, 182, 184. Voisin, 1671 : 59, 86, 88, 97, 113, 114, 124, 165, 205, 212, 228, 407, 427, 433, 451, 463, 481.
while it appears only once in (Nicole, 1667: 477).

Further results of DeSeRT lead us to understand that many others expressions are common to the four authors, such as the description of the theatre as a “flesh of pestilence” (
chair de pestilence) or as a “school of the debauchery”.

Conclusion
To conclude, DeSeRT allows the discovery of reformulations and similar phrases, as well as related topics and passages in a corpus. It is usable on any kind of corpus that is made up by files in the txt format. As we have briefly shown in this paper, this search engine enables users to identify crucial passages of a specific corpus according to two types of detection: the discovery of reuses, such as plagiarism or hidden rewritings, and the reformulation of ideas, which can be manually given by the user (as words or concepts) or automatically extracted by the DeSeRT search engine.

Bibliography

Barron-Cedeño, A., et al. (2013). Plagiarism meets paraphrasing : Insights for the next generation in automatic plagiarism detection. In
Association for Computational Linguistics, vol. 39: 917–47.

Conti, Prince de, A. de B. (1666).
Traité de la Comédie et des spectacles, Louis Billaine, Paris.

D'Aubignac, Abbé, F.H. (1666).
Dissertation sur la condemnation des théâtres, N. Pépingué, Paris.

Ferrero, J. and Simac-Lejeune, A. (2015). Détection automatique de reformulations - Correspondance de concepts appliquée à la détection du plagiat, 15ème conférence internationale sur l'extraction et la gestion des connaissances (EGC 2015), Luxembourg.

Ganascia, J.G., Glaudes, P. and Del Lungo, A. (2014). Automatic detection of reuses and citations in literary texts,
Literary and Linguistic Computing, 2014, doi: 10.1093/llc/fqu020

Horton, R., Olsen, M., and Roe, G. (2010). Something Borrowed: Sequence Alignment and the Identification of Similar Passages in Large Text Collections,
Digital Studies/ Le champ numérique 2(1), Available at: http://www.digitalstudies.org/ojs/index.php/ digital_studies/article/view/190/235. (last access 7 November 2013).

Nicole, P. (1667).
De la Comédie, Adolphe Beyers, Liege.

Spärck Jones, K., Walker, S. and Robertson, S. E. (2000).
A probabilistic model of information retrieval: Development and comparative experiments: Part 1. Information Processing and Management
36 (6): 779–808.

Voisin, Abbé, J. (1671).
Défense du traité de Mgr le Prince de Conti touchant la comédie et les spectacles ou la réfutation d'un livre intitulé Dissertation sur la condamnation des théâtres, Coignard, Paris.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

ADHO - 2016
"Digital Identities: the Past and the Future"

Hosted at Jagiellonian University, Pedagogical University of Krakow

Kraków, Poland

July 11, 2016 - July 16, 2016

454 works by 1072 authors indexed

Conference website: https://dh2016.adho.org/

Series: ADHO (11)

Organizers: ADHO