Mining Eighteenth Century Ontologies: Machine Learning and Knowledge Classification in the Encyclopédie

Authorship
  1. 1. Russell Horton

    Digital Library Development Center - University of Chicago

  2. 2. Mark Olsen

    ARTFL Project - University of Chicago

  3. 3. Robert Morrissey

    ARTFL Project - University of Chicago

  4. 4. Glenn Roe

    ARTFL Project - University of Chicago

  5. 5. Robert Voyer

    ARTFL Project - University of Chicago

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

One of the crowning achievements of the 18th century
Enlightenment was the Encyclopédie ou Dictionnaire
raisonné des sciences, des arts et des métiers, par une Société
de Gens de lettres, edited by Diderot and d'Alembert. Published
in Paris between 1751 and 1772, in 17 volumes of text and 11
volumes of plates, it contains 74,000 articles written by more
than 140 contributors.1The Encyclopédie was a massive
reference work for the arts and sciences, as well as a machine
de guerre which served to propagate Enlightenment ideas. The
impact of the Encyclopédie was enormous. Through its attempt
to classify learning and to open all domains of human activity
to its readers, the Encyclopédie gave expression to many of the
most important intellectual and social developments of its time.
The scale and ambition of the Encyclopédie inspired its editors
to adopt three distinct modes of organization which, taken
together, Diderot described as encyclopedic: dictionary,
hierarchical classification, and the renvois (cross-references).
The interaction of these three modes has led modern
commentators to describe the Encyclopédie as an "ancestor of
hypertext" and to depict Diderot as "l'internaute d'hier"2.
D'Alembert underscores the importance of the organization of
knowledge in the Discours Préliminaire:
As an Encyclopedia, it is to set forth the order and connection of
the parts of human knowledge. As a Reasoned Dictionary of the
Sciences, Arts, and Trades, it is to contain the general principles
that form the basis of each science and each art ... and the most
essential facts that make up the body and substance of each.3
Of the three modes of organization, the dictionary mode
(organization of entries in alphabetical order) is certainly the
simplest and the most arbitrary. The second mode of
organization is classification, wherein each dictionary entry is
assigned to a "class of knowledge," placing it within the "order"
of human understanding, as depicted in the Système Figuré des
connaissances humaines. Modeled after Bacon's classification
of knowledge and Enlightenment theories of epistemology, all
understanding is founded upon memory, reason, or imagination,
with numerous categories and sub-categories branching out
from these three faculties.4 However, simply placing an entry
into this hierarchy of knowledge was insufficient to indicate
the interconnections of knowledge. Thus, Diderot created an
extensive system of renvois, the third mode of organization,
providing a lattice of interconnections between individual leaves
of the tree as well as between classes of knowledge.5
The central role of the classification system in the intellectual
objectives of the Encyclopédie editors is indicated by the extent
to which it has been discussed and debated by both
contemporaneous scholars and later researchers. The editors
were remarkably diligent in assigning classes of knowledge to
each article and sub-article. Of the 73,840 main and sub articles,
55,227 were assigned classes of knowledge. The editors were,
however, somewhat less diligent in maintaining a precisely
controlled list. Thus the classifications as found in the text are
an amalgam of abbreviations, conflations, and even entries that
are not found on the Système Figuré. We have recently
completed orthographic normalization of the classes of
knowledge assigned to each article,6" resulting in some 54,289
articles with 2,600 normalized classes of knowledge. The twenty
most frequent classifications by number of articles are:
5513 Géographie
4794 Géographie moderne
3084 Géographie ancienne
2396 Jurisprudence
2304 Grammaire
1894 Marine
1483 Commerce
1277 Histoire naturelle. Botanique
1194 Histoire moderne
1115 Mythologie
1069 Histoire naturelle
889 Histoire ancienne
796 Medecine 730 Architecture
689 Jardinage
682 Littérature
627 Maréchallerie
614 Botanique
558 Histoire ecclésiastique
517 Théologie
Like the Système Figuré, these classifications are a reflection
of how knowledge was ordered and classified in the 18th
century. Given the assumptions that ontologies are historically
contingent and that the Encyclopédie is by far the most
consistent and coherent representation of the structure of 18th
century knowledge in French, this paper reports the results of
our current experiments using machine learning and data mining
techniques to understand and exploit this unique resource. Our
initial objectives are three-fold. First, we plan to examine the
relationship of the classifications to the content of the articles
using machine learning techniques to identify feature sets that
characterize classes of knowledge in the 55,000 articles
classified by the editors of the Encyclopédie. Secondly, we will
apply these feature sets to the 19,500 articles for which we do
not have a class of knowledge and evaluate the accuracy of
classification by randomly selecting articles with known authors
which scholars will then inspect. Most contributors to the
Encyclopédie worked on fairly specific domains -- Rousseau
contributed exclusively on music, for example -- so we can use
authorship as one control for judging the accuracy of
classification. Similarly, the cross-references will also serve as
an evaluation control, since 50% of the renvois link to articles
within the same class of knowledge. Finally, we plan to apply
these feature sets to the unclassified "plate legends" in an effort
to determine accuracy by examining the degree to which
classification of the plate legends reflects the relationship of
particular plate legends to particular articles.
For our initial experiments, we have extracted the text from all
articles that are more than 100 words in length, and which are
categorized within one of the 50 most frequent normalized
classifications. Explicit markers of class of knowledge, present
at the beginnings of these articles, were removed to ensure that
they do not provide facile criteria for classification. The texts
are tokenized and lemmatized7, and frequencies of words and
lemmas are computed both globally and for each article. Words
and lemmas with more than 100 occurrences in the entire
Encyclopédie were used as attributes, and vectors for each
article were generated from the number of occurrences of each
attribute in that article.
We are using the SMO implementation of a support vector
machine in the Weka8 data mining engine for initial
experimentation on smaller data samples, and an SVM-Light9
classifier for larger datasets. While support vector learning
algorithms are very effective for classification problems10, we
are also evaluating several other data metrics and machine
learning techniques, including information gain statistics and
J48 decision tree classification as implemented in Weka, to
examine the the most salient features that are used in the
classification process and to test the effectiveness of various
feature set selections.11
Results from preliminary experimentation indicate that SVM
classifiers applied to the articles of the Encyclopédie are very
effective at distinguishing articles from different classes. We
examined 936 unlemmatized articles in our sample dataset
belonging to the classes Medecine (499) and Mythologie (437).
The Weka SMO classifier using default options with 10-fold
cross-validation correctly recognized 98.29% of the articles
(920/936). Under the same parameters, the Weka J48 tree
classifier achieved slightly lower performance (91.66%
accuracy). The decision tree showed a clear split on medical
content words, such as maladie, humeurs, inflammation, and
so on. Such strong performance may be due to the fact that one
would not expect to find similar vocabularies in articles dealing
with medicine and mythology. We achieved similar
performance by classifying 2,448 articles equally divided
between modern and ancient geography. The SMO training
achieved 100% accuracy with 92.2% accuracy on
cross-validation. Inspection of the most important features in
both J48 tree and InfoGain measures shows a strong preference
for classical authors (Pline, Ptolomée) and places (Gaule,
Thrace), and the strings "l", "lib" and "liv", which correspond
to citations of classical authors (e.g. Pline, l. IV. c. xvj.).12
Distance and location terms (lieues, long., latit.) are strongly
correlated with modern geography. Furthermore, the function
words "selon" and "dit", which are far more prevalent in ancient
geography articles as the authors were citing classical
descriptions, are given high InfoGain scores.
We anticipate assigning classes of knowledge to articles that
were not originally classified by the editors iteratively by
comparing all unknown articles to specific classes of knowledge
rather than trying to classify all unknown articles en masse. To
test this approach, we assembled two sets of articles each
containing 1,209 instances. The first set contained articles
categorized by the editors as belonging to ancient geography,
while the second set was constituted by selecting, for each
article in the first set, an article as close as possible in length
but belonging to a different class of knowledge. Using SMO
training, we achieved 97.8% accuracy with standard 10-fold
cross-validation. Again, the most heavily weighted features
were terms denoting classical authors and place names, along
with a greater preponderance of more general geographic terms,
comporting nicely with a reasonable human's intuitive
understanding of what makes a document on ancient geography
distinct from another documents. We further validated our
results by running another experiment identical to the first
except that each article was randomly labeled as either ancient geography or not ancient geography, irrespective of its true
classification. The principle of Random Falsifiability states that
if random labels can be learned with the same ease (for SVM,
'ease' can be defined as proportion of support vectors required
13) as true class labels, the method must be rejected as
unreliable. After 10-fold cross-validation, SMO achieved a
mere 50.2895% accuracy on the classification, barely surpassing
random chance. That our method cannot learn the random labels
at all suggests that our success in discrimination is in fact based
on inherent differences between the two classes and not merely
a greedy model's exploitation of arbitrary patterns in the data
distribution.
The SMO model derived from comparing ancient geography
to a random selection of articles in other classes allows us to
test classification on a set of unclassified articles. To do this,
we assembled 5,000 randomly selected articles containing more
than 100 words for which classification was unknown and with
attributed authorship. We then applied the ancient geography
SMO model to this set in an effort to identify articles pertaining
to this category. The recall of this experiment was far too high.
In the future we intend to implement a classifier that reports a
numeric score rather than a simple binary categorization. There
were, however, within the results a number of correctly
classified articles such as the river ASOPE and the articles
GARAMANTES and Ionique Transmigration. Many of the
misclassified articles, such as ADONIES, ou FESTES
ADONIENNES and Danse astronomique, pertain to classical
history, mythology and other related fields. In addition to
implementing a ranking classifier, we will also investigate
moving up the tree of knowledge in order to use a coarser
classification scheme; e.g., rather than remaining at the leaves
of "ancient" and "modern" geography, we would use the branch
of geography itself as a general category.
The impressive performance of machine learning algorithms
suggests that the editors of the Encyclopédie were quite
judicious in their assignments of classifications, a claim which
will be tested further in the full paper. Examination of the
features most effective in classification tasks will establish a
sort of thesaurus which will give scholars a better understanding
of the organization of knowledge during the Englightenment.
Furthermore, we believe that the creation of well-verified
training sets on this large corpus will allow us to test the degree
to which we may profitably apply what the systems have learned
to articles and plate legends which were not classified at the
time, using the contemporary ontologies. If this series of
experiments is successful, we would anticipate using the training
sets from the classifications in the Encyclopédie to attempt to
classify passages in other 18th century French documents.
1. The ARTFL implementation of the Encyclopédie is discussed in
Robert Morrissey, Jack Iverson and Mark Olsen, "Présentation:
L'Encyclopédie Electronique" Robert Morrissey and Philippe
Roger, eds., L'Encyclopédie de réseau au livre et du livre au
réseau, (Paris: Champion, 2001): 17-27, and Leonid Andreev,
Jack Iverson and Mark Olsen, "Re-engineering a War Machine:
ARTFL's Encyclopédie" Literary & Linguistic Computing 14.1
(1999): 11-28.
2. Eric Brian, "L'ancêtre de l'hypertexte", Les Cahiers de Science et
Vie 47 (Oct. 1998): 28-38.
3. English translation cited in Nelly Hoyt and Thomas Cassier's
"Introduction" to Encyclopedia (1965): xxiii (our emphasis).
4. For various representations of the Système Figuré and the Editors'
description, see <http://www.lib.uchicago.edu/
efts/ARTFL/projects/encyc/texts/> and <ht
tp://artfl.uchicago.edu/cactus/>.
5. Blanchard and Olsen examined the structure of the renvois
generating a "mappemonde" of the cross-references and node level
classes of knowledge. See Gilles Blanchard and Mark Olsen, "Le
système de renvois dans l'Encyclopédie: une cartographie de la
structure des connaissances au XVIIIème siècle", Recherches sur
Diderot et sur l'Encyclopédie 31-32 (April 2002): 45-70.
6. This project was accomplished in collaboration with Professor
Dena Goodman at the University of Michigan.
7. <http://www.ims.uni-stuttgart.de/projek
te/corplex/TreeTagger/>
8. Ian H. Witten and Eibe Frank, Data Mining: Practical Machine
Learning Tools and Techniques 2nd ed. (Morgan Kaufmann, 2005)
and <http://www.cs.waikato.ac.nz/ml/weka
/>
9. SVM-Light: <http://svmlight.joachims.org/>
See T. Joachims, "Making large-Scale SVM Learning Practical",
Advances in Kernel Methods - Support Vector Learning, B.
Schölkopf and C. Burges and A. Smola eds. (MIT-Press, 1999).
Note that we are using a parallel implementation. See <http:
//www.dm.unife.it/gpdt/>, G. Zanghirati, L. Zanni,
"A Parallel Solver for Large Quadratic Programs in Training
Support Vector Machines", Parallel Computing 29 (2003):
535-551 and L. Zanni, T. Serafini, G. Zanghirati, "Parallel Software
for Training Large Scale Support Vector Machines on
Multiprocessor Systems", JMLR 7 (July 2006): 1467-1492.
10. S. Dumais, et. al., "Inductive learning algorithms and
representations for text categorization", CIKM-98, 1998.
11. See the discussion conformity and uniformity in Chih-Ming Chen,
et. al. "A Hierarchical Neural Network Document Classifier with
Linguistic Feature Selection" Applied Intelligence 3 (December
2005).
12. We checked these using the PhiloLogic build of the Encyclopédie
(<http://www.lib.uchicago.edu/efts/ARTF
L/projects/encyc/>), suggesting the importance of
checking text mining results with full text analysis systems.
13. A. Ruiz and P.E. López-de-Teruel, "Random Falsifiability and
Support Vector Machines" (<http://learn98.tsc.uc 3m.es/~learn98/papers/abstracts/paper01
3/abstract.html>).

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

ADHO - 2007

Hosted at University of Illinois, Urbana-Champaign

Urbana-Champaign, Illinois, United States

June 2, 2007 - June 8, 2007

106 works by 213 authors indexed

Series: ADHO (2)

Organizers: ADHO

Tags
  • Keywords: None
  • Language: English
  • Topics: None