Annotation en mode collaboratif au service de l'étude de documents anciens dans un contexte numérique

paper
Authorship
  1. 1. Ana Stulic

    AMERIBER - Bordeaux 3 University

  2. 2. Soufiane Rouissi

    CEMIC - GRESIC - Bordeaux 3 University

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Les mutations liées au numérique, nous invitent à revisiter une approche traditionnelle d’édition de textes telle qu’elle est pratiquée dans les sciences
humaines en tant qu’outil de recherche. Notre proposition
s’applique au problème d’édition électronique des
documents judéo-espagnols en écriture hébraïque sur
lesquels des techniques de translittération et transcription sont mises en oeuvre. Nous étudions les possibilités
de construction d’une plate-forme collaborative sur
laquelle les documents pourront être annotés et organisés
en tant que réservoir numérique de référence au service des chercheurs de ce domaine spécifique. Au delà de la description nécessaire à l’identification des documents
(la norme descriptive Dublin Core), notre objectif
est de favoriser l’annotation (sous une forme libre,
complémentaire à la méta-description) des documents enregistrés à des fins d’étude et de discussion. Les
documents sources (sous forme de fichiers images, par exemple) sont transformés par la transcription or
translitteration en documents numériques qui deviennent des outils de recherche. L’annotation (effectuée par le chercheur ou expert du domaine) intervient à la fois pour métadécrire la source et les documents “ dérivés ”
(à des fins d’identification, de localisation des ressources
appartenant au réservoir numérique) mais aussi pour
permettre une mise en discussion (interpréter, commenter,
réfuter, traduire, accepter …).
Un premier travail nous a permis de proposer un modèle
conceptuel définissant la structuration des données
nécessaire à ce double objectif d’identification et
d’annotation (Rouissi, Stulic 2005).
Nous adoptons le point de vue selon lequel la mise en place réussie d’un dispositif socio-technique adapté aux besoins des chercheurs d’une communauté scientifique dépend de la définition pertinente de ces fonctionnalités.
Pour les définir dans ce contexte précis, qui concerne
une communauté scientifique relativement réduite et
géographiquement dispersée, nous avons prévu la réalisation
d’une enquête auprès des chercheurs concernés. Cette enquête a pour objectif de confronter nos hypothèses
initiales (modèle conceptuel de définition des annotations)
avec les pratiques déclarées par les universitaires. Les résultats de l’enquête doivent nous permettre d’identifier
les besoins et les attentes des chercheurs interrogés de
façon à déterminer des pistes de développement et de mise
en place de fonctionnalités adaptées. Il s’agit également de mesurer les niveaux d’intégration des technologies de l’information et de la communication (TIC) dans leurs pratiques actuelles (types d’outils technologiques utilisés, cadre individuel ou collectif, recours à l’annotation sur des documents numériques, participation éventuelles à des environnements collaboratifs …). L’analyse des
réponses se fera à un niveau quantitatif (questions fermées)
et à un niveau qualitatif (analyse de contenu) pour nous permettre de proposer une typologie des pratiques
déclarées et/ou des besoins.
Dans les lignes générales, notre travail s’oriente vers l’élaboration d’une plate-forme accessible en mode
full-web et basée sur des solutions existantes respectant
notamment les principes de l’Open Source. Mais ce sont les fonctionnalités attendues par les membres de la communauté qui vont modeler les choix techniques
concrets.
Nous envisageons un format propriétaire pour le modèle en question, mais pour des raisons de portabilité celui-ci devra permettre des exportations basées sur des formats d’échanges de données. Ces formats doivent s’appuyer sur des schémas largement partagés et respectant une démarche normative comme la Text Encoding Initiative (TEI) pour la structuration des documents ou encore le Dublin Core pour la description des documents de
travail.
La plate-forme devra favoriser l’annotation en mode
collaboratif (avec divers niveaux possibles de participation :
du lecteur à l’éditeur) et organiser la discussion (celle-ci ayant plusieurs buts possibles : de transcrire, de traduire, de commenter, de proposer …) tout en la caractérisant (selon des vocabulaires restant à construire, pas forcément
pris en compte dans les spécifications comme la TEI). Le dispositif final comprendra aussi la construction d’un
corpus documentaire numérique qui sera exploitable grâce à des fonctionnalités de recherche (dans un but d’identification, de sélection…) des documents. Un
futur utilisateur du dispositif pourra dans un premier temps
effectuer une recherche (de type sélection) sur les documents numériques constituant le corpus, puis après sélection
d’un document, consulter la discussion sur celui-ci et
selon son rôle (son profil) il pourra éventuellement
participer en annotant le document et/ou en réagissant à une des annotations existantes. L’utilisateur pourra générer une nouvelle version comportant ses propres annotations
et exporter celle-ci selon les formats et standards (à prévoir
et à implémenter) qui lui seront proposés. Malgré
l’apparente singularité de notre contexte lié au traitement des textes judéo-espagnols, notre réflexion est menée
vers une approche plus large. A terme, notre but est
de construire un environnement numérique qui pourrait être appliqué dans les domaines de recherche avec des
besoins similaires autour de l’organisation de corpus
documentaires à des fins d’étude.
Annotations in Collaborative Mode for Ancient Documents Study in Digital Environment
The development of digital technology make us
revisit the traditionnal approach of text edition as a
research tool such as used in the Humanities. Our proposal applies to the problem of electronic editing of Judeo-Spanish documents in Hebrew characters over which the translitteration and transcription techniques are carried out. We study the possibilities of collaborative platform
construction in which these documents could be
annotated and organized as a reference digital repository for the scholars of this specific domain. Beyond the meta description necessary for the document identification (Dublin Core metadata standard), we aim to promote the annotating of the uploaded documents (in the free form, complementary to the metadescription) for the purpose of study and discussion. The source documents (presented as image files, for instance) are transformed into digital
documents, which become then the research tools. The role of annotation (carried out by a researcher or a domaine
expert) is to convey the metadata information about the source and the “ derived ” documents (for the purpose
of identification and localization of digital repository
resources) and to make possible the discussion over these
documents (to interprete, comment, refuse, accept, translate,
and so forth). In a previous paper we proposed a
conceptual model defining the data structuring necessary to this double purpose of identification and annotation (Rouissi, Stulic 2005).
Our point of view is that the successful setting up of
socio-technical device for the researchers of a particular
scientific field depends on the definition of appropriate functionalities. In order to define them in this context of relatively small and geographically dispersed scientific community, we planned to realise a survey that should confront our initial hypothesis (our conceptual annotation model) with the declared academic practices. The survey
results should allow us to identify the needs and the
expectations of researchers and to determine the direction of project development and the setting up of appropriate
functionalities. It should also measure the level of
Information and Communication Technology (ICT) integration in their current activities (the type of
technological tools in use, the individual or collective framework, the use of annotation over digital documents, the participation in collaborative work environments…). The answers will be analysed at the quantitative (closed questions) and the qualitative level (the content analysis) so as to provide a typology of declared practices and/or needs.
In broad lines, our work is oriented towards the elaborating of the full Web platform and is based on the existent open source solutions. But only with the clear definition of the functionalities that are expected by the scientific community members we will be able to model the final technical choices.
We envisage the proprietary format for the model in question, but it should permit the exportation through the standardized formats for data interchange like Text Encoding Initiative for document structuring or Dublin Core for meta description of working documents. The platform should promote the annotating in collaborative mode (with different levels of participation : from reader different purposes : to transcribe, to translate, to comment,
to propose…) by qualifiying it (following the vocabularies
that are still to be defined, not necessarily taken into
account in the specifications such as TEI). The final
platform will include also the documentary corpus
building that will be exploitable thanks to the search functionalities.
The future user will be able to effectuate a search (of
‘selection’ type) among digital documents which constitute
the corpus, and after the document selection, to consult the related discussion and according to his/her profile, he/she will be able to participate by adding the annotation and/or reacting to the annotation previously added by other user(s). The user will be able to generate a new version with his/her own annotations and to export it
following the proposed formats and standards.
Although the context to which we apply our project is very specific, our reflexion is oriented towards a larger approach. Our aim is to construct a digital environment which could be applied to the research domains with similar needs for documentary corpus organisation and work on documents.
References
Rouissi, S. Stulic, A. (2005). Annotation of Documents
for Electronic Edition of Judeo-Spanish Texts:
Problems and Solutions, à paraître dans les Actes de la
conférence Lesser Used Languages and Computer Linguistics, Bolzano, EURAC, 27-28 octobre 2005. Résumé consultable en ligne : http://www.eurac.edu/NR/rdonlyres/9F93F5B9-95F6-44AC-806D-C58FF69AFD27/8812/ConferenceProgramme2.pdf
TEI P5, Guidelines for Electronic Text Encoding and
Interchange, C.M. Sperberg-McQueen and Lou
Burnard, 2005, Text Encoding Initiative Consortium,
http://www.tei-c.org/
Strings, Texts and Meaning
Manfred THALLER
Universität zu Köln
From a technical point of view, texts are represented in computer systems currently as linear strings of atomic characters, between which no distinction is being made on the technical level. In the markup discussions within the Humanities, this is usually accepted as an
immutable fact of technology.
We propose, that the handling of Humanities texts could be considerably easier, if an engineering model
could be created, which is built upon a more complex understanding of text.
1. Basic model of “text” proposed
For this we start with the proposal to understand a text as a string of codes, each of which represents “meaning” measurable in a number of ways.
More detailed:
Texts – be they cuneiform, hand written or printed –
consist of information carrying tokens. This tokens fall into a number of categories, which are differentiated by the
degrees of certainty with which they can be used in various
operations. The trivial example are ASCII or Unicode
characters. Less trivial are symbolic tokens, like, e.g. the (primitive) string representing the term “chrismon”, a bit map representing a Chrismon (or something similar) etc.
A string made up of such tokens, which represents a
text, can be understood to exist in an n-dimensional
conceptual universe. Such dimensions, which have
different metrics are, e.g.:
• A dimension which has coordinates with only two possible values (“yes”, “no”) which describes, whether a token has an additional visible property, like being underscored.
• Another dimension, which has coordinates on a metric scale, which assigns a colour value, which allows to define similarities.
• Another dimension describing the position of a token like “Chrismon” with a ontology describing the relationships between Chrismons and other
formulaic forms.
• A real number, giving the relative closeness between a bitmap representing a Chrismon and an
idealtypische Chrismon.
If we view such a string from a specific point in the
conceptual space – a.k.a. an individual’s research
position – many of these dimensions tend to collapse
in the same way, as 3 dimensional objects collapse their z-value when represented in two dimensional drawings.
2. Relationship between text, markup and processing
We assume, that string processing, on a very low level of engineering, can be implemented in such a way, that the low level programming tools, which are used today for the generation of programs handle texts, can tackle the implications of this model directly.
This implies, e.g., a low level function, which can
compare two strings “sensitive for differences between included symbolic tokens beyond a specified ontological
distance” or “insensitive for this” very much like
current implementations of low level tools can compare
can compare two strings as “case sensitive” or “case
insensitive”.
While currently all textual phenomena have to be
described with one integrated system of markup,
expressing attributes, which can only be observed on the character level, without necessarily being interpretable on the spot, as well as highly abstract textual structures, the proposed approach would divide textual attributes into two classes: Textual attributes in the more narrow sense, which can be handled as properties of the strings
used to represent the texts and structural (and other
attributes) which are handled by a software system
implying the presence of the underlying capabilities of the low level textual model, while focusing itself upon a class of higher level problems: E.g. a data base operating upon an abstract content model of a drama, relying upon
the handling of page references as well as critical apparatus by the underlying string handling tools.
The later implies that documents will – seen from today’s
perspective – usually be marked up in at least two
concurrent ways. Some implications of that will be
listed.
3. Possibilities of generalizing the basic model.
Our model so far has assumed, that information is handled by strings, i.e. by tokens which form one-dimensional sequences. (Non linear structures are one-dimensional as well in this sense: a path within a graph has a length, measured as the number of nodes through which it passes. It cannot be measured in two
dimensions, as the relative location of the nodes within
a two dimensional drawing is just a property of the
visualization, not the structure itself.)
There is no reason, however, why the notion of meaning
represented by an arrangement of tokens carrying
information should not be generalized to two dimensions
(images), three dimensions (3D objects) or four
dimensions (e.g. 3D representations of historical
buildings over time).
A problem arises, however, when one compares
some operations on one- with the same operations on more-dimensional arrangements of information carrying tokens. A good example is the comparison of “insertion operations” in strings v. the same operation in images.
We conclude by proposing to solve that problem by the notion, that a textual string is a representation of an
underlying meaning with a specific information density,
which usually will transfer only part of the meaning
originally available, just as a digital image represents only part of the visual information available in the
original.
This in turn leads to the notion, that not only the handling of information carrying tokens can be generalized from the one to the more-dimensional case, but the properties of markup languages can as well.
4. Concluding remark
While the generalisation of the model quoted
above is presented in Paris for the first time, the idea of a specilised data type for the representation of Humanities text goes back to the early nineties (Thaller 1992, Thaller 1993). Various intermediate work never has been published, an experimental implementation, focusing on the interaction between texts and databases administering the structure embedded into the text does
exist, however and is used in the production level
system accessible via http://www.ceec.uni-koeln.de (Thaller 2004). More recently a project started at the chair of the author, to implement a datatype “extended string” as a series of MA theses in Humanities Computer Science. The first of these (Neumann 2006) provides a core implementation of the most basic concepts as a class augmenting Qt and fully integrated into that library.
References
Neumann, J. (2006). Ein allgemeiner Datentyp für die implizite Bereitstellung komplexer Texteigenschaften in darauf aufbauender Software. Unpubl. MA thesis, University at Cologne. Accessible via:
http://www.hki.uni-koeln.de/studium/MA/index.html
Thaller, M. (1992). “The Processing of Manuscripts”,
in: Manfred Thaller (Ed.) Images and
Manuscripts in Historical Computing, Scripta
Mercaturae (=Halbgraue Reihe zur Historischen
Fachinformatik A 14).
Thaller, M. (1993). “Historical Information Science: Is there such a Thing? New Comments on an Old Idea”, in: Tito Orlandi (Ed.): Seminario discipline
umanistiche e informatica. Il problema dell’
integrazione (= Contributi Del Centor Linceo
Interdisciplinare ‘Beniamo Segre’ 87).”
Thaller, M. (2004). “Texts,Databases, Kleio: A Note
on the Architecture of Computer Systems for the Humanities”, in: Dino Buzzetti, Giuliano Pancaldi, Harold Short (Eds.): Digital Tools for the History of Ideas (= Office for Humanities Communication
series 17) 2004, 49 - 76.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

ACH/ALLC / ACH/ICCH / ADHO / ALLC/EADH - 2006

Hosted at Université Paris-Sorbonne, Paris IV (Paris-Sorbonne University)

Paris, France

July 5, 2006 - July 9, 2006

151 works by 245 authors indexed

The effort to establish ADHO began in Tuebingen, at the ALLC/ACH conference in 2002: a Steering Committee was appointed at the ALLC/ACH meeting in 2004, in Gothenburg, Sweden. At the 2005 meeting in Victoria, the executive committees of the ACH and ALLC approved the governance and conference protocols and nominated their first representatives to the ‘official’ ADHO Steering Committee and various ADHO standing committees. The 2006 conference was the first Digital Humanities conference.

Conference website: http://www.allc-ach2006.colloques.paris-sorbonne.fr/

Series: ACH/ICCH (26), ACH/ALLC (18), ALLC/EADH (33), ADHO (1)

Organizers: ACH, ADHO, ALLC

Tags
  • Keywords: None
  • Language: French
  • Topics: None