Strings, Texts and Meaning

paper
Authorship
  1. 1. Manfred Thaller

    Universität zu Köln (University of Cologne)

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

From a technical point of view, texts are represented
in computer systems currently as linear strings of
atomic characters, between which no distinction is being
made on the technical level. In the markup discussions
within the Humanities, this is usually accepted as an
immutable fact of technology.
We propose, that the handling of Humanities texts
could be considerably easier, if an engineering model
could be created, which is built upon a more complex
understanding of text.
1. Basic model of “text” proposed
For this we start with the proposal to understand a
text as a string of codes, each of which represents
“meaning” measurable in a number of ways.
More detailed:
Texts – be they cuneiform, hand written or printed –
consist of information carrying tokens. This tokens fall
into a number of categories, which are differentiated by the
degrees of certainty with which they can be used in various
operations. The trivial example are ASCII or Unicode
characters. Less trivial are symbolic tokens, like, e.g. the
(primitive) string representing the term “chrismon”, a bit
map representing a Chrismon (or something similar) etc.
A string made up of such tokens, which represents a
text, can be understood to exist in an n-dimensional
conceptual universe. Such dimensions, which have
different metrics are, e.g.:
• A dimension which has coordinates with only two
possible values (“yes”, “no”) which describes,
whether a token has an additional visible property,
like being underscored.
• Another dimension, which has coordinates on a metric
scale, which assigns a colour value, which allows to
define similarities.
• Another dimension describing the position of a
DH.indb 212 6/06/06 10:56:01
DIGITAL HUMANITIES 2006
Single Sessions P. 213
token like “Chrismon” with a ontology describing
the relationships between Chrismons and other
formulaic forms.
• A real number, giving the relative closeness
between a bitmap representing a Chrismon and an
idealtypische Chrismon.
If we view such a string from a specific point in the
conceptual space – a.k.a. an individual’s research
position – many of these dimensions tend to collapse
in the same way, as 3 dimensional objects collapse their
z-value when represented in two dimensional drawings.
2. Relationship between text, markup
and processing
We assume, that string processing, on a very low
level of engineering, can be implemented in
such a way, that the low level programming tools, which
are used today for the generation of programs handle
texts, can tackle the implications of this model directly.
This implies, e.g., a low level function, which can
compare two strings “sensitive for differences between
included symbolic tokens beyond a specified ontological
distance” or “insensitive for this” very much like
current implementations of low level tools can compare
can compare two strings as “case sensitive” or “case
insensitive”.
While currently all textual phenomena have to be
described with one integrated system of markup,
expressing attributes, which can only be observed on the
character level, without necessarily being interpretable
on the spot, as well as highly abstract textual structures,
the proposed approach would divide textual attributes
into two classes: Textual attributes in the more narrow
sense, which can be handled as properties of the strings
used to represent the texts and structural (and other
attributes) which are handled by a software system
implying the presence of the underlying capabilities of
the low level textual model, while focusing itself upon a
class of higher level problems: E.g. a data base operating
upon an abstract content model of a drama, relying upon
the handling of page references as well as critical apparatus
by the underlying string handling tools.
The later implies that documents will – seen from today’s
perspective – usually be marked up in at least two
concurrent ways. Some implications of that will be
listed.
3. Possibilities of generalizing the basic
model.
Our model so far has assumed, that information
is handled by strings, i.e. by tokens which form
one-dimensional sequences. (Non linear structures are
one-dimensional as well in this sense: a path within a
graph has a length, measured as the number of nodes
through which it passes. It cannot be measured in two
dimensions, as the relative location of the nodes within
a two dimensional drawing is just a property of the
visualization, not the structure itself.)
There is no reason, however, why the notion of meaning
represented by an arrangement of tokens carrying
information should not be generalized to two dimensions
(images), three dimensions (3D objects) or four
dimensions (e.g. 3D representations of historical
buildings over time).
A problem arises, however, when one compares
some operations on one- with the same operations on
more-dimensional arrangements of information carrying
tokens. A good example is the comparison of “insertion
operations” in strings v. the same operation in images.
We conclude by proposing to solve that problem by
the notion, that a textual string is a representation of an
underlying meaning with a specific information density,
which usually will transfer only part of the meaning
originally available, just as a digital image represents
only part of the visual information available in the
original.
This in turn leads to the notion, that not only the handling
of information carrying tokens can be generalized from
the one to the more-dimensional case, but the properties
of markup languages can as well.
4. Concluding remark
While the generalisation of the model quoted
above is presented in Paris for the first time, the
idea of a specilised data type for the representation of
Humanities text goes back to the early nineties (Thaller
1992, Thaller 1993). Various intermediate work never
has been published, an experimental implementation,
focusing on the interaction between texts and databases
DH.indb 213 6/06/06 10:56:02
P. 214 Single Sessions
DIGITAL HUMANITIES 2006
administering the structure embedded into the text does
exist, however and is used in the production level
system accessible via http://www.ceec.uni-koeln.de
(Thaller 2004). More recently a project started at the
chair of the author, to implement a datatype “extended
string” as a series of MA theses in Humanities Computer
Science. The first of these (Neumann 2006) provides a
core implementation of the most basic concepts as a class
augmenting Qt and fully integrated into that library.
References
Neumann, J. (2006). Ein allgemeiner Datentyp für
die implizite Bereitstellung komplexer Texteigenschaften in darauf aufbauender Software. Unpubl.
MA thesis, University at Cologne. Accessible via:
http://www.hki.uni-koeln.de/studium/MA/index.html
Thaller, M. (1992). “The Processing of Manuscripts”,
in: Manfred Thaller (Ed.) Images and
Manuscripts in Historical Computing, Scripta
Mercaturae (=Halbgraue Reihe zur Historischen
Fachinformatik A 14).
Thaller, M. (1993). “Historical Information Science:
Is there such a Thing? New Comments on an Old
Idea”, in: Tito Orlandi (Ed.): Seminario discipline
umanistiche e informatica. Il problema dell’
integrazione (= Contributi Del Centor Linceo
Interdisciplinare ‘Beniamo Segre’ 87).”
Thaller, M. (2004). “Texts,Databases, Kleio: A Note
on the Architecture of Computer Systems for the
Humanities”, in: Dino Buzzetti, Giuliano Pancaldi,
Harold Short (Eds.): Digital Tools for the History
of Ideas (= Office for Humanities Communication
series 17) 2004, 49 - 76.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

ACH/ALLC / ACH/ICCH / ADHO / ALLC/EADH - 2006

Hosted at Université Paris-Sorbonne, Paris IV (Paris-Sorbonne University)

Paris, France

July 5, 2006 - July 9, 2006

151 works by 245 authors indexed

The effort to establish ADHO began in Tuebingen, at the ALLC/ACH conference in 2002: a Steering Committee was appointed at the ALLC/ACH meeting in 2004, in Gothenburg, Sweden. At the 2005 meeting in Victoria, the executive committees of the ACH and ALLC approved the governance and conference protocols and nominated their first representatives to the ‘official’ ADHO Steering Committee and various ADHO standing committees. The 2006 conference was the first Digital Humanities conference.

Conference website: http://www.allc-ach2006.colloques.paris-sorbonne.fr/

Series: ACH/ICCH (26), ACH/ALLC (18), ALLC/EADH (33), ADHO (1)

Organizers: ACH, ADHO, ALLC

Tags
  • Keywords: None
  • Language: English
  • Topics: None