Investigating word cooccurrence selection with extracted sub-networks of the Gospels

poster / demo / art installation
Authorship
  1. 1. Maki Miyake

    University of Osaka

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

Graph representation and the techniques of graph theory
and network analysis offer very effective ways of detecting
and investigating the intricate patterns of connectivity that
exist within large-scale linguistic knowledge resources. In this
presentation, we apply a graph-clustering technique for data
processing that utilizes a clustering-coeffi cient threshold in
creating sub-networks for the Gospels of the New Testament.
Specially, this study discusses some graph clustering results
from the perspectives of optimal clustering and data sizes
with a view to constructing an optimal semantic network
by employing the hierarchical graph clustering algorithm of
Recurrent Markov Clustering (RMCL). The corpus used in
this study is a Greek version of the Gospels (Nestle-Aland,
1979) for which two semantic networks are created based
on network features. The network structures are investigated
from the perspective of constructing appropriate semantic
networks that capture the relationships between words and
concepts.
In the natural language processing of texts, word pairs are
usually computed by the windowing method (Takayama,
Flournoy, Kaufmann, & Peters, 1998). The windowing method
provides a relatively simple representation of similarity levels
that is suitable for clustering. The technique involves moving
a certain sized window over a text to extract all fi xedsized
word grams (Vechthomova, Roberston, & Jones, 2003).
Word pairings are then made by combining all extracted
words. In the present study, co-occurrence data is computed
with two window sizes that refl ect syntactic and semantic
considerations. The fi rst size (WS1) is set at 1 for the nearest
co-occurrence, while the second size (WS10) is set to 10 to
collect words and no stemming process was taken. Accordingly,
8,361 word occurrences were identifi ed. An adjacency matrix
for the graphs was created from the word co-occurrence
data for a particular range of the texts. Using the clustering
coeffi cient as a threshold, the network was reduced into 18
sub-networks which were created at 0.1 increments of the
clustering coeffi cient value (from 0.1 to 0.9).
In Figure 1, the degree distributions for the two networks for
all occurrence nodes are plotted along log-log coordinates.
The window size of 1 shows the best fi t to a power law
distribution (r=1.7). The average degree value of 15.5 (2%) for
the complete semantic network of 8,361 nodes clearly indicates
that the network exhibits a pattern of sparse connectivity; in
other words, that it possesses the characteristics of a scalefree
network according to Barabasi and Albert (1999). In
contrast, the window size of 10 has a bell-shaped distribution
and its average degree value of 106.4 (13%) indicates that
the network exhibits a much greater level of connectivity
compared to the window size of 1.
Figure.1 Degree distributions
The second network feature is the clustering coeffi cient which
represents the inter-connective strength between neighboring
nodes in a graph. Following Watts and Strogatz (1998), the
clustering coeffi cient of the node (n) can be defi ned as:
(number of links among n’s neighbors)/(N(n)*(N)-1)/2), where
N(n) denotes the set of n’s neighbors. The coeffi cient assumes
values between 0 and 1.
Moreover, Ravasz and Barabasi (2003) advocate a similar notion
of clustering coeffi cient dependence on node degree, based on
a hierarchical model of scaling laws. The results of scaling C(k)
with k for the two networks are presented in Figure 2. As the
two networks conform well to a power law, we can conclude
that they both possess intrinsic hierarchies.
Figure.2 Clustering coeffi cient distributions
The graph clustering method applied in this study is Recurrent
Markov Clustering (RMCL), recently proposed by Jung, Miyake, and Akama (2006), as an improvement to Van Dongen’s (2000)
Markov Clustering (MCL) which is widely recognized as an
effective method for the detection of patterns and clusters
within large and sparsely connected data structures (Dorow,
Widdows, Ling, Eckmann, Sergi & Moses (2005), Steyvers &
Tenenbaum (2005)). The fi rst step in the MCL consists
of sustaining a random walk across a graph. The recurrent
process is essentially achieved by incorporating feedback data
about the states of overlapping clusters prior to the fi nal MCL
output stage. The reverse tracing procedure is a key feature of
RMCL making it possible to generate a virtual adjacency matrix
for non-overlapping clusters based on the convergent state
yielded by the MCL process. The resultant condensed matrix
represents a simpler graph that can highlight the conceptual
structures that underlie similar words.
Figure 3 presents cluster sizes as a function of the clustering
coeffi cient for the two window sizes of 1 and 10. The terms
for the input data (WS1-data, WS10-data) refer to the two
initial adjacency matrices. Focusing on the trend in cluster
sizes for the window size of 10 (WS10), the fact that cluster
sizes remain relatively constant regardless of size of the data
indicates that as the window size increases, clusters are less
dependent on the clustering coeffi cient.
Figure.3 MCL results
In order to optimize the co-occurrence data, we employ
Newman and Girvan’s (2004) notion of modularity, which is
particularly useful in assessing the quality of divisions within
a network. Modularity Q indicates differences in edge
distributions between a graph of meaningful partitions and a
random graph under the same vertices conditions (numbers
and sum of their degrees). Accordingly, Miyake and Joyce
(2007) have proposed the combination of the RMCL clustering
algorithm and this modularity index in order to optimize the
infl ation parameter within the clustering stages of the RMCL
process.
Moreover, to consider the recall rates of nodes, the F measure
is also employed to optimize the selection of the most
appropriate co-occurrence data, because the precision rate, P,
always depends on a trade-off relationship between modularity
Q and the recall rate R.
Figure 4 plots modularity Q and the F measure as a function
of clustering coeffi cients for the two window sizes of 1 and
10. The results indicate that there are no peaks in the plot of
Q values. This fi nding suggests that as the value of r decreases,
the value of Q increases. In the case of WS1, there are
differences in the peaks for the two measures of Q and F, for
while Q peaks at 0.6, the value of F peaks at 0.4. In this study,
we regard the peak in the F measure as a recall rate according
to the size of the data. As the results for WS10 show no peaks
in the Q value, we also take the peak value for F which is 0.7.
In this way, the number of node is about 5,000 for the two
selected sub-networks, which are almost the same size as the
networks.
Figure 4.
In conclusion, two network representations of the Gospels
were created for different window sizes of text, and the
networks were analyzed for the basic features of degree and
clustering coeffi cient distributions in order to investigate the
presence of scale-free patterns of connectivity and hierarchical
structures. This study also reported on the application of
a hierarchical graph clustering technique to sub-networks
created with different clustering-coeffi cient thresholds. In
terms of constructing an optimal semantic network, the
combination of the modularity measurement and the F
measure is demonstrated to be effective in controlling the
sizes of clusters. Bibliography
Barabási, A.L., and R. Albert (1999), Emergence of Scaling in
Random Networks. Science, 86, pp.509-512.
Dorow, B., D. Widdows, K. Ling, J. Eckmann, D. Sergi, and
E. Moses (2005), Using Curvature and Markov Clustering
in Graphs for Lexical Acquisition and Word Sense
Discrimination. MEANING-2005, 2nd Workshop organized by
the MEANING Project, February 3rd-4th 2005, Trento, Italy.
Trento, Italy.
Jung, J., Miyake, M., Akama, H.(2006), Recurrent Markov
Cluster (RMCL) Algorithm for the Refi nement of the
Semantic Network, In Proceedings of the 5th International
Conference on Language Resources and Evaluation (LREC2006),
pp.1428-1432.
Nestle-Aland (1987), Novum Testmentum Graece. 26th
edition. Stuttgart: German Bible Society, 1987.
Miyake, M., Joyce, T (2007), Mapping out a Semantic Network
of Japanese Word Associations through a Combination of
Recurrent Markov Clustering and Modularity. In Proceedings
of the 3rd Language & Technology Conference (L&TC’07),
pp.114-118.
Steyvers, M., and J. B. Tenenbaum (2005), The Large-Scale
Structure of Semantic Networks: Statistical Analyses and a
Model of Semantic Growth. Cognitive Science 29.1, pp.41-78.
Takayama, Y., R. Flournoy, S. Kaufmann, and S. Peters.
Information Mapping: Concept-based Information Retrieval
Based on Word Associations. http://www-csli.stanford.edu/
semlab-hold/infomap/theory.html, 1998.
van Dongen, Stijn Marinus (2000), “Graph Clustering by Flow
Simulation”. PhD Thesis. University of Utrecht. http://igiturarchive.
library.uu.nl/dissertations/1895620/inhoud.htm
Vechthomova, O., S. Robertson, and S. Jones (2003), Query
Expansion With Long-Span Collocate. Information Retrieval 6,
pp.251-273.
Watts, D., and S. Strogatz (1998), Collective Dynamics of
‘Small-World’ Networks. Nature 393, pp.440-442.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

Complete

ADHO - 2008

Hosted at University of Oulu

Oulu, Finland

June 25, 2008 - June 29, 2008

135 works by 231 authors indexed

Conference website: http://www.ekl.oulu.fi/dh2008/

Series: ADHO (3)

Organizers: ADHO

Tags
  • Keywords: None
  • Language: English
  • Topics: None