Visualizing Japanese Language Change During the Past Century

poster / demo / art installation
Authorship
  1. 1. Bor Hodošček

    Linguistics - University of Osaka

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.


Visualizing Japanese Language Change During the Past Century

Hodošček
Bor

Osaka University, Japan
bor@lang.osaka-u.ac.jp

2014-12-19T13:50:00Z

Paul Arthur, University of Western Sidney

Locked Bag 1797
Penrith NSW 2751
Australia
Paul Arthur

Converted from a Word document

DHConvalidator

Paper

Poster

diachronic corpus
Japanese language change
genre
register
cooccurrence networks

corpora and corpus activities
metadata
stylistics and stylometry
linguistics
genre-specific studies: prose
poetry
drama
networks
relationships
graphs
data mining / text mining
English

This study introduces an online system for the visualization and analysis of over a century (1874–2008) of Japanese language change. A comprehensive account of register variation in contemporary Japanese has recently become possible with the public availability of the Balanced Corpus of Contemporary Written Japanese (BCCWJ), a 100-million-word corpus that contains a wide variety of written Japanese collected and curated by the National Institute for Japanese Language and Linguistics. Increasingly, too, public releases of new corpora that record various genres of modern (Meiji-era) Japanese writing have paved the way to enabling more comprehensive diachronic analysis (analysis of language development and evolution through time) of Japanese. Still, especially compared to recent efforts in English, which include large book digitalization projects such as the Google Books corpus (Michel et al., 2011), as well as more curated historical corpora such as the Corpus of Historical American English (COHA) or the register-balanced Corpus of Contemporary American English (COCA) (Davies, 2010; 2011), the available resources and research tools for investigating diachronic language change as well as register variation in Japanese lack in two respects: balanced representation of registers throughout time, and unified and sophisticated search interfaces.
We combine the use of corpus metadata and annotations with textual features to model language change through time and between different registers from the following six corpora:
• The Balanced Corpus of Contemporary Written Japanese (c. 1975–2008).
• The Sun corpus (c. 1895–1925).
• The Meiroku Zasshi corpus (c. 1874–1875).
• The Kindai Josei Zasshi corpus (c. 1894–1925).
• The Kokumin no Tomo corpus (c. 1887–1888).
• A subset of the Aozora Bunko (c. 1890s–).
All text is first converted into a unified structured format that includes structural information (paragraphs, headings, titles, lists, etc.) as well as other information (spoken text, quotations, etc.), where available, from the different textual or XML encodings of the corpora. Next, we process sentences into morpheme tokens using the morphological analyzer MeCab and, depending on the time period, the modern or contemporary version of the UniDic morphological dictionary. A unique property of both variants of UniDic is their organization of word tokens under lemma that cover the many orthographic variants observed in Japanese writing. Taking the basic lemma, word orthography, and POS triplets as a base, we construct co-occurrence networks between all words occurring in the same sentence or paragraph. This co-occurrence network is constructed so that we are able to generate sub-networks that match some metadata query, such as year and NDC code, which can then be used to compare with other sub-networks. The query and visualization interface thus provides a timeline for choosing specific time-related subsets from the corpora, as well as a visual way of selecting from other metadata, including categorical (gender, media type, etc.) and hierarchical (NDC, topic, etc.) information, which allows the user to further constrain the scope of investigation into language change to some register within a chosen time period or to instead focus on the differences between registers by comparing between two or more different registers within a set time period.

Bibliography

Davies, M. (2010). The Corpus of Historical American English: 400 Million Words, 1810–2009. http://corpus.byu.edu/coha/ (accessed 1 November 2014).

Davies, M. (2011). Google Books (American English) Corpus (155 Billion Words, 1810–2009). http://googlebooks.byu.edu/ (accessed 1 November 2014).

Michel, J.-B., Shen, Y. K., Aiden, A. P., Veres, A., Gray, M. K., Pickett, J. P., . . . Orwant, J., et al. (2011). Quantitative Analysis of Culture Using Millions of Digitized Books.
Science,
331(6014): 176–82.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.