CAT : A Jurilinguistic Application of Automatic Speech to Text Transcription

Benjamin K. T'sou; K. K. Sin; Samuel W.K. Chan; Tom B. Y. Lai; Lawrence Y.L. Cheung; K. T. Ko; Gary K.K. Chan

Authorship

1. Benjamin K. T'sou

City University of Hong Kong
2. K. K. Sin

City University of Hong Kong
3. Samuel W.K. Chan

City University of Hong Kong
4. Tom B. Y. Lai

City University of Hong Kong
5. Lawrence Y.L. Cheung

City University of Hong Kong
6. K. T. Ko

City University of Hong Kong
7. Gary K.K. Chan

City University of Hong Kong

Work text

This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

1. Introduction

British rule in Hong Kong made English the only official language in the legal domain for over a century. It was not until the reversion of sovereignty to China in 1997 that Chinese came to also enjoy official status in the Judiciary of Hong Kong. Legal bilingualism in Hong Kong has brought on an urgent need to create a Computer-Aided Transcription (CAT) system for Chinese to be on a par with the existing English CAT system. The production and retention of verbatim records of court proceedings is a cornerstone of the Common Law system. The creation of such facilities is vital for the successful retention of the Common Law system in Hong Kong, under the "One Country, Two Systems" principle, which brought about the creation of the Hong Kong Special Administrative Region of China. Court proceedings had been kept only in English until recently. There is thus an urgent demand for the creation of Chinese CAT to produce and maintain the legally tenable records of court proceedings conducted in Cantonese, the predominant Chinese dialect in Hong Kong. (T'sou 1993, Sin and T'sou 1994, Lun et al. 1995) The existing monolingual English CAT system has to be adapted in order to produce the appropriate court proceedings. Furthermore, since English will remain to be frequently used in court in addition to Cantonese, the ultimate Cantonese CAT must operate in parallel to the English CAT so that the existing contingent of court stenographers can switch from one to the other easily. This paper discusses the Jurilinguistic Engineering undertaken to develop a Cantonese CAT system, with special reference to phonetically-based stenograph code to Chinese text conversion and other enhancement features.

2. Computer-Aided Transcription (CAT)

CAT is divided into three stages. First, the stenographer encodes speech into a sequence of phonetically-based shorthand code, or stenograph code. The code is recorded via a stenograph machine. Second, the Automatic Transcription System (ATS) will recover the original text {c1, . . . , cn} from a sequence of stenograph codes {s1, . . . , sn}. Finally, the post-editing step is needed to correct typing or transcription errors.

There are two major constraints in the development of the Cantonese CAT system. First, there are many homophonous characters which make the conversion of phonetically-based stenograph code into Chinese characters difficult. Cantonese (and also Mandarin Chinese) is basically a monosyllabic language and each logograph represents one syllable. Problematical homonymy is a persistent problem in the language. Second, the design of the Cantonese CAT system must capitalize on the existing equipment and the stenographer's skills in English stenography so that they can switch from one environment to the other easily. The user interface including keyboard design and input method should be made consistent across the two CAT systems.

3. Ambiguity Resolution - Bigram Model

ATS converts a sequence of stenograph codes {s1, . . . , sk} into a sequence of characters {c1, . . . , ck}. The challenge of the conversion lies in the one-to-many relationship between a stenograph code si and the set of homophonous characters ci that can be encoded by si. This is the homocode problem in theconversion from phonetic to textual representation. To resolve the ambiguity, we apply the bigram model (Bahl and Mercer 1976, Rabiner 1989, Waibel and Lee 1990, Charniak 1993), which has been extensively used in natural language modelling. The conversion procedure determines the most probable character sequence {c1, . . . , ck} for the input stenograph code sequence {s1, . . . , sk}. In conditional probability, (1) should be maximized.

(1) P (c1, . . . , ck | s1, . . . , sk)

where {c1, . . . , ck } denotes a sequence of k characters, and

{s1, . . . , sk} denotes a sequence of k input shorthand codes.

By making some approximation assumptions, the maximization of (1) is recast as the maximization of (2).

(2) Multiplication i=1,. . . ,k (P(ci | ci-1) * P(si | ci))

P(si | ci) and the bigram probability P(ci | ci-1) can be readily computed from the training data set. The Viterbi algorithm (Viterbi 1967) is implemented to efficiently compute the maximum value of (2).

To evaluate the accuracy of ATS, we conducted some transcription tests. Two prototypes were developed. The first one, CAT2, implemented the bigram model for conversion as described above. A baseline model, CAT0, was also built for comparison. It converts by selecting the character si with the highest P(si | ci)) value for the stenograph ci.

We compiled a training corpus of about 0.85 million character authentic court proceedings to obtain the conditional probabilities necessary for computation. A testing corpus of about 0.15 million character testing data was used. After training, CAT0 and CAT2 achieve an accuracy of about 78% and 92% respectively. The use of the bigram statistical model significantly improves the ambiguity resolution.

4. Enhancement Facilities

A consistent user interface for both Cantonese CAT and English CAT must be provided so that stenographers can easily operate in both Chinese and English. Two important features were offered to maintain consistency and to improve transcription efficiency.

4.1 "Arbitrary" (as defined in Glassbrenner and Sonntag (1986))

We have been assuming throughout that one keystroke corresponds to one syllable for the sake of simplicity. Nonetheless, such a requirement is not obligatory. A stenograph code may well represent a string of characters of any length. Although the English CAT system is syllable-based, it builds in functions to associate a unique user-defined stenograph code, or an "arbitrary", with a frequently used phrase or expression instead of a single syllable. "Arbitrary" is a critical feature for fast online recording as keystrokes can be significantly reduced.

Two requirements must be met in the incorporation of "arbitraries". First, each stenographer may not consistently use "arbitraries" even within the same recording session. An expression may be recorded using an "arbitrary" or be entered as a series of syllable-based stenograph codes. The system must be able to tolerate such variation. Second, the system must be flexible enough to allow the stenographer to create novel "arbitraries" at input time. The stenographer may invent new ad hoc codes at input time to speed up recording. The CAT system must be able to operate without defining the "arbitrary" before using it.

While "arbitraries" in the English CAT system are merely additional entries in the conversion dictionary, incorporating "arbitraries" into our conversion model is more complicated. A macro design is introduced which enables "arbitraries" to be fully integrated into the syllable-based scheme and our statistically-based ATS module. This is achieved by allowing the stenographer to define in an "arbitrary dictionary" an ad hoc macro for a stenograph code sequence plus its corresponding character string. The input is pre-processed so that the ad hoc code will be expanded into a sequence of syllable-based stenograph codes as defined. Subsequently, the expanded code will be subject to the statistically-based conversion. While the stenographer uses ad hoc codes at input time, the conversion procedure operates using the syllable-based code at transcription time.

4.2 Domain-Specific Transcription

In English CAT, automatic transcription is supported by special "Job dictionaries", containing specialized vocabularies. They can be dynamically activated depending on the case type recorded. Different case types have specific legal terms and lexical usage. For instance, chemical vocabulary in drug-trafficking offences is not likely to be found in fraud or traffic offences. Integrating all vocabularies in a training corpus of the Chinese CAT system may obscure the co-occurrence probabilities of some characters. To make the bigram probabilities more reliable, we exploit this domain-specificity of lexical items. Another test was conducted using transcripts pertaining to traffic offences. A training corpus and testing corpus of about 0.85 million and 0.15 million characters respectively were compiled. We observed that the system achieved a better transcription accuracy of about 95%. With this feature, the stenographer can choose a particular domain to work with before the automatic transcription. At present, the available categories include Assault, Civil, Robbery, and Traffic.

5. Conclusion

To summarize, the bigram statistical model has been applied to resolve ambiguity in stenograph code to Chinese conversion. Supplemented with the ad hoc codes and domain-specific transcription, the Cantonese CAT system offers a user-friendly environment that matches the English CAT. The resultant high transcription accuracy makes it viable to implement a Cantonese stenograph system on phonologically-based machines which are designed for English, and which can now accommodate both English and Cantonese as dictated by circumstances.

Acknowledgement

Support for the research reported here is provided mainly through the Research Grants Council of Hong Kong under Competitive Earmarked Research Grant CERG 9040326.

References

Bahl, L. R. and Mercer, R. L. (1976). "Part of Speech Assignment by a Statistical Algorithm." IEEE International Symposium on Information Theory, Ronneby, Sweden, June 1976.
Charniak, E. (1993) Statistical Language Learning. MIT Press, Cambridge, MA.
Glassbrenner, M and Sonntag, G A. (1986). Computer-Compatible Stenograph Theory. 2 vols. Stenogrph Corporation, Illinois.
Linguistic Society of Hong Kong (1997). Yueyu Pinyin Zibiao (Cantonese Jyutping Transliteration Word List). Linguistic Society of HK, Hong Kong.
Lun, S., Sin, K. K., T'sou, B. K. and Cheng, T. A. (1997). "Diannao Fuzhu Yueyu Suji Fangan." (The Cantonese Shorthand System for Computer-Aided Transcription) (in Chinese) Proceedings of the 5th International Conference on Cantonese and Other Yue Dialects. In B. H. Zhan (ed). Jinan University Press, Guangzhou. pp. 217"227.
Rabiner, L. R. (1990). "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition." Proceedings of IEEE. Reprinted in Waibel and Lee (1990).
Sin, K. K. and T'sou, B. K. (1994). "Hong Kong Courtroom Language: Some Issues on Linguistics and Language Technology". Paper presented at the Third International Conference on Chinese Linguistics. Hong Kong.
T'sou, B. K. (1993). "Some Issues on Law and Language in the Hong Kong Special Administrative Region (HKSAR) of China." Language, Law and Equality: Proceedings of the 3rd International Conference of the International Academy of Language Law (IALL). In K. Prinslooet al. (Eds) University of South Africa, Pretoria. 314-331.
Viterbi, A. J. (1967). "Error Bounds for Convolution Codes and an Asymptotically Optimal Decoding Algorithm." IEEE Transaction on Information Theory 13: 260"269.
Waibel, A., and Lee, K. F. (eds) (1990). Readings in Speech Recognition. Morgan Kaufmann, San Mateo, CA.

Full text license: This text is republished here with permission from the original rights holder.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ACH/ALLC / ACH/ICCH / ALLC/EADH - 2000

Hosted at University of Glasgow

Glasgow, Scotland, United Kingdom

July 21, 2000 - July 25, 2000

104 works by 187 authors indexed

Affiliations need to be double-checked.

Conference website: https://web.archive.org/web/20190421230852/https://www.arts.gla.ac.uk/allcach2k/

Series: ALLC/EADH (27), ACH/ICCH (20), ACH/ALLC (12)

Organizers: ACH, ALLC