Studies and Documents
Vocabularies in Chinese
A Taeko and E Bruce Brooks
estimated publication date: 2010

A Taeko Brooks

This is a more legible reprint of a 1976 mimeographed pamphlet which was never properly distributed at that time, but which remains invaluable for teachers and students of Chinese and for the analysts of Chinese texts.

1. The practical ceiling on how fast Chinese may be learned is widely agreed to be imposed by the Chinese characters: they are difficult for beginners to memorize, and the rate at which they may be acquired by one new to the language is therefore the key limiting factor. Economy in the assignment of vocabulary is thus one important basis for classroom efficiency.

2. It follows from the nature of all languages, which consist of a mixture of common and rare words, that the best strategy will be to concentrate from the beginning on the common characters: the ones that permit the most reading per character learned. This fact did not escape the Chinese educational theorists of the previous century, and various studies of written Chinese usage were made at that time. The "modern" portion of this manual conflates three different lists of characters in order of observed frequency, and thus has greater generality than any of those lists by itself could offer. The resulting list is given in sections, each containing words of roughly equal frequency. For the traditional language, wordcounts were taken from concordances available at the time, and combined in a similar way to give a balanced result, not overly influenced by any one text or type of text. This list is called "liteary" rather than "classical" because it was based not solely on pre-Han texts, but also on selected literary and especially poetic writings of the Six Dynasties and Tang. The literary and modern lists were in turn combined into a "composite" list, which will govern good pedagogy where it is desired to maximize the benefits for those studying both languages in parallel. These three lists make up the manual.

3. In an introductory essay, the sources of the three lists are given, and their classroom application is explained. It emerges that something like 850 characters will enable the reading of 89% of an average text; not quite at the level of consecutive reading. That level is reached with a coverage of 95% of an average text, requiring another 750 characters. A near independent level, is reached with a further 650 characters, for a total cumulative vocabulary of 2,250 characters; the coverage at that level is 98% of an average text. Beyond that lies relatively independent reading, with manageable use of the dictionary. If we construe the three levels as three years of nonintensive language study, we note that the second year adds only 6% to the cumulative frequency, and the third year only another 3%. Thus rapidly do the less common characters decline in usefulness. Somewhere in the third year one reaches the point where the text being read, and not the general frequency profile of the language at large, determines what it is most efficient to memorize next. The implied course structure is rather familiar in actual language programs, but it is here given a theoretical explanation based on the nature of the wordstock of Chinese.

4. Recent study of several classical texts has shown that a core vocabulary of approximately 2,500 characters will suffice for complete comprehension of Jya Yi's Syin Shu, but also for the Jan-gwo Tsv, which is more than twice as long. Vocabulary is not a function of text size, and does not grow in a regular way as a function of text size; it depends on the difficulty of the writing in the text.

5. Annotated texts for self-study, such as Donald Wagner's book on Han Shu 00 (Publisher date), necessarily take for granted the student's prior acquisition of basic reading competence. The lists here presented will more closely define what this means in terms of specific characters, and will permit those preparing such materials to know what to take for granted, and what to annotate. The production of such materials should be stimulated by this greater understanding of the abilities of their probable readers.

6. The implications for linguistics and also for text analysis can here be only briefly explored. One result of interest is that the key datum, the equation behind the cumulative frequency curve, is not a hyperbola, with its implication of a "least effort" principle, as has been supposed by Zipf and Mandelbrot. It is rather a modified hyperbola, whose exact shape is complicated by the fact that early choices (such as in cell division or word definition) affect later choices. The modified equation of such curves gives a better fit than does the hyperbola for such applications as the sizes of cities (Zipf's examples fit well except at the tails of the distribution, but it is precisely the tails of the distribution that indicate whether a given curve is well fitted to the data) and the size of cows at various ages (a cow is not a colony of bacteria in a Petri dish; it has a backbone, and that fact interferes with its growth pattern). The cumulative frequency equation for any real language can be given by supplying the appropriate constant in the equation here provided. No two of those curves are the same, but they are all of the same family, and they all reflect the same forces at work.

This book is essential for every Chinese teacher and for everyone in charge of a Chinese language program, whether the language emphasis is on modern or classical or both. It is equally essential for students working under teachers who have not absorbed its lessons, and who continue to assign characters in random or promiscuous order. Students who know where to concentrate their available learning time will best survive such regimes. And the analyst of Chinese textual dynamics, as well as the practitioner of stylometrics, will find in these pages a fundamental guide to their procedures.

Back to Studies and Documents

17 May 2007 / Contact The Project / Exit to Publications Page