The Global Lexicostatistical Database: General Description

 

I. Introduction

II. Basic structure

   II.1. Composition of a Level 1 Database

   II.2. Composition of a Level 2 Database

III. Methodological aspects

   III.1. Choosing synonyms

   III.2. Choosing the protoform

   III.3. Dealing with borrowings

 

I. Introduction.

 

The Global Lexicostatistical Database (GLD) is an ongoing linguistic project, initiated by the Moscow school of comparative linguistics and carried out with the support of the Evolution of Human Language (EHL) program at the Santa Fe Institute and the Tower of Babel program, lau­n­ched in 1998 by Sergei Starostin.

 

The direct aim of the GLD project is to compile, formalize, and provide public access to stan­dardized basic lexicon wordlists of as many world languages and their dialects as possible, ran­ging from well-known and expertly studied to poorly documented and analyzed, and also inclu­ding reconstructed «proto-lists» for numerous common ancestors of modern day languages.

 

Since, in accordance with the methodological foundations of comparative-historical linguistics (at least, the particular branch of it represented by the Moscow school), it is first and foremost the basic lexicon that serves as the key to demonstrating genetic relationship, the database will be of the greatest use to those interested in historical-comparative linguistics and issues of genetic clas­sification of the world's languages. However, formally structured lists of comparative lexical data may also be important for specialists in phonetic and semantic typology, sociolinguistics, basic philosophy of language, and various other fields of linguistic and philological research too nu­merous to mention.

 


 

II. Basic structure.

 

Formally speaking, the GLD does not constitute one single database. Rather, one must think of it as a hierarchical system of wordlists, organized from bottom to top. Such a structuring not only ma­kes it easier to work with overwhelmingly huge quantities of data, but is also in strict accor­dance with the basic conception of the language family tree, in which multiple common ancestors each give rise to a number of descendant languages, which can, in their turn, be traced back to their common invariant by means of historical linguistics.

 

Level 1 consists of a series of rela­ti­vely small databases, each of which contains wordlists for a clo­sely related, non-controversial group of languages whose approximately es­ti­mated age does not exceed 3,000 years (see the Glotto­chronology section below for explication) and is rounded off with a recon­struc­tion of the proto-wordlist for their common ancestor. Typical examples of such databases are: Germanic, Turkic, Polynesian, North Khoisan etc. We reserve the general name group for such taxonomic entities.

 

Level 2 is occupied by databases that contain only proto-wordlists which are known to belong or, at least, suspected of belonging to re­la­ted proto-languages. The status of such proto-languages among the linguistic com­munity is generally non-controversial, and their approximate date of disintegration does not generally exceed 6,000 years. The databases are, once again, accompanied by reconstructions of proto-wor­dlists for the common ancestral language. Typical examples in­clude: Indo-European, Uralic, Aus­tronesian, North Caucasian etc. For these entities, the name fa­mily is reserved.

 

Level 3 consists of databases that compare proto-wordlists for several families, under the assum­ption that a very deep genetic relation may exist between these families. Since such ultra-deep genetic connections are frequently placed under serious doubt (particularly among specialists convinced that neither the comparative method nor any of its substitutes may yield true po­sitives at a time depth that exceeds 6,000 ~ 8,000 years), the creation and analysis of hypo­thetical proto-wordlists for such deep taxa is a necessary prerequisite to confirming their historic reality. Typical examples include: Nostratic, Sino-Caucasian, Afro-Asiatic, Niger-Congo, etc. For these entities, we reserve the name macrofamily.

 

II.1. Composition of a Level 1 Database.

 

This section describes and comments upon the general structure and acceptable standards for a typical Level 1 Database — 100-wordlists for relatively «young» language groups, accompanied by a reconstructed proto-wordlist.

 

The list of fields in each such database is as follows (fields that may not necessarily be present are entered in square brackets): Word, Ln1, Ln1Num, [Ln1EtNum], Ln1Notes... Lnn, LnnNum, [LnnEtNum], LnnNotes, [PLn], [PLnNum], [PLnNotes]. Each field has a name, determined by the regular standards of the Starling software (no non-standard symbols), and an alias deciphering the name, as in the example below:

 

Field type

Field name

Field alias

Example (from database: North Khoisan)

 

 

 

 

Word

WORD

Word

cold

Ln1

JUH

Zhuǀ'hoan

ǂàʔú

Ln1Num

JUHNUM

Zhu|'hoan number

1

Ln1EtNum

JUHETNUM

Zhu|'hoan etymology

261

Ln1Notes

JUH_NOTES

Zhu|'hoan notes

Dickens 1994: 300.

Ln2

AUE

ǁKxauǁen

ǂxiː

Ln2Num

AUENUM

ǁKxauǁen number

2

Ln2EtNum

AUEETNUM

ǁKxauǁen etymology

412

Ln2Notes

AUE_NOTES

ǁKxauǁen notes

Bleek 1929: 29; Bleek 1956: 680. Alter­na­tely transcribed as ǂxẽ in [Bleek 1956: 679]. A possible synonym is |au 'to be cold, bare' [Bleek 1956: 303]; however, in the English-ǁKxauǁen vocabulary of [Ble­ek 1929] only the first root is ad­du­ced.  

Ln3

EKK

Ekoka !Xung

ǃǃàò ~ ǃǃàʔō

Ln3Num

EKKNUM

Ekoka !Xung number

1

Ln3EtNum

EKKETNUM

Ekoka !Xung etymology

261

Ln3Notes

EKK_NOTES

Ekoka !Xung notes

König & Heine 2008: 89. Quoted as ǂàʔō in [Heikkinen 1986: 23]. Polysemy: 'cold / cool / good, well'.

PLn

NKH

Proto North Khoisan

*ǂàʔū

PLnNum

NKHNUM

Protolanguage number

1

PLnNotes

NKH_NOTES

Protolanguage notes

Distribution: Preserved (mostly) in the Northern and Central clusters. Replace­me­nts: Southern cluster: *ǂxãĩ, possibly refle­c­ting a rare semantic development {'to tre­m­ble' > 'to be cold'}. Reconstruction shape: Correspondences are regular and trivial. 

 

Notes on particular fields:

 

A. Word. Standardized listing of the Swadesh elements, completely identical for all Level 1 da­ta­ba­­ses. The listing includes all of the elements on Swadesh's original 100-wordlist, plus 10 ad­di­ti­o­nal items off the original 200-wordlist reserved for special correctional purposes. If the word is fol­lowed by a par­tial synonym in parentheses, this indicates a correction / specification of the ori­ginal Swadesh meaning: e. g. 'claw (nail)', 'walk (go)', 'warm (hot)' means that the basic meaning of each entry is rather alig­n­ed with the English words '(finger)nail', 'to go', 'hot' than with the En­glish words 'claw', 'walk', 'warm'.

 

Since English lexical equivalents can be frequently converted into a whole number of synonyms in any other given language, the meanings of all Swadesh elements have been expressed and ex­plicated in a relatively precise, restrictive manner, and the wordlists — wherever possible — should be constructed according to these uniform specifications. General informa­tion on the principles of synonym elimination can be accessed by clicking on the link to the ex­planatory paper by A. Kassian, G. Starostin, A. Dybo, and V. Chernov at the top of the page («The Swadesh wordlist. An attempt at semantic specification»).

 

For personal pronouns, where synonymity between at least two stems is assumed to be very com­mon in languages across the world, such cases presuppose the creation of two different records, with the Word field marked as 'I1', 'I2', 'thou1', 'thou2', 'we1', 'we2' respectively. In the rare cases where other words still show synonymous variants (see notes on «transit synonymy» below), the record is also doubled, but the item in the Word field remains without a numeric index. (Indexed and non-indexed entries receive different treatment in the Web interface).

 

B. Data: Ln1, Ln2... Lnn. These fields contain the actual data from all the languages in the group for which it has been possible to collect and annotate it. The input is regulated as follows:

 

a) Language names: each field is provided with a three-letter name (invisible on the Web inter­face, but determining the database structure) and a full «alias» name for convenience. The same double nomination is also valid for each indivi­dual database as a whole. The unique formal identification of each wordlist is a six-letter entry, e. g.: NKH_JUH («alias» North Khoisan → Zhu|'hoan) = the Swadesh wordlist for the Zhu|'hoan language of the North Khoisan group, etc.

 

This approach, which allows for different languages in different families to be encoded with the same three-letter abbreviation, is different from the principle assumed in the Ethnologue, where each language is issued its own unique three-letter code. One major reason for this, beyond mere convenience (the Ethnologue system frequently results in the three-letter code bearing little or no mnemonic resemblance to the designated language in question), is that wordlists are frequently provided and catalogued from multiple different dialects of the same language, each of which has to have its unique identification — for Ethnologue, which does not list actual language data, this problem does not exist — and thus, the maximum number of combinations allowed by 26 letters of the Latin alphabet (263 = 17576) may simply turn out to be insufficient in the long run.

 

b) Transcription: all of the entries are transliterated into the exact same unified transcriptional system that, for the most part, follows the IPA, but also includes convenience-oriented modi­fi­cations agreed upon by the principal contributors to the Database. The system can be checked at any time by clicking the «Transcription system» link at the top of the page.

 

The decision to unify transcriptional systems is justified by several factors, such as reader conve­nience (most users will probably only be familiar with idiosyncratic transcriptional conventions for a small part of the data) and easier application of automatic algorithms for data analysis. In many cases, however, it makes sense to also provide the «specific» transcription for a given lan­guage, especially in cases where there exists a long-term orthographic tradition (e. g. English); in a few, it also makes sense to provide data on the original non-Latin-based graphic system (e. g. Chinese characters). In all such cases, the GLD transliteration is followed by the traditional spel­ling, enclosed in curly brackets: e. g. British English šɔ:t {short}, Beijing Chinese śīŋ {} etc. 

 

c) Grammatical aspects of form presentation: for morphologically complex languages, nouns, ad­je­c­tives, verbs etc., if possible, are presented in the same form (e. g. nominative singular; active in­finitive; 1st person singular present tense, etc.) that may be indicated in the general description section of the wordlist. If, for some reason, this is not possible (e. g. the forms are extracted from texts rather than dictionaries), the grammatical characteristics of the attested form should be indicated in the Language Notes section.

 

If the word is segmentable on the synchronic level or based on easily explainable historical considerations, prefixes are sepa­rated from the root with a = sign, suffixes with a hyphen (e. g. Russian u=mer-ˈety 'to die'). If on­ly the bare stem is given (this is not generally recommendable, but sometimes the data present no choice), it is always followed by a hyphen.

 

Special note on the marking of compound forms that consist of two or more lexical roots: in all such cases, the le­xi­co­statistical treatment in the GLD calls for the demarcation of the morpheme bearing the principal lexical meaning (this is usually performed on the basis of internal and exter­nal comparison). All the other root morphemes are to be trea­ted and designated in the same way as grammatical prefixes and suffixes. E. g.: Beijing Chinese yǜe-lyàŋ {月亮} 'moon' (the second lexical root, lyàŋ 'light', is treated in a suffixal manner).

Infixes, if identifiable, are placed in square brackets: Thao k[m]an 'to eat' etc.

Deviating forms that display important vowel or consonant alternations or even suppletion are generally relegated to the comments section, but in a few cases it may be deemed necessary to put several morphological variants in the same spot. In such cases, they are separated with a slash (e. g. ǁAni / 'thou') and the difference itself is explai­ned in the comments (in this case — masculine vs. feminine form).

 

d) Additional notation: interchangeable variants of the same word attested in the same source wi­thout an explicit explanation as to the reasons (e. g. a mix of several dialectal forms) are listed one after the other, separated with a tilde (~), preferably in order of frequency of usage, if any such thing can be established from the source.

 

Use of regular parentheses is also allowed to denote «fa­cultative» elements that may or may not be encountered in the informant's speech, for phonetical or morphological reasons.

 

Questionable items — e. g. ones for which there is good reason to sur­mise an error on the transcriber's part, or ones presented with a slightly different meaning from the required one, but for which there is also a good reason to surmise that they may express the required meanings as well, etc. — are followed by the # sign. (All such cases are explicated in the notes section).

 

C. Numeration: [Ln1Num]. In this field each word is assigned a numeric value ranging from 1 to infinity, with etymologically cognate words receiving the same number and etymologically divergent ones re­ceiving different numbers in succession. In the Web interface of the database, the cognation num­ber is displayed as a small superscript index (1, 2... n) to the right of the corresponding word. See the General Rules For Cognate Scoring section for details.

 

If the word is not attested, this is marked with a negative index in the database (any negative num­ber will do, but the usual index is -1); in the Web interface, the corresponding slot remains com­pletely empty. Transparent borrowings on the wordlist (in accordance with Sergei Starostin's revisions of the lexi­co­statistical procedure) are technically equated with lack of attestation and marked with the same negative index (since the slot is still occupied by the borrowing, this time it ac­tu­ally shows up on the Web interface as well).

 

D. Etymology: Ln1EtNum. If any of the entries in a given wordlist are linked to one of the ety­mological data­bases hosted on the Tower of Babel server, they are provided with a second nu­mber that is iden­tical with the number of the corresponding etymology in the related ety­mological database. The numbers are not displayed on the Web interface. Instead, the user can cho­ose one of two options: (a) disregard the connection to the databases by unchecking the «View entries with hyperlinks» option at the top of the page, or (b) make use of it by seeing the entries that lead to Tower of Babel's ety­mological databases displayed as hyperlinks.

 

Technically, two numbers are somewhat superfluous — in the original StarLing database set, the­re is only one nu­me­ration for both purposes — but, since many of the prepared wordlists lack any connection to etymological databases, it would still be necessary to employ two different ways of numeration depending on the wordlist (from 1 to infinity for those that are not linked to etymological databases; coinciding with etymological database numbers for those that are). Besides, the word­lists should also be able to function as completely autonomous entities.

 

E. Notes: Ln1Notes. This field always begins with the indication of the primary source(s) for the attested form, in the standard format: Author — Year : Page number (e. g.: Doke 1925: 153). The rest of the field is less formalized and may include any additional information about the cor­res­pon­ding entry that the compiler thinks necessary. Welcome types of information include the fol­lowing:

 

— additional variants of the same word in complementary data sources (dic­ti­o­na­ries / wor­d­­­lists of the same language / dialect by different authors), with precise references to the sou­rce as well (a possible basic formula is "Quoted as {alt. variant} in {Source}"). Note: if there are re­latively full wordlists or dictionaries of two closely related dia­lects of the same language, it is advisable to have a different wordlist for each;

— quasi- and secondary synonyms, denoted as such and, preferably, with a brief explana­tion of why these particular words are ineligible or less likely to occupy the appropriate «slot» in the wordlist. This is particularly important if the main source is a large dictionary that lists several synonyms for each Swadesh item, indicating their semantic nuances or providing syntactic con­texts. Note: if several synonyms are available with no information as to their practical usage in the language, the procedure recommends that the main synonym be the one supported by external data (e. g. the same root is attested in the primary word for the same meaning in closely related languages);

— morphological information (especially one that may be useful for historical purposes, such as various paradigmatic forms with morphophonological alternations, etc.);

— considerations on the reliability of the entry, especially where the word in question is marked with # (for instance, the compiler may have certain reasons to doubt the correctness of the source material, or that of the supplied meaning).

 

F. Protoform information: [PLn], [PLnNum], [PLnNotes]. The protoform, from which at least one descendant form (usually several cognate forms), attested in the language group, is/are descended.

 

The first field [PLn] only contains one as­te­risked reconstruction (or two or more phonetically different variants of the same reconstructed etymon, separated by a tilde ~, if the variants cannot be securely traced back to a single phonetic invariant). If there are alternate etyma strongly eligible for the «Proto-Root» position (listed in the Notes section), the main one should be marked with the # sign («uncertainty»).

Reconstructions may either be taken from an already published source (provided it is reliable from the point of view of sound comparative-historical methodology) or produced by the compiler of the list as preliminary approximations. In the former case, the field [PLnNotes] should also contain all the necessary references, down to page numbers.

 

The second field [PLnNum] contains the cognation index which is, naturally, the same as the number assigned to the attested descendants of the proto-root.

 

The third field [PLnNotes] provides the necessary information to back up the reconstruction, such as:

 

Distribution: Notes on how well the root is represented across the group (e. g. «preser­ved in all / the majority of daughter languages / dialects», etc.). If two or more «candidates» for the «proto-slot» have been identified, this section should contain the justification behind the ul­timate choice;

Replacements: Notes on entries that are seen as innovations in comparison with the proto-etymon: their presumed forms and meanings in the protolanguage (if reconstructible), for bor­rowings — the sources of borrowing (if known);

Reconstruction shape: Notes on the phonetic peculiarities of the reconstruction for the main proto-root (degree of regularity of correspondences; justification of the approximate shape if the reconstruction is preliminary, not based on a thorough scrutiny of the correspondences);

Semantics and structure: Notes on the semantics of the main proto-root in the proto-language (e. g. indication of polysemy, if detectable), as well as on the internal morphological structure if the «root» is actually a complex stem. May involve elements of internal reconstruc­tion if necessary.

 

If the field contains relevant notes on protolanguage polysemy or semantic change from proto­language meaning to descendant language meaning, it is recommendable to quote them in formu­laic notation, e. g.: {'head' > 'hair of head'} (for semantic change), {'head' & 'head hair'} (for poly­semy). Such formalization will facilitate the construction of a general database on polysemy and semantic change in the basic lexicon domain in the future.

 

II.2. Composition of a Level 2 / Level 3 Database.

 

Since there are no crucial differences between an attested language and a reconstructed proto­lan­guage, there are no significant structural differences between a Level 1 and a Level 2 / 3 da­ta­base either. The following minor notes should be made (more to follow in the future).

 

Language names: The three-letter code common for all the languages in the group becomes the designation for the reconstructed proto-language. E. g., NKH will now mean «Proto-North Khoi­san» and function as the name of the corresponding field in a common «Khoisan» database (com­plete designation may be NKH_KHO).

 

Notes: This field may or may not include information on how the reconstruction was arrived at and any of its phonetic, semantic, or distributional problems, since most of this information wo­uld merely replicate the information already presented in the Level 1 Database. A bibliographical reference, however, is necessary (provided one exists).

 


 

III. Methodological aspects.

 

This section describes some of the fundamental (and at the same time technical) issues that arise during the construction of the GLD and need to be treated in a systematic and unified manner. It is intended both for contributors to the database and users whose interest in the GLD goes beyond superficial curiosity.

 

As of now, three major problems have been identified: [a] dealing with synonymous equivalents within the wordlist; [b] choosing the most likely candidate for a given item on the proto-wordlist; [c] resolving the issue of borrowed items on the wordlist. The sections below, without going into a lot of detail, present quasi-algorithmic ways of eliminating these problems.

 

III.1. Choosing synonyms.

 

The basic premise employed in the construction of the GLD is that no two words on the 100-word­list can function as completely and ubiquitously interchangeable synonyms within one and the same language (dialect). The rather vague common notion of «synonymity» is understood therein as representing one of the following phenomena:

 

quasi-synonymity: two or more words have very similar meanings that are nevertheless some­what different in their definitions ('tree' / 'wood'; 'feather' / 'quill'), or may vary in stylistic usage ('mouth' / 'trap'; 'dog' / 'hound') or syntactic behavior (German 'wissen' / 'kennen'). The same group also includes suppletive stems, e. g. 'I' / 'me';

pseudo-synonymity: two or more words that are assigned the same meaning in one or more existing data sources, even though the meanings or peculiarities of usage are actually different, due to negligence or lack of time on the part of the compiler of the source;

transit-synonymity: two and only two words that really have the same meaning and are gene­rally interchangeable, except that one is the «old» word and the other is the «new» one, caught in the process of inevitably replacing the «old» one ('stone' > 'rock'; the most obvious examples are the ones attested in languages with a long written history, e. g. Chinese, Greek, etc.).

 

All three types of phenomena are usually identifiable:

 

— quasi-synonymity is understood based on careful perusal of available dictionaries and textual corpora;

— if detailed dictionaries and text exa­mples are lacking, quasi-synonymity may be defined as pseudo-synonymity (the difference in meaning is impossible to establish);

— transit-synonymity is established, based on available historical informa­tion or com­pa­ra­tive-historical evidence.

 

The basic ways of dealing with these three types is as follows:

 

a) Quasi-synonyms, based on as much available evidence as possible, are judged in ac­cordance with the specific definitions and syntactic contexts provided in the paper mentioned above (A. Kassian, G. Starostin, A. Dybo, V. Chernov, «The Swadesh wordlist. An attempt at se­mantic specification»). In the rare, but possible, case where the article does not offer a clear solu­tion of the problem, the problematic words may be treated as transit-synonyms (see below), but the authors should be informed of the situation so that necessary improvements may be made to make the standard more rigorous.

Rejected («ineligible») quasi-synonyms may be listed in the Notes section, together with remarks on the reasons of their ineligibility, but this is not an obligatory demand if these reasons are fairly simple and incontestable.

 

b) Pseudo-synonyms: Only one pseudo-synonym should be entered in the main field, but all the other ones should necessarily be listed in the Notes section together with a clear descrip­tion of the reasons why one was given precedence over the others. Such reasons may include, in descending order of importance:

 

statistical frequency: if the source lists two or more lexemes without specifying the diffe­rence in meaning, usually the more «basic» one is the lexeme that crops up more frequently in accompanying texts, syntactic examples (in grammars), etc.;

external support: if the words are given in vocabulary lists, with no textual / syntactic contacts to help understand their usage, the regular technical procedure is to choose the word that finds lexi­co­­statistical (same form, same meaning) or, at least, etymo­logical (same form, different meaning) parallels in closely related languages;

lack of significance: if the difference between the several pseudo-synonyms cannot be told and there is no etymological information attached to any of them, this means that choosing one over the other will not affect neither the proto-language reconstructions, nor the calculations, and so it does not actually matter which one is listed as «primary» and which ones are listed as «secondary» items in the Notes section.

 

c) Transit-synonymity: In those relatively rare cases where it can really be established, tran­sit-synonyms are the only synonyms that should be really listed as true synonyms (besides certain allowed cases of suppletive stems, see below); a new record is inserted in the Starling database file where the second synonym is entered into the corresponding field and is assigned another number.

The Notes section should, in all such cases, clearly state which synonym is considered to be the old one and which one is its more recent «ongoing» replacement.

 

Suppletivism. Most cases of paradigmatic suppletivism should be resolved in favour of one stem only (although the alternate stems may and, in fact, should be mentioned in the Notes section); the list of preferred forms (e. g. singular subject / object verbal stems rather than plural ones, etc.) may be found in the abovementioned paper. The following several cases of suppleti­vism are, however, so pervasive in languages around the world that treating them as synonyms should be acceptable for GLD standards:

 

direct / indirect stems for personal pronouns, such as 'I' / 'me';

exclusive / inclusive stems for the 1st person pl. pronoun 'we';

perfective / imperfective markers of negation ('not'), complementarily distributed ac­ross the verbal paradigm (note, however, that the prohibitive negation marker is completely ineli­gible for the wordlist, representing a significantly different meaning).

 

Compound stems. The basic rules for treating compound stems are formulated in «The Swa­desh wordlist. An attempt at semantic specification». Considerations about singling out the «primary» morpheme in a compound stem should be laid out in the Notes section.

 

III.2. Choosing the protoform

 

Filling in the «Protolanguage» field is a responsible procedure that, even in cases when a solid etymological dictionary for the group / family in question is available, should not be redu­ced to merely copying the according reconstruction from the dictionary. The procedure of choo­sing the most appropriate proto-stem is generally outlined in G. Starostin's article «Preliminary lexicostatistics as a basis for language classification», included on the site. The main guidelines, in condensed form, are as follows:

 

For languages l1, l2, l3... ln that constitute language group L (all of them descended from proto-language *L), the Swadesh item protoform *I, represented in said languages by forms i1, i2, i3... in, is established in the following way:

 

(1) If, etymologically, i1 = i2 = i3 = in, the protoform *I is obviously the same root as all of its descendants, and is assigned the exact same number;

(2) If i1 = ix, i. e. two (or more) languages share the same root, whereas all other languages have individual, non-corresponding entries, AND languages l1 and lx do not form a single node on the lexicostatistical tree, the protoform *I is also defined as the basic equivalent of the corresponding Swadesh item in proto-language *L.

 

Situations (1) and (2) are defined as non-competitive, i. e., cases where there is a clear dis­tributional bias in favour of only one «candidate» for the proto-list. The remaining situations, more complicated in terms of possible solutions, are defined as competitive, when two or more of the attested roots may qualify for proto-list status with comparable chances. They include cases when:

 

(3) i1i2 i3 in, i. e. all languages in group L have different roots for the same notion;

(4) i1 = ix, i. e. two (or more) languages share the same root, whereas all other languages have individual, non-corresponding entries, BUT languages l1 and lx form a single node on the lexi­costatistical tree (meaning that this particular item may have been innovated, as a lexical replacement, in Proto-l1-lx, not going back to *L);

(5) i1 = ix & i2 = iy, with languages l1 and lx forming one node on the tree and languages l2 and ly forming another one. (The number of node-forming languages may certainly exceed two). The existence of such pairs / triplets / quadruplets etc. means that, without additional evidence, each of them is equally qualified for proto-list status;

(6) i1 = ix & i2 = iy, with languages l1 and ly forming one node on the tree and languages l2 and lx forming another one. This is the trickiest possible situation, a case of so-called «semantic criss-crossing», and is generally explained in two ways: (a) synonymity in the proto-language or (b) independent semantic innovation in two (or more) branches of the group. The refined lexico­statistical procedure requires that, if an instance of (b) is identified, the respective culprits be sco­red differently from each other, despite their common etymological origin, since the coinciding semantics in such cases does not surmise a common ancestral Swadesh item, but rather two independently occurring processes of identical semantic change.

 

            Competitive variants *I1, *I2... *In are weighted against each other in the following way:

 

[A] Comparison of etymological parallels. If the form *I1 is always attested in the Swa­desh meaning, whereas the form *I2 is in some languages attested in the Swadesh meaning and in others is attested in a different meaning that is typologically known to shift to the Swadesh mea­ning (e. g., *I1 always = 'eat', whereas *I2 = 'eat ~ chew'), *I1 must be given precedence.

[B] Internal analysis. If the form *I1 is morphologically simple, whereas *I2 is easily ana­lyzable as a stem derived from a root with a different meaning (e. g. 'star' vs. 'sky-eye'), *I1 must be given precedence (NOTE: only if *I1 and *I2 are truly competitive variants. If nine languages out of ten have 'sky-eye' in the meaning 'star', and the tenth language does not constitute a sepa­rate branch that split off before all the others, there is no competition).

[C] External parallels. If, upon the application of parameters [A] and [B], *I1 and *I2 are still weighted as equals, it is permissible to fill the proto-list with the form that constitutes a better match for its closest external relative(s) on the higher level of subgrouping.

 

IMPORTANT NOTE: [C] is a very tricky parameter, inappropriate use of which may lead to a vicious circle in estab­lishing relationships. Making a decision based on external parallels is only permissible if (a) the relationship is, in a completely non-controversial way, already estab­lished by non-lexicostatisti­cal means; or (b) the higher-level relationship has already been suggested by lexicostatistical means with­out relying on such ambiguous cases.

 

If none of the listed criteria are applicable, or if two or more criteria contradict one ano­ther, crossing out each other's significance, the proto-form slot should, in theory, remain empty. Nevertheless, for technical reasons it is advisable to still fill in the position, even if the competi­tion is on a 50-50 basis, so that the corresponding slot does not remain with a negative number (skewing the calculation results on higher levels). In such cases, any root out of the competition pool may be selected at random, and the field should be marked with a # sign, with the other «candidates» properly listed in the Notes section.

 

III.3. Dealing with borrowings

 

Items on the list that have been identified as borrowings are marked with negative num­bers, which excludes them from lexicostatistical calculations that measure only internally trigger­ed lexical change, not externally triggered one.

In many cases, particularly those of relatively shallow families with well-studied histories, identifying borrowings is easy. In other cases, the procedure is more difficult, involving context­ual analysis, such as the establishment of a «borrowing triangle»: if the number of look-alikes between A and B is significantly higher than between A and C, but significantly lower than be­tween B and C, this suggests that similarities between A and B may be due to contact (i. e. A and B are either unrelated or related on a much higher level than B and C).

In a large number of cases it remains completely unclear if a certain word, identified as a lexical replacement, is of the «original» stock or represents a borrowing from an unknown source. Such situations should be dealt with according to the following steps:

 

(1) If the word has a transparent, non-controversial, etymology within the analyzed lan­guage group, it is obviously not a borrowing and is scored positive;

(2) If the word has no etymology whatsoever, and its form is highly unusual for a «native» word — e. g., contains one or more phonemes, atypical for that language — it may tentatively be scored as a borrowing;

(3) In all other cases, it is better to mark the form as an internal replacement rather than a borrowing. This is because, strictly speaking, it is not entirely wise to always omit all borrowings from the calculations: some words may penetrate the language in a different meaning and acquire the Swadesh meaning only later in their history. Situations in which Swadesh items get borrowed in Swadesh meanings normally happen in cases of «massive lexical bombardment», i. e. surmise multiple near-simultaneous borrowings from a single source, and such cases are usually easy to identify if data from surrounding languages are available.

 

If, on the other hand, we deal with an individual potential borrowing whose source is un­identifiable, there are at least three possible scenarios: (a) it entered the language immediately in the Swadesh meaning; (b) it is not a borrowing at all; (c) it was originally borrowed in a non-Swadesh meaning and only later switched to the Swadesh meaning. Since, out of these three, only situation (a) unconditionally requires marking the form with a negative cognation index, it is relatively safe to count the respective item as a non-borrowing by default. Nevertheless, such cases may be marked in the Notes section with the comment «Possibly borrowed from an unknown source».