The Global Lexicostatistical Database

The Global Lexicostatistical Database: General Description

I. Introduction

II. Basic structure

II.1. Composition of a Level 1 Database

II.2. Composition of a Level 2 Database

III. Methodological aspects

III.1. Choosing synonyms

III.2. Choosing the protoform

III.3. Dealing with borrowings

I. Introduction.

The Global Lexicostatistical Database (GLD) is an ongoing linguistic project, initiated by the Moscow school of comparative linguistics and carried out with the support of the Evolution of Human Language (EHL) program at the Santa Fe Institute and the Tower of Babel program, launched in 1998 by Sergei Starostin.

The direct aim of the GLD project is to compile, formalize, and provide public access to standardized basic lexicon wordlists of as many world languages and their dialects as possible, ranging from well-known and expertly studied to poorly documented and analyzed, and also including reconstructed «proto-lists» for numerous common ancestors of modern day languages.

Since, in accordance with the methodological foundations of comparative-historical linguistics (at least, the particular branch of it represented by the Moscow school), it is first and foremost the basic lexicon that serves as the key to demonstrating genetic relationship, the database will be of the greatest use to those interested in historical-comparative linguistics and issues of genetic classification of the world's languages. However, formally structured lists of comparative lexical data may also be important for specialists in phonetic and semantic typology, sociolinguistics, basic philosophy of language, and various other fields of linguistic and philological research too numerous to mention.

II. Basic structure.

Formally speaking, the GLD does not constitute one single database. Rather, one must think of it as a hierarchical system of wordlists, organized from bottom to top. Such a structuring not only makes it easier to work with overwhelmingly huge quantities of data, but is also in strict accordance with the basic conception of the language family tree, in which multiple common ancestors each give rise to a number of descendant languages, which can, in their turn, be traced back to their common invariant by means of historical linguistics.

Level 1 consists of a series of relatively small databases, each of which contains wordlists for a closely related, non-controversial group of languages whose approximately estimated age does not exceed 3,000 years (see the Glottochronology section below for explication) and is rounded off with a reconstruction of the proto-wordlist for their common ancestor. Typical examples of such databases are: Germanic, Turkic, Polynesian, North Khoisan etc. We reserve the general name group for such taxonomic entities.

Level 2 is occupied by databases that contain only proto-wordlists which are known to belong or, at least, suspected of belonging to related proto-languages. The status of such proto-languages among the linguistic community is generally non-controversial, and their approximate date of disintegration does not generally exceed 6,000 years. The databases are, once again, accompanied by reconstructions of proto-wordlists for the common ancestral language. Typical examples include: Indo-European, Uralic, Austronesian, North Caucasian etc. For these entities, the name family is reserved.

Level 3 consists of databases that compare proto-wordlists for several families, under the assumption that a very deep genetic relation may exist between these families. Since such ultra-deep genetic connections are frequently placed under serious doubt (particularly among specialists convinced that neither the comparative method nor any of its substitutes may yield true positives at a time depth that exceeds 6,000 ~ 8,000 years), the creation and analysis of hypothetical proto-wordlists for such deep taxa is a necessary prerequisite to confirming their historic reality. Typical examples include: Nostratic, Sino-Caucasian, Afro-Asiatic, Niger-Congo, etc. For these entities, we reserve the name macrofamily.

II.1. Composition of a Level 1 Database.

This section describes and comments upon the general structure and acceptable standards for a typical Level 1 Database — 100-wordlists for relatively «young» language groups, accompanied by a reconstructed proto-wordlist.

The list of fields in each such database is as follows (fields that may not necessarily be present are entered in square brackets): Word, Ln₁, Ln₁Num, [Ln₁EtNum], Ln₁Notes... Ln_n, Ln_nNum, [Ln_nEtNum], Ln_nNotes, [PLn], [PLnNum], [PLnNotes]. Each field has a name, determined by the regular standards of the Starling software (no non-standard symbols), and an alias deciphering the name, as in the example below:

Field type	Field name	Field alias	Example (from database: North Khoisan)

Word	WORD	Word	cold
Ln₁	JUH	Zhuǀ'hoan	ǂàʔú
Ln₁Num	JUHNUM	Zhu\|'hoan number	1
Ln₁EtNum	JUHETNUM	Zhu\|'hoan etymology	261
Ln₁Notes	JUH_NOTES	Zhu\|'hoan notes	Dickens 1994: 300.
Ln₂	AUE	ǁKxauǁen	ǂxiː
Ln₂Num	AUENUM	ǁKxauǁen number	2
Ln₂EtNum	AUEETNUM	ǁKxauǁen etymology	412
Ln₂Notes	AUE_NOTES	ǁKxauǁen notes	Bleek 1929: 29; Bleek 1956: 680. Alternately transcribed as ǂxẽ in [Bleek 1956: 679]. A possible synonym is \|au 'to be cold, bare' [Bleek 1956: 303]; however, in the English-ǁKxauǁen vocabulary of [Bleek 1929] only the first root is adduced.
Ln₃	EKK	Ekoka !Xung	ǃǃàò ~ ǃǃàʔō
Ln₃Num	EKKNUM	Ekoka !Xung number	1
Ln₃EtNum	EKKETNUM	Ekoka !Xung etymology	261
Ln₃Notes	EKK_NOTES	Ekoka !Xung notes	König & Heine 2008: 89. Quoted as ǂàʔō in [Heikkinen 1986: 23]. Polysemy: 'cold / cool / good, well'.
PLn	NKH	Proto North Khoisan	*ǂàʔū
PLnNum	NKHNUM	Protolanguage number	1
PLnNotes	NKH_NOTES	Protolanguage notes	Distribution: Preserved (mostly) in the Northern and Central clusters. Replacements: Southern cluster: *ǂxãĩ, possibly reflecting a rare semantic development {'to tremble' > 'to be cold'}. Reconstruction shape: Correspondences are regular and trivial.

Notes on particular fields:

A. Word. Standardized listing of the Swadesh elements, completely identical for all Level 1 databases. The listing includes all of the elements on Swadesh's original 100-wordlist, plus 10 additional items off the original 200-wordlist reserved for special correctional purposes. If the word is followed by a partial synonym in parentheses, this indicates a correction / specification of the original Swadesh meaning: e. g. 'claw (nail)', 'walk (go)', 'warm (hot)' means that the basic meaning of each entry is rather aligned with the English words '(finger)nail', 'to go', 'hot' than with the English words 'claw', 'walk', 'warm'.

Since English lexical equivalents can be frequently converted into a whole number of synonyms in any other given language, the meanings of all Swadesh elements have been expressed and explicated in a relatively precise, restrictive manner, and the wordlists — wherever possible — should be constructed according to these uniform specifications. General information on the principles of synonym elimination can be accessed by clicking on the link to the explanatory paper by A. Kassian, G. Starostin, A. Dybo, and V. Chernov at the top of the page («The Swadesh wordlist. An attempt at semantic specification»).

For personal pronouns, where synonymity between at least two stems is assumed to be very common in languages across the world, such cases presuppose the creation of two different records, with the Word field marked as 'I₁', 'I₂', 'thou₁', 'thou₂', 'we₁', 'we₂' respectively. In the rare cases where other words still show synonymous variants (see notes on «transit synonymy» below), the record is also doubled, but the item in the Word field remains without a numeric index. (Indexed and non-indexed entries receive different treatment in the Web interface).

B. Data: Ln₁, Ln₂... Ln_n. These fields contain the actual data from all the languages in the group for which it has been possible to collect and annotate it. The input is regulated as follows:

a) Language names: each field is provided with a three-letter name (invisible on the Web interface, but determining the database structure) and a full «alias» name for convenience. The same double nomination is also valid for each individual database as a whole. The unique formal identification of each wordlist is a six-letter entry, e. g.: NKH_JUH («alias» North Khoisan → Zhu|'hoan) = the Swadesh wordlist for the Zhu|'hoan language of the North Khoisan group, etc.

This approach, which allows for different languages in different families to be encoded with the same three-letter abbreviation, is different from the principle assumed in the Ethnologue, where each language is issued its own unique three-letter code. One major reason for this, beyond mere convenience (the Ethnologue system frequently results in the three-letter code bearing little or no mnemonic resemblance to the designated language in question), is that wordlists are frequently provided and catalogued from multiple different dialects of the same language, each of which has to have its unique identification — for Ethnologue, which does not list actual language data, this problem does not exist — and thus, the maximum number of combinations allowed by 26 letters of the Latin alphabet (26³ = 17576) may simply turn out to be insufficient in the long run.

b) Transcription: all of the entries are transliterated into the exact same unified transcriptional system that, for the most part, follows the IPA, but also includes convenience-oriented modifications agreed upon by the principal contributors to the Database. The system can be checked at any time by clicking the «Transcription system» link at the top of the page.

The decision to unify transcriptional systems is justified by several factors, such as reader convenience (most users will probably only be familiar with idiosyncratic transcriptional conventions for a small part of the data) and easier application of automatic algorithms for data analysis. In many cases, however, it makes sense to also provide the «specific» transcription for a given language, especially in cases where there exists a long-term orthographic tradition (e. g. English); in a few, it also makes sense to provide data on the original non-Latin-based graphic system (e. g. Chinese characters). In all such cases, the GLD transliteration is followed by the traditional spelling, enclosed in curly brackets: e. g. British English šɔ:t {short}, Beijing Chinese śīŋ {星} etc.

c) Grammatical aspects of form presentation: for morphologically complex languages, nouns, adjectives, verbs etc., if possible, are presented in the same form (e. g. nominative singular; active infinitive; 1st person singular present tense, etc.) that may be indicated in the general description section of the wordlist. If, for some reason, this is not possible (e. g. the forms are extracted from texts rather than dictionaries), the grammatical characteristics of the attested form should be indicated in the Language Notes section.

If the word is segmentable on the synchronic level or based on easily explainable historical considerations, prefixes are separated from the root with a = sign, suffixes with a hyphen (e. g. Russian u=mer-ˈet^y 'to die'). If only the bare stem is given (this is not generally recommendable, but sometimes the data present no choice), it is always followed by a hyphen.

Special note on the marking of compound forms that consist of two or more lexical roots: in all such cases, the lexicostatistical treatment in the GLD calls for the demarcation of the morpheme bearing the principal lexical meaning (this is usually performed on the basis of internal and external comparison). All the other root morphemes are to be treated and designated in the same way as grammatical prefixes and suffixes. E. g.: Beijing Chinese yǜe-l^yàŋ {月亮} 'moon' (the second lexical root, l^yàŋ 'light', is treated in a suffixal manner).

Infixes, if identifiable, are placed in square brackets: Thao k[m]an 'to eat' etc.

Deviating forms that display important vowel or consonant alternations or even suppletion are generally relegated to the comments section, but in a few cases it may be deemed necessary to put several morphological variants in the same spot. In such cases, they are separated with a slash (e. g. ǁAni cá / há 'thou') and the difference itself is explained in the comments (in this case — masculine vs. feminine form).

d) Additional notation: interchangeable variants of the same word attested in the same source without an explicit explanation as to the reasons (e. g. a mix of several dialectal forms) are listed one after the other, separated with a tilde (~), preferably in order of frequency of usage, if any such thing can be established from the source.

Use of regular parentheses is also allowed to denote «facultative» elements that may or may not be encountered in the informant's speech, for phonetical or morphological reasons.

Questionable items — e. g. ones for which there is good reason to surmise an error on the transcriber's part, or ones presented with a slightly different meaning from the required one, but for which there is also a good reason to surmise that they may express the required meanings as well, etc. — are followed by the # sign. (All such cases are explicated in the notes section).

C. Numeration: [Ln₁Num]. In this field each word is assigned a numeric value ranging from 1 to infinity, with etymologically cognate words receiving the same number and etymologically divergent ones receiving different numbers in succession. In the Web interface of the database, the cognation number is displayed as a small superscript index (¹, ²... ⁿ) to the right of the corresponding word. See the General Rules For Cognate Scoring section for details.

If the word is not attested, this is marked with a negative index in the database (any negative number will do, but the usual index is ^-1); in the Web interface, the corresponding slot remains completely empty. Transparent borrowings on the wordlist (in accordance with Sergei Starostin's revisions of the lexicostatistical procedure) are technically equated with lack of attestation and marked with the same negative index (since the slot is still occupied by the borrowing, this time it actually shows up on the Web interface as well).

D. Etymology: Ln₁EtNum. If any of the entries in a given wordlist are linked to one of the etymological databases hosted on the Tower of Babel server, they are provided with a second number that is identical with the number of the corresponding etymology in the related etymological database. The numbers are not displayed on the Web interface. Instead, the user can choose one of two options: (a) disregard the connection to the databases by unchecking the «View entries with hyperlinks» option at the top of the page, or (b) make use of it by seeing the entries that lead to Tower of Babel's etymological databases displayed as hyperlinks.

Technically, two numbers are somewhat superfluous — in the original StarLing database set, there is only one numeration for both purposes — but, since many of the prepared wordlists lack any connection to etymological databases, it would still be necessary to employ two different ways of numeration depending on the wordlist (from 1 to infinity for those that are not linked to etymological databases; coinciding with etymological database numbers for those that are). Besides, the wordlists should also be able to function as completely autonomous entities.

E. Notes: Ln₁Notes. This field always begins with the indication of the primary source(s) for the attested form, in the standard format: Author — Year : Page number (e. g.: Doke 1925: 153). The rest of the field is less formalized and may include any additional information about the corresponding entry that the compiler thinks necessary. Welcome types of information include the following:

— additional variants of the same word in complementary data sources (dictionaries / wordlists of the same language / dialect by different authors), with precise references to the source as well (a possible basic formula is "Quoted as {alt. variant} in {Source}"). Note: if there are relatively full wordlists or dictionaries of two closely related dialects of the same language, it is advisable to have a different wordlist for each;

— quasi- and secondary synonyms, denoted as such and, preferably, with a brief explanation of why these particular words are ineligible or less likely to occupy the appropriate «slot» in the wordlist. This is particularly important if the main source is a large dictionary that lists several synonyms for each Swadesh item, indicating their semantic nuances or providing syntactic contexts. Note: if several synonyms are available with no information as to their practical usage in the language, the procedure recommends that the main synonym be the one supported by external data (e. g. the same root is attested in the primary word for the same meaning in closely related languages);

— morphological information (especially one that may be useful for historical purposes, such as various paradigmatic forms with morphophonological alternations, etc.);

— considerations on the reliability of the entry, especially where the word in question is marked with # (for instance, the compiler may have certain reasons to doubt the correctness of the source material, or that of the supplied meaning).

F. Protoform information: [PLn], [PLnNum], [PLnNotes]. The protoform, from which at least one descendant form (usually several cognate forms), attested in the language group, is/are descended.

The first field [PLn] only contains one asterisked reconstruction (or two or more phonetically different variants of the same reconstructed etymon, separated by a tilde ~, if the variants cannot be securely traced back to a single phonetic invariant). If there are alternate etyma strongly eligible for the «Proto-Root» position (listed in the Notes section), the main one should be marked with the # sign («uncertainty»).

Reconstructions may either be taken from an already published source (provided it is reliable from the point of view of sound comparative-historical methodology) or produced by the compiler of the list as preliminary approximations. In the former case, the field [PLnNotes] should also contain all the necessary references, down to page numbers.

The second field [PLnNum] contains the cognation index which is, naturally, the same as the number assigned to the attested descendants of the proto-root.

The third field [PLnNotes] provides the necessary information to back up the reconstruction, such as:

— Distribution: Notes on how well the root is represented across the group (e. g. «preserved in all / the majority of daughter languages / dialects», etc.). If two or more «candidates» for the «proto-slot» have been identified, this section should contain the justification behind the ultimate choice;

— Replacements: Notes on entries that are seen as innovations in comparison with the proto-etymon: their presumed forms and meanings in the protolanguage (if reconstructible), for borrowings — the sources of borrowing (if known);

— Reconstruction shape: Notes on the phonetic peculiarities of the reconstruction for the main proto-root (degree of regularity of correspondences; justification of the approximate shape if the reconstruction is preliminary, not based on a thorough scrutiny of the correspondences);

— Semantics and structure: Notes on the semantics of the main proto-root in the proto-language (e. g. indication of polysemy, if detectable), as well as on the internal morphological structure if the «root» is actually a complex stem. May involve elements of internal reconstruction if necessary.

If the field contains relevant notes on protolanguage polysemy or semantic change from protolanguage meaning to descendant language meaning, it is recommendable to quote them in formulaic notation, e. g.: {'head' > 'hair of head'} (for semantic change), {'head' & 'head hair'} (for polysemy). Such formalization will facilitate the construction of a general database on polysemy and semantic change in the basic lexicon domain in the future.

II.2. Composition of a Level 2 / Level 3 Database.

Since there are no crucial differences between an attested language and a reconstructed protolanguage, there are no significant structural differences between a Level 1 and a Level 2 / 3 database either. The following minor notes should be made (more to follow in the future).

Language names: The three-letter code common for all the languages in the group becomes the designation for the reconstructed proto-language. E. g., NKH will now mean «Proto-North Khoisan» and function as the name of the corresponding field in a common «Khoisan» database (complete designation may be NKH_KHO).

Notes: This field may or may not include information on how the reconstruction was arrived at and any of its phonetic, semantic, or distributional problems, since most of this information would merely replicate the information already presented in the Level 1 Database. A bibliographical reference, however, is necessary (provided one exists).

III. Methodological aspects.

This section describes some of the fundamental (and at the same time technical) issues that arise during the construction of the GLD and need to be treated in a systematic and unified manner. It is intended both for contributors to the database and users whose interest in the GLD goes beyond superficial curiosity.

As of now, three major problems have been identified: [a] dealing with synonymous equivalents within the wordlist; [b] choosing the most likely candidate for a given item on the proto-wordlist; [c] resolving the issue of borrowed items on the wordlist. The sections below, without going into a lot of detail, present quasi-algorithmic ways of eliminating these problems.

III.1. Choosing synonyms.

The basic premise employed in the construction of the GLD is that no two words on the 100-wordlist can function as completely and ubiquitously interchangeable synonyms within one and the same language (dialect). The rather vague common notion of «synonymity» is understood therein as representing one of the following phenomena:

— quasi-synonymity: two or more words have very similar meanings that are nevertheless somewhat different in their definitions ('tree' / 'wood'; 'feather' / 'quill'), or may vary in stylistic usage ('mouth' / 'trap'; 'dog' / 'hound') or syntactic behavior (German 'wissen' / 'kennen'). The same group also includes suppletive stems, e. g. 'I' / 'me';

— pseudo-synonymity: two or more words that are assigned the same meaning in one or more existing data sources, even though the meanings or peculiarities of usage are actually different, due to negligence or lack of time on the part of the compiler of the source;

— transit-synonymity: two and only two words that really have the same meaning and are generally interchangeable, except that one is the «old» word and the other is the «new» one, caught in the process of inevitably replacing the «old» one ('stone' > 'rock'; the most obvious examples are the ones attested in languages with a long written history, e. g. Chinese, Greek, etc.).

All three types of phenomena are usually identifiable:

— quasi-synonymity is understood based on careful perusal of available dictionaries and textual corpora;

— if detailed dictionaries and text examples are lacking, quasi-synonymity may be defined as pseudo-synonymity (the difference in meaning is impossible to establish);

— transit-synonymity is established, based on available historical information or comparative-historical evidence.

The basic ways of dealing with these three types is as follows:

a) Quasi-synonyms, based on as much available evidence as possible, are judged in accordance with the specific definitions and syntactic contexts provided in the paper mentioned above (A. Kassian, G. Starostin, A. Dybo, V. Chernov, «The Swadesh wordlist. An attempt at semantic specification»). In the rare, but possible, case where the article does not offer a clear solution of the problem, the problematic words may be treated as transit-synonyms (see below), but the authors should be informed of the situation so that necessary improvements may be made to make the standard more rigorous.

Rejected («ineligible») quasi-synonyms may be listed in the Notes section, together with remarks on the reasons of their ineligibility, but this is not an obligatory demand if these reasons are fairly simple and incontestable.

b) Pseudo-synonyms: Only one pseudo-synonym should be entered in the main field, but all the other ones should necessarily be listed in the Notes section together with a clear description of the reasons why one was given precedence over the others. Such reasons may include, in descending order of importance:

— statistical frequency: if the source lists two or more lexemes without specifying the difference in meaning, usually the more «basic» one is the lexeme that crops up more frequently in accompanying texts, syntactic examples (in grammars), etc.;

— external support: if the words are given in vocabulary lists, with no textual / syntactic contacts to help understand their usage, the regular technical procedure is to choose the word that finds lexicostatistical (same form, same meaning) or, at least, etymological (same form, different meaning) parallels in closely related languages;

— lack of significance: if the difference between the several pseudo-synonyms cannot be told and there is no etymological information attached to any of them, this means that choosing one over the other will not affect neither the proto-language reconstructions, nor the calculations, and so it does not actually matter which one is listed as «primary» and which ones are listed as «secondary» items in the Notes section.

c) Transit-synonymity: In those relatively rare cases where it can really be established, transit-synonyms are the only synonyms that should be really listed as true synonyms (besides certain allowed cases of suppletive stems, see below); a new record is inserted in the Starling database file where the second synonym is entered into the corresponding field and is assigned another number.

The Notes section should, in all such cases, clearly state which synonym is considered to be the old one and which one is its more recent «ongoing» replacement.

Suppletivism. Most cases of paradigmatic suppletivism should be resolved in favour of one stem only (although the alternate stems may and, in fact, should be mentioned in the Notes section); the list of preferred forms (e. g. singular subject / object verbal stems rather than plural ones, etc.) may be found in the abovementioned paper. The following several cases of suppletivism are, however, so pervasive in languages around the world that treating them as synonyms should be acceptable for GLD standards:

— direct / indirect stems for personal pronouns, such as 'I' / 'me';

— exclusive / inclusive stems for the 1st person pl. pronoun 'we';

— perfective / imperfective markers of negation ('not'), complementarily distributed across the verbal paradigm (note, however, that the prohibitive negation marker is completely ineligible for the wordlist, representing a significantly different meaning).

Compound stems. The basic rules for treating compound stems are formulated in «The Swadesh wordlist. An attempt at semantic specification». Considerations about singling out the «primary» morpheme in a compound stem should be laid out in the Notes section.

III.2. Choosing the protoform

Filling in the «Protolanguage» field is a responsible procedure that, even in cases when a solid etymological dictionary for the group / family in question is available, should not be reduced to merely copying the according reconstruction from the dictionary. The procedure of choosing the most appropriate proto-stem is generally outlined in G. Starostin's article «Preliminary lexicostatistics as a basis for language classification», included on the site. The main guidelines, in condensed form, are as follows:

For languages l₁, l₂, l₃... l_nthat constitute language group L (all of them descended from proto-language *L), the Swadesh item protoform *I, represented in said languages by forms i₁, i₂, i₃... i_n, is established in the following way:

(1) If, etymologically, i₁ = i₂= i₃= i_n, the protoform *I is obviously the same root as all of its descendants, and is assigned the exact same number;

(2) If i₁ = i_x, i. e. two (or more) languages share the same root, whereas all other languages have individual, non-corresponding entries, AND languages l₁ and l_x do not form a single node on the lexicostatistical tree, the protoform *I is also defined as the basic equivalent of the corresponding Swadesh item in proto-language *L.

Situations (1) and (2) are defined as non-competitive, i. e., cases where there is a clear distributional bias in favour of only one «candidate» for the proto-list. The remaining situations, more complicated in terms of possible solutions, are defined as competitive, when two or more of the attested roots may qualify for proto-list status with comparable chances. They include cases when:

(3) i₁ ≠ i₂≠ i₃≠ i_n, i. e. all languages in group L have different roots for the same notion;

(4) i₁ = i_x, i. e. two (or more) languages share the same root, whereas all other languages have individual, non-corresponding entries, BUT languages l₁ and l_x form a single node on the lexicostatistical tree (meaning that this particular item may have been innovated, as a lexical replacement, in Proto-l₁-l_x, not going back to *L);

(5) i₁ = i_x & i₂ = i_y, with languages l₁ and l_x forming one node on the tree and languages l₂ and l_y forming another one. (The number of node-forming languages may certainly exceed two). The existence of such pairs / triplets / quadruplets etc. means that, without additional evidence, each of them is equally qualified for proto-list status;

(6) i₁ = i_x & i₂ = i_y, with languages l₁ and l_y forming one node on the tree and languages l₂ and l_x forming another one. This is the trickiest possible situation, a case of so-called «semantic criss-crossing», and is generally explained in two ways: (a) synonymity in the proto-language or (b) independent semantic innovation in two (or more) branches of the group. The refined lexicostatistical procedure requires that, if an instance of (b) is identified, the respective culprits be scored differently from each other, despite their common etymological origin, since the coinciding semantics in such cases does not surmise a common ancestral Swadesh item, but rather two independently occurring processes of identical semantic change.

Competitive variants *I₁, *I₂... *I_n are weighted against each other in the following way:

[A] Comparison of etymological parallels. If the form *I₁ is always attested in the Swadesh meaning, whereas the form *I₂ is in some languages attested in the Swadesh meaning and in others is attested in a different meaning that is typologically known to shift to the Swadesh meaning (e. g., *I₁ always = 'eat', whereas *I₂ = 'eat ~ chew'), *I₁ must be given precedence.

[B] Internal analysis. If the form *I₁ is morphologically simple, whereas *I₂ is easily analyzable as a stem derived from a root with a different meaning (e. g. 'star' vs. 'sky-eye'), *I₁ must be given precedence (NOTE: only if *I₁ and *I₂ are truly competitive variants. If nine languages out of ten have 'sky-eye' in the meaning 'star', and the tenth language does not constitute a separate branch that split off before all the others, there is no competition).

[C] External parallels. If, upon the application of parameters [A] and [B], *I₁ and *I₂ are still weighted as equals, it is permissible to fill the proto-list with the form that constitutes a better match for its closest external relative(s) on the higher level of subgrouping.

IMPORTANT NOTE: [C] is a very tricky parameter, inappropriate use of which may lead to a vicious circle in establishing relationships. Making a decision based on external parallels is only permissible if (a) the relationship is, in a completely non-controversial way, already established by non-lexicostatistical means; or (b) the higher-level relationship has already been suggested by lexicostatistical means without relying on such ambiguous cases.

If none of the listed criteria are applicable, or if two or more criteria contradict one another, crossing out each other's significance, the proto-form slot should, in theory, remain empty. Nevertheless, for technical reasons it is advisable to still fill in the position, even if the competition is on a 50-50 basis, so that the corresponding slot does not remain with a negative number (skewing the calculation results on higher levels). In such cases, any root out of the competition pool may be selected at random, and the field should be marked with a # sign, with the other «candidates» properly listed in the Notes section.

III.3. Dealing with borrowings

Items on the list that have been identified as borrowings are marked with negative numbers, which excludes them from lexicostatistical calculations that measure only internally triggered lexical change, not externally triggered one.

In many cases, particularly those of relatively shallow families with well-studied histories, identifying borrowings is easy. In other cases, the procedure is more difficult, involving contextual analysis, such as the establishment of a «borrowing triangle»: if the number of look-alikes between A and B is significantly higher than between A and C, but significantly lower than between B and C, this suggests that similarities between A and B may be due to contact (i. e. A and B are either unrelated or related on a much higher level than B and C).

In a large number of cases it remains completely unclear if a certain word, identified as a lexical replacement, is of the «original» stock or represents a borrowing from an unknown source. Such situations should be dealt with according to the following steps:

(1) If the word has a transparent, non-controversial, etymology within the analyzed language group, it is obviously not a borrowing and is scored positive;

(2) If the word has no etymology whatsoever, and its form is highly unusual for a «native» word — e. g., contains one or more phonemes, atypical for that language — it may tentatively be scored as a borrowing;

(3) In all other cases, it is better to mark the form as an internal replacement rather than a borrowing. This is because, strictly speaking, it is not entirely wise to always omit all borrowings from the calculations: some words may penetrate the language in a different meaning and acquire the Swadesh meaning only later in their history. Situations in which Swadesh items get borrowed in Swadesh meanings normally happen in cases of «massive lexical bombardment», i. e. surmise multiple near-simultaneous borrowings from a single source, and such cases are usually easy to identify if data from surrounding languages are available.

If, on the other hand, we deal with an individual potential borrowing whose source is unidentifiable, there are at least three possible scenarios: (a) it entered the language immediately in the Swadesh meaning; (b) it is not a borrowing at all; (c) it was originally borrowed in a non-Swadesh meaning and only later switched to the Swadesh meaning. Since, out of these three, only situation (a) unconditionally requires marking the form with a negative cognation index, it is relatively safe to count the respective item as a non-borrowing by default. Nevertheless, such cases may be marked in the Notes section with the comment «Possibly borrowed from an unknown source».