The Global
Lexicostatistical Database: General Description
I. Introduction
II. Basic structure
II.1. Composition of a Level 1 Database
II.2. Composition of a Level 2 Database
III.1.
Choosing synonyms
III.2.
Choosing the protoform
III.3.
Dealing with borrowings
The Global Lexicostatistical Database (GLD) is
an ongoing linguistic project, initiated by the
The direct aim of the GLD project is to
compile, formalize, and provide public access to standardized basic lexicon
wordlists of as many world languages and their dialects as possible, ranging
from well-known and expertly studied to poorly documented and analyzed, and
also including reconstructed «proto-lists» for numerous common ancestors of
modern day languages.
Since, in accordance with the methodological
foundations of comparative-historical linguistics (at least, the particular
branch of it represented by the Moscow school), it is first and foremost the basic
lexicon that serves as the key to demonstrating genetic relationship, the
database will be of the greatest use to those interested in
historical-comparative linguistics and issues of genetic classification of the
world's languages. However, formally structured lists of comparative lexical
data may also be important for specialists in phonetic and semantic typology,
sociolinguistics, basic philosophy of language, and various other fields of
linguistic and philological research too numerous to mention.
Formally speaking, the GLD does not constitute
one single database. Rather, one must think of it as a hierarchical system of
wordlists, organized from bottom to top. Such a structuring not only makes it
easier to work with overwhelmingly huge quantities of data, but is also in
strict accordance with the basic conception of the language family tree,
in which multiple common ancestors each give rise to a number of descendant
languages, which can, in their turn, be traced back to their common invariant
by means of historical linguistics.
Level 1 consists of a series of relatively small databases, each of
which contains wordlists for a closely related, non-controversial group of
languages whose approximately estimated age does not exceed 3,000 years (see
the Glottochronology section below for explication) and is rounded off
with a reconstruction of the proto-wordlist for their common ancestor.
Typical examples of such databases are: Germanic, Turkic, Polynesian, North
Khoisan etc. We reserve the general name group for such taxonomic
entities.
Level 2 is occupied by databases that contain only proto-wordlists which are
known to belong or, at least, suspected of belonging to related
proto-languages. The status of such proto-languages among the linguistic community
is generally non-controversial, and their approximate date of disintegration
does not generally exceed 6,000 years. The databases are, once again,
accompanied by reconstructions of proto-wordlists for the common ancestral language.
Typical examples include: Indo-European, Uralic, Austronesian, North
Caucasian etc. For these entities, the name family is reserved.
Level 3 consists of databases that compare proto-wordlists for several
families, under the assumption that a very deep genetic relation may exist
between these families. Since such ultra-deep genetic connections are
frequently placed under serious doubt (particularly among specialists convinced
that neither the comparative method nor any of its substitutes may yield true
positives at a time depth that exceeds 6,000 ~ 8,000 years), the creation and
analysis of hypothetical proto-wordlists for such deep taxa is a necessary
prerequisite to confirming their historic reality. Typical examples include:
Nostratic, Sino-Caucasian, Afro-Asiatic, Niger-Congo, etc. For these entities,
we reserve the name macrofamily.
II.1.
Composition of a Level 1 Database.
This section describes and comments upon the
general structure and acceptable standards for a typical Level 1 Database —
100-wordlists for relatively «young» language groups, accompanied by a
reconstructed proto-wordlist.
The list of fields in each such database is as
follows (fields that may not necessarily be present are entered in square
brackets): Word, Ln1, Ln1Num, [Ln1EtNum],
Ln1Notes... Lnn, LnnNum,
[LnnEtNum], LnnNotes, [PLn], [PLnNum],
[PLnNotes]. Each field has a name, determined by the regular standards
of the Starling software (no non-standard symbols), and an alias deciphering
the name, as in the example below:
Field type |
Field name |
Field alias |
Example (from database: North Khoisan) |
|
|
|
|
Word |
WORD |
Word |
cold |
Ln1 |
JUH |
Zhuǀ'hoan |
ǂàʔú |
Ln1Num |
JUHNUM |
Zhu|'hoan number |
1 |
Ln1EtNum |
JUHETNUM |
Zhu|'hoan
etymology |
261 |
Ln1Notes |
JUH_NOTES |
Zhu|'hoan notes |
Dickens 1994: 300. |
Ln2 |
AUE |
ǁKxauǁen |
ǂxiː |
Ln2Num |
AUENUM |
ǁKxauǁen
number |
2 |
Ln2EtNum |
AUEETNUM |
ǁKxauǁen
etymology |
412 |
Ln2Notes |
AUE_NOTES |
ǁKxauǁen
notes |
Bleek 1929: 29;
Bleek 1956: 680. Alternately transcribed as ǂxẽ in [Bleek
1956: 679]. A possible synonym is |au 'to be cold, bare' [Bleek 1956:
303]; however, in the English-ǁKxauǁen vocabulary of [Bleek 1929]
only the first root is adduced. |
Ln3 |
EKK |
Ekoka !Xung |
ǃǃàò
~ ǃǃàʔō |
Ln3Num |
EKKNUM |
Ekoka !Xung number |
1 |
Ln3EtNum |
EKKETNUM |
Ekoka !Xung
etymology |
261 |
Ln3Notes |
EKK_NOTES |
Ekoka !Xung notes |
König &
Heine 2008: 89. Quoted as ǂàʔō in [Heikkinen
1986: 23]. Polysemy: 'cold / cool / good, well'. |
PLn |
NKH |
Proto North
Khoisan |
*ǂàʔū |
PLnNum |
NKHNUM |
Protolanguage
number |
1 |
PLnNotes |
NKH_NOTES |
Protolanguage
notes |
Distribution: Preserved
(mostly) in the Northern and Central clusters. Replacements:
Southern cluster: *ǂxãĩ, possibly reflecting a rare
semantic development {'to tremble' > 'to be cold'}. Reconstruction
shape: Correspondences are regular and trivial. |
Notes on particular fields:
A. Word.
Standardized listing of the Swadesh elements, completely identical for all
Level 1 databases. The listing includes all of the elements on Swadesh's
original 100-wordlist, plus 10 additional items off the original 200-wordlist
reserved for special correctional purposes. If the word is followed by a partial
synonym in parentheses, this indicates a correction / specification of the original
Swadesh meaning: e. g. 'claw (nail)', 'walk (go)', 'warm (hot)' means that the basic
meaning of each entry is rather aligned with the English words
'(finger)nail', 'to go', 'hot' than with the English words 'claw', 'walk',
'warm'.
Since English lexical equivalents can be
frequently converted into a whole number of synonyms in any other given
language, the meanings of all Swadesh elements have been expressed and explicated
in a relatively precise, restrictive manner, and the wordlists — wherever
possible — should be constructed according to these uniform specifications.
General information on the principles of synonym elimination can be accessed
by clicking on the link to the explanatory paper by A. Kassian, G. Starostin,
A. Dybo, and V. Chernov at the top of the page («The Swadesh wordlist. An
attempt at semantic specification»).
For personal pronouns, where synonymity between
at least two stems is assumed to be very common in languages across the world,
such cases presuppose the creation of two different records, with the Word field marked as 'I1', 'I2',
'thou1', 'thou2', 'we1', 'we2'
respectively. In the rare cases where other words still show synonymous
variants (see notes on «transit synonymy» below), the record is also doubled,
but the item in the Word field
remains without a numeric index. (Indexed and non-indexed entries receive
different treatment in the Web interface).
B. Data: Ln1,
Ln2... Lnn. These fields
contain the actual data from all the languages in the group for which it has
been possible to collect and annotate it. The input is regulated as follows:
a) Language names: each field is provided with
a three-letter name (invisible on the Web interface, but determining the
database structure) and a full «alias» name for convenience. The same double
nomination is also valid for each individual database as a whole. The unique
formal identification of each wordlist is a six-letter entry, e. g.: NKH_JUH
(«alias» North Khoisan → Zhu|'hoan) = the Swadesh wordlist for the
Zhu|'hoan language of the North Khoisan group, etc.
This
approach, which allows for different languages in different families to be
encoded with the same three-letter abbreviation, is different from the
principle assumed in the Ethnologue, where each language is issued its own
unique three-letter code. One major reason for this, beyond mere convenience (the
Ethnologue system frequently results in the three-letter code bearing little or
no mnemonic resemblance to the designated language in question), is that
wordlists are frequently provided and catalogued from multiple different
dialects of the same language, each of which has to have its unique
identification — for Ethnologue, which does not list actual language data, this
problem does not exist — and thus, the maximum number of combinations allowed
by 26 letters of the Latin alphabet (263 = 17576) may simply turn
out to be insufficient in the long run.
b) Transcription: all of the entries are
transliterated into the exact same unified transcriptional system that, for the
most part, follows the IPA, but also includes convenience-oriented modifications
agreed upon by the principal contributors to the Database. The system can be
checked at any time by clicking the «Transcription system» link at the top of
the page.
The
decision to unify transcriptional systems is justified by several factors, such
as reader convenience (most users will probably only be familiar with
idiosyncratic transcriptional conventions for a small part of the data) and
easier application of automatic algorithms for data analysis. In many cases,
however, it makes sense to also provide the «specific» transcription for a
given language, especially in cases where there exists a long-term
orthographic tradition (e. g. English); in a few, it also makes sense to
provide data on the original non-Latin-based graphic system (e. g. Chinese characters).
In all such cases, the GLD transliteration is followed by the traditional spelling,
enclosed in curly brackets: e. g. British English šɔ:t {short},
Beijing Chinese śīŋ {星} etc.
c) Grammatical aspects of form presentation:
for morphologically complex languages, nouns, adjectives, verbs etc., if
possible, are presented in the same form (e. g. nominative singular; active infinitive;
1st person singular present tense, etc.) that may be indicated in the general
description section of the wordlist. If, for some reason, this is not possible
(e. g. the forms are extracted from texts rather than dictionaries), the
grammatical characteristics of the attested form should be indicated in the Language Notes section.
If the word is segmentable on the synchronic
level or based on easily explainable historical considerations, prefixes are
separated from the root with a = sign, suffixes with a hyphen (e. g. Russian u=mer-ˈety
'to die'). If only the bare stem is given (this is not generally
recommendable, but sometimes the data present no choice), it is always followed
by a hyphen.
Special
note on the marking of compound forms that consist of two or more lexical
roots: in all such cases, the lexicostatistical treatment in the GLD calls
for the demarcation of the morpheme bearing the principal lexical meaning (this
is usually performed on the basis of internal and external comparison). All
the other root morphemes are to be treated and designated in the same way as
grammatical prefixes and suffixes. E. g.: Beijing Chinese yǜe-lyàŋ {月亮} 'moon' (the second lexical
root, lyàŋ 'light', is treated in a suffixal
manner).
Infixes,
if identifiable, are placed in square brackets: Thao k[m]an 'to eat'
etc.
Deviating
forms that display important vowel or consonant alternations or even suppletion
are generally relegated to the comments section, but in a few cases it may be
deemed necessary to put several morphological variants in the same spot. In
such cases, they are separated with a slash (e. g. ǁAni cá /
há 'thou') and the difference itself is explained in the
comments (in this case — masculine vs. feminine form).
d) Additional notation: interchangeable
variants of the same word attested in the same source without an explicit
explanation as to the reasons (e. g. a mix of several dialectal forms) are
listed one after the other, separated with a tilde (~), preferably in order of
frequency of usage, if any such thing can be established from the source.
Use of regular parentheses is also allowed to
denote «facultative» elements that may or may not be encountered in the
informant's speech, for phonetical or morphological reasons.
Questionable items — e. g. ones for which there
is good reason to surmise an error on the transcriber's part, or ones
presented with a slightly different meaning from the required one, but for
which there is also a good reason to surmise that they may express the required
meanings as well, etc. — are followed by the # sign. (All such cases are
explicated in the notes section).
C. Numeration: [Ln1Num].
In this field each word is assigned a numeric value ranging from 1 to infinity,
with etymologically cognate words receiving the same number and etymologically
divergent ones receiving different numbers in succession. In the Web interface
of the database, the cognation number is displayed as a small superscript
index (1, 2... n) to the right of the
corresponding word. See the General
Rules For Cognate Scoring section for details.
If the
word is not attested, this is marked with a negative index in the database (any
negative number will do, but the usual index is -1); in the Web
interface, the corresponding slot remains completely empty. Transparent
borrowings on the wordlist (in accordance with Sergei Starostin's revisions of
the lexicostatistical procedure) are technically equated with lack of
attestation and marked with the same negative index (since the slot is still
occupied by the borrowing, this time it actually shows up on the Web
interface as well).
D. Etymology: Ln1EtNum.
If any of the entries in a given wordlist are linked to one of the etymological
databases hosted on the Tower of Babel server, they are provided with a second
number that is identical with the number of the corresponding etymology in
the related etymological database. The numbers are not displayed on the Web
interface. Instead, the user can choose one of two options: (a) disregard the
connection to the databases by unchecking the «View entries with hyperlinks»
option at the top of the page, or (b) make use of it by seeing the entries that
lead to Tower of Babel's etymological databases displayed as hyperlinks.
Technically,
two numbers are somewhat superfluous — in the original StarLing database set,
there is only one numeration for both purposes — but, since many of the
prepared wordlists lack any connection to etymological databases, it would
still be necessary to employ two different ways of numeration depending on the
wordlist (from 1 to infinity for those that are not linked to etymological
databases; coinciding with etymological database numbers for those that are).
Besides, the wordlists should also be able to function as completely
autonomous entities.
E. Notes: Ln1Notes.
This field always begins with the indication of the primary source(s) for the
attested form, in the standard format: Author — Year : Page number (e. g.: Doke
1925: 153). The rest of the field is less formalized and may include any
additional information about the corresponding entry that the compiler
thinks necessary. Welcome types of information include the following:
— additional variants of
the same word in complementary data sources (dictionaries / wordlists of
the same language / dialect by different authors), with precise references to
the source as well (a possible basic formula is "Quoted as {alt. variant} in {Source}"). Note: if there are relatively full
wordlists or dictionaries of two closely related dialects of the same
language, it is advisable to have a different wordlist for each;
— quasi- and secondary
synonyms, denoted as such and, preferably, with a brief explanation of why
these particular words are ineligible or less likely to occupy the appropriate
«slot» in the wordlist. This is particularly important if the main source is a
large dictionary that lists several synonyms for each Swadesh item, indicating
their semantic nuances or providing syntactic contexts. Note: if
several synonyms are available with no information as to their practical usage
in the language, the procedure recommends that the main synonym be the one
supported by external data (e. g. the same root is attested in the primary word
for the same meaning in closely related languages);
— morphological
information (especially one that may be useful for historical purposes, such as
various paradigmatic forms with morphophonological alternations, etc.);
— considerations on the
reliability of the entry, especially where the word in question is marked with
# (for instance, the compiler may have certain reasons to doubt the correctness
of the source material, or that of the supplied meaning).
F. Protoform information:
[PLn], [PLnNum], [PLnNotes]. The protoform, from which at least
one descendant form (usually several cognate forms), attested in the language
group, is/are descended.
The first field [PLn] only contains one
asterisked reconstruction (or two or more phonetically different variants of
the same reconstructed etymon, separated by a tilde ~, if the variants cannot
be securely traced back to a single phonetic invariant). If there are alternate etyma strongly eligible for
the «Proto-Root» position (listed in the Notes section), the main one should be
marked with the # sign («uncertainty»).
Reconstructions may either be taken from an
already published source (provided it is reliable from the point of view of
sound comparative-historical methodology) or produced by the compiler of the
list as preliminary approximations. In the former case, the field [PLnNotes]
should also contain all the necessary references, down to page numbers.
The second field [PLnNum] contains the cognation index which is, naturally, the same
as the number assigned to the attested descendants of the proto-root.
The third field [PLnNotes] provides the necessary information to back up the reconstruction, such as:
— Distribution:
Notes on how well the root is represented across the group (e. g. «preserved
in all / the majority of daughter languages / dialects», etc.). If two or more
«candidates» for the «proto-slot» have been identified, this section should
contain the justification behind the ultimate choice;
— Replacements:
Notes on entries that are seen as innovations in comparison with the proto-etymon:
their presumed forms and meanings in the protolanguage (if reconstructible),
for borrowings — the sources of borrowing (if known);
— Reconstruction
shape: Notes on the phonetic peculiarities of the reconstruction for the
main proto-root (degree of regularity of correspondences; justification of the
approximate shape if the reconstruction is preliminary, not based on a thorough
scrutiny of the correspondences);
— Semantics and
structure: Notes on the semantics of the main proto-root in the
proto-language (e. g. indication of polysemy, if detectable), as well as on the
internal morphological structure if the «root» is actually a complex stem. May
involve elements of internal reconstruction if necessary.
If the field contains relevant notes on
protolanguage polysemy or semantic change from protolanguage meaning to
descendant language meaning, it is recommendable to quote them in formulaic
notation, e. g.: {'head' > 'hair of head'} (for semantic change), {'head'
& 'head hair'} (for polysemy). Such formalization will facilitate the
construction of a general database on polysemy and semantic change in the basic
lexicon domain in the future.
II.2. Composition
of a Level 2 / Level 3 Database.
Since there are no crucial differences between an
attested language and a reconstructed protolanguage, there are no significant
structural differences between a Level 1 and a Level 2 / 3 database either.
The following minor notes should be made (more to follow in the future).
Language
names: The three-letter code
common for all the languages in the group becomes the designation for the
reconstructed proto-language. E. g., NKH will now mean «Proto-North Khoisan»
and function as the name of the corresponding field in a common «Khoisan»
database (complete designation may be NKH_KHO).
Notes: This field may or may not include information on how the reconstruction was arrived at and any of its phonetic, semantic, or distributional problems, since most of this information would merely replicate the information already presented in the Level 1 Database. A bibliographical reference, however, is necessary (provided one exists).
This section describes some of the fundamental
(and at the same time technical) issues that arise during the construction of
the GLD and need to be treated in a systematic and unified manner. It is
intended both for contributors to the database and users whose interest in the
GLD goes beyond superficial curiosity.
As of now, three major problems have been
identified: [a] dealing with synonymous equivalents within the wordlist; [b]
choosing the most likely candidate for a given item on the proto-wordlist; [c]
resolving the issue of borrowed items on the wordlist. The sections below,
without going into a lot of detail, present quasi-algorithmic ways of
eliminating these problems.
The basic premise employed in the construction
of the GLD is that no two words on the
100-wordlist can function as completely and ubiquitously interchangeable
synonyms within one and the same language (dialect). The rather vague
common notion of «synonymity» is understood therein as representing one of the
following phenomena:
— quasi-synonymity:
two or more words have very similar meanings that are nevertheless somewhat
different in their definitions ('tree' / 'wood'; 'feather' / 'quill'), or may
vary in stylistic usage ('mouth' / 'trap'; 'dog' / 'hound') or syntactic
behavior (German 'wissen' / 'kennen'). The same group also includes suppletive
stems, e. g. 'I' / 'me';
— pseudo-synonymity:
two or more words that are assigned the same meaning in one or more existing
data sources, even though the meanings or peculiarities of usage are actually
different, due to negligence or lack of time on the part of the compiler of the
source;
— transit-synonymity:
two and only two words that really have the same meaning and are generally
interchangeable, except that one is the «old» word and the other is the «new»
one, caught in the process of inevitably replacing the «old» one ('stone' >
'rock'; the most obvious examples are the ones attested in languages with a
long written history, e. g. Chinese, Greek, etc.).
All three types of phenomena are usually identifiable:
— quasi-synonymity is
understood based on careful perusal of available dictionaries and textual
corpora;
— if detailed
dictionaries and text examples are lacking, quasi-synonymity may be defined as
pseudo-synonymity (the difference in meaning is impossible to establish);
— transit-synonymity is
established, based on available historical information or comparative-historical
evidence.
The basic ways of dealing with these three
types is as follows:
a) Quasi-synonyms,
based on as much available evidence as possible, are judged in accordance with
the specific definitions and syntactic contexts provided in the paper mentioned
above (A. Kassian, G. Starostin, A. Dybo, V. Chernov, «The Swadesh wordlist. An
attempt at semantic specification»). In the rare, but possible, case where the
article does not offer a clear solution of the problem, the problematic words
may be treated as transit-synonyms (see below), but the authors should be
informed of the situation so that necessary improvements may be made to make
the standard more rigorous.
Rejected («ineligible») quasi-synonyms may be
listed in the Notes section, together with remarks on the reasons of their
ineligibility, but this is not an obligatory demand if these reasons are fairly
simple and incontestable.
b) Pseudo-synonyms:
Only one pseudo-synonym should be entered in the main field, but all the other
ones should necessarily be listed in the Notes section together with a clear
description of the reasons why one was given precedence over the others. Such
reasons may include, in descending order of importance:
— statistical
frequency: if the source lists two or more lexemes without specifying the
difference in meaning, usually the more «basic» one is the lexeme that crops
up more frequently in accompanying texts, syntactic examples (in grammars),
etc.;
— external support:
if the words are given in vocabulary lists, with no textual / syntactic
contacts to help understand their usage, the regular technical procedure is to
choose the word that finds lexicostatistical (same form, same meaning) or,
at least, etymological (same form, different meaning) parallels in closely
related languages;
— lack of
significance: if the difference between the several pseudo-synonyms cannot
be told and there is no etymological
information attached to any of them, this means that choosing one over the
other will not affect neither the proto-language reconstructions, nor the
calculations, and so it does not actually matter which one is listed as
«primary» and which ones are listed as «secondary» items in the Notes section.
c) Transit-synonymity:
In those relatively rare cases where it can really be established, transit-synonyms
are the only synonyms that should be really listed as true synonyms (besides
certain allowed cases of suppletive stems, see below); a new record is inserted
in the Starling database file where the second synonym is entered into the
corresponding field and is assigned another number.
The Notes section should, in all such cases,
clearly state which synonym is considered to be the old one and which one is
its more recent «ongoing» replacement.
Suppletivism. Most cases of paradigmatic suppletivism
should be resolved in favour of one stem only (although the alternate stems may
and, in fact, should be mentioned in the Notes section); the list of preferred
forms (e. g. singular subject / object verbal stems rather than plural ones,
etc.) may be found in the abovementioned paper. The following several cases of
suppletivism are, however, so pervasive in languages around the world that
treating them as synonyms should be acceptable for GLD standards:
— direct / indirect stems for personal pronouns, such as 'I' / 'me';
— exclusive / inclusive stems for the 1st person pl. pronoun 'we';
— perfective / imperfective markers of negation ('not'), complementarily
distributed across the verbal paradigm (note, however, that the prohibitive negation marker is
completely ineligible for the wordlist, representing a significantly different
meaning).
Compound
stems. The basic rules for
treating compound stems are formulated in «The Swadesh wordlist. An attempt at
semantic specification». Considerations about singling out the «primary»
morpheme in a compound stem should be laid out in the Notes section.
Filling in the «Protolanguage» field is a
responsible procedure that, even in cases when a solid etymological dictionary
for the group / family in question is available, should not be reduced to
merely copying the according reconstruction from the dictionary. The procedure
of choosing the most appropriate proto-stem is generally outlined in G.
Starostin's article «Preliminary lexicostatistics as a basis for language
classification», included on the site. The main guidelines, in condensed form,
are as follows:
For languages l1, l2,
l3... ln that constitute language group L (all of them descended from proto-language *L), the Swadesh item protoform *I,
represented in said languages by forms i1,
i2, i3... in,
is established in the following way:
(1) If, etymologically, i1 = i2 = i3 =
in, the protoform *I is obviously the same root as all of
its descendants, and is assigned the exact same number;
(2) If i1 = ix, i. e. two (or more) languages share the same root,
whereas all other languages have individual, non-corresponding entries, AND
languages l1 and lx do not form a single node
on the lexicostatistical tree, the protoform *I is also defined as the basic equivalent of the corresponding
Swadesh item in proto-language *L.
Situations (1) and (2) are defined as non-competitive, i. e., cases where
there is a clear distributional bias in favour of only one «candidate» for the
proto-list. The remaining situations, more complicated in terms of possible
solutions, are defined as competitive,
when two or more of the attested roots may qualify for proto-list status with
comparable chances. They include cases when:
(3) i1 ≠ i2
≠ i3 ≠
in, i. e. all languages in
group L have different roots for the
same notion;
(4) i1 = ix,
i. e. two (or more) languages share the same root, whereas all other languages
have individual, non-corresponding entries, BUT languages l1 and lx
form a single node on the lexicostatistical tree (meaning that this particular
item may have been innovated, as a lexical replacement, in Proto-l1-lx, not going back to *L);
(5) i1 = ix
& i2 = iy, with languages l1 and lx forming one node on the tree and languages l2 and ly forming another one. (The number of node-forming
languages may certainly exceed two). The existence of such pairs / triplets / quadruplets
etc. means that, without additional evidence, each of them is equally qualified
for proto-list status;
(6) i1 = ix
& i2 = iy, with languages l1 and ly forming one node on the tree and languages l2 and lx forming another one. This is the trickiest possible
situation, a case of so-called «semantic criss-crossing», and is generally
explained in two ways: (a) synonymity in the proto-language or (b) independent
semantic innovation in two (or more) branches of the group. The refined lexicostatistical
procedure requires that, if an instance of (b) is identified, the respective
culprits be scored differently from each other, despite their common
etymological origin, since the coinciding semantics in such cases does not
surmise a common ancestral Swadesh item, but rather two independently occurring
processes of identical semantic change.
Competitive
variants *I1, *I2... *In are weighted against each other in the following
way:
[A] Comparison of
etymological parallels. If the form *I1
is always attested in the Swadesh meaning, whereas the form *I2 is in some languages
attested in the Swadesh meaning and in others is attested in a different
meaning that is typologically known to shift to the Swadesh meaning (e. g., *I1 always = 'eat', whereas *I2 = 'eat ~ chew'), *I1 must be given precedence.
[B] Internal analysis.
If the form *I1 is
morphologically simple, whereas *I2
is easily analyzable as a stem derived from a root with a different meaning
(e. g. 'star' vs. 'sky-eye'), *I1
must be given precedence (NOTE: only if *I1
and *I2 are truly
competitive variants. If nine languages out of ten have 'sky-eye' in the
meaning 'star', and the tenth language does not constitute a separate branch
that split off before all the others, there is no competition).
[C] External parallels.
If, upon the application of parameters [A] and [B], *I1 and *I2
are still weighted as equals, it is permissible to fill the proto-list with the
form that constitutes a better match for its closest external relative(s) on
the higher level of subgrouping.
IMPORTANT NOTE: [C] is a very tricky parameter,
inappropriate use of which may lead to a vicious circle in establishing
relationships. Making a decision based on external parallels is only
permissible if (a) the relationship is, in a completely non-controversial way,
already established by non-lexicostatistical means; or (b) the higher-level
relationship has already been suggested by lexicostatistical means without relying on such ambiguous
cases.
If none of the listed criteria are applicable,
or if two or more criteria contradict one another, crossing out each other's
significance, the proto-form slot should, in theory, remain empty.
Nevertheless, for technical reasons it is advisable to still fill in the
position, even if the competition is on a 50-50 basis, so that the
corresponding slot does not remain with a negative number (skewing the
calculation results on higher levels). In such cases, any root out of the
competition pool may be selected at random, and the field should be marked with
a # sign, with the other «candidates» properly listed in the Notes section.
III.3. Dealing with borrowings
Items on the list that have been identified as
borrowings are marked with negative numbers, which excludes them from
lexicostatistical calculations that measure only internally triggered lexical
change, not externally triggered one.
In many cases, particularly those of relatively
shallow families with well-studied histories, identifying borrowings is easy.
In other cases, the procedure is more difficult, involving contextual
analysis, such as the establishment of a «borrowing triangle»: if the number of
look-alikes between A and B is significantly higher than between A and C, but
significantly lower than between B and C, this suggests that similarities
between A and B may be due to contact (i. e. A and B are either unrelated or
related on a much higher level than B and C).
In a large number of cases it remains
completely unclear if a certain word, identified as a lexical replacement, is
of the «original» stock or represents a borrowing from an unknown source. Such
situations should be dealt with according to the following steps:
(1) If the word has a
transparent, non-controversial, etymology within the analyzed language group,
it is obviously not a borrowing and is scored positive;
(2) If the word has no
etymology whatsoever, and its form is highly unusual for a «native» word — e.
g., contains one or more phonemes, atypical for that language — it may
tentatively be scored as a borrowing;
(3) In all other cases,
it is better to mark the form as an
internal replacement rather than a borrowing. This is because, strictly
speaking, it is not entirely wise to always omit all borrowings from the
calculations: some words may penetrate the language in a different meaning and
acquire the Swadesh meaning only later in their history. Situations in which
Swadesh items get borrowed in Swadesh meanings normally happen in cases of
«massive lexical bombardment», i. e. surmise multiple near-simultaneous
borrowings from a single source, and such cases are usually easy to identify if
data from surrounding languages are available.
If, on the other hand, we deal with an
individual potential borrowing whose source is unidentifiable, there are at
least three possible scenarios: (a) it entered the language immediately in the
Swadesh meaning; (b) it is not a borrowing at all; (c) it was originally
borrowed in a non-Swadesh meaning and only later switched to the Swadesh
meaning. Since, out of these three, only situation (a) unconditionally requires
marking the form with a negative cognation index, it is relatively safe to
count the respective item as a non-borrowing by default. Nevertheless, such
cases may be marked in the Notes section with the comment «Possibly borrowed
from an unknown source».