The Global Lexicostatistical Database: Specifics

GLD SPECIFICS

As of today, multiple websites, in one form or another, host various numbers of Swadesh and «quasi-Swadesh» wordlists; some of the most prominent examples include the Wiktionary Swadesh word list collection, the Rosetta Project, the Austronesian Basic Vocabulary Database, Isidore Dyen's Comparative Indo-European Database (now also revised and updated with extra data and features as The Indo-European Lexical Cognacy Database), and The Automated Similarity Judgement Program.

The GLD strives to take into account both the obvious advantages and the observed flaws of these resources, as well as the experience gained from more than two decades of computerized lexicostatistical studies by various members of the Moscow school of comparative linguistics, in order to come up with a new, updated standard that will, at the same time, increase the reliability and transparent character of the data and allow researchers to try out new approaches and ideas concerning its manual and automatic analysis.

The principal specific features of the GLD, which, put together, set it apart from most other similar ventures, are as follows:

1. All of the data are computerized or, at least, thoroughly fact-checked by professional researchers with a solid background in general comparative-historical linguistics and, as a minimum requirement, a working knowledge of the material.

2. All of the data are accompanied with annotations that, as an absolute minimum, necessarily contain direct source references right down to the page number, so that any single entry may be easily verified by anyone with access to the respective sources.

3. All of the data are transliterated from the original sources into a single unified transcription system (UTS), based on the IPA with slight modifications (details may be found here), with the original orthographies included along with the recodings for some languages with established written/orthographic traditions. This makes it easy for users to compare data from languages they are unfamiliar with, and also facilitates various algorithms of automatic analysis.

4. All of the data, except for cases where the languages have not been studied in sufficient detail, are morphologically segmented, in order to facilitate manual and automatic analysis procedures and decrease the basis for potential errors of etymological judgement.

5. Specially for the needs of the GLD, an updated and explicated list of Swadesh meanings has been introduced (details may be found here), facilitating a correct and uniform selection of the appropriate synonym for languages with sufficient data coverage.

6. All of the data are presented in at least three formats: (a) on-line database, available for browsing or querying (including the possibility to search through multiple databases at once); (b) print-ready uneditable PDF version; (c) editable Microsoft Excel table with all the data at the potential user's disposal (for normal viewing of Excel files, it will be necessary to download and install Starling Serif, the default Unicode font for the GLD).

7. With the gradual addition of new data, the existing collections will be slowly integrated into a hierarchic structure that will be capable of functioning as a unified basis for genetic classification. The GLD intends to go far beyond mere collection, recoding, and annotation of raw data, incorporating powerful historical tools for analysis of said data as well.

BACK TO MAIN PAGE DATABASE LIST RUSSIAN VERSION