Global Lexicostatistical Database: Specifics
As of today, multiple
websites, in one form or another, host various numbers of Swadesh and «quasi-Swadesh»
wordlists; some of the most prominent examples include the Wiktionary Swadesh
word list collection, the Rosetta
Project, the Austronesian
Basic Vocabulary Database, Isidore Dyen's Comparative Indo-European
Database (now also revised and updated with extra data and features as The Indo-European Lexical Cognacy Database), and
The Automated Similarity Judgement Program.
The GLD strives to
take into account both the obvious advantages and the observed flaws of these
resources, as well as the experience gained from more than two decades of
computerized lexicostatistical studies by various members of the Moscow school
of comparative linguistics, in order to come up with a new, updated standard
that will, at the same time, increase the reliability and transparent character
of the data and allow researchers to try out new approaches and ideas
concerning its manual and automatic analysis.
The principal specific
features of the GLD, which, put together, set it apart from most other similar
ventures, are as follows:
1. All of the data are computerized or, at least, thoroughly
fact-checked by professional researchers
with a solid background in general
comparative-historical linguistics and, as a minimum requirement, a working knowledge of the material.
2. All of the data are accompanied with annotations that, as an absolute minimum, necessarily contain
direct source references right down
to the page number, so that any single
entry may be easily verified by anyone with access to the respective sources.
3. All of the data are transliterated from the original
sources into a single unified transcription
system (UTS), based on the IPA with slight modifications (details may be
found here), with the
original orthographies included along with the recodings for some languages
with established written/orthographic traditions. This makes it easy for users
to compare data from languages they are unfamiliar with, and also facilitates
various algorithms of automatic analysis.
4. All of the data, except for cases where the languages
have not been studied in sufficient detail, are morphologically segmented, in order to facilitate manual and
automatic analysis procedures and decrease the basis for potential errors of
5. Specially for the
needs of the GLD, an updated and
explicated list of Swadesh meanings has been introduced (details may be
found here), facilitating
a correct and uniform selection of the appropriate
synonym for languages with sufficient data coverage.
6. All of the data are presented in at least three formats:
(a) on-line database, available for browsing or querying (including the
possibility to search through multiple databases at once); (b) print-ready
uneditable PDF version; (c) editable Microsoft Excel table with all the data at
the potential user's disposal (for normal viewing of Excel files, it will be
necessary to download
and install Starling Serif, the default Unicode font for the GLD).
7. With the gradual
addition of new data, the existing collections will be slowly integrated into a
hierarchic structure that will be
capable of functioning as a unified basis for genetic classification. The
GLD intends to go far beyond mere collection, recoding, and annotation of raw
data, incorporating powerful historical tools for analysis of said data as well.
© 2011-2016 George Starostin (site design,
data input coordination)
© 2011-2016 Phil Krylov (programming, technical support)