Text Annotation Guidelines
Version 1e
1. Generalities
Please, provide sample texts with an annotation. There are two basic formats of representation: horizontal (for human reading) and vertical (for machine reading).
The horizontal representation of the sample text(s) is included into the Analytical Report, at the end. In the vertical representation, each text should be submitted as a separate XLSX file (or in a similar spreadsheet format). From our experience, it appears best to first prepare the texts in the horizonal format and then convert them into the vertical format.
If only one text is annotated for a given language, the XLSX file should have the same name as the DOCX file for the Analytical Report (everything in lowercase letters):
Glottolog code-language name-AR author’s name
E.g.: soni1259-east_soninke-vydrin.xlsx
If more than one text is annotated, add a word or two at the end of the file name from the name of the text, e.g., for Bambara:
bamb1269-bambara-vydrin-drum
bamb1269-bambara-vydrin-na_magosa
bamb1269-bambara-vydrin-Niger_riverbank
For each text, provide the metadata. The medata are included into the same Excel file as the text, on a separate sheet (this sheet must be named metadata), in two columns:
Field | Data |
Reference | |
Author/speaker name | |
Date of recording/publication | |
Genre/register of the text | |
URL |
URL is provided for the texts from the Internet; Reference is for published texts (otherwise, these lines are left vacant).
2. Sample size
The total size of the sample should be somewhere between 100 (the bare minimum) and 3000 syllables. Ideally, it should contain several texts of several hundred syllables each.
If you provide more than one text, please, try do diversify the genres; or, at least, use texts from different authors/speakers.
3. Conventions for the annotation
3.1. Presentation of the annotated text
The text is represented in two formats, horizontal (primarily for human reading) and vertical (in Excel or ODT spreadsheet, for machine reading).
A horizontal is included into the Analytical Report, at the end. The text is represented in four lines:
– phonological transcription (tonemes are indicated at the underlying level), without bondaries (for human reading);
– an annotation with all boundaries indicated (this line is also intended for the machine reading). In this line, punctuation is omitted (otherwise, dots and commas will be mistakingly counted as the designations for syllabic and moraic boundaries respectively);
– glosses (according to the Leipzig Glossing Rules; please note that the conventional glosses should be in CAPITALS, not in *small caps!);
– free translation.
If a language has an established written tradition (especially if the traditional writing system contains tonal notation), a supplementary fifth line fororthography can be added. Likewise, it is possible to add a line with the transcription used by the source, if this is different from the phonological transcription.
The text in the horizontal format should be represented in the form of tables with transparent (invisible) borders; each sentence in a separate table, each prosodic word in a separate cell. The free translation line follows separately after each table.
In the vertical representation, the free translation line (and the traditional orthography line, if any) are omitted. Phonological transcription, annotation and glosses are given in columns A, B and C of the spreadsheet respectively.
To convert a vertical table to a horizontal one (or vice-versa) in Exel, use option Formulas > Insert a function > Lookup and reference > TRANSPOSE. After the transposition, select the transposed fragment and do “Copy – Past, option Value” (and the text will be copy-pasted elsewhere without relation to the formula Transpose).
An even easier procedure:
– select the lines in the Word file (provided that the examples are represented in the tables format);
– copy-paste the selection into some free space in the Excel file;
– select it in Excel, press “copy” (CNTRL + C);
Go to the desired position in the A column and right-click a free cell; among different options, find “paste in special format”, and click on the pasting option “Transpose”.
3.2. Types of boundaries to be indicated in the text
- Tonal span boundaries: square brackets. After the left bracket, a letter (or a combination of letters) indicating the toneme is provided (followed by space), e.g.:
kòotá
[L kòo][H tá]
day
Note: If the author of an AR needs square brackets for phonetic transcription, it creates no problem, because in the tonal span marking, the opening square bracket is necessarily followed by a letter index for the toneme.
- Syllable boundaries: dots, e.g.: cáq.qá
- For moraic languages, boundaries of morae within a syllable: commas, e.g.: má,a.là.mí,i.
- Foot boundaries (for the languages where foot is relevant), a vertical bar, e.g.: fá.la|-tɔ. In some languages, words are not subdivided into prosodic feet exhaustively, and there may remain some leftovers (i.e. extra-metrical syllables). In this case, the vertical bar designating a foot boundary is provided with an underlying on the side of the extra-metrical syllable. For example, in Lhasa Tibetan: [H tshúŋ.tsuŋ]|_po ‘small’ (po is an extra-metrical syllable).
- Morphemic boundaries (if they cannot be traced, e.g. because of fusion, don’t bother, just ignore them): hyphens.
- Prosodic word boundaries: each prosodic word is given in a separate cell. If you need to cite an annotated fragment in the AR, use spaces.
In cases when segmentation into prosodic words is tricky, the author is expected to come up with some simple working solution (it is desirable to summarize it in §2.2.3 of the AR.) In a typological project like this, we cannot delve too deep into the wordhood problematics. For languages with established orthography, it is acceptable to take orthographic words as the basis, unless word segmentation in writing is too blatantly irrelevant.
There are no dedicated symbols for clitic boundaries. Clitics are to be treated as either morphemes or separate prosodic words, at the author’s discretion, depending on whether they are closer to the former or the latter in their properties. It is possible that some clitic boundaries in a given language will be treated as morpheme boundaries and some as prosodic word boundaries.
- Boundaries of tonal phrases (if relevant): curly brackets, { }
- Boundaries of domains of grammatical tonal morphemes: broken brackets. E.g. (Soninke):
<[L ŋà.lì.màa][H .mí-n]>
If the domain of a grammatical tonal morpheme coincides with a tonal span, the broken brackets embrace the square brackets, e.g.:
yàagu-nu
<[L yàa.gù.-nù]>
be.ashamed-GER\L
In the annotation line, no punctuation marks should be used in their conventional way (otherwise, they will be automatically counted as boundaries of syllables or morae).
Mora, syllable, foot and prosodic word constitute a hierarchy; a morpheme and a word constitute another hierarchy. Tonal span is outside these hierarchies.
Within one hierarchy, if boundaries of a lower unit coincide with boundaries of a higher unit, only the latter are indicated. E.g. in Hausa: dóːlè dó,o.lè rather than *dó,o,.lè); in Bambara: bàn|fu.la, rather than *bàn.|fu.la; jú.gu|-ya|-ra rather than *jú.gu.|-ya.|-ra or *|jú.gu|-ya|-ra|.
If a boundary of a mora, a syllable or a foot coincides with a boundary of a morpheme, both should be indicated, e.g.: sɔ̀.rɔ-|la.
As for the tonal span boundaries, they should go together with all other boundary markers. For example, Hausa lóːkàtʃîn [H lo,o][L .ka][H .tʃi][L -,n]; Bambara: sàngó [L sàn][H |gó]
When two or more different boundaries coincide, the relative ordering of respective designations is up to the annotator.
3.3. Two tonemes on one TBU
– without morphemic boundary in-between:
gbɛ̃̂
[H gbɛ̃́][L-]
- an additive non-segmental morpheme sharing one TBU with a lexical morpheme:
kó`
[H kó-][L]
matter\ART
3.4. Other conventions
In languages with stress, the stress position is marked by an asterisk preceding the stressed syllable.
Syntactic annotation in Universal Dependences is optional, it can be presented for the languages which already have a UD sample and mode. – ?
4. Examples
A Soninke phrase, horizontal format:
kòotá | yògó | , | dèbé-n` | ŋàlimaamí-n` | tàaxa-llenmá-n` |
[L kòo][H tá] | [L yò][H gó] | [L dè][H bé-n] | <[L ŋà.lì.màa][H mí-n]> | <[L tàa.xà-l.lèn][H má-n]> | |
day | certain | village-def | imam\stconstr-def | sit-recp\stconstr-def |
dàgá | búgu | à | `yí. |
[L dà][H gá] | [H búg] | [L à] | [H yí] |
go | exit | 3sg | for |
‘One day, a neighbour of an imam of a village went to visit him.’
The same Soninke phrase in the vertical format, where each word (and each punctuation symbol) appears in a separate line.
Kòotá | [L kòo][H tá] | day |
yògó, | [L yò][H gó] | certain |
dèbé-n` | [L dè][H bé-n] | village-DEF |
ŋàlimaamí-n` | <[L ŋà.lì.màa][H mí-n]> | imam\STCONSTR-DEF |
tàaxa-llenmán` | <[L tàa.xà-l.lèn][H má-n]> | sit-RECP\ STCONSTR-DEF |
dàgá | [L dà][H gá] | go |
búgu | [H búg] | exit |
à | [L à] | 3SG |
`yí. | [H yí] | for |
A Bambara example, two phrases, in the horizontal format:
Ń | ye | à | dá | fálatɔnin` | dɔ́ | lá. | . | |
[H ń | ye] | [L à] | [H dá] | [H fá.la|-tɔ|-nin-]<[L `]> | [H dɔ́] | [H lá] | ||
1sg | pfv.tr | 3sg | put | orphan\art | indef | at |
‘I am going to speak about an orphan.’
Fálatɔnin` | dɔ́ | , | à | mɔ̀ko` | júguyara | fɔ́ |
[H fá.la|tɔ|-nin-]<[L `]> | [H dɔ́] | [L à] | [L mɔ̀|-ko-]<[L `]> | [H jú.gu|-ya|-ra] | [H fɔ́] | |
orphan\art | indef | 3sg | education\art | harden-pfv.intr | till |
k’ à | dàmatɛ̀mɛ | . |
[L k’-à] | [L dà|-ma|-][L tɛ̀.mɛ] | |
inf-3sg | exceed |
‘An orphan, his upbringing was extremely difficult.’
The same Bambara phrases in the vertical format:
Ń | [H Ń | 1SG |
ye | ye] | PFV.TR |
à | [L à] | 3SG |
dá | [H dá] | put |
fálatonin` | [H fá.la|-to|-nin-]<[L]> | orphan\ART |
dɔ́ | [H dɔ́] | INDEF |
lá | [H lá] | at |
fálatonin` | [H fá.la|-to|-nin-]<[L]> | orphan\ART |
dɔ́ | [H dɔ́] | INDEF |
à | [L à] | 3SG |
mɔ̀ko` | [H mɔ̀|-ko-]<[L]> | education\ART |
júguyara | [H jú.gu|-ya|-ra-] | harden-PFV.INTR |
fɔ́ | [H fɔ́] | till |
k’à | [L k-à] | INF-3SG |
dàmatɛ̀mɛ | [H dà|-ma|-][L tɛ̀.mɛ] | exceed |