UD Thai TUD (Thai Universal Dependency Treebank) is a treebank of 3,627 syntactic trees from the Thai National Corpus and Wikipedia, annotated in Universal Dependencies, covering diverse text types and topics across various domains.
The UD Thai TUD treebank was created to provide a broad-coverage syntactic resource for the Thai language under the Universal Dependencies (UD) framework. Text was randomly sampled from two major sources: the Thai National Corpus and the November 2020 dump of Thai Wikipedia. To ensure diversity, 5,000 paragraphs were selected from various document types—news articles, Wikipedia entries, essays, advertisements, interviews, and stories—covering a wide range of topics such as politics, crime, entertainment, sports, history, religion, culture, and science. After annotation and rigorous quality control, 3,627 well-formed dependency trees were retained in the final dataset.
All paragraphs were tokenized using the newmm tokenizer from the PyThaiNLP library, then annotated on the Datasaur platform. A team of 10 annotators with linguistics backgrounds was trained to: (1) correct tokenization errors, (2) assign Universal POS (UPOS) tags, (3) identify dependency arcs, and (4) label relations (DEPREL) without subtypes. The LEMMA field was excluded due to the lack of inflectional morphology in Thai.
Following pilot annotations and manual review, two annotators demonstrating the highest accuracy were selected to complete the remaining data. Agreement was evaluated on 20 double-annotated sentences (399 tokens), achieving Cohen’s Kappa scores of 0.92 (UPOS) and 0.84 (DEPREL), and UAS/LAS scores of 0.85 and 0.78. The annotated data was then converted into CoNLL-U format and split into individual trees based on dependency structure. Trees with incomplete labels, multiple roots, or structural errors were corrected, with additional quality assurance performed via manual inspection of 50 randomly selected trees.
The treebank consists of randomly shuffled sentences sampled from the Thai National Corpus (TNC) and the November 2020 dump of Thai Wikipedia, rather than complete documents. Each filename encodes the source document and the portion of the document from which the sentences were extracted:
- Wikipedia trees: filenames follow the format
wiki_<wgArticleID>. - TNC trees: filenames follow the format
[tnc/][Original TNC filename][Part of the document].
The final treebank was split into training, development, and test sets in an 8:1:1 ratio. It consists of syntactically complete trees rather than full documents. While Filenames are used to identify source paragraphs, typically reflecting their origin from the Thai National Corpus or Wikipedia. However, sentence IDs in the final treebank do not encode genre or domain metadata.
| UPOS | Train | Dev | Test | UPOS | Train | Dev | Test |
|---|---|---|---|---|---|---|---|
| NOUN | 18777 | 2270 | 2310 | CCONJ | 2063 | 239 | 270 |
| VERB | 14881 | 1802 | 1867 | ADJ | 1575 | 223 | 197 |
| ADP | 4517 | 530 | 560 | PART | 1366 | 156 | 169 |
| ADV | 4498 | 557 | 521 | NUM | 1161 | 165 | 118 |
| AUX | 3424 | 401 | 421 | DET | 1140 | 137 | 144 |
| PRON | 2796 | 322 | 350 | PUNCT | 871 | 104 | 125 |
| SCONJ | 2438 | 321 | 335 | SYM | 16 | 1 | 1 |
| PROPN | 2488 | 293 | 295 |
Table 1. UPOS Distribution in Each Split of TUD
TUD was developed as part of the paper "The Thai Universal Dependency Treebank", published in Transactions of the Association for Computational Linguistics (TACL). We thank the reviewers and the action editor of the paper for their constructive feedback, which contributed to significant improvements. We also gratefully acknowledge all annotators for their effort and dedication throughout the annotation process.
This work was supported by the National Research Foundation, Singapore under its AI Singapore Programme, and by the National Science Research and Innovation Fund (NSRF) through the Program Management Unit for Human Resources & Institutional Development, Research, and Innovation [grant number B0SF640234].
If you use TUD in your project or publication, please cite as follows:
BibTex
@article{Sriwirote-etal-2024-TUD,
title={The Thai Universal Dependency Treebank},
author={Panyut Sriwirote and Wei Qi Leong and
Charin Polpanumas and Santhawat Thanyawong and
William Chandra Tjhi and Wirote Aroonmanakun and
Attapol T. Rutherford},
journal={Transactions of the Association for Computational Linguistics},
year={in press},
publisher={MIT Press Direct}
}
- 2025-11-15 v2.17
- Initial release in Universal Dependencies.
=== Machine-readable metadata (DO NOT REMOVE!) ================================ Data available since: UD v2.17 License: CC BY-SA 4.0 Includes text: yes Parallel: no Genre: wiki news fiction nonfiction academic legal Lemmas: not available UPOS: manual native XPOS: not available Features: not available Relations: manual native Contributors: Sriwirote, Panyut; Leong, Wei Qi; Polpanumas, Charin; Thanyawong, Santhawat; Tjhi, William Chandra; Aroonmanakun, Wirote; Rutherford, Attapol T.; Jiamsundutsadee, Ratanon; Maitreenukul, Punyanuch Contributing: here Contact: attapol.t@chula.ac.th, punyanuch.maitree@gmail.com ===============================================================================
C:\Users\Punyanuch\OneDrive\Pictures\Documents\GitHub\UD_Thai-TUD\th_tud-ud-dev.conllu