Skip to Main content Skip to Navigation
Conference papers

ANTILLES: An Open French Linguistically Enriched Part-of-Speech Corpus

Abstract : Part-of-speech (POS) tagging is a classical natural language processing (NLP) task. Although many tools and corpora have been proposed, especially for the most widely spoken languages, these suffer from limitations concerning their user license, the size of their tagset, or even approaches no longer in the state-of-the-art. In this article, we propose ANTILLES, an extended version of an existing French corpus (UD French-GSD) comprising an original set of labels obtained with the aid of morphological characteristics (gender, number, tense, etc.). This extended version includes a set of 65 labels, against 16 in the initial version. We also implemented several POS tools for French from this corpus, incorporating the latest advances in the state-of-the-art in this area. The corpus as well as the POS labeling tools are fully open and freely available.
Complete list of metadata

https://hal.archives-ouvertes.fr/hal-03696042
Contributor : yanis labrak Connect in order to contact the contributor
Submitted on : Monday, June 20, 2022 - 12:00:48 PM
Last modification on : Friday, August 5, 2022 - 2:54:52 PM

File

ANTILLES_A_freNch_linguisTIcaL...
Files produced by the author(s)

Identifiers

  • HAL Id : hal-03696042, version 1

Citation

Yanis Labrak, Richard Dufour. ANTILLES: An Open French Linguistically Enriched Part-of-Speech Corpus. 25th International Conference on Text, Speech and Dialogue (TSD), Sep 2022, Brno, Czech Republic. ⟨hal-03696042⟩

Share

Metrics

Record views

25

Files downloads

3