ANTILLES: An Open French Linguistically Enriched Part-of-Speech Corpus

Yanis Labrak; Richard Dufour

Communication Dans Un Congrès Lecture Notes in Artificial Intelligence Année : 2022

ANTILLES: An Open French Linguistically Enriched Part-of-Speech Corpus

(1) , (2)

1
2

Yanis Labrak

Fonction : Auteur
PersonId : 750624
IdHAL : yanis-labrak
ORCID : 0000-0003-1072-3862

Laboratoire Informatique d'Avignon

Richard Dufour

Fonction : Auteur
PersonId : 178348
IdHAL : richard-dufour
ORCID : 0000-0003-1203-9108

Traitement Automatique du Langage Naturel

Résumé

Part-of-speech (POS) tagging is a classical natural language processing (NLP) task. Although many tools and corpora have been proposed, especially for the most widely spoken languages, these suffer from limitations concerning their user license, the size of their tagset, or even approaches no longer in the state-of-the-art. In this article, we propose ANTILLES, an extended version of an existing French corpus (UD French-GSD) comprising an original set of labels obtained with the aid of morphological characteristics (gender, number, tense, etc.). This extended version includes a set of 65 labels, against 16 in the initial version. We also implemented several POS tools for French from this corpus, incorporating the latest advances in the state-of-the-art in this area. The corpus as well as the POS labeling tools are fully open and freely available.

Mots clés

Part-of-speech corpus POS tagging Open tools Word embeddings Bi-LSTM CRF Transformers

Domaines

Intelligence artificielle [cs.AI] Informatique [cs] Traitement du texte et du document

Fichier principal

ANTILLES_A_freNch_linguisTIcaLLy_Enriched_part_of_Speech_corpus.pdf (504.81 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

yanis labrak : Connectez-vous pour contacter le contributeur

https://hal.science/hal-03696042

Soumis le : lundi 20 juin 2022-12:00:48

Dernière modification le : mardi 9 mai 2023-13:42:05

Archivage à long terme le : jeudi 22 septembre 2022-10:58:20

Dates et versions

hal-03696042 , version 1 (20-06-2022)

hal-03696042 , version 2 (16-11-2022)

Identifiants

HAL Id : hal-03696042 , version 1

Citer

Yanis Labrak, Richard Dufour. ANTILLES: An Open French Linguistically Enriched Part-of-Speech Corpus. 25th International Conference on Text, Speech and Dialogue (TSD), Sep 2022, Brno, Czech Republic. ⟨hal-03696042v1⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

163 Consultations

153 Téléchargements

ANTILLES: An Open French Linguistically Enriched Part-of-Speech Corpus

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Partager