La question de la normalisation des écrits scolaires pour leur traitement automatique. Le cas de l’omission de mots

Corpus 26 (26) (2025)
  Copy   BIBTEX

Abstract

This paper addresses the treatment of noise caused by word omissions in a corpus of school writings, in order to facilitate their subsequent automatic processing. While a normalization step may facilitate the processing of these texts, certain linguistic expressions remain challenging to comprehend, particularly in instances where the writer omits words from the text. The present contribution proposes three automatic and semi-automatic potential solutions to this problem. The first method employs a "mask" token in the form of xxx. The second is a semi-automatic approach whereby each morpho-syntactic category proposed during normalization is replaced by the corresponding "prototypical word." The third involves a FlauBERT method, using this language model to "reconstruct" the most probable token in the text. The three methods are evaluated quantitatively, and the results obtained using method 3, which proved to be the most effective in the context of our research, are also presented qualitatively.

Other Versions

No versions found

Links

PhilArchive



    Upload a copy of this work     Papers currently archived: 100,888

External links

Setup an account with your affiliations in order to access resources via your University's proxy server

Through your library

Analytics

Added to PP
2025-01-28

Downloads
0

6 months
0

Historical graph of downloads

Sorry, there are not enough data points to plot this chart.
How can I increase my downloads?