The Relationship Between Word Length and Average Information Content in Japanese

Cognitive Science 47 (6):e13302 (2023)
  Copy   BIBTEX

Abstract

Piantadosi, Tily, and Gibson analyzed a large‐scale web‐scraping corpus (the Google 1T dataset) and reported that word length is independently predicted from average information content (surprisal) calculated by a 2‐ to 4‐gram model (hereafter, longer‐span surprisal) across 11 Indo‐European languages, namely, Czech, Dutch, English, French, German, Italian, Polish, Spanish, Portuguese, Romanian, and Swedish. However, a recent article by Meylan and Griffiths suggested the importance of preprocessing for studies with large‐scale corpora and reanalyzed the same databases. After their preprocessing, the results in Piantadosi et al. were not replicated in Czech, Romanian, and Swedish. Additionally, a German‐specific study by Koplenig, Kupietz, and Wolfer showed that the strict analysis did not replicate the result in Piantadosi et al. for that language with the preprocessing suggested by Meylan and Griffiths in a large‐scale but less noisy database. These three studies provide evidence from 11 Indo‐European languages and one Afro‐Asiatic language, Hebrew, as relevant in this debate. However, we do not have evidence from other linguistic groups. This study provides evidence about Japanese based on a strict preprocessing of Google's web‐scraping database. The results show that Japanese word length can be predicted independently by 2‐ to 4‐gram surprisal.

Other Versions

No versions found

Links

PhilArchive



    Upload a copy of this work     Papers currently archived: 100,561

External links

Setup an account with your affiliations in order to access resources via your University's proxy server

Through your library

Similar books and articles

Analytics

Added to PP
2023-06-14

Downloads
17 (#1,129,509)

6 months
3 (#1,467,341)

Historical graph of downloads
How can I increase my downloads?