Coming soon...

Languages of Internet: Musings on the Feb 2019 Common Crawl Dataset

Disclaimer: This analysis is an analysis of the analysis of Feb 2019 common crawl snapshot reported in a paper from Facebook Research; link . Interpretations should be digested with a grain of analytical skepticism 😉 . A while back I became interested in the Common Crawl project. For those of you who do not know – the common crawl project is an effort to archive time-series snapshots of the whole internet. While the project wants to be as comprehensive as possible, taking a snapshot of the whole internet is an impossible task (let alone taking time-series snapshots for the complete internet corpus for last 11 years, since 2011). Instead, the common crawl project takes selective snapshots of a wide range of websites. Since my native tongue is not English, I have been wondering for a while how much of the archived content are of non-English web contents. Whereas I haven’t been able to get into analyzing the data myself yet, this paper ( https://aclanthology.org/2020...

Mahdi's Musings

Search This Blog

Coming soon...

Comments

Post a Comment

Popular posts from this blog

Languages of Internet: Musings on the Feb 2019 Common Crawl Dataset