Disclaimer: This analysis is an analysis of the analysis of Feb 2019 common crawl snapshot reported in a paper from Facebook Research; link . Interpretations should be digested with a grain of analytical skepticism 😉 . A while back I became interested in the Common Crawl project. For those of you who do not know – the common crawl project is an effort to archive time-series snapshots of the whole internet. While the project wants to be as comprehensive as possible, taking a snapshot of the whole internet is an impossible task (let alone taking time-series snapshots for the complete internet corpus for last 11 years, since 2011). Instead, the common crawl project takes selective snapshots of a wide range of websites. Since my native tongue is not English, I have been wondering for a while how much of the archived content are of non-English web contents. Whereas I haven’t been able to get into analyzing the data myself yet, this paper ( https://aclanthology.org/2020...
Comments
Post a Comment