Skip to main content

Languages of Internet: Musings on the Feb 2019 Common Crawl Dataset

 

Disclaimer: This analysis is an analysis of the analysis of Feb 2019 common crawl snapshot reported in a paper from Facebook Research; link. Interpretations should be digested with a grain of analytical skepticism 😉.

 

A while back I became interested in the Common Crawl project. For those of you who do not know – the common crawl project is an effort to archive time-series snapshots of the whole internet. While the project wants to be as comprehensive as possible, taking a snapshot of the whole internet is an impossible task (let alone taking time-series snapshots for the complete internet corpus for last 11 years, since 2011). Instead, the common crawl project takes selective snapshots of a wide range of websites.

Since my native tongue is not English, I have been wondering for a while how much of the archived content are of non-English web contents. Whereas I haven’t been able to get into analyzing the data myself yet, this paper (https://aclanthology.org/2020.lrec-1.494/) provided me some data to perform a rudimentary analysis of the non-English content in project. Obviously, I needed to do some cleanings of the data after reading the tabular summary data (table 3 of the article). I also grabbed (i) language code to full language mapping and (ii) speakers of each of the languages from corresponding Wikipedia articles.

Let’s go through different aspects of the dataset. First, in the plot showing both total speakers and number of documents in the Feb 2010 snapshot shows that some languages have more representation than the others. Russian, German, Japanese, Italian, Polish stands out here.

 

Another way of visualizing the same concept is by plotting the ratio of documents/speakers for each of the languages of the dataset. This plot also shows similar trends:

Next, let’s check how many sentences are there in each document for individual languages. Here, we see that Japanese, Chinese, Tagalog, Korean (60-70 sentences/documents) have more sentences/document compared to other languages. Urdu has the least number of sentences/document (~17).


Next, let’s inspect the sizes of each documents. Plotting size/document showed interesting trend, with Chinese, Japanese, Russian, Burmese leading the trend.

On its face value, this data would suggest that content-wise Chinese, Japanese, Burmese language documents are richer (since they are larger). I was trying to make sense, why contents in these languages would be richer than other language contents. Suddenly, it came to my mind (obviously after consulting Google) that content storage is not equally optimized for all languages in Unicode. English is the most efficient language and other languages would be less efficient. See links: here and here. So, I normalized individual languages to Latin byte equivalents based on crude correction factors. The normalized data looks:
Perhaps, this data makes more sense with English and French languages leading the per document normalized sizes. I don’t know the exact reasons for this, but my guess would be:

1. Many academic research articles are published in these two languages.

2. These two languages are first/second official languages of many countries. Thus, many official documents (communications, law gazettes etc.) are released. Typically, these official documents are large in size and therefore, are likely to result in inflated (normalized) size per document

Concluding remarks:

1. Common crawl dataset is enriched in English-language contents (Duh!).

2. French, Spanish, German, Russian, Polish, Italian contents are also significantly represented (both in the number of archived documents and normalized language corpus sizes).

3. Japanese, Chinese, Korean and Tagalog has relatively high number of sentences per document. I don’t know whether this is due to sampling bias or is this property intrinsic to individual languages.

Cheers!

 

GitHub repository: https://github.com/Mahdi-Moosa/Musings_On_The_Feb_2019_Common_Crawl_Dataset

Comments

Popular posts from this blog