Common Crawl

From Algolit

Type: Dataset
Technique: scraping
Developed by: The Common Crawl Foundation, California, US

Common Crawl is a registered non-profit organisation founded by Gil Elbaz with the goal of democratizing access to web information by producing and maintaining an open repository of web crawl data that is universally accessible and analyzable.

Common Crawl completes four crawls a year. Amazon Web Services began hosting Common Crawl's archive through its Public Data Sets program in 2012. The crawl of September 2017 contains 3.01 billion web pages and over 250 TiB of uncompressed content, or about 75% of the Internet.

The organization's crawlers respect nofollow and robots.txt policies. Open source code for processing Common Crawl's data set is publicly available.

Common Crawl datasets are used to create pretrained word embeddings datasets, like GloVe (see The GloVe Reader). word2vec is another much used pretrained word embeddings dataset, it is based on Google News' texts.

Maison du Livre's Website in the Common Crawl Index:

{"urlkey": "be,lamaisondulivre)/", "timestamp": "20170921193906", "filename": "crawl-data/CC-MAIN-2017-39/segments/1505818687837.85/warc/CC-MAIN-20170921191047-20170921211047-00095.warc.gz", "mime-detected": "application/xhtml+xml", "status": "200", "mime": "text/html", "digest": "KDTUFUFZASPU7DXCJRQN62DHWGXGUZIX", "length": "5082", "offset": "491381827", "url": ""}

Constant's website in the Common Crawl Index:

{"urlkey": "org,constantvzw)/", "timestamp": "20170920232443", "filename": "crawl-data/CC-MAIN-2017-39/segments/1505818687582.7/crawldiagnostics/CC-MAIN-20170920232245-20170921012245-00322.warc.gz", "mime-detected": "text/html", "status": "302", "mime": "text/html", "digest": "3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ", "length": "547", "offset": "10063605", "url": ""}
{"urlkey": "org,constantvzw)/", "timestamp": "20170921101437", "filename": "crawl-data/CC-MAIN-2017-39/segments/1505818687740.4/crawldiagnostics/CC-MAIN-20170921101029-20170921121029-00322.warc.gz", "mime-detected": "text/html", "status": "302", "mime": "text/html", "digest": "3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ", "length": "548", "offset": "10050808", "url": ""}
{"urlkey": "org,constantvzw)/", "timestamp": "20170925145800", "filename": "crawl-data/CC-MAIN-2017-39/segments/1505818691977.66/crawldiagnostics/CC-MAIN-20170925145232-20170925165232-00347.warc.gz", "mime-detected": "text/html", "status": "302", "mime": "text/html", "digest": "3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ", "length": "541", "offset": "1503578", "url": ""}