URL Search FAQ

The URL Search tool allows you to search through the URL index of the Common Crawl corpus. You can search for any URL, URL prefix, subdomain or top-level domain. The search results will be presented as an alphabetically ordered list with an approximate count of the number of matches. The search results are also available for download as a JSON file. Queries must be entered in the format of 'tld.domain'

What can I search for?

Looking For Data From

Search Query

en.wikipedia.org, es.wikipedia.org, ...

wikipedia.org

fr.wikipedia.org

fr.wikipedia.org

all pages with .org TLD

org

Where can I find out more about the URL Index?
This GitHub page has more information about the URL Index.

How can I download the search results as a JSON file?
In the upper right of the screen you will see a button “download as: JSON”. Click that and you will download a JSON file of your results. Be patient, the download can take a loooong time.

Why would I want a JSON file of the results? If you download the JSON file, you will be able to drop it into your code so that your job only runs across the files you searched for rather than the entire Common Crawl corpus.

Why does do most of the queries I do report the number as “over 100,000”? Counting a large number of URLs takes effort and a fair bit of time. We figure that when you want to know how many pages in the corpus are related to your query it was better to get the answer to you quickly. So we count and report the number up until 100,000 and at 100,000 we stop counting.  Because really, if there are more than 100,000 pages matching your query,  “a lot” is a good enough answer for all practical purposes.

Have more questions? Feel free to email us at info@commoncrawl.org