Marginalia Search Engine Software Documentation

2.5 - WARC export

It is possible to configure the crawler to export its crawl data in a WARC format on top of the native parquet format. This is toggled in the node configuration, available from Index Nodes -> Node N -> Configuration

The node configuration panel, showing the `Keep WARC files during crawling` option

If the option Keep WARC files during crawling is enabled, the crawler will retain a WARC record of the crawl.

Output Files

This is stored in the same location as the rest of the crawl data, in a directory warc. That is, something like (install-dir)/index-n/storage/crawl_data[...]/warc

Each crawl will have one idx file, and one or more warc.gz-files. The idx file is a plain text file containing offsets into the warc.gz-file, for ranges of data corresponding to particular domains.

$ ls warc/

The index file is line based, where each line contains the space-separated fields:

  • WARC File Name
  • Domain Name
  • Start Offset
  • Data Length

For example:

$ cat warc/marginalia-crawl-2024-02-09T16-12-01.674824867.idx
marginalia-crawl-2024-02-09T16-12-01.674824867--0000.warc.gz 0 259
marginalia-crawl-2024-02-09T16-12-01.674824867--0000.warc.gz 259 19425
marginalia-crawl-2024-02-09T16-12-01.674824867--0000.warc.gz 19684 10018
marginalia-crawl-2024-02-09T16-12-01.674824867--0000.warc.gz 29702 116852
marginalia-crawl-2024-02-09T16-12-01.674824867--0000.warc.gz 146554 644091

In this example, the crawled data for will be in file marginalia-crawl-2024-02-09T16-12-01.674824867--0000.warc.gz in the byte range [29702, 29702 + 116852).

This is designed to play well with HTTP range queries, a trick that is also used by the Common Crawl project.
For local access, dd may be used to extract individual domain ranges.

$ dd ibs=1 
     count=116852 | zcat

Content-Length: 114
Content-Type: application/http;msgtype=request
WARC-Concurrent-To: <urn:uuid:209b4919-d9f6-4f3d-b044-e6dbcadd0f2b>
WARC-Date: 2024-02-09T16:12:02Z
WARC-Record-ID: <urn:uuid:dca93c94-605c-4a76-99fc-8334454ec901>
WARC-Type: request

GET robots.txt HTTP/1.1
Accept-Encoding: gzip
User-Agent: Mozilla/5.0 (compatible)


Per the recommendations in the WARC specification, a new file will be created the file reaches a size of about 1 GB.

Note: You may see repeated requests for the root document in the WARC data. This behavior is an expected part of the crawler’s initial discovery process, where the URL schema is probed and sitemaps and RSS feeds are discovered.