Marginalia Search Engine Software Documentation

4.2 Data Files

In the data/ directory, the following files are stored:

File Description
adblock.txt Adblock rules
asn-data-raw-table CIDR->ASN data
asn-used-autnums ASN->AS registry data
IP2LOCATION-LITE-DB1.CSV IP2Location data
atags.parquet Anchor tags

adblock.txt is used to detect ads and other problematic content in the crawled documents.

The asn-files are sourced from APNIC and used to map IP addresses to the corresponding CIDR and autononous system.

The ip2location data is from https://lite.ip2location.com/, and available under CC-BY-SA 4.0.

The atags.parquet file is available from https://downloads.marginalia.nu, and contains anchor texts. It’s also possible to use the Marginalia software to export your own anchor text data.