4.1 Model Files

In the model/ directory, the following files are stored:

File	Description
English.DICT	RDRPosTagger dictionary
English.RDR	RDRPosTagger model
lid.176.ftz	fasttext language identification model
opennlp-sentence.bin	OpenNLP sentence detector model
opennlp-tokens.bin	OpenNLP tokenizer model
tfreq-new-algo3.bin	Marginalia term frequency model
ngrams.bin	Marginalia n-grams model

The RDRPosTagger models are used for fast part-of-speech tagging. These models, and additional models are available at the RDRPOSTagger git repository

The fasttext language identification model is used to identify the language of a document. See the fasttext documentation for more information.

The OpenNLP models are used for sentence detection and tokenization. See the OpenNLP documentation for more information.

tfreq-new-algo3.bin is a Marginalia-specific term frequency model. It contains hashed terms and their frequencies. These files can be generated from crawl data in the control interface, under Node N-> Actions-> Export From Crawl Data, select ‘Extract term frequency data’. The model is used for term frequency calculations in the search engine. It’s also available for download from the Marginalia Search Downloads website.

ngrams.bin is a binary index of n-grams. It’s used in query processing. There is currently no way of generating this file from crawl data, it’s only available for download from the Marginalia Search Downloads website.