4.1 Model Files
In the model/
directory, the following files are stored:
File | Description |
---|---|
English.DICT | RDRPosTagger dictionary |
English.RDR | RDRPosTagger model |
lid.176.ftz | fasttext language identification model |
opennlp-sentence.bin | OpenNLP sentence detector model |
opennlp-tokens.bin | OpenNLP tokenizer model |
tfreq-new-algo3.bin | Marginalia term frequency model |
ngrams.bin | Marginalia n-grams model |
The RDRPosTagger models are used for fast part-of-speech tagging. These models, and additional models are available at the RDRPOSTagger git repository
The fasttext language identification model is used to identify the language of a document. See the fasttext documentation for more information.
The OpenNLP models are used for sentence detection and tokenization. See the OpenNLP documentation for more information.
tfreq-new-algo3.bin
is a Marginalia-specific term frequency model. It contains hashed terms and their frequencies. These files can be generated from crawl data in the control interface, under Node N-> Actions-> Export From Crawl Data
, select ‘Extract term frequency data’. The model is used for term frequency calculations in the search engine. It’s also available for download from the Marginalia Search Downloads website.
ngrams.bin
is a binary index of n-grams. It’s used in query processing. There is currently no way of generating this file from crawl data, it’s only available for download from the Marginalia Search Downloads website.