Skip to content

Latest commit

 

History

History
 
 

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

README.md

Spam Assassin

The Spam Assassin public mail corpus.

Usage

var corpus = require( '@stdlib/datasets/spam-assassin' );

corpus()

Returns the Spam Assassin public mail corpus.

var data = corpus();
// returns [{...},{...},...]

Each array element has the following fields:

  • id: message id (relative to message group)
  • group: message group
  • checksum: object containing checksum info
  • text: message text (including headers)

The message group may be one of the following:

  • easy-ham-1: easier to detect non-spam e-mails (2500 messages)
  • easy-ham-2: easier to detect non-spam e-mails collected at a later date (1400 messages)
  • hard-ham-1: harder to detect non-spam e-mails (250 messages)
  • spam-1: spam e-mails (500 messages)
  • spam-2: spam e-mails collected at a later date (1396 messages)

The checksum object contains the following fields:

  • type: checksum type (e.g., MD5)
  • value: checksum value

Examples

var corpus = require( '@stdlib/datasets/spam-assassin' );

var data;
var i;

data = corpus();
for ( i = 0; i < data.length; i++ ) {
    console.log( 'Character Count: %d', data[ i ].text.length );
}

CLI

Usage

Usage: spam-assassin [options]

Options:

  -h,    --help                Print this message.
  -V,    --version             Print the package version.
         --format fmt          Output format: 'txt' or 'ndjson'.

Notes

  • The CLI supports two output formats: plain text (txt) and newline-delimited JSON (NDJSON). The default output format is txt.

Examples

$ spam-assassin

License

The data files (databases) are licensed under an Open Data Commons Public Domain Dedication & License 1.0 and their contents are licensed under Creative Commons Zero v1.0 Universal. The software is licensed under Apache License, Version 2.0.