xan is a command line tool that can be used to process CSV files directly from the shell.
It has been written in Rust to be as fast as possible, use as little memory as possible, and can very easily handle large CSV files (Gigabytes). It leverages a novel SIMD CSV parser and is also able to parallelize some computations (through multithreading) to make some tasks complete as fast as your computer can allow.
It can easily preview, filter, slice, aggregate, sort, join CSV files, and exposes a large collection of composable commands that can be chained together to perform a wide variety of typical tasks.
xan also offers its own expression language so you can perform complex tasks that cannot be done by relying on the simplest commands. This minimalistic language has been tailored for CSV data and is way faster than evaluating typical dynamically-typed languages such as Python, Lua, JavaScript etc.
Note that this tool is originally a fork of BurntSushi's xsv, but has been nearly entirely rewritten at that point, to fit SciencesPo's médialab use-cases, rooted in web data collection and analysis geared towards social sciences (you might think CSV is outdated by now, but read our love letter to the format before judging too quickly).
xan therefore goes beyond typical data manipulation and expose utilities related to lexicometry, graph theory and even scraping.
Beyond CSV data, xan is able to process a large variety of CSV-adjacent data formats from many different disciplines such as web archival (.cdx) or bioinformatics (.vcf, .gtf, .sam, .bed etc.). xan is also able to convert to & from many data formats such as json, excel files, numpy arrays etc. using xan to and xan from. See this section for more detail.
Finally, xan can be used to display CSV files in the terminal, for easy exploration, and can even be used to draw basic data visualisations:
- How to install
- Quick tour
- Available commands
- General flags and IO model
- Expression language reference
- Cookbook
- News
- How to cite?
- Frequently Asked Questions
xan can be installed using cargo (it usually comes with Rust):
cargo install xan --lockedYou can also tweak the build flags to make sure the Rust compiler is able to leverage all your CPU's features:
CARGO_BUILD_RUSTFLAGS='-C target-cpu=native' cargo install xan --lockedYou can also install the latest dev version thusly:
cargo install --git https://github.com/medialab/xan --lockedxan can be installed using Scoop on Windows:
scoop bucket add extras
scoop install xanxan can be installed with Homebrew on macOS thusly:
brew install xanYou can install xan from the extra repository using pacman:
sudo pacman -S xanA package is available from the official repositories. To install xan simply run:
pkgin install xan
xan is packaged for Nix, and is available in Nixpkgs as of 25.05 release. To
install it, you may add it to your environment.systemPackages as pkgs.xan or
use nix-shell to enter an ephemeral shell.
nix-shell -p xanxan can be installed in Linux, macOS, and Windows using the Pixi package manager:
pixi global install xanPre-built binaries can be found attached to every GitHub releases.
Currently supported targets include:
-
x86_64-apple-darwin -
x86_64-unknown-linux-gnu -
x86_64-unknown-linux-musl -
x86_64-pc-windows-msvc -
aarch64-apple-darwin -
aarch64-unknown-linux-gnu
ppc64le targets are not built by the CI yet but prebuilt binaries can still be found in the conda-forge package's files if you need them.
Feel free to open a PR to improve the CI by adding relevant targets.
Note that xan also exposes handy automatic completions for command and header/column names that you can install through the xan completions command.
Run the following command to understand how to install those completions:
xan completions -h
# With zsh you might also need to add this to your initialization to make
# sure Bash compatibility is loaded:
autoload -Uz bashcompinit && bashcompinitLet's learn about the most commonly used xan commands by exploring a corpus of French medias:
curl -LO https://github.com/medialab/corpora/raw/master/polarisation/medias.csvxan headers medias.csv0 webentity_id
1 name
2 prefixes
3 home_page
4 start_pages
5 indegree
6 hyphe_creation_timestamp
7 hyphe_last_modification_timestamp
8 outreach
9 foundation_year
10 batch
11 edito
12 parody
13 origin
14 digital_native
15 mediacloud_ids
16 wheel_category
17 wheel_subcategory
18 has_paywall
19 inactive
xan count medias.csv478
xan view medias.csvDisplaying 5/20 cols from 10 first rows of medias.csv
┌───┬───────────────┬───────────────┬────────────┬───┬─────────────┬──────────┐
│ - │ name │ prefixes │ home_page │ … │ has_paywall │ inactive │
├───┼───────────────┼───────────────┼────────────┼───┼─────────────┼──────────┤
│ 0 │ Acrimed.org │ http://acrim… │ http://ww… │ … │ false │ <empty> │
│ 1 │ 24matins.fr │ http://24mat… │ https://w… │ … │ false │ <empty> │
│ 2 │ Actumag.info │ http://actum… │ https://a… │ … │ false │ <empty> │
│ 3 │ 2012un-Nouve… │ http://2012u… │ http://ww… │ … │ false │ <empty> │
│ 4 │ 24heuresactu… │ http://24heu… │ http://24… │ … │ false │ <empty> │
│ 5 │ AgoraVox │ http://agora… │ http://ww… │ … │ false │ <empty> │
│ 6 │ Al-Kanz.org │ http://al-ka… │ https://w… │ … │ false │ <empty> │
│ 7 │ Alalumieredu… │ http://alalu… │ http://al… │ … │ false │ <empty> │
│ 8 │ Allodocteurs… │ http://allod… │ https://w… │ … │ false │ <empty> │
│ 9 │ Alterinfo.net │ http://alter… │ http://ww… │ … │ <empty> │ true │
│ … │ … │ … │ … │ … │ … │ … │
└───┴───────────────┴───────────────┴────────────┴───┴─────────────┴──────────┘
On unix, don't hesitate to use the -p flag to automagically forward the full output to an appropriate pager and skim through all the columns.
# NOTE: drop -c to avoid truncating the values
xan flatten -c medias.csvRow n°0
───────────────────────────────────────────────────────────────────────────────
webentity_id 1
name Acrimed.org
prefixes http://acrimed.org|http://acrimed69.blogspot…
home_page http://www.acrimed.org
start_pages http://acrimed.org|http://acrimed69.blogspot…
indegree 61
hyphe_creation_timestamp 1560347020330
hyphe_last_modification_timestamp 1560526005389
outreach nationale
foundation_year 2002
batch 1
edito media
parody false
origin france
digital_native true
mediacloud_ids 258269
wheel_category Opinion Journalism
wheel_subcategory Left Wing
has_paywall false
inactive <empty>
Row n°1
───────────────────────────────────────────────────────────────────────────────
webentity_id 2
...
xan search -s outreach internationale medias.csv | xan viewDisplaying 4/20 cols from 10 first rows of <stdin>
┌───┬──────────────┬────────────────────┬───┬─────────────┬──────────┐
│ - │ webentity_id │ name │ … │ has_paywall │ inactive │
├───┼──────────────┼────────────────────┼───┼─────────────┼──────────┤
│ 0 │ 25 │ Businessinsider.fr │ … │ false │ <empty> │
│ 1 │ 59 │ Europe-Israel.org │ … │ false │ <empty> │
│ 2 │ 66 │ France 24 │ … │ false │ <empty> │
│ 3 │ 220 │ RFI │ … │ false │ <empty> │
│ 4 │ 231 │ fr.Sott.net │ … │ false │ <empty> │
│ 5 │ 246 │ Voltairenet.org │ … │ true │ <empty> │
│ 6 │ 254 │ Afp.com /fr │ … │ false │ <empty> │
│ 7 │ 265 │ Euronews FR │ … │ false │ <empty> │
│ 8 │ 333 │ Arte.tv │ … │ false │ <empty> │
│ 9 │ 341 │ I24News.tv │ … │ false │ <empty> │
│ … │ … │ … │ … │ … │ … │
└───┴──────────────┴────────────────────┴───┴─────────────┴──────────┘
xan select foundation_year,name medias.csv | xan viewDisplaying 2 cols from 10 first rows of <stdin>
┌───┬─────────────────┬───────────────────────────────────────┐
│ - │ foundation_year │ name │
├───┼─────────────────┼───────────────────────────────────────┤
│ 0 │ 2002 │ Acrimed.org │
│ 1 │ 2006 │ 24matins.fr │
│ 2 │ 2013 │ Actumag.info │
│ 3 │ 2012 │ 2012un-Nouveau-Paradigme.com │
│ 4 │ 2010 │ 24heuresactu.com │
│ 5 │ 2005 │ AgoraVox │
│ 6 │ 2008 │ Al-Kanz.org │
│ 7 │ 2012 │ Alalumieredunouveaumonde.blogspot.com │
│ 8 │ 2005 │ Allodocteurs.fr │
│ 9 │ 2005 │ Alterinfo.net │
│ … │ … │ … │
└───┴─────────────────┴───────────────────────────────────────┘
xan sort -s foundation_year medias.csv | xan view -s name,foundation_yearDisplaying 2 cols from 10 first rows of <stdin>
┌───┬────────────────────────────────────┬─────────────────┐
│ - │ name │ foundation_year │
├───┼────────────────────────────────────┼─────────────────┤
│ 0 │ Le Monde Numérique (Ouest France) │ <empty> │
│ 1 │ Le Figaro │ 1826 │
│ 2 │ Le journal de Saône-et-Loire │ 1826 │
│ 3 │ L'Indépendant │ 1846 │
│ 4 │ Le Progrès │ 1859 │
│ 5 │ La Dépêche du Midi │ 1870 │
│ 6 │ Le Pélerin │ 1873 │
│ 7 │ Dernières Nouvelles d'Alsace (DNA) │ 1877 │
│ 8 │ La Croix │ 1883 │
│ 9 │ Le Chasseur Francais │ 1885 │
│ … │ … │ … │
└───┴────────────────────────────────────┴─────────────────┘
# Some medias of our corpus have the same ids on mediacloud.org
xan dedup -s mediacloud_ids medias.csv | xan count && xan count medias.csv457
478
Deduplicating can also be done while sorting:
xan sort -s mediacloud_ids -u medias.csvxan frequency -s edito medias.csv | xan viewDisplaying 3 cols from 5 rows of <stdin>
┌───┬───────┬────────────┬───────┐
│ - │ field │ value │ count │
├───┼───────┼────────────┼───────┤
│ 0 │ edito │ media │ 423 │
│ 1 │ edito │ individu │ 30 │
│ 2 │ edito │ plateforme │ 14 │
│ 3 │ edito │ agrégateur │ 10 │
│ 4 │ edito │ agence │ 1 │
└───┴───────┴────────────┴───────┘
xan frequency -s edito medias.csv | xan histHistogram for edito (bars: 5, sum: 478, max: 423):
media |423 88.49%|━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━|
individu | 30 6.28%|━━━╸ |
plateforme | 14 2.93%|━╸ |
agrégateur | 10 2.09%|━╸ |
agence | 1 0.21%|╸ |
xan stats -s indegree,edito medias.csv | xan transpose | xan view -IDisplaying 2 cols from 14 rows of <stdin>
┌─────────────┬───────────────────┬────────────┐
│ field │ indegree │ edito │
├─────────────┼───────────────────┼────────────┤
│ count │ 463 │ 478 │
│ count_empty │ 15 │ 0 │
│ type │ int │ string │
│ types │ int|empty │ string │
│ sum │ 25987 │ <empty> │
│ mean │ 56.12742980561554 │ <empty> │
│ variance │ 4234.530197929737 │ <empty> │
│ stddev │ 65.07326792108829 │ <empty> │
│ min │ 0 │ <empty> │
│ max │ 424 │ <empty> │
│ lex_first │ 0 │ agence │
│ lex_last │ 99 │ plateforme │
│ min_length │ 0 │ 5 │
│ max_length │ 3 │ 11 │
└─────────────┴───────────────────┴────────────┘
xan filter 'batch > 1' medias.csv | xan count130
To access the expression language's cheatsheet, run xan help cheatsheet. To display the full list of available functions, run xan help functions.
xan map 'fmt("{} ({})", name, foundation_year) as key' medias.csv | xan select key | xan slice -l 10key
Acrimed.org (2002)
24matins.fr (2006)
Actumag.info (2013)
2012un-Nouveau-Paradigme.com (2012)
24heuresactu.com (2010)
AgoraVox (2005)
Al-Kanz.org (2008)
Alalumieredunouveaumonde.blogspot.com (2012)
Allodocteurs.fr (2005)
Alterinfo.net (2005)
To access the expression language's cheatsheet, run xan help cheatsheet. To display the full list of available functions, run xan help functions.
xan transform name 'split(name, ".") | first | upper' medias.csv | xan select name | xan slice -l 10name
ACRIMED
24MATINS
ACTUMAG
2012UN-NOUVEAU-PARADIGME
24HEURESACTU
AGORAVOX
AL-KANZ
ALALUMIEREDUNOUVEAUMONDE
ALLODOCTEURS
ALTERINFO
To access the expression language's cheatsheet, run xan help cheatsheet. To display the full list of available functions, run xan help functions.
xan agg 'sum(indegree) as total_indegree, mean(indegree) as mean_indegree' medias.csv | xan view -IDisplaying 1 col from 1 rows of <stdin>
┌────────────────┬───────────────────┐
│ total_indegree │ mean_indegree │
├────────────────┼───────────────────┤
│ 25987 │ 56.12742980561554 │
└────────────────┴───────────────────┘
To access the expression language's cheatsheet, run xan help cheatsheet. To display the full list of available functions, run xan help functions. Finally, to display the list of available aggregation functions, run xan help aggs.
xan groupby edito 'sum(indegree) as indegree' medias.csv | xan view -IDisplaying 1 col from 5 rows of <stdin>
┌────────────┬──────────┐
│ edito │ indegree │
├────────────┼──────────┤
│ agence │ 50 │
│ agrégateur │ 459 │
│ plateforme │ 658 │
│ media │ 24161 │
│ individu │ 659 │
└────────────┴──────────┘
To access the expression language's cheatsheet, run xan help cheatsheet. To display the full list of available functions, run xan help functions. Finally, to display the list of available aggregation functions, run xan help aggs.
- help: Get help regarding the expression language
Explore & visualize
- count (c): Count rows in file
- headers (h): Show header names
- view (v): Preview a CSV file in a human-friendly way
- flatten: Display a flattened version of each row of a file
- hist: Print a histogram with rows of CSV file as bars
- plot: Draw a scatter plot or line chart
- heatmap: Draw a heatmap of a CSV matrix
- progress: Display a progress bar while reading CSV data
Search & filter
- search: Search for (or replace) patterns in CSV data
- grep: Coarse but fast filtering of CSV data
- filter: Only keep some CSV rows based on an evaluated expression
- head: First rows of CSV file
- tail: Last rows of CSV file
- slice: Slice rows of CSV file
- top: Find top rows of a CSV file according to some column
- sample: Randomly sample CSV data
Sort & deduplicate
Aggregate
- frequency (freq): Show frequency tables
- groupby: Aggregate data by groups of a CSV file
- stats: Compute basic statistics
- agg: Aggregate data from CSV file
- bins: Dispatch numeric columns into bins
- window: Compute window aggregations (cumsum, rolling mean, lag etc.)
Combine multiple CSV files
- cat: Concatenate by row or column
- join: Join CSV files
- fuzzy-join: Join a CSV file with another containing patterns (e.g. regexes)
- merge: Merge multiple similar already sorted CSV files
Add, transform, drop and move columns
- select: Select columns from a CSV file
- drop: Drop columns from a CSV file
- map: Create a new column by evaluating an expression on each CSV row
- transform: Transform a column by evaluating an expression on each CSV row
- enum: Enumerate CSV file by preprending an index column
- flatmap: Emit one row per value yielded by an expression evaluated for each CSV row
- fill: Fill empty cells
- blank: Blank down contiguous identical cell values
Format, convert & recombobulate
- behead: Drop header from CSV file
- rename: Rename columns of a CSV file
- input: Read unusually formatted CSV data
- fixlengths: Makes all rows have same length
- fmt: Format CSV output (change field delimiter)
- explode: Explode rows based on some column separator
- implode: Collapse consecutive identical rows based on a diverging column
- from: Convert a variety of formats to CSV
- to: Convert a CSV file to a variety of data formats
- scrape: Scrape HTML into CSV data
- reverse: Reverse rows of CSV data
- transpose (t): Transpose CSV file
- pivot: Stack multiple columns into fewer columns
- unpivot: Split distinct values of a column into their own columns
Split a CSV file into multiple
Parallelization
- parallel (p): Map-reduce-like parallel computation
Generate CSV files
- range: Create a CSV file from a numerical range
Lexicometry & fuzzy matching
- tokenize: Tokenize a text column
- vocab: Build a vocabulary over tokenized documents
- cluster: Cluster CSV data to find near-duplicates
Matrix & network-related commands
Debug
- eval: Evaluate/debug a single expression
If you ever feel lost, each command has a -h/--help flag that will print the related documentation.
If you need help about the expression language, check out the help command itself:
# Help about help ;)
xan help --helpAll xan commands expect a "standard" CSV file, e.g. comma-delimited, with proper double-quote escaping. This said, xan is also perfectly able to infer the delimiter from typical file extensions such as .tsv, .tab, .psv, .ssv or .scsv.
If you need to process a file with a custom delimiter, you can either use the xan input command or use the -d/--delimiter flag available with all commands.
If you need to output a custom CSV dialect (e.g. using ; delimiters), feel free to use the xan fmt command.
Finally, even if most xan commands won't even need to decode the file's bytes, some might still need to. In this case, xan will expect correctly formatted UTF-8 text. Please use iconv or other utils if you need to process other encodings such as latin1 ahead of xan.
Even if this is good practice to name your columns, some CSV file simply don't have headers. Most commands are able to deal with those file if you give the -n/--no-headers flag.
Note that this flag always relates to the input, not the output. If for some reason you want to drop a CSV output's header row, use the xan behead command.
By default, all commands will try to read from stdin when the file path is not specified. This makes piping easy and comfortable as it respects typical unix standards. Some commands may have multiple inputs (xan join, for instance), in which case stdin is usually specifiable using the - character:
# First file given to join will be read from stdin
cat file1.csv | xan join col1 - col2 file2.csvNote that the command will also warn you when stdin cannot be read, in case you forgot to indicate the file's path.
By default, all commands will print their output to stdout (note that this output is usually buffered for performance reasons).
In addition, all commands expose a -o/--output flag that can be use to specify where to write the output. This can be useful if you do not want to or cannot use > (typically in some Windows shells). In which case, - as a output path will mean forwarding to stdout also. This can be useful when scripting sometimes.
xan is able to process a large variety of CSV-adjacent data formats out-of-the box:
.csvfiles will be understood as comma-separated.tsv&.tabfiles will be understood as tab-separated.scsv&.ssvfiles will be understood as semicolon-separated.psvfiles will be understood as pipe-separated.cdxfiles (an index file format related to web archive) will be understood as space-separated and will have its magic bytes dropped.ndjson&.jsonlfiles will be understood as tab-separated, headless, null-byte-quoted, so you can easily use them withxancommands (e.g. parsing or wrangling JSON data using the expression language to aggregate, even in parallel). If you need a more thorough conversion of newline-delimited JSON data, check out thexan from -f ndjsoncommand instead..vcffiles (Variant Call Format) from bioinformatics are supported out of the box. They will be stripped of their header data and considered as tab-delimited..gtf&.gff2files (Gene Transfert Format) from bioinformatics are supported out of the box. They will be stripped of their header data and considered as headless & tab-delimited..samfiles (Sequence Alignment Map) from bioinformatics are supported out of the box. They will be stripped of their header data and considered as headless & tab-delimited..bedfiles (Browser Extensible Data) from bioinformatics are supported out of the box. They will be stripped of their header data and considered as headless & tab-delimited.
Note that more exotic delimiters can always be handled using the ubiquitous -d, --delimiter flag.
Some additional formats (e.g. .gff, .gff3) are also supported but must first be normalized using the xan input command because their cells must be trimmed or because they have comment lines to be skipped.
Note also that UTF-8 BOMs ara always stripped from the data when processed.
xan is able to read gzipped files (having a .gz extension). It is also able to leverage .gzi indices (usually created through bgzip) when seeking is necessary (constant time reversing, parallelization etc.).
xan is also able to read files compressed with Zstdandard (having a .zst extension).
Some xan commands print ANSI colors in the terminal by default, typically view, flatten, etc.
All those commands have a standard --color=(auto|always|never) flag to tweak the colouring behavior if you need it (note that colors are not printed when commands are piped, by default).
They also respect typical environment variables related to ANSI colouring, such as NO_COLOR, CLICOLOR & CLICOLOR_FORCE, as documented here.
- Cheatsheet
- Comprehensive list of functions & operators
- Comprehensive list of aggregation functions
- Comprehensive list of window aggregation functions
- Scraping DSL
- Merging frequency tables, three ways
- Parsing and visualizing dates with xan
- Joining files by URL prefixes
- Miscellaneous
For news about the tool's evolutions feel free to read:
xan is published on Zenodo as 10.5281/zenodo.15310200.
You can cite it thusly:
Guillaume Plique, Béatrice Mazoyer, Laura Miguel, César Pichon, Anna Charles, & Julien Pontoire. (2025). xan, the CSV magician. (0.50.0). Zenodo. https://doi.org/10.5281/zenodo.15310200
Rotate your screen ;)











