Datamash, a tool for common data related operations in Unix shells
Having used the flexibility of powershell in terms of its rich command-lets, e.g., where-object, foreach-object, and group-object, I find bash shell limiting. Most of the capabilities offered by powershell command-lets need to be realized by either combining other commands or programming them in bash. Specifically, I am annoyed by the lack of command pipeline based support to perform basic data related operations, e.g., sum the numbers in a file, group values in a file, get frequency of words in a file.
This changed today when I learned about GNU’s datamash, a command that performs basic numeric,textual and statistical operations on input textual data files. Now, instead of doing
(seq 10 | tr '\n' '+' ; echo '0' ) | bc to sum a sequence of numbers, I can just do
seq 10 | datamash sum 1 :)
For my purpose, I found datamash’s support for following operations most useful.
- Basic summarization operations on both numeric (e.g., sum, min, max) and textual (e.g., count, first, last).
- Basic statistical operations, e.g., mean, median, q1, q3, iqr, mode
- Basic data transformation operations such as groupby and transpose.
While I have used csvkit in the past for some of the above operations, I suspect that I will be using datamash to perform these operations in the future.
If you crunch data, then you should try out datamash.