Giter VIP home page Giter VIP logo

qsv's Introduction

qsv: Blazing-fast CSV data-wrangling toolkit

Linux build status Windows build status macOS build status Security audit Codacy Badge Clones Discussions Crates.io Crates.io downloads Prebuilt Downloads Minimum supported Rust version FOSSA Status

ย  Table of Contents
qsv logo
Hi-ho "Quicksilver" away!
logo details
qsv (pronounced "Quicksilver") is a command line program
for querying, indexing, slicing, analyzing, filtering, enriching,
transforming, sorting, validating & joining CSV files.
Commands are simple, fast & composable.

* Commands
* Installation Options
* Whirlwind Tour / Notebooks
* Cookbook
* FAQ
* Performance Tuning
* ๐Ÿ‘‰ Benchmarks ๐Ÿš€
* Environment Variables
* Feature Flags
* Goals/Non-goals
* Testing
* NYC School of Data 2022 slides
* Sponsor

Try it out at qsv.dathere.com!

Command Description
apply
โœจ๐Ÿš€๐Ÿง ๐Ÿค–๐Ÿ”ฃ
Apply series of string, date, math & currency transformations to given CSV column/s. It also has some basic NLP functions (similarity, sentiment analysis, profanity, eudex, language & name gender) detection.
applydp
๐Ÿš€๐Ÿ”ฃ CKAN
applydp is a slimmed-down version of apply with only Datapusher+ relevant subcommands/operations (qsvdp binary variant only).
behead Drop headers from a CSV.
cat
๐Ÿ—„๏ธ
Concatenate CSV files by row or by column.
count
๐Ÿ“‡๐ŸŽ๏ธ๐Ÿปโ€โ„๏ธ
Count the rows in a CSV file. (11.87 seconds for a 15gb, 27m row NYC 311 dataset without an index. Instantaneous with an index.) If the polars feature is enabled, uses Polars' multithreaded, mem-mapped CSV reader for fast counts even without an index
datefmt
๐Ÿš€
Formats recognized date fields (19 formats recognized) to a specified date format using strftime date format specifiers.
dedup
๐Ÿคฏ๐Ÿš€
Remove duplicate rows (See also extdedup, extsort, sort & sortcheck commands).
describegpt
๐ŸŒ๐Ÿค–
Infer extended metadata about a CSV using a GPT model from OpenAI's API.
diff
๐Ÿš€
Find the difference between two CSVs with ludicrous speed!
e.g. compare two CSVs with 1M rows x 9 columns in under 600ms!
enum Add a new column enumerating rows by adding a column of incremental or uuid identifiers. Can also be used to copy a column or fill a new column with a constant value.
excel
๐Ÿš€
Exports a specified Excel/ODS sheet to a CSV file.
exclude
๐Ÿ“‡
Removes a set of CSV data from another set based on the specified columns.
explode
๐Ÿ”ฃ
Explode rows into multiple ones by splitting a column value based on the given separator.
extdedup
Remove duplicate rows from an arbitrarily large CSV/text file using a memory-mapped, on-disk hash table. Unlike the dedup command, this command does not load the entire file into memory nor does it sort the deduped file.
extsort
๐Ÿš€
Sort an arbitrarily large CSV/text file using a multithreaded external merge sort algorithm.
fetch
โœจ๐Ÿง ๐ŸŒ
Fetches data from web services for every row using HTTP Get. Comes with HTTP/2 adaptive flow control, jql JSON query language support, dynamic throttling (RateLimit) & caching with available persistent caching using Redis or a disk-cache.
fetchpost
โœจ๐Ÿง ๐ŸŒ
Similar to fetch, but uses HTTP Post. (HTTP GET vs POST methods)
fill Fill empty values.
fixlengths Force a CSV to have same-length records by either padding or truncating them.
flatten A flattened view of CSV records. Useful for viewing one record at a time.
e.g. qsv slice -i 5 data.csv | qsv flatten.
fmt Reformat a CSV with different delimiters, record terminators or quoting rules. (Supports ASCII delimited data.)
foreach
โœจ
Loop over a CSV to execute shell commands. (not available on Windows)
frequency
๐Ÿ“‡๐Ÿ˜ฃ๐ŸŽ๏ธ
Build frequency tables of each column. Uses multithreading to go faster if an index is present.
geocode
โœจ๐Ÿง ๐ŸŒ๐Ÿš€๐Ÿ”ฃ
Geocodes a location against an updatable local copy of the Geonames cities database. With caching and multi-threading, it geocodes up to 360,000 records/sec!
headers
๐Ÿ—„๏ธ
Show the headers of a CSV. Or show the intersection of all headers between many CSV files.
index Create an index (๐Ÿ“‡) for a CSV. This is very quick (even the 15gb, 28m row NYC 311 dataset takes all of 14 seconds to index) & provides constant time indexing/random access into the CSV. With an index, count, sample & slice work instantaneously; random access mode is enabled in luau; and multithreading (๐ŸŽ๏ธ) is enabled for the frequency, split, stats, schema & tojsonl commands.
input Read CSV data with special commenting, quoting, trimming, line-skipping & non-UTF8 encoding handling rules. Typically used to "normalize" a CSV for further processing with other qsv commands.
join Inner, outer, right, cross, anti & semi joins. Automatically creates a simple, in-memory hash index to make it fast.
joinp
โœจ๐Ÿš€๐Ÿปโ€โ„๏ธ
Inner, outer, cross, anti, semi & asof joins using the Pola.rs engine. Unlike the join command, joinp can process files larger than RAM, is multithreaded, has join key validation, pre-join filtering, supports asof joins (which is particularly useful for time series data) & its output doesn't have duplicate columns. However, joinp doesn't have an --ignore-case option & it doesn't support right outer joins.
jsonl
๐Ÿš€๐Ÿ”ฃ
Convert newline-delimited JSON (JSONL/NDJSON) to CSV. See tojsonl command to convert CSV to JSONL.

luau ๐Ÿ‘‘
โœจ๐Ÿ“‡๐ŸŒ๐Ÿ”ฃ CKAN
Create multiple new computed columns, filter rows, compute aggregations and build complex data pipelines by executing a Luau 0.616 expression/script for every row of a CSV file (sequential mode), or using random access with an index (random access mode).
Can process a single Luau expression or full-fledged data-wrangling scripts using lookup tables with discrete BEGIN, MAIN and END sections.
It is not just another qsv command, it is qsv's Domain-specific Language (DSL) with numerous qsv-specific helper functions to build production data pipelines.
partition Partition a CSV based on a column value.
pseudo
๐Ÿ”ฃ
Pseudonymise the value of the given column by replacing them with an incremental identifier.
py
โœจ๐Ÿ”ฃ
Create a new computed column or filter rows by evaluating a python expression on every row of a CSV file. Python's f-strings is particularly useful for extended formatting, with the ability to evaluate Python expressions as well.
rename Rename the columns of a CSV efficiently.
replace Replace CSV data using a regex. Applies the regex to each field individually.
reverse
๐Ÿ“‡๐Ÿคฏ
Reverse order of rows in a CSV. Unlike the sort --reverse command, it preserves the order of rows with the same key. If an index is present, it works with constant memory. Otherwise, it will load all the data into memory.
safenames
CKAN
Modify headers of a CSV to only have "safe" names - guaranteed "database-ready"/"CKAN-ready" names.
sample
๐Ÿ“‡๐ŸŒ๐ŸŽ๏ธ
Randomly draw rows (with optional seed) from a CSV using reservoir sampling, using memory proportional to the sample size. If an index is present, using random indexing with constant memory.
schema
๐Ÿ“‡๐Ÿ˜ฃ๐ŸŽ๏ธ
Infer schema from CSV data, replete with data type & domain/range validation & output in JSON Schema format. Uses multithreading to go faster if an index is present. See validate command to use the generated JSON Schema to validate if similar CSVs comply with the schema.
search Run a regex over a CSV. Applies the regex to each field individually & shows only matching rows.
searchset Run multiple regexes over a CSV in a single pass. Applies the regexes to each field individually & shows only matching rows.
select Select, re-order, duplicate or drop columns.
slice
๐Ÿ“‡๐ŸŽ๏ธ
Slice rows from any part of a CSV. When an index is present, this only has to parse the rows in the slice (instead of all rows leading up to the start of the slice).
snappy
๐Ÿš€๐ŸŒ
Does streaming compression/decompression of the input using Google's Snappy framing format (more info).
sniff
๐ŸŒ CKAN
Quickly sniff & infer CSV metadata (delimiter, header row, preamble rows, quote character, flexible, is_utf8, average record length, number of records, content length & estimated number of records if sniffing a CSV on a URL, number of fields, field names & data types). It is also a general mime type detector.
sort
๐Ÿš€๐Ÿคฏ
Sorts CSV data in alphabetical (with case-insensitive option), numerical, reverse, unique or random (with optional seed) order (See also extsort & sortcheck commands).
sortcheck
๐Ÿ“‡
Check if a CSV is sorted. With the --json options, also retrieve record count, sort breaks & duplicate count.
split
๐Ÿ“‡๐ŸŽ๏ธ
Split one CSV file into many CSV files. It can split by number of rows, number of chunks or file size. Uses multithreading to go faster if an index is present when splitting by rows or chunks.
sqlp
โœจ๐Ÿš€๐Ÿปโ€โ„๏ธ๐Ÿ—„๏ธ
Run Polars SQL queries against several CSVs - converting queries to blazing-fast LazyFrame expressions, processing larger than memory CSV files.
stats
๐Ÿ“‡๐Ÿคฏ๐ŸŽ๏ธ
Compute summary statistics (sum, min/max/range, min/max length, mean, stddev, variance, nullcount, sparsity, quartiles, IQR, lower/upper fences, skewness, median, mode/s, antimode/s & cardinality) & make GUARANTEED data type inferences (Null, String, Float, Integer, Date, DateTime, Boolean) for each column in a CSV.
Uses multithreading to go faster if an index is present (with an index, can compile "streaming" stats on NYC's 311 data (15gb, 28m rows) in less than 7.3 seconds).
table
๐Ÿคฏ
Show aligned output of a CSV using elastic tabstops. To interactively view CSV files, qsv pairs well with csvlens.
to
โœจ๐Ÿš€๐Ÿ—„๏ธ
Convert CSV files to PostgreSQL, SQLite, XLSX, Parquet and Data Package.
tojsonl
๐Ÿ“‡๐Ÿ˜ฃ๐Ÿš€๐Ÿ”ฃ
Smartly converts CSV to a newline-delimited JSON (JSONL/NDJSON). By scanning the CSV first, it "smartly" infers the appropriate JSON data type for each column. See jsonl command to convert JSONL to CSV.
transpose
๐Ÿคฏ
Transpose rows/columns of a CSV.
validate
๐Ÿ“‡๐Ÿš€๐ŸŒ
Validate CSV data blazingly-fast using JSON Schema Validation & put invalid records into a separate file with an accompanying detailed validation error report file (e.g. up to 930,000 rows/second using NYC's 311 schema generated by the schema command).
If no JSON schema file is provided, validates if a CSV conforms to the RFC 4180 standard and is UTF-8 encoded.
Performance metrics compiled on an M2 Pro 12-core Mac Mini with 32gb RAM

โœจ: enabled by a feature flag.
๐Ÿ“‡: uses an index when available.
๐Ÿคฏ: loads entire CSV into memory, though dedup, stats & transpose have "streaming" modes as well.
๐Ÿ˜ฃ: uses additional memory proportional to the cardinality of the columns in the CSV.
๐Ÿง : expensive operations are memoized with available inter-session Redis/Disk caching for fetch commands.
๐Ÿ—„๏ธ: Extended input support.
๐Ÿปโ€โ„๏ธ: command powered by polars 0.39.1 engine.
๐Ÿค–: command uses Natural Language Processing & General AI techniques.
๐ŸŽ๏ธ: multithreaded and/or faster when an index (๐Ÿ“‡) is available.
๐Ÿš€: multithreaded even without an index.
CKAN : has CKAN-aware integration options.
๐ŸŒ: has web-aware options.
๐Ÿ”ฃ: requires UTF-8 encoded input.

Installation Options

Option 1: Download Prebuilt Binaries

Full-featured prebuilt binary variants of the latest qsv version for Linux, macOS & Windows are available for download, including binaries compiled with Rust Nightly (more info).

These prebuilt binaries are also built with CPU optimizations enabled for x86_64 (e.g. SSE4.2, AVX2, AVX512, etc. on Intel and AMD processors) and Apple Silicon processors (ARM64 SIMD NEON) for even more performance gains.

For Windows, an MSI Installer wrapping the x86_64-pc-windows-msvc build is also available for download.

For macOS, "ad-hoc" signatures are used to sign our binaries, so you will need to set appropriate Gatekeeper security settings or run the following command to remove the quarantine attribute from qsv before you run it for the first time:

# replace qsv with qsvlite or qsvdp if you installed those binary variants
xattr -d com.apple.quarantine qsv

Verifying the Integrity of the Prebuilt Binaries Zip Archives

All prebuilt binaries zip archives are signed with zipsign with the following public key qsv-zipsign-public.key. To verify the integrity of the downloaded zip archives:

# if you don't have zipsign installed yet
cargo install zipsign

# verify the integrity of the downloaded prebuilt binary zip archive
# after downloading the zip archive and the qsv-zipsign-public.key file.
# replace <PREBUILT-BINARY-ARCHIVE.zip> with the name of the downloaded zip archive
# e.g. zipsign verify zip qsv-0.118.0-aarch64-apple-darwin.zip qsv-zipsign-public.key
zipsign verify zip <PREBUILT-BINARY-ARCHIVE.zip> qsv-zipsign-public.key

Option 2: Package Managers & Distributions

qsv is also distributed by several package managers and distros. Check each package manager's/distro's documentation for installation instructions.

Packaging status

Note that qsv provided by these package managers/distros often do not enable all features (Homebrew, for instance, only enables the apply and luau features).

To find out what features are enabled in a package/distro's qsv, run qsv --version (more info).

Option 3: Install with Rust

If you have Rust installed, you can also install from source using Rust's cargo command1:

cargo install qsv --locked --features all_features

The binary will be installed in ~/.cargo/bin.

To install different variants and enable optional features, use cargo --features (see Feature Flags for more info):

# to install qsv with all features enabled
cargo install qsv --locked --bin qsv --features feature_capable,apply,luau,fetch,foreach,python,to,self_update,polars
# or shorthand
cargo install qsv --locked --bin qsv -F all_features

# or enable only the apply and polars features
cargo install qsv --locked --bin qsv -F feature_capable,apply,polars

# or to install qsvlite
cargo install qsv --locked --bin qsvlite -F lite

# or to install qsvdp
cargo install qsv --locked --bin qsvdp -F datapusher_plus,luau,polars

Option 4: Compile from Source

Compiling from source also works similarly1:

git clone https://github.com/jqnatividad/qsv.git
cd qsv
cargo build --release --locked --bin qsv --features all_features

The compiled binary will end up in ./target/release/.

To compile different variants and enable optional features:

# to compile qsv with all features enabled
cargo build --release --locked --bin qsv --features feature_capable,apply,luau,fetch,foreach,python,to,self_update,polars
# shorthand
cargo build --release --locked --bin qsv -F all_features

# or build qsv with only the fetch and foreach features enabled
cargo build --release --locked --bin qsv -F feature_capable,fetch,foreach

# for qsvlite
cargo build --release --locked --bin qsvlite -F lite

# for qsvdp
cargo build --release --locked --bin qsvdp -F datapusher_plus,luau,polars

NOTE: To build with Rust nightly, see Nightly Release Builds.

Variants

There are four binary variants of qsv:

  • qsv - feature-capable(โœจ), with the prebuilt binaries enabling all applicable features except Python 2
  • qsvpy - same as qsv but with the Python feature enabled. Three subvariants are available - qsvpy310, qsvpy311 & qsvpy312 - which are compiled with Python 3.10, 3.11 & 3.12 respectively.
  • qsvlite - all features disabled (~13% of the size of qsv)
  • qsvdp - optimized for use with DataPusher+ with only DataPusher+ relevant commands; an embedded luau interpreter; applydp, a slimmed-down version of the apply feature; the --progressbar option disabled; and the self-update only checking for new releases, requiring an explicit --update (~12% of the the size of qsv).

Regular Expression Syntax

The --select option and several commands (apply, applydp, datefmt, exclude, fetchpost, replace, schema, search, searchset, select, sqlp & stats) allow the user to specify regular expressions. We use the regex crate to parse, compile and execute these expressions. 3

Its syntax can be found here and "is similar to other regex engines, but it lacks several features that are not known how to implement efficiently. This includes, but is not limited to, look-around and backreferences. In exchange, all regex searches in this crate have worst case O(m * n) time complexity, where m is proportional to the size of the regex and n is proportional to the size of the string being searched."

If you want to test your regular expressions, regex101 supports the syntax used by the regex crate. Just select the "Rust" flavor.

File formats

qsv recognizes UTF-8/ASCII encoded, CSV (.csv) & TSV files (.tsv & .tab). CSV files are assumed to have "," (comma) as a delimiter, and TSV files, "\t" (tab) as a delimiter. The delimiter is a single ascii character that can be set either by the --delimiter command-line option or with the QSV_DEFAULT_DELIMITER environment variable or automatically detected when QSV_SNIFF_DELIMITER is set.

When using the --output option, qsv will UTF-8 encode the file & automatically change the delimiter used in the generated file based on the file extension - i.e. comma for .csv, tab for .tsv & .tab files.

JSONL/NDJSON files are also recognized & converted to/from CSV with the jsonl and tojsonl commands respectively.

The fetch & fetchpost commands also produces JSONL files when its invoked without the --new-column option & TSV files with the --report option.

The excel, safenames, sniff, sortcheck & validate commands produce JSON files with their JSON options following the JSON API 1.1 specification, so it can return detailed machine-friendly metadata that can be used by other systems.

The schema command produces a JSON Schema Validation (Draft 7) file with the ".schema.json" file extension, which can be used with the validate command to validate other CSV files with an identical schema.

The excel command recognizes Excel & Open Document Spreadsheet(ODS) files (.xls, .xlsx, .xlsm, .xlsb & .ods files).

Speaking of Excel, if you're having trouble opening qsv-generated CSV files in Excel, set the QSV_OUTPUT_BOM environment variable to add a Byte Order Mark to the beginning of the generated CSV file. This is a workaround for Excel's UTF-8 encoding detection bug.

The to command converts CSVs to .xlsx, Parquet & Data Package files, and populates PostgreSQL and SQLite databases.

The sqlp command returns query results in CSV, JSON, JSONL, Parquet, Apache Arrow IPC & Apache AVRO formats. Polars SQL also supports reading external files directly in various formats with its read_csv, read_ndjson, read_parquet & read_ipc table functions.

The sniff command can also detect the mime type of any file with the --no-infer or --just-mime options, may it be local or remote (http and https schemes supported). It can detect more than 130 file formats, including MS Office/Open Document files, JSON, XML, PDF, PNG, JPEG and specialized geospatial formats like GPX, GML, KML, TML, TMX, TSX, TTML. Click here for a complete list.

Extended Input Support

The cat, headers, sqlp & to commands have extended input support (๐Ÿ—„๏ธ). If the input is - or empty, the command will try to use stdin as input. If it's not, it will check if its a directory, and if so, add all the files in the directory as input files.

If its a file, it will first check if it has an .infile-list extension. If it does, it will load the text file and parse each line as an input file path. This is a much faster and convenient way to process a large number of input files, without having to pass them all as separate command-line arguments. Further, the file paths can be anywhere in the file system, even on separate volumes. If an input file path is not fully qualified, it will be treated as relative to the current working directory. Empty lines and lines starting with # are ignored. Invalid file paths will be logged as warnings and skipped.

For both directory and .infile-list input, snappy compressed files with a .sz extension will be automatically decompressed.

Finally, if its just a regular file, it will be treated as a regular input file.

Snappy Compression/Decompression

qsv supports automatic compression/decompression using the Snappy frame format. Snappy was chosen instead of more popular compression formats like gzip because it was designed for high-performance streaming compression & decompression (up to 2.58 gb/sec compression, 0.89 gb/sec decompression).

For all commands except the index, extdedup & extsort commands, if the input file has an ".sz" extension, qsv will automatically do streaming decompression as it reads it. Further, if the input file has an extended CSV/TSV ".sz" extension (e.g nyc311.csv.sz/nyc311.tsv.sz/nyc311.tab.sz), qsv will also use the file extension to determine the delimiter to use.

Similarly, if the --output file has an ".sz" extension, qsv will automatically do streaming compression as it writes it. If the output file has an extended CSV/TSV ".sz" extension, qsv will also use the file extension to determine the delimiter to use.

Note however that compressed files cannot be indexed, so index-accelerated commands (frequency, schema, split, stats, tojsonl) will not be multithreaded. Random access is also disabled without an index, so slice will not be instantaneous and luau's random-access mode will not be available.

There is also a dedicated snappy command with four subcommands for direct snappy file operations โ€” a multithreaded compress subcommand (4-5x faster than the built-in, single-threaded auto-compression); a decompress subcommand with detailed compression metadata; a check subcommand to quickly inspect if a file has a Snappy header; and a validate subcommand to confirm if a Snappy file is valid.

The snappy command can be used to compress/decompress ANY file, not just CSV/TSV files.

Using the snappy command, we can compress NYC's 311 data (15gb, 28m rows) to 4.95 gb in 5.77 seconds with the multithreaded compress subcommand - 2.58 gb/sec with a 0.33 (3.01:1) compression ratio. With snappy decompress, we can roundtrip decompress the same file in 16.71 seconds - 0.89 gb/sec.

Compare that to zip 3.0, which compressed the same file to 2.9 gb in 248.3 seconds on the same machine - 43x slower at 0.06 gb/sec with a 0.19 (5.17:1) compression ratio - for just an additional 14% (2.45 gb) of saved space. zip also took 4.3x longer to roundtrip decompress the same file in 72 seconds - 0.20 gb/sec.

RFC 4180 CSV Standard

qsv follows the RFC 4180 CSV standard. However, in real life, CSV formats vary significantly & qsv is actually not strictly compliant with the specification so it can process "real-world" CSV files. qsv leverages the awesome Rust CSV crate to read/write CSV files.

Click here to find out more about how qsv conforms to the standard using this crate.

When dealing with "atypical" CSV files, you can use the input command to normalize them to be RFC 4180-compliant.

UTF-8 Encoding

qsv requires UTF-8 encoded input (of which ASCII is a subset).

Should you need to re-encode CSV/TSV files, you can use the input command to "lossy save" to UTF-8 - replacing invalid UTF-8 sequences with ๏ฟฝ (U+FFFD REPLACEMENT CHARACTER).

Alternatively, if you want to truly transcode to UTF-8, there are several utilities like iconv that you can use to do so on Linux/macOS & Windows.

Windows Powershell and Windows Excel Usage Note

Unlike other modern operating systems, Microsoft Windows' default encoding is UTF16-LE. This will cause problems when redirecting qsv's output to a CSV file in Powershell & trying to open it with Excel - everything will be in the first column, as the UTF16-LE encoded CSV file will not be properly recognized by Excel.

# the following command will produce a UTF16-LE encoded CSV file on Windows
qsv stats wcp.csv > wcpstats.csv

Which is weird, since you'd think Microsoft's own Excel would properly recognize UTF16-LE encoded CSV files. Regardless, to create a properly UTF-8 encoded file on Windows, use the --output option instead:

# so instead of redirecting stdout to a file on Windows
qsv stats wcp.csv > wcpstats.csv

# do this instead, so it will be properly UTF-8 encoded
qsv stats wcp.csv --output wcpstats.csv

Alternatively, qsv can add a Byte Order Mark (BOM) to the beginning of a CSV to indicate it's UTF-8 encoded. You can do this by setting the QSV_OUTPUT_BOM environment variable to 1.

This will allow Excel on Windows to properly recognize the CSV file as UTF-8 encoded.

Note that this is not a problem with Excel on macOS, as macOS (like most other *nixes) uses UTF-8 as its default encoding.

Nor is it a problem with qsv output files produced on other operating systems, as Excel on Windows can properly recognize UTF-8 encoded CSV files.

Interpreters

For complex data-wrangling tasks, you can use Luau and Python scripts.

Luau is recommended over Python for complex data-wrangling tasks as it is faster, more memory-efficient, has no external dependencies and has several data-wrangling helper functions as qsv's DSL.

See Luau vs Python for more info.

Memory Management

qsv supports three memory allocators - mimalloc (default), jemalloc and the standard allocator.
See Memory Allocator for more info.

It also has Out-of-Memory prevention, with two modes - NORMAL (default) & CONSERVATIVE.
See Out-of-Memory Prevention for more info.

Environment Variables & dotenv file support

qsv supports an extensive list of environment variables and supports .env files to set them.

For details, see Environment Variables and the dotenv.template.yaml file.

Feature Flags

qsv has several feature flags that can be used to enable/disable optional features.

See Features for more info.

Minimum Supported Rust Version

qsv's MSRV policy is to require the latest stable Rust version that is supported by Homebrew, currently HomeBrew. However, if the latest Rust stable has been released for more than a week and Homebrew has not yet updated its Rust formula, qsv will go ahead and require the latest Rust stable version.

Goals / Non-Goals

QuickSilver's goals, in priority order, are to be:

  • As Fast as Possible - To do so, it has frequent releases, an aggressive MSRV policy, takes advantage of CPU features, employs various caching strategies, uses HTTP/2, and is multithreaded when possible and it makes sense. See Performance for more info.
  • Able to Process Very Large Files - Most qsv commands are streaming, using constant memory, and can process arbitrarily large CSV files. For those commands that require loading the entire CSV into memory (denoted by ๐Ÿคฏ), qsv has Out-of-Memory prevention, batch processing strategies and "ext"ernal commands that use the disk to process larger than memory files. See Memory Management for more info.
  • A Complete Data-Wrangling Toolkit - qsv aims to be a comprehensive data-wrangling toolkit that you can use for quick analysis and investigations, but is also robust enough for production data pipelines. Its many commands are targeted towards common data-wrangling tasks and can be combined/composed into complex data-wrangling scripts with its Luau-based DSL.
    Luau will also serve as the backbone of a whole library of qsv recipes - reusable scripts for common tasks (e.g. street-level geocoding, removing PII, data enrichment, etc.) that prompt for easily modifiable parameters.
  • Composable/Interoperable - qsv is designed to be composable, with a focus on interoperability with other common CLI tools like 'awk', 'xargs', 'ripgrep', 'sed', etc., and with well known ETL/ELT tools like Airbyte, Airflow, Pentaho Kettle, etc. Its commands can be combined with other tools via pipes, and it supports other common file formats like JSON/JSONL, Parquet, Arrow IPC, Avro, Excel, ODS, PostgreSQL, SQLite, etc. See File Formats for more info.
  • As Portable as Possible - qsv is designed to be portable, with installers on several platforms with an integrated self-update mechanism. In preference order, it supports Linux, macOS and Windows. See Installation Options for more info.
  • As Easy to Use as Possible - qsv is designed to be easy to use. As easy-to-use that is, as command line interfaces go ๐Ÿคท. Its commands have numerous options but have sensible defaults. The usage text is written for a data analyst audience, not developers; and there are numerous examples in the usage text, with the tests doubling as examples as well. In the future, it will also have a Graphical User Interface (GUI).
  • As Secure as Possible - qsv is designed to be secure. It has no external runtime dependencies, is written in Rust, and it's codebase is automatically audited for security vulnerabilities with automated DevSkim, "cargo audit" and Codacy Github Actions workflows.
    It uses the latest stable Rust version, with an aggressive MSRV policy and the latest version of all its dependencies. It has an extensive test suite with ~1,300 tests, including several property tests which randomly generate parameters for oft-used commands. It also has a Security Policy.
    Its prebuilt binary archives are zipsigned, so you can verify their integrity. Its self-update mechanism automatically verifies the integrity of the prebuilt binaries archive before applying an update.
  • As Easy to Contribute to as Possible - qsv is designed to be easy to contribute to, with a focus on maintainability. It's architecture allows the easy addition of self-contained commands gated by feature flags, the source code is heavily commented, the usage text is embedded, and there are helper functions that make it easy to create tests. See Features and Contributing for more info.

QuickSilver's non-goals are to be:

  • As Small as Possible - qsv is designed to be small, but not at the expense of performance, features, composability, portability, usability, security or maintainability. However, we do have a qsvlite variant that is ~13% of the size of qsv and a qsvdp variant that is ~12% of the size of qsv. Those variants, however, have reduced functionality. Further, several commands are gated behind feature flags, so you can compile qsv with only the features you need.
  • Multi-lingual - qsv's usage text and messages are English-only. There are no plans to support other languages. This does not mean it can only process English input files.
    It can process well-formed CSVs in any language so long as its UTF-8 encoded. Further, it supports alternate delimiters/separators other than comma; the apply whatlang operation detects 69 languages; and its apply thousands, currency and eudex operations supports different languages and country conventions for number, currency and date parsing/formatting.
    Finally, though the default Geonames index of the geocode command is English-only, the index can be rebuilt with the geocode index-update subcommand with the --languages option to return place names in multiple languages (with support for 253 languages).

Testing

qsv has ~1,350 tests in the tests directory. Each command has its own test suite in a separate file with the convention test_<COMMAND>.rs. Apart from preventing regressions, the tests also serve as good illustrative examples, and are often linked from the usage text of each corresponding command.

To test each binary variant:

# to test qsv
cargo test --features all_features

# to test qsvlite
cargo test --features lite
# to test all tests with "stats" in the name with qsvlite
cargo test stats --features lite

# to test qsvdp
cargo test --features datapusher_plus,luau,polars

# to test a specific command
# here we test only stats and use the
# t alias for test and the -F shortcut for --features
cargo t stats -F all_features

# to test a specific command with a specific feature
# here we test only luau command with the luau feature
cargo t luau -F feature_capable,luau

# to test the count command with multiple features
cargo t count -F feature_capable,luau,polars

# to test using an alternate allocator
# other than the default mimalloc allocator
cargo t --no-default-features -F all_features,jemallocator

License

Dual-licensed under MIT or the UNLICENSE.

FOSSA Status

Origins

Quicksilver (qsv) is a fork of the popular xsv utility, merging several pending PRs since xsv 0.13.0's May 2018 release. On top of xsv's 20 commands, it adds numerous new features; 37 additional commands; 4 apply subcommands & 36 operations; 5 to subcommands; 3 cat subcommands; 7 geocode subcommands & 4 index operations; and 4 snappy subcommands. See FAQ for more details.

Sponsor

qsv was made possible by
datHere Logo
Standards-based, best-of-breed, open source solutions
to make your Data Useful, Usable & Used.

Naming Collision

This project is unrelated to Intel's Quick Sync Video.

Footnotes

  1. Of course, you'll also need a linker & a C compiler. Linux users should generally install GCC or Clang, according to their distributionโ€™s documentation. For example, if you use Ubuntu, you can install the build-essential package. On macOS, you can get a C compiler by running $ xcode-select --install. For Windows, this means installing Visual Studio 2022. When prompted for workloads, include "Desktop Development with C++", the Windows 10 or 11 SDK & the English language pack, along with any other language packs your require. โ†ฉ โ†ฉ2

  2. The foreach feature is not available on Windows. The luaufeature is enabled by default on the prebuilt binaries if the platform supports it. โ†ฉ

  3. This is the same regex engine used by ripgrep - the blazingly fast grep replacement that powers Visual Studio's magical "Find in Files" feature. โ†ฉ

qsv's People

Contributors

a5dur avatar alerque avatar burntsushi avatar dependabot[bot] avatar dimagog avatar emk avatar ericsoroos avatar josephfrazier avatar jqnatividad avatar kbd avatar kerollmops avatar kianmeng avatar kindly avatar kper avatar kyegupov avatar lalaithion avatar ldcasillas-progreso avatar lemmingavalanche avatar lukaskalbertodt avatar mchesser avatar mhuang74 avatar minhajuddin2510 avatar mintyplanet avatar pjsier avatar ronohm avatar rzmk avatar scpike avatar sd2k avatar tafia avatar yomguithereal avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

qsv's Issues

RUSTSEC-2020-0036: failure is officially deprecated/unmaintained

failure is officially deprecated/unmaintained

Details
Status unmaintained
Package failure
Version 0.1.8
URL rust-lang-deprecated/failure#347
Date 2020-05-02

The failure crate is officially end-of-life: it has been marked as deprecated
by the former maintainer, who has announced that there will be no updates or
maintenance work on it going forward.

The following are some suggested actively developed alternatives to switch to:

See advisory page for additional details.

"Heavy-duty" configurable `geocode` command

qsv bundles reverse-geocoder - a "lightweight" static, nearest city geonames geocoder.

But for real, street-level geocoding, we need a configurable geocoder that can use the user's geocoder backend of choice.

For the initial implementation of a heavy-weight geocoder, we'll start in order of implementation:

  • pelias (because it's open-source, and users can stand up their own customizable pelias geocoder instance; no ToS prohibiting caching results, etc.)
  • google geocoder

Other geocoder backends in the backlog:

This geocoder will be its own qsv command - geocode unlike the current lightweight one, which is just one of many apply operations.

Add `validate` command

Checks a CSV against a jsonschema file.

The jsonschema file can be located on the filesystem or a URL.

Apply rustfmt

With the release of 0.16.1, all the major pending pull requests from xsv have been merged into qsv.

IMHO, we can now apply rustfmt to the whole project and standardize the code to rustfmt standards.

This will put us in a better position to directly accept PRs.

Feature Request: deduplicate columns/extract unique columns

Cross reference BurntSushi/xsv#283

We can use qsv dedup or the Unix command line tools sort and uniq to remove duplicate rows in plain text table, but I find myself wanting to do something similar with duplicated columns.

For example, after doing qsv join ... there will be at least one pair of duplicated columns (the values used for the join).

I am hoping for something like a column based version of the row based qsv dedup command (see #26).

I suspect I could workaround this via the qsv transpose command (see #3).

Create `schema` command

stats does a great job of not only getting descriptive stats about a CSV, it also infers the data type.
frequency compiles a frequency table.

The schema command will use the output of the stats, and optionally frequency (to specify the valid range of a field), to create a json schema file that can be used with the validate command (#46) to validate a CSV against the generated schema.

With the combo addition of schema and validate, qsv can be used in a more bullet-proof automated data pipeline that can fail gracefully when there are data quality issues:

  • use schema to create a json schema from a representative CSV file for a feed
  • adjust the schema to fine-tune the validation rules
  • use validate at the beginning of a data pipeline and fail gracefully when validate fails
  • for extra large files, use sample to validate against a sample
  • or alternatively, partition the CSV to break down the pipeline into smaller jobs

RUSTSEC-2020-0071: Potential segfault in the time crate

Potential segfault in the time crate

Details
Package time
Version 0.1.43
URL time-rs/time#293
Date 2020-11-18
Patched versions >=0.2.23
Unaffected versions =0.2.0,=0.2.1,=0.2.2,=0.2.3,=0.2.4,=0.2.5,=0.2.6

Impact

Unix-like operating systems may segfault due to dereferencing a dangling pointer in specific circumstances. This requires an environment variable to be set in a different thread than the affected functions. This may occur without the user's knowledge, notably in a third-party library.

The affected functions from time 0.2.7 through 0.2.22 are:

  • time::UtcOffset::local_offset_at
  • time::UtcOffset::try_local_offset_at
  • time::UtcOffset::current_local_offset
  • time::UtcOffset::try_current_local_offset
  • time::OffsetDateTime::now_local
  • time::OffsetDateTime::try_now_local

The affected functions in time 0.1 (all versions) are:

  • at
  • at_utc

Non-Unix targets (including Windows and wasm) are unaffected.

Patches

Pending a proper fix, the internal method that determines the local offset has been modified to always return None on the affected operating systems. This has the effect of returning an Err on the try_* methods and UTC on the non-try_* methods.

Users and library authors with time in their dependency tree should perform cargo update, which will pull in the updated, unaffected code.

Users of time 0.1 do not have a patch and should upgrade to an unaffected version: time 0.2.23 or greater or the 0.3. series.

Workarounds

No workarounds are known.

References

time-rs/time#293

See advisory page for additional details.

Remove py command

As it makes building qsv more difficult, with its various version, platform architecture dependencies.

Lua should be more than good enough as it has no external dependencies as its meant to be embeddable, and you can even call lua scripts.

Closes #55.

Package qsv in conda-forge (as done for xsv)

I currently use xsv installed from conda via the conda-forge community package collection,

https://anaconda.org/conda-forge/xsv
https://github.com/conda-forge/xsv-feedstock/tree/master/recipe

I would like to do the same for qsv since I increasingly find myself wanting to use functionality only available in this more up to date fork (thank you!).

I am not familiar with rust, but it ought to be straightforward to package qsv with an almost cut-and-paste copy of that recipe, so I am willing to attempt this having previously contributed recipes for other packages to conda-forge.

RUSTSEC-2020-0159: Potential segfault in `localtime_r` invocations

Potential segfault in localtime_r invocations

Details
Package chrono
Version 0.4.19
URL chronotope/chrono#499
Date 2020-11-10

Impact

Unix-like operating systems may segfault due to dereferencing a dangling pointer in specific circumstances. This requires an environment variable to be set in a different thread than the affected functions. This may occur without the user's knowledge, notably in a third-party library.

Workarounds

No workarounds are known.

References

See advisory page for additional details.

Add "normalize" command

To normalize data inside the CSV:

  • convert dates to ISO-8601 format
  • optionally add additional date-based columns to the CSV
    • weekday
    • week number
    • year
    • month
    • day
    • hour
    • minute
    • second
    • timezone
  • convert null fields of specified columns to a specified value (e.g. "N/A", "None", "0", "Not specified", etc.)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.