A
unique(1)
pipeline filter
Because I don't want to change the name to something lame, this package is not published on crates.io. To install, run the following command.
cargo install --git https://github.com/archer884/unique
In all likelihood, your computer already has a command that does this, and the odds are you can probably get by with either of them. Read on to discover if unique
can be useful to you.
PowerShell (pwsh
) offers the Select-Object
cmdlet which, called with the -Unique
flag, does almost exactly what unique
does. I imagine that Select-Object
is using the CLR's GetHashCode()
mechanism, but I am far too lazy to dig into this to find out. This command operates on object streams rather than text streams, which makes it more versatile than unique
. For working with text streams, however, you can expect unique
to offer slightly better performance characteristics due to the innate inefficiency of string handling in pwsh
and in CLR languages generally.
Of course, pwsh
is also available on macOS, but it isn't installed by default. Considerations are identical.
If instead you use a shell like bash
, your built-in option is uniq(1)
, which is more efficient than both Select-Object
and unique
, but which will also fail to filter non-consecutive repetitions. As a demonstration, try the following command using the test file in this repository: cat ./resource/test-file.txt | uniq
. The string "One" will appear twice in the output. This is by design: uniq
is written for maximum efficiency and, as such, does not retain the full text being filtered in memory, which is necessary to discover non-consecutive repetitions.
unique
differs slightly from both of these commands.
In comparison to the CLR-based Select-Object
, unique
offers more efficient operation on text streams.
This is because unique
reads the entirety of stdin at once before printing only unique lines. Only a single input buffer is ever allocated, with the resulting individual lines being represented in memory as slices of that buffer rather than as discrete strings, as would be necessary in the CLR.
Note: Yes, I know that the CLR now has
Span<T>
. Whether or not this technology has been incorporated intoSelect-Object
is not known to me. Feel free to look it up and let me know. :)
Allowing for the necessary reduction in sheer efficiency, unique
offers an advantage over the built-in command uniq
in that it will faithfully exclude repeated lines even when those repetitions are not back to back. The primary difference you'll see as a user is that your pipelines can be shorter (uniq
must often be combined with sort
to address this shortcoming) and you can expect your input and output strings to appear in the same order (because, obviously, you didn't have to sort them). Compare the following pipelines:
cat foo.txt | sort | uniq
cat foo.txt | unique
You will save five entire characters. Your children will thank you.