Giter VIP home page Giter VIP logo

counter-rs's Introduction

counter

Counter counts recurrent elements of iterables. It is based on the Python implementation.

The struct Counter is the entry-point type for this module.

Math Underpinnings

Mathematically, a Counter implements a hash-based version of a multiset, or bag. This is simply an extension of the notion of a set to the idea that we care not only about whether an entity exists within the set, but the number of occurrences within the set. Normal set operations such as intersection, union, etc. are of course still supported.

Cargo Features

  • serde implements serde::Serialize and serde::Deserialize for Counter.

Examples

Just count an iterable

use counter::Counter;
let char_counts = "barefoot".chars().collect::<Counter<_>>();
let counts_counts = char_counts.values().collect::<Counter<_>>();

Update a count

let mut counts = "aaa".chars().collect::<Counter<_>>();
counts[&'a'] += 1;
counts[&'b'] += 1;
let mut counts = "able babble table babble rabble table able fable scrabble"
    .split_whitespace().collect::<Counter<_>>();
// add or subtract an iterable of the same type
counts += "cain and abel fable table cable".split_whitespace();
// or add or subtract from another Counter of the same type
let other_counts = "scrabble cabbie fable babble"
    .split_whitespace().collect::<Counter<_>>();
let difference = counts - other_counts;

Extend a Counter with another Counter:

let mut counter = "abbccc".chars().collect::<Counter<_>>();
let another = "bccddd".chars().collect::<Counter<_>>();
counter.extend(&another);
let expect = [('a', 1), ('b', 3), ('c', 5), ('d', 3)].iter()
    .cloned().collect::<HashMap<_, _>>();
assert_eq!(counter.into_map(), expect);

Get items with keys

let counts = "aaa".chars().collect::<Counter<_>>();
assert_eq!(counts[&'a'], 3);
assert_eq!(counts[&'b'], 0);

Get the most common items

most_common_ordered() uses the natural ordering of keys which are Ord.

let by_common = "eaddbbccc".chars().collect::<Counter<_>>().most_common_ordered();
let expected = vec![('c', 3), ('b', 2), ('d', 2), ('a', 1), ('e', 1)];
assert!(by_common == expected);

k_most_common_ordered() takes an argument k of type usize and returns the top k most common items. This is functionally equivalent to calling most_common_ordered() and then truncating the result to length k. However, if k is smaller than the length of the counter then k_most_common_ordered() can be more efficient, often much more so.

let by_common = "eaddbbccc".chars().collect::<Counter<_>>().k_most_common_ordered(2);
let expected = vec![('c', 3), ('b', 2)];
assert!(by_common == expected);

Get the most common items using your own ordering

For example, here we break ties reverse alphabetically.

let counter = "eaddbbccc".chars().collect::<Counter<_>>();
let by_common = counter.most_common_tiebreaker(|&a, &b| b.cmp(&a));
let expected = vec![('c', 3), ('d', 2), ('b', 2), ('e', 1), ('a', 1)];
assert!(by_common == expected);

Test counters against another

Counters are multi-sets and so can be sub- or supersets of each other.

A counter is a subset of another if for all its elements, the other counter has an equal or higher count. Test for this with is_subset():

let counter = "aaabb".chars().collect::<Counter<_>>();
let superset = "aaabbbc".chars().collect::<Counter<_>>();
let not_a_superset = "aaae".chars().collect::<Counter<_>>();
assert!(counter.is_subset(&superset));
assert!(!counter.is_subset(&not_a_superset));

Testing for a superset is the inverse, is_superset() is true if the counter can contain another counter in its entirety:

let counter = "aaabbbc".chars().collect::<Counter<_>>();
let subset = "aabbb".chars().collect::<Counter<_>>();
let not_a_subset = "aaae".chars().collect::<Counter<_>>();
assert!(counter.is_superset(&subset));
assert!(!counter.is_superset(&not_a_subset));

These relationships continue to work when using a signed integer type for the counter: all values in the subset must be equal or lower to the values in the superset. Negative values are interpreted as 'missing' those values, and the subset would need to miss those same elements, or be short more, to still be a subset:

let mut subset = "aaabb".chars().collect::<Counter<_, i8>>();
subset.insert('e', -2);  // short 2 'e's
subset.insert('f', -1);  // and 1 'f'
let mut superset = "aaaabbb".chars().collect::<Counter<_, i8>>();
superset.insert('e', -1);  // short 1 'e'
assert!(subset.is_subset(&superset));
assert!(superset.is_superset(&subset));

Counter intersection and union

You can intersect two counters, giving you the minimal counts of their combined elements using the & bitwise and operator, and produce their union with the maximum counts using | bitwise or:

let a = "aaabb".chars().collect::<Counter<_>>();
let b = "aabbbbe".chars().collect::<Counter<_>>();

let intersection = a & b;
let expected_intersection = "aabb".chars().collect::<Counter<_>>();
assert_eq!(intersection, expected_intersection);

let c = "aaabb".chars().collect::<Counter<_>>();
let d = "aabbbbe".chars().collect::<Counter<_>>();

let union = c | d;
let expected_union = "aaabbbbe".chars().collect::<Counter<_>>();
assert_eq!(union, expected_union)

The in-place &= and |= operations are also supported.

Treat it like a HashMap

Counter<T, N> implements Deref<Target=HashMap<T, N>> and DerefMut<Target=HashMap<T, N>>, which means that you can perform any operations on it which are valid for a HashMap.

let mut counter = "aa-bb-cc".chars().collect::<Counter<_>>();
counter.remove(&'-');
assert!(counter == "aabbcc".chars().collect::<Counter<_>>());

Note that Counter<T, N> itself implements Index. Counter::index returns a reference to a Zero::zero value for missing keys.

let counter = "aaa".chars().collect::<Counter<_>>();
assert_eq!(counter[&'b'], 0);
// panics
// assert_eq!((*counter)[&'b'], 0);

Advanced Usage

Count any iterable which is Hash + Eq

You can't use the most_common* functions unless T is also Clone, but simple counting works fine on a minimal data type.

#[derive(Debug, Hash, PartialEq, Eq)]
struct Inty {
    i: usize,
}

impl Inty {
    pub fn new(i: usize) -> Inty {
        Inty { i: i }
    }
}

// <https://en.wikipedia.org/wiki/867-5309/Jenny>
let intys = vec![
    Inty::new(8),
    Inty::new(0),
    Inty::new(0),
    Inty::new(8),
    Inty::new(6),
    Inty::new(7),
    Inty::new(5),
    Inty::new(3),
    Inty::new(0),
    Inty::new(9),
];

let inty_counts = intys.iter().collect::<Counter<_>>();
println!("{:?}", inty_counts);
// {Inty { i: 8 }: 2, Inty { i: 0 }: 3, Inty { i: 9 }: 1, Inty { i: 3 }: 1,
//  Inty { i: 7 }: 1, Inty { i: 6 }: 1, Inty { i: 5 }: 1}
assert!(inty_counts.get(&Inty { i: 8 }) == Some(&2));
assert!(inty_counts.get(&Inty { i: 0 }) == Some(&3));
assert!(inty_counts.get(&Inty { i: 6 }) == Some(&1));

Use your own type for the count

Sometimes usize just isn't enough. If you find yourself overflowing your machine's native size, you can use your own type. Here, we use an i8, but you can use most numeric types, including bignums, as necessary.

let counter: Counter<_, i8> = "abbccc".chars().collect();
let expected: HashMap<char, i8> = [('a', 1), ('b', 2), ('c', 3)].iter().cloned().collect();
assert!(counter.into_map() == expected);

License: MIT

counter-rs's People

Contributors

chris-ha458 avatar clint-white avatar coriolinus avatar dpathakj avatar jonasbb avatar lukaskalbertodt avatar matthewmcintire-savantx avatar mjpieters avatar nw0 avatar qryxip avatar samueltardieu avatar swolchok avatar tinbryn avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

counter-rs's Issues

Suggestion: add 'multiset' and 'bag' tags to repository, README and Cargo.toml

A counter is a multiset, AKA a bag. The latter term is very generic, but still something people search for when they look for the datatype. However, the counter project makes no mention of these terms, in the README, the GitHub tags or the Cargo.toml keywords list.

There are currently no Rust projects with the multiset tag on GitHub, and the multiset projects on Crates.io I looked at are not nearly as flexible and versatile as counter is. E.g. the hashbag library doesn't offer intersections, unions, subset or superset functionality nor does it offer syntactic sugar support the way counter does.

However, anyone looking for a multiset will find those other projects and not this one.

Counter doesn't derive Debug

(probably also want to derive Default while you're at it.) It's possible to work around this with some ugly use of the Deref trait, but, well, it's ugly and inconvenient when debugging.

min count + into

Hi,

I find myself occasionally implementing a min_count() method after I use most_common(). For example If I want to have the 100K most common strings in a text, and exclude from the output elements that appeared under m times (remain with potentially less than 100K).

Is it something that you would want to add here?
If adding an into() impl, you can use them both sequentially (and potentially compose with other functionalities).

Not sure if this is something you would want, but checking.
For example something like (just a draft):

Usage:

let counter: Counter<char, usize> = Counter::init("abcdefgggg".chars());
let by_common: Counter<char, usize> = counter.k_most_common_ordered(3).into();
let by_min_and_common = by_common.min_count(2);
let expected = vec![('g', 4)];
assert!(by_min_and_common == expected);

into + min_count :

impl<T, N> Into<Counter<T, N>> for Vec<(T, N)>
where
    T: Hash + Eq + Clone,
    N: Clone + Zero + AddAssign
{
    /// Converts a Vector of (T, N) into a Counter<T,N>
    /// 
    /// ```rust
    /// # use counter::Counter;
    /// let counter = Counter::init("abb".chars());
    /// let expected = vec![('a', 1), ('b', 2)].into();
    /// assert_eq!(counter, expected);
    /// ```
    fn into(self) -> Counter<T, N> {
        self.iter().cloned().collect::<Counter<T, N>>()
    }
}


impl<T, N> Counter<T, N>
where
T: Hash + Eq + Clone,
N: Clone + Eq + PartialOrd<usize>
{
    /// Creates a vector of `(elem, frequency)` pairs, for all pairs where frequency is at least k.
    /// 
    /// ```rust
    /// # use counter::Counter;
    /// let counter: Counter<_> = "abababa".chars().collect();
    /// let by_min = counter.min_count(4);
    /// let expected = vec![('a', 4)];
    /// assert_eq!(by_min, expected);
    /// ```
    /// 
    pub fn min_count(&self, k: usize) -> Vec<(T, N)> {
        let items = self
        .map
        .iter()
        .filter(|(_, count)| **count >= k)
        .map(|(key, count)| (key.clone(), count.clone()))
        .collect::<Vec<_>>();
        items
    }
}

Most common should take a number

Python implementation allows taking most common up to N items.

Rust implementation from this crate really take the most common.

Type would probably need to change to:

impl<T, N> Counter<T, N>
where
    T: Hash + Eq + Clone,
    N: Clone + Ord,
{
    pub fn most_common(&self, n: usize) -> Vec<(T, N)> {

Or maybe there could be another function for this, like nth_most_common? But I personally think it the current one can just be Counter::sort_by(Ordering) -> Vec<(T, N)> or something similar.

Similarly, is there a need to clone everything? I think the N could use Copy + Ord instead since count can usually be copied instead of cloned.

Suggestion: make Counter generic over the hashing algorithm

It would be useful for Counter to be generic over the hashing algorithm, in the same way that HashMap allows alternate hashing algorithms to be specified. Currently the performance of Counter is limited by the fact that it can only use the default hashing algorithm, which is significantly slower than available alternatives in many common use-cases (albeit resistant to DoS attacks, but that's not a concern in some applications).

Use defaultmap crate

I just released a defaultmap crate. I think this lib could definitely make use of it. Are you open to a PR that changes this lib to use it?

is_subset / is_superset methods?

The Python Counter() object overrides the >= and <= operators to implement superset / subset tests, just like Python set objects use the same operations. Counter A is a superset of Counter B if for all elements in both counters, the count for that element in A is equal to or greater than the count in B. The subset relationship is the same but replace 'greater' for 'smaller'. When two Counter objects are equal, the superset and subset tests are also true, in either direction. Missing elements in either Counter are treated as having a count of 0.

The Rust equivalent are the Set.is_superset() and Set.is_subset() methods.

It doesn't like counter-rs implements these. Is there any interest in implementing them?

(Python sets and Counters also have the concept of strict supersets and subsets, which is the same conditions as for supersets and subsets but the two sets / counters can't be equal; in Rust you'd just use a.is_subset(b) && a != b and a.is_superset(b) && a != b).

Add into_hashmap function

fn into_hashmap(self) -> HashMap<T, usize> { self.map }

Just makes it easier to extract the data once counting is complete.

Make Counter generic over the count type?

I'm porting some Python code that uses Python Counters to Rust using this crate. It turns out that the code I'm porting relies on Python's unlimited integer precision. (The Rust behavior is to panic on overflow in debug mode and silently wrap in release builds. I had extra fun because I foolishly didn't think to test debug builds for a while.) It might be nice if Counter had a second generic parameter (defaulted to usize) so that consumers could plug in an unlimited precision type (like, say, BigInt from the num crate) if needed.

Refactoring & Optional data concurrency

I was thinking about incorporating rayon in certain sections of the code behind a feature flag.
For instance
pub fn most_common_tiebreaker<F>
could be changed as follows

pub fn most_common_tiebreaker<F>(&self, mut tiebreaker: F) -> Vec<(T, N)>
where
    F: FnMut(&T, &T) -> ::std::cmp::Ordering + Send + Sync,
{
    let mut items = self
        .map
        .iter() // i'm not sure if changing this into par_iter() would be more useful or needless complication
        .map(|(key, count)| (key.clone(), count.clone()))
        .collect::<Vec<_>>();
    #[cfg(feature = "use_rayon")]
    {
        use rayon::prelude::*;
        items.par_sort_unstable_by(|(a_item, a_count), (b_item, b_count)| {
            b_count
                .cmp(a_count)
                .then_with(|| tiebreaker(a_item, b_item))
        });
    }
    #[cfg(not(feature = "use_rayon"))]
    {
        items.sort_unstable_by(|(a_item, a_count), (b_item, b_count)| {
            b_count
                .cmp(a_count)
                .then_with(|| tiebreaker(a_item, b_item))
        });
    }
    items
}

Similar feature flag based dependencies would be set in Cargo.toml as well.
Overall, this would be integrated in a way that preserves default behavior unless a feature flag is set (likely as rayon)

If you are open to this, I would first begin by initiating a bit of refactoring first.
Most if not all of the code is situated inside lib.rs which is now pushing 2k loc.
I think the unit tests can be extracted into /src/lib/tests.rs as a separate file.
This is in contrast to /tests/integration_tests.rs which would be for integration tests.
Of course, if any specific test is identified to be more suitable for integration tests I would extract them and organise them as such.

What do you feel about each?

  • Refactoring the code, especially for the tests.
  • Adding rayon (behind a feature flag to preserve default behavior)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.