pbeshai / tidy Goto Github PK

View Code? Open in Web Editor NEW

725.0 15.0 21.0 1.33 MB

Tidy up your data with JavaScript, inspired by dplyr and the tidyverse

Home Page: https://pbeshai.github.io/tidy

License: MIT License

JavaScript 8.24% TypeScript 83.74% CSS 4.17% MDX 3.86%

data wrangling dplyr tidyverse

tidy's Introduction

tidy.js

Tidy up your data with JavaScript! Inspired by dplyr and the tidyverse, tidy.js attempts to bring the ergonomics of data manipulation from R to javascript (and typescript). The primary goals of the project are:

Readable code. Tidy.js prioritizes making your data transformations readable, so future you and your teammates can get up and running quickly.
Standard transformation verbs. Tidy.js is built using battle-tested verbs from the R community that can handle any data wrangling need.
Work with plain JS objects. No wrapper classes needed — all tidy.js needs is an array of plain old-fashioned JS objects to get started. Simple in, simple out.

Secondarily, this project aims to provide acceptable types for the functions provided.

Quick Links

Related work

Be sure to check out a very similar project, Arquero, from UW Data.

Getting started

To start using tidy, your best bet is to install from npm:

npm install @tidyjs/tidy
# or
yarn add @tidyjs/tidy

Then import the functions you need:

import { tidy, mutate, arrange, desc } from '@tidyjs/tidy'

Note if you're just trying tidy in a browser, you can use the UMD version hosted on jsdelivr (codesandbox example):

<script src="https://d3js.org/d3-array.v2.min.js"></script>
<script src="https://cdn.jsdelivr.net/npm/@tidyjs/tidy/dist/umd/tidy.min.js"></script>
<script>
  const { tidy, mutate, arrange, desc } = Tidy;
  // ...
</script>

And use them on an array of objects:

const data = [
  { a: 1, b: 10 }, 
  { a: 3, b: 12 }, 
  { a: 2, b: 10 }
]

const results = tidy(
  data, 
  mutate({ ab: d => d.a * d.b }),
  arrange(desc('ab'))
)

The output is:

[
  { a: 3, b: 12, ab: 36},
  { a: 2, b: 10, ab: 20},
  { a: 1, b: 10, ab: 10}
]

All tidy.js code is wrapped in a tidy flow via the tidy() function. The first argument is the array of data, followed by the transformation verbs to run on the data. The actual functions passed to tidy() can be anything so long as they fit the form:

(items: object[]) => object[]

For example, the following is valid:

tidy(
  data, 
  items => items.filter((d, i) => i % 2 === 0),
  arrange(desc('value'))
)

All tidy verbs fit this style, with the exception of exports from groupBy, discussed below.

Grouping data with groupBy

Besides manipulating flat lists of data, tidy provides facilities for wrangling grouped data via the groupBy() function.

import { tidy, summarize, sum, groupBy } from '@tidyjs/tidy'

const data = [
  { key: 'group1', value: 10 }, 
  { key: 'group2', value: 9 }, 
  { key: 'group1', value: 7 }
]

tidy(
  data,
  groupBy('key', [
    summarize({ total: sum('value') })
  ])
)

The output is:

[
  { "key": "group1", "total": 17 },
  { "key": "group2", "total": 9 },
]

The groupBy() function works similarly to tidy() in that it takes a flow of functions as its second argument (wrapped in an array). Things get really fun when you use groupBy's third argument for exporting the grouped data into different shapes.

For example, exporting data as a nested object, we can use groupBy.object() as the third argument to groupBy().

const data = [
  { g: 'a', h: 'x', value: 5 },
  { g: 'a', h: 'y', value: 15 },
  { g: 'b', h: 'x', value: 10 },
  { g: 'b', h: 'x', value: 20 },
  { g: 'b', h: 'y', value: 30 },
]

tidy(
  data,
  groupBy(
    ['g', 'h'], 
    [
      mutate({ key: d => `\${d.g}\${d.h}`})
    ], 
    groupBy.object() // <-- specify the export
  )
);

The output is:

{
  "a": {
    "x": [{"g": "a", "h": "x", "value": 5, "key": "ax"}],
    "y": [{"g": "a", "h": "y", "value": 15, "key": "ay"}]
  },
  "b": {
    "x": [
      {"g": "b", "h": "x", "value": 10, "key": "bx"},
      {"g": "b", "h": "x", "value": 20, "key": "bx"}
    ],
    "y": [{"g": "b", "h": "y", "value": 30, "key": "by"}]
  }
}

Or alternatively as { key, values } entries-objects via groupBy.entriesObject():

tidy(data,
  groupBy(
    ['g', 'h'], 
    [
      mutate({ key: d => `\${d.g}\${d.h}`})
    ], 
    groupBy.entriesObject() // <-- specify the export
  )
);

The output is:

[
  {
    "key": "a",
    "values": [
      {"key": "x", "values": [{"g": "a", "h": "x", "value": 5, "key": "ax"}]},
      {"key": "y", "values": [{"g": "a", "h": "y", "value": 15, "key": "ay"}]}
    ]
  },
  {
    "key": "b",
    "values": [
      {
        "key": "x",
        "values": [
          {"g": "b", "h": "x", "value": 10, "key": "bx"},
          {"g": "b", "h": "x", "value": 20, "key": "bx"}
        ]
      },
      {"key": "y", "values": [{"g": "b", "h": "y", "value": 30, "key": "by"}]}
    ]
  }
]

It's common to be left with a single leaf in a groupBy set, especially after running summarize(). To prevent your exported data having its values wrapped in an array, you can pass the single option to it.

tidy(input,
  groupBy(['g', 'h'], [
    summarize({ total: sum('value') })
  ], groupBy.object({ single: true }))
);

The output is:

{
  "a": {
    "x": {"total": 5, "g": "a", "h": "x"},
    "y": {"total": 15, "g": "a", "h": "y"}
  },
  "b": {
    "x": {"total": 30, "g": "b", "h": "x"},
    "y": {"total": 30, "g": "b", "h": "y"}
  }
}

Visit the API reference docs to learn more about how each function works and all the options they take. Be sure to check out the levels export, which can let you mix-and-match different export types based on the depth of the data. For quick reference, other available groupBy exports include:

groupBy.entries()
groupBy.entriesObject()
groupBy.grouped()
groupBy.levels()
groupBy.object()
groupBy.keys()
groupBy.map()
groupBy.values()

Developing

clone the repo:

git clone [email protected]:pbeshai/tidy.git

install dependencies:

yarn

initialize lerna:

lerna bootstrap

build tidy:

yarn run build

test all of tidy:

yarn run test

test:watch a single package

yarn workspace @tidyjs/tidy test:watch

Conventional commits

This library uses conventional commits, following the angular convention. Prefixes are:

build: Changes that affect the build system or external dependencies (example scopes: yarn, npm)
ci: Changes to our CI configuration files and scripts (e.g. CircleCI)
chore
docs: Documentation only changes
feat : A new feature
fix: A bug fix
perf: A code change that improves performance
refactor: A code change that neither fixes a bug nor adds a feature
revert
style: Changes that do not affect the meaning of the code (white-space, formatting, missing semi-colons, etc)
test: Adding missing tests or correcting existing tests

Docs website

start the local site:

yarn start:web

build the site:

yarn build:web

deploy the site via github-pages:

USE_SSH=true GIT_USER=pbeshai yarn workspace @tidyjs/tidy-website deploy

Ideally we can automate this via github actions one day!

Shout out to Netflix

I want to give a big shout out to Netflix, my current employer, for giving me the opportunity to work on this project and to open source it. It's a great place to work and if you enjoy tinkering with data-related things, I'd strongly recommend checking out our analytics department. – Peter Beshai

tidy's People

Contributors

Stargazers

Watchers

tidy's Issues

Feature Request: add a boolean flag to turn on/off debug

Feature Request

Able to turn on/off debug depending when needed. For example: I want to suppress debug on production but show it on staging

For example:

const _DEBUG = false;
tidy(data, debug("info label", {limit: 15, show: _DEBUG})  );

Update docs to reflect keys being prepended now

From the changes made in #34, keys are now prepended on objects in groupBy but the docs still show them being appended – groupBy.entries() for example:

const data = [
  { str: 'a', ing: 'x', foo: 'G', value: 1 },
  { str: 'b', ing: 'x', foo: 'H', value: 100 },
  { str: 'b', ing: 'x', foo: 'K', value: 200 },
  { str: 'a', ing: 'y', foo: 'G', value: 2 },
  { str: 'a', ing: 'y', foo: 'H', value: 3 },
  { str: 'a', ing: 'y', foo: 'K', value: 4 },
  { str: 'b', ing: 'y', foo: 'G', value: 300 },
  { str: 'b', ing: 'z', foo: 'H', value: 400 },
  { str: 'a', ing: 'z', foo: 'K', value: 5 },
  { str: 'a', ing: 'z', foo: 'G', value: 6 },
]

tidy(
  data,
  groupBy('str', [
    summarize({ total: sum('value') })
  ], groupBy.entries())
)
// output:
[
  ["a", [{"total": 21, "str": "a"}]], 
  ["b", [{"total": 1000, "str": "b"}]]
]

But putting that into the playground gives me:

[
  ["a", [{"str": "a", "total": 21}]], 
  ["b", [{"str": "b", "total": 1000}]]
]

Type error when using groupBy after mutate

See the code example here: https://codesandbox.io/s/divine-cache-x3rzjj?file=/src/App.tsx

Basically doing this:

tidy(data, mutate({ ab: (d) => d.a * d.b }), groupBy("ab"));

Results in a Typescript error.

Note: It doesn't matter which argument is passed to groupBy, the error persists either way.

Specify which open-source license

This library is super exciting and possibly very helpful for some projects I work on!

However, I can't see from your documentation which open-source license you're using, which could prevent me from using this library at work.

Could you add more info on which open-source license you've decided to use?

Set colum index for mutations

Does anyone else run into the problem where you want to add a new column via mutate but you don't want that new key to be added at the very end? One common example is I have a bunch of state fips codes that are my first column and I add the full state name via a lookup with a mutate call. I'd like to have that full state name then be the second column so my spreadsheet is easier to read. I could do a select / pick call but I'd have to write out all of my columns and that is a bit verbose.

Perhaps mutate could be supplied an index and it inserts the key at that index? Open to other workarounds people have found for this...

leftJoin followed by select everything has unexpected dropped columns

Example:

tidy(
  [{ a: 123, b: 345 }, { a: 452, b: 999}],
  leftJoin([{ a: 452, c: 456 }], { by: 'a' }),
  select(T.everything())
)
// output:
[{"a":123,"b":345},{"a":452,"b":999}]

// expected output: 
[{"a":123,"b":345,"c":undefined},{"a":452,"b":999,"c":456}]

Make types dumber

It gets very annoying when the type inference is wrong. Perhaps we can mitigate this by stopping trying to be clever with keyof and other crazy generics and just make whatever comes out of tidy be opaque. At the end of tidy flows, users can cast their outputs to their expected types. Currently you have to fight the type system and guess where to override things and it is just crazy.

GroupBy fails with Dates

Needs to use valueOf when caching keys. Could switch to explicitly using internmap. Brought up in #46

Allow type guards for filter

A common use case for me is filtering possibly null values before mutating data. For example like this:

import { filter, mutate, tidy } from "@tidyjs/tidy";

type Data = { a: number | null };
const data: Array<Data> = [{ a: null }, { a: 1 }, { a: null }, { a: 2 }];

tidy(
  data,
  filter((x) => x.a !== null),
  mutate({
    b: (x) => 2 * x.a
  })
);

Of course typescript does not know that after filtering x.a can not be null so it complains: Object is possibly 'null'.
This is fully expected and I love that tidyjs has such great type inference (I hope you do not make it dumber as suggested in #14).

A solution to my problem would be a type guard (thanks to @phryneas for the suggestion):

tidy(
    data,
    filter((x): x is Data & { a: number } => x.a !== null),
    mutate({
        b: (x) => 2 * x.a
    })
);

However this requires an additional declaration for the filter function that allows for type guards (it can be added without interference with the existing functionality). I currently do this:

declare module "@tidyjs/tidy" {
  function filter<T extends object, O extends T>(
    filterFn: (item: T, index: number, array: T[]) => item is O
  ): TidyFn<T, O>;
}

I'll send a pull request that adds this declaration to the library itself.

bug: first() and last() fail when there are no items

first() and last() try running on an undefined element when there are no items, so they should be updated to just return undefined in that case.

Mutate should pass the index

e.g. mutate({ foo: (d, i) => i })

Typescript type error?

First of all thanks for this nice library, it's filling an important gap in the JS world.

When using typescript I'm getting the following error, any idea? It seems an issue with the grouped column parameter.

         groupBy('day', [
            summarize({ total_cost: mean('total_cost'),
                        total_revenue: mean('total_revenue'),
                        margin: mean('margin') })
         ]))

No overload matches this call.
  The last overload gave the following error.
    Argument of type 'string' is not assignable to parameter of type 'GK<object>'.ts(2769)
tidy.d.ts(1006, 18): The last overload is declared here.
```

How do I select iris as input in the playground?

Thanks so much - this library is awesome!

The very handy playground includes iris, but it's not clear how to use it. The input table is visible but output says Error: ReferenceError: iris is not defined.

Wonder if worth preserving state in e.g. URL hash string so examples could be shared this way?

Thoughts on slice()

In my view slice feels very javascripty. dplyr's implementation allows for multiple methods:

slice(iris, c(1, 51, 101)) // single argument
slice(iris, 1, 51, 101) // multiple arguments
slice(iris, -1, -51, -101) // negation

Am I right slice in tidy.js is basically same as javascript?

x = [1,2,3,4,5,6,7,8,9,10];
tidy(x, Tidy.slice(2,5))
(3) [3, 4, 5]
tidy(x, Tidy.slice([2,5]))   // ?
(10) [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
tidy(x, Tidy.slice(-5))
(5) [6, 7, 8, 9, 10]

I feel javascript is crying out for an elegant implementation like dplyr's. Wonder if there might be appetite for a slice2 function to plug the gap?

Bug: select() fails when no keys are selected

Currently if you do select([]) or select([contains('foo')]) and contains returns no results, an error occurs. Instead, let's just have select do nothing if no keys are passed – the input data is output untouched.

complete, is not parsing the original data

As seen here.
T.tidy( aaplMissing, T.select(["Date", "Close"]), T.complete({Date: T.fullSeqDate('Date', 'day', 1)}, {Close: 0}) )

The values in Close are not parsed to the new object.

Feature request: playground encodes the state / contents in the URL

To make the playground more useful, we should be able to share URLs by making the content and inputs saved to the URL.

Requested in #6

Clarity on dependencies?

I'm building a static app (no Node) and planned to run a local version of tidy.min.js (plus its d3 dependency), so I cloned the file at <script src="https://www.unpkg.com/@tidyjs/tidy/dist/umd/tidy.min.js"></script> and call locally. But encountered the error:

DevTools failed to load SourceMap: Could not load content for http://localhost:8888/js/tidy.min.js.map: 
HTTP error: status code 404, net::ERR_HTTP_RESPONSE_CODE_FAILURE

And it turns out there is a file at https://www.unpkg.com/@tidyjs/[email protected]/dist/umd/tidy.min.js.map which specifies a number of .ts files e.g:

Is tidy.min.js.map an actual dependency? I'm not a javascript dev by background so may be breaking some best practice by cloning as I have.

New summarizer: nWhere?

Would you be interested in a PR adding a new summarizer that I've found helpful. It's the same as n but allows you to add a condition and only count items that meet that condition. So if you want to know how many elements in your groupBy had a certain value you can get that.

Simple JS implementation (Would convert to TypeScript)

export default function nWhere(conditional) {
  return function nWhereFn (list) {
    return list.filter(conditional).length;
  }
}

Usage

const data = [
  { str: 'foo', value: 3 },
  { str: 'foo', value: 1 },
  { str: 'bar', value: 3 },
  { str: 'bar', value: 1 },
  { str: 'bar', value: 7 },
];

tidy(data, summarize({
  foos: nWhere(d => d.str === 'foo'),
  bars: nWhere(d => d.str === 'bar'),
  count: n(),
})
// output:
[{ foos: 2, bars: 3, count: 5 }]

Bug: groupBy exports do not respect addGroupKeys

if you do for instance groupBy.object({ addGroupKeys: false }), addGroupKeys is ignored.

valuesFillMap for pivotWider() not being applied

hello, i have started using tidy and am finding it awesomely helpful, and am appreciating the many examples in the documentation. thank you!!

i think i have found a bug with valuesFillMap for pivotWider(), where it seems that the map is just ignored.

test case

tidy(
  [
    {n:2, c:'a', e:1001},{n:3, c:'a', e:1002},{n:7, c:'x', e:1003},
    {n:4, c:'b', e:1001},{n:2, c:'r', e:1002},{n:9, c:'y', e:1003},
    {n:6, c:'c', e:1001},{n:1, c:'z', e:1002},{n:1, c:'z', e:1003},
  ],
  pivotWider({
    namesFrom: 'c',
    valuesFrom: 'n',
    valuesFillMap: { a:0, b:0, c:0, r:0, s:0, t:0, x:0, y:0, z:0 },
  }),
)

results

[
  {a: 2, x: undefined, b: 4, r: undefined, y: undefined, c: 6, z: undefined, e: 1001},
  {a: 3, x: undefined, b: undefined, r: 2, y: undefined, c: undefined, z: 1, e: 1002},
  {a: undefined, x: 7, b: undefined, r: undefined, y: 9, c: undefined, z: 1, e: 1003},
]

map values do not seem to be applied; widened keys are given value undefined
map keys do not seem to be applied; s and t are missing from results

expected results

[
  {a: 2, x: 0, b: 4, r: 0, y: 0, c: 6, z: 0, s: 0, t: 0, e: 1001},
  {a: 3, x: 0, b: 0, r: 2, y: 0, c: 0, z: 1, s: 0, t: 0, e: 1002},
  {a: 0, x: 7, b: 0, r: 0, y: 9, c: 0, z: 1, s: 0, t: 0, e: 1003},
]

map values applied to widened keys (as 0)
map keys applied; every row has all columns (a, b, c, r, s, t, x, y, z, and e)

import fail

import { tidy, mutate, arrange, desc } from "@tidyjs/tidy" dosent work anymore.

arrange should accept accessors

it is convenient to say d => d.foo.bar for arrange instead of requiring functions to be comparators ((a,b)=>number)

Bug: mapEntry not called on entriesObject

Originally this was because all entriesObject does it supply a mapEntry for the entries export, but it's weird that it just doesn't work. So let's just use it if it is provided.

Should groupBy's addGroupKeys be on by default?

Currently groupBy automatically adds group keys back to objects after each function in the flow. This was primarily done to mitigate the fact that summarize (a very common groupBy operation) removes them. There has been some discussion in #34 around whether or not this should be default behavior.

It's a pretty big breaking change to switch to not adding them by default, so I'm not sure it will be worth it. However, it would improve the performance of groupBy and perhaps it is easier to reason about by not adding them (users can always explicitly keep them around when summarizing via first (e.g. summarize({ cyl: first('cyl') })).

An in-between would be to not add them in except for certain exports? It's likely when exporting by entries or object you don't want them added back in. I'm not sure what to do here, and am open to any ideas.

Feedback on unexpected behavior in groupBy

Hi, Thanks for making this library. I've wanted it to exist for years. I've been trying to incorporate into some of my projects and I came across something surprising today.

Here's a REPL reproduction: https://svelte.dev/repl/3d3126f8ea994d3d866427cab0642e3b?version=3.38.2

I wanted to group a list of states based on a value, in this case count. I was getting a really weird output, though. After poking around a bit, I discovered that I fixed the problem by setting { addGroupKeys: false } in the group export. The issue seems to be that because my input data is a list of strings, the default behavior of adding a key onto the element is converting it to an object.

I'm not sure what the best solution is for it – maybe a change in the docs, a console warning or possibly a change in the default behavior so that it doesn't mutate the data by default. My expectation was definitely that it wouldn't mutate the original object, for what it's worth.

New Architecture

I looked around at the code and saw that tidy calls the functions passed one by one on the whole array,
I think that switching to using Iterators and apply at least some of the functions per cell would speed up run time significantly.

nested objects?

lodash pick allows nested picks: https://masteringjs.io/tutorials/lodash/pick

is this not supported in tidy?

pbeshai / tidy Goto Github PK

tidy's Introduction

tidy.js

Quick Links

Related work

Getting started

Grouping data with groupBy

Developing

Conventional commits

Docs website

Shout out to Netflix

tidy's People

Contributors

Stargazers

Watchers

Forkers

tidy's Issues

Feature Request

test case

results

expected results

Recommend Projects

Recommend Topics

Recommend Org