techascent / tech.ml.dataset Goto Github PK
View Code? Open in Web Editor NEWA Clojure high performance data processing system
License: Eclipse Public License 1.0
A Clojure high performance data processing system
License: Eclipse Public License 1.0
The current rolling window system is a fixed-window system independent of the data.
An upgrade to it would be to allow windows to be dependent upon data, so a you have a 2 second window rolling average or something like that.
Below code raises exception. Doc string states: "For booleans, true=1 false=0".
(let [d (ds/->dataset [{:a true} {:a true} {:a false}])]
(col/to-double-array (d :a)))
(let [arr (into-array Double (repeat 10 1.0))]
(tech.ml.dataset/name-values-seq->dataset [["double" arr]]))
(let [arr (into-array Boolean (repeat 10 true))]
(tech.ml.dataset/name-values-seq->dataset [["boolean" arr]]))
(ds-col/new-column (ds-col/column-name first-col)
column-values
(ds-col/metadata first-col))
this shows up when building a left-join dataset. I have a function in my branch empty-column
that leverages new-column
to create a column with all indices marked as missing for the datatype. This is a naive rough draft, but it works.
When I go to implement left-join
, the result look correct, except the missing data for each column isn't propagated, since I leverage ds-concat
above, which uses the arity of new-column
that ignore missing data. I "think" this is not correct on the implementation of ds-concat
, although I could be missing something about the guarantees concat is supposed to confer. There's no docstring at the moment, so I'm just projecting my expectation onto it.
When I try to drop rows with missing values and when all rows in dataset have missing values I'm getting the exception.
;; works (there is one row without missing values)
(let [ds (ds/->dataset [{:a 1 :b 1} {:b 2}])]
(ds/drop-rows ds (ds/missing ds)))
;; => _unnamed [1 2]:
;; | :a | :b |
;; |----+----|
;; | 1 | 1 |
;; all rows have missing values -> exception
(let [ds (ds/->dataset [{:a 1} {:b 2}])]
(ds/drop-rows ds (ds/missing ds)))
1. Unhandled java.util.NoSuchElementException
Empty RoaringArray
RoaringArray.java: 998 org.roaringbitmap.RoaringArray/assertNonEmpty
RoaringArray.java: 978 org.roaringbitmap.RoaringArray/first
RoaringBitmap.java: 2745 org.roaringbitmap.RoaringBitmap/first
bitmap.clj: 88 tech.v2.datatype.bitmap/eval23542/fn
protocols.clj: 306 tech.v2.datatype.protocols/eval15210/fn/G
bitmap.clj: 217 tech.v2.datatype.bitmap/bitmap->efficient-random-access-reader
bitmap.clj: 210 tech.v2.datatype.bitmap/bitmap->efficient-random-access-reader
column.clj: 134 tech.ml.dataset.impl.column/->efficient-reader
column.clj: 130 tech.ml.dataset.impl.column/->efficient-reader
column.clj: 313 tech.ml.dataset.impl.column.Column/select
column.clj: 117 tech.ml.dataset.column/select
column.clj: 114 tech.ml.dataset.column/select
dataset.clj: 111 tech.ml.dataset.impl.dataset.Dataset/fn
core.clj: 2753 clojure.core/map/fn
LazySeq.java: 42 clojure.lang.LazySeq/sval
LazySeq.java: 51 clojure.lang.LazySeq/seq
RT.java: 535 clojure.lang.RT/seq
RT.java: 650 clojure.lang.RT/countFrom
RT.java: 643 clojure.lang.RT/count
column.clj: 321 tech.ml.dataset.column/ensure-column-seq
column.clj: 302 tech.ml.dataset.column/ensure-column-seq
dataset.clj: 191 tech.ml.dataset.impl.dataset/new-dataset
dataset.clj: 181 tech.ml.dataset.impl.dataset/new-dataset
dataset.clj: 113 tech.ml.dataset.impl.dataset.Dataset/select
base.clj: 203 tech.ml.dataset.base/select
base.clj: 191 tech.ml.dataset.base/select
base.clj: 257 tech.ml.dataset.base/drop-rows
base.clj: 251 tech.ml.dataset.base/drop-rows
The current pipeline pathway has a lot of magic. Building pipelines requires quite a bit of quoting and especially when they are mixed with actual values unquoting the right thing at the right time is a bit of a PITA.
Working to reduce the magic involved in this while still keeping the pipeline as data would help new people get used to it and also help understadability of the codebase.
The following code makes cider unresponsive. Interrupting the repl doesn't work, gives "sync nrepl request timed out (op close)"
(-> (ds/->dataset [{:a 1 :b 2}])
(ds/update-column :b #(tech.ml.dataset.column/set-missing % (range))))
When I shut down the jvm process I received an out of memory error in the repl
After calling dfn/max
or dfn/min
on column I'm getting following exception:
1. Unhandled java.lang.Exception
Failed to find op: :min
builtin_op_providers.clj: 30 tech.v2.datatype.builtin-op-providers/get-op
builtin_op_providers.clj: 26 tech.v2.datatype.builtin-op-providers/get-op
Hi,
Thx for this library! I'm using it for data-science in clojure and am experimenting with pymc3
and libpython-clj
:)
I think I've stumbled upon a bug.
I read that :object
type columns are implemented now. This works great when I use vectors as the values of a column, however when I try to use tech.v2.tensor
as the values of a column it get an error. Is something going wrong with parsing?
Reproducible example:
;; works
(ds/->dataset [{:a [0.4935 0.5552]} {:a [0.4935 0.5552]}])
;; => _unnamed [2 1]:
;; | :a |
;; |-----------------|
;; | [0.4935 0.5552] |
;; | [0.4935 0.5552] |
;; doesn't work
(ds/->dataset [{:a (tech.v2.tensor/->tensor [0.4935 0.5552]) }
{:a (tech.v2.tensor/->tensor [0.4935 0.5552]) }])
;; Execution error (ClassCastException) at tech.ml.dataset.parse.mapseq$map$reify$reify$reify__52074/doubleValue (mapseq.clj:39).
;; tech.v2.tensor.impl.Tensor cannot be cast to java.lang.Number
I'm using newest versions of libpython-clj
and datatype
{:deps {
clj-python/libpython-clj {:mvn/version "1.38" }
techascent/tech.ml.dataset {:mvn/version "2.0-beta-11"}
techascent/tech.datatype {:mvn/version "5.0-beta-2"}}}
Some of the API functions are dataset-first some are dataset-last. Regardless, if the dataset is first it should always be first and options and additional arguments should follow. The inverse is true for dataset-last functionality. Filter, for instance, breaks this which makes the options to filter irritating to use.
The format "yyyyMMdd" is included in tech.ml.dataset.parse.datetime/date-parser-patterns
, so I assumed that means that it parses automatically, but it doesn't:
(ds/select-rows (ds/->dataset "https://covidtracking.com/api/v1/states/daily.csv" {:column-whitelist ["date"]}) (range 3))
;; => https://covidtracking.com/api/v1/states/daily.csv [3 1]:
;; | date |
;; |----------|
;; | 20200422 |
;; | 20200422 |
;; | 20200422 |
(ds/select-rows (ds/->dataset "https://covidtracking.com/api/v1/states/daily.csv"
{:column-whitelist ["date"]
:parser-fn {"date" [:local-date "yyyyMMdd"]}}) (range 3))
;; => https://covidtracking.com/api/v1/states/daily.csv [3 1]:
;; | date |
;; |------------|
;; | 2020-04-22 |
;; | 2020-04-22 |
;; | 2020-04-22 |
Reasonable extension :-).
We have two ways of api creation:
name-values-seq->dataset
->dataset
or ->>dataset
options are passed differently: first on variadic position, latter as second argument.
BTW there should be one function (you can infer if something is map or sequence of sequences (length=2)
Aggregation on dataset and on groups (after group-by) should result a dataset. As @keesterbrugge shows it can be done with some Clojure operations. Wonder if there are more optimal versions.
(defn aggregate
([agg-fns-map ds]
(aggregate {} agg-fns-map ds))
([m agg-fns-map ds]
(into m (map (fn [[k agg-fn]]
[k (agg-fn ds)]) agg-fns-map))))
(def aggregate->dataset (comp ds/->dataset vector aggregate))
(defn group-by-columns-and-aggregate
[gr-colls agg-fns-map ds]
(->> (ds/group-by identity ds gr-colls)
(map (fn [[group-idx group-ds]]
(aggregate group-idx agg-fns-map group-ds)))
ds/->dataset))
Results can be found here: https://github.com/genmeblog/techtest/blob/master/src/techtest/core.clj#L642
referenced in pull request #25 . I think the behavior for left and right joins would be to include the fields from both lhs and rhs. The difference between the two would be in the missing fields and missing values on either the lhs fields or rhs fields in the resulting joined table. Uncertain if - given the design only returns indices - this is a pedantic difference. Any join operation will return all the fields of both tables then. Tests predicates in #25 can be altered trivially if I'm mistaken.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html
https://www.rdocumentation.org/packages/data.table/versions/1.12.8/topics/duplicated
Pandas seems strictly better to be as you get a choice between if you want the first or last occurrence.
Continuation of #44 .
Using a parallel-map-reduce algorithm of just iterating every row and building a hashtable of hash codes to index array lists seems like a good way of doing this. Then potentially there is a fast combine method of the hashtables that results in the set of indexes to either add or remove. That is sort of like (group-by-index identity dataset)
followed by iterating the indexes and taking first or last or butfirst or butlast.
I guess that also implies an algorithm to get the hash value of a given row which in and of itself is a thing.
Naive attempt:
tech.ml.dataset.vega> ds
#<tech.ml.dataset.generic_columnar_dataset.GenericColumnarDataset@401f8598 [560 3]:
| symbol | date | price |
|--------+------------+--------|
| MSFT | Jan 1 2000 | 39.810 |
| MSFT | Feb 1 2000 | 36.350 |
| MSFT | Mar 1 2000 | 43.220 |
| MSFT | Apr 1 2000 | 28.370 |
| MSFT | May 1 2000 | 25.450 |
| MSFT | Jun 1 2000 | 32.540 |
| MSFT | Jul 1 2000 | 28.400 |
| MSFT | Aug 1 2000 | 28.400 |
| MSFT | Sep 1 2000 | 24.530 |
| MSFT | Oct 1 2000 | 28.020 |
| MSFT | Nov 1 2000 | 23.340 |
| MSFT | Dec 1 2000 | 17.650 |
| MSFT | Jan 1 2001 | 24.840 |
| MSFT | Feb 1 2001 | 24.000 |
| MSFT | Mar 1 2001 | 22.250 |
| MSFT | Apr 1 2001 | 27.560 |
| MSFT | May 1 2001 | 28.140 |
| MSFT | Jun 1 2001 | 29.700 |
| MSFT | Jul 1 2001 | 26.930 |
| MSFT | Aug 1 2001 | 23.210 |
| MSFT | Sep 1 2001 | 20.820 |
| MSFT | Oct 1 2001 | 23.650 |
| MSFT | Nov 1 2001 | 26.120 |
| MSFT | Dec 1 2001 | 26.950 |
| MSFT | Jan 1 2002 | 25.920 |
>
tech.ml.dataset.vega> (.getTime (.parse (java.text.SimpleDateFormat. "MMM dd yyyy") "Oct 1 2001"))
1001916000000
tech.ml.dataset.vega> (ds/update-column ds "date" (fn [col]
(->> col
(map (fn [v] (.getTime (.parse (java.text.SimpleDateFormat. "MMM dd yyyy") v)))))))
Execution error (IllegalArgumentException) at tech.libs.tablesaw.tablesaw-column/make-empty-column (tablesaw_column.clj:446).
No matching clause: :object
There is an use case where we want to filter based on some column calculations. This calculations can't be done by simple map
.
For example: we need to filter by (moving) average, or R rank
. To flow looks like that:
Here is concrete case with rank
: https://github.com/genmeblog/techtest/blob/master/src/techtest/datatable_dplyr.clj#L554
I can think about solution of filter (filter-by
maybe?) which takes a sequence and selects only rows corresponding to the result of predicate.
Require dataset is very, very slow, it takes 25s in my machine
user> (time (require '[tech.ml.dataset :as ds]))
"Elapsed time: 25646.0355 msecs"
nil
I found a page full of datasets but the one I've chosen is zipped. Can you add zip format as well? (it's supported by Java SDK fortunately).
I would like to have a function for reshape. Currently I have only one need. columns -> rows
input:
a | b | c |
---|---|---|
1 | 2 | 3 |
4 | 5 | 6 |
result
column | value |
---|---|
a | 1 |
a | 4 |
b | 2 |
b | 5 |
c | 3 |
c | 6 |
It would be great to have possibility to sort dataset by columns with given order. It's fairly easy to sort by columns with names as keywords ordering ascending:
(->> (ds/->dataset [{:a 1 :b 2} {:a 11 :b -2} {:a 1 :b -1}])
(ds/sort-by (juxt :a :b)))
;; => _unnamed [3 2]:
;; | :a | :b |
;; |----+----|
;; | 1 | -1 |
;; | 1 | 2 |
;; | 11 | -2 |
But things are getting more complicated when:
I think enhancement should add sort-by-columns
with the same arguments as sort-by-column
. Also compare-fn
can be: function (as is now) or seq of order, one of :asc
and :desc
.
Above example (imagined):
(->> (ds/->dataset [{:a 1 :b 2} {:a 11 :b -2} {:a 1 :b -1}])
(ds/sort-by-columns [:a :b] [:asc :desc]))
;; => _unnamed [3 2]:
;; | :a | :b |
;; |----+----|
;; | 1 | 2 |
;; | 1 | -1 |
;; | 11 | -2 |
This has been suggested several times by several people and it is just getting to be time.
We would like to get these things out of the transform:
Dependency analysis - which columns are dependent upon which inputs thus making it easier to trim out columns when we aren't going to infer or train on them.
Variable promotion - be able to declaratively specify variables that get automagically promoted to a higher level. This level could be a gridsearch level or it could be a UI level.
Execution of the graph produces both a new graph and a new dataset. In this way we preserve the ability to auto-produce the inference process from the training process.
We intend to model the nodes of the graph as maps with ids with an edge list similar to cortex.
tech.ip-addresses> (println (slurp "./test/data/ip-addresses.csv"))
name,ip
Harold,10.0.0.1
Google,172.217.1.206
nil
tech.ip-addresses> (ds/->dataset "./test/data/ip-addresses.csv")
#<tech.ml.dataset.impl.dataset.Dataset@3b60f35c ./test/data/ip-addresses.csv [2 2]:
| name | ip |
|--------+-----------------|
| Harold | PT10H0.001S |
| Google | PT175H37M1.206S |
>
What is the meaning behind categorical column?
My understanding is that column is categorical when content is discrete. It can be inferred from the type but usually should be inferred from the content or user should mark such column as categorical. Not every String column is categorical. Also there are some int64 columns which are categorical.
What is interesting for me is to have an ability to get sequence of categories, without calling unique
on the column. Maybe we should have an option: set-as-categorical on dataset and column name, which adds metatag with (lazy)list of categories.
tech.ml.dataset.impl.column
(defn make-container
([dtype n-elems]
(case dtype
:string (make-string-table n-elems "")
:text (let [list-data (ArrayList.)]
(dotimes [iter n-elems]
(.add list-data ""))
list-data) ;;this should be added, otherwise the let returns nil, not the arraylist.
(dtype/make-container :list dtype n-elems)))
([dtype]
(make-container dtype 0)))
Is there any way to get the list of the keys from group-by operation without actual grouping? I can do group-by
and then call keys
but I'm not sure if it's efficient way.
Tablesaw supports this. It would just take a bit of time to get in there and figure out the caveats.
Is there any way to get the list of the keys from group-by operation without actual grouping? I can do group-by
and then call keys
but I'm not sure if it's efficient way.
Working on a subcolumn implementation as a proof of concept, got almost there and noticed (After copying over the implementation of TableSawColumn and testing), that col-proto/get-column-value is abstract here. Is there a reason for this? looks like the api function of the same name just delegates to the protocol, so I think not.
We have some variant for column(s) selection
column
and columns
returns columnar data typeselect-columns
returns datasetThat's ok. I would add:
select-column
- to create dataset with one columnselect-columns
:
What is the best way if I want to create my own aggregate function?
I know that I can always reduce
on columns but then I escape from datatype
flow.
;; integer
(as-> (ds/->dataset [{:a 1 :b 2}]) $
(ds/update-column $ :b #(tech.ml.dataset.column/set-missing % [0]))
(ds/new-column $ :a-b (dfn/- ($ :a) ($ :b))))
;; => Error printing return value (ArithmeticException) at clojure.lang.Numbers/throwIntOverflow (Numbers.java:1576).
;; integer overflow
;; float
(as-> (ds/->dataset [{:a 1.0 :b 2.0}]) $
(ds/update-column $ :b #(tech.ml.dataset.column/set-missing % [0]))
(ds/new-column $ :a-b (dfn/- ($ :a) ($ :b)))
(ds/descriptive-stats $))
;; => _unnamed: descriptive-stats [3 9]:
;; | :col-name | :datatype | :n-valid | :n-missing | :mean | :min | :max | :standard-deviation | :skew |
;; |-----------+-----------+----------+------------+-------+-------+-------+---------------------+-------|
;; | :a | :float64 | 1 | 0 | 1,000 | 1,000 | 1,000 | 0,000 | NAN |
;; | :a-b | :float64 | 1 | 0 | NAN | NAN | NAN | 0,000 | NAN |
;; | :b | :float64 | 0 | 1 | NAN | NAN | NAN | NAN | NAN |
> (ds/ds-concat nil ds)
Execution error (IllegalArgumentException) at tech.ml.protocols.dataset/eval36872$fn$G (dataset.clj:7).
No implementation of method: :columns of protocol: #'tech.ml.protocols.dataset/PColumnarDataset found for class: nil
could just result in ds
instead of an error
Just documenting here what I found regarding the possible memory leak due to cached sequences in tech.ml.dataset.parse/csv->columns.
Retaining the head of the invoked iterator-seq
on the iterable produced by the univocity CSV parser should result in e.g. large text files being fully realized and cached until after the dataset is constructed and the function returns. In theory, this would trivially blow the heap (e.g. 1GB of string references could expand to significantly more in memory, etc.).
I tested with the corrected test data, using a filesize of 1GB with the legacy (leaky) implementation. I then tested with the proposed fix that uses an atom to clear the reference later, here.
For a baseline file of ~1.3GB, the dataset is able to load in both cases in about the same time. More importantly, the memory footprint and GC pressure is fairly consistent, with the freed implementation having a bit less (around 1.5gb max used). It looks like univocity's CSV parser is returning many flyweight objects for the rows in its iterable, rather than actual arrays of strings, since the memory footprint is surprisingly low. So, the typically destructive caching behavior doesn't impact us at this scale.
To test this further, I pumped the row count for the synthetic data, and generated a 13.4 GB dataset to load. On a 4GB heap, we do see a difference in the memory profile, but again not drastic. Both implementations peak out around 3GB eventually, with the leaking version running into GC issues earlier. Consequently, the non-leaking version is around 5% faster to load the dataset. Both datasets load on a 4GB heap though, which is actually fairly impressive.
In all, I think this is a minor issue since univocity's parsing implementation appears to be efficient, thus mitigating the effects of the leak. For implementation sake, it's probably fine to leave it as is, unless people run into problems in the far future. In the extreme, say someone is trying to load a file that vastly outstrips available memory (say 100GB), it may be a problem, but at that point there are likely smarter options for processing. For laptop use cases, the existing solution appears to scale fine.
Implementing a left join example, realizing that having a flyweight empty column that projects in O(1) space an arbitrary range of missing values would be very useful for these kinds of joins. Right now, we have to manually create a missing set (e.g. via new-column) full of all the entries. Should be able to reify a set implementation that is just a countable thing that returns a range for reader/seq, and checks if indices are within an the range for containment.
Only tricky bit is promoting if we allow items to be added to the set. Either coerce it to the full missing set as before, or work on a sparse scheme to allow "present" values, sort of like we use missing values for the column stuff.
Is it proper behaviour?
(first (ds/name-values-seq->dataset {:t (repeatedly 10 #(rand-nth [1 2 3 4]))}))
;; => #tech.ml.dataset.column<float64>[10]
:t
[1.000, 3.000, 4.000, 3.000, 4.000, 4.000, 1.000, 2.000, 1.000, 2.000, ]
When I export dataset from R which has row names, resulting CSV looks like that:
"","Rural Male","Rural Female","Urban Male","Urban Female"
"50-54",11.7,8.7,15.4,8.4
"55-59",18.1,11.7,24.3,13.6
"60-64",26.9,20.3,37,19.3
"65-69",41,30.9,54.6,35.1
"70-74",66,54.3,71.1,50
CSV contains empty string as the first column's name (row names). In such case import fails. I think it would be great to assign some dummy name in such case.
Could you consider an option to add reading data from URL?
It would be nice to have:
(ds/->dataset "https://raw.githubusercontent.com/Rdatatable/data.table/master/vignettes/flights14.csv")
Which (on beta-22) throws Missing config value: :tech-io-cache-local
exception.
Workaround:
(ds/->dataset (.openStream (java.net.URL. "https://raw.githubusercontent.com/Rdatatable/data.table/master/vignettes/flights14.csv")))
I can imagine such helper functions:
n
rows (equivalent to (select-rows (range n))
n
rowsn
random rows (with repetitions or not)Hi @cnuernber,
Am I wrong in expecting that
(require '[tech.ml.dataset :as ds])
(= (ds/->dataset [{:a 0 :tensor (tech.v2.tensor/->tensor [1 2 3])}
{:a 1 :tensor (tech.v2.tensor/->tensor [4 5 6])}])
(ds/name-values-seq->dataset {:a [0 1]
:tensor (tech.v2.tensor/->tensor [[1 2 3]
[4 5 6]])}))
I get back the following error
Execution error (ExceptionInfo) at tech.ml.dataset.impl.dataset/name-values-seq->dataset (dataset.clj:211).
Different sized columns detected: clojure.lang.LazySeq@405
Take a look at this dataset: https://github.com/generateme/cljplot/blob/master/data/chem97.json
When I load it, :gcsescore
gets object type instead of float64.
(chem97 :gcsescore)
;; => #tech.ml.dataset.column<object>[31022]
:gcsescore
[6.625, 7.625, 7.250, 7.500, 6.444, 7.750, 6.750, 6.909, 6.375, 7.750, 7.857, 7.333, 7.750, 7.700, 6.300, 7.300, 6.636, 7.272, 7.200, 6.454, ...]
(reduce + (chem97 :gcsescore))
;; => 194994.49800000832
->flyweight
isn't very descriptive although it is fun. ->seq-of-maps
would be a clearer name.
optimizing it such that it creates a record on the fly and thus all maps share the same keys and structure would be a solid move forward.
semantic-csv does this here using create-struct.
Went to implement a naive left-join and corresponding test, working off the ww3 sample
Using the following data:
lhs
[{"CustomerID" 1,
"CustomerName" "Alfreds Futterkiste",
"ContactName" "Maria Anders",
"Address" "Obere Str. 57",
"City" "Berlin",
"PostalCode" 12209,
"Country" "Germany"}
{"CustomerID" 2,
"CustomerName" "Ana Trujillo Emparedados y helados",
"ContactName" "Ana Trujillo",
"Address" "Avda. de la Constitución 2222",
"City" "México D.F.",
"PostalCode" 5021,
"Country" "Mexico"}
{"CustomerID" 3,
"CustomerName" "Antonio Moreno Taquería",
"ContactName" "Antonio Moreno",
"Address" "Mataderos 2312",
"City" "México D.F.",
"PostalCode" 5023,
"Country" "Mexico"}]
rhs
[{"OrderID" 10308,
"CustomerID" 2,
"EmployeeID" 7,
"OrderDate" "1996-09-18",
"ShipperID" 3}
{"OrderID" 10309,
"CustomerID" 37,
"EmployeeID" 3,
"OrderDate" "1996-09-19",
"ShipperID" 1}
{"OrderID" 10310,
"CustomerID" 77,
"EmployeeID" 8,
"OrderDate" "1996-09-20",
"ShipperID" 2}]
The result of join-by-column
is a map of :join-table and :rhs-missing, not something that can immediately be used by the ds protocols and API. The stock join-by-column
completes and results print apparently successfully.
{:join-table join-table [1 11]:
| CustomerID | CustomerName | ContactName | Address | City | PostalCode | Country | OrderID | EmployeeID | OrderDate | ShipperID |
|------------+------------------------------------+--------------+-------------------------------+-------------+------------+---------+---------+------------+------------+-----------|
| 2 | Ana Trujillo Emparedados y helados | Ana Trujillo | Avda. de la Constitución 2222 | México D.F. | 5021 | Mexico | 10308 | 7 | 1996-09-18 | 3 |
, :rhs-missing-indexes [1 2]}
If I try to apply the ds functions to the :join-table result, e.g. columns
, I get an error:
[Error printing return value (IndexOutOfBoundsException) at it.unimi.dsi.fastutil.longs.LongArrayList/getLong (LongArrayList.java:267).
Index (1) is greater than or equal to list size (1)
user>
Seems like maybe this is an off-by-one error somewhere (looks like we should be getting 0 instead of 1 as an index).
Maybe the :rhs-missing information could be elevated to metadata on the join-table, so that the type of join-by-column
is String|Keyword|Number -> PColumnarDataset -> PColumnarDataset -> PColumnarDataset ?
Setup stripe trial periods.
Doing dataset -> vega is good and fun and has already proved very useful.
It occurs to me that there's nothing really dataset specific in the operations, however. I could imagine splitting the vega/viz portion out to work on (probably?) sequences of values, and then calling into that with thin wrappers that extract the sequences of values from datasets - to keep the code easy to use from ds
.
The upside would be that if some data doesn't happen to be in a dataset, we could avoid the extra step of turning it into a dataset just for the purposes of vis.
This is the manual implementation that ended up generating aforementioned issues elsewhere.
Looks like there's a more elegant way to join; the left-join test may be useful going forward though.
I don't see function to discard some rows by index.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.