Comments (7)
There is an inefficient version of this that returns a mapseq for charts:
https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/ml/dataset.clj#L123
Some links to similar types of things in pandas and data.table:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.transpose.html
https://rdrr.io/cran/data.table/man/transpose.html
from tech.ml.dataset.
I sketched this solution:
(defn transpose [ds col-names-seq]
(let [size (ds/row-count ds)]
(reduce ds/concat (map (fn [col-name]
(let [data (ds col-name)]
(ds/new-dataset
[(col/new-column :column (repeat (count data) col-name))
(col/set-name data :value)]))) col-names-seq))))
from tech.ml.dataset.
(-> [{:a 1 :b 2 :c 3} {:a 4 :b 5 :c 6}]
(ds/->dataset)
(transpose [:a :b :c]))
;; => null [6 2]:
;; | :column | :value |
;; |---------+--------|
;; | :a | 1 |
;; | :a | 4 |
;; | :b | 2 |
;; | :b | 5 |
;; | :c | 3 |
;; | :c | 6 |
from tech.ml.dataset.
That is a great formulation of the actual answer, much more to the point and efficient than what I had previously. And the result would be space efficient and with a small bit of effort generally efficient if concat realized that if the datasets were all the same number of rows. And what I had previously can be described in these terms. The means index generation is quot
instead of a scan of a list of lengths.
The only question left is transpose the correct name? Numpy (and tech.datatype) transpose is an in-place remapping on several dimensions you would expect a shape of [n-cols n-rows] after a transpose of [n-rows n-cols] by [1 0]. This returns an object of shape [2 (* n-rows n-cols)]. Likewise reshape has has specific meaning that is different.
Maybe columnwise-concat
?
from tech.ml.dataset.
columnwise-concat
- yes, perfect. Transpose is not proper name. I named it without too much thinking.
Also I will try to check and analyse other reshaping methods (if we really need more fancy ways of reshaping)
from tech.ml.dataset.
the tidyverse has a similar concept called pivot_longer
. I link here to the documentation with examples https://tidyr.tidyverse.org/reference/pivot_longer.html
The main difference is that you can choose which columns to "transpose" on. So if I adapt your code example to include a column :d
that we do not "transpose" on it would look something like the following
(-> [{:a 1 :b 2 :c 3 :d 1} {:a 4 :b 5 :c 6 :d 2}]
(ds/->dataset)
(pivot-longer [:a :b :c]))
;; =>
;; | :column | :value | :d |
;; |---------+--------+--------|
;; | :a | 1 | 1 |
;; | :a | 4 | 2 |
;; | :b | 2 | 1 |
;; | :b | 5 | 2 |
;; | :c | 3 | 1 |
;; | :c | 6 | 2 |
This is a function that is often used to get a dataset into "tidy" format. I think this would be useful. The current implementation of transpose
drops the :d
column
(-> [{:a 1 :b 2 :c 3 :d 1} {:a 4 :b 5 :c 6 :d 2}]
(ds/->dataset)
(transpose [:a :b :c]))
;; => null [6 2]:
;; | :column | :value |
;; |---------+--------|
;; | :a | 1 |
;; | :a | 4 |
;; | :b | 2 |
;; | :b | 5 |
;; | :c | 3 |
;; | :c | 6 |
from tech.ml.dataset.
I fixed this one and mistyped 57 instead of 47 in my changelist.
from tech.ml.dataset.
Related Issues (20)
- `[group-by]` - returned value cannot be destructured as a sequence of key/value pairs HOT 3
- do we have dot product ? HOT 3
- selecting first row on empty dataset throws an exception HOT 2
- left-join on char column fails HOT 1
- zipfile->dataset-seq should ignore unknown file types HOT 2
- Default to eliding missing values in JSON representation HOT 3
- typo in arrow dependencies HOT 1
- Unexpected exception when treating columns like tensors and `mset!`ing HOT 3
- Can't construct a columns from the result of tech.v3.parallel.for/pmap HOT 2
- Minimal test suite
- create-table! docstring recommends a non-existent function HOT 1
- `print-all` metadata lost by `filter-column` HOT 2
- `group-by-column->indexes` returns something that uses some kind of different key lookup equality? HOT 9
- Travis auto-tests are broken HOT 5
- left join on nil value fails - regression HOT 3
- Do `partition` and `partition-by` make any sense here in TMD? HOT 1
- left-join on longer datasets causes an error HOT 1
- CVE-2021-40531 on org.apache.datasketches/datasketches-java HOT 1
- left-join fails when options argument is nil HOT 2
- Documentation and the actual behavior of `select` do not match. HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tech.ml.dataset.