Comments (6)
Welcome in! Thanks for the feedback.
I'd like to go to each nil, take its row and find the row which both has a non-nil in the same column and is closest in terms of Cartesian distance (across all features/columns) and then copy that value over.
Yes, the first way I'd try this is with row-map
.
Thinking about it for a few seconds, the nearest queries may be complicated by the presence of the nil
s, but I suspect it's manageable.
I don't know of any name for the procedure you're describing here. Do you know if it has a name?
Definitely interested to see what you come up with. Report back when you try it.
from tech.ml.dataset.
Looks like a fun musical space you're working in. It's very cool that you were able to test the method you developed using the library.
I still suspect this method probably has a name and has been thought through quite thoroughly, Chris' original intuition that this sort of operation might best be done in a dense tensor space still strikes me as correct.
Sorry, I haven't been able to refactor this into a lil library or something directly useable :(
There's no need to apologize, you're of course right that this is the next step in the real work to be done here. What gets done in that step might suggest a way to generalize the api to be more conducive to these kinds of experiments (especially if that work is done with a little bit of an eye toward that potential generalization).
I think the program you linked sort of proves that the initial idea proposed here (changing the way replace-missing
works) isn't that important, the experiment was run relatively easily with the existing api. And if it were someone's job at some point to run, say, a hundred such experiments, then the generalization would become super-obvious essentially immediately.
Your contribution is valued.
from tech.ml.dataset.
Generally replace-missing
works only on one column at the time (even if you select multiple columns, it will go one by one). As I described on Reddit this function gives a functionality from Pandas/Dplyr so should be enough for most cases. And yes, this functions is not extensible in a way you've described.
Anyway, here is the possible approach.
- To get indexes of rows with missing values call
tech.v3.dataset.column/missing
on a column. Missing values are kept in a RoaringBitmap structure. It's a sequence with some nice features. For example:.previousAbsentValue
or.nextAbsentValue
will give you index of non-missing row. - With the above you can find the value matching your criteria.
- [undocumented]
replace-missing
with:value
strategy accepts a map containing pairs: index-value to replace full set of values.
(def DSm2 (tc/dataset {:a [nil nil nil 1.0 2 nil nil nil nil nil 4 nil 11 nil nil]
:b [2 2 2 nil nil nil nil nil nil 13 nil 3 4 5 5]}))
DSm2
;; => _unnamed [15 2]:
;; | :a | :b |
;; |-----:|---:|
;; | | 2 |
;; | | 2 |
;; | | 2 |
;; | 1.0 | |
;; | 2.0 | |
;; | | |
;; | | |
;; | | |
;; | | |
;; | | 13 |
;; | 4.0 | |
;; | | 3 |
;; | 11.0 | 4 |
;; | | 5 |
;; | | 5 |
;; indexes of missing values
(col/missing (DSm2 :a)) ;; => {0,1,2,5,6,7,8,9,11,13,14}
(col/missing (DSm2 :b)) ;; => {3,4,5,6,7,8,10}
(class (col/missing (DSm2 :a))) ;; => org.roaringbitmap.RoaringBitmap
;; index of the nearest non-missing value in column `:a` starting from 0
(.nextAbsentValue (col/missing (DSm2 :a)) 0) ;; => 3
;; there is no previous non-missing
(.previousAbsentValue (col/missing (DSm2 :a)) 0) ;; => -1
;; replace some missing values by hand
(tc/replace-missing DSm2 :a :value {0 100 1 -100 14 -1000})
;; => _unnamed [15 2]:
;; | :a | :b |
;; |--------:|---:|
;; | 100.0 | 2 |
;; | -100.0 | 2 |
;; | | 2 |
;; | 1.0 | |
;; | 2.0 | |
;; | | |
;; | | |
;; | | |
;; | | |
;; | | 13 |
;; | 4.0 | |
;; | | 3 |
;; | 11.0 | 4 |
;; | | 5 |
;; | -1000.0 | 5 |
from tech.ml.dataset.
@kxygk - have you solved your issues with the dataset library? Another approach would be to transform the dataset into a tensor and then create a compute tensor that does the right thing when the x,y location has a missing value.
from tech.ml.dataset.
Thanks for checking in!
I did manage to accomplish it with 'row-map' - though from a purely ml perspective it unfortunately didn't give an improvement in the validation error (could be a peculiarity of my dataset though - it was for an ML class)
Sorry for the radio silence. I'm going to try to clean up the code a bit and put it up in a repo in the next few days. I always want to take a closer look at the other suggestions here :)
from tech.ml.dataset.
I put up a very simple stripped down demo here: https://github.com/kxygk/caulk
This is just in case anyone is curious how I did it.. Sorry, I haven't been able to refactor this into a lil library or something directly useable :(
If I have a bit more free time towards the end of the summer I may revisit this
I think the core issue is still kinda unaddressed - so it's a bummer this issue has been closed. The replace-missing
interface only provides a fixed set of cookie cutter methods instead of a generic interface for replacing missing value. That said, a preset list of functions for hole-filling could also be useful, but then I'd expect something more akin to the tech.v3.dataset.column-filters
namespace. A function with predefined key flags feels a lil oldschool..
from tech.ml.dataset.
Related Issues (20)
- math/correlation-table always uses pearson correlation type HOT 2
- left-join failing on dates in 7.000-beta-10 compared to v6 HOT 2
- Issue in filter-column for large datasets in 7 beta HOT 4
- Strange ds/head and ds/tail behaviour in v7 HOT 6
- tech.v3.libs.nettoolkit HOT 4
- `print-all` is busted HOT 2
- Document build/deploy pathways
- Arrow - nested types
- `:column-whitelist` thoughts HOT 5
- allow printing precision for doubles HOT 1
- NullPointerException when reading an empty Arrow dataset HOT 1
- tranduce-compatible rf functions for parquet ds-seq->parquet and arrow/ds-seq->arrow pathways.
- `column-map` on three columns throws an exception HOT 2
- `[group-by]` - returned value cannot be destructured as a sequence of key/value pairs HOT 3
- do we have dot product ? HOT 3
- selecting first row on empty dataset throws an exception HOT 2
- left-join on char column fails HOT 1
- zipfile->dataset-seq should ignore unknown file types HOT 2
- Default to eliding missing values in JSON representation HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tech.ml.dataset.