I'm just getting started with using tech.m.dataset -

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

I put up a very simple stripped down demo here: <a href="https://github.com/kxygk/caul

Maybe a more generic `replace-missing` interface? about tech.ml.dataset HOT 6 CLOSED

kxygk commented on July 4, 2024

Maybe a more generic `replace-missing` interface?

from tech.ml.dataset.

Comments (6)

harold commented on July 4, 2024 1

Welcome in! Thanks for the feedback.

I'd like to go to each nil, take its row and find the row which both has a non-nil in the same column and is closest in terms of Cartesian distance (across all features/columns) and then copy that value over.

Yes, the first way I'd try this is with row-map.

Thinking about it for a few seconds, the nearest queries may be complicated by the presence of the nils, but I suspect it's manageable.

I don't know of any name for the procedure you're describing here. Do you know if it has a name?

Definitely interested to see what you come up with. Report back when you try it.

from tech.ml.dataset.

harold commented on July 4, 2024 1

Looks like a fun musical space you're working in. It's very cool that you were able to test the method you developed using the library.

I still suspect this method probably has a name and has been thought through quite thoroughly, Chris' original intuition that this sort of operation might best be done in a dense tensor space still strikes me as correct.

Sorry, I haven't been able to refactor this into a lil library or something directly useable :(

There's no need to apologize, you're of course right that this is the next step in the real work to be done here. What gets done in that step might suggest a way to generalize the api to be more conducive to these kinds of experiments (especially if that work is done with a little bit of an eye toward that potential generalization).

I think the program you linked sort of proves that the initial idea proposed here (changing the way replace-missing works) isn't that important, the experiment was run relatively easily with the existing api. And if it were someone's job at some point to run, say, a hundred such experiments, then the generalization would become super-obvious essentially immediately.

Your contribution is valued.

from tech.ml.dataset.

genmeblog commented on July 4, 2024

Generally replace-missing works only on one column at the time (even if you select multiple columns, it will go one by one). As I described on Reddit this function gives a functionality from Pandas/Dplyr so should be enough for most cases. And yes, this functions is not extensible in a way you've described.

Anyway, here is the possible approach.

To get indexes of rows with missing values call tech.v3.dataset.column/missing on a column. Missing values are kept in a RoaringBitmap structure. It's a sequence with some nice features. For example: .previousAbsentValue or .nextAbsentValue will give you index of non-missing row.
With the above you can find the value matching your criteria.
[undocumented] replace-missing with :value strategy accepts a map containing pairs: index-value to replace full set of values.

(def DSm2 (tc/dataset {:a [nil nil nil 1.0 2  nil nil nil nil  nil 4   nil  11 nil nil]
                     :b [2   2   2 nil nil nil nil nil nil 13   nil   3  4  5 5]}))

DSm2
;; => _unnamed [15 2]:
;;    |   :a | :b |
;;    |-----:|---:|
;;    |      |  2 |
;;    |      |  2 |
;;    |      |  2 |
;;    |  1.0 |    |
;;    |  2.0 |    |
;;    |      |    |
;;    |      |    |
;;    |      |    |
;;    |      |    |
;;    |      | 13 |
;;    |  4.0 |    |
;;    |      |  3 |
;;    | 11.0 |  4 |
;;    |      |  5 |
;;    |      |  5 |

;; indexes of missing values
(col/missing (DSm2 :a)) ;; => {0,1,2,5,6,7,8,9,11,13,14}
(col/missing (DSm2 :b)) ;; => {3,4,5,6,7,8,10}

(class (col/missing (DSm2 :a))) ;; => org.roaringbitmap.RoaringBitmap

;; index of the nearest non-missing value in column `:a` starting from 0
(.nextAbsentValue (col/missing (DSm2 :a)) 0) ;; => 3
;; there is no previous non-missing
(.previousAbsentValue (col/missing (DSm2 :a)) 0) ;; => -1

;; replace some missing values by hand
(tc/replace-missing DSm2 :a :value {0 100 1 -100 14 -1000})
;; => _unnamed [15 2]:
;;    |      :a | :b |
;;    |--------:|---:|
;;    |   100.0 |  2 |
;;    |  -100.0 |  2 |
;;    |         |  2 |
;;    |     1.0 |    |
;;    |     2.0 |    |
;;    |         |    |
;;    |         |    |
;;    |         |    |
;;    |         |    |
;;    |         | 13 |
;;    |     4.0 |    |
;;    |         |  3 |
;;    |    11.0 |  4 |
;;    |         |  5 |
;;    | -1000.0 |  5 |

from tech.ml.dataset.

cnuernber commented on July 4, 2024

@kxygk - have you solved your issues with the dataset library? Another approach would be to transform the dataset into a tensor and then create a compute tensor that does the right thing when the x,y location has a missing value.

from tech.ml.dataset.

kxygk commented on July 4, 2024

Thanks for checking in!

I did manage to accomplish it with 'row-map' - though from a purely ml perspective it unfortunately didn't give an improvement in the validation error (could be a peculiarity of my dataset though - it was for an ML class)

Sorry for the radio silence. I'm going to try to clean up the code a bit and put it up in a repo in the next few days. I always want to take a closer look at the other suggestions here :)

from tech.ml.dataset.

kxygk commented on July 4, 2024

I put up a very simple stripped down demo here: https://github.com/kxygk/caulk

This is just in case anyone is curious how I did it.. Sorry, I haven't been able to refactor this into a lil library or something directly useable :(

If I have a bit more free time towards the end of the summer I may revisit this

I think the core issue is still kinda unaddressed - so it's a bummer this issue has been closed. The replace-missing interface only provides a fixed set of cookie cutter methods instead of a generic interface for replacing missing value. That said, a preset list of functions for hole-filling could also be useful, but then I'd expect something more akin to the tech.v3.dataset.column-filters namespace. A function with predefined key flags feels a lil oldschool..

from tech.ml.dataset.

Maybe a more generic `replace-missing` interface? about tech.ml.dataset HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent