Giter VIP home page Giter VIP logo

Comments (6)

harold avatar harold commented on July 4, 2024 1

Welcome in! Thanks for the feedback.

I'd like to go to each nil, take its row and find the row which both has a non-nil in the same column and is closest in terms of Cartesian distance (across all features/columns) and then copy that value over.

Yes, the first way I'd try this is with row-map.

Thinking about it for a few seconds, the nearest queries may be complicated by the presence of the nils, but I suspect it's manageable.

I don't know of any name for the procedure you're describing here. Do you know if it has a name?

Definitely interested to see what you come up with. Report back when you try it.

from tech.ml.dataset.

harold avatar harold commented on July 4, 2024 1

Looks like a fun musical space you're working in. It's very cool that you were able to test the method you developed using the library.

I still suspect this method probably has a name and has been thought through quite thoroughly, Chris' original intuition that this sort of operation might best be done in a dense tensor space still strikes me as correct.

Sorry, I haven't been able to refactor this into a lil library or something directly useable :(

There's no need to apologize, you're of course right that this is the next step in the real work to be done here. What gets done in that step might suggest a way to generalize the api to be more conducive to these kinds of experiments (especially if that work is done with a little bit of an eye toward that potential generalization).

I think the program you linked sort of proves that the initial idea proposed here (changing the way replace-missing works) isn't that important, the experiment was run relatively easily with the existing api. And if it were someone's job at some point to run, say, a hundred such experiments, then the generalization would become super-obvious essentially immediately.

Your contribution is valued.

from tech.ml.dataset.

genmeblog avatar genmeblog commented on July 4, 2024

Generally replace-missing works only on one column at the time (even if you select multiple columns, it will go one by one). As I described on Reddit this function gives a functionality from Pandas/Dplyr so should be enough for most cases. And yes, this functions is not extensible in a way you've described.

Anyway, here is the possible approach.

  1. To get indexes of rows with missing values call tech.v3.dataset.column/missing on a column. Missing values are kept in a RoaringBitmap structure. It's a sequence with some nice features. For example: .previousAbsentValue or .nextAbsentValue will give you index of non-missing row.
  2. With the above you can find the value matching your criteria.
  3. [undocumented] replace-missing with :value strategy accepts a map containing pairs: index-value to replace full set of values.
(def DSm2 (tc/dataset {:a [nil nil nil 1.0 2  nil nil nil nil  nil 4   nil  11 nil nil]
                     :b [2   2   2 nil nil nil nil nil nil 13   nil   3  4  5 5]}))

DSm2
;; => _unnamed [15 2]:
;;    |   :a | :b |
;;    |-----:|---:|
;;    |      |  2 |
;;    |      |  2 |
;;    |      |  2 |
;;    |  1.0 |    |
;;    |  2.0 |    |
;;    |      |    |
;;    |      |    |
;;    |      |    |
;;    |      |    |
;;    |      | 13 |
;;    |  4.0 |    |
;;    |      |  3 |
;;    | 11.0 |  4 |
;;    |      |  5 |
;;    |      |  5 |

;; indexes of missing values
(col/missing (DSm2 :a)) ;; => {0,1,2,5,6,7,8,9,11,13,14}
(col/missing (DSm2 :b)) ;; => {3,4,5,6,7,8,10}

(class (col/missing (DSm2 :a))) ;; => org.roaringbitmap.RoaringBitmap

;; index of the nearest non-missing value in column `:a` starting from 0
(.nextAbsentValue (col/missing (DSm2 :a)) 0) ;; => 3
;; there is no previous non-missing
(.previousAbsentValue (col/missing (DSm2 :a)) 0) ;; => -1

;; replace some missing values by hand
(tc/replace-missing DSm2 :a :value {0 100 1 -100 14 -1000})
;; => _unnamed [15 2]:
;;    |      :a | :b |
;;    |--------:|---:|
;;    |   100.0 |  2 |
;;    |  -100.0 |  2 |
;;    |         |  2 |
;;    |     1.0 |    |
;;    |     2.0 |    |
;;    |         |    |
;;    |         |    |
;;    |         |    |
;;    |         |    |
;;    |         | 13 |
;;    |     4.0 |    |
;;    |         |  3 |
;;    |    11.0 |  4 |
;;    |         |  5 |
;;    | -1000.0 |  5 |

from tech.ml.dataset.

cnuernber avatar cnuernber commented on July 4, 2024

@kxygk - have you solved your issues with the dataset library? Another approach would be to transform the dataset into a tensor and then create a compute tensor that does the right thing when the x,y location has a missing value.

from tech.ml.dataset.

kxygk avatar kxygk commented on July 4, 2024

Thanks for checking in!

I did manage to accomplish it with 'row-map' - though from a purely ml perspective it unfortunately didn't give an improvement in the validation error (could be a peculiarity of my dataset though - it was for an ML class)

Sorry for the radio silence. I'm going to try to clean up the code a bit and put it up in a repo in the next few days. I always want to take a closer look at the other suggestions here :)

from tech.ml.dataset.

kxygk avatar kxygk commented on July 4, 2024

I put up a very simple stripped down demo here: https://github.com/kxygk/caulk

This is just in case anyone is curious how I did it.. Sorry, I haven't been able to refactor this into a lil library or something directly useable :(

If I have a bit more free time towards the end of the summer I may revisit this

I think the core issue is still kinda unaddressed - so it's a bummer this issue has been closed. The replace-missing interface only provides a fixed set of cookie cutter methods instead of a generic interface for replacing missing value. That said, a preset list of functions for hole-filling could also be useful, but then I'd expect something more akin to the tech.v3.dataset.column-filters namespace. A function with predefined key flags feels a lil oldschool..

from tech.ml.dataset.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.