Giter VIP home page Giter VIP logo

tech.datatype's People

Contributors

alanmarazzi avatar behrica avatar cnuernber avatar daslu avatar harold avatar rschmukler avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tech.datatype's Issues

operator as sequential

Can you mark each op/reader as Sequential?

Why I need this. Op (like dfn/+) returns reified object which is seqable (because it is Iterable).

When I test if something is seqable I can catch String and Map (maybe other stuff). sequential is a narrow test simply says that the object is list (in common sense) of simple values.

Support for "reindexing" datetime column?

Hi, me again, I managed to hit also this repo ahah!

I am playing with a datetime column in my dataset and I was wondering if you already plan to support "reindexing", by detecting the missing dates in a "date range", and possibly interpolating/filling them?

It sounds complicated so I would understand if it is something you are not interested in supporting!

Thank you once again for your amazing work!!

packed datetime types falls into ints

Consider following example. Datetime values are stored as packed-local-date-time. Selected value taken out by get-value function returns :int64 type.

When trying to convert to milliseconds (this is my goal) I'm getting an exception.

(def ds (ds/->dataset {:dt [(java.time.LocalDateTime/of 2020 01 01 11 22 33)
                            (java.time.LocalDateTime/of 2020 10 01 01 01 01)]}))

(ds :dt)
;; => #tech.ml.dataset.column<packed-local-date-time>[2]
;;    :dt
;;    [2020-01-01T11:22:33, 2020-10-01T01:01:01, ]

;; Expecting datetime
(dtype/get-value (ds :dt) 0)
;; => 568580556948144360

;; exepcting :packed-local-date-time
(dtype/get-datatype (dtype/get-value (ds :dt) 0))
;; => :int64

(dt/to-milliseconds (dtype/get-value (ds :dt) 0) :packed-local-date-time)
;; Exception

reader can have any type

Why is it allowed?

(dtype/->reader [1 2 3] :random-type)
;; => [1 2 3]

(col/new-column :what-is-the-type? (dtype/->reader [1 2 3] :random-type))
;; => #tech.ml.dataset.column<random-type>[3]
;;    :what-is-the-type?
;;    [1, 2, 3, ]

I would expect validation of provided types.

dfn/round doesn't work on generic lists

(dfn/round [1.2 2.5 3.6 7.8])

returns an error because the datatype of that sequence is object and round doesn't have an overload for object (only float and double).

Sort fails on readers

fathym-forecast.validate> (sort (vec (dtype/->reader [3 1 2])))
(1 2 3)
fathym-forecast.validate> (sort (dtype/->reader [3 1 2]))
Execution error (AbstractMethodError) at tech.v2.datatype.base$eval17710$fn$reify__17714/toArray (base.cljc:-1).
Method tech/v2/datatype/base$eval17710$fn$reify__17714.toArray()[Ljava/lang/Object; is abstract

This is because the readers don't have a default implementation of toArray. This is an easy fix with default interface functions.

Support for outlier using quartiles

Check out:
https://pupli.net/2018/05/14/detect-outlier-values-in-java-using-boxplot-formula-and-apache-commons-math-library/

(defn quartiles
  [l]
  (let [vs (vec (sort l))
        n (count vs)]
    ;; These aren't exactly right, but are good enough for today.
    [(first vs)
     (nth vs (int (* 0.25 n)))
     (nth vs (int (* 0.5 n)))
     (nth vs (int (* 0.75 n)))
     (last vs)]))

(defn quartiles->inter-quartile-range
  [[_ q1 _ q3 _]]
  (- q3 q1))

(defn outlier?
  [qs v]
  (let [iqr (quartiles->inter-quartile-range qs)
        [_ q1 _ q3 _] qs]
    (or (< v (- q1 (* 1.5 iqr)))
        (> v (+ q3 (* 1.5 iqr))))))

[color-gradients] Idea for a higher-level interface.

Had some thoughts about color-gradients interfaces today.

At the moment, I'm using gradient-name->gradient-line, and then making an int32-reader off the returned gradient, and then using .read2d on the reader to get the r/g/b byte values.

This is a little verbose and error prone (reading off the end of the reader throws, for instance.

An interface like this might be nice:

(for [p [0.0 0.2 0.4 0.6 0.8 1.0]]
  (let [[b g r] (color-gradients/lookup :red-green p)]
    (println [r g b])))

Date/time support

Date time support is going to look like the numpy date/time support:

https://docs.scipy.org/doc/numpy/reference/arrays.datetime.html

Initial RFC:

  • 2 new storage datatypes, :datetime64/32, :timeinterval64/32 along with object datatypes corresponding to the java8 java.time :date :datetime and :timeinterval objects. These will be stored in containers that know the base time quantization value, so year,month,day, hour, second, millisecond. These will natively form long readers/writers but object reader/writers of type :date :datetime and time will be possible.

  • Type promotion - any integer will be promoted to be the smallest quantization value involved in the operation. Quantization values will be the smallest quant value across all terms in the operation.

  • This more or less matches the tablesaw data layout for date/times:
    https://github.com/jtablesaw/tablesaw/tree/master/core/src/main/java/tech/tablesaw/columns

Support clojure.lang.APersistentVector.

First step is to generically create a persistent vector implementation from any List definition that just happens to be the subset that readers actually implement. This supports all the readers as a proxy with most importantly the same memory characteristics as the base reader and is very little code for the win.

These offer key advantages over a generic readers - specifically implementation of correct hash and equality semantics over generic lists of types. This is a great step forward as APersistentVector also has a minimal high quality java.util.List implementation and as such will work out of the box with a wider range of existing software that readers with their extremely partial subset. It also may have the benefit that persistent vectors and readers will hash to the same bin in a hashtable.

The one caveat is at this level that binding should be loose and easy to fix between upstream software versions.

dfn/reduce-+ NPE, depending on data size

$ clj -Sdeps '{:deps {techascent/tech.datatype {:mvn/version "4.65"}}}}'
Clojure 1.10.1
user=> (require '[tech.v2.datatype.functional :as dfn])
nil
user=> (-> (range 10) dfn/reduce-+)
Execution error (IllegalArgumentException) at tech.v2.datatype.operation-provider/eval15184$fn (operation_provider.clj:79).
No method in multimethod 'half-dispatch-reduce-op' for dispatch value: :iterable
user=> ;; Seem to be missing tech.v2.datatype. Let us require it.
user=> (require '[tech.v2.datatype :as dtype])
nil
user=> (-> (range 10) dfn/reduce-+)
45
user=> ;; Now it works. Also in the reader version:
user=> (-> (range 10) double-array dfn/reduce-+)
45.0
user=> (-> (range 10) into-array dfn/reduce-+)
45
user=> ;; What about more data?
user=> (-> (range 1000) dfn/reduce-+)
499500
user=> (-> (range 1000) double-array dfn/reduce-+)
499500.0
user=> (-> (range 1000) into-array dfn/reduce-+)
Execution error (NullPointerException) at tech.v2.datatype.binary_op$reify$reify__8062/op (binary_op.clj:462).
null
user=> ;; Now this is confusing.
user=> ;; What about a different container, with an explicit type?
user=> (->> (range 1000) into-array (dtype/make-container :java-array :double) dfn/reduce-+)
Execution error (NullPointerException) at tech.v2.datatype.binary_op$reify$reify__8062/op (binary_op.clj:462).
null
user=> ;; Still NPE.
user=> 

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.