Giter VIP home page Giter VIP logo

Comments (21)

risenW avatar risenW commented on July 20, 2024 1

Yea, I think so too. I found this as well:

https://stackoverflow.com/questions/56649680/tensorflow-vs-tensorflow-js-different-results-for-floating-point-arithmetic-comp

Let me know what you come up with.

from danfojs.

risenW avatar risenW commented on July 20, 2024 1

Looks interesting. What do you mean by extending features like align data?

Also to be sure, are you proposing we abstract the isna function to generic or an internal function that can be called by the isna function from both Series and Dataframe?

If we have to abstract it, then we have to use a different name, something like __isna(), and this can return values as an array which will be constructed in the Dataframe or Series depending on the caller.

For example, in generic we can have:

 /**
     * Return a boolean same-sized object indicating if the values are NaN. NaN and undefined values,
     *  gets mapped to True values. Everything else gets mapped to False values. 
     * @return {Array}
     */
    __isna(is_series=true) {
        let new_arr = []
        if (is_series){
            this.values.map(val => {
                // eslint-disable-next-line use-isnan
                if (val == NaN) {
                    new_arr.push(true)
                } else if (isNaN(val) && typeof val != "string") {
                    new_arr.push(true)
                } else {
                    new_arr.push(false)
                }
            })
        }else{
            let row_data = this.values;
            row_data.map(arr => {
                let temp_arr = []
                arr.map(val => {
                    // eslint-disable-next-line use-isnan
                    if (val == NaN) {
                        temp_arr.push(true)
                    } else if (isNaN(val) && typeof val != "string") {
                        temp_arr.push(true)
                    } else {
                        temp_arr.push(false)
                    }
                })
                new_arr.push(temp_arr)
            })
        }
        return new_arr
    }

and from DataFrame or Series we can call __isna() in the isna() function. Is this what you intend?

UPDATE:
Check this abstraction I did here

from danfojs.

JhennerTigreros avatar JhennerTigreros commented on July 20, 2024

Is there any update in this feature?

from danfojs.

risenW avatar risenW commented on July 20, 2024

No one is actively working on this currently. Would like to work on it?

from danfojs.

JhennerTigreros avatar JhennerTigreros commented on July 20, 2024

Yes I would like it. When you talk about TF-based, do you mean using TF as a support to perform the calculations with the Tensor methods they provide?

from danfojs.

risenW avatar risenW commented on July 20, 2024

Yes, that's exactly what I meant. Alright, I'll assign you to this. Thanks!

from danfojs.

JhennerTigreros avatar JhennerTigreros commented on July 20, 2024

I suppose that can be a generic method that Dataframe and Series inherit from Generic module and verify in this module what type of structure is an apply correct way to calculate the correlation, right?

from danfojs.

risenW avatar risenW commented on July 20, 2024

The generic module is used for more low level methods. So it should be separated, and also, the way you compute Dataframe Corr is slightly different than Series, I believe one is pairwise (Dataframe) and the other is with another series.

from danfojs.

JhennerTigreros avatar JhennerTigreros commented on July 20, 2024

Yes, you are rigth. I'm do some research yesterday, at the moment I implement the pearson method to calculate the corr in Series. I want to refactor the math functions and cumulative operations that currently use in the base code to use TF-built in methods to gain performance in large datasets. Currently they are using math.js library.

from danfojs.

JhennerTigreros avatar JhennerTigreros commented on July 20, 2024

For example the std function:

  std() {
        if (this.dtypes[0] == "string") {
            throw Error("dtype error: String data type does not support std operation")
        }

        let values = []
        this.values.forEach(val => {
            if (!(isNaN(val) && typeof val != 'string')) {
                values.push(val)
            }
        })
        let std_val = std(values) //using math.js
        return std_val

    }

Can be change to

std() {
  if (this.dtypes[0] == "string") {
    Error("dtype error: String data type does not support std operation")
  }

  let values = []

  values.forEach(val => {
    (!(isNaN(val) && typeof val != 'string')) {
      .push(val)
    }});

  let tensor = tf.tensor1d(values, this.dtypes[0]);

  return parseFloat(tf.moments(tensor).variance.sqrt().arraySync());
}

from danfojs.

JhennerTigreros avatar JhennerTigreros commented on July 20, 2024

For example the std function:

  std() {
        if (this.dtypes[0] == "string") {
            throw Error("dtype error: String data type does not support std operation")
        }

        let values = []
        this.values.forEach(val => {
            if (!(isNaN(val) && typeof val != 'string')) {
                values.push(val)
            }
        })
        let std_val = std(values) //using math.js
        return std_val

    }

Can be change to

std() {
  if (this.dtypes[0] == "string") {
    Error("dtype error: String data type does not support std operation")
  }

  let values = []

  values.forEach(val => {
    (!(isNaN(val) && typeof val != 'string')) {
      .push(val)
    }});

  let tensor = tf.tensor1d(values, this.dtypes[0]);

  return parseFloat(tf.moments(tensor).variance.sqrt().arraySync());
}

After doing some research, I found out that TF when creating a tensor from an array of values may incur some precision errors in float data that diverge the final result from Std, Variance, and others. The error is around ~ 6.35% of the actual value.

from danfojs.

risenW avatar risenW commented on July 20, 2024

For example the std function:

  std() {
        if (this.dtypes[0] == "string") {
            throw Error("dtype error: String data type does not support std operation")
        }

        let values = []
        this.values.forEach(val => {
            if (!(isNaN(val) && typeof val != 'string')) {
                values.push(val)
            }
        })
        let std_val = std(values) //using math.js
        return std_val

    }

Can be change to

std() {
  if (this.dtypes[0] == "string") {
    Error("dtype error: String data type does not support std operation")
  }

  let values = []

  values.forEach(val => {
    (!(isNaN(val) && typeof val != 'string')) {
      .push(val)
    }});

  let tensor = tf.tensor1d(values, this.dtypes[0]);

  return parseFloat(tf.moments(tensor).variance.sqrt().arraySync());
}

For example the std function:

  std() {
        if (this.dtypes[0] == "string") {
            throw Error("dtype error: String data type does not support std operation")
        }

        let values = []
        this.values.forEach(val => {
            if (!(isNaN(val) && typeof val != 'string')) {
                values.push(val)
            }
        })
        let std_val = std(values) //using math.js
        return std_val

    }

Can be change to

std() {
  if (this.dtypes[0] == "string") {
    Error("dtype error: String data type does not support std operation")
  }

  let values = []

  values.forEach(val => {
    (!(isNaN(val) && typeof val != 'string')) {
      .push(val)
    }});

  let tensor = tf.tensor1d(values, this.dtypes[0]);

  return parseFloat(tf.moments(tensor).variance.sqrt().arraySync());
}

After doing some research, I found out that TF when creating a tensor from an array of values may incur some precision errors in float data that diverge the final result from Std, Variance, and others. The error is around ~ 6.35% of the actual value.

Yes, I notice that as well. Did you try rounding the values down? Seems TFJS increases the precision of floats and that leads to the high error rates.

from danfojs.

JhennerTigreros avatar JhennerTigreros commented on July 20, 2024

For example the std function:

  std() {
        if (this.dtypes[0] == "string") {
            throw Error("dtype error: String data type does not support std operation")
        }

        let values = []
        this.values.forEach(val => {
            if (!(isNaN(val) && typeof val != 'string')) {
                values.push(val)
            }
        })
        let std_val = std(values) //using math.js
        return std_val

    }

Can be change to

std() {
  if (this.dtypes[0] == "string") {
    Error("dtype error: String data type does not support std operation")
  }

  let values = []

  values.forEach(val => {
    (!(isNaN(val) && typeof val != 'string')) {
      .push(val)
    }});

  let tensor = tf.tensor1d(values, this.dtypes[0]);

  return parseFloat(tf.moments(tensor).variance.sqrt().arraySync());
}

For example the std function:

  std() {
        if (this.dtypes[0] == "string") {
            throw Error("dtype error: String data type does not support std operation")
        }

        let values = []
        this.values.forEach(val => {
            if (!(isNaN(val) && typeof val != 'string')) {
                values.push(val)
            }
        })
        let std_val = std(values) //using math.js
        return std_val

    }

Can be change to

std() {
  if (this.dtypes[0] == "string") {
    Error("dtype error: String data type does not support std operation")
  }

  let values = []

  values.forEach(val => {
    (!(isNaN(val) && typeof val != 'string')) {
      .push(val)
    }});

  let tensor = tf.tensor1d(values, this.dtypes[0]);

  return parseFloat(tf.moments(tensor).variance.sqrt().arraySync());
}

After doing some research, I found out that TF when creating a tensor from an array of values may incur some precision errors in float data that diverge the final result from Std, Variance, and others. The error is around ~ 6.35% of the actual value.

Yes, I notice that as well. Did you try rounding the values down? Seems TFJS increases the precision of floats and that leads to the high error rates.

No, I didn't. I went back to the current implementation, but tomorrow I will try one more time. Yes, TFJS increases the precision, but it also depends on the processor / gpu / browser running the library, so I think it is a compatibility issue

from danfojs.

JhennerTigreros avatar JhennerTigreros commented on July 20, 2024

In way to implement the corr methods I notice that DataFrame and Series class has owned isna() method.

Series ->

  isna() {
        let new_arr = []
        this.values.map(val => {
            // eslint-disable-next-line use-isnan
            if (val == NaN) {
                new_arr.push(true)
            } else if (isNaN(val) && typeof val != "string") {
                new_arr.push(true)
            } else {
                new_arr.push(false)
            }
        })
        let sf = new Series(new_arr, { index: this.index, columns: this.column_names, dtypes: ["boolean"] })
        return sf
    }

DataFrame ->

  isna() {
        let new_row_data = []
        let row_data = this.values;
        let columns = this.column_names;

        row_data.map(arr => {
            let temp_arr = []
            arr.map(val => {
                // eslint-disable-next-line use-isnan
                if (val == NaN) {
                    temp_arr.push(true)
                } else if (isNaN(val) && typeof val != "string") {
                    temp_arr.push(true)
                } else {
                    temp_arr.push(false)
                }
            })
            new_row_data.push(temp_arr)
        })

        return new DataFrame(new_row_data, { columns: columns, index: this.index })
    }

I think it's a good idea to move this to the generic NDFrame class to extend some features like align data, something that pandas have in generalizing the operation of both modules.

def isna(self) -> "DataFrame":
        result = self._constructor(self._data.isna(func=isna))
        return result.__finalize__(self, method="isna")

What do you think about that?

from danfojs.

JhennerTigreros avatar JhennerTigreros commented on July 20, 2024

Correctly, I think is good idea to abstract the method to generic module like you say and propose.

I meant if you have two series or dataframes of different sizes and you want to compute the corr function, Pandas first apply df.align(df2) that align the data with smaller object this means clear excess data on the other object and before apply the respective corr function, at the momment I have this:

       if (kwargs["min_periods"] === undefined || kwargs["min_periods"] === 0) {
            kwargs["min_periods"] = 1;
        }

        if (this.size < kwargs["min_periods"]) {
            return NaN;
        }

        if (kwargs["min_periods"] < 0 && kwargs["min_periods"] > this.size) {
            throw new Error(`Value Error: min_periods need to be in range of [0, ${this.size}]`);
        }

        if (other !== undefined) {
          let [ left, right ] = this.__align_data(other, { "join": "outer", "axis": 0, "inplace": false})
          let valid_index = utils.__bit_wise_nanarray(left.isna().values, right.isna().values)

          if (valid_index.length !== 0) {
            left = left.iloc(valid_index)
            right = right.iloc(valid_index)
          }

          if (left.__check_series_op_compactibility(right)) {
            let f = this.__get_corr_function(kwargs["method"]);
            return f(left, right);
          }
        }

from danfojs.

steveoni avatar steveoni commented on July 20, 2024

Correctly, I think is good idea to abstract the method to generic module like you say and propose.

I meant if you have two series or dataframes of different sizes and you want to compute the corr function, Pandas first apply df.align(df2) that align the data with smaller object this means clear excess data on the other object and before apply the respective corr function, at the momment I have this:

       if (kwargs["min_periods"] === undefined || kwargs["min_periods"] === 0) {
            kwargs["min_periods"] = 1;
        }

        if (this.size < kwargs["min_periods"]) {
            return NaN;
        }

        if (kwargs["min_periods"] < 0 && kwargs["min_periods"] > this.size) {
            throw new Error(`Value Error: min_periods need to be in range of [0, ${this.size}]`);
        }

        if (other !== undefined) {
          let [ left, right ] = this.__align_data(other, { "join": "outer", "axis": 0, "inplace": false})
          let valid_index = utils.__bit_wise_nanarray(left.isna().values, right.isna().values)

          if (valid_index.length !== 0) {
            left = left.iloc(valid_index)
            right = right.iloc(valid_index)
          }

          if (left.__check_series_op_compactibility(right)) {
            let f = this.__get_corr_function(kwargs["method"]);
            return f(left, right);
          }
        }

Is the corr. calculating the correlation within dataframe columns or between two dataframe (or it is calculating both)

from danfojs.

JhennerTigreros avatar JhennerTigreros commented on July 20, 2024

Correctly, I think is good idea to abstract the method to generic module like you say and propose.
I meant if you have two series or dataframes of different sizes and you want to compute the corr function, Pandas first apply df.align(df2) that align the data with smaller object this means clear excess data on the other object and before apply the respective corr function, at the momment I have this:

       if (kwargs["min_periods"] === undefined || kwargs["min_periods"] === 0) {
            kwargs["min_periods"] = 1;
        }

        if (this.size < kwargs["min_periods"]) {
            return NaN;
        }

        if (kwargs["min_periods"] < 0 && kwargs["min_periods"] > this.size) {
            throw new Error(`Value Error: min_periods need to be in range of [0, ${this.size}]`);
        }

        if (other !== undefined) {
          let [ left, right ] = this.__align_data(other, { "join": "outer", "axis": 0, "inplace": false})
          let valid_index = utils.__bit_wise_nanarray(left.isna().values, right.isna().values)

          if (valid_index.length !== 0) {
            left = left.iloc(valid_index)
            right = right.iloc(valid_index)
          }

          if (left.__check_series_op_compactibility(right)) {
            let f = this.__get_corr_function(kwargs["method"]);
            return f(left, right);
          }
        }

Is the corr. calculating the correlation within dataframe columns or between two dataframe (or it is calculating both)

At the moment I'm calculating correlation within dataframe columns, I've pearson and kendall tau-b working now. See #26

from danfojs.

steveoni avatar steveoni commented on July 20, 2024

Ok. that's cool.
Great job 👍

from danfojs.

github-actions avatar github-actions commented on July 20, 2024

Stale issue message

from danfojs.

JhennerTigreros avatar JhennerTigreros commented on July 20, 2024

Its been a while since I can work on this. @steveoni or @risenW Any update for this implementation?. Now I can return to tackle this issue 💪🏽

from danfojs.

risenW avatar risenW commented on July 20, 2024

Its been a while since I can work on this. @steveoni or @risenW Any update for this implementation?. Now I can return to tackle this issue 💪🏽

Update on this? We have released the TS version, so you can update this issue

from danfojs.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.