I'm finding some unexpected behaviour in the 1.4.1, which was not occurring in 1.4.0.

In 1.4.1: <div class="highlight highlight-source-ruby notranslate position-relativ

Here's what I could reproduce: With 1.4.0 <div class="highligh

Running Ruby 2.1.2p95: <div class="highlight highlight-source-ruby notranslate pos

Returning NaN for simple multiple linear regression case in 1.4.1,about sciruby/statsample

Comments (17)

agarie commented on August 26, 2024

Thanks, I'll look into it.

from statsample.

agarie commented on August 26, 2024

In 1.4.1:

>> require 'statsample'
>> Statsample::VERSION
=> "1.4.1"

>> @a=[27.0, 12.0, 16.0, 25.0].to_vector(:scale)
>> @b=[10.0, 15.0, 19.0, 2.0].to_vector(:scale)
>> @y=[1, 1, 1, 1].to_vector(:scale)

>> lr=Statsample::Regression::Multiple::RubyEngine.new(ds,'y')

>> lr.r
=> NaN

>> lr.r2
=> NaN

>> lr.coeffs.each do |k, v|
 |   puts "#{k}: #{v}"
 | end
a: NaN
b: NaN
=> {"a"=>NaN, "b"=>NaN}

And in 1.4.0:

>> gem "statsample", "=1.4.0"
>> require 'statsample'
>> Statsample::VERSION
=> "1.4.0"

>> @a=[27.0, 12.0, 16.0, 25.0].to_vector(:scale)
>> @b=[10.0, 15.0, 19.0, 2.0].to_vector(:scale)
>> @y=[1, 1, 1, 1].to_vector(:scale)

>> ds={'a'=>@a,'b'=>@b,'y'=>@y}.to_dataset

>> lr=Statsample::Regression::Multiple::RubyEngine.new(ds,'y')

>> lr.r
=> NaN

>> lr.r2
=> NaN

>> lr.coeffs.each do |k, v|
 |   puts "#{k}: #{v}"
 | end
a: NaN
b: NaN
=> {"a"=>NaN, "b"=>NaN}

I just downloaded 1.4.1 and 1.4.0 from rubygems and I got the same result regardless of the version used. Will have to look at how Regression::Multiple is implemented to have a better idea of what is happening.

Thanks for creating a test, it'll be useful. Please post here if you find anything that can help. :)

from statsample.

einpaule commented on August 26, 2024

I tried to simplify as much as possible while still able to see the bad behaviour, obviously forgetting to check whether the simplified version had the good behaviour in the previous version. I'll trace back my steps and find something simple where the error occurs only in 1.4.1 but not in 1.4.0

from statsample.

agarie commented on August 26, 2024

How did you find about this behavior? Can you host the original code in a gist (if possible), so we can work from there?

I made some stylistic changes to lib/statsample.rb and lib/statsample/{dataset,matrix,reliability}.rb, but I might have introduced a bug. Or there were changes in Claudio's repository that were introduced after the 1.4.0 release... anyway, I'll keep looking into it. :)

from statsample.

einpaule commented on August 26, 2024

Here's what I could reproduce:

With 1.4.0

rvm gemset create test140
rvm gemset use test140
gem install 'statsample' -v 1.4.0
irb

2.1.2 :001 > require 'statsample'
 => true 
2.1.2 :002 > Statsample::VERSION
 => "1.4.0" 
2.1.2 :003 > 
2.1.2 :004 >   regression_analysis = Statsample::Analysis.store(Statsample::Regression::Multiple) do
2.1.2 :005 >       dataset_inputs = {
2.1.2 :006 >           '4' => [27.0,12.0,16.0,25.0,0.0,13.0,14.0,28.0,1.0,18.0,24.0,19.0,7.0,27.0,17.0,17.0,16.0,24.0,21.0,22.0,16.0,24.0,22.0,13.0].to_vector(:scale),
2.1.2 :007 >           '5' => [10.0,15.0,19.0,2.0,20.0,13.0,7.0,24.0,5.0,17.0,16.0,29.0,15.0,20.0,23.0,11.0,14.0,4.0,19.0,19.0,3.0,9.0,6.0,20.0].to_vector(:scale),
2.1.2 :008 >           '6' => [0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001].to_vector(:scale)
2.1.2 :009?>       }
2.1.2 :010?>     ds = dataset(dataset_inputs)
2.1.2 :011?>     lr(ds, '6')
2.1.2 :012?>   end
 => #<Statsample::Analysis::Suite:0x000000038b72e0 @block=#<Proc:0x000000038b73a8@(irb):4>, @name=Statsample::Regression::Multiple, @attached=[], @output=#<IO:<STDOUT>>> 
2.1.2 :013 > 
2.1.2 :014 >   results = regression_analysis.run
 => #<Statsample::Regression::Multiple::RubyEngine:0x000000038bcfd8 @matrix_cor=Matrix[[1.0, 0.009923807720864935, 0.0], [0.009923807720864935, 1.0, 0.0], [0.0, 0.0, 1.0]], @matrix_cov=Matrix[[1.0, 0.009923807720864935, 0.0], [0.009923807720864935, 1.0, 0.0], [0.0, 0.0, 1.0]], @no_covariance=true, @y_var="6", @fields=["4", "5"], @n_predictors=2, @predictors_n=2, @matrix_x=Matrix[[1.0, 0.009923807720864935], [0.009923807720864935, 1.0]], @matrix_x_cov=Matrix[[1.0, 0.009923807720864935], [0.009923807720864935, 1.0]], @matrix_y=Matrix[[0.0], [0.0]], @matrix_y_cov=Matrix[[0.0], [0.0]], @y_sd=4.25288070139903e-17, @x_sd={"4"=>7.540110136434694, "5"=>7.2631689494604075}, @cases=24, @x_mean={"4"=>17.625, "5"=>14.166666666666666}, @y_mean=0.07999200000000005, @name="Multiple reggresion of 4,5 on 6", @digits=3, @coeffs_stan=[0.0, 0.0], @coeffs=[0.0, 0.0], @valid_cases=24, @total_cases=24, @ds=#<Statsample::Dataset:29747360 @name=Dataset 1 @fields=[4,5,6] cases=24, @dy=Vector(type:scale, n:24)[0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001], @ds_valid=#<Statsample::Dataset:30012060 @name=Dataset 1 @fields=[4,5,6] cases=24, @ds_indep=#<Statsample::Dataset:30010980 @name=Dataset 1 @fields=[4,5] cases=24, @dep_columns=[[27.0, 12.0, 16.0, 25.0, 0.0, 13.0, 14.0, 28.0, 1.0, 18.0, 24.0, 19.0, 7.0, 27.0, 17.0, 17.0, 16.0, 24.0, 21.0, 22.0, 16.0, 24.0, 22.0, 13.0], [10.0, 15.0, 19.0, 2.0, 20.0, 13.0, 7.0, 24.0, 5.0, 17.0, 16.0, 29.0, 15.0, 20.0, 23.0, 11.0, 14.0, 4.0, 19.0, 19.0, 3.0, 9.0, 6.0, 20.0]]> 
2.1.2 :015 > results.r
 => 0.0 
2.1.2 :016 > results.r2
 => 0.0

with 1.4.1

rvm gemset create test141
rvm gemset use test141
gem install 'statsample' -v 1.4.1
irb

2.1.2 :001 > require 'statsample'
 => true 
2.1.2 :002 > Statsample::VERSION
 => "1.4.1" 
2.1.2 :003 > 
2.1.2 :004 >   regression_analysis = Statsample::Analysis.store(Statsample::Regression::Multiple) do
2.1.2 :005 >       dataset_inputs = {
2.1.2 :006 >           '4' => [27.0,12.0,16.0,25.0,0.0,13.0,14.0,28.0,1.0,18.0,24.0,19.0,7.0,27.0,17.0,17.0,16.0,24.0,21.0,22.0,16.0,24.0,22.0,13.0].to_vector(:scale),
2.1.2 :007 >           '5' => [10.0,15.0,19.0,2.0,20.0,13.0,7.0,24.0,5.0,17.0,16.0,29.0,15.0,20.0,23.0,11.0,14.0,4.0,19.0,19.0,3.0,9.0,6.0,20.0].to_vector(:scale),
2.1.2 :008 >           '6' => [0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001].to_vector(:scale)
2.1.2 :009?>       }
2.1.2 :010?>     ds = dataset(dataset_inputs)
2.1.2 :011?>     lr(ds, '6')
2.1.2 :012?>   end
 => #<Statsample::Analysis::Suite:0x00000002197ce0 @block=#<Proc:0x00000002197da8@(irb):4>, @name=Statsample::Regression::Multiple, @attached=[], @output=#<IO:<STDOUT>>> 
2.1.2 :013 > 
2.1.2 :014 >   results = regression_analysis.run
 => #<Statsample::Regression::Multiple::RubyEngine:0x000000021915c0 @matrix_cor=Matrix[[1.0, 0.009923807720864888, NaN], [0.009923807720864888, 1.0, NaN], [NaN, NaN, 1.0]], @matrix_cov=Matrix[[1.0, 0.009923807720864888, NaN], [0.009923807720864888, 1.0, NaN], [NaN, NaN, 1.0]], @no_covariance=true, @y_var="6", @fields=["4", "5"], @n_predictors=2, @predictors_n=2, @matrix_x=Matrix[[1.0, 0.009923807720864888], [0.009923807720864888, 1.0]], @matrix_x_cov=Matrix[[1.0, 0.009923807720864888], [0.009923807720864888, 1.0]], @matrix_y=Matrix[[NaN], [NaN]], @matrix_y_cov=Matrix[[NaN], [NaN]], @y_sd=0.0, @x_sd={"4"=>7.540110136434694, "5"=>7.2631689494604075}, @cases=24, @x_mean={"4"=>17.625, "5"=>14.166666666666666}, @y_mean=0.07999200000000001, @name="Multiple reggresion of 4,5 on 6", @digits=3, @coeffs_stan=[NaN, NaN], @coeffs=[NaN, NaN], @valid_cases=24, @total_cases=24, @ds=#<Statsample::Dataset:17599380 @name=Dataset 1 @fields=[4,5,6] cases=24, @dy=Vector(type:scale, n:24)[0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001], @ds_valid=#<Statsample::Dataset:17258600 @name=Dataset 1 @fields=[4,5,6] cases=24, @ds_indep=#<Statsample::Dataset:17257420 @name=Dataset 1 @fields=[4,5] cases=24, @dep_columns=[[27.0, 12.0, 16.0, 25.0, 0.0, 13.0, 14.0, 28.0, 1.0, 18.0, 24.0, 19.0, 7.0, 27.0, 17.0, 17.0, 16.0, 24.0, 21.0, 22.0, 16.0, 24.0, 22.0, 13.0], [10.0, 15.0, 19.0, 2.0, 20.0, 13.0, 7.0, 24.0, 5.0, 17.0, 16.0, 29.0, 15.0, 20.0, 23.0, 11.0, 14.0, 4.0, 19.0, 19.0, 3.0, 9.0, 6.0, 20.0]]> 
2.1.2 :015 > results.r
 => NaN 
2.1.2 :016 > results.r2
 => NaN

from statsample.

einpaule commented on August 26, 2024

For your convenience, here's a version of the code to copy paste into the irb:

require 'statsample'
Statsample::VERSION

regression_analysis = Statsample::Analysis.store(Statsample::Regression::Multiple) do
  dataset_inputs = {
    '4' => [27.0,12.0,16.0,25.0,0.0,13.0,14.0,28.0,1.0,18.0,24.0,19.0,7.0,27.0,17.0,17.0,16.0,24.0,21.0,22.0,16.0,24.0,22.0,13.0].to_vector(:scale),
    '5' => [10.0,15.0,19.0,2.0,20.0,13.0,7.0,24.0,5.0,17.0,16.0,29.0,15.0,20.0,23.0,11.0,14.0,4.0,19.0,19.0,3.0,9.0,6.0,20.0].to_vector(:scale),
    '6' => [0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001].to_vector(:scale)
  }
  ds = dataset(dataset_inputs)
  lr(ds, '6')
end

results = regression_analysis.run
results.r
results.r2

from statsample.

einpaule commented on August 26, 2024

Another thing that might not be clearly visible from the above is that the dependent variable is always the same value, so the coefficients are resolved to be 0.0 and the constant to the value of the dependent variable.

If one value of the dependent variable is changed, it works both in 1.4.1 and 1.4.0

from statsample.

agarie commented on August 26, 2024

Running Ruby 2.1.2p95:

>> gem "statsample", "= 1.4.0"
=> true
>> require "statsample"
=> true
>> Statsample::VERSION
=> "1.4.0"
>> regression_analysis = Statsample::Analysis.store(Statsample::Regression::Multiple) do
?>   dataset_inputs = {
?>     '4' => [27.0,12.0,16.0,25.0,0.0,13.0,14.0,28.0,1.0,18.0,24.0,19.0,7.0,27.0,17.0,17.0,16.0,24.0,21.0,22.0,16.0,24.0,22.0,13.0].to_vector(:scale),
?>     '5' => [10.0,15.0,19.0,2.0,20.0,13.0,7.0,24.0,5.0,17.0,16.0,29.0,15.0,20.0,23.0,11.0,14.0,4.0,19.0,19.0,3.0,9.0,6.0,20.0].to_vector(:scale),
?>     '6' => [0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001].to_vector(:scale)
>>   }
>>   ds = dataset(dataset_inputs)
>>   lr(ds, '6')
>> end
=> #<Statsample::Analysis::Suite:0x007fb9a13fb800 @block=#<Proc:0x007fb9a13fb8c8@(irb):5>, @name=Statsample::Regression::Multiple, @attached=[], @output=#<IO:<STDOUT>>>
>>
?> results = regression_analysis.run
=> #<Statsample::Regression::Multiple::RubyEngine:0x007fb9a13c9c60 @matrix_cor=Matrix[[1.0, 0.009923807720864888, NaN], [0.009923807720864888, 1.0, NaN], [NaN, NaN, 1.0]], @matrix_cov=Matrix[[1.0, 0.009923807720864888, NaN], [0.009923807720864888, 1.0, NaN], [NaN, NaN, 1.0]], @no_covariance=true, @y_var="6", @fields=["4", "5"], @n_predictors=2, @predictors_n=2, @matrix_x=Matrix[[1.0, 0.009923807720864888], [0.009923807720864888, 1.0]], @matrix_x_cov=Matrix[[1.0, 0.009923807720864888], [0.009923807720864888, 1.0]], @matrix_y=Matrix[[NaN], [NaN]], @matrix_y_cov=Matrix[[NaN], [NaN]], @y_sd=0.0, @x_sd={"4"=>7.540110136434694, "5"=>7.2631689494604075}, @cases=24, @x_mean={"4"=>17.625, "5"=>14.166666666666666}, @y_mean=0.07999200000000001, @name="Multiple reggresion of 4,5 on 6", @digits=3, @coeffs_stan=[NaN, NaN], @coeffs=[NaN, NaN], @valid_cases=24, @total_cases=24, @ds=#<Statsample::Dataset:70217625390900 @name=Dataset 1 @fields=[4,5,6] cases=24, @dy=Vector(type:scale, n:24)[0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001,0.07999200000000001], @ds_valid=#<Statsample::Dataset:70217624644040 @name=Dataset 1 @fields=[4,5,6] cases=24, @ds_indep=#<Statsample::Dataset:70217624642940 @name=Dataset 1 @fields=[4,5] cases=24, @dep_columns=[[27.0, 12.0, 16.0, 25.0, 0.0, 13.0, 14.0, 28.0, 1.0, 18.0, 24.0, 19.0, 7.0, 27.0, 17.0, 17.0, 16.0, 24.0, 21.0, 22.0, 16.0, 24.0, 22.0, 13.0], [10.0, 15.0, 19.0, 2.0, 20.0, 13.0, 7.0, 24.0, 5.0, 17.0, 16.0, 29.0, 15.0, 20.0, 23.0, 11.0, 14.0, 4.0, 19.0, 19.0, 3.0, 9.0, 6.0, 20.0]]>
>> results.r
=> NaN
>> results.r2
=> NaN

I get the same result with Ruby 2.1.5.

Well, there's obviously something else different between our systems -- I can't make it work with Statsample 1.4.0 or 1.4.1, both installed via rubygems. What system are you using? I'm on a Mac OSX 10.10.2, with rb-gsl 1.16.0.4.

It appears the line responsible for generating those NaNs is statsample/regression/multiple/rubyengine.rb#L20. I'll have more time tonight to look into this issue. Thanks for your help! :)

from statsample.

einpaule commented on August 26, 2024

Hi again,

sorry, I didn't get notified of your reply (maybe because of the DDoS on github?)

I'm running on:

Ubuntu 14.04 (kernel 3.13.0-48-generic)
ruby-2.1.2

According to Wikipedia R^2 is not defined for the case we have (where all the values of the dependent variable are equal). So NaN is actually acceptable for this case after all. The reason why I was actually getting a different value could be due to differences in rounding (?).

So I would be OK to close this issue if you do not want to hunt down the difference further. If you did, I'd be happy to continue trying to narrow down the case in which it worked for me in 1.4.0.

from statsample.

agarie commented on August 26, 2024

No problem, that DDoS was very problematic indeed. Anyway, I thought about it and yeah, R^2 doesn't make sense in this context. Can you link to the passage that says it is undefined in this situation?

And I don't know if closing it is really a good idea. Maybe we should print a warning or raise an exception if we get to this situation? I certainly need to add this case to the documentation at the very least.

from statsample.

einpaule commented on August 26, 2024

Sure:
http://en.wikipedia.org/wiki/Coefficient_of_determination#Interpretation

At the end of the second paragraph it says that neither formula is defined for the case that y_1 = ... = y_n = average of y values.

from statsample.

einpaule commented on August 26, 2024

I actually gave it a bit more thought and it came to me that just because my application flow went to R^2 first, I never actually checked the coefficients and the constant.

Although R^2 is fine as NaN (running the calculation as described in Wikipedia gave a -Infinity in Matlab for example), the constant should equal y and the coefficients should be 0. The result, in my mind should not be an exception. [edit:] I get back NaN for the coefficients and for the constant in all cases in which I get back NaN as the R^2.[/edit]

I think I might have hit upon a corner case in 1.4.0 that for some reason worked. If I replace the 0.07999200000000001 in the determined variable with 1, it works in none of the versions. I think that what we should really be looking at is the behaviour of multiple linear regression if the determined value is constant.

The case in which the dependent variable is unaffected by any of the explaining variables is a special case which is obviously simplistic.

Perhaps a good solution is to first check whether all the elements in the dependent vector are identical. If they are, the solution is trivial: coefficients are 0 and the constant is equal to the value of any dependent variable entry.

from statsample.

einpaule commented on August 26, 2024

Hi again,

I've updated the my fork with separate tests for GslEngine and RubyEngine. The issue is confined to RubyEngine and works as expected (by the new tests) in the GslEngine.

I am uncertain about creating a pull request, because I haven't gotten the tests to pass on a lot of the other tests of the suite.

from statsample.

einpaule commented on August 26, 2024

Created a pull request to clbustos:
clbustos#42

from statsample.

agarie commented on August 26, 2024

Sorry for my absence. I finally got some time to work on open source.

Can you send your pull request to this repository? We want to concentrate our efforts in the SciRuby forks so it's easier for someone to find our projects.

from statsample.

agarie commented on August 26, 2024

And thanks for your explanation.

Perhaps a good solution is to first check whether all the elements in the dependent vector are identical. If they are, the solution is trivial: coefficients are 0 and the constant is equal to the value of any dependent variable entry.

This seems like a good solution indeed. I'm thinking about how to check that the elements are identical when we're working with floating-point precision. I'll update this issue this week.

from statsample.

agisga commented on August 26, 2024

I just ran a regression model with the same data in R, since it's nowadays the standard in statistical computing. They seem to have the same solution as the one proposed by einpaule.

R also estimates the slope coefficients to 0 and the intercept (constant) to the value of y. All the computed statistics (like R^2, F-statistic, t, etc.) are NaN or NA, and warnings of the following kind are produced:

Warning message:
In summary.lm(mod) : essentially perfect fit: summary may be unreliable

Warning message:
In anova.lm(mod) :
  ANOVA F-tests on an essentially perfect fit are unreliable

from statsample.

Returning NaN for simple multiple linear regression case in 1.4.1 about statsample HOT 17 OPEN

Comments (17)

With 1.4.0

with 1.4.1

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent