Giter VIP home page Giter VIP logo

Comments (6)

mcaceresb avatar mcaceresb commented on May 23, 2024 1

I asked about this inconsistency in egen here, if you want to read more about it. For now, I'll stick to upgrading variable types.

from stata-gtools.

mcaceresb avatar mcaceresb commented on May 23, 2024

Right. I put that in place in case no type was specified, but I suppose if the user wants to specify a different type then I should allow for that. I'll fix it over the weekend. Thanks!

from stata-gtools.

mcaceresb avatar mcaceresb commented on May 23, 2024

Actually, I had misunderstood this comment. Due to the way the plugin API works, there's no way to do this efficiently, and even then it's not obvious to me what the default behavior should be. For now, I'll just upgrade the type to something that is for sure safe. egen does not always do this, by the way. Consider

clear
set obs `=2^24 + 10'
g long x = _n
egen group = group(x)
format %21.0gc x group
l in `=_N - 10' / `=_N'
qui gdistinct x group
matrix list r(distinct)

And you can see that egen does not adequately type group so there aren't enough levels.

from stata-gtools.

sergiocorreia avatar sergiocorreia commented on May 23, 2024

Hi Mauricio,

Just to be clear, do you plan in upgrading the user-specified type in order to fit the number of levels?

I would argue that we should follow two rules, in order of importance:

  1. As much as possible, upgrade type to avoid losing information. In particular, if group() returns incorrect results, that would be both hard to detect and would create serious problems down the line (e.g. if I get the mean by group, but I used float as default so now I treat two groups as one).
  2. If possible, downgrade type to save memory. So if the user doesn't specify the type of the new variable, and it fits in -byte-, then use byte and not -float- or anything else.

Also, even if you don't like the ideas above, I would argue that the default type for egen group should never be float, because it is always dominated by -long-:

clear all
set obs `=3e7'
gen long i = _n // long supports up to 2bn

gegen long g1 = group(i)
cap noi assert i == g1 // works

gegen g2 = group(i)
cap noi assert i == g2 // fails b/c g2 is float

gegen byte g3 = group(i)
cap noi assert i == g3 // fails b/c g3 is byte

from stata-gtools.

mcaceresb avatar mcaceresb commented on May 23, 2024

"gegen, group" now forces a type if there might be loss of information. In other words, I follow the first rule but not always the second because, as I mention, the second is not possible to always implement efficiently.

I think you're using an old version of gtools for your example. In gtools 0.12.5 I don't get the issues you mention, and gegen byte g3 = group(i) you should see the message "(warning: user-requested type 'byte' upgraded to 'long')" (unless I didn't upload to github correctly, in which case do let me know).

from stata-gtools.

sergiocorreia avatar sergiocorreia commented on May 23, 2024

Cool! The first rule is the key one, because we can always call compress for the 2nd one!!

from stata-gtools.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.