Comments (14)
Can you give me desc and sum on id1 to id6?
from stata-gtools.
Out of memory should not cause segfault. If all the IDs are integers I am thinking this is a mistake with the bijection code. To check, you can do
gen str1 dummy = " "
gegen double loanid = group(id1 id2 id3 id4 id5 id6 dummy), missing verbose
I'll try to hop on a server tomo and check if I can reproduce (but it will work better with the output of desc and sum).
from stata-gtools.
Probably won't be able to until tomorrow or Monday (Veterans Day), but I'll try to send you something reproductible.
BTW, in case it's worth it on your side, I added a shortcut to -fegen- in case the dataset is already sorted by the variables (in which case it's faster to just do it from Stata (lines 111-112).
from stata-gtools.
I don't get the speedups you mention (Stata/MP 14.1 with 8 cores). I'm not sure why, though. You also said that it is faster for you to do that type of thing in issue 21. But even after sort, "by `varlist'" is not very fast on my end. Can you point me to some benchmarks? Maybe then I can try to debug when it might be faster to switch.
A reproducible example would be better but even variable descriptions would help. I doubt it's memory. I have this run, which used ~20GiB of memory at most. Here is an example where the bijection would overflow:
. set rmsg on
r; t=0.00 7:57:51
. clear
r; t=0.01 7:57:54
. set obs 142686929
number of observations (_N) was 0, now 142,686,929
r; t=3.09 7:57:57
. gen long id1 = int(runiform() * 10000)
r; t=13.74 7:58:11
. gen long id2 = int(rnormal() * 10000)
r; t=27.45 7:58:39
. gen long id3 = int(runiform() * 10000)
r; t=16.32 7:58:55
. gen long id4 = int(rnormal() * 10000)
r; t=32.42 7:59:28
. gen long id5 = int(runiform() * 10000)
r; t=20.07 7:59:48
. gen long id6 = int(rnormal() * 10000)
r; t=31.13 8:00:19
. gegen g_id = group(id1 id2 id3 id4 id5 id6), v bench(2)
Plugin setup; .02 seconds
Generated targets; 8.951 seconds
Parsed by variables; .002 seconds
Plugin step 1: Read in by variables; 20.210 seconds.
Bijection OK with all integers (i.e. no extended miss val)? No.
Values OK but range (1,277,352,040,000,000,000 * 112,824) too large; falling back on hash.
Plugin step 2.1: Determined hashing strategy; 6.820 seconds.
Plugin step 2.3: Hashed variables (128-bit); 8.210 seconds.
Radix sort on hash (16-bits at a time)
Plugin step 2.4: Sorted integer-only hash; 38.300 seconds.
Plugin step 2: Hashed by variables; 53.330 seconds.
Found 1 64-bit hash collision(s). Fell back on 128-bit hash.
N = 142,686,929; 142,686,929 balanced groups of size 1
Plugin step 3: Set up panel; 2.710 seconds.
Plugin step 4.1: Checked for hash collisions; 8.130 seconds.
There were no hash collisions: 6 variables, 142,686,929 obs, 142,686,929 groups
Plugin step 4.2: Keep only one row per group; 10.120 seconds.
Plugin step 4.3: Sorted groups in memory; 72.550 seconds.
Plugin step 4: Created indexed array with sorted by vars; 91.900 seconds.
Plugin step 5: Copied back encoding to Stata; 61.750 seconds.
Plugin runtime; 230.6 seconds
Total runtime; 240.5 seconds
r; t=240.54 8:04:20
Here is one where it wouldn't:
. set rmsg on
r; t=0.00 8:05:28
. clear
r; t=0.01 8:05:28
. set obs 142686929
number of observations (_N) was 0, now 142,686,929
r; t=3.13 8:05:31
. gen long id1 = int(runiform() * 100)
r; t=14.24 8:05:46
. gen long id2 = int(rnormal() * 100)
r; t=27.44 8:06:13
. gen long id3 = int(runiform() * 100)
r; t=16.52 8:06:30
. gen long id4 = int(rnormal() * 100)
r; t=31.03 8:07:01
. gen long id5 = int(runiform() * 100)
r; t=20.06 8:07:21
. gen long id6 = int(rnormal() * 100)
r; t=31.18 8:07:53
. gegen g_id = group(id1 id2 id3 id4 id5 id6), v bench(2)
Plugin setup; .02 seconds
Generated targets; 8.941 seconds
Parsed by variables; .003 seconds
Plugin step 1: Read in by variables; 20.380 seconds.
Bijection OK with all integers (i.e. no extended miss val)? Yes.
Plugin step 2.1: Determined hashing strategy; 6.770 seconds.
Plugin step 2.3: Bijected integers to natural numbers; 3.360 seconds.
Radix sort on hash (16-bits at a time)
Plugin step 2.4: Sorted integer-only hash; 27.610 seconds.
Plugin step 2: Hashed by variables; 37.740 seconds.
N = 142,686,929; 142,686,686 unbalanced groups of sizes 1 to 2
Plugin step 3: Set up panel; 2.090 seconds.
Plugin step 4.2: Keep only one row per group; 9.410 seconds.
Plugin step 4: Created indexed array with sorted by vars; 10.550 seconds.
Plugin step 5: Copied back encoding to Stata; 21.510 seconds.
Plugin runtime; 92.82 seconds
Total runtime; 102.4 seconds
r; t=102.49 8:09:35
from stata-gtools.
About the overflow, I couldn't replicate it (I managed to get a very similar dataset with the same number of obs., but the program ran successfully). The only thing I can think of is either a server-specific memory error (unlikely), or something very specific to the data at that point in time. For instance, the variables used to create the group were unique identifiers of the dataset, so there were 142 million different values of the ID.
About the speed up, this is a simple example that I ran on my personal laptop (4 cores), but tbh the gain doesn't seem very large:
clear
set obs 20000000
gen double id1 = ceil(runiform()*100) * 1000
gen double id2 = ceil(runiform()*1000) * 10000
gen double id3 = ceil(runiform()*10000) * 10000
sort id1 id2 id3
timer clear
cap drop id
timer on 1
by id1 id2 id3: gen byte _ = (_n == 1)
qui gen double id = sum(_)
drop _
timer off 1
mata: hash1(st_data(., "id"))
cap drop id
timer on 2
by id1 id2 id3: gen double id = (_n == 1)
qui replace id = sum(id)
timer off 2
mata: hash1(st_data(., "id"))
cap drop id
timer on 3
gegen double id = group(id1 id2 id3)
timer off 3
mata: hash1(st_data(., "id"))
timer list
And the timing results:
. timer list
1: 3.31 / 1 = 3.3070
2: 2.82 / 1 = 2.8160
3: 3.46 / 1 = 3.4560
So it seems there are some savings but not much...
from stata-gtools.
Here is the benchmark on my computer (Stata/IC):
1: 23.16 / 1 = 23.1550
2: 22.59 / 1 = 22.5950
3: 2.86 / 1 = 2.8560
but on a server (Stata/MP 8 cores):
1: 3.40 / 1 = 3.3980
2: 2.38 / 1 = 2.3800
3: 4.40 / 1 = 4.4030
So it is faster in Stata/MP with many cores (almost 2x) but 10x slower in IC (which is my use case). One option would be to switch if, say, Stata is using >= 4 cores. But I'd rather do something like bypassing the hash when the data is sorted.
Reading the by variables (1s), checking if the doubles area really integers (0.7s; they are but the bijection requires too large an index), and hashing the data (1.9s) overall take 3.6 seconds, which is most of the runtime. I can't speed up the first step, but I can skip the second step and the hash if the data is sorted. There's 2.6s there to cut, which is plenty, so I'll have to figure out how to generate my panelsetup from the sorted set of variables to bring gegen back to par in this scenario.
from stata-gtools.
Here's a run with all 6 IDs being unique:
clear
set obs 142686929
gen rsort = .
forvalues i = 1 / 6 {
replace rsort = runiform() * 100
sort rsort
gen long id`i' = _n
}
replace rsort = runiform() * 100
sort rsort
gegen g_id = group(id1 id2 id3 id4 id5 id6), v bench(2)
Gives
Plugin setup; 0 seconds
Generated targets; 8.373 seconds
Parsed by variables; .005 seconds
Plugin step 1: Read in by variables; 17.830 seconds.
Bijection OK with all integers (i.e. no extended miss val)? No.
Values OK but range (20,359,559,707,451,041 * 142,686,929) too large; falling back on hash.
Plugin step 2.1: Determined hashing strategy; 6.190 seconds.
Plugin step 2.3: Hashed variables (128-bit); 6.480 seconds.
Radix sort on hash (16-bits at a time)
Plugin step 2.4: Sorted integer-only hash; 36.160 seconds.
Plugin step 2: Hashed by variables; 48.840 seconds.
Plugin step 3.1: Created group index; 1.360 seconds.
N = 142,686,929; 142,686,929 balanced groups of size 1
Plugin step 3.2: Normalized group index and Stata index; 0.720 seconds.
Plugin step 3: Set up panel; 2.080 seconds.
Plugin step 4.1: Checked for hash collisions; 9.830 seconds.
There were no hash collisions: 6 variables, 142,686,929 obs, 142,686,929 groups
Plugin step 4.2: Keep only one row per group; 10.720 seconds.
Plugin step 4.3: Sorted groups in memory; 62.050 seconds.
Plugin step 4: Created indexed array with sorted by vars; 83.790 seconds.
Plugin step 5.1: Generated encoding in Stata order; 27.520 seconds.
Plugin step 5.2: Copied back encoding to Stata; 2.060 seconds.
Plugin step 5: Copied back encoding to Stata; 29.590 seconds.
Plugin runtime; 182.2 seconds
Total runtime; 190.6 seconds
from stata-gtools.
Re: Speedup, I now set up the panel directly and runtime goes down considerably. Still marginally slower but I think I can call this a day. I will run tests and commit later in the week.
1: 3.62 / 1 = 3.6210
2: 2.66 / 1 = 2.6600
3: 2.84 / 1 = 2.8420
from stata-gtools.
Cool! Yes, I agree that should be enough.
I'm still bugged about those crashes I was having last week, but can't replicate them so I think you might want to close the issue at least for now.
from stata-gtools.
I am bugged as well, but I cannot replicate them either. On my end it handles that many observatrions just fine... Maybe it was a case where it just hit the bijection limit? So it was just under, and it passed the check I made, but when it came down to it it overflowed? That might make sense if you were using a version of gtools < 0.9.0, but from 0.9.0 onward I changed the limit to be the largest unsigned integer over 2.
from stata-gtools.
One annoyance with the speedup is that it is actually slower if all variables are integers and the bijection will not overflow. So
- "Speedup" slower vs all inputs are integers and bijection OK
- Speedup faster vs inputs are not integers and must use hash.
- Speedup faster vs inputs are integers but bijection might overflow.
So the "speedup" is not always faster and not always slower. I'm thinking to add a hash()
option that switches hash methods, so the user can choose:
- 0: Default. Uses the panelsetup if sorted, bijection if all integers, hash otherwise.
- 1: Always hashes. Uses bijection if all integers, hash otherwise.
- 2: Always uses spookyhash.
In all cases, if the data is already sorted then the hash would not be re-sorted.
from stata-gtools.
Makes sense. One thing that would be best as a some sort of "endgame" is to expose the c functions through thin wrappers, and do everything through that. That's in a way what I did with ftools, and some reasons for that is I) easier to extend by others, ii) easier to run extensive tests against a smaller set of c functions and iii) easier to improve later on.
Is that feasible or maybe even already there? Like, what would I need if I wanted to accelerate reghdfe with those functions (or any data intensive package that uses categories)
from stata-gtools.
Re: reghdfe. What do you need from ftools? If you tell me then maybe the functionality is already present, but I'm not sure what you need.
Re: Wrappers. I don't think they will be as useful as your wrappers.
I've made work easier for myself by making every .ado file a wrapper already for _gtools_internals.ado, whcih is a massive function that handles everything and sets it up for C. Hence C assumes that a bunch of scalars, matrices, and all the relevant variables will always exist and have specific shapes. This is because C cannot create variables, matrices, or anything that doesn't already exist in Stata.
So I could write a number of wrappers for various parts of the function, but executing them would be way slower than executing a full function run. Further, they couldn't be thin wrappers for C because C cannot interact with Stata outside a relatively restricted set of functions (more specifically, I cannot keep objects I made in C available after the plugin has finished executing).
However, I could write thin wrappers for _gtools_internal.ado. To pull the curtain back a little, all the functions (except gisid, which executes 1-4 but is different onward) have a commonality to them:
- Read the data
- Determine hashing strategy (this includes an "is sorted" check)
- Hash
- Sort hash (keeping track of how it maps to Stata row numbers)
- Panel setup
- Check for collisions
- Sort the groups (with index)
- Map sorted group index to sorted hash index
- Function-specific stuff
Steps 2, 3, 6, and 7 require a copy of the data be available for C in memory. Saving the results in steps 3, 4, 5, or 6-8 would require creating variables in Stata in addition to allocating memory in C. Also, there is an inefficiency throughout in casting doubles to and from 64-bit integers. Here is the stuff I could write:
-
Is the data sorted? Executes 1 and checks if sorted. Gives yes or no.
-
Is the bijection OK? Executes 1 and checks if can biject. Gives yes or no.
-
Hash. Creates 3 variables from Stata and executes 1-3. The first two variables need to be double and store either the bijection and empty or the two parts of the spookyhash. The third variable can be long or double and is the index of the hash to the Stata observations (in case the user passes [if] [in] or drops missing rows).
-
Hash sort. Either creates 3 variables from Stata and executes 1-4 or picks up from the hash step above and executes 4. It sorts the hash and stores the sorted hash along with the index.
-
Panel setup. Either creates 2 variables from Stata and executes 1-5 or it creates 1 variable and picks up form the hash sort step above and executes 5. This step creates the index to Stata observations, if it does not exist, and it stores in the first J observations the start points of the grouped data.
-
Check for collisions + sort groups + map sorted group index to hash index. This can pick up from the panel setup step by creating one extra variable (which will be the sort order of the groups); it would re-read the observations into memory, check for collisions, sort them, and store the sort index. It can also do steps 1-8 directly after creating 3 variables from Stata.
Is that useful? Running any one of those would be fast, but running any sequence of those would be really slow. Of course, if the user could somehow interact with C results directly then it would not be a problem, but this is not possible with the current plugin interface.
from stata-gtools.
I will close this issue, as you suggest, and we can continue the discussion of additions to gtools (including an API, integration with reghdfe, etc.) in issue #30.
from stata-gtools.
Related Issues (20)
- gegen total vs. egen total HOT 6
- Could not load gtools_macosx_v3.plugin, error 9999 HOT 9
- gegen normalize does not realize that a new variable shall be created HOT 1
- gunique missing scalars when there are no observations
- gtools version of merge HOT 4
- gtools not installing on macos Stata 16 HOT 3
- Problem with -if- condition in several commands HOT 1
- Please update the benchmark using Stata 17. HOT 5
- gtools 1.8.1 not working *at all* with Stata MP 16.1 on MacOS 11.6 HOT 7
- Plugin download error when using "ssc install gtools" HOT 2
- the option cw in gcollapse is invalid. HOT 2
- Error r(111) in Stata MP 16.1 and SE 17.0, macOS Monterey HOT 4
- OSX plugin fails; move OSX Compilation to github
- gegen max does not properly evaluate string expressions HOT 2
- Error trying to copy gtop.sthlp in Stata 14 HOT 3
- Could not load gtools_macosx_v3.plugin, error 9999 HOT 17
- Some commands appear to ignore [w=weights] HOT 3
- Export results to word or excel HOT 4
- Wrong number of groups HOT 1
- Will greshape support strL variabes in the future? HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from stata-gtools.