mcaceresb / stata-gtools Goto Github PK
View Code? Open in Web Editor NEWFaster implementation of Stata's collapse, reshape, xtile, egen, isid, and more using C plugins
Home Page: https://gtools.readthedocs.io
License: MIT License
Faster implementation of Stata's collapse, reshape, xtile, egen, isid, and more using C plugins
Home Page: https://gtools.readthedocs.io
License: MIT License
This line causes an error if the current path has spaces.
clear
set obs 5
gen y = 1
gen x = _n
replace x = . in 2
replace x = .a in 3
replace x = .b in 4
gcollapse (sum) y, by(x)
Works just fine. However,
clear
set obs 5
gen y = 1
gen long x = _n
replace x = . in 2
replace x = .a in 3
replace x = .b in 4
gcollapse (sum) y, by(x)
Thinks ".", ".a", and ".c" are the same group. This happens throughout gtools
.
When wild
is specified, the user cannot use existing variable names.
. sysuse auto, clear
(1978 Automobile Data)
. gcollapse mpg = price, wild
mpg already defined
r(110);
I get the following error when trying to use gtools' -gcollapse- command:
file env_set_macosx.plugin not found
(error occurred while loading gcollapse.ado)
r(601);
Simple example re-producing the error:
. ado uninstall gtools
package gtools from https://raw.githubusercontent.com/mcaceresb/stata-gtools/master/build
^
(package uninstalled)
. net install gtools, from(https://raw.githubusercontent.com/mcaceresb/stata-gtools/master/build/)
checking gtools consistency and verifying not already installed...
installing into /Users/USER/Library/Application Support/Stata/ado/plus/...
installation complete.
. sysuse auto
(1978 Automobile Data)
. gcollapse (mean) price, by(make)
file env_set_macosx.plugin not found
(error occurred while loading gcollapse.ado)
r(601);
Just a quick question: is there a reason why we can't use strings as targets in collapse?
EG:
sysuse auto
gcollapse (first) make
Returns:
. sysuse auto
(1978 Automobile Data)
. gcollapse (first) make
Source make must be numeric.
r(198);
On principle, some target variables (first, last, firstnm, lastnm) could be strings.
To parallelize bootstraps with the ado parallel
I have included gegen and regress in a small program which returns the invalid syntax error 198.
When replacing gegen
with egen
the programm runs fine.
clear all
sysuse auto , clear
cap prog drop pargegen
prog define pargegen, rclass
version 13
syntax varlist [if]
marksample touse
gegen test = sum(price)
reg `varlist' if `touse'
drop test
end
pargegen price weight foreign rep78
Setting trace on I get the following error log:
gtools_log.txt
My suspicion is that the local level, which is defined by regress, is somehow altered by gegen. According to the stata manual level(95) is the default option but in line 3883 of the log file 95,0
shows up. Even when I define the regress option , level(95)
explicitly, the same error is shown.
It might be good to have a comprehensive list of the things that could go wrong, so users know where to look at. For instance
When writing ftools, I realized that I could have contract
for free if I just stopped at the middle of the collapse code and returned the counts of each level. Thus, the freq
option allows you to get the same results as contract.
Is there something like that with gtools? (Running collapse (count)
is not ideal because it requires an additional variable, it requires it to be nonmissing, etc.)
This mostly applies for older versions of Stata (e.g. Stata 12), but in general it's risky to have lines such as
local xyz = substr(..)
Because the equal sign truncates the local. A good explanation of why is here: http://www.stata-journal.com/sjpdf.html?articlenum=pr0045
When trying to use the -gcollapse- command I came across the following error repeatedly: the program would fail in the middle of execution, citing the following error:
'1.149e+08' invalid observation number
I attach the relevant part of the trace output from Stata below. Note that "1.149e+08" is the number of observations in my dataset, but I suspect that somehow this gets passed in the wrong format, causing this error:
----------------------------------------------------------------------------------- end gcollapse.gtools_timer ---
- }
- }
- if ( "`merge'" == "" ) {
= if ( "" == "" ) {
- qui {
- if ( `=scalar(__gtools_J) > 0' ) keep in 1 / `:di scalar(__gtools_J)'
= if ( 1 ) keep in 1 / 1.149e+08
'1.149e+08' invalid observation number
else if ( `=scalar(__gtools_J) == 0' ) drop if 1
else if ( `=scalar(__gtools_J) < 0' ) {
di as err "The plugin returned a negative number of groups."
di as err `"This is a bug. Please report to {browse "`website_url'":`website_disp'}"'
}
ds *
}
if ( `=_N' == 0 ) di as txt "(no observations)"
local memvars `r(varlist)'
local keepvars `by' `gtools_targets'
local dropme `:list memvars - keepvars'
if ( "`dropme'" != "" ) mata: st_dropvar((`:di subinstr(`""`dropme'""', " ", `"", ""', .)'))
if ( (`=_N > 0') & (`=scalar(__gtools_k_extra)' > 0) & ( `used_io' | ("`forceio'" == "forceio") ) ) {
qui mata: st_addvar(__gtools_addtypes, __gtools_addvars, 1)
gtools_timer info 97 `"Added extra targets after collapse"', prints(`benchmark')
local __gtools_iovars: list gtools_targets - gtools_uniq_vars
mata: __gtools_iovars = (`:di subinstr(`""`__gtools_iovars'""', " ", `"", ""', .)')
if ( `debug_io_read_method' == 0 ) {
cap `noi' `plugin_call' `__gtools_iovars', collapse read `"`__gtools_file'"'
if ( _rc != 0 ) exit _rc
}
else {
local nrow = `=scalar(__gtools_J)'
local ncol = `=scalar(__gtools_k_extra)'
mata: __gtools_data = gtools_get_collapsed (`"`__gtools_file'"', `nrow', `ncol')
mata: st_store(., __gtools_iovars, __gtools_data)
}
gtools_timer info 97 `"Read extra targets from disk"', prints(`benchmark')
}
local order = 0
qui ds *
local varorder `r(varlist)'
local varsort `by' `gtools_targets'
foreach varo in `varorder' {
gettoken svar varsort: varsort
if ("`varo'" != "`vars'") local order = 1
}
if ( `order' ) order `by' `gtools_targets'
forvalues k = 1 / `:list sizeof gtools_targets' {
mata: st_varlabel("`:word `k' of `gtools_targets''", __gtools_labels[`k'])
}
if ( "`unsorted'" == "" ) sort `by'
}
-------------------------------------------------------------------------------------------------- end gcollapse ---
r(198); t=103.95 11:28:02
end of do-file
r(198); t=7348.34 11:28:02
r(198); t=7348.34 11:28:02
end of do-file
r(198); t=0.00 11:28:02
Please refer to the attached pictures (sorry I have difficulties reproducing the data). In my data there is a variable "oobbu" whose maximum value is 3231, as indicated in the picture. after gcollapse (firstnm), the maximum of "oobbu" becomes 2.14e+09, which is wrong.
This issue seems to be solved if I save the data in memory first, followed by "use XXX, clear" , and then do the gcollapse again (the second picture).
When running a gcollapse
or gegen
command, Stata directly quits with a Windows Error Message. I't in German but maybe it helps anyhow.
Problemereignisname: APPCRASH
Anwendungsname: StataMP-64.exe
Anwendungsversion: 521.13.1.222
Anwendungszeitstempel: 5567795b
Fehlermodulname: env_set_windows.plugin
Fehlermodulversion: 0.0.0.0
Fehlermodulzeitstempel: 5946e14a
Ausnahmecode: c0000005
Ausnahmeoffset: 0000000000001505
Betriebsystemversion: 6.1.7601.2.1.0.256.48
Gebietsschema-ID: 1031
Zusatzinformation 1: 86b5
Zusatzinformation 2: 86b57a8af37a99527b6b1746d2e86099
Zusatzinformation 3: 41fb
Zusatzinformation 4: 41fb44b28c1de6cadb0770927f4e2baf
Min, max, first, last, firstnm, lastnm all preserve Stata's extended missing values. However, gcollapse
treats them all as missing.
. sysuse auto, clear
(1978 Automobile Data)
. replace price = .a
(74 real changes made, 74 to missing)
. gcollapse (first) price, by(foreign)
. l
+------------------+
| foreign price |
|------------------|
1. | Domestic . |
2. | Foreign . |
+------------------+
However, collapse gives
+------------------+
| foreign price |
|------------------|
1. | Domestic .a |
2. | Foreign .a |
+------------------+
Further, extended values are not correctly parsed by glevelsof
. Consider:
clear
set obs 5
gen x = _n
replace x = . in 2
replace x = .a in 3
replace x = .b in 4
glevelsof x
While "." is excluded, both ".a" and ".b" appear via their internal representation (rather than ".a" and ".b").
I discovered that -gcollapse- produces an error when collapsing a variable into a target name that is identical to a variable name that exists pre-collapse. It is easy enough to circumvent, but might be fixed without much trouble? This caused me a bit of a headache trying to troubleshoot, since that behavior of -gcollapse- was not obvious to me.
See minimal working example as follows:
. gen price2=price
. gcollapse price=price2, by(make)
variable price already defined
r(110);
. gen price2=price
. collapse price=price2, by(make)
A bug in Stata causes gtools to exit with error if the user has more than 2^31-1 observations in memory. See this bug report.
I contacted StataCorp about it and they replied:
The SPI can work with datasets containing up to 2^31-1 observations. Our development group is looking into modifying future versions of the SPI to allow more observations.
Just a minor thing but might be useful:
Sometimes I forget what specific commands are available on gtools, so it would be useful to have such a list within -gtools.sthlp-. Thus, if anyone runs h gtools
they can then click on the specific command he is interested in.
https://gtools.readthedocs.io/en/latest/usage/gunique/index.html#stored-results
r(unique)
, not r(nunique)
yields the number of groups. In the Stata help file this is correct.
Hi
Thanks for the great work on gtools. It seems to me there may be a bug in hashsort. Please refer to the code below:
××××××××××××××××××××××××××××××
clear *
set obs 100
gen id1=round(_n,5)
gen id2=round(_n,15)
hashsort id1 -id2
by id1: gen value=_n //caused error
gsort id1 -id2
by id1: gen value=_n //caused no error
××××××××××××××××××××××××××××××
Grateful for advice.
DW
I'm running a fairly complex gegen
on a fairly large dataset:
. count
142,686,929
. sort id4 id3 id5 id1 // sorted but not exactly by what i want to gegen
. gegen double loanid = group(id1 id2 id3 id4 id5 id6), missing verbose
Immediately after that I get a segfault, even when using a server with lots of memory (64gb). This is on Stata 14 MP/6 on Linux.
I would usually try to provide a minimum working example that replicates this error but even when logging or when running with set trace on
I can't catch what is going on.
Is it possible that there is not enough memory to copy all the variables and that causes the segfault?
Thanks!
Sergio
I get the message "Plugin step 4: ..." even if verbose options are off.
Hi Mauricio,
I just installed the gtools package on Stata 14.1 running on Win Server 2012 R2 but I got the following error which seems to be the same reported here #15.
. gcollapse (mean) price, by(foreign)
file gtools_windows_legacy.plugin not found
(error occurred while loading gcollapse.ado)
. which gtools
c:\ado\plus\g\gtools.ado
*! version 0.6.16 13Sep2017 Mauricio Caceres Bravo, [email protected]
*! Program for managing the gtools package installation
Do you have any hint?
Many thanks,
Federico
Medians appear to be calculated wrong in at least this simple case:
clear all
set obs 3
gen x = 1
replace x = 3 in 2
replace x = 5 in 3
egen med_egen = pctile(x), p(50)
gegen med_gegen = pctile(x), p(50)
list
The same problem also emerges in gcollapse:
clear all
set obs 3
gen x = 1
replace x = 3 in 2
replace x = 5 in 3
tempfile t1
save `t1'
collapse (p50) x
list
use `t1', clear
gcollapse (p50) x
list
Hi Mauricio,
I just updated gtools and got an error that got me a bit confused (nothing else has changed on my system:
. gcollapse (sum) price, by(turn)
file gtools_windows_legacy.plugin not found
I'm running Windows 10 and Stata 14.2. The string reported by which gtools
is this:
*! version 0.6.16 13Sep2017 Mauricio Caceres Bravo, [email protected]
A major enhancement to the plugin would be to sort the groups internally in C. This would afford a major speed-up when processing a large number of groups (vs sorting in Stata) and allow gegen
to be used as an adequate replacement for egen
.
In particular, one issue raied in #4 is that gegen
does not produce IDs in the order that the groups would be sorted. For instance,
sysuse auto, clear
egen id1 = group(turn trunk)
fegen id2 = group(turn trunk)
gegen id3 = group(turn trunk)
assert id1==id2
assert id1==id3
Instead, gegen
produces IDs in the order the groups appear. While this is by design, it is not the behavior of egen
. Sorting groups internally would allow solving this issue as well.
I observe the following behavior of gegen on WIN 7 and Linux:
. sysuse auto
(1978 Automobile Data)
. gegen mean=mean(mpg)
-gegen- only provides support for by-able egen functions
r(198);
. by foreign: gegen mean=mean(mpg)
-gegen- only provides support for by-able egen functions
r(198);
This also happens with other gegen
functions.
Hi!
I noticed that "gtools, dependencies" threw an error at me for not finding the spookyhash.dll in my ado folder despite it being there.
After inspection of gtools.ado I found that the issue was caused by not enclosing the filepaths in additional quotation marks. If your ado-path contains spaces (like in my case) then this will cause problems. I fixed it by enclosing the filepaths on lines 25, 26 and 65 in quotes.
No idea if that solution is windows-sepecific, but maybe its worth including that in the next update.
Thanks a lot for writing this great plugin, this is extremely useful!
A few issues with gegen
:
sysuse auto, clear
egen id1 = group(turn trunk)
fegen id2 = group(turn trunk)
gegen id3 = group(turn trunk)
assert id1==id2
assert id1==id3
The last command fails as the IDs are not the same
gegen id = group(turn) , wrongoption
Should give a warning that an incorrect option was used.
Hi, when doing gcollapse by groups with some group being data type of "strL", the results seem to be problematic. Reproduced as follows:
clear *
set obs 4
g strL name = strofreal(round(_n,2))
g id = round(_n,2)
g value = _n
gcollapse (max) value = value , by(name id)
currently the default seems to place missing values first.
When winsorizing missing values, gstats winsor
will fill in those missing values, instead of leaving them empty).
clear
set obs 10
gen y = _n if !inrange(_n, 4, 6)
winsor2 y, cut(1 99) suffix(_w)
gstats winsor y, cut(10 90) suffix(_g)
li
Results:
+----------------+
| y y_w y_g |
|----------------|
1. | 1 1 1 |
2. | 2 2 2 |
3. | 3 3 3 |
4. | . . 10 |
5. | . . 10 |
|----------------|
6. | . . 10 |
7. | 7 7 7 |
8. | 8 8 8 |
9. | 9 9 9 |
10. | 10 10 10 |
+----------------+
Happy holidays!
-Sergio
Hi
Are there plans to support gcollapse (rawsum), or more generally allow for certain operation to be weighted whilst others are not? I frequently run into situations where I'd like to calculate a weighted mean, but an unweighted sum, which at the moment means I'm limited to using collapse in those situations (or spend extra time coding workarounds).
By the way, thanks for this great suite of tools, I use it regularly and it has made significant improvements to the runtime of some of my programs.
It seems I can't abbreviate variables in gcontract
:
. sysuse auto
(1978 Automobile Data)
. gcontract head
Malformed call: 'head'
Syntax: [+|-]varname [[+|-]varname ...]
r(111);
. contract head
Native Stata -collapse- does this too
sysuse auto, clear
des rep
collapse (sum) rep, by(foreign)
des rep
compress
-fcollapse- keeps integers as is
sysuse auto, clear
des rep
fcollapse (sum) rep, by(foreign)
des rep
-gcollapse- replicates the behavior of the native Stata -collapse-
sysuse auto, clear
des rep
gcollapse (sum) rep, by(foreign)
des rep
compress
On large datasets, the extra time taken by -compress- at the end to convert the variables back to int may reduce the speed gains of -gcollapse- vs -fcollapse-. Is there anyway to replicate the behavior of -fcollapse- in this setting?
The maximum number of variables that can be passed to gtools commands is the maximum value of matsize
in the user's system.
disp c(matsize)
Will show the user how many variables can be used with gtools. This limitation is due to the way Stata's C API is designed. I use matrices to pass various necessary information to and from C, so this limitation will almost certainly not change unless the Stata C API changes.
So, for example, the following fails in Stata/IC:
. clear
. set obs 10
obs was 0, now 10
. forvalues i = 1/800 {
2. gen x`i' = _n
3. }
. * This is fine
. gisid x*
. gen x801 = 10
. * But now this fails
. gisid x*
# variables > matsize (801 > 800). Tried to run
set matsize 801
but the command failed. Try setting matsize manually.
r(908);
A lengthy discussion on improvements and additions to gtools started in issue #28, but it is more appropriate to have a sepparate thread for it. The main idea currently being discussed as a gtools API, which would consist of various wrappers to the core functionality of gtools.
I am not sure the Stata portion of the API will be as useful as the ftools analogue due to the way in which the Stata Plugin Interface works (which is that I have to use to interact with Stata via C). However, it might be useful in ways I have not considered, hence this thread (and I am also thinking of creating a C library based on this plugin, which would be useful for people who aim to write C plugins in the future).
Feel free to make any suggestions or comments on what you would like to see in a gtools API here, as well as any other features and suggestions that you don't think merit their own thread. This issue will remain open past version 1.0, since an API won't make it to that release.
Example code:
clear
input long id1 int id2
1225800 179
1226197 162
1245415 167
1245415 204
1249196 158
1246805 226
1247361 189
1248872 203
1249196 158
end
which gtools
gisid id1 id2
isid id1 id2
I get this output:
which gtools
/fire/home/m1sac03/ado/plus/g/gtools.ado
*! version 1.0.5 20Sep2018 Mauricio Caceres Bravo, [email protected]
*! Program for managing the gtools package installation
. gisid id1 id2
. isid id1 id2
variables id1 id2 do not uniquely identify the observations
r(459);
I might be doing something wrong, but it's probably a bug. Also, running on Stata 14.2/Linux
The default Stata total() allows for an expression between the parentheses, which is something more convenient than if-conditions (e.g. you want to take the total over certain observations, but list the value in all observations). De help file for gegen suggests this is also possible with gtools, but instead I get
sum(varlist) must call a numeric variable list.
when asking
bysort iri_key: gegen firstYearRev = total(dollar * inFirstYear)
The following need to be typed as double
when weights are involved:
total
sum
nansum
rawsum
rawnansum
count
nmissing
Describe the bug
Gtools uses timers 97 to 99 for benchmarks. If the user or another program opened those timers they will be cleared by gtools.
Code Sample
timer clear
timer on 99
timer on 98
timer on 97
timer list
clear
set obs 1
gen x = 1
gcollapse *
timer list
The second timer list
prints nothing.
Version info
If the dataset is already sorted by the variables (as it often is), we can take advantage of this.
For instance, in the case below, it was 3x faster (0.8s vs 2.5s) to generate ok, run assert, and drop it, than to run gisid (2.5). The other alternatives were way slower of course (12s for fisid and 1m for isid):
cls
clear
set obs 10000000
gen long id = (_n / 10)
bys id: gen double t = _n * 1000
sort id t
set rmsg on
*isid id t
fisid id t
gisid id t
by id t: gen byte ok = _n == 1
assert ok
drop ok
A quick way to implement this in an ado would be to add something like this at the beginning of the ado:
if ("`: sortedby'" == "`varlist'") {
tempvar ok
by id t: gen byte `ok' = _n == 1
cou if !ok
drop `ok'
if (r(N)) {
di "not unique"
exit 459
}
}
That said, I'm not entirely sure it will be widely useful. If t
is instead a byte variable, most of the advantage is gone. Also, it would need to support if/in.
gcollapse
will give an error when there are too many by variables or targets. The number of targets and by variables are limited by matsize
:
clear
set matsize 100
set obs 10
forvalues i = 1/101 {
gen x`i' = 10
}
gen zz = runiform()
gcollapse zz, by(x*)
gcollapse x*, by(zz)
Both commands above will fail with error code 908
. However, there is a point where increasing matsize
will not help with the number of targets:
clear
set matsize 400
set obs 10
forvalues i = 1/300 {
gen x`i' = 10
}
gen zz = runiform()
gcollapse zz, by(x*)
gcollapse x*, by(zz)
The first command will succeed but the second will fail with error code 3000
(too many tokens). This is a problem with lines 253-255, 314, 385, 487, 514, 570, 579, and 621 using the regular subinstr
function rather than the extended macro function :subinstr
. A previous commit had switched to using :subinstr
for all locals, but these lines use the function to create a mata object.
NOTE: The matsize
problem may be a very fundamental limitation. Make sure to create a warning if it cannot be bypassed.
Since I do not have a Mac and I have been unable to install OSX on a virtual machine, I have not been able to compile an OSX version of the plugin. This is currently on hold. Macs are expensive.
If you happen to have a Mac and are willing yo contribute to gtools, please comment here and I will give you instructions on how to compile.
hi, it seems there is a bug - please refer to the following:
clear *
set obs 100
gen aaaxp=1
gen aaa=1
gcollapse (mean) aaaxp=aaaxp (max) aaax=aaa //error: variable aaaxp not found
clear *
set obs 100
gen aaaxp=1
gen aaa=1
collapse (mean) aaaxp=aaaxp (max) aaax=aaa //no error
sysuse auto, clear
egen t1 = tag(rep)
gegen t2 = tag(rep)
li rep t? if t1!=t2
assert t1==t2
The assertion fails because gegen
tags missing values of rep, while egen
doesn't.
From the help files: "The result will be 1 if the observation is tagged and never missing, and 0 otherwise", so I think MVs should have a tag of zero unless the missing
option is included.
In some large datasets I often want to do collapse (sum) var1-var1000 ...
, but if a variable is always missing within a group, keep the sum as missing instead of zero:
clear
input byte(id x)
1 .
1 .
2 4
2 6
end
preserve
collapse (sum) x, by(id) fast
list
restore
preserve
fcollapse (nansum) x, by(id) fast
list
restore
Is there a way to do something like this with gcollapse
? Other statistics (mean, etc.) already do this, it's only sum that causes problems.
Thanks!
Sergio
What would you like gtools to add or change (and why)?
When using -gcontract- with a variable which is not defined, the error is uninformative:
. sysuse auto, clear
(1978 Automobile Data)
. gcontract color
Malformed call: 'color'
Syntax: [+|-]varname [[+|-]varname ...]
r(111);
A specific message would be more useful.
Please include a specific suggestion
Return Stata's typical variable color not found
Hi, in the following code, egen group and gegen group give different results. egen group seems to automatically resort to a larger data type when the number of groups exceed the limit of the specified data type.
clear
set obs 1000
g x = _n
egen byte group = group(x)
gegen byte group2 = group(x)
sysuse auto, clear
collapse price, by(foreign)
display "`:format price'"
The formats are maintained. However
sysuse auto, clear
gcollapse price, by(foreign)
display "`:format price'"
collapse
preserves the format but gcollapse
does not.
When I run
sysuse auto
gcollapse mpg, by(foreign) verbose
The collapse works but I'm told that
note: failed to load multi-threaded version; using fallback
, even though I'm running it with Stata-MP on multicore Linux (and WIN 7, but this is pointed out in TODO). Not sure if a bug or lack of understanding on my site.
Hi guys,
I'm trying to run gtools on a Windows Server 2016 with Stata 15, but I'm getting the following error:
Stata MP 15 6 users output:
gcollapse y, by(x)
Could not load plugin: c:\ado\plus\g\gtools_windows.plugin
(error occurred while loading _gtools_internal.ado)
Any tips on how to solve it?
Thanks a lot for this amazing command!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.