mcaceresb / stata-gtools Goto Github PK

View Code? Open in Web Editor NEW

176.0 9.0 35.0 40.91 MB

Faster implementation of Stata's collapse, reshape, xtile, egen, isid, and more using C plugins

Home Page: https://gtools.readthedocs.io

License: MIT License

Makefile 0.16% Python 0.58% C 30.21% Stata 69.05% TeX 0.01%

stata collapse reshape xtile percentile egen hash gtools spookyhash

stata-gtools's Introduction

Faster Stata for big data. This packages uses C plugins and hashes to provide a massive speed improvements to common Stata commands, including: reshape, collapse, xtile, tabstat, isid, egen, pctile, winsor, contract, levelsof, duplicates, unique/distinct, and more.

Faster Stata for Big Data

This package provides a fast implementation of various Stata commands using hashes and C plugins. The syntax and purpose is largely analogous to their Stata counterparts; for example, you can replace collapse with gcollapse, reshape with greshape, and so on. For a comprehensive list of differences (including some extra features!) see the remarks below; for details and examples see the official project page.

Quickstart

ssc install gtools
gtools, upgrade

Some quick benchmarks:

NOTE: Stata 17 introduced massive speed improvements to sort and collapse. In the MP version, in particular with many cores available, the native collapse can be up to twice as fast. (YMMV; overall native collapses could still be slower in some use cases.) gcollapse remains faster in SE and older Stata versions.

Gtools commands with a Stata equivalent

Function	Replaces	Speedup (IC / MP)	Unsupported	Extras
gcollapse	collapse	-0.5 to 2 (Stata 17+); 4 to 100 (Stata 16 and earlier)		Quantiles, merge, labels, nunique, etc.
greshape	reshape	4 to 20 / 4 to 15	"advanced syntax"	`fast`, spread/gather (tidyr equiv)
gegen	egen	9 to 26 / 4 to 9 (+,.)	labels	Weights, quantiles, nunique, etc.
gcontract	contract	5 to 7 / 2.5 to 4
gisid	isid	8 to 30 / 4 to 14	`using`, `sort`	`if`, `in`
glevelsof	levelsof	3 to 13 / 2 to 7		Multiple variables, arbitrary levels
gduplicates	duplicates	8 to 16 / 3 to 10
gquantiles	xtile	10 to 30 / 13 to 25 (-)		`by()`, various (see usage)
	pctile	13 to 38 / 3 to 5 (-)		Ibid.
	_pctile	25 to 40 / 3 to 5		Ibid.
gstats tab	tabstat	10 to 50 / 5 to 30 (-)	See remarks	various (see usage)
gstats sum	sum, detail	10 to 20 / 5 to 10	See remarks	various (see usage)

(+) The upper end of the speed improvements are for quantiles (e.g. median, iqr, p90) and few groups. Weights have not been benchmarked.

(.) Only gegen group was benchmarked rigorously.

(-) Benchmarks computed 10 quantiles. When computing a large number of quantiles (e.g. thousands) pctile and xtile are prohibitively slow due to the way they are written; in that case gquantiles is hundreds or thousands of times faster, but this is an edge case.

Extra commands

Function	Similar (SSC/SJ)	Speedup (IC / MP)	Notes
fasterxtile	fastxtile	20 to 30 / 2.5 to 3.5	Allows `by()`
	egenmisc (SSC) (-)	8 to 25 / 2.5 to 6
	astile (SSC) (-)	8 to 12 / 3.5 to 6
gstats hdfe		(.)	Allows weights, `by()`
gstats winsor	winsor2	10 to 40 / 10 to 20	Allows weights
gunique	unique	4 to 26 / 4 to 12
gdistinct	distinct	4 to 26 / 4 to 12	Also saves results in matrix
gtop (gtoplevelsof)	groups, select()	(+)	See table notes (+)
gstats range	rangestat	10 to 20 / 10 to 20	Allows weights; no flex stats
gstats transform			Various statistical functions

(-) fastxtile from egenmisc and astile were benchmarked against gquantiles, xtile (fasterxtile) using by().

(+) While similar to the user command 'groups' with the 'select' option, gtoplevelsof does not really have an equivalent. It is several dozen times faster than 'groups, select', but that command was not written with the goal of gleaning the most common levels of a varlist. Rather, it has a plethora of features and that one is somewhat incidental. As such, the benchmark is not equivalent and gtoplevelsof does not attempt to implement the features of 'groups'

(.) Other than the dated 'hdfe' command, I do not know of a stata command that residualizes variables from a set of fixed effects. The 'hdfe' command, as far as I can tell, morphed into the 'reghdfe' package; the latter, however, is a fully-functioning regression command, while 'gstats hdfe' only residualizes a set of variables.

Regression models

WARNING: Regression models are in beta and are only intended as utilities to compute coefficients and standard errors. I do not recommend their use in production; various post-estimation commands and statistics are not availabe. (See gstats hdfe for residualizing variables net of fixed effects.)

Function	Model	Similar
gregress	OLS	`regress`, `reghdfe`
givregress	2SLS	`ivregress 2sls`, `ivreghdfe`
gglm	IRLS	`logit`, `poisson`, `ppmlhdfe`

All commands allow the user to optionally add:

absorb() for high-dimensional fixed effects absorptions.
cluster() for clustering (multiple covariates assume clusters are nested).
by() for regressions by group.
weights for weighted versions. Unlike other weights, fweights are assumed to refer to the number of observations.

Linear regression is computed via OLS (or WLS), IV regression is computed via two-stage least squares (2SLS), and GLM (poisson or logit) regression is computed via iteratively reweighted least squares (IRLS). See the TODO section for planned features, or the Missing Features section in the documentation for what is missing before the first non-beta release.

Extra features

Several commands offer additional features on top of the massive speedup. See the remarks section below for an overview; for details and examples, see each command's help page:

In addition, several commands take gsort-style input, that is

[+|-]varname [[+|-]varname ...]

This does not affect the results in most cases, just the sort order. Commands that take this type of input include:

gcollapse
gcontract
gegen
glevelsof
gtop (gtoplevelsof)

Ftools

The commands here are also faster than the commands provided by ftools; further, gtools commands take a mix of string and numeric variables, which is a limitation of ftools. (Note I could not get several parts of ftools working on the Linux server where I have access to Stata/MP; hence the IC benchmarks.)

Gtools	Ftools	Speedup (IC)
gcollapse	fcollapse	2-9
gegen	fegen	2.5-4 (+)
gisid	fisid	4-14
glevelsof	flevelsof	1.5-13
hashsort	fsort	2.5-4

(+) Only egen group was benchmarked rigorously.

Limitations

strL variables only partially supported on Stata 14 and above; gcollapse, gcontract, and greshape do not support strL variabes.
Due to a Stata bug, gtools cannot support more than 2^31-1 (2.1 billion) observations. See this issue
Due to limitations in the Stata Plugin Interface, gtools can only handle as many variables as the largest matsize in the user's Stata version. For MP this is more than 10,000 variables but in IC this is only 800. See this issue.
Gtools uses compiled C code to achieve it's massive increases in speed. This has two side-effects users might notice: First, it is sometimes not possible to break the program's execution. While this is already true for at least some parts of most Stata commands, there are fewer opportunities to break Gtools commands relative to their Stata counterparts.

Second, the Stata GUI might appear frozen when running Gtools commands. If the system then runs out of RAM (memory), it could look like Stata has crashed (it may show a "(Not Responding)" message on Windows or it may darken on *nix systems). However, the program has not crashed; it is merely trying to swap memory. To check this is the case, the user can monitor disk activity or monitor their system's pagefile or swap space directly.

Acknowledgements

The OSX version of gtools was implemented with invaluable help from @fbelotti in issue 11.
Gtools was largely inspired by Sergio Correia's (@sergiocorreia) excellent ftools package. Further, several improvements and bug fixes have come from to @sergiocorreia's helpful comments.
With the exception of greshape, every gtools command has been written almost entirely from scratch (and even greshape is mostly new code). However, gtools commands typically mimic the functionality of existing Stata commands, including community-contributed programs, meaning many of the ideas and options are based on them (see the respective help files for details). gtools commands based on community-contributed programs include:
- gstats winsor, based on winsor2 by Lian (Arlion) Yujun
- gunique, based on unique by Michael Hills and Tony Brady.
- gdistinct, based on distinct by Gary Longton and Nicholas J. Cox.

Installation

I only have access to Stata 13.1, so I impose that to be the minimum. You can install gtools from Stata via SSC:

ssc install gtools
gtools, upgrade

By default this syncs to the master branch, which is stable. To install the latest version directly, type:

local github "https://raw.githubusercontent.com"
net install gtools, from(`github'/mcaceresb/stata-gtools/master/build/)

Examples

The syntax is generally analogous to the standard commands (see the corresponding help files for full syntax and options):

sysuse auto, clear

* gstats {hdfe|residualize} varlist [if] [in] [weight], [absorb(varlist) options]
gstats hdfe hdfe_price = price, absorb(foreign rep78)
gstats residualize price mpg, absorb(foreign rep78) prefix(res_)

* gstats {sum|tab} varlist [if] [in] [weight], [by(varlist) options]
gstats sum price [pw = gear_ratio / 4]
gstats tab price mpg, by(foreign) matasave

* gquantiles [newvarname =] exp [if] [in] [weight], {_pctile|xtile|pctile} [options]
gquantiles 2 * price, _pctile nq(10)
gquantiles p10 = 2 * price, pctile nq(10)
gquantiles x10 = 2 * price, xtile nq(10) by(rep78)
fasterxtile xx = log(price) [w = weight], cutpoints(p10) by(foreign)

* gstats winsor varlist [if] [in] [weight], [by(varlist) cuts(# #) options]
gstats winsor price gear_ratio mpg, cuts(5 95) s(_w1)
gstats winsor price gear_ratio mpg, cuts(5 95) by(foreign) s(_w2)
drop *_w?

* hashsort varlist, [options]
hashsort -make
hashsort foreign -rep78, benchmark verbose mlast

* gegen target  = stat(source) [if] [in] [weight], by(varlist) [options]
gegen tag   = tag(foreign)
gegen group = tag(-price make)
gegen p2_5  = pctile(price) [w = weight], by(foreign) p(2.5)

* gisid varlist [if] [in], [options]
gisid make, missok
gisid price in 1 / 2

* gduplicates varlist [if] [in], [options gtools(gtools_options)]
gduplicates report foreign
gduplicates report rep78 if foreign, gtools(bench(3))

* glevelsof varlist [if] [in], [options]
glevelsof rep78, local(levels) sep(" | ")
glevelsof foreign mpg if price < 4000, loc(lvl) sep(" | ") colsep(", ")
glevelsof foreign mpg in 10 / 70, gen(uniq_) nolocal

* gtop varlist [if] [in] [weight], [options]
* gtoplevelsof varlist [if] [in] [weight], [options]
gtoplevelsof foreign rep78
gtop foreign rep78 [w = weight], ntop(5) missrow groupmiss pctfmt(%6.4g) colmax(3)

* gregress depvar indepvars [if] [in] [weight], [by(varlist) options]
gregress price mpg rep78, mata(coefs) prefix(b(_b_) se(_se_))
gregress price mpg [fw = rep78], by(foreign) absorb(rep78 headroom) cluster(rep78)

* givregress depvar (endog = instruments) exog [if] [in] [weight], [by(varlist) options]
givregress price (mpg = gear_ratio) rep78, mata(coefs) prefix(b(_b_) se(_se_)) replace
givregress price (mpg = gear_ratio) [fw = rep78], by(foreign) absorb(rep78 headroom) cluster(rep78)

* gglm depvar indepvars [if] [in] [weight], family(...) [by(varlist) options]
gglm price mpg rep78, family(poisson) mata(coefs) prefix(b(_b_) se(_se_)) replace
gglm price mpg [fw = trunk], family(poisson) by(foreign) absorb(rep78 headroom) cluster(rep78)

gglm foreign price rep78 [fw = trunk], family(binomial) absorb(headroom) mata(coefs)
gglm foreign price if rep78 > 2, family(binomial) by(rep78) prefix(b(_b_) se(_se_)) replace

* gcollapse (stat) out = src [(stat) out = src ...] [if] [if] [weight], by(varlist) [options]
gen h1 = headroom
gen h2 = headroom
local lbl labelformat(#stat:pretty# #sourcelabel#)

gcollapse (mean) mean = price (median) p50 = gear_ratio, by(make) merge v `lbl'
disp "`:var label mean', `:var label p50'"
gcollapse (iqr) irq? = h? (nunique) turn (p97.5) mpg, by(foreign rep78) bench(2) wild

* gcontract varlist [if] [if] [fweight], [options]
gcontract foreign [fw = turn], freq(f) percent(p)

* greshape wide varlist,    i(i) j(j) [options]
* greshape long prefixlist, i(i) [j(j) string options]
*
* greshape spread varlist, j(j) [options]
* greshape gather varlist, j(j) value(value) [options]

gen j = _n
greshape wide f p, i(foreign) j(j)
greshape long f p, i(foreign) j(j)

greshape spread f p, j(j)
greshape gather f? p?, j(j) value(fp)

* gstats transform (stat) out = src [(stat) out = src ...] [if] [if] [weight], by(varlist) [options]
* gstats range  (stat) out = src [...] [if] [if] [weight], by(varlist) [options]
* gstats moving (stat) out = src [...] [if] [if] [weight], by(varlist) [options]

sysuse auto, clear
gstats transform (normalize) price (demean) price (range mean -sd sd) price, auto
gstats range  (mean) mean_r = price (sd) sd_r = price, interval(-10 10 mpg)
gstats moving (mean) mean_m = price (sd) sd_m = price, by(foreign) window(-5 5)

See the FAQs or the respective documentation for a list of supported gcollapse and gegen functions.

Remarks

Functions available with gegen, gcollapse, gstats tab

gcollapse supports every collapse function, including their weighted versions. In addition, weights can be selectively applied via rawstat(), and several additional statistics are allowed, including nunique, select#, and so on.

gegen technically does not support all of egen, but whenever a function that is not supported is requested, gegen hashes the data and calls egen grouping by the hash, which is often faster (gegen only supports weights for internal functions, since egen does not normally allow weights).

Hence both should be able to replicate all of the functionality of their Stata counterparts. Last, gstats tab allows every statistic allowed by tabstat as well as any statistic allowed by gcollapse; the syntax for the statistics specified via statistics() is the same as in tabstat.

The following are implemented internally in C:

Function	gcollapse	gegen	gstats tab
tag		X
group		X
total		X
count	X	X	X
nunique	X	X	X
nmissing	X	X (+)	X
sum	X	X	X
nansum	X	X	X
rawsum	X		X
rawnansum	X		X
mean	X	X	X
geomean	X	X	X
median	X	X	X
percentiles	X	X	X
iqr	X	X	X
sd	X	X	X
variance	X	X (+)	X
cv	X	X	X
max	X	X	X
min	X	X	X
range	X	X	X
select	X	X	X
rawselect	X		X
percent	X	X	X
first	X	X (+)	X
last	X	X (+)	X
firstnm	X	X (+)	X
lastnm	X	X (+)	X
semean	X	X (+)	X
sebinomial	X	X	X
sepoisson	X	X	X
skewness	X	X	X
kurtosis	X	X	X
gini	X	X	X
gini dropneg	X	X	X
gini keepneg	X	X	X

(+) indicates the function has the same or a very similar name to a function in the "egenmore" packge, but the function was independently implemented and is hence analogous to its gcollapse counterpart, not necessarily the function in egenmore.

The percentile syntax mimics that of collapse and egen, with the addition that quantiles are also supported. That is,

gcollapse (p#) target = var [target = var ...] , by(varlist)
gegen target = pctile(var), by(varlist) p(#)

where # is a "percentile" with arbitrary decimal places (e.g. 2.5 or 97.5). gtools also supports selecting the #th smallest or largest value:

gcollapse (select#) target = var [(select-#) target = var ...] , by(varlist)
gegen target = select(var), by(varlist) n(#)
gegen target = select(var), by(varlist) n(-#)

In addition, the following are allowed in gegen as wrappers to other gtools functions (stat is any stat available to gcollapse, except percent, nunique):

Function	calls
xtile	fasterxtile
standardize	gstats transform
normalize	gstats transform
demean	gstats transform
demedian	gstats transform
moving_stat	gstats transform
range_stat	gstats transform
cumsum	gstats transform
shift	gstats transform
rank	gstats transform
winsor	gstats winsor
winsorize	gstats winsor

Last, when gegen calls a function that is not implemented internally by gtools, it will hash the by variables and call egen with by set to an id based on the hash. That is, if fcn is not one of the functions above,

gegen outvar = fcn(varlist) [if] [in], by(byvars)

would be the same as

hashsort byvars, group(id) sortgroup
egen outvar = fcn(varlist) [if] [in], by(id)

but preserving the original sort order. In case an egen option might conflict with a gtools option, the user can pass gtools_capture(fcn_options) to gegen.

Differences and Extras

Differences from collapse

String variables are not allowed for first, last, min, max, etc. (see issue 25)
New functions: nunique, nmissing, cv, variance, select#, select-#, range, gini
rawstat allows selectively applying weights.
rawselect ignores weights for select (analogously to rawsum).
Option wild allows bulk-rename. E.g. gcollapse mean_x* = x*, wild
gcollapse (nansum) and gcollapse (rawnansum) outputs a missing value for sums if all inputs are missing (instead of 0).
gcollapse, merge merges the collapsed data set back into memory. This is much faster than collapsing a dataset, saving, and merging after. However, Stata's merge ..., update functionality is not implemented, only replace. (If the targets exist the function will throw an error without replace).
gcollapse, labelformat allows specifying the output label using placeholders.
gcollapse, sumcheck keeps integer types with sum if the sum will not overflow.

Differences from reshape

Allows an arbitrary number of variables in i() and j()
Several option allow turning off error checks for faster execution, including: fast (similar to fast in gcollapse), unsorted (do not sort the output), nodupcheck (allow duplicates in i), nomisscheck (allow missing values and/or leading blanks in j), or nochecks (all of the above).
Subcommands gather and spread implement the equivalent commands from R's tidyr package.
At the moment, j(name [values]) is not supported. All values of j are used.
"reshape mode" is not supported. Reshape variables are not saved as part of the current dataset's characteristics, meaning the user cannot type reshape wide and reshape long without further arguments to reverse the reshape. This syntax is very cumbersome and difficult to support; greshape re-wrote much of the code base and had to dispense with this functionality.
For that same reason, "advanced" syntax is not supported, including the subcommands: clear, error, query, i, j, xij, and xi.
@ syntax can be modified via match()
dropmiss allows dropping missing observations when reshaping from wide to long (via long or gather).

Differences from regression models

gregress, givregress, and gglm do not aim to replicate the entire table of estimation results, nor the entire suite of post-estimation results and tests, that regress (reghdfe), ivregress 2sls (ivreghdfe), poisson (ppmlhdfe), or logit make available. At the moment, they are considered beta software and only coefficients and standard errors are computed.

Results are saved either to mata (default) or copied to variables in the dataset in memory.
by() and absorb() are allowed and can be combined.
givregress does a small sample adjustment (small) automatically.
givregress does not exit with error if covariates are collinear with the dependent variable.
If the givregress model is not identified, standard errors and coefficients are set to missing instead of exiting with error.
gglm runs with option robust automatically.
If the givregress model is not identified, standard errors and
If there are no non-linear covariates (i.e. all observations are numerically zero) then the coefficients and standard errors are both set to missing.

Differences from xtile, pctile, and _pctile

Adds support for by() (including weights)
Does not ignore altdef with xtile (see this Statalist thread)
Category frequencies can also be requested via binfreq[()].
xtile, pctile, and _pctile can be combined via xtile(newvar) and pctile(newvar)
There is no limit to nquantiles() for xtile
Quantiles can be requested via percentiles() (or quantiles()), cutquantiles(), or quantmatrix() for xtile as well as pctile.
Cutoffs can be requested via cutquantiles(), cutoffs(), or cutmatrix() for xtile as well as pctile.
The user has control over the behavior of cutpoints() and cutquantiles(). They obey if in with option cutifin, they can be group-specific with option cutby, and they can be de-duplicated via dedup.
Fixes numerical precision issues with pctile, altdef (e.g. see this Statalist thread, which is a very minor thing so Stata and fellow users maintain it's not an issue, but I think it is because Stata/MP gives what I think is the correct answer whereas IC and SE do not).
Fixes a possible issue with the weights implementation in _pctile; see this thread.

Differences from egen

group label options are not supported
weights are supported for internally implemented functions.
New functions: nunique, nmissing, cv, variance, select#, select-#, range
gegen upgrades the type of the target variable if it is not specified by the user. This means that if the sources are double then the output will be double. All sums are double. group creates a long or a double. And so on. egen will default to the system type, which could cause a loss of precision on some functions.
For internally supported functions, you can specify a varlist as the source, not just a single variable. Observations will be pooled by row in that case.
While gegen is much faster for tag, group, and summary stats, most egen function are not implemented internally, meaning for arbitrary gegen calls this is a wrapper for hashsort and egen.

Differences from tabstat

Multiple groups are allowed.
Saving the output is done via mata instead of r(). No matrices are saved in r() and option save is not allowed. However, option matasave saves the output and by() info in GstatsOutput (the object can be named via matasave(name)). See mata GstatsOutput.desc() after gstats tab, matasave for details.
GstatsOutput provides helpers for extracting rows, columns, and levels.
Options casewise, longstub are not supported.
Option nototal is on by default; total is planned for a future release.
Option pooled pools the source variables into one.

Differences from summarize, detail

The behavior of summarize and summarize, meanonly can be recovered via options nodetail and meanonly. These two options are mainly for use with by()
Option matasave saves output and by() info in GstatsOutput, a mata class object (the object can be named via matasave(name)). See mata GstatsOutput.desc() after gstats sum, matasave for details.
Option noprint saves the results but omits printing output.
Option tab prints statistics in the style of tabstat
Option pooled pools the source variables and computes summary stats as if it was a single variable.
pweights are allowed.
Largest and smallest observations are weighted.
rolling:, statsby:, and by: are not allowed. To use by pass the option by()
display options are not supported.
Factor and time series variables are not allowed.

Differences from levelsof

It can take a varlist and not just a varname; in that case it prints all unique combinations of the varlist. The user can specify column and row separators.
It can deduplicate an arbitrary number of levels and store the results in a new variable list or replace the old variable list via gen(prefix) and gen(replace), respectively. If the user runs up against the maximum macro variable length, add option nolocal.

Differences from isid

No support for using. The C plugin API does not allow to load a Stata dataset from disk.
Option sort is not available.
It can also check IDs with if and in conditions.

Differences from gsort

hashsort behaves as if mfirst was passed. To recover the default behavior of gsort pass option mlast.

Differences from duplicates

gduplicates does not sort examples or list by default. This massively enhances performance but it might be harder to read. Pass option sort (sorted) to mimic duplicates behavior and sort the list.

Differences from rangestat

Note that gstats range is an alias for gstats transform that assumes all the stats requested are range statistics. However, it can be called in conjunction with any other transform via (range stat ...). It was not intended to be a replacement of rangestat but it can replicate some of its functionality.
flex_stats (reg, corr, cov) are not allowed (see gregress).
Intervals are of the form interval(low high [keyvar]); if keyvar is missing then it is taken to be the source variable.
Variables are not allowed in place of low or high. Instead they must be #[stat] where # is a number and stat is an optional summary statistic; e.g. interval(-sd 0.5sd x).
Separate interval and interval variables can be specified for each target; e.g. gstats range (mean -3 3) x (mean -2 . time) y ....
All statistics allowed by gstats tab are allowed by gstats range (except nunique or percent).
Options casewise, describe, and local are not allowed.

Hashing and Sorting

There are two key insights to the massive speedups of Gtools:

Hashing the data and sorting a hash is a lot faster than sorting the data to then process it by group. Sorting a hash can be achieved in linear O(N) time, whereas the best general-purpose sorts take O(N log(N)) time. Sorting the groups would then be achievable in O(J log(J)) time (with J groups). Hence the speed improvements are largest when N / J is largest.
Compiled C code is much faster than Stata commands. While it is true that many of Stata's underpinnings are compiled code, several operations are written in ado files without much thought given to optimization. If you're working with tens of thousands of observations you might barely notice (and the difference between 5 seconds and 0.5 seconds might not be particularly important). However, with tens of millions or hundreds of millions of rows, the difference between half a day and an hour can matter quite a lot.

Stata Sorting

It should be noted that Stata's sorting mechanism is hard to improve upon because of the overhead involved in sorting. We have implemented a hash-based sorting command, hashsort, which should be faster Stata's sort for groups, but not necessarily otherwise:

Function	Replaces	Speedup (IC / MP)	Unsupported	Extras
hashsort	sort	2.5 to 4 / .8 to 1.3		Group (hash) sorting
	gsort	2 to 18 / 1 to 6	`mfirst` (see `mlast`)	Sorts are stable

The overhead involves copying the by variables, hashing, sorting the hash, sorting the groups, copying a sort index back to Stata, and having Stata do the final swaps. The plugin runs fast, but the copy overhead plus the Stata swaps often make the function be slower than Stata's native sort.

The reason that the other functions are faster is because they don't deal with all that overhead. By contrast, Stata's gsort is not efficient. To sort data, you need to make pair-wise comparisons. For real numbers, this is just a > b. However, a generic comparison function can be written as compare(a, b) > 0. This is true if a is greater than b and false otherwise. To invert the sort order, one need only use compare(b, a) > 0, which is what gtools does internally.

However, Stata creates a variable that is the inverse of the sort variable. This is equivalent, but the overhead makes it slower than hashsort.

TODO

Planned features:

These are options/features/improvements I would like to add, but I don't have an ETA for them (i.e. they are a wishlist because I am either not sure how to implement them or because writing the code will take a long time). Roughly in order of likelihood:

About

Hi! I'm Mauricio Caceres; I made gtools after some of my Stata jobs were taking literally days to run because of repeat calls to egen, collapse, and similar on data with over 100M rows. Feedback and comments are welcome! I hope you find this package as useful as I do.

Along those lines, here are some other Stata projects I like:

ftools: The main inspiration for gtools. Not as fast, but it has a rich feature set; its mata API in particular is excellent.
reghdfe: The fastest way to run a regression with multiple fixed effects (as far as I know).
ivreghdfe: A combination of ivreg2 and reghdfe.
stata_kernel: A Stata kernel for Jupyter; extremely useful for interacting with Stata.
stata-cowsay: Productivity-boosting cowsay functionality in Stata.

License

Gtools is MIT-licensed. ./lib/spookyhash and ./src/plugin/common/quicksort.c belong to their respective authors and are BSD-licensed. Also see gtools, licenses.

stata-gtools's People

Contributors

Stargazers

Watchers

stata-gtools's Issues

Links from help file

Just a minor thing but might be useful:

Sometimes I forget what specific commands are available on gtools, so it would be useful to have such a list within -gtools.sthlp-. Thus, if anyone runs h gtools they can then click on the specific command he is interested in.

segfault with gegen group()

I'm running a fairly complex gegen on a fairly large dataset:

. count
  142,686,929
. sort id4 id3 id5 id1 // sorted but not exactly by what i want to gegen
. gegen double loanid = group(id1 id2 id3 id4 id5 id6), missing verbose

Immediately after that I get a segfault, even when using a server with lots of memory (64gb). This is on Stata 14 MP/6 on Linux.

I would usually try to provide a minimum working example that replicates this error but even when logging or when running with set trace on I can't catch what is going on.

Is it possible that there is not enough memory to copy all the variables and that causes the segfault?

Thanks!
Sergio

Could not load plugin: c:\ado\plus\g\gtools_windows.plugin

Hi guys,

I'm trying to run gtools on a Windows Server 2016 with Stata 15, but I'm getting the following error:

Stata MP 15 6 users output:
gcollapse y, by(x)
Could not load plugin: c:\ado\plus\g\gtools_windows.plugin
(error occurred while loading _gtools_internal.ado)

Any tips on how to solve it?

Thanks a lot for this amazing command!

Using gegen and reg within program returns error 198

To parallelize bootstraps with the ado parallel I have included gegen and regress in a small program which returns the invalid syntax error 198.
When replacing gegen with egen the programm runs fine.

clear all
sysuse auto , clear

cap prog drop pargegen
prog define pargegen, rclass
	version 13
	syntax varlist [if]
	marksample touse
	gegen test = sum(price)
	reg `varlist' if `touse'
	drop test
end

pargegen price weight foreign rep78

Setting trace on I get the following error log:
gtools_log.txt

My suspicion is that the local level, which is defined by regress, is somehow altered by gegen. According to the stata manual level(95) is the default option but in line 3883 of the log file 95,0 shows up. Even when I define the regress option , level(95) explicitly, the same error is shown.

gegen total does not parse "exp" correctly

The default Stata total() allows for an expression between the parentheses, which is something more convenient than if-conditions (e.g. you want to take the total over certain observations, but list the value in all observations). De help file for gegen suggests this is also possible with gtools, but instead I get
sum(varlist) must call a numeric variable list.

when asking

bysort iri_key: gegen firstYearRev = total(dollar * inFirstYear)

`gegen group` misc issues

A few issues with gegen:

It doesn't seem to respect the order of the data. EG:

sysuse auto, clear
egen id1 = group(turn trunk)
fegen id2 = group(turn trunk)
gegen id3 = group(turn trunk)

assert id1==id2
assert id1==id3

The last command fails as the IDs are not the same

It doesn't warn for incorrect options:

gegen id = group(turn) , wrongoption

Should give a warning that an incorrect option was used.

Error when wild targets are existing variables

When wild is specified, the user cannot use existing variable names.

. sysuse auto, clear
(1978 Automobile Data)

. gcollapse mpg = price, wild
mpg already defined
r(110);

Variable formats are not maintained with `gcollapse`

sysuse auto, clear
collapse price, by(foreign)
display "`:format price'"

The formats are maintained. However

sysuse auto, clear
gcollapse price, by(foreign)
display "`:format price'"

collapse preserves the format but gcollapse does not.

gegen tag treatment of missing

sysuse auto, clear
egen t1 = tag(rep)
gegen t2 = tag(rep)
li rep t? if t1!=t2
assert t1==t2

The assertion fails because gegen tags missing values of rep, while egen doesn't.

From the help files: "The result will be 1 if the observation is tagged and never missing, and 0 otherwise", so I think MVs should have a tag of zero unless the missing option is included.

gtools does not group strL variables correctly

Hi, when doing gcollapse by groups with some group being data type of "strL", the results seem to be problematic. Reproduced as follows:


clear *
set obs 4
g strL name = strofreal(round(_n,2)) 
g id = round(_n,2)
g value = _n
gcollapse (max) value = value , by(name id)

a bug in gcollapse

hi, it seems there is a bug - please refer to the following:

clear *
set obs 100
gen aaaxp=1
gen aaa=1
gcollapse (mean) aaaxp=aaaxp (max) aaax=aaa //error: variable aaaxp not found

clear *
set obs 100
gen aaaxp=1
gen aaa=1
collapse (mean) aaaxp=aaaxp (max) aaax=aaa //no error

issue specifying data type in gegen group

Hi, in the following code, egen group and gegen group give different results. egen group seems to automatically resort to a larger data type when the number of groups exceed the limit of the specified data type.

clear
set obs 1000
g x = _n
egen byte group = group(x)
gegen byte group2 = group(x)

Typo in gunique docs

https://gtools.readthedocs.io/en/latest/usage/gunique/index.html#stored-results
r(unique), not r(nunique) yields the number of groups. In the Stata help file this is correct.

The plugin does not work on OSX

Since I do not have a Mac and I have been unable to install OSX on a virtual machine, I have not been able to compile an OSX version of the plugin. This is currently on hold. Macs are expensive.

If you happen to have a Mac and are willing yo contribute to gtools, please comment here and I will give you instructions on how to compile.

invalid observation number

When trying to use the -gcollapse- command I came across the following error repeatedly: the program would fail in the middle of execution, citing the following error:

'1.149e+08' invalid observation number

I attach the relevant part of the trace output from Stata below. Note that "1.149e+08" is the number of observations in my dataset, but I suspect that somehow this gets passed in the wrong format, causing this error:

    ----------------------------------------------------------------------------------- end gcollapse.gtools_timer ---
    - }
    - }
    - if ( "`merge'" == "" ) {
    = if ( "" == "" ) {
    - qui {
    - if ( `=scalar(__gtools_J) > 0' ) keep in 1 / `:di scalar(__gtools_J)'
    = if ( 1 ) keep in 1 / 1.149e+08
'1.149e+08' invalid observation number
      else if ( `=scalar(__gtools_J) == 0' ) drop if 1
      else if ( `=scalar(__gtools_J) < 0' ) {
      di as err "The plugin returned a negative number of groups."
      di as err `"This is a bug. Please report to {browse "`website_url'":`website_disp'}"'
      }
      ds *
      }
      if ( `=_N' == 0 ) di as txt "(no observations)"
      local memvars `r(varlist)'
      local keepvars `by' `gtools_targets'
      local dropme `:list memvars - keepvars'
      if ( "`dropme'" != "" ) mata: st_dropvar((`:di subinstr(`""`dropme'""', " ", `"", ""', .)'))
      if ( (`=_N > 0') & (`=scalar(__gtools_k_extra)' > 0) & ( `used_io' | ("`forceio'" == "forceio") ) ) {
      qui mata: st_addvar(__gtools_addtypes, __gtools_addvars, 1)
      gtools_timer info 97 `"Added extra targets after collapse"', prints(`benchmark')
      local __gtools_iovars: list gtools_targets - gtools_uniq_vars
      mata: __gtools_iovars = (`:di subinstr(`""`__gtools_iovars'""', " ", `"", ""', .)')
      if ( `debug_io_read_method' == 0 ) {
      cap `noi' `plugin_call' `__gtools_iovars', collapse read `"`__gtools_file'"'
      if ( _rc != 0 ) exit _rc
      }
      else {
      local nrow = `=scalar(__gtools_J)'
      local ncol = `=scalar(__gtools_k_extra)'
      mata: __gtools_data = gtools_get_collapsed (`"`__gtools_file'"', `nrow', `ncol')
      mata: st_store(., __gtools_iovars, __gtools_data)
      }
      gtools_timer info 97 `"Read extra targets from disk"', prints(`benchmark')
      }
      local order = 0
      qui ds *
      local varorder `r(varlist)'
      local varsort `by' `gtools_targets'
      foreach varo in `varorder' {
      gettoken svar varsort: varsort
      if ("`varo'" != "`vars'") local order = 1
      }
      if ( `order' ) order `by' `gtools_targets'
      forvalues k = 1 / `:list sizeof gtools_targets' {
      mata: st_varlabel("`:word `k' of `gtools_targets''", __gtools_labels[`k'])
      }
      if ( "`unsorted'" == "" ) sort `by'
      }
    -------------------------------------------------------------------------------------------------- end gcollapse ---
r(198); t=103.95 11:28:02

end of do-file
r(198); t=7348.34 11:28:02
r(198); t=7348.34 11:28:02

end of do-file
r(198); t=0.00 11:28:02

-gcollapse- recasts ints to doubles

Native Stata -collapse- does this too

sysuse auto, clear
des rep
collapse (sum) rep, by(foreign)
des rep
compress

-fcollapse- keeps integers as is

sysuse auto, clear
des rep
fcollapse (sum) rep, by(foreign)
des rep

-gcollapse- replicates the behavior of the native Stata -collapse-

sysuse auto, clear
des rep
gcollapse (sum) rep, by(foreign)
des rep
compress

On large datasets, the extra time taken by -compress- at the end to convert the variables back to int may reduce the speed gains of -gcollapse- vs -fcollapse-. Is there anyway to replicate the behavior of -fcollapse- in this setting?

gcollapse: Support for rawsum / selective weights in general

Are there plans to support gcollapse (rawsum), or more generally allow for certain operation to be weighted whilst others are not? I frequently run into situations where I'd like to calculate a weighted mean, but an unweighted sum, which at the moment means I'm limited to using collapse in those situations (or spend extra time coding workarounds).

By the way, thanks for this great suite of tools, I use it regularly and it has made significant improvements to the runtime of some of my programs.

is it possible to place missing values last when doing a descending hashsort?

currently the default seems to place missing values first.

Error with more than 2^31-1 observations

A bug in Stata causes gtools to exit with error if the user has more than 2^31-1 observations in memory. See this bug report.

I contacted StataCorp about it and they replied:

The SPI can work with datasets containing up to 2^31-1 observations. Our development group is looking into modifying future versions of the SPI to allow more observations.

Extended missing values are not preserved

Min, max, first, last, firstnm, lastnm all preserve Stata's extended missing values. However, gcollapse treats them all as missing.

. sysuse auto, clear
(1978 Automobile Data)

. replace price = .a
(74 real changes made, 74 to missing)

. gcollapse (first) price, by(foreign)

. l

     +------------------+
     |  foreign   price |
     |------------------|
  1. | Domestic       . |
  2. |  Foreign       . |
     +------------------+

However, collapse gives

     +------------------+
     |  foreign   price |
     |------------------|
  1. | Domestic      .a |
  2. |  Foreign      .a |
     +------------------+

Further, extended values are not correctly parsed by glevelsof. Consider:

clear
set obs 5
gen x = _n
replace x = .  in 2
replace x = .a in 3
replace x = .b in 4
glevelsof x

While "." is excluded, both ".a" and ".b" appear via their internal representation (rather than ".a" and ".b").

Issue calculating medians in gegen pctile and gcollapse w/ small groups

Medians appear to be calculated wrong in at least this simple case:

clear all
set obs 3
gen x = 1
replace x = 3 in 2
replace x = 5 in 3
egen med_egen = pctile(x), p(50)
gegen med_gegen = pctile(x), p(50)
list

The same problem also emerges in gcollapse:

clear all
set obs 3
gen x = 1
replace x = 3 in 2
replace x = 5 in 3
tempfile t1
save `t1'
collapse (p50) x
list
use `t1', clear
gcollapse (p50) x
list

"gtools, dependencies" not working with spaces in filepath

Hi!

I noticed that "gtools, dependencies" threw an error at me for not finding the spookyhash.dll in my ado folder despite it being there.

After inspection of gtools.ado I found that the issue was caused by not enclosing the filepaths in additional quotation marks. If your ado-path contains spaces (like in my case) then this will cause problems. I fixed it by enclosing the filepaths on lines 25, 26 and 65 in quotes.

No idea if that solution is windows-sepecific, but maybe its worth including that in the next update.

Thanks a lot for writing this great plugin, this is extremely useful!

nansum option in gcollapse?

In some large datasets I often want to do collapse (sum) var1-var1000 ..., but if a variable is always missing within a group, keep the sum as missing instead of zero:

clear
input byte(id x)
1 .
1 .
2 4
2 6
end

preserve
collapse (sum) x, by(id) fast
list
restore

preserve
fcollapse (nansum) x, by(id) fast
list
restore

Is there a way to do something like this with gcollapse? Other statistics (mean, etc.) already do this, it's only sum that causes problems.

Thanks!
Sergio

Is there an equivalent to `contract`?

When writing ftools, I realized that I could have contract for free if I just stopped at the middle of the collapse code and returned the counts of each level. Thus, the freq option allows you to get the same results as contract.

Is there something like that with gtools? (Running collapse (count) is not ideal because it requires an additional variable, it requires it to be nonmissing, etc.)

Extended missing values are grouped with missing values when all by variables are byte, int, or long

clear
set obs 5
gen y = 1
gen x = _n
replace x = .  in 2
replace x = .a in 3
replace x = .b in 4

gcollapse (sum) y, by(x)

Works just fine. However,

clear
set obs 5
gen y = 1
gen long x = _n
replace x = .  in 2
replace x = .a in 3
replace x = .b in 4

gcollapse (sum) y, by(x)

Thinks ".", ".a", and ".c" are the same group. This happens throughout gtools.

WIN: Instant crash of Stata

When running a gcollapse or gegen command, Stata directly quits with a Windows Error Message. I't in German but maybe it helps anyhow.
Problemereignisname: APPCRASH
Anwendungsname: StataMP-64.exe
Anwendungsversion: 521.13.1.222
Anwendungszeitstempel: 5567795b
Fehlermodulname: env_set_windows.plugin
Fehlermodulversion: 0.0.0.0
Fehlermodulzeitstempel: 5946e14a
Ausnahmecode: c0000005
Ausnahmeoffset: 0000000000001505
Betriebsystemversion: 6.1.7601.2.1.0.256.48
Gebietsschema-ID: 1031
Zusatzinformation 1: 86b5
Zusatzinformation 2: 86b57a8af37a99527b6b1746d2e86099
Zusatzinformation 3: 41fb
Zusatzinformation 4: 41fb44b28c1de6cadb0770927f4e2baf

`gcontract` does not allow abbreviations?

It seems I can't abbreviate variables in gcontract:

. sysuse auto
(1978 Automobile Data)

. gcontract head
Malformed call: 'head'
Syntax: [+|-]varname [[+|-]varname ...]
r(111);

. contract head

Gtools commands are limited by matsize

The maximum number of variables that can be passed to gtools commands is the maximum value of matsize in the user's system.

disp c(matsize)

Will show the user how many variables can be used with gtools. This limitation is due to the way Stata's C API is designed. I use matrices to pass various necessary information to and from C, so this limitation will almost certainly not change unless the Stata C API changes.

So, for example, the following fails in Stata/IC:

. clear

. set obs 10
obs was 0, now 10

. forvalues i = 1/800 {
  2.     gen x`i' = _n
  3. }
 
. * This is fine

. gisid x*

. gen x801 = 10
 
. * But now this fails

. gisid x*

# variables > matsize (801 > 800). Tried to run

    set matsize 801

but the command failed. Try setting matsize manually.
r(908);

Bug in `gisid`

Example code:

clear
input long id1 int id2
1225800 179
1226197 162
1245415 167
1245415 204
1249196 158
1246805 226
1247361 189
1248872 203
1249196 158
end

which gtools
gisid id1 id2
isid id1 id2

I get this output:

 which gtools
/fire/home/m1sac03/ado/plus/g/gtools.ado
*! version 1.0.5 20Sep2018 Mauricio Caceres Bravo, [email protected]
*! Program for managing the gtools package installation

. gisid id1 id2

. isid id1 id2
variables id1 id2 do not uniquely identify the observations
r(459);

I might be doing something wrong, but it's probably a bug. Also, running on Stata 14.2/Linux

Error: Plugin not found

I get the following error when trying to use gtools' -gcollapse- command:
file env_set_macosx.plugin not found
(error occurred while loading gcollapse.ado)
r(601);

Simple example re-producing the error:
. ado uninstall gtools

package gtools from https://raw.githubusercontent.com/mcaceresb/stata-gtools/master/build
^

(package uninstalled)

. net install gtools, from(https://raw.githubusercontent.com/mcaceresb/stata-gtools/master/build/)
checking gtools consistency and verifying not already installed...
installing into /Users/USER/Library/Application Support/Stata/ado/plus/...
installation complete.

. sysuse auto
(1978 Automobile Data)

. gcollapse (mean) price, by(make)
file env_set_macosx.plugin not found
(error occurred while loading gcollapse.ado)
r(601);

Collapsing strings

Just a quick question: is there a reason why we can't use strings as targets in collapse?

EG:

sysuse auto
gcollapse (first) make

Returns:

. sysuse auto
(1978 Automobile Data)

. gcollapse (first) make
Source make must be numeric.
r(198);

On principle, some target variables (first, last, firstnm, lastnm) could be strings.

(possible suggestion) Speed up gisid if the dataset is already sorted

If the dataset is already sorted by the variables (as it often is), we can take advantage of this.

For instance, in the case below, it was 3x faster (0.8s vs 2.5s) to generate ok, run assert, and drop it, than to run gisid (2.5). The other alternatives were way slower of course (12s for fisid and 1m for isid):

cls
clear
set obs 10000000
gen long id = (_n / 10)
bys id: gen double t = _n * 1000
sort id t

set rmsg on

*isid id t
fisid id t
gisid id t

by id t: gen byte ok = _n == 1
assert ok
drop ok

A quick way to implement this in an ado would be to add something like this at the beginning of the ado:

if ("`: sortedby'" == "`varlist'") {
	tempvar ok
	by id t: gen byte `ok' = _n == 1
	cou if !ok
	drop `ok'
	if (r(N)) {
		di "not unique"
		exit 459
	}
}

That said, I'm not entirely sure it will be widely useful. If t is instead a byte variable, most of the advantage is gone. Also, it would need to support if/in.

variable X already defined

I discovered that -gcollapse- produces an error when collapsing a variable into a target name that is identical to a variable name that exists pre-collapse. It is easy enough to circumvent, but might be fixed without much trouble? This caused me a bit of a headache trying to troubleshoot, since that behavior of -gcollapse- was not obvious to me.

See minimal working example as follows:

use -gcollapse- with error:
. sysuse auto, clear
(1978 Automobile Data)

. gen price2=price

. gcollapse price=price2, by(make)
variable price already defined
r(110);

use -collapse- without error:
. sysuse auto, clear
(1978 Automobile Data)

. gen price2=price

. collapse price=price2, by(make)

Running `gtools, dependencies` on Linux has an unquoted -cd- command

This line causes an error if the current path has spaces.

Possible issues with equal signs in locals

This mostly applies for older versions of Stata (e.g. Stata 12), but in general it's risky to have lines such as

local xyz = substr(..)

Because the equal sign truncates the local. A good explanation of why is here: http://www.stata-journal.com/sjpdf.html?articlenum=pr0045

Gcontract indicating that variable was not found

What would you like gtools to add or change (and why)?
When using -gcontract- with a variable which is not defined, the error is uninformative:

. sysuse auto, clear
(1978 Automobile Data)

. gcontract color
Malformed call: 'color'
Syntax: [+|-]varname [[+|-]varname ...]
r(111);

A specific message would be more useful.

Please include a specific suggestion
Return Stata's typical variable color not found

-gegen- only provides support for by-able egen functions

I observe the following behavior of gegen on WIN 7 and Linux:


. sysuse auto
(1978 Automobile Data)

. gegen mean=mean(mpg)
-gegen- only provides support for by-able egen functions
r(198);

. by foreign: gegen mean=mean(mpg)
-gegen- only provides support for by-able egen functions
r(198);

This also happens with other gegen functions.

`gisid` shows debugging message

I get the message "Plugin step 4: ..." even if verbose options are off.

Type limits incorrectly parsed with weights.

The following need to be typed as double when weights are involved:

total
sum
nansum
rawsum
rawnansum
count
nmissing

file gtools_windows_legacy.plugin not found

Hi Mauricio,

I just installed the gtools package on Stata 14.1 running on Win Server 2012 R2 but I got the following error which seems to be the same reported here #15.

. gcollapse (mean) price, by(foreign)
file gtools_windows_legacy.plugin not found
(error occurred while loading gcollapse.ado)

. which gtools
c:\ado\plus\g\gtools.ado
*! version 0.6.16 13Sep2017 Mauricio Caceres Bravo, [email protected]
*! Program for managing the gtools package installation

Do you have any hint?

Many thanks,
Federico

Why does gcollapse fail to load multi-threaded version; using fallback

When I run

sysuse auto
gcollapse mpg, by(foreign) verbose

The collapse works but I'm told that
note: failed to load multi-threaded version; using fallback, even though I'm running it with Stata-MP on multicore Linux (and WIN 7, but this is pointed out in TODO). Not sure if a bug or lack of understanding on my site.

file gtools_windows_legacy.plugin not found

Hi Mauricio,

I just updated gtools and got an error that got me a bit confused (nothing else has changed on my system:

. gcollapse (sum) price, by(turn)
file gtools_windows_legacy.plugin not found

I'm running Windows 10 and Stata 14.2. The string reported by which gtools is this:

*! version 0.6.16 13Sep2017 Mauricio Caceres Bravo, [email protected]

`gcollapse` fails with a large number of targets or by variables

gcollapse will give an error when there are too many by variables or targets. The number of targets and by variables are limited by matsize:

clear
set matsize 100
set obs 10
forvalues i = 1/101 {
    gen x`i' = 10
}
gen zz = runiform()
gcollapse zz, by(x*)
gcollapse x*, by(zz)

Both commands above will fail with error code 908. However, there is a point where increasing matsize will not help with the number of targets:

clear
set matsize 400
set obs 10
forvalues i = 1/300 {
    gen x`i' = 10
}
gen zz = runiform()
gcollapse zz, by(x*)
gcollapse x*, by(zz)

The first command will succeed but the second will fail with error code 3000 (too many tokens). This is a problem with lines 253-255, 314, 385, 487, 514, 570, 579, and 621 using the regular subinstr function rather than the extended macro function :subinstr. A previous commit had switched to using :subinstr for all locals, but these lines use the function to create a mata object.

NOTE: The matsize problem may be a very fundamental limitation. Make sure to create a warning if it cannot be bypassed.

Gtools clears some timers without warning

Describe the bug

Gtools uses timers 97 to 99 for benchmarks. If the user or another program opened those timers they will be cleared by gtools.

Code Sample

timer clear
timer on 99
timer on 98
timer on 97
timer list
clear
set obs 1
gen x = 1
gcollapse *
timer list

The second timer list prints nothing.

Version info

OS: All
Version: 1.1.2

bug in gcollapse?

Please refer to the attached pictures (sorry I have difficulties reproducing the data). In my data there is a variable "oobbu" whose maximum value is 3231, as indicated in the picture. after gcollapse (firstnm), the maximum of "oobbu" becomes 2.14e+09, which is wrong.

This issue seems to be solved if I save the data in memory first, followed by "use XXX, clear" , and then do the gcollapse again (the second picture).

Improvements and additions to gtools

A lengthy discussion on improvements and additions to gtools started in issue #28, but it is more appropriate to have a sepparate thread for it. The main idea currently being discussed as a gtools API, which would consist of various wrappers to the core functionality of gtools.

I am not sure the Stata portion of the API will be as useful as the ftools analogue due to the way in which the Stata Plugin Interface works (which is that I have to use to interact with Stata via C). However, it might be useful in ways I have not considered, hence this thread (and I am also thinking of creating a C library based on this plugin, which would be useful for people who aim to write C plugins in the future).

Feel free to make any suggestions or comments on what you would like to see in a gtools API here, as well as any other features and suggestions that you don't think merit their own thread. This issue will remain open past version 1.0, since an API won't make it to that release.

Sort group variables internally in C

A major enhancement to the plugin would be to sort the groups internally in C. This would afford a major speed-up when processing a large number of groups (vs sorting in Stata) and allow gegen to be used as an adequate replacement for egen.

In particular, one issue raied in #4 is that gegen does not produce IDs in the order that the groups would be sorted. For instance,

sysuse auto, clear
egen id1 = group(turn trunk)
fegen id2 = group(turn trunk)
gegen id3 = group(turn trunk)

assert id1==id2
assert id1==id3

Instead, gegen produces IDs in the order the groups appear. While this is by design, it is not the behavior of egen. Sorting groups internally would allow solving this issue as well.

List possible gotchas

It might be good to have a comprehensive list of the things that could go wrong, so users know where to look at. For instance

SpookyHash 128 has a low collision chance if the inputs are uniformly distributed. But if the inputs follow patterns, that might not be true. For instance, this repo mentions that Spooky128 is "weak (collisions with 4bit diff)". Thus, even if the chance of a collision is low for random inputs, it might not be so for certain patterns. Moreover, even if the chance of a single user having a collision is low, but the chance of random users encountering collisions might be higher
Are there any overflow risks remaining? I saw some code about detecting whether an int. can overflow to double, but usually numerical software has a lot of checks for corner cases (e.g. if there are negative numbers).
Are there numerical precision issues? For instance, Mata does sum() computations as quads, and I think other software might also do so.

(develop branch) gtools winsor gives incorrect results on missing values

When winsorizing missing values, gstats winsor will fill in those missing values, instead of leaving them empty).

clear
set obs 10
gen y = _n if !inrange(_n, 4, 6)
winsor2 y, cut(1 99) suffix(_w)
gstats winsor y, cut(10 90) suffix(_g)
li

Results:

     +----------------+
     |  y   y_w   y_g |
     |----------------|
  1. |  1     1     1 |
  2. |  2     2     2 |
  3. |  3     3     3 |
  4. |  .     .    10 |
  5. |  .     .    10 |
     |----------------|
  6. |  .     .    10 |
  7. |  7     7     7 |
  8. |  8     8     8 |
  9. |  9     9     9 |
 10. | 10    10    10 |
     +----------------+

Happy holidays!
-Sergio

a bug in hashsort?

Thanks for the great work on gtools. It seems to me there may be a bug in hashsort. Please refer to the code below:

××××××××××××××××××××××××××××××

clear *
set obs 100
gen id1=round(_n,5)
gen id2=round(_n,15)

hashsort id1 -id2
by id1: gen value=_n //caused error

gsort id1 -id2
by id1: gen value=_n //caused no error