Comments (2)
@kylebarron I haven't done any benchmarks, as I was focused more on correctness. However I believe it is reasonably fast and I routinely use this to process 100's of GBs of data.
I have been involved with maintaining the pandas readers for SAS and Stata (although I was not the primary author for either). I have used the SAS reader much more than the Stata reader, and it is clearly much faster here (in Go) than it is in Python/Pandas. The Stata dta file format is more amenable to vectorized processing, which makes a big difference in Python, so the advantage of using Go might be less for Stata files compared to SAS files (SAS7BDAT is not at all friendly to vectorization).
The column-oriented data structure used here is modeled on Bcolz (https://github.com/Blosc/bcolz), though not interchangeable with it. I believe this is much simpler than Parquet. In any case, I evolved this into a different columnar container called Dstream (https://github.com/kshedden/dstream) which I actively develop and maintain. The container here is mainly for the internal use of the Stata and SAS readers in this package.
Regarding concurrency, the readers work through the SAS/Stata files in chunks, and each chunk has its own backing memory, so you can process one chunk while reading the next. The reading itself is not concurrent (I don't think doing this would help as it is IO-bound).
from datareader.
Thanks for all that information!
from datareader.
Related Issues (13)
- TrimStrings truncates data HOT 2
- Index out of range error when reading last column's label. HOT 3
- Support various type conversions.
- First few rows of Data() method has wrong offset HOT 4
- Writing to a .sas7bdat file HOT 1
- Panic HOT 6
- Writer? HOT 4
- Unexpected non-zero end_of_first_byte
- I have collated your package into my SAS7BDAT collections
- SAD7BDAT.read_next_page does not handle EOF. HOT 2
- First rows of Stata file are repeatedly re-read on each call to Read() HOT 1
- Reader does not handle column labels. HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from datareader.