Giter VIP home page Giter VIP logo

Comments (2)

kshedden avatar kshedden commented on July 3, 2024

@kylebarron I haven't done any benchmarks, as I was focused more on correctness. However I believe it is reasonably fast and I routinely use this to process 100's of GBs of data.

I have been involved with maintaining the pandas readers for SAS and Stata (although I was not the primary author for either). I have used the SAS reader much more than the Stata reader, and it is clearly much faster here (in Go) than it is in Python/Pandas. The Stata dta file format is more amenable to vectorized processing, which makes a big difference in Python, so the advantage of using Go might be less for Stata files compared to SAS files (SAS7BDAT is not at all friendly to vectorization).

The column-oriented data structure used here is modeled on Bcolz (https://github.com/Blosc/bcolz), though not interchangeable with it. I believe this is much simpler than Parquet. In any case, I evolved this into a different columnar container called Dstream (https://github.com/kshedden/dstream) which I actively develop and maintain. The container here is mainly for the internal use of the Stata and SAS readers in this package.

Regarding concurrency, the readers work through the SAS/Stata files in chunks, and each chunk has its own backing memory, so you can process one chunk while reading the next. The reading itself is not concurrent (I don't think doing this would help as it is IO-bound).

from datareader.

kylebarron avatar kylebarron commented on July 3, 2024

Thanks for all that information!

from datareader.

Related Issues (13)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.