Comments (7)
I think that's because WeakRefString
objects can only be used with NullableArray
at the moment, so lots of small strings need to be allocated when nullable=true
. Does the dataset contain string columns? If you have a small number of categories, using a CategoricalArray
would save a lot of memory (not sure it's supported yet.)
from csv.jl.
@nalimilan Thanks, yes, that's correct.
I'm still not entirely sure why this should explain the disparity though as:
- maximum length of any of the small strings is 14 UInt8s (so 14 bytes?)
- minimum size of a WeakRefString appears to be 24 bytes (3
Int64
s?)
I would have thought that the file should be small enough to fit into RAM using actual strings as:
nrow = 9999999
strlen = 14
colsize_mb = nrow*strlen*sizeof(UInt8))/1e6
#139.999986
There are 6 cols and I have 4GB RAM, so I'd expect a String
version to take up roughly 840MB or less.
I'd expect a WeakRefString
version of the CSV to be similar. When I end my julia process after reading in with WeakRefString
about 1GB RAM gets released (some cols are Int64), which is roughly consistent with the above numbers.
from csv.jl.
In Julia 0.5, String
objects have a significant overhead due to their Array
field (this will be much better in 0.6). I don't remember what the exact value is, but for short strings like yours it's a lot.
from csv.jl.
Thanks for following up. Investigating this the container size of a String
seems to be 8 bytes.
julia> N = 10000;
julia> v = @timed ["hello" for i in 1:N]
(String["hello","hello","hello","hello","hello","hello","hello","hello","hello","hello" … "hello","hello","hello","hello","hello","hello","hello","hello","hello","hello"],0.045251508,1375438,0.0,Base.GC_Diff(1375438,1,0,12641,0,0,0,0,0))
julia> v[3] #mem allocated
1375438
julia> tots = sizeof(v[1]) + length(v[1])*sizeof(v[1][1]) #size of the containers + size of the data contained?
1300000
julia> tots/N #cost per string
13.0
So if my strings are each 14 bytes + 8byte container = 22 bytes this should still be smaller than a 24 byte WeakRefString
. Unless I'm misunderstanding the info given back by @timed
.
So I'm still not sure the String
overhead is sufficient to explain the nullable frame fitting into 1GB memory, but the typed version taking up 4GB + 4GBswap + ???.
from csv.jl.
hey @pearcemc, thanks for opening an issue! Is there any chance you can share the file you're using? Could you also share your system info? I know there have been a few platform-specific issues before.
from csv.jl.
Sure, it's the page_views_sample.csv
from here.
My laptop /proc/meminfo
looks like:
ubuntu@ubuntu-UX21E:/db/outbrain$ cat /proc/meminfo | head
MemTotal: 3946968 kB
MemFree: 413704 kB
Buffers: 61688 kB
Cached: 1299912 kB
SwapCached: 130404 kB
Active: 2008912 kB
Inactive: 1269984 kB
Active(anon): 1748120 kB
Inactive(anon): 993240 kB
Active(file): 260792 kB
I'm on Julia 0.5.0.
from csv.jl.
This should be fixed on master since we're now relying on plain Vector{Union{T, Null}}
for non-String columns, and WeakRefStringArray
for String arrays, which will be memory efficient.
from csv.jl.
Related Issues (20)
- can not read the csv with large cells written by itself HOT 1
- Formatting broken on Examples page in documentation HOT 2
- CSV.jl fails to precompile on Ubuntu server, v0.10.5 and up. HOT 2
- Error on CSV.read attempt HOT 4
- `emptyvalue` keyword option
- CSV.Chunks splits file into uneven chunks
- CSV.jl errors on nightly
- Incorrect results for `argmax` with multithreaded parsing HOT 1
- CSV is failing PkgEval HOT 4
- Error when combining single row with multiple row CSV file into a DataFrame with pooling on. HOT 1
- `Date` types should not be inferred from column
- CSV is broken in nightly julia
- 1.12.0-DEV.317 ERROR: LoadError: TypeError: in typeassert, expected Tuple{Vector{UInt8}, Int64, Int64, Union{Nothing, String}}, got a value of type Tuple{Memory{UInt8}, Int64, Int64, Nothing}
- Error when passing as `source` a vector with fewer unique elements than files.
- CSV.File failing for gzipped file in Julia 1.11-rc
- CSV.write() with append=true allocating a lot of memory
- Fail to read multiple-line string with multiple tasks
- Cannot round-trip a file (read, write, read) in some circumstances HOT 5
- Segfault when reading into dataframe with a transpose
- Use PrettyTables.jl in CSV.File for a friendlier experience
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from csv.jl.