Comments (10)
There shouldn't be any cases where a Vector{WeakRefString{UInt8}}
is returned from parsing a CSV file. Careful attention is paid to ensure that if the user requests a non-NullableVector, then a Vector{String}
is returned. Please let me know if you have cases where you are seeing otherwise, but I've tried to be careful in testing that WeakRefStrings are never used without NullableVectors.
from csv.jl.
@quinnj Could you elaborate a bit more on this? I don't understand why the problem goes away if you use NullableVector. For example, when I simply do a CSV.read("data.csv")
, I get back a DataFrame
and one column might have elements of type Nullable{WeakRefString{UInt8}}
. To what memory are those instances pointing, and who is keeping that alive?
from csv.jl.
The NullableArray itself holds a reference to the memory that it's WeakRefStrings are pointing to. See the member field of NullableArray here.
from csv.jl.
Ah, ok. What about this:
df = CSV.read("data.csv")
s = df[1,1]
df = nothing
Now s
is of type Nullable{WeakRefString{UInt8}}
. Wouldn't the original memory disappear at this point, once the DataFrame
that was holding the NullableArray
is gone?
from csv.jl.
Yes.
from csv.jl.
I almost think weakrefstrings=false
should be the default in that case for the various read
functions, and then folks can opt in if they understand that design and need the speed. I think it would also just make the whole thing more approachable for novice users, right now I see them quite confused when they see their first WeakRefString
in a simple situation where they are just reading a file. Would that be an option?
from csv.jl.
So far, we've gotten many more perf-related issues filed than "what is a weakrefstring" type of issues. I'm inclined to leave it as is, due to the # of issues like "why is CSV.read is 5x slower than...." issues.
from csv.jl.
I have to admit that I'm quite worried about this... I think this is a recipe for incredibly subtle bugs. For example the code above, where the original buffer is freed, does actually not crash or show any weird behavior at first, probably because that memory is just not being reclaimed yet. So folks can happily write code like that, everything works, until all of a sudden it doesn't, because the memory manager actually used that area of memory for something else. This might also introduce serious security concerns, right? I'm not an expert for that, but I do know that pointers to wrong memory locations can be a huge problem.
I guess my gut feeling here is that CSV is a package that will be used by a lot of "normal" users, and I don't think they should be exposed to such a subtle and difficult memory management story.
from csv.jl.
I agree with @davidanthoff. To me the name WeakRefString
would mean a pointer to a string with a weak reference to the memory, not a pointer that requires an external weak reference to the memory. It seems that returning such a type by default is a "fail faulty" design.
from csv.jl.
I tend to agree that safety should win over speed by default. The option to use WeakRefString
could be advertised prominently in the docs. It will be easier to point people to that option when they complain about speed than to explain them why they get subtle issues by default.
BTW, since most CSV string data is categorical, would an option to convert the columns to CategoricalArray
improve performance over creating a Vector{String}
? That could be another, safer alternative.
from csv.jl.
Related Issues (20)
- Add Zenodo badge to README HOT 6
- Segfault on Julia 1.9 on Intel Sapphire Rapids during precompilation
- `bufsize` of `write` is defined to be length of row but actually cells
- can not read the csv with large cells written by itself HOT 1
- Formatting broken on Examples page in documentation HOT 2
- CSV.jl fails to precompile on Ubuntu server, v0.10.5 and up. HOT 2
- Error on CSV.read attempt HOT 4
- `emptyvalue` keyword option
- CSV.Chunks splits file into uneven chunks
- CSV.jl errors on nightly
- Incorrect results for `argmax` with multithreaded parsing HOT 1
- CSV is failing PkgEval HOT 4
- Error when combining single row with multiple row CSV file into a DataFrame with pooling on. HOT 1
- `Date` types should not be inferred from column
- CSV is broken in nightly julia
- 1.12.0-DEV.317 ERROR: LoadError: TypeError: in typeassert, expected Tuple{Vector{UInt8}, Int64, Int64, Union{Nothing, String}}, got a value of type Tuple{Memory{UInt8}, Int64, Int64, Nothing}
- Error when passing as `source` a vector with fewer unique elements than files.
- CSV.File failing for gzipped file in Julia 1.11-rc
- CSV.write() with append=true allocating a lot of memory
- Fail to read multiple-line string with multiple tasks
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from csv.jl.