Comments (7)
This should be fixed in version 0.63. Are you on that version yet? I ran your code on 0.63 and it works, but I can reproduce your error on the previous version.
Here's what's going on.
Before the latest version, Token was supposed to be a weak reference. The underlying data is a C array of structs. This data is owned by the Tokens object. Once the Tokens object passes out of scope, the underlying C data is due for collection. If you're still holding references to Token objects, and you access a property that's proxied to that C data, all bets are off.
The recent update tries to check whether any Token objects will outlive the Tokens object. If so, it gives them a copy of the C data. I didn't document any of this properly, and I'm sorry that this seems to have wasted some of your time. I hope to have less broken release processes soon.
But, aside from this: what you're doing is supposed to be a rare edge-case. My intention is for users to maintain a reference to the Tokens object, and access the Token objects through it. Is there a reason you want to create your own list? It might indicate a weakness in my API.
from spacy.
I am running this on version 0.63, sorry I didn't mention that. Strange that it works on your machine, but not mine. But now that I understand you have to keep a reference to the Tokens object, I can do what I wanted to earlier without the error. It might be good to look into this a bit more to make sure you really are preventing those tokens from losing the necessary data; I'm positive I'm on 0.63 (I even checked tokens.pxd and it has this change: "cdef int take_ownership_of_c_data(self) except -1"
The reason I was doing that was I have a large set of tweets, each of which I was turning into a Tokens object by doing nlp(tweet_text). I didn't know you needed to keep a reference to it so I was just passing it into a function that goes and does the analysis I want. In this case I was using the dependency parse to extract Subject, Verb, Object structures. In that function, I was finding those subject, verb, and object tokens and was trying to append them to another list that I return at the end. So down the line I wanted to examine the tokens that were returned. I figured it might be useful to have all of the information, like tok.pos_, tok.dep_, etc, which is why I was adding the token to a list. To work around it, I simply appended the tok.lower_ string representation. Which so far is fine, but maybe I'd want more info later.
Don't worry about the bugs, I understand this is the early stage of development. I do want to commend you on your work so far, though. The dependency parsing works pretty darn well, even on informal text from the internet. And it is by far the fastest and easiest to use in Python. I've tried TurboParser and TweeboParser and both of those are very difficult to work with in Python. It's so simple to traverse the tree using rights and lefts as well. Really well done!
from spacy.
Working on this. I think I have it replicating on my server, but not on my laptop. Memory errors like this are difficult, because tests can pass accidentally, depending on whether the memory was over-written.
from spacy.
Okay, try v0.65. This is working on my laptop, server, and on Travis.
Please watch out for, and report, memory leaks in the new implementation. There's a reference cycle between Tokens and Token. I've run the parser over lots of documents, but the problems might only arise when the Token objects are used in some non-trivial way.
from spacy.
It seems to be working on my end now with v0.65!
from spacy.
Great!
from spacy.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
from spacy.
Related Issues (20)
- Incorrect lemmatization HOT 2
- Verb characterized as noun in English text HOT 2
- nlp.pipe cannot pickle when using Transformer model HOT 9
- Potential security issue HOT 1
- TypeError: can not serialize 'DocTransformerOutput' object HOT 3
- 'POS tagging' output is not correct HOT 2
- [mypy] Module "spacy" does not explicitly export attribute "prefer_gpu" when using --no-implicit-reexport HOT 2
- Cannot train Arabic models with a custom tokenizer HOT 3
- Vocab Issue HOT 3
- `ValueError: cannot reshape array of size ...` when loading model HOT 7
- spacy.load() on python 3.12 with vscode HOT 3
- Regex doesn't work if less than 3 characters? HOT 2
- Error Loading Spacy pickle model to gpu HOT 1
- Spacy french NER transformer based model fr_dep_news_trf not working HOT 1
- ROCm 5.7 + Spacy HOT 4
- Doc won't serialize with custom attribute HOT 1
- Install via `requirements.txt` documentation doesn't work HOT 17
- catalogue.RegistryError: [E892] Unknown function registry: 'vectors'. HOT 1
- invalid whitespace entity spans msg but no whitespace is there HOT 2
- Upgrade to spacy 3.7.2 throws Attribute error HOT 9
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from spacy.