Comments (17)
Looks like a source file is in an unexpected text encoding and has bytes that fail to decode using cp1254, which Wikipedia tells me is a Windows text encoding primarily used for Turkish(!)
Can you verify that the file is not in that encoding? And by the way, if the content of that file is not private, could you share it? maybe I could try to convert it into a test case to make sure this bug does not arise again after it is fixed
from seagoat.
I don’t know for sure which file it is; I was just guessing on the basis of CP-1254 being used for Turkish, and one known file with “turkish” in the name.
If you could add the filename to the exception log somehow, I could show you which file. But of course, it’d be best if the code gracefully handled any encoding, even broken ones.
from seagoat.
This is the repo: https://github.com/couchbase/couchbase-lite-C
from seagoat.
I am running into the same error on a private repo (ofc with different byte
and position
values UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 2129: character maps to <undefined>
). Happy to dig into logs and whatnot a bit later - haven't even dug up where those are yet.
from seagoat.
Oh it looks like maybe #240 combined with #231 might address this? I ran into this both when trying to run from source and when running from the pipx
version yesterday, so I think #240 was insufficient to resolve completely.
from seagoat.
Oh it looks like maybe #240 combined with #231 might address this? I ran into this both when trying to run from source and when running from the
pipx
version yesterday, so I think #240 was insufficient to resolve completely.
I wonder if #240 actually caused this problem
from seagoat.
The 2 of you are running into the same problem, I think it must be something common then
from seagoat.
This is the repo: https://github.com/couchbase/couchbase-lite-C
I have quickly "fake analyzed" this repo (meaning I did not actually create the vector embeddings) but it seems that at least it's able to read all files, at least on linux 🤔 but it seems that actually this problem happened for you when showing the results, which is again, weird. I mean, if it fails when showing the results, then it should have failed in the beginning when the files were analyzed 🤔
from seagoat.
I apologize - I didn't read carefully enough an am not familiar enough with the project to note - I am running into the same base exception but through a different stacktrace - mine is in the analyze phase
Exception in thread Thread-1 (_worker_function):
Traceback (most recent call last):
File "/home/cori/code/SeaGOAT/seagoat/queue/base_queue.py", line 76, in _worker_function
task = self._task_queue.get(timeout=1)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/queue.py", line 179, in get
raise Empty
_queue.Empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.11/threading.py", line 1038, in _bootstrap_inner
self.run()
File "/usr/lib/python3.11/threading.py", line 975, in run
self._target(*self._args, **self._kwargs)
File "/home/cori/code/SeaGOAT/seagoat/queue/base_queue.py", line 81, in _worker_function
self.handle_maintenance(context)
File "/home/cori/code/SeaGOAT/seagoat/queue/task_queue.py", line 50, in handle_maintenance
remaining_chunks_to_analyze = context["seagoat_engine"].analyze_codebase(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cori/code/SeaGOAT/seagoat/engine.py", line 84, in analyze_codebase
return self._create_vector_embeddings(minimum_chunks_to_analyze)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cori/code/SeaGOAT/seagoat/engine.py", line 106, in _create_vector_embeddings
for chunk in file.get_chunks():
^^^^^^^^^^^^^^^^^
File "/home/cori/code/SeaGOAT/seagoat/file.py", line 81, in get_chunks
lines = self._get_file_lines()
^^^^^^^^^^^^^^^^^^^^^^
File "/home/cori/code/SeaGOAT/seagoat/file.py", line 36, in _get_file_lines
for i, line in enumerate(source_code_file.read().splitlines())
^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/encodings/cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 2129: character maps to <undefined>
I can split this into a different issue.
from seagoat.
I apologize - I didn't read carefully enough an am not familiar enough with the project to note - I am running into the same base exception but through a different stacktrace - mine is in the analyze phase
Exception in thread Thread-1 (_worker_function): Traceback (most recent call last): File "/home/cori/code/SeaGOAT/seagoat/queue/base_queue.py", line 76, in _worker_function task = self._task_queue.get(timeout=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/queue.py", line 179, in get raise Empty _queue.Empty During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/lib/python3.11/threading.py", line 1038, in _bootstrap_inner self.run() File "/usr/lib/python3.11/threading.py", line 975, in run self._target(*self._args, **self._kwargs) File "/home/cori/code/SeaGOAT/seagoat/queue/base_queue.py", line 81, in _worker_function self.handle_maintenance(context) File "/home/cori/code/SeaGOAT/seagoat/queue/task_queue.py", line 50, in handle_maintenance remaining_chunks_to_analyze = context["seagoat_engine"].analyze_codebase( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/cori/code/SeaGOAT/seagoat/engine.py", line 84, in analyze_codebase return self._create_vector_embeddings(minimum_chunks_to_analyze) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/cori/code/SeaGOAT/seagoat/engine.py", line 106, in _create_vector_embeddings for chunk in file.get_chunks(): ^^^^^^^^^^^^^^^^^ File "/home/cori/code/SeaGOAT/seagoat/file.py", line 81, in get_chunks lines = self._get_file_lines() ^^^^^^^^^^^^^^^^^^^^^^ File "/home/cori/code/SeaGOAT/seagoat/file.py", line 36, in _get_file_lines for i, line in enumerate(source_code_file.read().splitlines()) ^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/encodings/cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 2129: character maps to <undefined>
I can split this into a different issue.
sounds good
from seagoat.
but in the end it's the same encoding which is suspicious not true, it's different ones cp1254 vs cp1252
from seagoat.
it could be related to this chardet/chardet#148
from seagoat.
using a script that randomly generates byte sequences that are valid utf-8, I managed to find some examples that cause the same error:
b'\xc3\x8f>WP\x1b'
b'\x1emh\xdf\x90F'
b'\xe8\xb2\x81;N0uA'
b'\x19|\x14G?<\x0c\\\x17\xeb\xbd\x8f>|'
These are valid utf-8 but chardet does not detect them as such. And these examples also examples where an exception is raised when trying to read them as the encoding that chardet detected them as. I think this should be the reason, let me write some test cases to reproduce it
from seagoat.
should be fixed now, feel free to reopen if you face this error again, or anything similar
from seagoat.
some combination of these fixes also resolved my possibly-tangential issue; thanks!
from seagoat.
I've just encountered a similar problem, here's the full stacktrace https://pastebin.com/rBCh91Ld
from seagoat.
I've just encountered a similar problem, here's the full stacktrace https://pastebin.com/rBCh91Ld
hi @ChrisB85 !
Thank you for posting the stack trace. Unfortunately based on the stacktrace I am not sure what the exact issue is.
Could you please:
- Share the repo itself if it is open source
- If that is not possible, try to make a new repo with a minimal reproduction
It could also be useful if you could describe what operating system you use and what character encoding your files use
from seagoat.
Related Issues (20)
- Allow users to configure ONNX execution provider to increase embedding generation HOT 3
- Task queue priority should be based on file importance HOT 4
- Sanitize user input to prevent remote code execution vulnerability
- Add fuzzy finding features using ripgrep
- Use Hydra for configuration files
- MacOS - Exception in thread Thread-1 (_worker_function) HOT 1
- Replace `pylint` and `black` with `ruff` HOT 6
- Try instructor models
- Allow including a specific number of lines in the beginning and end of files
- When the server info file does not exist at all, the error message is confusing HOT 1
- Please, point what is the source of `qualityScore` in benchmarks HOT 2
- Use git to share database dumps
- Add 2 new POST endpoints
- Limit logic might be broken
- Hash calculation for uncommitted changes might be broken
- Explicitly fail when starting a server in a folder that is not a git repo
- improve vimgrep support by trying to guess result column
- hide empty/"invisible" lines no matter why they were included
- Allow reversing results
- Add context using `ollama`
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from seagoat.