Giter VIP home page Giter VIP logo

Comments (17)

kantord avatar kantord commented on August 28, 2024

Looks like a source file is in an unexpected text encoding and has bytes that fail to decode using cp1254, which Wikipedia tells me is a Windows text encoding primarily used for Turkish(!)

Can you verify that the file is not in that encoding? And by the way, if the content of that file is not private, could you share it? maybe I could try to convert it into a test case to make sure this bug does not arise again after it is fixed

from seagoat.

snej avatar snej commented on August 28, 2024

I don’t know for sure which file it is; I was just guessing on the basis of CP-1254 being used for Turkish, and one known file with “turkish” in the name.

If you could add the filename to the exception log somehow, I could show you which file. But of course, it’d be best if the code gracefully handled any encoding, even broken ones.

from seagoat.

snej avatar snej commented on August 28, 2024

This is the repo: https://github.com/couchbase/couchbase-lite-C

from seagoat.

cori avatar cori commented on August 28, 2024

I am running into the same error on a private repo (ofc with different byte and position values UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 2129: character maps to <undefined>). Happy to dig into logs and whatnot a bit later - haven't even dug up where those are yet.

from seagoat.

cori avatar cori commented on August 28, 2024

Oh it looks like maybe #240 combined with #231 might address this? I ran into this both when trying to run from source and when running from the pipx version yesterday, so I think #240 was insufficient to resolve completely.

from seagoat.

kantord avatar kantord commented on August 28, 2024

Oh it looks like maybe #240 combined with #231 might address this? I ran into this both when trying to run from source and when running from the pipx version yesterday, so I think #240 was insufficient to resolve completely.

I wonder if #240 actually caused this problem

from seagoat.

kantord avatar kantord commented on August 28, 2024

The 2 of you are running into the same problem, I think it must be something common then

from seagoat.

kantord avatar kantord commented on August 28, 2024

This is the repo: https://github.com/couchbase/couchbase-lite-C

I have quickly "fake analyzed" this repo (meaning I did not actually create the vector embeddings) but it seems that at least it's able to read all files, at least on linux 🤔 but it seems that actually this problem happened for you when showing the results, which is again, weird. I mean, if it fails when showing the results, then it should have failed in the beginning when the files were analyzed 🤔

from seagoat.

cori avatar cori commented on August 28, 2024

I apologize - I didn't read carefully enough an am not familiar enough with the project to note - I am running into the same base exception but through a different stacktrace - mine is in the analyze phase

Exception in thread Thread-1 (_worker_function):
Traceback (most recent call last):
  File "/home/cori/code/SeaGOAT/seagoat/queue/base_queue.py", line 76, in _worker_function
    task = self._task_queue.get(timeout=1)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/queue.py", line 179, in get
    raise Empty
_queue.Empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.11/threading.py", line 1038, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.11/threading.py", line 975, in run
    self._target(*self._args, **self._kwargs)
  File "/home/cori/code/SeaGOAT/seagoat/queue/base_queue.py", line 81, in _worker_function
    self.handle_maintenance(context)
  File "/home/cori/code/SeaGOAT/seagoat/queue/task_queue.py", line 50, in handle_maintenance
    remaining_chunks_to_analyze = context["seagoat_engine"].analyze_codebase(
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cori/code/SeaGOAT/seagoat/engine.py", line 84, in analyze_codebase
    return self._create_vector_embeddings(minimum_chunks_to_analyze)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cori/code/SeaGOAT/seagoat/engine.py", line 106, in _create_vector_embeddings
    for chunk in file.get_chunks():
                 ^^^^^^^^^^^^^^^^^
  File "/home/cori/code/SeaGOAT/seagoat/file.py", line 81, in get_chunks
    lines = self._get_file_lines()
            ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cori/code/SeaGOAT/seagoat/file.py", line 36, in _get_file_lines
    for i, line in enumerate(source_code_file.read().splitlines())
                             ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/encodings/cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 2129: character maps to <undefined>

I can split this into a different issue.

from seagoat.

kantord avatar kantord commented on August 28, 2024

I apologize - I didn't read carefully enough an am not familiar enough with the project to note - I am running into the same base exception but through a different stacktrace - mine is in the analyze phase

Exception in thread Thread-1 (_worker_function):
Traceback (most recent call last):
  File "/home/cori/code/SeaGOAT/seagoat/queue/base_queue.py", line 76, in _worker_function
    task = self._task_queue.get(timeout=1)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/queue.py", line 179, in get
    raise Empty
_queue.Empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.11/threading.py", line 1038, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.11/threading.py", line 975, in run
    self._target(*self._args, **self._kwargs)
  File "/home/cori/code/SeaGOAT/seagoat/queue/base_queue.py", line 81, in _worker_function
    self.handle_maintenance(context)
  File "/home/cori/code/SeaGOAT/seagoat/queue/task_queue.py", line 50, in handle_maintenance
    remaining_chunks_to_analyze = context["seagoat_engine"].analyze_codebase(
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cori/code/SeaGOAT/seagoat/engine.py", line 84, in analyze_codebase
    return self._create_vector_embeddings(minimum_chunks_to_analyze)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cori/code/SeaGOAT/seagoat/engine.py", line 106, in _create_vector_embeddings
    for chunk in file.get_chunks():
                 ^^^^^^^^^^^^^^^^^
  File "/home/cori/code/SeaGOAT/seagoat/file.py", line 81, in get_chunks
    lines = self._get_file_lines()
            ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cori/code/SeaGOAT/seagoat/file.py", line 36, in _get_file_lines
    for i, line in enumerate(source_code_file.read().splitlines())
                             ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/encodings/cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 2129: character maps to <undefined>

I can split this into a different issue.

sounds good

from seagoat.

kantord avatar kantord commented on August 28, 2024

but in the end it's the same encoding which is suspicious not true, it's different ones cp1254 vs cp1252

from seagoat.

kantord avatar kantord commented on August 28, 2024

it could be related to this chardet/chardet#148

from seagoat.

kantord avatar kantord commented on August 28, 2024

using a script that randomly generates byte sequences that are valid utf-8, I managed to find some examples that cause the same error:

b'\xc3\x8f>WP\x1b'
b'\x1emh\xdf\x90F'
b'\xe8\xb2\x81;N0uA'
 b'\x19|\x14G?<\x0c\\\x17\xeb\xbd\x8f>|'

These are valid utf-8 but chardet does not detect them as such. And these examples also examples where an exception is raised when trying to read them as the encoding that chardet detected them as. I think this should be the reason, let me write some test cases to reproduce it

from seagoat.

kantord avatar kantord commented on August 28, 2024

should be fixed now, feel free to reopen if you face this error again, or anything similar

from seagoat.

cori avatar cori commented on August 28, 2024

some combination of these fixes also resolved my possibly-tangential issue; thanks!

from seagoat.

ChrisB85 avatar ChrisB85 commented on August 28, 2024

I've just encountered a similar problem, here's the full stacktrace https://pastebin.com/rBCh91Ld

from seagoat.

kantord avatar kantord commented on August 28, 2024

I've just encountered a similar problem, here's the full stacktrace https://pastebin.com/rBCh91Ld

hi @ChrisB85 !

Thank you for posting the stack trace. Unfortunately based on the stacktrace I am not sure what the exact issue is.

Could you please:

  • Share the repo itself if it is open source
  • If that is not possible, try to make a new repo with a minimal reproduction

It could also be useful if you could describe what operating system you use and what character encoding your files use

from seagoat.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.