Comments (7)
The issues you're encountering with CodeSplitter
in your LlamaIndex setup seem to stem from two main areas: the initialization of CodeSplitter
and the processing of Go language code.
-
Initialization Error: The
TypeError
suggests a mismatch in the expected arguments for theCodeSplitter
constructor. Ensure all required parameters are correctly provided and match the expected types. The error indicates that the constructor is not receiving the expected number of arguments, which could be due to a syntax error or missing parameters in your call. -
Processing Go Language Code: The
ValueError
indicates a problem with parsing Go language code. This could be due to thetree_sitter_languages
package not supporting Go or a misconfiguration in your setup. Ensure that:- The
tree_sitter_languages
package is correctly installed and supports Go. - You have the correct version of
tree_sitter
andtree_sitter_languages
that includes Go language support. If Go is not supported, you may need to manually load a Go language grammar fortree-sitter
.
- The
For both issues, verify that you're using compatible versions of llama-index
, tree-sitter
, and tree_sitter_languages
. If the problem persists, consider manually initializing a tree-sitter
Parser for Go and passing it to the CodeSplitter
constructor, ensuring you have a Go language grammar available for tree-sitter
.
Lastly, ensure your error handling in the implementation is robust, especially for unsupported languages or parsing errors, to gracefully manage these situations.
Details
Sources
- llama-index-core/llama_index/core/node_parser/text/code.py
- llama-index-legacy/llama_index/legacy/node_parser/text/code.py
About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
To continue the conversation, mention @dosu.
from llama_index.
@framsouza yea this seems like an issue with the tree-sitter version. I don't actually know what the solution is yet, but a band-aid fix was pip install -U "tree-sitter<0.22.0"
from llama_index.
@framsouza Seems like the if statement at the top if failing
from llama_index.
- create a pure python env
- install llama_index
my version is 0.10.37
and this is my code:
import tree_sitter_java as tsjava
from llama_index.core.node_parser import CodeSplitter
from llama_index.readers.file.flat.base import FlatReader
from tree_sitter import Language, Parser
CODEBASE_DIR = "your code"
JAVA_LANGUAGE = Language(tsjava.language())
parser = Parser(JAVA_LANGUAGE)
language = "java"
documents = FlatReader().load_data(Path(CODEBASE_DIR))
splitter = CodeSplitter(
parser = parser,
language = language,
chunk_lines = 40, # lines per chunk
chunk_lines_overlap = 15, # lines overlap between chunks
max_chars = 1500 # max chars per chunk
)
nodes = splitter.get_nodes_from_documents(documents)
print(len(nodes))
it works!
from llama_index.
Hello @framsouza! 👋 I'm Dosu, a bot here to lend a hand with bugs, answer questions, and guide you on your journey to becoming a contributor, all while we wait for a human maintainer to chime in. Delighted to meet you! I'm diving into your issue and will get back to you with a full answer shortly.
from llama_index.
hey @logan-markewich , i just gave it a try:
tree-sitter 0.21.3
tree-sitter-go 0.21.0
tree-sitter-languages 1.10.2
llama-index 0.10.36
llama-index-core 0.10.36
pipeline = IngestionPipeline(
transformations=[
CodeSplitter(language="go", chunk_lines=20, chunk_lines_overlap=5),
Settings.embed_model
],
vector_store=get_es_vector_store()
)
got a different error,
Data loaded from local directory.
Starting the pipeline...
Parsing nodes: 0%| | 0/2002 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/Users/framsouza/git-assistant/index.py", line 166, in <module>
main()
File "/Users/framsouza/git-assistant/index.py", line 159, in main
pipeline.run(show_progress=True, documents=documents)
File "/Users/framsouza/git-assistant/lib/python3.11/site-packages/llama_index/core/ingestion/pipeline.py", line 734, in run
nodes = run_transformations(
^^^^^^^^^^^^^^^^^^^^
File "/Users/framsouza/git-assistant/lib/python3.11/site-packages/llama_index/core/ingestion/pipeline.py", line 124, in run_transformations
nodes = transform(nodes, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/framsouza/git-assistant/lib/python3.11/site-packages/llama_index/core/node_parser/interface.py", line 127, in __call__
return self.get_nodes_from_documents(nodes, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/framsouza/git-assistant/lib/python3.11/site-packages/llama_index/core/node_parser/interface.py", line 76, in get_nodes_from_documents
nodes = self._parse_nodes(documents, show_progress=show_progress, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/framsouza/git-assistant/lib/python3.11/site-packages/llama_index/core/node_parser/interface.py", line 145, in _parse_nodes
splits = self.split_text(node.get_content())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/framsouza/git-assistant/lib/python3.11/site-packages/llama_index/core/node_parser/text/code.py", line 161, in split_text
raise ValueError(f"Could not parse code with language {self.language}.")
ValueError: Could not parse code with language go.
I can see go on supported languages.
from llama_index.
This is the code, I'm moving from SentenceSpliter
to CodeSpliter
since I'm ingesting code type of information
from llama_index.
Related Issues (20)
- [Bug]: QdrantVectorStore parser always expects a key called "text" in response HOT 1
- [Question]: How does Agentic RAG judge if the question shall be answered via single RAG retrieval or multiple retrievals by agent? HOT 2
- [Question]: Updating metadata and text in existing pinecone index HOT 1
- [Bug]: HOT 2
- [Bug]: Using the command "pip download llama_index==0.10.19" downloaded the wheel file for version "llama_index_core-0.10.40-py3-none-any.whl" instead HOT 2
- [Question]: Can i use multiple collections in mongo at a time? HOT 1
- [Feature Request]: Add structured_output to Gemini
- [Bug]: Graph Index with Azure OpenAI is impossible to query HOT 2
- [Question]: SmartPDFLoader does not work as a file_extractor HOT 5
- [Bug]: llama-index-llms-mlx does not seem to exist HOT 4
- [Bug]: minor doc issue with MLX HOT 1
- [Question]: How to add new SQLTableSchema to an existing ChromaDB embedding? HOT 3
- [Question]: The retriever failed to fetch the relevant info from chromadb HOT 1
- [Bug]: HOT 1
- [Bug]: index.ref_doc_info does not work with chromadb HOT 6
- [Bug]: BM25Retriever cannot work on chinese HOT 1
- [Bug]: Package `llama_index.core.bridge.langchain` has an orphan reference to ChatFireworks HOT 1
- [Documentation]: PropertyGraph Missing image and bad link HOT 1
- [Bug]: NotImplementedError: Messages passed in must be of odd length while using chat_mode="react" HOT 8
- [Bug]: Refine's GetResponseEndEvent striping out first char of 'response' HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from llama_index.