Giter VIP home page Giter VIP logo

text-split-explorer's Introduction

Text Split Explorer

ui.png

Many of the most important LLM applications involve connecting LLMs to external sources of data. A prerequisite to doing this is to ingest data into a format where LLMs can easily connect to them. Most of the time, that means ingesting data into a vectorstore. A prerequisite to doing this is to split the original text into smaller chunks.

While this may seem trivial, it is a nuanced and overlooked step. When splitting text, you want to ensure that each chunk has cohesive information - e.g. you don't just want to split in the middle of sentence. What "cohesive information" means can differ depending on the text type as well. For example, with Markdown you have section delimiters (##) so you may want to keep those together, while for splitting Python code you may want to keep all classes and methods together (if possible).

This repo (and associated Streamlit app) are designed to help explore different types of text splitting. You can adjust different parameters and choose different types of splitters. By pasting a text file, you can apply the splitter to that text and see the resulting splits. You are also shown a code snippet that you can copy and use in your application

Hosted App

To use the hosted app, head to https://langchain-text-splitter.streamlit.app/

Running locally

To run locally, first set up the environment by cloning the repo and running:

pip install -r requirements

Then, run the Streamlit app with:

streamlit run splitter.py

text-split-explorer's People

Contributors

hwchase17 avatar rlancemartin avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.