You can use the python script to transfer pdf, docx, and other documents to paragraph for transferring them to embedding vector. This tool is particularly useful for scenarios where text extraction and further processing from various documents are required.
- Supports text extraction from PDF and DOCX files.
- Allows custom title patterns for segmenting text.
- Saves extracted text into a CSV file for further processing.
Before using the document2paragraph
tool, ensure that you have Python 3 installed. Follow these steps to install the necessary dependencies:
git clone https://github.com/LiuYuWei/document2paragraph.git
cd document2paragraph
pip install -r requirements.txt
To use the document2paragraph
tool, follow these steps:
- Place your PDF or DOCX files in an appropriate directory.
- Execute the script with the following command:
python main.py <document_file_path> --pattern <split_pattern> --folder <output_folder>
For example:
python main.py example.pdf --pattern "(\s*[一二三四五六七八九十]{1,3}\、)" --folder result
- You can use the following method to run the streamlit webUI.
git clone https://github.com/LiuYuWei/document2paragraph.git
cd document2paragraph
streamlit run streamlit.py
This project is licensed under the MIT License - see the LICENSE file for details.