unstructured-io / unstructured Goto Github PK

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.

Home Page: https://www.unstructured.io/

License: Apache License 2.0

Makefile 0.07% HTML 87.79% XSLT 0.01% Python 10.55% Shell 0.86% Dockerfile 0.01% Rich Text Format 0.73%

deep-learning document-parsing machine-learning nlp ocr information-retrieval data-pipelines ml preprocessing pdf-to-text

unstructured's Introduction

Open-Source Pre-Processing Tools for Unstructured Data

The unstructured library provides open-source components for ingesting and pre-processing images and text documents, such as PDFs, HTML, Word docs, and many more. The use cases of unstructured revolve around streamlining and optimizing the data processing workflow for LLMs. unstructured modular functions and connectors form a cohesive system that simplifies data ingestion and pre-processing, making it adaptable to different platforms and efficient in transforming unstructured data into structured outputs.

API Announcement!

We are thrilled to announce our newly launched Unstructured API, providing the Unstructured capabilities from unstructured as an API. Check out the unstructured-api GitHub repository to start making API calls. You’ll also find instructions about how to host your own API version.

While access to the hosted Unstructured API will remain free, API Keys are required to make requests. To prevent disruption, get yours here and start using it today! Check out the unstructured-api README to start making API calls.

🚀 Beta Feature: Chipper Model

We are releasing the beta version of our Chipper model to deliver superior performance when processing high-resolution, complex documents. To start using the Chipper model in your API request, you can utilize the hi_res_model_name=chipper parameter. Please refer to the documentation here.

As the Chipper model is in beta version, we welcome feedback and suggestions. For those interested in testing the Chipper model, we encourage you to connect with us on Slack community.

✴️ Quick Start

There are several ways to use the unstructured library:

Run the library in a container or
Install the library
1. Install from PyPI
2. Install for local development
For installation with conda on Windows system, please refer to the documentation

Run the library in a container

The following instructions are intended to help you get up and running using Docker to interact with unstructured. See here if you don't already have docker installed on your machine.

NOTE: we build multi-platform images to support both x86_64 and Apple silicon hardware. docker pull should download the corresponding image for your architecture, but you can specify with --platform (e.g. --platform linux/amd64) if needed.

We build Docker images for all pushes to main. We tag each image with the corresponding short commit hash (e.g. fbc7a69) and the application version (e.g. 0.5.5-dev1). We also tag the most recent image with latest. To leverage this, docker pull from our image repository.

docker pull downloads.unstructured.io/unstructured-io/unstructured:latest

Once pulled, you can create a container from this image and shell to it.

# create the container
docker run -dt --name unstructured downloads.unstructured.io/unstructured-io/unstructured:latest

# this will drop you into a bash shell where the Docker image is running
docker exec -it unstructured bash

You can also build your own Docker image.

If you only plan on parsing one type of data you can speed up building the image by commenting out some of the packages/requirements necessary for other data types. See Dockerfile to know which lines are necessary for your use case.

make docker-build

# this will drop you into a bash shell where the Docker image is running
make docker-start-bash

Once in the running container, you can try things directly in Python interpreter's interactive mode.

# this will drop you into a python console so you can run the below partition functions
python3

>>> from unstructured.partition.pdf import partition_pdf
>>> elements = partition_pdf(filename="example-docs/layout-parser-paper-fast.pdf")

>>> from unstructured.partition.text import partition_text
>>> elements = partition_text(filename="example-docs/fake-text.txt")

Installing the library

Use the following instructions to get up and running with unstructured and test your installation.

Install the Python SDK to support all document types with pip install "unstructured[all-docs]"
- For plain text files, HTML, XML, JSON and Emails that do not require any extra dependencies, you can run pip install unstructured
- To process other doc types, you can install the extras required for those documents, such as pip install "unstructured[docx,pptx]"
Install the following system dependencies if they are not already available on your system. Depending on what document types you're parsing, you may not need all of these.
- libmagic-dev (filetype detection)
- poppler-utils (images and PDFs)
- tesseract-ocr (images and PDFs, install tesseract-lang for additional language support)
- libreoffice (MS Office docs)
- pandoc (EPUBs, RTFs and Open Office docs). Please note that to handle RTF files, you need version 2.14.2 or newer. Running either make install-pandoc or ./scripts/install-pandoc.sh will install the correct version for you.
For suggestions on how to install on the Windows and to learn about dependencies for other features, see the installation documentation here.

At this point, you should be able to run the following code:

from unstructured.partition.auto import partition

elements = partition(filename="example-docs/eml/fake-email.eml")
print("\n\n".join([str(el) for el in elements]))

Installation Instructions for Local Development

The following instructions are intended to help you get up and running with unstructured locally if you are planning to contribute to the project.

Using pyenv to manage virtualenv's is recommended but not necessary
- Mac install instructions. See here for more detailed instructions.
  - brew install pyenv-virtualenv
  - pyenv install 3.10
- Linux instructions are available here.
Create a virtualenv to work in and activate it, e.g. for one named unstructured:

pyenv virtualenv 3.10 unstructured
pyenv activate unstructured
Run make install
Optional:
- To install models and dependencies for processing images and PDFs locally, run make install-local-inference.
- For processing image files, tesseract is required. See here for installation instructions.
- For processing PDF files, tesseract and poppler are required. The pdf2image docs have instructions on installing poppler across various platforms.

Additionally, if you're planning to contribute to unstructured, we provide you an optional pre-commit configuration file to ensure your code matches the formatting and linting standards used in unstructured. If you'd prefer not to have code changes auto-tidied before every commit, you can use make check to see whether any linting or formatting changes should be applied, and make tidy to apply them.

If using the optional pre-commit, you'll just need to install the hooks with pre-commit install since the pre-commit package is installed as part of make install mentioned above. Finally, if you decided to use pre-commit you can also uninstall the hooks with pre-commit uninstall.

In addition to develop in your local OS we also provide a helper to use docker providing a development environment:

make docker-start-dev

This starts a docker container with your local repo mounted to /mnt/local_unstructured. This docker image allows you to develop without worrying about your OS's compatibility with the repo and its dependencies.

👏 Quick Tour

Documentation

This README overviews how to install, use and develop the library. For more comprehensive documentation, visit https://unstructured-io.github.io/unstructured/ .

Concepts Guide

The unstructured library includes core functionality for partitioning, chunking, cleaning, and staging raw documents for NLP tasks. You can see a complete list of available functions and how to use them from the Core Functionality documentation.

In general, these functions fall into several categories:

Partitioning functions break raw documents into standard, structured elements.
Cleaning functions remove unwanted text from documents, such as boilerplate and sentence fragments.
Staging functions format data for downstream tasks, such as ML inference and data labeling.
Chunking functions split documents into smaller sections for use in RAG apps and similarity search.
Embedding encoder classes provide an interfaces for easily converting preprocessed text to vectors.

The Connectors 🔗 in unstructured serve as vital links between the pre-processing pipeline and various data storage platforms. They allow for the batch processing of documents across various sources, including cloud services, repositories, and local directories. Each connector is tailored to a specific platform, such as Azure, Google Drive, or Github, and comes with unique commands and dependencies. To see the list of Connectors available in unstructured library, please check out the Connectors GitHub folder and documentation

PDF Document Parsing Example

The following examples show how to get started with the unstructured library. You can parse over a dozen document types with one line of code! Use this Colab notebook to run the example below.

The easiest way to parse a document in unstructured is to use the partition function. If you use partition function, unstructured will detect the file type and route it to the appropriate file-specific partitioning function. If you are using the partition function, you may need to install additional parameters via pip install unstructured[local-inference]. Ensure you first install libmagic using the instructions outlined here partition will always apply the default arguments. If you need advanced features, use a document-specific partitioning function.

from unstructured.partition.auto import partition

elements = partition("example-docs/layout-parser-paper.pdf")

Run print("\n\n".join([str(el) for el in elements])) to get a string representation of the output, which looks like:


LayoutParser : A Uniﬁed Toolkit for Deep Learning Based Document Image Analysis

Zejiang Shen 1 ( (cid:0) ), Ruochen Zhang 2 , Melissa Dell 3 , Benjamin Charles Germain Lee 4 , Jacob Carlson 3 , and
Weining Li 5

Abstract. Recent advances in document image analysis (DIA) have been primarily driven by the application of neural
networks. Ideally, research outcomes could be easily deployed in production and extended for further investigation.
However, various factors like loosely organized codebases and sophisticated model conﬁgurations complicate the easy
reuse of important innovations by a wide audience. Though there have been ongoing eﬀorts to improve reusability and
simplify deep learning (DL) model development in disciplines like natural language processing and computer vision, none
of them are optimized for challenges in the domain of DIA. This represents a major gap in the existing toolkit, as DIA
is central to academic research across a wide range of disciplines in the social sciences and humanities. This paper
introduces LayoutParser, an open-source library for streamlining the usage of DL in DIA research and applications.
The core LayoutParser library comes with a set of simple and intuitive interfaces for applying and customizing DL models
for layout detection, character recognition, and many other document processing tasks. To promote extensibility,
LayoutParser also incorporates a community platform for sharing both pre-trained models and full document digitization
pipelines. We demonstrate that LayoutParser is helpful for both lightweight and large-scale digitization pipelines in
real-word use cases. The library is publicly available at https://layout-parser.github.io

Keywords: Document Image Analysis · Deep Learning · Layout Analysis · Character Recognition · Open Source library ·
Toolkit.

Introduction

Deep Learning(DL)-based approaches are the state-of-the-art for a wide range of document image analysis (DIA) tasks
including document image classiﬁcation [11,

See the partitioning section in our documentation for a full list of options and instructions on how to use file-specific partitioning functions.

💂‍♂️ Security Policy

See our security policy for information on how to report security vulnerabilities.

🐛 Reporting Bugs

Encountered a bug? Please create a new GitHub issue and use our bug report template to describe the problem. To help us diagnose the issue, use the python scripts/collect_env.py command to gather your system's environment information and include it in your report. Your assistance helps us continuously improve our software - thank you!

📚 Learn more

Section	Description
Company Website	Unstructured.io product and company info
Documentation	Full API documentation
Batch Processing	Ingesting batches of documents through Unstructured

📈 Analytics

We’ve partnered with Scarf (https://scarf.sh) to collect anonymized user statistics to understand which features our community is using and how to prioritize product decision-making in the future. To learn more about how we collect and use this data, please read our Privacy Policy. To opt out of this data collection, you can set the environment variable SCARF_NO_ANALYTICS=true before running any unstructured commands.

unstructured's People

Contributors

Stargazers

Watchers

Forkers

asymness firmai-research tuffacton sibtainrazajamali simple-ml-learning jingmouren rasoulnorouzi gokullan kvnkho sparkbrains sparkbrainsnew kevinprinsloo neurotech-hq djacobs7 jacknion codeaudit esteininger noahdemoes firedragonironfist sidmohan0 13331112522 asai95 context-labs vasco989k tnachen keleffew kirillkazakov8 tomaarsen crosleythomas haksoat alvarobartt james-see ajaycode eltociear mu4farooqi danielye8 ifrasa timonkai thesekyi xdarabseh allchain doctorslimm hertera1 neocodegs mirsants mintymac pandazki sorokinvld lucapericlp thomasewing04 siddartha-re primemeridianeth acumennations ccaiccie rogeliorea hedaayat jonvet jay-ylee davefork qraft-technologies hsm207 gaoxiaojun simplyjuanjo tousyou techthiyanes tabossert jbrjake wei-zhong90 ruolunhui eqeiland lemonguge chmodsss sarkarda decentralised-ai kevinrpan marcusyatim mestrace joyxof votee bananemure nickscamara meirdev kravetsmic startakovsky salonast wesleysanjose jinghai lumiqai ytoml jpollard-cs timbielawski srjchauhan cshaddox varunsaini745 eric6017 iamfaith jjhw drgonzalomora thenamesalex fran-unstructured

unstructured's Issues

Create a data conector for Azure Blob Storage

Create a data connector that pulls data from Azure Blob Storage

See Adding Data Connectors for details on how to get started. Make sure to include a link to this issue when submitting a PR.

Definition of Done

The checklist has been completed.
The connector is able to process a single object or a container.
The connector is able to process the entire container.
For private files, it requires the credentials of access.
For now, it is OK to process only doc types that partition.auto() is capable of processing.

Bonus points: the ability to filter by document type.

Create a data connector for Slack

Create a data connector that:

fetches posts from a specific slack channel with the option to add a data filter.
stores the extracted data locally as text files (at least temporarily for processing)
processes them using unstructured.partition.auto.
See Adding Data Connectors for details on how to get started. Make sure to include a link to this issue when submitting a PR.

Definition of Done

The checklist has been completed.
The connector is able to pull all conversations in a channel or take in a date range and extract conversations from those dates.
The connector takes in credentials for private channels.
The connector is able to process a single text file.
The connector is able to process all text files extracted from slack, recursively.
For now, it is OK to process only doc types that unstructured.partition.auto is capable of processing.
Bonus points: the ability to filter by document type.

Updates the auto.partition() function to recognize Unstructured json

The auto partition() function should recognize json-serialized Unstructured ISD documents, which are likely recognized as .TXT files (needs to be confirmed). See recalibrating-risk-report.pdf.json for example json, but note that different json elements may have different metadata fields defined, whereas the top-level fields type and text are always defined.

In this case, the staging brick https://unstructured-io.github.io/unstructured/bricks.html#isd-to-elements should be used to instantiate the Unstructured elements after loading the json.

Motivation

Applications downstream of Unstructured would benefit from processing a mix of already-processed structured data and unstructured data with no additional code or config changes.

Example

from unstructured.partition.auto import partition
from unstructured.staging.base import elements_to_json, elements_from_json

# write json output
elements = partition("example-docs/fake-text.txt")
elements_to_json(elements, filename="fake-text.json", indent=2)

after the serialization step above, fake-text.json looks like:

$ head -12 fake-text.json 
[
  {
    "element_id": "1df8eeb8be847c3a1a7411e3be3e0396",
    "coordinates": null,
    "text": "This is a test document to use for unit tests.",
    "type": "NarrativeText",
    "metadata": {
      "filename": "example-docs/fake-text.txt"
    }
  },
  {
    "element_id": "a9d4657034aa3fdb5177f1325e912362",

Which may be converted back to elements:

# Verify that elements_from_json is the inverse operation:
elements2 = elements_from_json(filename="fake-text.json")
for i in range(len(elements)):
    assert elements[i] == elements2[i]

The thing to add in this issue to update the auto partition function to also detect json structured elements:

elements3 = partition("filename="fake-text.json")
for i in range(len(elements)):
    assert elements[i] == elements3[i]

In addition, the parition() function should still be able to construct elements if the element_id or coordinates fields are missing, or any metadata fields are missing.

Definition of Done

auto partition() function successfully loads ISD json.
Many permutations of serialized ISD json are tested, i.e. across different types with different metadata schemas.

Add filetype check based on file extension if `libmagic` isn't available

Currently we use libmagic to determine the filetype for input files. However, magic requires the libmagic library, which not all users have available. The goal of this issue is to add a fallback that determines the filetype based on the file extension if libmagic is not available. If libmagic is available, detect_filetype should continue to use magic, as it does currently.

Create a data connector for Discord

Create a data connector that:

fetches data from one or more discord channels
stores the files locally (at least temporarily for processing), and runs them through unstructured.partition.auto.

See Adding Data Connectors for details on how to get started. Make sure to include a link to this issue when submitting a PR.

Definition of Done

The checklist has been completed.
The connector is able to process all files from Discord data loader API.
The connector can accept any credentials, if necessary.
For now, it is OK to process only doc types that unstructured.partition.auto is capable of processing.
Bonus points: the ability to filter by document type.

Add Colab Notebook

Would be nice to have Colab notebook with examples shown in the readme here. Happy to contribute this, since I already did it to play with the repo the other day.

Lists from LayoutParser should be split into `ListItem`s

Summary

Currently if you run code such as:

from unstructured.partition.pdf import partition_pdf_or_image

filename = "example-docs/layout-parser-paper.pdf"
elements = partition_pdf(filename=filename)

List elements are a single Text blog that looks like:

'1. An oﬀ-the-shelf toolkit for applying DL models for layout detection, character recognition, and other DIA tasks (Section 3) 2. A rich repository of pre-trained neural network models (Model Zoo) that underlies the oﬀ-the-shelf usage 3. Comprehensive tools for eﬃcient document image data annotation and model tuning to support diﬀerent levels of customization 4. A DL model hub and community platform for the easy sharing, distribu- tion, and discussion of DIA models and pipelines, to promote reusability, reproducibility, and extensibility (Section 4)'

We should split up the list so each element in the list is its own ListItem element.

Add text categories as metadata

The goal of this issue is to add text categories as metadata to document elements to enable users to restrict search to specific sections (i.e. the experiments section in research papers).

Create a data connector for Google Drive

Create a data connector that pulls documents from Google Drive, stores them locally (at least temporarily for processing), and runs them through unstructured.partition.auto.

See Adding Data Connectors for details on how to get started. Make sure to include a link to this issue when submitting a PR.

Definition of Done

The checklist has been completed.
The connector is able to process a single document.
The connector is able to process all documents in a Google Drive folder, recursively.
For now, it is OK to process only doc types that unstructured.partition.auto is capable of processing. Google Drive documents should be converted to PDF or Word Doc for processing (unless there is a better way).
Bonus points: the ability to filter by document type.

More meaningful warnings for unknown filetypes in `detect_filetype`

Currently the detect_filetype function in unstructured.file_utils.filetype emits the following warning if it detects an unknown MIME type. This isn't especially helpful because you still don't know what the file type is. The goal of this issue is to update the warning to print out the file extension or filename so it's more obvious what filetype cause the issue.

MIME type was inode/x-empty. This file type is not currently supported in unstructured.

Tests pass `label-studio-sdk==0.0.17` but not with `label-studio-sdk==0.0.18`

Summary

Currently tests passwith label-studio-sdk==0.0.17 but not with label-studio-sdk==0.0.18. The goal of this issue is to figure out why and implement a fix.

Steps to reproduce

Run PYTHONPATH=. pytest test_unstructured/staging/test_label_studio.py. This will result in the following error message.

self = <vcr.patch.VCRRequestsHTTPConnectiontest_unstructured/vcr_fixtures/cassettes/label_studio_upload.yaml object at 0x1066f7160>
_ = False, kwargs = {'buffering': True}

    def getresponse(self, _=False, **kwargs):
        """Retrieve the response"""
        # Check to see if the cassette has a response for this request. If so,
        # then return it
        if self.cassette.can_play_response_for(self._vcr_request):
            log.info("Playing response for {} from cassette".format(self._vcr_request))
            response = self.cassette.play_response(self._vcr_request)
            return VCRHTTPResponse(response)
        else:
            if self.cassette.write_protected and self.cassette.filter_request(self._vcr_request):
>               raise CannotOverwriteExistingCassetteException(
                    cassette=self.cassette, failed_request=self._vcr_request
                )
E               vcr.errors.CannotOverwriteExistingCassetteException: Can't overwrite existing cassette ('test_unstructured/vcr_fixtures/cassettes/label_studio_upload.yaml') in your current record mode (<RecordMode.ONCE: 'once'>).
E               No match for the request (<Request (GET) http://localhost:8080/api/projects/95>) was found.
E               Found 1 similar requests with 0 different matcher(s) :
E
E               1 - (<Request (GET) http://localhost:8080/api/projects/95>).
E               Matchers succeeded : ['method', 'scheme', 'host', 'port', 'path', 'query']
E               Matchers failed :

../../.pyenv/versions/unstructured/lib/python3.8/site-packages/vcr/stubs/__init__.py:231: CannotOverwriteExistingCassetteException
-------------------------------------------- Captured log call --------------------------------------------
DEBUG    urllib3.util.retry:retry.py:351 Converted retries value: 3 -> Retry(total=3, connect=None, read=None, redirect=None, status=None)
DEBUG    urllib3.util.retry:retry.py:351 Converted retries value: 3 -> Retry(total=3, connect=None, read=None, redirect=None, status=None)
DEBUG    urllib3.connectionpool:connectionpool.py:228 Starting new HTTP connection (1): localhost:8080
DEBUG    urllib3.connectionpool:connectionpool.py:456 http://localhost:8080 "GET /health HTTP/1.1" 200 None
DEBUG    urllib3.connectionpool:connectionpool.py:456 http://localhost:8080 "GET /api/projects?page_size=10000000 HTTP/1.1" 200 None
DEBUG    urllib3.connectionpool:connectionpool.py:456 http://localhost:8080 "GET /api/projects/95 HTTP/1.1" 200 None
========================================= short test summary info =========================================
FAILED test_unstructured/staging/test_label_studio.py::test_upload_label_studio_data_with_sdk - vcr.errors.CannotOverwriteExistingCassetteException: Can't overwrite existing cassette ('test_unstruct...
====================================== 1 failed, 17 passed in 0.27s =======================================

References

kevin1024/vcrpy#533

`ZeroDivisionError` on capitalization check if there are no tokens

Summary

The exceeds_caps_ratio function in text_type.py raises a ZeroDivisionError if the length of the tokens is zero. Instead it should return False.

Reproduce

from unstructured.partition.text_type import exceeds_cap_ratio

exceeds_cap_ratio("")

ZeroDivisionError                         Traceback (most recent call last)
Cell In [9], line 1
----> 1 exceeds_cap_ratio("")

File ~/repos/unstructured/unstructured/partition/text_type.py:123, in exceeds_cap_ratio(text, threshold)
    121 tokens = word_tokenize(text)
    122 capitalized = sum([word.istitle() or word.isupper() for word in tokens])
--> 123 ratio = capitalized / len(tokens)
    124 return ratio > threshold

ZeroDivisionError: division by zero

Add email header information to metadata

The partition_email function has the ability to parse out header information from emails, like to:, from: and subject:. The goal of this issue is to add this to the unstructured metadata when partitioning emails.

Document level language detection before processing

There is some rules based logic in unstructured that is language dependent. For example, is_possible_narrative_text checks for verbs and the presence of valid words from a dictionary. However, these checks are specific to English language docs. The goal of this issue is to add a document level language check to determine the language of a document. We already use langdetect for this in translate_text.

Create a data connector for Substack

Create a data connector that:

fetches one more substack posts
stores them locally as html files (at least temporarily for processing)
processes them in a way specfic to substack, i.e. filters out Substack boilerplate elements after calling partition_html

See Adding Data Connectors for details on how to get started. Make sure to include a link to this issue when submitting a PR.

Definition of Done

The checklist has been completed.
The connector is able to process a single document or multiple documents (multiple URL's may be passed to --substack-url in main.py).
Note that unlike the S3IngestDoc, the subclass of BaseIngestDoc created for this task (probably named SubstackInstanceDoc) would need to override the process_file method to handle additional cleaning logic. I.e., the 2nd bullet in the checklist.

Create a data connector for Sharepoint

Create a data connector that:

fetches data from Sharepoint.
stores the content locally (at least temporarily for processing), and runs them through unstructured.partition.auto.

See Adding Data Connectors for details on how to get started. Make sure to include a link to this issue when submitting a PR.

Definition of Done

The checklist has been completed.
The connector is able to process using the Sharepoint API [https://learn.microsoft.com/en-us/sharepoint/dev/sp-add-ins/get-to-know-the-sharepoint-rest-service].
The connector is able to process a single document.
The connector is able to process documents from a folder, with the option to process a folder recursively.
The connector can accept any credentials, if necessary.
The connector should be able of processing documents through unstructured.partition.auto.

Enable users to translate extracted coordinates to PDF User Space Coordinates

The goal of this issue is provide the ability for users to translate the coordinates attribute in document elements to PDF User Space coordinates when working with partition_pdf. This would enable users to more easily visualize document extractions when dealing with partition_pdf.

Create a new data connector for Obsidian

Create a data connector that:

fetches data from Obsidian.
stores the content locally (at least temporarily for processing), and runs them through unstructured.partition.auto.
more files are markdown, there is support for them through unstructured.partition.auto.

See Adding Data Connectors for details on how to get started. Make sure to include a link to this issue when submitting a PR.

Definition of Done

The checklist has been completed.
The connector is able to process entries from Obsidian using the API.
The connector is able to process a single entry.
The connector is able to process several entries.
The connector can accept any credentials, if necessary.
The connector should be able of processing documents from the Obsidian Vault through unstructured.partition.auto.

Install `python-magic-bin` instead of `python-magic` for Windows

Currently windows users have difficulty with file detection because windows needs to install python-magic-bin instead of python-magic. The goal of this issue is to see if we can install python-magic-bin instead of python-magic if the user's OS is Windows.

See this comment for details.

References:

https://github.com/ahupp/python-magic#windows

Sync `detectron2` versions in docs

There are a few sets of detectron2 install instructions and make targets. Some of them reference different versions. The goal of this issue is to make them all the same so all of the docs are synced.

File detection doesn't work properly when using `partition` in hot reload mode in `ipython`

Currently if you are using the partition brick in hot reload mode in ipython, an error related to file detection occur is you make a change to partition.py.

Steps to reproduce

from unstructured.partition.auto import partition

elements = partition("example-docs/layout-parser-paper.pdf")

Make a change to partition.py and run the same code again. You'll get:

ValueError: Invalid file /Users/mrobinson/data/sitreps/downsampled-data/docx/01_SOCN_DOWNREP_06-01.docx. File type not support in partition.

But if you run:

from unstructured.file_utils.filetype import detect_filetype

partition("example-docs/layout-parser-paper.pdf")

You'll see that the filetype is correct. This may be related to how Enums are handled during hot reloading. The workaround is to exit iPython and restart.

Create a data connector for processing the biomedical literature

Create a data connector that:

fetches PDF files from PMC Open Access Subset.
stores the files locally (at least temporarily for processing), and runs them through unstructured.partition.auto.

See Adding Data Connectors for details on how to get started. Make sure to include a link to this issue when submitting a PR.

Definition of Done

The checklist has been completed.
The connector is able to process all files from the PMC Open Access Subset FTP directory.
The connector is able to process a document using the Individual article download.
The connector is able to process a document using the PDF download.
The connector can accept any credentials, if necessary.
The connector should be able of processing the PDF documents through unstructured.partition.auto.

Create a data connector for processing social media sites

Create a data connector that:

fetches data from social media sites such as Twitter or Facebook.
stores the content locally (at least temporarily for processing), and runs them through unstructured.partition.auto.
indicate the social media site you are working on by commenting on the issue.

See Adding Data Connectors for details on how to get started. Make sure to include a link to this issue when submitting a PR.

Definition of Done

The checklist has been completed.
The connector is able to process posts using the API of the media site.
The connector is able to process a single post.
The connector is able to process several posts.
The connector can accept any credentials, if necessary.
The connector should be able of processing documents through unstructured.partition.auto.

ISD dictionaries are not JSON serializable if the filename has a POSIX path

Currently if a filename has a path like "../../my-file.txt", the ISD dictionary is not JSON serializable. The goal of this issue is to only include the filename and not the full path in the metadata so that the ISD dictionary is JSON serializable.

Create a data connector for Reddit

Create a data connector that:

fetches one more Reddit messages from a Subreddit or Redditor
stores them locally as markdown files (at least temporarily for processing)
processes the files the standard way, through there is support for them through unstructured.partition.auto.

See Adding Data Connectors for details on how to get started. Make sure to include a link to this issue when submitting a PR.

For inspiration on downloading Reddit content, see:
https://github.com/emptycrown/llama-hub/blob/main/loader_hub/reddit/base.py

Additional References

Definition of Done

The checklist has been completed.
The connector is able to process all documents for a subreddit or user, at least up to a large well defined limit
May constrain messages processed by date range

Get rid of `UserWarning` on the first call to `partition`

Currently, we get a UserWarning the first time you call partition_pdf or partition_image. The goal of this issue is add handling so that this UserWarning no longer appears.

from unstructured.partition.auto import partition

partition("example-docs/layout-parser-paper-fast.pdf")

/home/ec2-user/anaconda3/envs/unstructured/lib/python3.8/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3190.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]

Add tests to get all unstructured modules above 90% coverage

Currently the email.py, pdf.py and email_elements.py files in https://github.com/Unstructured-IO/unstructured have below 90% coverage. The goal of this issue is to add additional tests for those modules to get the coverage above 90%.

Definition of Done

All modules in unstructured have >90% coverage.

References:

Unstructured-IO/community#43

Create a data connector for Apache Kafka

Create a data connector that:

Fetches data from Apache Kafka.
Stores the content locally (at least temporarily for processing), and runs them through unstructured.partition.auto.

See Adding Data Connectors for details on how to get started. Make sure to include a link to this issue when submitting a PR.

Definition of Done

The checklist has been completed.
The connector is able to process entries from Notion using the API.
The connector is able to process a single entry.
The connector is able to process several entries, with an option to process them recursively.
The connector can accept any credentials, if necessary.
The connector should be able of processing documents through unstructured.partition.auto.

Download `nltk` models if they're not available so people don't have to run `make install-nltk-models`

Currently you need to run make install-nltk-models to download the NLTK pre-reqs in unstructured. We can make this easier by downloading those models if they're not available when they're called in the relevant modules.

partition_html incorrect encoding

Describe the bug
partition_html function incorrectly handles cyrillic (and probably other non-latin) text.
This happens when an html do not have correct encoding specified in a meta tag.

To Reproduce

from unstructured.partition.html import partition_html
text = '<html><body><p>Привет</p></body></html>'
parts = partition_html(text=text)
print(parts[0].text)

Prints Ð\x9fÑ\x80Ð¸Ð²ÐµÑ\x82

Expected behavior

from unstructured.partition.html import partition_html
text = '<html><body><p>Привет</p></body></html>'
parts = partition_html(text=text)
print(parts[0].text)

Prints Привет

Desktop (please complete the following information):

OS: mac
Python version: 3.9.9

Additional context
As a possible solution, I would suggest to add an ability to provide your own parser to partition_html function.
This way we could use lxml.etree.HTMLParser with initialized with desired encoding argument.

feat/Add support for epub format

Is your feature request related to a problem? Please describe.
One of the leading open format for books is epub. Thus, when running algorithms on books I often have to find a way to process them.

Describe the solution you'd like
An epub is a compressed html. As unstructured already supports html, it would be very straightforward to add support for epub.

Describe alternatives you've considered
I am currently using pandoc to convert my epub data to html. This is straightforward but adds a special case to my pipelines.

Custom color schemes for the `sphinx` documentation

Currently, we’re using a default theme for our sphinx documentation. The goal of this issue is to update the sphinx documentation to match the Unstructured color scheme from unstructured.io website.

Compatibility with Python 3.7

So related to #56 , Colab's environment uses Python 3.7. Since in the package you are using typing.Final, it makes this package unusable on Python 3.7 unfortunately, as that is only supported on Python >= 3.8

Would it be possible to not use this typing feature so the package works on Python < 3.8?

Create a data connector for OneDrive

Create a data connector that:

Fetches data from OneDrive.
Stores the content locally (at least temporarily for processing), and runs them through unstructured.partition.auto.

See Adding Data Connectors for details on how to get started. Make sure to include a link to this issue when submitting a PR.

Definition of Done

The checklist has been completed.
The connector is able to process entries from Notion using the API.
The connector is able to process a single entry.
The connector is able to process several entries, with an option to process them recursively.
The connector can accept any credentials, if necessary.
The connector should be able of processing documents through unstructured.partition.auto.

Add file information to document metadata

For several document types, we have helper functions that extract document metadata like "author" and "last modified date". For images files, we can also extract EXIF metadata when it's available. The goal of this issue is to include any information that would be of interest to downstream users in the unstructured metadata.

Create a data connector for RSS Feeds

Create a data connector that:

Fetches data from RSS Feeds.
Stores the content locally (at least temporarily for processing), and runs them through unstructured.partition.auto.

See Adding Data Connectors for details on how to get started. Make sure to include a link to this issue when submitting a PR.

Definition of Done

The checklist has been completed.
The connector is able to process entries from Notion using the API.
The connector is able to process a single entry.
The connector is able to process several entries, with an option to process them recursively.
The connector can accept any credentials, if necessary.
The connector should be able of processing documents through unstructured.partition.auto.

Create a data connector for WHO data

Create a data connector that:

Fetches World Health Organization's data.
Stores the content locally (at least temporarily for processing), and runs them through unstructured.partition.auto.

See Adding Data Connectors for details on how to get started. Make sure to include a link to this issue when submitting a PR.

Definition of Done

The checklist has been completed.
The connector is able to process entries from Notion using the API.
The connector is able to process a single entry.
The connector is able to process several entries, with an option to process them recursively.
The connector can accept any credentials, if necessary.
The connector should be able of processing documents through unstructured.partition.auto.

Create a data conector for Github repos Part 1 (all supported filetypes but no source code)

Create a data connector that pulls data from GIthub repo.

See Adding Data Connectors for details on how to get started. Make sure to include a link to this issue when submitting a PR.

Definition of Done

The checklist has been completed.
The connector is able to process a single repository and recursively the entire documents in a repo.
Should ignore code.
Should process only doc types supported in partition.auto, including markdown files.

Bonus points: the ability to filter by document type.

Optional encoding kwarg for for `partition_{html, email, text}`

The goal of this issue is to add an optional encoding kwarg that allows users to pass in the encoding to bricks the process plain text files (partiton_html, partition_email, and partition_text). Currently, if you want to use a specific encoding, you need to read in the file yourself and pass it in as text, as shown below.

with open("example-docs/example-10k.html", "r", encoding="utf-8") as f: #also passed encoding here
    text = f.read()
elements = partition_html(text=text)

ModuleNotFoundError: No module named 'unstructured.documents.pdf'

Hello,

Following the documentation for parsing a pdf as found here: https://unstructured-io.github.io/unstructured/examples.html#pdf-parsing

It seems that the import statement:
from unstructured.documents.pdf import PDFDocument
results in a not found error.

Indeed, checking unstructured/unstructured/documents, I can't seem to find anything relevant for PDF parsing.

Thank you

`FigureCaption`, `Text`, and metadata get lost when you serialize and deserialize a list of elements

Describe the bug
As described in this comment, we currently lose FigureCaption, Text, and metadata when we serialize and deserialize a list of elements.

To Reproduce
Start with a list of elements that contain FigureCaption, Text, and metadata then run the following. You'll see that some elements and their metadata are lost:

with open("elements.json", "w") as f:
    json.dump(convert_to_isd(elements), f)

with open("elements.json", "r") as f:
    elements = isd_to_elements(json.load(f))

Add Python 3.9 and Python 3.10 to the CI test job

Currently, we're only testing against python3.8 in .github/workflows/ci.yml. The goal of this issue is to add python3.9 and python3.10 to make sure that we're compatible with later versions of Python.

feat/write_elements

Hi,

Is there any way to write List[Element] data into file and load from it then, in order to avoid to partition data each time?

`partition_html` is returning javascript code from some HTML documents

Currently, the partition_html function is returning javascript code in some html documents. The goal of this issue is to update our partitioning logic so that this javascript code doesn't come through in the example document.

Steps to reproduce

import requests
from unstructured.partition.html import partition_html

url = "https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-december-13"
r = requests.get(url)
elements = partition_html(text=r.text)
 print("\n\n".join([str(el) for el in elements[:5]]))

You should see the following javascript code in elements[1].text

'(function(d){\n  var js, id = \'facebook-jssdk\'; if (d.getElementById(id)) {return;}\n  js = d.createElement(\'script\'); js.id = id; js.async = true;\n  js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";\n  d.getElementsByTagName(\'head\')[0].appendChild(js);\n}(document));'

Create a Data Connector for Storj DCS (Decentralized Cloud Storage)

Create a data connector that pulls data from Storj DCS

See Adding Data Connectors for details on how to get started. Make sure to include a link to this issue when submitting a PR.

Definition of Done

The checklist has been completed.
The connector is able to process a single distributed object or a bucket.
The connector is able to process the entire bucket.
The connector ingests a Storj Access Grant.
For now, it is OK to process only doc types that partition.auto() is capable of processing.

`partition_pdf` should return `unstructured` document `Element` objects

Currently, if you run elements = partition_pdf("example-docs/layout-parser-paper.pdf", url=None) and look at elements[0], the type is LayoutElement, which is a type from unstructured-inference. Instead, we should return standard unstructured document elements for consistency with the other partition functions. As part of this, we may consider whether to include extract information such as coordinates as optional attributes in Element.

`partition_email` outputs `UnicodeDecodeError` when trying to parse email with an image attachment

Currently there is a bug in partition_email that results in a UnicodeDecodeError when parsing emails that have an image attachment.

Steps to reproduce

Run the following from the root directory of the repo.

from unstructured.partition.email import partition_email

filename = "example-docs/fake-email-attachment.eml"
elements = partition_email(filename=filename)

The error should look like:

UnicodeDecodeError                        Traceback (most recent call last)
Cell In [3], line 1
----> 1 elements = partition_email(filename=filename)

File ~/unstructured/unstructured/partition/email.py:207, in partition_email(filename, file, text, content_source, include_headers)
    205     for element in elements:
    206         if isinstance(element, Text):
--> 207             element.apply(replace_mime_encodings)
    208 elif content_source == "text/plain":
    209     elements = partition_text(text=content)

File ~/unstructured/unstructured/documents/elements.py:44, in Text.apply(self, *cleaners)
     42 cleaned_text = self.text
     43 for cleaner in cleaners:
---> 44     cleaned_text = cleaner(cleaned_text)
     46 if not isinstance(cleaned_text, str):
     47     raise ValueError("Cleaner produced a non-string output.")

File ~/unstructured/unstructured/cleaners/core.py:117, in replace_mime_encodings(text)
    110 def replace_mime_encodings(text: str) -> str:
    111     """Replaces MIME encodings with their UTF-8 equivalent characters.
    112 
    113     Example
    114     -------
    115     5 w=E2=80-99s -> 5 w’s
    116     """
--> 117     return quopri.decodestring(text.encode()).decode("utf-8")

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc2 in position 6: invalid continuation byte

Create a new data connector for Notion

Create a data connector that:

fetches data from Notion.
stores the content locally (at least temporarily for processing), and runs them through unstructured.partition.auto.
inspiration for processing is available here

See Adding Data Connectors for details on how to get started. Make sure to include a link to this issue when submitting a PR.

Definition of Done

The checklist has been completed.
The connector is able to process entries from Notion using the API.
The connector is able to process a single entry.
The connector is able to process several entries, with an option to process them recursively.
The connector can accept any credentials, if necessary.
The connector should be able of processing documents through unstructured.partition.auto.

Too many ListItem's generated for some PDF's

Describe the bug
PDF parsing results in too many ListItem chunks.

To Reproduce
Run test-ingest script that currently exists in CI. https://github.com/Unstructured-IO/unstructured/blob/main/test_unstructured_ingest/test-ingest.sh

Expected behavior
Many fewer chunks.

Additional context
Here is one example: https://github.com/Unstructured-IO/unstructured/blob/3c1b089/test_unstructured_ingest/expected-structured-output/s3-small-batch/small-pdf-set/2023-Jan-economic-outlook.pdf.json#L16 but there are more throughout that sample document.

Create a data connector for JIRA

Create a data connector that:

Fetches data from JIRA.
Stores the content locally (at least temporarily for processing), and runs them through unstructured.partition.auto.

See Adding Data Connectors for details on how to get started. Make sure to include a link to this issue when submitting a PR.

Definition of Done

The checklist has been completed.
The connector is able to process entries from JIRA using the API.
The connector is able to process a single entry.
The connector is able to process several entries, with an option to process them recursively.
The connector can accept any credentials, if necessary.
The connector should be able of processing documents through unstructured.partition.auto.

unstructured-io / unstructured Goto Github PK

unstructured's Introduction

Open-Source Pre-Processing Tools for Unstructured Data

API Announcement!

🚀 Beta Feature: Chipper Model

✴️ Quick Start

Run the library in a container

Installing the library

Installation Instructions for Local Development

👏 Quick Tour

Documentation

Concepts Guide

PDF Document Parsing Example

💂‍♂️ Security Policy

🐛 Reporting Bugs

📚 Learn more

📈 Analytics

unstructured's People

Contributors

Stargazers

Watchers

Forkers

unstructured's Issues

Summary

Summary

Steps to reproduce

Summary

Reproduce

Steps to reproduce

Definition of Done

Steps to reproduce

Steps to reproduce

Recommend Projects

Recommend Topics

Recommend Org