unstructured-io / unstructured Goto Github PK

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.

Home Page: https://www.unstructured.io/

License: Apache License 2.0

Makefile 0.07% HTML 87.58% XSLT 0.01% Python 10.75% Shell 0.86% Dockerfile 0.01% Rich Text Format 0.73%

deep-learning document-parsing machine-learning nlp ocr information-retrieval data-pipelines ml preprocessing pdf-to-text

unstructured's Issues

Document level language detection before processing

There is some rules based logic in unstructured that is language dependent. For example, is_possible_narrative_text checks for verbs and the presence of valid words from a dictionary. However, these checks are specific to English language docs. The goal of this issue is to add a document level language check to determine the language of a document. We already use langdetect for this in translate_text.

Create a data connector for Substack

Create a data connector that:

fetches one more substack posts
stores them locally as html files (at least temporarily for processing)
processes them in a way specfic to substack, i.e. filters out Substack boilerplate elements after calling partition_html

See Adding Data Connectors for details on how to get started. Make sure to include a link to this issue when submitting a PR.

Definition of Done

The checklist has been completed.
The connector is able to process a single document or multiple documents (multiple URL's may be passed to --substack-url in main.py).
Note that unlike the S3IngestDoc, the subclass of BaseIngestDoc created for this task (probably named SubstackInstanceDoc) would need to override the process_file method to handle additional cleaning logic. I.e., the 2nd bullet in the checklist.

Create a data connector for Sharepoint

Create a data connector that:

fetches data from Sharepoint.
stores the content locally (at least temporarily for processing), and runs them through unstructured.partition.auto.

See Adding Data Connectors for details on how to get started. Make sure to include a link to this issue when submitting a PR.

Definition of Done

The checklist has been completed.
The connector is able to process using the Sharepoint API [https://learn.microsoft.com/en-us/sharepoint/dev/sp-add-ins/get-to-know-the-sharepoint-rest-service].
The connector is able to process a single document.
The connector is able to process documents from a folder, with the option to process a folder recursively.
The connector can accept any credentials, if necessary.
The connector should be able of processing documents through unstructured.partition.auto.

Too many ListItem's generated for some PDF's

Describe the bug
PDF parsing results in too many ListItem chunks.

To Reproduce
Run test-ingest script that currently exists in CI. https://github.com/Unstructured-IO/unstructured/blob/main/test_unstructured_ingest/test-ingest.sh

Expected behavior
Many fewer chunks.

Additional context
Here is one example: https://github.com/Unstructured-IO/unstructured/blob/3c1b089/test_unstructured_ingest/expected-structured-output/s3-small-batch/small-pdf-set/2023-Jan-economic-outlook.pdf.json#L16 but there are more throughout that sample document.

Create a data connector for OneDrive

Create a data connector that:

Fetches data from OneDrive.
Stores the content locally (at least temporarily for processing), and runs them through unstructured.partition.auto.

See Adding Data Connectors for details on how to get started. Make sure to include a link to this issue when submitting a PR.

Definition of Done

The checklist has been completed.
The connector is able to process entries from Notion using the API.
The connector is able to process a single entry.
The connector is able to process several entries, with an option to process them recursively.
The connector can accept any credentials, if necessary.
The connector should be able of processing documents through unstructured.partition.auto.

ModuleNotFoundError: No module named 'unstructured.documents.pdf'

Hello,

Following the documentation for parsing a pdf as found here: https://unstructured-io.github.io/unstructured/examples.html#pdf-parsing

It seems that the import statement:
from unstructured.documents.pdf import PDFDocument
results in a not found error.

Indeed, checking unstructured/unstructured/documents, I can't seem to find anything relevant for PDF parsing.

Thank you

Create a data connector for Reddit

Create a data connector that:

fetches one more Reddit messages from a Subreddit or Redditor
stores them locally as markdown files (at least temporarily for processing)
processes the files the standard way, through there is support for them through unstructured.partition.auto.

See Adding Data Connectors for details on how to get started. Make sure to include a link to this issue when submitting a PR.

For inspiration on downloading Reddit content, see:
https://github.com/emptycrown/llama-hub/blob/main/loader_hub/reddit/base.py

Additional References

Definition of Done

The checklist has been completed.
The connector is able to process all documents for a subreddit or user, at least up to a large well defined limit
May constrain messages processed by date range

Add text categories as metadata

The goal of this issue is to add text categories as metadata to document elements to enable users to restrict search to specific sections (i.e. the experiments section in research papers).

`partition_email` outputs `UnicodeDecodeError` when trying to parse email with an image attachment

Currently there is a bug in partition_email that results in a UnicodeDecodeError when parsing emails that have an image attachment.

Steps to reproduce

Run the following from the root directory of the repo.

from unstructured.partition.email import partition_email

filename = "example-docs/fake-email-attachment.eml"
elements = partition_email(filename=filename)

The error should look like:

UnicodeDecodeError                        Traceback (most recent call last)
Cell In [3], line 1
----> 1 elements = partition_email(filename=filename)

File ~/unstructured/unstructured/partition/email.py:207, in partition_email(filename, file, text, content_source, include_headers)
    205     for element in elements:
    206         if isinstance(element, Text):
--> 207             element.apply(replace_mime_encodings)
    208 elif content_source == "text/plain":
    209     elements = partition_text(text=content)

File ~/unstructured/unstructured/documents/elements.py:44, in Text.apply(self, *cleaners)
     42 cleaned_text = self.text
     43 for cleaner in cleaners:
---> 44     cleaned_text = cleaner(cleaned_text)
     46 if not isinstance(cleaned_text, str):
     47     raise ValueError("Cleaner produced a non-string output.")

File ~/unstructured/unstructured/cleaners/core.py:117, in replace_mime_encodings(text)
    110 def replace_mime_encodings(text: str) -> str:
    111     """Replaces MIME encodings with their UTF-8 equivalent characters.
    112 
    113     Example
    114     -------
    115     5 w=E2=80-99s -> 5 w’s
    116     """
--> 117     return quopri.decodestring(text.encode()).decode("utf-8")

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc2 in position 6: invalid continuation byte

Add email header information to metadata

The partition_email function has the ability to parse out header information from emails, like to:, from: and subject:. The goal of this issue is to add this to the unstructured metadata when partitioning emails.

feat/Add support for epub format

Is your feature request related to a problem? Please describe.
One of the leading open format for books is epub. Thus, when running algorithms on books I often have to find a way to process them.

Describe the solution you'd like
An epub is a compressed html. As unstructured already supports html, it would be very straightforward to add support for epub.

Describe alternatives you've considered
I am currently using pandoc to convert my epub data to html. This is straightforward but adds a special case to my pipelines.

Create a data connector for WHO data

Create a data connector that:

Fetches World Health Organization's data.
Stores the content locally (at least temporarily for processing), and runs them through unstructured.partition.auto.

See Adding Data Connectors for details on how to get started. Make sure to include a link to this issue when submitting a PR.

Definition of Done

The checklist has been completed.
The connector is able to process entries from Notion using the API.
The connector is able to process a single entry.
The connector is able to process several entries, with an option to process them recursively.
The connector can accept any credentials, if necessary.
The connector should be able of processing documents through unstructured.partition.auto.

`ZeroDivisionError` on capitalization check if there are no tokens

Summary

The exceeds_caps_ratio function in text_type.py raises a ZeroDivisionError if the length of the tokens is zero. Instead it should return False.

Reproduce

from unstructured.partition.text_type import exceeds_cap_ratio

exceeds_cap_ratio("")

ZeroDivisionError                         Traceback (most recent call last)
Cell In [9], line 1
----> 1 exceeds_cap_ratio("")

File ~/repos/unstructured/unstructured/partition/text_type.py:123, in exceeds_cap_ratio(text, threshold)
    121 tokens = word_tokenize(text)
    122 capitalized = sum([word.istitle() or word.isupper() for word in tokens])
--> 123 ratio = capitalized / len(tokens)
    124 return ratio > threshold

ZeroDivisionError: division by zero

Sync `detectron2` versions in docs

There are a few sets of detectron2 install instructions and make targets. Some of them reference different versions. The goal of this issue is to make them all the same so all of the docs are synced.

Create a data connector for Slack

Create a data connector that:

fetches posts from a specific slack channel with the option to add a data filter.
stores the extracted data locally as text files (at least temporarily for processing)
processes them using unstructured.partition.auto.
See Adding Data Connectors for details on how to get started. Make sure to include a link to this issue when submitting a PR.

Definition of Done

The checklist has been completed.
The connector is able to pull all conversations in a channel or take in a date range and extract conversations from those dates.
The connector takes in credentials for private channels.
The connector is able to process a single text file.
The connector is able to process all text files extracted from slack, recursively.
For now, it is OK to process only doc types that unstructured.partition.auto is capable of processing.
Bonus points: the ability to filter by document type.

Download `nltk` models if they're not available so people don't have to run `make install-nltk-models`

Currently you need to run make install-nltk-models to download the NLTK pre-reqs in unstructured. We can make this easier by downloading those models if they're not available when they're called in the relevant modules.

Tests pass `label-studio-sdk==0.0.17` but not with `label-studio-sdk==0.0.18`

Summary

Currently tests passwith label-studio-sdk==0.0.17 but not with label-studio-sdk==0.0.18. The goal of this issue is to figure out why and implement a fix.

Steps to reproduce

Run PYTHONPATH=. pytest test_unstructured/staging/test_label_studio.py. This will result in the following error message.

self = <vcr.patch.VCRRequestsHTTPConnectiontest_unstructured/vcr_fixtures/cassettes/label_studio_upload.yaml object at 0x1066f7160>
_ = False, kwargs = {'buffering': True}

    def getresponse(self, _=False, **kwargs):
        """Retrieve the response"""
        # Check to see if the cassette has a response for this request. If so,
        # then return it
        if self.cassette.can_play_response_for(self._vcr_request):
            log.info("Playing response for {} from cassette".format(self._vcr_request))
            response = self.cassette.play_response(self._vcr_request)
            return VCRHTTPResponse(response)
        else:
            if self.cassette.write_protected and self.cassette.filter_request(self._vcr_request):
>               raise CannotOverwriteExistingCassetteException(
                    cassette=self.cassette, failed_request=self._vcr_request
                )
E               vcr.errors.CannotOverwriteExistingCassetteException: Can't overwrite existing cassette ('test_unstructured/vcr_fixtures/cassettes/label_studio_upload.yaml') in your current record mode (<RecordMode.ONCE: 'once'>).
E               No match for the request (<Request (GET) http://localhost:8080/api/projects/95>) was found.
E               Found 1 similar requests with 0 different matcher(s) :
E
E               1 - (<Request (GET) http://localhost:8080/api/projects/95>).
E               Matchers succeeded : ['method', 'scheme', 'host', 'port', 'path', 'query']
E               Matchers failed :

../../.pyenv/versions/unstructured/lib/python3.8/site-packages/vcr/stubs/__init__.py:231: CannotOverwriteExistingCassetteException
-------------------------------------------- Captured log call --------------------------------------------
DEBUG    urllib3.util.retry:retry.py:351 Converted retries value: 3 -> Retry(total=3, connect=None, read=None, redirect=None, status=None)
DEBUG    urllib3.util.retry:retry.py:351 Converted retries value: 3 -> Retry(total=3, connect=None, read=None, redirect=None, status=None)
DEBUG    urllib3.connectionpool:connectionpool.py:228 Starting new HTTP connection (1): localhost:8080
DEBUG    urllib3.connectionpool:connectionpool.py:456 http://localhost:8080 "GET /health HTTP/1.1" 200 None
DEBUG    urllib3.connectionpool:connectionpool.py:456 http://localhost:8080 "GET /api/projects?page_size=10000000 HTTP/1.1" 200 None
DEBUG    urllib3.connectionpool:connectionpool.py:456 http://localhost:8080 "GET /api/projects/95 HTTP/1.1" 200 None
========================================= short test summary info =========================================
FAILED test_unstructured/staging/test_label_studio.py::test_upload_label_studio_data_with_sdk - vcr.errors.CannotOverwriteExistingCassetteException: Can't overwrite existing cassette ('test_unstruct...
====================================== 1 failed, 17 passed in 0.27s =======================================

References

kevin1024/vcrpy#533

Optional encoding kwarg for for `partition_{html, email, text}`

The goal of this issue is to add an optional encoding kwarg that allows users to pass in the encoding to bricks the process plain text files (partiton_html, partition_email, and partition_text). Currently, if you want to use a specific encoding, you need to read in the file yourself and pass it in as text, as shown below.

with open("example-docs/example-10k.html", "r", encoding="utf-8") as f: #also passed encoding here
    text = f.read()
elements = partition_html(text=text)

Create a Data Connector for Storj DCS (Decentralized Cloud Storage)

Create a data connector that pulls data from Storj DCS

See Adding Data Connectors for details on how to get started. Make sure to include a link to this issue when submitting a PR.

Definition of Done

The checklist has been completed.
The connector is able to process a single distributed object or a bucket.
The connector is able to process the entire bucket.
The connector ingests a Storj Access Grant.
For now, it is OK to process only doc types that partition.auto() is capable of processing.

Lists from LayoutParser should be split into `ListItem`s

Summary

Currently if you run code such as:

from unstructured.partition.pdf import partition_pdf_or_image

filename = "example-docs/layout-parser-paper.pdf"
elements = partition_pdf(filename=filename)

List elements are a single Text blog that looks like:

'1. An oﬀ-the-shelf toolkit for applying DL models for layout detection, character recognition, and other DIA tasks (Section 3) 2. A rich repository of pre-trained neural network models (Model Zoo) that underlies the oﬀ-the-shelf usage 3. Comprehensive tools for eﬃcient document image data annotation and model tuning to support diﬀerent levels of customization 4. A DL model hub and community platform for the easy sharing, distribu- tion, and discussion of DIA models and pipelines, to promote reusability, reproducibility, and extensibility (Section 4)'

We should split up the list so each element in the list is its own ListItem element.

Install `python-magic-bin` instead of `python-magic` for Windows

Currently windows users have difficulty with file detection because windows needs to install python-magic-bin instead of python-magic. The goal of this issue is to see if we can install python-magic-bin instead of python-magic if the user's OS is Windows.

See this comment for details.

References:

https://github.com/ahupp/python-magic#windows

Create a data conector for Github repos Part 1 (all supported filetypes but no source code)

Create a data connector that pulls data from GIthub repo.

See Adding Data Connectors for details on how to get started. Make sure to include a link to this issue when submitting a PR.

Definition of Done

The checklist has been completed.
The connector is able to process a single repository and recursively the entire documents in a repo.
Should ignore code.
Should process only doc types supported in partition.auto, including markdown files.

Bonus points: the ability to filter by document type.

partition_html incorrect encoding

Describe the bug
partition_html function incorrectly handles cyrillic (and probably other non-latin) text.
This happens when an html do not have correct encoding specified in a meta tag.

To Reproduce

from unstructured.partition.html import partition_html
text = '<html><body><p>Привет</p></body></html>'
parts = partition_html(text=text)
print(parts[0].text)

Prints Ð\x9fÑ\x80Ð¸Ð²ÐµÑ\x82

Expected behavior

from unstructured.partition.html import partition_html
text = '<html><body><p>Привет</p></body></html>'
parts = partition_html(text=text)
print(parts[0].text)

Prints Привет

Desktop (please complete the following information):

OS: mac
Python version: 3.9.9

Additional context
As a possible solution, I would suggest to add an ability to provide your own parser to partition_html function.
This way we could use lxml.etree.HTMLParser with initialized with desired encoding argument.

Create a data connector for Apache Kafka

Create a data connector that:

Fetches data from Apache Kafka.
Stores the content locally (at least temporarily for processing), and runs them through unstructured.partition.auto.

See Adding Data Connectors for details on how to get started. Make sure to include a link to this issue when submitting a PR.

Definition of Done

The checklist has been completed.
The connector is able to process entries from Notion using the API.
The connector is able to process a single entry.
The connector is able to process several entries, with an option to process them recursively.
The connector can accept any credentials, if necessary.
The connector should be able of processing documents through unstructured.partition.auto.

Updates the auto.partition() function to recognize Unstructured json

The auto partition() function should recognize json-serialized Unstructured ISD documents, which are likely recognized as .TXT files (needs to be confirmed). See recalibrating-risk-report.pdf.json for example json, but note that different json elements may have different metadata fields defined, whereas the top-level fields type and text are always defined.

In this case, the staging brick https://unstructured-io.github.io/unstructured/bricks.html#isd-to-elements should be used to instantiate the Unstructured elements after loading the json.

Motivation

Applications downstream of Unstructured would benefit from processing a mix of already-processed structured data and unstructured data with no additional code or config changes.

Example

from unstructured.partition.auto import partition
from unstructured.staging.base import elements_to_json, elements_from_json

# write json output
elements = partition("example-docs/fake-text.txt")
elements_to_json(elements, filename="fake-text.json", indent=2)

after the serialization step above, fake-text.json looks like:

$ head -12 fake-text.json 
[
  {
    "element_id": "1df8eeb8be847c3a1a7411e3be3e0396",
    "coordinates": null,
    "text": "This is a test document to use for unit tests.",
    "type": "NarrativeText",
    "metadata": {
      "filename": "example-docs/fake-text.txt"
    }
  },
  {
    "element_id": "a9d4657034aa3fdb5177f1325e912362",

Which may be converted back to elements:

# Verify that elements_from_json is the inverse operation:
elements2 = elements_from_json(filename="fake-text.json")
for i in range(len(elements)):
    assert elements[i] == elements2[i]

The thing to add in this issue to update the auto partition function to also detect json structured elements:

elements3 = partition("filename="fake-text.json")
for i in range(len(elements)):
    assert elements[i] == elements3[i]

In addition, the parition() function should still be able to construct elements if the element_id or coordinates fields are missing, or any metadata fields are missing.

Definition of Done

auto partition() function successfully loads ISD json.
Many permutations of serialized ISD json are tested, i.e. across different types with different metadata schemas.

Get rid of `UserWarning` on the first call to `partition`

Currently, we get a UserWarning the first time you call partition_pdf or partition_image. The goal of this issue is add handling so that this UserWarning no longer appears.

from unstructured.partition.auto import partition

partition("example-docs/layout-parser-paper-fast.pdf")

/home/ec2-user/anaconda3/envs/unstructured/lib/python3.8/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3190.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]

Create a data connector for Google Drive

Create a data connector that pulls documents from Google Drive, stores them locally (at least temporarily for processing), and runs them through unstructured.partition.auto.

See Adding Data Connectors for details on how to get started. Make sure to include a link to this issue when submitting a PR.

Definition of Done

The checklist has been completed.
The connector is able to process a single document.
The connector is able to process all documents in a Google Drive folder, recursively.
For now, it is OK to process only doc types that unstructured.partition.auto is capable of processing. Google Drive documents should be converted to PDF or Word Doc for processing (unless there is a better way).
Bonus points: the ability to filter by document type.

File detection doesn't work properly when using `partition` in hot reload mode in `ipython`

Currently if you are using the partition brick in hot reload mode in ipython, an error related to file detection occur is you make a change to partition.py.

Steps to reproduce

from unstructured.partition.auto import partition

elements = partition("example-docs/layout-parser-paper.pdf")

Make a change to partition.py and run the same code again. You'll get:

ValueError: Invalid file /Users/mrobinson/data/sitreps/downsampled-data/docx/01_SOCN_DOWNREP_06-01.docx. File type not support in partition.

But if you run:

from unstructured.file_utils.filetype import detect_filetype

partition("example-docs/layout-parser-paper.pdf")

You'll see that the filetype is correct. This may be related to how Enums are handled during hot reloading. The workaround is to exit iPython and restart.

Create a data conector for Azure Blob Storage

Create a data connector that pulls data from Azure Blob Storage

See Adding Data Connectors for details on how to get started. Make sure to include a link to this issue when submitting a PR.

Definition of Done

The checklist has been completed.
The connector is able to process a single object or a container.
The connector is able to process the entire container.
For private files, it requires the credentials of access.
For now, it is OK to process only doc types that partition.auto() is capable of processing.

Bonus points: the ability to filter by document type.

Add Colab Notebook

Would be nice to have Colab notebook with examples shown in the readme here. Happy to contribute this, since I already did it to play with the repo the other day.

Create a data connector for RSS Feeds

Create a data connector that:

Fetches data from RSS Feeds.
Stores the content locally (at least temporarily for processing), and runs them through unstructured.partition.auto.

See Adding Data Connectors for details on how to get started. Make sure to include a link to this issue when submitting a PR.

Definition of Done

The checklist has been completed.
The connector is able to process entries from Notion using the API.
The connector is able to process a single entry.
The connector is able to process several entries, with an option to process them recursively.
The connector can accept any credentials, if necessary.
The connector should be able of processing documents through unstructured.partition.auto.

Create a new data connector for Notion

Create a data connector that:

fetches data from Notion.
stores the content locally (at least temporarily for processing), and runs them through unstructured.partition.auto.
inspiration for processing is available here

See Adding Data Connectors for details on how to get started. Make sure to include a link to this issue when submitting a PR.

Definition of Done

The checklist has been completed.
The connector is able to process entries from Notion using the API.
The connector is able to process a single entry.
The connector is able to process several entries, with an option to process them recursively.
The connector can accept any credentials, if necessary.
The connector should be able of processing documents through unstructured.partition.auto.

Add tests to get all unstructured modules above 90% coverage

Currently the email.py, pdf.py and email_elements.py files in https://github.com/Unstructured-IO/unstructured have below 90% coverage. The goal of this issue is to add additional tests for those modules to get the coverage above 90%.

Definition of Done

All modules in unstructured have >90% coverage.

References:

Unstructured-IO/community#43

`partition_pdf` should return `unstructured` document `Element` objects

Currently, if you run elements = partition_pdf("example-docs/layout-parser-paper.pdf", url=None) and look at elements[0], the type is LayoutElement, which is a type from unstructured-inference. Instead, we should return standard unstructured document elements for consistency with the other partition functions. As part of this, we may consider whether to include extract information such as coordinates as optional attributes in Element.

Add filetype check based on file extension if `libmagic` isn't available

Currently we use libmagic to determine the filetype for input files. However, magic requires the libmagic library, which not all users have available. The goal of this issue is to add a fallback that determines the filetype based on the file extension if libmagic is not available. If libmagic is available, detect_filetype should continue to use magic, as it does currently.

Enable users to translate extracted coordinates to PDF User Space Coordinates

The goal of this issue is provide the ability for users to translate the coordinates attribute in document elements to PDF User Space coordinates when working with partition_pdf. This would enable users to more easily visualize document extractions when dealing with partition_pdf.

Custom color schemes for the `sphinx` documentation

Currently, we’re using a default theme for our sphinx documentation. The goal of this issue is to update the sphinx documentation to match the Unstructured color scheme from unstructured.io website.

Create a data connector for processing social media sites

Create a data connector that:

fetches data from social media sites such as Twitter or Facebook.
stores the content locally (at least temporarily for processing), and runs them through unstructured.partition.auto.
indicate the social media site you are working on by commenting on the issue.

See Adding Data Connectors for details on how to get started. Make sure to include a link to this issue when submitting a PR.

Definition of Done

The checklist has been completed.
The connector is able to process posts using the API of the media site.
The connector is able to process a single post.
The connector is able to process several posts.
The connector can accept any credentials, if necessary.
The connector should be able of processing documents through unstructured.partition.auto.

Create a data connector for processing the biomedical literature

Create a data connector that:

fetches PDF files from PMC Open Access Subset.
stores the files locally (at least temporarily for processing), and runs them through unstructured.partition.auto.

See Adding Data Connectors for details on how to get started. Make sure to include a link to this issue when submitting a PR.

Definition of Done

The checklist has been completed.
The connector is able to process all files from the PMC Open Access Subset FTP directory.
The connector is able to process a document using the Individual article download.
The connector is able to process a document using the PDF download.
The connector can accept any credentials, if necessary.
The connector should be able of processing the PDF documents through unstructured.partition.auto.

Add Python 3.9 and Python 3.10 to the CI test job

Currently, we're only testing against python3.8 in .github/workflows/ci.yml. The goal of this issue is to add python3.9 and python3.10 to make sure that we're compatible with later versions of Python.

Create a data connector for JIRA

Create a data connector that:

Fetches data from JIRA.
Stores the content locally (at least temporarily for processing), and runs them through unstructured.partition.auto.

See Adding Data Connectors for details on how to get started. Make sure to include a link to this issue when submitting a PR.

Definition of Done

The checklist has been completed.
The connector is able to process entries from JIRA using the API.
The connector is able to process a single entry.
The connector is able to process several entries, with an option to process them recursively.
The connector can accept any credentials, if necessary.
The connector should be able of processing documents through unstructured.partition.auto.

`partition_html` is returning javascript code from some HTML documents

Currently, the partition_html function is returning javascript code in some html documents. The goal of this issue is to update our partitioning logic so that this javascript code doesn't come through in the example document.

Steps to reproduce

import requests
from unstructured.partition.html import partition_html

url = "https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-december-13"
r = requests.get(url)
elements = partition_html(text=r.text)
 print("\n\n".join([str(el) for el in elements[:5]]))

You should see the following javascript code in elements[1].text

'(function(d){\n  var js, id = \'facebook-jssdk\'; if (d.getElementById(id)) {return;}\n  js = d.createElement(\'script\'); js.id = id; js.async = true;\n  js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";\n  d.getElementsByTagName(\'head\')[0].appendChild(js);\n}(document));'

feat/write_elements

Hi,

Is there any way to write List[Element] data into file and load from it then, in order to avoid to partition data each time?

Compatibility with Python 3.7

So related to #56 , Colab's environment uses Python 3.7. Since in the package you are using typing.Final, it makes this package unusable on Python 3.7 unfortunately, as that is only supported on Python >= 3.8

Would it be possible to not use this typing feature so the package works on Python < 3.8?

`FigureCaption`, `Text`, and metadata get lost when you serialize and deserialize a list of elements

Describe the bug
As described in this comment, we currently lose FigureCaption, Text, and metadata when we serialize and deserialize a list of elements.

To Reproduce
Start with a list of elements that contain FigureCaption, Text, and metadata then run the following. You'll see that some elements and their metadata are lost:

with open("elements.json", "w") as f:
    json.dump(convert_to_isd(elements), f)

with open("elements.json", "r") as f:
    elements = isd_to_elements(json.load(f))

Create a data connector for Discord

Create a data connector that:

fetches data from one or more discord channels
stores the files locally (at least temporarily for processing), and runs them through unstructured.partition.auto.

See Adding Data Connectors for details on how to get started. Make sure to include a link to this issue when submitting a PR.

Definition of Done

The checklist has been completed.
The connector is able to process all files from Discord data loader API.
The connector can accept any credentials, if necessary.
For now, it is OK to process only doc types that unstructured.partition.auto is capable of processing.
Bonus points: the ability to filter by document type.

Create a new data connector for Obsidian

Create a data connector that:

fetches data from Obsidian.
stores the content locally (at least temporarily for processing), and runs them through unstructured.partition.auto.
more files are markdown, there is support for them through unstructured.partition.auto.

See Adding Data Connectors for details on how to get started. Make sure to include a link to this issue when submitting a PR.

Definition of Done

The checklist has been completed.
The connector is able to process entries from Obsidian using the API.
The connector is able to process a single entry.
The connector is able to process several entries.
The connector can accept any credentials, if necessary.
The connector should be able of processing documents from the Obsidian Vault through unstructured.partition.auto.

Add file information to document metadata

For several document types, we have helper functions that extract document metadata like "author" and "last modified date". For images files, we can also extract EXIF metadata when it's available. The goal of this issue is to include any information that would be of interest to downstream users in the unstructured metadata.

ISD dictionaries are not JSON serializable if the filename has a POSIX path

Currently if a filename has a path like "../../my-file.txt", the ISD dictionary is not JSON serializable. The goal of this issue is to only include the filename and not the full path in the metadata so that the ISD dictionary is JSON serializable.

More meaningful warnings for unknown filetypes in `detect_filetype`

Currently the detect_filetype function in unstructured.file_utils.filetype emits the following warning if it detects an unknown MIME type. This isn't especially helpful because you still don't know what the file type is. The goal of this issue is to update the warning to print out the file extension or filename so it's more obvious what filetype cause the issue.

MIME type was inode/x-empty. This file type is not currently supported in unstructured.

unstructured-io / unstructured Goto Github PK

unstructured's Issues

Steps to reproduce

Summary

Reproduce

Summary

Steps to reproduce

Summary

Steps to reproduce

Definition of Done

Steps to reproduce

Recommend Projects

Recommend Topics

Recommend Org