Giter VIP home page Giter VIP logo

unstructured-io / unstructured Goto Github PK

View Code? Open in Web Editor NEW
6.9K 49.0 519.0 129.94 MB

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.

Home Page: https://www.unstructured.io/

License: Apache License 2.0

Makefile 0.07% HTML 87.58% XSLT 0.01% Python 10.75% Shell 0.86% Dockerfile 0.01% Rich Text Format 0.73%
deep-learning document-parsing machine-learning nlp ocr information-retrieval data-pipelines ml preprocessing pdf-to-text

unstructured's Issues

Document level language detection before processing

There is some rules based logic in unstructured that is language dependent. For example, is_possible_narrative_text checks for verbs and the presence of valid words from a dictionary. However, these checks are specific to English language docs. The goal of this issue is to add a document level language check to determine the language of a document. We already use langdetect for this in translate_text.

Create a data connector for Substack

Create a data connector that:

  • fetches one more substack posts
  • stores them locally as html files (at least temporarily for processing)
  • processes them in a way specfic to substack, i.e. filters out Substack boilerplate elements after calling partition_html

See Adding Data Connectors for details on how to get started. Make sure to include a link to this issue when submitting a PR.

Definition of Done

  • The checklist has been completed.
  • The connector is able to process a single document or multiple documents (multiple URL's may be passed to --substack-url in main.py).
  • Note that unlike the S3IngestDoc, the subclass of BaseIngestDoc created for this task (probably named SubstackInstanceDoc) would need to override the process_file method to handle additional cleaning logic. I.e., the 2nd bullet in the checklist.

Create a data connector for Sharepoint

Create a data connector that:

  • fetches data from Sharepoint.
  • stores the content locally (at least temporarily for processing), and runs them through unstructured.partition.auto.

See Adding Data Connectors for details on how to get started. Make sure to include a link to this issue when submitting a PR.

Definition of Done

  • The checklist has been completed.
  • The connector is able to process using the Sharepoint API [https://learn.microsoft.com/en-us/sharepoint/dev/sp-add-ins/get-to-know-the-sharepoint-rest-service].
  • The connector is able to process a single document.
  • The connector is able to process documents from a folder, with the option to process a folder recursively.
  • The connector can accept any credentials, if necessary.
  • The connector should be able of processing documents through unstructured.partition.auto.

Too many ListItem's generated for some PDF's

Describe the bug
PDF parsing results in too many ListItem chunks.

To Reproduce
Run test-ingest script that currently exists in CI. https://github.com/Unstructured-IO/unstructured/blob/main/test_unstructured_ingest/test-ingest.sh

Expected behavior
Many fewer chunks.

Additional context
Here is one example: https://github.com/Unstructured-IO/unstructured/blob/3c1b089/test_unstructured_ingest/expected-structured-output/s3-small-batch/small-pdf-set/2023-Jan-economic-outlook.pdf.json#L16 but there are more throughout that sample document.

Create a data connector for OneDrive

Create a data connector that:

See Adding Data Connectors for details on how to get started. Make sure to include a link to this issue when submitting a PR.

Definition of Done

  • The checklist has been completed.
  • The connector is able to process entries from Notion using the API.
  • The connector is able to process a single entry.
  • The connector is able to process several entries, with an option to process them recursively.
  • The connector can accept any credentials, if necessary.
  • The connector should be able of processing documents through unstructured.partition.auto.

Create a data connector for Reddit

Create a data connector that:

  • fetches one more Reddit messages from a Subreddit or Redditor
  • stores them locally as markdown files (at least temporarily for processing)
  • processes the files the standard way, through there is support for them through unstructured.partition.auto.

See Adding Data Connectors for details on how to get started. Make sure to include a link to this issue when submitting a PR.

For inspiration on downloading Reddit content, see:
https://github.com/emptycrown/llama-hub/blob/main/loader_hub/reddit/base.py

Additional References

Definition of Done

  • The checklist has been completed.
  • The connector is able to process all documents for a subreddit or user, at least up to a large well defined limit
  • May constrain messages processed by date range

Add text categories as metadata

The goal of this issue is to add text categories as metadata to document elements to enable users to restrict search to specific sections (i.e. the experiments section in research papers).

`partition_email` outputs `UnicodeDecodeError` when trying to parse email with an image attachment

Currently there is a bug in partition_email that results in a UnicodeDecodeError when parsing emails that have an image attachment.

Steps to reproduce

Run the following from the root directory of the repo.

from unstructured.partition.email import partition_email

filename = "example-docs/fake-email-attachment.eml"
elements = partition_email(filename=filename)

The error should look like:

UnicodeDecodeError                        Traceback (most recent call last)
Cell In [3], line 1
----> 1 elements = partition_email(filename=filename)

File ~/unstructured/unstructured/partition/email.py:207, in partition_email(filename, file, text, content_source, include_headers)
    205     for element in elements:
    206         if isinstance(element, Text):
--> 207             element.apply(replace_mime_encodings)
    208 elif content_source == "text/plain":
    209     elements = partition_text(text=content)

File ~/unstructured/unstructured/documents/elements.py:44, in Text.apply(self, *cleaners)
     42 cleaned_text = self.text
     43 for cleaner in cleaners:
---> 44     cleaned_text = cleaner(cleaned_text)
     46 if not isinstance(cleaned_text, str):
     47     raise ValueError("Cleaner produced a non-string output.")

File ~/unstructured/unstructured/cleaners/core.py:117, in replace_mime_encodings(text)
    110 def replace_mime_encodings(text: str) -> str:
    111     """Replaces MIME encodings with their UTF-8 equivalent characters.
    112 
    113     Example
    114     -------
    115     5 w=E2=80-99s -> 5 w’s
    116     """
--> 117     return quopri.decodestring(text.encode()).decode("utf-8")

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc2 in position 6: invalid continuation byte

Add email header information to metadata

The partition_email function has the ability to parse out header information from emails, like to:, from: and subject:. The goal of this issue is to add this to the unstructured metadata when partitioning emails.

feat/Add support for epub format

Is your feature request related to a problem? Please describe.
One of the leading open format for books is epub. Thus, when running algorithms on books I often have to find a way to process them.

Describe the solution you'd like
An epub is a compressed html. As unstructured already supports html, it would be very straightforward to add support for epub.

Describe alternatives you've considered
I am currently using pandoc to convert my epub data to html. This is straightforward but adds a special case to my pipelines.

Create a data connector for WHO data

Create a data connector that:

See Adding Data Connectors for details on how to get started. Make sure to include a link to this issue when submitting a PR.

Definition of Done

  • The checklist has been completed.
  • The connector is able to process entries from Notion using the API.
  • The connector is able to process a single entry.
  • The connector is able to process several entries, with an option to process them recursively.
  • The connector can accept any credentials, if necessary.
  • The connector should be able of processing documents through unstructured.partition.auto.

`ZeroDivisionError` on capitalization check if there are no tokens

Summary

The exceeds_caps_ratio function in text_type.py raises a ZeroDivisionError if the length of the tokens is zero. Instead it should return False.

Reproduce

from unstructured.partition.text_type import exceeds_cap_ratio

exceeds_cap_ratio("")
ZeroDivisionError                         Traceback (most recent call last)
Cell In [9], line 1
----> 1 exceeds_cap_ratio("")

File ~/repos/unstructured/unstructured/partition/text_type.py:123, in exceeds_cap_ratio(text, threshold)
    121 tokens = word_tokenize(text)
    122 capitalized = sum([word.istitle() or word.isupper() for word in tokens])
--> 123 ratio = capitalized / len(tokens)
    124 return ratio > threshold

ZeroDivisionError: division by zero

Sync `detectron2` versions in docs

There are a few sets of detectron2 install instructions and make targets. Some of them reference different versions. The goal of this issue is to make them all the same so all of the docs are synced.

Create a data connector for Slack

Create a data connector that:

fetches posts from a specific slack channel with the option to add a data filter.
stores the extracted data locally as text files (at least temporarily for processing)
processes them using unstructured.partition.auto.
See Adding Data Connectors for details on how to get started. Make sure to include a link to this issue when submitting a PR.

Definition of Done

The checklist has been completed.
The connector is able to pull all conversations in a channel or take in a date range and extract conversations from those dates.
The connector takes in credentials for private channels.
The connector is able to process a single text file.
The connector is able to process all text files extracted from slack, recursively.
For now, it is OK to process only doc types that unstructured.partition.auto is capable of processing.
Bonus points: the ability to filter by document type.

Tests pass `label-studio-sdk==0.0.17` but not with `label-studio-sdk==0.0.18`

Summary

Currently tests passwith label-studio-sdk==0.0.17 but not with label-studio-sdk==0.0.18. The goal of this issue is to figure out why and implement a fix.

Steps to reproduce

Run PYTHONPATH=. pytest test_unstructured/staging/test_label_studio.py. This will result in the following error message.

self = <vcr.patch.VCRRequestsHTTPConnectiontest_unstructured/vcr_fixtures/cassettes/label_studio_upload.yaml object at 0x1066f7160>
_ = False, kwargs = {'buffering': True}

    def getresponse(self, _=False, **kwargs):
        """Retrieve the response"""
        # Check to see if the cassette has a response for this request. If so,
        # then return it
        if self.cassette.can_play_response_for(self._vcr_request):
            log.info("Playing response for {} from cassette".format(self._vcr_request))
            response = self.cassette.play_response(self._vcr_request)
            return VCRHTTPResponse(response)
        else:
            if self.cassette.write_protected and self.cassette.filter_request(self._vcr_request):
>               raise CannotOverwriteExistingCassetteException(
                    cassette=self.cassette, failed_request=self._vcr_request
                )
E               vcr.errors.CannotOverwriteExistingCassetteException: Can't overwrite existing cassette ('test_unstructured/vcr_fixtures/cassettes/label_studio_upload.yaml') in your current record mode (<RecordMode.ONCE: 'once'>).
E               No match for the request (<Request (GET) http://localhost:8080/api/projects/95>) was found.
E               Found 1 similar requests with 0 different matcher(s) :
E
E               1 - (<Request (GET) http://localhost:8080/api/projects/95>).
E               Matchers succeeded : ['method', 'scheme', 'host', 'port', 'path', 'query']
E               Matchers failed :

../../.pyenv/versions/unstructured/lib/python3.8/site-packages/vcr/stubs/__init__.py:231: CannotOverwriteExistingCassetteException
-------------------------------------------- Captured log call --------------------------------------------
DEBUG    urllib3.util.retry:retry.py:351 Converted retries value: 3 -> Retry(total=3, connect=None, read=None, redirect=None, status=None)
DEBUG    urllib3.util.retry:retry.py:351 Converted retries value: 3 -> Retry(total=3, connect=None, read=None, redirect=None, status=None)
DEBUG    urllib3.connectionpool:connectionpool.py:228 Starting new HTTP connection (1): localhost:8080
DEBUG    urllib3.connectionpool:connectionpool.py:456 http://localhost:8080 "GET /health HTTP/1.1" 200 None
DEBUG    urllib3.connectionpool:connectionpool.py:456 http://localhost:8080 "GET /api/projects?page_size=10000000 HTTP/1.1" 200 None
DEBUG    urllib3.connectionpool:connectionpool.py:456 http://localhost:8080 "GET /api/projects/95 HTTP/1.1" 200 None
========================================= short test summary info =========================================
FAILED test_unstructured/staging/test_label_studio.py::test_upload_label_studio_data_with_sdk - vcr.errors.CannotOverwriteExistingCassetteException: Can't overwrite existing cassette ('test_unstruct...
====================================== 1 failed, 17 passed in 0.27s =======================================

References

Optional encoding kwarg for for `partition_{html, email, text}`

The goal of this issue is to add an optional encoding kwarg that allows users to pass in the encoding to bricks the process plain text files (partiton_html, partition_email, and partition_text). Currently, if you want to use a specific encoding, you need to read in the file yourself and pass it in as text, as shown below.

with open("example-docs/example-10k.html", "r", encoding="utf-8") as f: #also passed encoding here
    text = f.read()
elements = partition_html(text=text)

Create a Data Connector for Storj DCS (Decentralized Cloud Storage)

Create a data connector that pulls data from Storj DCS

See Adding Data Connectors for details on how to get started. Make sure to include a link to this issue when submitting a PR.

Definition of Done

  • The checklist has been completed.
  • The connector is able to process a single distributed object or a bucket.
  • The connector is able to process the entire bucket.
  • The connector ingests a Storj Access Grant.
  • For now, it is OK to process only doc types that partition.auto() is capable of processing.

Lists from LayoutParser should be split into `ListItem`s

Summary

Currently if you run code such as:

from unstructured.partition.pdf import partition_pdf_or_image

filename = "example-docs/layout-parser-paper.pdf"
elements = partition_pdf(filename=filename)

List elements are a single Text blog that looks like:

'1. An off-the-shelf toolkit for applying DL models for layout detection, character recognition, and other DIA tasks (Section 3) 2. A rich repository of pre-trained neural network models (Model Zoo) that underlies the off-the-shelf usage 3. Comprehensive tools for efficient document image data annotation and model tuning to support different levels of customization 4. A DL model hub and community platform for the easy sharing, distribu- tion, and discussion of DIA models and pipelines, to promote reusability, reproducibility, and extensibility (Section 4)'

We should split up the list so each element in the list is its own ListItem element.

Create a data conector for Github repos Part 1 (all supported filetypes but no source code)

Create a data connector that pulls data from GIthub repo.

See Adding Data Connectors for details on how to get started. Make sure to include a link to this issue when submitting a PR.

Definition of Done

  • The checklist has been completed.
  • The connector is able to process a single repository and recursively the entire documents in a repo.
  • Should ignore code.
  • Should process only doc types supported in partition.auto, including markdown files.

Bonus points: the ability to filter by document type.

partition_html incorrect encoding

Describe the bug
partition_html function incorrectly handles cyrillic (and probably other non-latin) text.
This happens when an html do not have correct encoding specified in a meta tag.

To Reproduce

from unstructured.partition.html import partition_html
text = '<html><body><p>Привет</p></body></html>'
parts = partition_html(text=text)
print(parts[0].text)

Prints Ð\x9fÑ\x80ивеÑ\x82

Expected behavior

from unstructured.partition.html import partition_html
text = '<html><body><p>Привет</p></body></html>'
parts = partition_html(text=text)
print(parts[0].text)

Prints Привет

Desktop (please complete the following information):

  • OS: mac
  • Python version: 3.9.9

Additional context
As a possible solution, I would suggest to add an ability to provide your own parser to partition_html function.
This way we could use lxml.etree.HTMLParser with initialized with desired encoding argument.

Create a data connector for Apache Kafka

Create a data connector that:

See Adding Data Connectors for details on how to get started. Make sure to include a link to this issue when submitting a PR.

Definition of Done

  • The checklist has been completed.
  • The connector is able to process entries from Notion using the API.
  • The connector is able to process a single entry.
  • The connector is able to process several entries, with an option to process them recursively.
  • The connector can accept any credentials, if necessary.
  • The connector should be able of processing documents through unstructured.partition.auto.

Updates the auto.partition() function to recognize Unstructured json

The auto partition() function should recognize json-serialized Unstructured ISD documents, which are likely recognized as .TXT files (needs to be confirmed). See recalibrating-risk-report.pdf.json for example json, but note that different json elements may have different metadata fields defined, whereas the top-level fields type and text are always defined.

In this case, the staging brick https://unstructured-io.github.io/unstructured/bricks.html#isd-to-elements should be used to instantiate the Unstructured elements after loading the json.

Motivation

Applications downstream of Unstructured would benefit from processing a mix of already-processed structured data and unstructured data with no additional code or config changes.

Example

from unstructured.partition.auto import partition
from unstructured.staging.base import elements_to_json, elements_from_json

# write json output
elements = partition("example-docs/fake-text.txt")
elements_to_json(elements, filename="fake-text.json", indent=2)

after the serialization step above, fake-text.json looks like:

$ head -12 fake-text.json 
[
  {
    "element_id": "1df8eeb8be847c3a1a7411e3be3e0396",
    "coordinates": null,
    "text": "This is a test document to use for unit tests.",
    "type": "NarrativeText",
    "metadata": {
      "filename": "example-docs/fake-text.txt"
    }
  },
  {
    "element_id": "a9d4657034aa3fdb5177f1325e912362",

Which may be converted back to elements:

# Verify that elements_from_json is the inverse operation:
elements2 = elements_from_json(filename="fake-text.json")
for i in range(len(elements)):
    assert elements[i] == elements2[i]

The thing to add in this issue to update the auto partition function to also detect json structured elements:

elements3 = partition("filename="fake-text.json")
for i in range(len(elements)):
    assert elements[i] == elements3[i]

In addition, the parition() function should still be able to construct elements if the element_id or coordinates fields are missing, or any metadata fields are missing.

Definition of Done

  • auto partition() function successfully loads ISD json.
  • Many permutations of serialized ISD json are tested, i.e. across different types with different metadata schemas.

Get rid of `UserWarning` on the first call to `partition`

Currently, we get a UserWarning the first time you call partition_pdf or partition_image. The goal of this issue is add handling so that this UserWarning no longer appears.

from unstructured.partition.auto import partition

partition("example-docs/layout-parser-paper-fast.pdf")
/home/ec2-user/anaconda3/envs/unstructured/lib/python3.8/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3190.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]

Create a data connector for Google Drive

Create a data connector that pulls documents from Google Drive, stores them locally (at least temporarily for processing), and runs them through unstructured.partition.auto.

See Adding Data Connectors for details on how to get started. Make sure to include a link to this issue when submitting a PR.

Definition of Done

  • The checklist has been completed.
  • The connector is able to process a single document.
  • The connector is able to process all documents in a Google Drive folder, recursively.
  • For now, it is OK to process only doc types that unstructured.partition.auto is capable of processing. Google Drive documents should be converted to PDF or Word Doc for processing (unless there is a better way).
  • Bonus points: the ability to filter by document type.

File detection doesn't work properly when using `partition` in hot reload mode in `ipython`

Currently if you are using the partition brick in hot reload mode in ipython, an error related to file detection occur is you make a change to partition.py.

Steps to reproduce

from unstructured.partition.auto import partition

elements = partition("example-docs/layout-parser-paper.pdf")

Make a change to partition.py and run the same code again. You'll get:

ValueError: Invalid file /Users/mrobinson/data/sitreps/downsampled-data/docx/01_SOCN_DOWNREP_06-01.docx. File type not support in partition.

But if you run:

from unstructured.file_utils.filetype import detect_filetype

partition("example-docs/layout-parser-paper.pdf")

You'll see that the filetype is correct. This may be related to how Enums are handled during hot reloading. The workaround is to exit iPython and restart.

Create a data conector for Azure Blob Storage

Create a data connector that pulls data from Azure Blob Storage

See Adding Data Connectors for details on how to get started. Make sure to include a link to this issue when submitting a PR.

Definition of Done

  • The checklist has been completed.
  • The connector is able to process a single object or a container.
  • The connector is able to process the entire container.
  • For private files, it requires the credentials of access.
  • For now, it is OK to process only doc types that partition.auto() is capable of processing.

Bonus points: the ability to filter by document type.

Add Colab Notebook

Would be nice to have Colab notebook with examples shown in the readme here. Happy to contribute this, since I already did it to play with the repo the other day.

Create a data connector for RSS Feeds

Create a data connector that:

See Adding Data Connectors for details on how to get started. Make sure to include a link to this issue when submitting a PR.

Definition of Done

  • The checklist has been completed.
  • The connector is able to process entries from Notion using the API.
  • The connector is able to process a single entry.
  • The connector is able to process several entries, with an option to process them recursively.
  • The connector can accept any credentials, if necessary.
  • The connector should be able of processing documents through unstructured.partition.auto.

Create a new data connector for Notion

Create a data connector that:

  • fetches data from Notion.
  • stores the content locally (at least temporarily for processing), and runs them through unstructured.partition.auto.
  • inspiration for processing is available here

See Adding Data Connectors for details on how to get started. Make sure to include a link to this issue when submitting a PR.

Definition of Done

  • The checklist has been completed.
  • The connector is able to process entries from Notion using the API.
  • The connector is able to process a single entry.
  • The connector is able to process several entries, with an option to process them recursively.
  • The connector can accept any credentials, if necessary.
  • The connector should be able of processing documents through unstructured.partition.auto.

`partition_pdf` should return `unstructured` document `Element` objects

Currently, if you run elements = partition_pdf("example-docs/layout-parser-paper.pdf", url=None) and look at elements[0], the type is LayoutElement, which is a type from unstructured-inference. Instead, we should return standard unstructured document elements for consistency with the other partition functions. As part of this, we may consider whether to include extract information such as coordinates as optional attributes in Element.

Add filetype check based on file extension if `libmagic` isn't available

Currently we use libmagic to determine the filetype for input files. However, magic requires the libmagic library, which not all users have available. The goal of this issue is to add a fallback that determines the filetype based on the file extension if libmagic is not available. If libmagic is available, detect_filetype should continue to use magic, as it does currently.

Create a data connector for processing social media sites

Create a data connector that:

  • fetches data from social media sites such as Twitter or Facebook.
  • stores the content locally (at least temporarily for processing), and runs them through unstructured.partition.auto.
  • indicate the social media site you are working on by commenting on the issue.

See Adding Data Connectors for details on how to get started. Make sure to include a link to this issue when submitting a PR.

Definition of Done

  • The checklist has been completed.
  • The connector is able to process posts using the API of the media site.
  • The connector is able to process a single post.
  • The connector is able to process several posts.
  • The connector can accept any credentials, if necessary.
  • The connector should be able of processing documents through unstructured.partition.auto.

Create a data connector for processing the biomedical literature

Create a data connector that:

See Adding Data Connectors for details on how to get started. Make sure to include a link to this issue when submitting a PR.

Definition of Done

Add Python 3.9 and Python 3.10 to the CI test job

Currently, we're only testing against python3.8 in .github/workflows/ci.yml. The goal of this issue is to add python3.9 and python3.10 to make sure that we're compatible with later versions of Python.

Create a data connector for JIRA

Create a data connector that:

See Adding Data Connectors for details on how to get started. Make sure to include a link to this issue when submitting a PR.

Definition of Done

  • The checklist has been completed.
  • The connector is able to process entries from JIRA using the API.
  • The connector is able to process a single entry.
  • The connector is able to process several entries, with an option to process them recursively.
  • The connector can accept any credentials, if necessary.
  • The connector should be able of processing documents through unstructured.partition.auto.

`partition_html` is returning javascript code from some HTML documents

Currently, the partition_html function is returning javascript code in some html documents. The goal of this issue is to update our partitioning logic so that this javascript code doesn't come through in the example document.

Steps to reproduce

import requests
from unstructured.partition.html import partition_html

url = "https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-december-13"
r = requests.get(url)
elements = partition_html(text=r.text)
 print("\n\n".join([str(el) for el in elements[:5]]))

You should see the following javascript code in elements[1].text

'(function(d){\n  var js, id = \'facebook-jssdk\'; if (d.getElementById(id)) {return;}\n  js = d.createElement(\'script\'); js.id = id; js.async = true;\n  js.src = "//connect.facebook.net/en_US/all.js#xfbml=1";\n  d.getElementsByTagName(\'head\')[0].appendChild(js);\n}(document));'

feat/write_elements

Hi,

Is there any way to write List[Element] data into file and load from it then, in order to avoid to partition data each time?

Compatibility with Python 3.7

So related to #56 , Colab's environment uses Python 3.7. Since in the package you are using typing.Final, it makes this package unusable on Python 3.7 unfortunately, as that is only supported on Python >= 3.8

Would it be possible to not use this typing feature so the package works on Python < 3.8?

`FigureCaption`, `Text`, and metadata get lost when you serialize and deserialize a list of elements

Describe the bug
As described in this comment, we currently lose FigureCaption, Text, and metadata when we serialize and deserialize a list of elements.

To Reproduce
Start with a list of elements that contain FigureCaption, Text, and metadata then run the following. You'll see that some elements and their metadata are lost:

with open("elements.json", "w") as f:
    json.dump(convert_to_isd(elements), f)

with open("elements.json", "r") as f:
    elements = isd_to_elements(json.load(f))

Create a data connector for Discord

Create a data connector that:

  • fetches data from one or more discord channels
  • stores the files locally (at least temporarily for processing), and runs them through unstructured.partition.auto.

See Adding Data Connectors for details on how to get started. Make sure to include a link to this issue when submitting a PR.

Definition of Done

  • The checklist has been completed.
  • The connector is able to process all files from Discord data loader API.
  • The connector can accept any credentials, if necessary.
  • For now, it is OK to process only doc types that unstructured.partition.auto is capable of processing.
  • Bonus points: the ability to filter by document type.

Create a new data connector for Obsidian

Create a data connector that:

See Adding Data Connectors for details on how to get started. Make sure to include a link to this issue when submitting a PR.

Definition of Done

  • The checklist has been completed.
  • The connector is able to process entries from Obsidian using the API.
  • The connector is able to process a single entry.
  • The connector is able to process several entries.
  • The connector can accept any credentials, if necessary.
  • The connector should be able of processing documents from the Obsidian Vault through unstructured.partition.auto.

Add file information to document metadata

For several document types, we have helper functions that extract document metadata like "author" and "last modified date". For images files, we can also extract EXIF metadata when it's available. The goal of this issue is to include any information that would be of interest to downstream users in the unstructured metadata.

More meaningful warnings for unknown filetypes in `detect_filetype`

Currently the detect_filetype function in unstructured.file_utils.filetype emits the following warning if it detects an unknown MIME type. This isn't especially helpful because you still don't know what the file type is. The goal of this issue is to update the warning to print out the file extension or filename so it's more obvious what filetype cause the issue.

MIME type was inode/x-empty. This file type is not currently supported in unstructured.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.