The auto partition() function should recognize json-serialized Unstructured ISD documents, which are likely recognized as .TXT files (needs to be confirmed). See recalibrating-risk-report.pdf.json for example json, but note that different json elements may have different metadata
fields defined, whereas the top-level fields type
and text
are always defined.
In this case, the staging brick https://unstructured-io.github.io/unstructured/bricks.html#isd-to-elements should be used to instantiate the Unstructured elements after loading the json.
Motivation
Applications downstream of Unstructured would benefit from processing a mix of already-processed structured data and unstructured data with no additional code or config changes.
Example
from unstructured.partition.auto import partition
from unstructured.staging.base import elements_to_json, elements_from_json
# write json output
elements = partition("example-docs/fake-text.txt")
elements_to_json(elements, filename="fake-text.json", indent=2)
after the serialization step above, fake-text.json looks like:
$ head -12 fake-text.json
[
{
"element_id": "1df8eeb8be847c3a1a7411e3be3e0396",
"coordinates": null,
"text": "This is a test document to use for unit tests.",
"type": "NarrativeText",
"metadata": {
"filename": "example-docs/fake-text.txt"
}
},
{
"element_id": "a9d4657034aa3fdb5177f1325e912362",
Which may be converted back to elements:
# Verify that elements_from_json is the inverse operation:
elements2 = elements_from_json(filename="fake-text.json")
for i in range(len(elements)):
assert elements[i] == elements2[i]
The thing to add in this issue to update the auto partition function to also detect json structured elements:
elements3 = partition("filename="fake-text.json")
for i in range(len(elements)):
assert elements[i] == elements3[i]
In addition, the parition() function should still be able to construct elements if the element_id
or coordinates
fields are missing, or any metadata
fields are missing.
Definition of Done
- auto partition() function successfully loads ISD json.
- Many permutations of serialized ISD json are tested, i.e. across different
types
with different metadata schemas.