OCR - Azure AI Document Intelligence

This custom step uses the Azure AI Document Intelligence service to perform different types of OCR on files that are stored on the SAS file system. What is Azure AI Document Intelligence?

✨ Features

✅ Text Extraction (words / lines / paragraphs / pages / document)
✅ Form Extraction (key-value pairs)
✅ Query Extraction (extraction of specified keys)
✅ Table Extraction
✅ Local Container Support

💻 User Interface

Settings tab

Standalone mode Flow mode
Azure Settings tab

Azure connection tab
Advanced settings tab
About tab

👩‍💻 Usage

Note: This step works great with the Create Listings of Directory - CLOD custom step to create the input file-list based on a folder of documents.

📺 Tutorial (👇Click Thumbnail👇)

Feature Matrix

Extraction	PDF	Image¹	URL	Azure²	Local Container³^,4
	File Formats			OCR Processing
Text	✅	✅	✅	✅	✅
Form	✅	✅	✅	✅	✅
Query	✅	✅	✅	✅
Table	✅	✅	✅	✅	✅

$^1$ JPEG/JPG, PNG, BMP, TIFF, HEIF | $^2$ API-Version 2023-10-31-preview (4.0) | $^3$ API-Version 2022-08-31 (3.0)
$^4$ Only supports General Document Model / Container

Test data

Pro Tip: Take a photo with your smartphone, make a screenshot of a document or export a PowerPoint slide as image / PDF.

📋 Requirements

Tested on SAS Viya version Stable 2024.01

🐍 Python

Python 3

📦 Packages

numpy
pandas

🤖 Azure AI Document Intelligence Resource

To use this step the endpoint and key for an Azure Document Intelligence Resource is needed.
👉 Create a Document Intelligence resource

⚙️ Settings

General

Parameter	Required	Description
OCR Type	Yes	Defines the type of Optical Character Recognition (OCR) to use
Input Mode	Yes	Indicates if processing a list of files or a single file
Input Type	Yes	Indicates if local documents or document URLs are used as input
File Path	No*	The file path for processing a single file
Input Table	No†	The name of the table containing file paths/URLs for batch processing
Path Column	No†	The column in the input table that contains the file path/URL

* Required if Input Mode is set to "single".
† Required if Input Mode is set to "batch".

Text Settings

Parameter	Required	Description
Granularity	Yes	Defines granularity of the text output (e.g. word, line, paragrpah, page). Has implications regarding extraction output (e.g. 'role' only for paragraphs, 'confidence' only for words/pages) word - includes confidence value line - text line per row paragraph - blocks of text, can include 'role' of a given paragraph (heading, etc..) page - everything one one page document - everything in the document

Query Settings

Parameter	Required	Description
Query Fields	Yes	List of keys that are used as queries in the extraction process.
Exclude Metadata	No	If set to 'yes', all meta information from the extraction will be ignored, and the output will only contain a column per key and a row per file.

Table Settings

Parameter	Required	Description
Table Output Format	Yes	Defines the output format for table extraction: map - outputs (col_id, row_id, value) for later reconstruction reference - outputs a row per table with a uuid as reference, stored in the defined library table - outputs one table through standard output, supports only one table and one file
Table Output Library	No*	Defines the output library for extracted. tables
Select Tables	No†	Defines if a table per document is selected.
Table Selection Method	No	Defines the method to select the table per document that is extracted: index - uses the index to select the extracted table. size - selects the table with the most cells.
Table Index	No‡	Table index to extract.

* Only available if Table Output Format is set to "reference".
† Defaults to true when Table Output Format is "table".
‡ Required if Table Selection Method is set to "index"

🔐 Azure

Parameter	Required	Description
Endpoint URL	Yes	AI Document Intelligence Resource Endpoint
Key	Yes	Secret Key
Local Container	No	Whether or not to use a locally deployed Document Intelligence container. Please make sure to deploy the `General Document` container.
Container Endpoint	No*	URL and Port of the locally deployed container.

* Required if Local Container is set to True.

👉Where to find resource key and endpoint

🧙‍♂️ Advanced

Parameter	Required	Description
Force Language	No	Option to force Document Intelligence to use only a specific language for OCR. Note: Languages are detected automatically by default.
Timeout†	No	How many seconds to wait for the OCR process to finish for document before timing out.
Number of Retries	No	How many retries attempts before a document is skipped
Seconds between retries	No	How many seconds between retry attempts
Number of Threads	No	How many Python threads will be used to process all files.
Save as JSON	No	Whether to save the raw output as JSON (one file per document)
Output Folder	No*	Folder for the JSON files.

† Note: Make sure to set this high enough if your documents are excessively large.
* Required if Save as JSON is set to true.

📚 Documentation

📝 Change Log

Version 1.0 (08JAN2024)
- Initial version

sundareshsankaran / azure---extract-tables-via-ocr Goto Github PK

azure---extract-tables-via-ocr's Introduction

OCR - Azure AI Document Intelligence

✨ Features

📖 Contents

💻 User Interface

Settings tab

Azure Settings tab

Azure connection tab

Advanced settings tab

About tab

👩‍💻 Usage

📺 Tutorial (👇Click Thumbnail👇)

Feature Matrix

Test data

📋 Requirements

🐍 Python

📦 Packages

🤖 Azure AI Document Intelligence Resource

⚙️ Settings

General

🔐 Azure

🧙‍♂️ Advanced

📚 Documentation

📝 Change Log

azure---extract-tables-via-ocr's People

Contributors

Stargazers

Watchers

Recommend Projects

Recommend Topics

Recommend Org