atbasu / document-content-extractor Goto Github PK
View Code? Open in Web Editor NEWPython program that uses open ai apis to parse user specified content from text files
Python program that uses open ai apis to parse user specified content from text files
There was an error in processing this file:
Unexpected Exception: Traceback (most recent call last):
File "/Users/atbasu/Documents/document-content-extractor/content_extractor.py", line 463, in extract_content_async
prompts, results = loop.run_until_complete(
^^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/asyncio/base_events.py", line 653, in run_until_complete
return future.result()
^^^^^^^^^^^^^^^
File "/Users/atbasu/Documents/document-content-extractor/content_extractor.py", line 406, in process_prompts
responses = await asyncio.gather(*tasks)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/atbasu/Documents/document-content-extractor/content_extractor.py", line 312, in process_chunk
response = await session.post(
^^^^^^^^^^^^^^^^^^^
File "/Users/atbasu/Documents/document-content-extractor/venv/lib/python3.11/site-packages/aiohttp/client.py", line 560, in _request
await resp.start(conn)
File "/Users/atbasu/Documents/document-content-extractor/venv/lib/python3.11/site-packages/aiohttp/client_reqrep.py", line 899, in start
message, payload = await protocol.read() # type: ignore[union-attr]
^^^^^^^^^^^^^^^^^^^^^
File "/Users/atbasu/Documents/document-content-extractor/venv/lib/python3.11/site-packages/aiohttp/streams.py", line 616, in read
await self._waiter
aiohttp.client_exceptions.ServerDisconnectedError: Server disconnected
E.g. the fields "pocPhone", "pocEmail" aren't being extracted even when the reqiured field is set to True
((venv) (base) atbasu@x86_64-apple-darwin13 document-content-extractor % python3.11 document_content_extractor.py
Enter name of file : 1.pdf
Enter Open AI api key: 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
Processing sample forms/1.pdf provided using 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 and parser configuration from closewise_parser_configurator.json
Field Name | Field Value
+----------------------------|--------------------------------------------------+
SignerName | Vincent Torres
Email | [email protected]
PhoneNumber | (803) 203-2710
IsTheSignerAForeignNational? | No
Language | N/A
Timezone | Eastern Daylight Time (EDT)
PropertyAddress | 1357 Shimmer Light Circle
AppointmentAddress | 1357 Shimmer Light Circle
AppointmentDateAndTime | 05/13/22 at 8:00 am (EDT)
FileNumber | SC22117033
OrderOnBehalfOf | BNT
SigningType | Notary confirmation
ProductType | Refi (E DOCS)
CompanyFee | $200
Client | Vincent Torres and Dianna Sager
AgentName | Carolina Attorney Networ
AgentFee | $200
WitnessNumber | 2
UploadFiles | No
InternalNotes | Please call the signer
ExternalNotes | Call the borrower or buyers ASAP and confirm the following information: TIME
InstructionType | Special Instructions
ScanBacksNeeded | Yes
PersonOfContactEmail | [email protected]
+----------------------------|--------------------------------------------------+
Are any of the values incorrect? (y/n): n
Traceback (most recent call last):
File "/Users/atbasu/Documents/document-content-extractor/document_content_extractor.py", line 235, in <module>
main()
File "/Users/atbasu/Documents/document-content-extractor/document_content_extractor.py", line 207, in main
f"{file_path.split('/')[-1]}_corrections_{run_id}.json",
^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'split'
(venv) (base) atbasu@x86_64-apple-darwin13 document-content-extractor %
This should not occur
Enter name of file : 1.pfg
Enter Open AI api key: sdfs
Processing sample forms/1.pfg provided using sdfs and parser configuration from closewise_parser_configurator.json
Traceback (most recent call last):
File "/Users/atbasu/Documents/document-content-extractor/document_content_extractor.py", line 235, in <module>
main()
File "/Users/atbasu/Documents/document-content-extractor/document_content_extractor.py", line 182, in main
result = upload_and_process_document(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/atbasu/Documents/document-content-extractor/document_content_extractor.py", line 28, in upload_and_process_document
text = read_document(file_path)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/atbasu/Documents/document-content-extractor/utils.py", line 9, in read_document
with open(file_path, 'rb') as file:
^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: 'sample forms/1.pfg'
This situation should be handled gracefully.
Traceback (most recent call last):
File "/Users/atbasu/Documents/document-content-extractor/document_content_extractor.py", line 261, in <module>
main()
File "/Users/atbasu/Documents/document-content-extractor/document_content_extractor.py", line 237, in main
metrics_file = write_metrics(
^^^^^^^^^^^^^^
File "/Users/atbasu/Documents/document-content-extractor/document_content_extractor.py", line 66, in write_metrics
logger.info("Writing metrics to CSV")
^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'info'
Invalid input. Please enter either y, n.
Invalid input. Please en^C
Traceback (most recent call last):
File "/Users/atbasu/Documents/document-content-extractor/document_content_extractor.py", line 260, in <module>
main()
File "/Users/atbasu/Documents/document-content-extractor/document_content_extractor.py", line 227, in main
corrections = check_for_errors(json_result, dce_logger)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/atbasu/Documents/document-content-extractor/document_content_extractor.py", line 118, in check_for_errors
print('Invalid input. Please enter either y, n.')
KeyboardInterrupt
This happens when we enter the wrong input to the question :
'Are any of the values incorrect? (y/n): '
Are any of the values incorrect? (y/n): y
Enter a coma separated list of fields that need to be corrected? POCPhone
What should the correct value be for POCPhone?N/A
Why is this the correct value POCPhone?no phone number for the POC is listed, (929) 270-5392 is the number of the borrower.
Are there any other fields that need to be corrected? (y/n): n
CRITICAL:utils:An error occurred while checking for any incorrect extractions: 'POCPhone'
Traceback (most recent call last):
File "/Users/atbasu/Documents/document-content-extractor/document_content_extractor.py", line 302, in main
corrections, correction_prompt = check_for_errors(
^^^^^^^^^^^^^^^^^
File "/Users/atbasu/Documents/document-content-extractor/document_content_extractor.py", line 187, in check_for_errors
correction_prompt = generate_correction_prompt(corrections, result)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/atbasu/Documents/document-content-extractor/document_content_extractor.py", line 127, in generate_correction_prompt
returned_values += f"{key} : {result[key]}\n"
~~~~~~^^^^^
KeyError: 'POCPhone'
For example 28.pdf gets read with the following text:
〷⼲㔯㈰㈲‱〺〴⁐協 Order Date:Order Number: 08-02476888-CExpedited Order
呉䵉体Ⱐ䥎䌮
㈳〱⁗⸠偌䅎传偋坙
STE 215
PLANO, TX 75075
Phone: (818) 706-6400
䱅塉乇呏丬⁓䌠ⴠ㈹〷㈭㤷㈹噥湤潲⁃桡牧攺 $350.00
噥湤潲⁎漺 〰〲〳㔶
噥湤潲⁎慭攺 䍁剏䱉乁⁁呔佒久夠久呗 ORK LLC
噥湤潲⁁ ddress: ㄱ㔠䑒䥆呗 住䐠䑒
Vendor Phone: (803)520-2048 Vendor Fax: (803)520-8041
䅳獩杮浥湴⁔祰攺 Reverse Closing
䅳獩杮敤⁄慴攺 〹⼲〯㈰㈲‱㌺ㄹ⁐協
Projected Closing Date: 〹⼲㈯㈰㈲‱㌺㌰
Home Phone: (803)646-1118
䅤摲敳猺 242 TRIPLE H FARM LANE, AIKEN, SC - 29803
HAYMANS\W ANDA A.
Home Phone: (803)646-1118䡁奍䅎卜䵉䱅匠䍕剔䥓
Borrower:Borrower:
偲潰敲瑹⁁ ddress:
County:㈴㈠呒䥐䱅⁈⁆䅒䴠䱎
䅉䭅丬⁓䌠ⴠ㈹㠰㌭㠲㔴
AIKEN
REFINANCE
Closing Location: 242 TRIPLE H FARM LANE, AIKEN, SC - 29803Purchase:
MUTUAL OF OMAHA MORTGAGE INC.
4429973
$451,000.00卵扳捲楢敲⁎慭攺
䱯慮⁎畭扥爺
䱯慮⁁ 浯畮琺
奯畲杲敥搠晥攠景爠瑨楳牤敲猠㔰⸰〮⁙潵畳琠捡汬湥映潵爠牥灲敳敮瑡瑩癥猠灲楯爠瑯潩湧湹 additional work that will
楮捲敡獥⁹ 潵爠晥攮⁕湡畴桯物穥搠睯牫⁷楬氠湯琠扥⁰慩搠景爮
䅤ditional Fee: Reason: 䅰灲潶敤⁂示Please Email: [email protected] or Text 803-646-1118 for communication she does
not receive phone calls at her house. **** Reverse Loan Signing****
PLEASE CONTACT THE BORROWER(S) TO CONFIRM THE SIGNING TIME AND LOCATION. If you are in
a state with Witness requirements please confirm with borrowers that they will have the
湥捥獳慲礠睩瑮敳獥猠灲敳敮琮
䉵獩湥獳潵牳㨠‸慭⁴漠㔺〰灭⁃協
䙏删䅎夠兕䕓呉低区
Arwen Hebert 972-384-4639 / After Hours 214-907-5249 ** If no answer you can try
瑥硴楮朠慳⁷敬氮
䅭祥⁃潲牡摯‹㜲ⴳ㠴ⴴ㘲㤠⼠䅦瑥爠䡯畲猠㐶㤭㤹㔭㐲㘵
䑡牲敬氠坩湲潷‹㜲ⴳ㠴ⴴ㘱㠠⼠䅦瑥爠䡯畲猠㠱㠭㌰㤭㔰ㄴ
Jordan Tomenga 972-384-4622 / After Hours 972- 523-9746COMMENTS:
1COMMENTS:
偬敡獥敡搠慬氠楮獴牵捴楯湳慲敦 ully before completing the signing appointm ent.
Due to the COVID- 19 pandemic, Timios w ill REQUIRE the NSA 䡥慬瑨⁓捲敥湩湧
Form as w ell as the Borrow 敲⁈敡汴栠卣牥敮楮朠䙯牭⁴漠扥楬汥搠潵琠捯浰汥瑥汹 .
䅬so, w 攠慲攠慳歩湧畲潴慲楥猠瑯⁰牯瑥捴⁴桥浳敬癥猠慮搠潵爠扯牲潷 敲猠批
睥aring gloves and masks to all Face to Face closings
AS A剅䵉乄䕒Ⱐ呈䔠坏剄ₑ乏呁 剙⁐啂䱉䎒⁍啓吠䉅⁗剉呔䕎⁏啔⁁ FTER THE NA 䵅
低⁕十 偁TRIOT A CT FORM, NO EXCEPTIONS
䉯牲潷敲⡳⤠獨潵汤攠捯湴慣瑥搠瑯潮晩牭灰潩湴浥湴慴攬⁴業攠慮搠汯捡瑩潮⸠
All docum ents should be printed and returned in the order they are attached. Do not shuff le pages in the package.
䑯捵浥湴猠獨潵汤潴攠捵瑯晦 ⸠⁍潳琠晩汥猠慲攠異汯慤敤猠浩硥搠潲楧楮慬猬†睨敮渠摯畢琠灲楮琠潮敧慬楺攠灡灥爮⁙潵
will be contacted by a Timios representative when documents are ready to be printed.
䅬氠獩杮敤潣畭敮瑳畳琠扥牯灰敤⁷楴桩渠㐠桯畲猠潦⁴桥汯獩湧⁵獩湧⁴桥潵物敲慢敬瑴慣桥搮†䅦瑥爠桯畲猠捬潳楮杳
浵獴攠摲潰灥搠批 湯潮⁴桥數琠扵獩湥獳慹⸠⁁晴敲潣畭敮瑳慶攠扥敮牯灰敤Ⱐ⁰汥慳攠汯杩渠瑯⁷睷業楯献捯洠瑯
confirm the signing has been completed.
⨪⩐汥慳攠畳攠愠䙅䐠䕘⽕偓⁓桩灰楮朠䕮癥汯灥⁷桥渠牥瑵牮楮朠摯捵浥湴猪⨪
Failure to comply 睩瑨⁴桥湳瑲畣瑩潮猠慢潶攠捯畬搠牥獵汴渠愠牥摵捴楯渠潲⁷慩癥爠潦⁹ 潵爠獩杮楮朠晥攮†
䍬潳楮朠噥湤潲⁌潧楮⁉湳瑲畣瑩潮猺
ⴠ䝯⁴漠睷眮瑩浩潳潭湤漠瑯⁁捣敳猠奯畲⁁捣潵湴
ⴠ啳敲⁉䐠楳⁹潵爠癥湤潲畭扥爠⡄传乏吠䕎呅删偒䕃䕄䥎䜠婅剏匩
Please direct all billing and pay ment inquiries to accountspay [email protected]
Do not change the name of the fields.
But there are no foreign characters in the pdf
Getting the Following error while Running the Command
Processing sample forms/01-01711646.pdf provided using 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 and parser configuration from closewise_parser_configurator.json
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/aiohttp/connector.py", line 980, in _wrap_create_connection
return await self._loop.create_connection(*args, **kwargs) # type: ignore[return-value] # noqa
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/asyncio/base_events.py", line 1112, in create_connection
transport, protocol = await self._create_connection_transport(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/asyncio/base_events.py", line 1145, in _create_connection_transport
await waiter
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/asyncio/sslproto.py", line 574, in _on_handshake_complete
raise handshake_exc
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/asyncio/sslproto.py", line 556, in _do_handshake
self._sslobj.do_handshake()
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/ssl.py", line 979, in do_handshake
self._sslobj.do_handshake()
ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1002)
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/akhileshthapliyal/document-content-extractor/document-content-extractor.py", line 244, in <module>
main()
File "/Users/akhileshthapliyal/document-content-extractor/document-content-extractor.py", line 207, in main
file_path, api_key, json_result, config, usage, result_file = upload_and_process_document(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/akhileshthapliyal/document-content-extractor/document-content-extractor.py", line 60, in upload_and_process_document
result = extract_content_async(
^^^^^^^^^^^^^^^^^^^^^^
File "/Users/akhileshthapliyal/document-content-extractor/content_extractor.py", line 294, in extract_content_async
prompts, results = loop.run_until_complete(
^^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/asyncio/base_events.py", line 653, in run_until_complete
return future.result()
^^^^^^^^^^^^^^^
File "/Users/akhileshthapliyal/document-content-extractor/content_extractor.py", line 211, in process_prompts
responses = await asyncio.gather(*tasks)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/akhileshthapliyal/document-content-extractor/content_extractor.py", line 74, in process_chunk
response = await session.post(
^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/aiohttp/client.py", line 536, in _request
conn = await self._connector.connect(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/aiohttp/connector.py", line 540, in connect
proto = await self._create_connection(req, traces, timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/aiohttp/connector.py", line 901, in _create_connection
_, proto = await self._create_direct_connection(req, traces, timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/aiohttp/connector.py", line 1206, in _create_direct_connection
raise last_exc
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/aiohttp/connector.py", line 1175, in _create_direct_connection
transp, proto = await self._wrap_create_connection(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/aiohttp/connector.py", line 982, in _wrap_create_connection
raise ClientConnectorCertificateError(req.connection_key, exc) from exc
aiohttp.client_exceptions.ClientConnectorCertificateError: Cannot connect to host api.openai.com:443 ssl:True [SSLCertVerificationError: (1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1002)')]
get the following error:
Field Name | Field Value
+----------------------------|--------------------------------------------------+
borrowerName_1 | Eric Robert Matthews
borrowerEmail_1 |
borrowerCellPhone_1 | (310) 418-3470
borrowerName_2 |
borrowerEmail_2 |
borrowerCellPhone_2 |
isTheSignerAForeignNational? | No
language |
timezone |
propertyAddress_Line | 915 Via Colinas
propertyAddress_City | Westlake Village
propertyAddress_State | CA
propertyAddress_Zip | 92660
closingAddress_Line | 915 VIA COLINAS
closingAddress_City | WESTLAKE VILLAGE
closingAddress_State | CA
closingAddress_Zip | 91362
appointmentDateTime | 10/12/2022 10:00 AM
fileNumber | OPA-4747851
orderOnBehalfOf | DEANNA GOOD
signingType | DISBURSEMENT ONLY
closingType | Home Equity Loan
lender | Title365
titleCompany | Vendor Name
companyFee | $90.00
agentFee | N/A
witnessNumber | N/A
uploadFiles | N/A
isScanBackNeeded | N/A
pocName | DEANNA GOOD
pocPhone | (310) 418-3470
pocEmail | N/A
+----------------------------|--------------------------------------------------+
Are any of the values incorrect? (y/n): y
Enter a coma separated list of fields that need to be corrected? propertyAddress_Zip, titleCompany, pocPhone
propertyAddress_Zip is an invalid field_name. Please enter a valid field name.
titleCompany is an invalid field_name. Please enter a valid field name.
pocPhone is an invalid field_name. Please enter a valid field name.
When i don't pass the logging level it's logging everything.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.