Giter VIP home page Giter VIP logo

document-content-extractor's People

Contributors

atbasu avatar sanjaykrmallick avatar

Watchers

 avatar

document-content-extractor's Issues

unhandled server disconnect error

There was an error in processing this file:
Unexpected Exception: Traceback (most recent call last):
  File "/Users/atbasu/Documents/document-content-extractor/content_extractor.py", line 463, in extract_content_async
    prompts, results = loop.run_until_complete(
                       ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/asyncio/base_events.py", line 653, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/Users/atbasu/Documents/document-content-extractor/content_extractor.py", line 406, in process_prompts
    responses = await asyncio.gather(*tasks)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/atbasu/Documents/document-content-extractor/content_extractor.py", line 312, in process_chunk
    response = await session.post(
               ^^^^^^^^^^^^^^^^^^^
  File "/Users/atbasu/Documents/document-content-extractor/venv/lib/python3.11/site-packages/aiohttp/client.py", line 560, in _request
    await resp.start(conn)
  File "/Users/atbasu/Documents/document-content-extractor/venv/lib/python3.11/site-packages/aiohttp/client_reqrep.py", line 899, in start
    message, payload = await protocol.read()  # type: ignore[union-attr]
                       ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/atbasu/Documents/document-content-extractor/venv/lib/python3.11/site-packages/aiohttp/streams.py", line 616, in read
    await self._waiter
aiohttp.client_exceptions.ServerDisconnectedError: Server disconnected

can't write corrections file

((venv) (base) atbasu@x86_64-apple-darwin13 document-content-extractor % python3.11 document_content_extractor.py
Enter name of file : 1.pdf
Enter Open AI api key: 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
Processing sample forms/1.pdf provided using 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 and parser configuration from closewise_parser_configurator.json
Field Name                   | Field Value
+----------------------------|--------------------------------------------------+
SignerName                   | Vincent Torres
Email                        | [email protected]
PhoneNumber                  | (803) 203-2710
IsTheSignerAForeignNational? | No
Language                     | N/A
Timezone                     | Eastern Daylight Time (EDT)
PropertyAddress              | 1357 Shimmer Light Circle
AppointmentAddress           | 1357 Shimmer Light Circle
AppointmentDateAndTime       | 05/13/22 at 8:00 am (EDT)
FileNumber                   | SC22117033
OrderOnBehalfOf              | BNT
SigningType                  | Notary confirmation
ProductType                  | Refi (E DOCS)
CompanyFee                   | $200
Client                       | Vincent Torres and Dianna Sager
AgentName                    | Carolina Attorney Networ
AgentFee                     | $200
WitnessNumber                | 2
UploadFiles                  | No
InternalNotes                | Please call the signer
ExternalNotes                | Call the borrower or buyers ASAP and confirm the following information: TIME
InstructionType              | Special Instructions
ScanBacksNeeded              | Yes
PersonOfContactEmail         | [email protected]
+----------------------------|--------------------------------------------------+
Are any of the values incorrect? (y/n): n
Traceback (most recent call last):
  File "/Users/atbasu/Documents/document-content-extractor/document_content_extractor.py", line 235, in <module>
    main()
  File "/Users/atbasu/Documents/document-content-extractor/document_content_extractor.py", line 207, in main
    f"{file_path.split('/')[-1]}_corrections_{run_id}.json",
       ^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'split'
(venv) (base) atbasu@x86_64-apple-darwin13 document-content-extractor % 

This should not occur

when file name doesn't exist program crashes

Enter name of file : 1.pfg
Enter Open AI api key: sdfs
Processing sample forms/1.pfg provided using sdfs and parser configuration from closewise_parser_configurator.json
Traceback (most recent call last):
  File "/Users/atbasu/Documents/document-content-extractor/document_content_extractor.py", line 235, in <module>
    main()
  File "/Users/atbasu/Documents/document-content-extractor/document_content_extractor.py", line 182, in main
    result = upload_and_process_document(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/atbasu/Documents/document-content-extractor/document_content_extractor.py", line 28, in upload_and_process_document
    text = read_document(file_path)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/atbasu/Documents/document-content-extractor/utils.py", line 9, in read_document
    with open(file_path, 'rb') as file:
         ^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: 'sample forms/1.pfg'

This situation should be handled gracefully.

Error when writing metrics file

Traceback (most recent call last):
  File "/Users/atbasu/Documents/document-content-extractor/document_content_extractor.py", line 261, in <module>
    main()
  File "/Users/atbasu/Documents/document-content-extractor/document_content_extractor.py", line 237, in main
    metrics_file = write_metrics(
                   ^^^^^^^^^^^^^^
  File "/Users/atbasu/Documents/document-content-extractor/document_content_extractor.py", line 66, in write_metrics
    logger.info("Writing metrics to CSV")
    ^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'info'

invalid input when entering corrections creates infinite loop

Invalid input. Please enter either y, n.
Invalid input. Please en^C
Traceback (most recent call last):
  File "/Users/atbasu/Documents/document-content-extractor/document_content_extractor.py", line 260, in <module>
    main()
  File "/Users/atbasu/Documents/document-content-extractor/document_content_extractor.py", line 227, in main
    corrections = check_for_errors(json_result, dce_logger)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/atbasu/Documents/document-content-extractor/document_content_extractor.py", line 118, in check_for_errors
    print('Invalid input. Please enter either y, n.')
KeyboardInterrupt

This happens when we enter the wrong input to the question :
'Are any of the values incorrect? (y/n): '

Reporting errors in extraction results in an exception

Are any of the values incorrect? (y/n): y
Enter a coma separated list of fields that need to be corrected? POCPhone
What should the correct value be for POCPhone?N/A
Why is this the correct value POCPhone?no phone number for the POC is listed, (929) 270-5392 is the number of the borrower.
Are there any other fields that need to be corrected? (y/n): n
CRITICAL:utils:An error occurred while checking for any incorrect extractions: 'POCPhone'

Traceback (most recent call last):
  File "/Users/atbasu/Documents/document-content-extractor/document_content_extractor.py", line 302, in main
    corrections, correction_prompt = check_for_errors(
                                     ^^^^^^^^^^^^^^^^^
  File "/Users/atbasu/Documents/document-content-extractor/document_content_extractor.py", line 187, in check_for_errors
    correction_prompt = generate_correction_prompt(corrections, result)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/atbasu/Documents/document-content-extractor/document_content_extractor.py", line 127, in generate_correction_prompt
    returned_values += f"{key} : {result[key]}\n"
                                  ~~~~~~^^^^^
KeyError: 'POCPhone'

some pdf files are getting read with non english characters.

For example 28.pdf gets read with the following text:
〷⼲㔯㈰㈲‱〺〴⁐協 Order Date:Order Number: 08-02476888-CExpedited Order
呉䵉体Ⱐ䥎䌮
㈳〱⁗⸠偌䅎传偋坙
STE 215
PLANO, TX 75075
Phone: (818) 706-6400
䱅塉乇呏丬⁓䌠ⴠ㈹〷㈭㤷㈹噥湤潲⁃桡牧攺 $350.00
噥湤潲⁎漺 〰〲〳㔶
噥湤潲⁎慭攺 䍁剏䱉乁⁁呔佒久夠久呗 ORK LLC
噥湤潲⁁ ddress: ㄱ㔠䑒䥆呗 住䐠䑒
Vendor Phone: (803)520-2048 Vendor Fax: (803)520-8041
䅳獩杮浥湴⁔祰攺 Reverse Closing
䅳獩杮敤⁄慴攺 〹⼲〯㈰㈲‱㌺ㄹ⁐協
Projected Closing Date: 〹⼲㈯㈰㈲‱㌺㌰
Home Phone: (803)646-1118
䅤摲敳猺 242 TRIPLE H FARM LANE, AIKEN, SC - 29803
HAYMANS\W ANDA A.
Home Phone: (803)646-1118䡁奍䅎卜䵉䱅匠䍕剔䥓
Borrower:Borrower:
偲潰敲瑹⁁ ddress:
County:㈴㈠呒䥐䱅⁈⁆䅒䴠䱎
䅉䭅丬⁓䌠ⴠ㈹㠰㌭㠲㔴
AIKEN
REFINANCE
Closing Location: 242 TRIPLE H FARM LANE, AIKEN, SC - 29803Purchase:
MUTUAL OF OMAHA MORTGAGE INC.
4429973
$451,000.00卵扳捲楢敲⁎慭攺
䱯慮⁎畭扥爺
䱯慮⁁ 浯畮琺
奯畲⁡杲敥搠晥攠景爠瑨楳牤敲⁩猠␳㔰⸰〮⁙潵畳琠捡汬湥映潵爠牥灲敳敮瑡瑩癥猠灲楯爠瑯⁤潩湧⁡湹 additional work that will
楮捲敡獥⁹ 潵爠晥攮⁕湡畴桯物穥搠睯牫⁷楬氠湯琠扥⁰慩搠景爮
䅤ditional Fee: Reason: 䅰灲潶敤⁂示Please Email: [email protected] or Text 803-646-1118 for communication she does
not receive phone calls at her house. **** Reverse Loan Signing****
PLEASE CONTACT THE BORROWER(S) TO CONFIRM THE SIGNING TIME AND LOCATION. If you are in
a state with Witness requirements please confirm with borrowers that they will have the
湥捥獳慲礠睩瑮敳獥猠灲敳敮琮
䉵獩湥獳⁨潵牳㨠‸慭⁴漠㔺〰灭⁃協
䙏删䅎夠兕䕓呉低区
Arwen Hebert 972-384-4639 / After Hours 214-907-5249 ** If no answer you can try
瑥硴楮朠慳⁷敬氮
䅭祥⁃潲牡摯‹㜲ⴳ㠴ⴴ㘲㤠⼠䅦瑥爠䡯畲猠㐶㤭㤹㔭㐲㘵
䑡牲敬氠坩湲潷‹㜲ⴳ㠴ⴴ㘱㠠⼠䅦瑥爠䡯畲猠㠱㠭㌰㤭㔰ㄴ
Jordan Tomenga 972-384-4622 / After Hours 972- 523-9746COMMENTS:
1COMMENTS:
偬敡獥⁲敡搠慬氠楮獴牵捴楯湳⁣慲敦 ully before completing the signing appointm ent.
Due to the COVID- 19 pandemic, Timios w ill REQUIRE the NSA 䡥慬瑨⁓捲敥湩湧
Form as w ell as the Borrow 敲⁈敡汴栠卣牥敮楮朠䙯牭⁴漠扥⁦楬汥搠潵琠捯浰汥瑥汹 .
䅬so, w 攠慲攠慳歩湧畲潴慲楥猠瑯⁰牯瑥捴⁴桥浳敬癥猠慮搠潵爠扯牲潷 敲猠批
睥aring gloves and masks to all Face to Face closings
AS A剅䵉乄䕒Ⱐ呈䔠坏剄ₑ乏呁 剙⁐啂䱉䎒⁍啓吠䉅⁗剉呔䕎⁏啔⁁ FTER THE NA 䵅
低⁕十 偁TRIOT A CT FORM, NO EXCEPTIONS

䉯牲潷敲⡳⤠獨潵汤⁢攠捯湴慣瑥搠瑯⁣潮晩牭⁡灰潩湴浥湴⁤慴攬⁴業攠慮搠汯捡瑩潮⸠
All docum ents should be printed and returned in the order they are attached. Do not shuff le pages in the package.
䑯捵浥湴猠獨潵汤潴⁢攠捵瑯晦 ⸠⁍潳琠晩汥猠慲攠異汯慤敤⁡猠浩硥搠潲楧楮慬猬†睨敮⁩渠摯畢琠灲楮琠潮敧慬⁳楺攠灡灥爮⁙潵
will be contacted by a Timios representative when documents are ready to be printed.
䅬氠獩杮敤⁤潣畭敮瑳畳琠扥⁤牯灰敤⁷楴桩渠㐠桯畲猠潦⁴桥⁣汯獩湧⁵獩湧⁴桥⁣潵物敲慢敬⁡瑴慣桥搮†䅦瑥爠桯畲猠捬潳楮杳
浵獴⁢攠摲潰灥搠批 湯潮⁴桥數琠扵獩湥獳⁤慹⸠⁁晴敲⁤潣畭敮瑳⁨慶攠扥敮⁤牯灰敤Ⱐ⁰汥慳攠汯杩渠瑯⁷睷⹴業楯献捯洠瑯
confirm the signing has been completed.
⨪⩐汥慳攠畳攠愠䙅䐠䕘⽕偓⁓桩灰楮朠䕮癥汯灥⁷桥渠牥瑵牮楮朠摯捵浥湴猪⨪
Failure to comply 睩瑨⁴桥⁩湳瑲畣瑩潮猠慢潶攠捯畬搠牥獵汴⁩渠愠牥摵捴楯渠潲⁷慩癥爠潦⁹ 潵爠獩杮楮朠晥攮†
䍬潳楮朠噥湤潲⁌潧楮⁉湳瑲畣瑩潮猺
ⴠ䝯⁴漠睷眮瑩浩潳⹣潭⁡湤⁧漠瑯⁁捣敳猠奯畲⁁捣潵湴
ⴠ啳敲⁉䐠楳⁹潵爠癥湤潲畭扥爠⡄传乏吠䕎呅删偒䕃䕄䥎䜠婅剏匩
Please direct all billing and pay ment inquiries to accountspay [email protected]
Do not change the name of the fields.

But there are no foreign characters in the pdf

Not able to run the app

Getting the Following error while Running the Command

Processing sample forms/01-01711646.pdf provided using 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 and parser configuration from closewise_parser_configurator.json
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/aiohttp/connector.py", line 980, in _wrap_create_connection
    return await self._loop.create_connection(*args, **kwargs)  # type: ignore[return-value]  # noqa
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/asyncio/base_events.py", line 1112, in create_connection
    transport, protocol = await self._create_connection_transport(
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/asyncio/base_events.py", line 1145, in _create_connection_transport
    await waiter
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/asyncio/sslproto.py", line 574, in _on_handshake_complete
    raise handshake_exc
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/asyncio/sslproto.py", line 556, in _do_handshake
    self._sslobj.do_handshake()
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/ssl.py", line 979, in do_handshake
    self._sslobj.do_handshake()
ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1002)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/akhileshthapliyal/document-content-extractor/document-content-extractor.py", line 244, in <module>
    main()
  File "/Users/akhileshthapliyal/document-content-extractor/document-content-extractor.py", line 207, in main
    file_path, api_key, json_result, config, usage, result_file = upload_and_process_document(
                                                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/akhileshthapliyal/document-content-extractor/document-content-extractor.py", line 60, in upload_and_process_document
    result = extract_content_async(
             ^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/akhileshthapliyal/document-content-extractor/content_extractor.py", line 294, in extract_content_async
    prompts, results = loop.run_until_complete(
                       ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/asyncio/base_events.py", line 653, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/Users/akhileshthapliyal/document-content-extractor/content_extractor.py", line 211, in process_prompts
    responses = await asyncio.gather(*tasks)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/akhileshthapliyal/document-content-extractor/content_extractor.py", line 74, in process_chunk
    response = await session.post(
               ^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/aiohttp/client.py", line 536, in _request
    conn = await self._connector.connect(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/aiohttp/connector.py", line 540, in connect
    proto = await self._create_connection(req, traces, timeout)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/aiohttp/connector.py", line 901, in _create_connection
    _, proto = await self._create_direct_connection(req, traces, timeout)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/aiohttp/connector.py", line 1206, in _create_direct_connection
    raise last_exc
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/aiohttp/connector.py", line 1175, in _create_direct_connection
    transp, proto = await self._wrap_create_connection(
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/aiohttp/connector.py", line 982, in _wrap_create_connection
    raise ClientConnectorCertificateError(req.connection_key, exc) from exc
aiohttp.client_exceptions.ClientConnectorCertificateError: Cannot connect to host api.openai.com:443 ssl:True [SSLCertVerificationError: (1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1002)')]

When reporting corrections in the app it fails to recognize the keys

get the following error:

Field Name                   | Field Value
+----------------------------|--------------------------------------------------+
borrowerName_1               | Eric Robert Matthews
borrowerEmail_1              | 
borrowerCellPhone_1          | (310) 418-3470
borrowerName_2               | 
borrowerEmail_2              | 
borrowerCellPhone_2          | 
isTheSignerAForeignNational? | No
language                     | 
timezone                     | 
propertyAddress_Line         | 915 Via Colinas
propertyAddress_City         | Westlake Village
propertyAddress_State        | CA
propertyAddress_Zip          | 92660
closingAddress_Line          | 915 VIA COLINAS
closingAddress_City          | WESTLAKE VILLAGE
closingAddress_State         | CA
closingAddress_Zip           | 91362
appointmentDateTime          | 10/12/2022 10:00 AM
fileNumber                   | OPA-4747851
orderOnBehalfOf              | DEANNA GOOD
signingType                  | DISBURSEMENT ONLY
closingType                  | Home Equity Loan
lender                       | Title365
titleCompany                 | Vendor Name
companyFee                   | $90.00
agentFee                     | N/A
witnessNumber                | N/A
uploadFiles                  | N/A
isScanBackNeeded             | N/A
pocName                      | DEANNA GOOD
pocPhone                     | (310) 418-3470
pocEmail                     | N/A
+----------------------------|--------------------------------------------------+
Are any of the values incorrect? (y/n): y
Enter a coma separated list of fields that need to be corrected? propertyAddress_Zip, titleCompany, pocPhone
propertyAddress_Zip is an invalid field_name. Please enter a valid field name.
titleCompany is an invalid field_name. Please enter a valid field name.
pocPhone is an invalid field_name. Please enter a valid field name.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.