Giter VIP home page Giter VIP logo

lambda-text-extractor's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

lambda-text-extractor's Issues

Received error when trying to parse .jpg file.

Following is the error response:

{"errorMessage": "local variable 'textractor_results' referenced before assignment", "errorType": "UnboundLocalError", "stackTrace": [["/var/task/main.py", 128, "handle", "payload['results']['textractor'] = textractor_results"]]}

Failing to Extract Text on Lambda

Hi, I've just deployed the new version of your code, but I'm getting errors. In particular, when I try to run the example given on the Readme:

aws lambda invoke --function-name textractor_simple --payload '{"document_uri": "https://mozilla.github.io/pdf.js/web/compressed.tracemonkey-pldi-09.pdf", "temp_uri_prefix": "s3://text-extractor/", "text_uri": "s3://text-extractor/tracemonkey.txt"}' -

I get a:

{
    "StatusCode": 200
}

And no Errors on the lambda, but when I go see the extracted text file, it has 0 bytes, and Cloudwatch says this:

[ERROR] 2017-11-09T20:32:36.918Z 1e9cea26-c58d-11e7-9503-b7e3017ab9c2 Subprocess ['/var/task/bin/pdftotext', '-layout', '-nopgbrk', '-eol', 'unix', '/tmp/tmp8xi8qzza.pdf', '/tmp/tmpim1oc76s.txt'] returned 127:
Traceback (most recent call last):
File "/var/task/utils.py", line 8, in get_subprocess_output
output = subprocess.check_output(cmdline, **kwargs)
File "/var/lang/lib/python3.6/subprocess.py", line 336, in check_output
**kwargs).stdout
File "/var/lang/lib/python3.6/subprocess.py", line 418, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['/var/task/bin/pdftotext', '-layout', '-nopgbrk', '-eol', 'unix', '/tmp/tmp8xi8qzza.pdf', '/tmp/tmpim1oc76s.txt']' returned non-zero exit status 127.

And I'm a little puzzled. I believe the pdftotext binary should be in the bin/ directory of the function. Maybe the libraries are having a problem? Is it working for you?

Thanks!

Source Bucket Lambda Trigger

Currently the way this is setup is through a manual invoke.

What would be the best steps to use a source bucket and a destination bucket?

Antiword and UnRTF failing

Hi,

First of all, I'd like to thank you for your awesome repo!

However, I was testing it, and run into some errors. The PDF extractor lambda works good. However, when I tried the office extractor lambda, it failed, both with an RTF and a DOC files.

This are the messages:

For UnRTF:
"reason": "Exception while executing ['/var/task/bin/unrtf', '-P', '/var/task/lib/unrtf', '--text', u'/tmp/intelllex_dZROnq.rtf']: Command '['/var/task/bin/unrtf', '-P', '/var/task/lib/unrtf', '--text', u'/tmp/intelllex_dZROnq.rtf']' returned non-zero exit status 1 (output=No config directories. Searched: /var/task/lib/unrtf\n)"

For Antiword:
"reason": "Exception while executing ['/var/task/bin/antiword', '-t', '-w', '0', '-m', 'UTF-8', u'/tmp/intelllex_pLX1jK.doc']: Command '['/var/task/bin/antiword', '-t', '-w', '0', '-m', 'UTF-8', u'/tmp/intelllex_pLX1jK.doc']' returned non-zero exit status 1 (output=I can't find the name of your HOME directory\nI can't open your mapping file (UTF-8.txt)\nIt is not in '/.antiword' nor in '/usr/share/antiword'.\n\tName: antiword\n\tPurpose: Display MS-Word files\n\tAuthor: (C) 1998-2005 Adri van Os\n\tVersion: 0.37 (21 Oct 2005)\n\tStatus: GNU General Public License\n\tUsage: antiword [switches] wordfile1 [wordfile2 ...]\n\tSwitches: [-f|-t|-a papersize|-p papersize|-x dtd][-m mapping][-w #][-i #][-Ls]\n\t\t-f formatted text output\n\t\t-t text output (default)\n\t\t-a <paper size name> Adobe PDF output\n\t\t-p <paper size name> PostScript output\n\t\t paper size like: a4, letter or legal\n\t\t-x <dtd> XML output\n\t\t like: db (DocBook)\n\t\t-m <mapping> character mapping file\n\t\t-w <width> in characters of text output\n\t\t-i <level> image level (PostScript only)\n\t\t-L use landscape mode (PostScript only)\n\t\t-r Show removed text\n\t\t-s Show hidden (by Word) text\n)"

Do you know what the reason might be? I just used apex deploy from a cloned version of your repo, with my IAM role. From what I can see, it kind of looks like is looking for a lib folder where instead seems to be a lib-linux_x86 folder. Although I'm not sure and it might have nothing to do with it.

Please, any pointers would be very welcomed. I can do more testing if you point me in the right direction.

Thanks!

Santiago.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.