receiptmanager / receipt-parser-legacy Goto Github PK

View Code? Open in Web Editor NEW

789.0 32.0 199.0 10.84 MB

A supermarket receipt parser written in Python using tesseract OCR

Home Page: https://tech.trivago.com/2015/10/06/python_receipt_parser/

License: Apache License 2.0

Python 97.11% Makefile 1.96% Dockerfile 0.93%

receipt-parser receipt ocr invoice home-assistant supermarket

receipt-parser-legacy's Introduction

A fuzzy receipt parser written in Python

This is a fuzzy receipt parser written in Python. It extracts information like the shop, the date, and the total from scanned receipts. It can work as a standalone script or as part of our IOS and Android application.

Dependencies

The receipt-parser-core library depend on imagemagick. Please install imagemagick with your favorite package manager.

Usage

To convert all images from the data/img/ folder to text using tesseract and parse the resulting text files, run

make run

Docker

A Dockerfile is available with all dependencies needed to run the program.
To build the image, run

make docker-build

To run it on the sample files, try

make docker-run

By default, running the image will execute the make run command. To use with your own images, run the following:

docker run -v <path_to_input_images>:/app/data/img mre0/receipt_parser

History

This project started as a hackathon idea. Read more about it on the trivago techblog. Also read the comments on HackerNews There's also a talk about the project. The library is now available at PyPi.

receipt-parser-legacy's People

Contributors

Stargazers

Watchers

Forkers

ashbt zixan em- msmakhlouf phaufe fremietlara mweibel kartikravi rayleyva mkmojo frank-u gajomi vdmitriyev slamice kkbox360 nloveland13 engmux snelis jdrew1303 eric013 li363849131 cnuonline daurenamanbayev madmerlyn dknyboy20 manoj062006 fanzalika lulzzz tomgross cthurber alvincjin limingzhou parksunwoo deastiny linpingchuan tomreitsma priestd09 tinaranic asehlaoui jedirhymetrix cooganb fitrialif ginking ieee820 platonn djpugh e1even1ee owlyone coolpup franciscodiazydiaz tevpro kikevillab ajflood nieto90 datasuperman sonnguyen478 praneybehl denvaar kiwita88 rmdern iopsycode sando1 pcsailor ksu-is precociouslydigital john-willikers newms226 paulwicking yuanqin27 gauss-it yuliangzhang thamizh-sterio ktsang622 tarsbase indigos33k3r espre05 rkand4 simmol hitman56 aayushagarwal28 tggo atif-github-venture xinbinhuang rayrrr workingcapital farmers-tan sanjay3001 vinc456 sambhav2612 avinasharc vireshdoshi shubhampachori12110095 mohitrajranu hpshemant liyucode karankrish adewin haokoo edgency amemobileinc

receipt-parser-legacy's Issues

Fix date verification

In #10, @kiwita88 discovered, that there is no date verification right now.
This means that dates like 32.08.2016 will not throw errors. We should fix that by creating proper dates from the parsed string, similar to this snippet:

from dateutil.parser import parse

a = "2012-10-09T19:00:55Z"

b = parse(a)

print(b.weekday())
# 1 (equal to a Tuesday)

Docker image not working in Ubuntu 22.04

Issue: Docker image not working in Ubuntu 22.04

Expected behaviour: Running "make docker-run" completes without error and processes the sample images.
Actual behaviour: Running make docker-run spits out error "make": executable file not found in $PATH: unknown.

Environment: Ubuntu 22.04
Python: 3.7.9 (from docker official images)
Docker version 20.10.16, build aa7e414

Additional info: make docker-build completed without errors after ensuring dockerfile and pyproject.toml pointed to python version 3.7.9

Attaching screenshot of exact error:

Consider moving to Jazzband (a team of Python maintainers)

@mre, thanks for this lib! It seems like a great contender to be a project under https://jazzband.co, a collaborative group of Python maintainers of which I'm a member.

See https://jazzband.co/about/guidelines

If we can get #12 merged and spruce up the documentation a bit, this should be an acceptable project for Jazzband to manage. Let me know if you'd like any help getting it fixed up to that point.

add API endpoints

this tool would become incredibly powerful if there were API endpoints that we would shot over some image files or Base64 encoded data and then get the Base64 results.

the below link is something I had in mind, not even that webpage if we had the API

https://ocr-example.herokuapp.com/

https://github.com/otiai10/ocrserver/wiki/API-Endpoints

Fix failing unit test for market name

In #23 (comment), @kiwita88 found the likely reason why our unit test for the market name 'p e n ny' fails. We should fix that.

Linking Receipt Parser with Open Food Facts

hi @mre
We're building an open worldwide database of food products. You give it a barcode, or a product name, and you get detailed information about the product.
It's a crowdsourced database fed by smartphone apps.

It would be cool to create a database of supermarket receipts, and be able to look them up on Open Food Facts (how many calories did I buy today, or even how much did the average calorie cost)

Unclear documentation

What does this project actually do?

I read the README, the blog post, and the hacker news comments and I still have no idea what this actually outputs or how I would even use it. The most relevant thing in the README is "To convert all images from the data/img/ folder to text using tesseract and parse the resulting text files, run ..."

Parse the resulting text files into what?

Is this only meant to be used from the command line or is this a python library? Then only reason I think it might be meant to be used as a Python library is because I looked at the tests.

Parsing date fails with unsanitized input

Using the included images:

❯ LANG=C make run
poetry run python parser/importer.py
Found the following images in /home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/img
['IMG0007.jpg', 'IMG0003.jpg', 'IMG0001.jpg', 'IMG0004.jpg', 'IMG0008.jpg', 'IMG0006.jpg']
Running convert -rotate ' 90' '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/img/IMG0007.jpg' '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/tmp/IMG0007.jpg'
Running convert -auto-level -sharpen 0x4.0 -contrast '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/tmp/IMG0007.jpg' '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/tmp/IMG0007.jpg'
Running tesseract -l deu '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/tmp/IMG0007.jpg' '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/txt/IMG0007.jpg.out.txt'
Tesseract Open Source OCR Engine v4.1.1 with Leptonica
Detected 233 diacritics
Running convert -rotate ' 90' '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/img/IMG0003.jpg' '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/tmp/IMG0003.jpg'
Running convert -auto-level -sharpen 0x4.0 -contrast '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/tmp/IMG0003.jpg' '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/tmp/IMG0003.jpg'
Running tesseract -l deu '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/tmp/IMG0003.jpg' '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/txt/IMG0003.jpg.out.txt'
Tesseract Open Source OCR Engine v4.1.1 with Leptonica
Detected 8 diacritics
Running convert -rotate ' 90' '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/img/IMG0001.jpg' '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/tmp/IMG0001.jpg'
Running convert -auto-level -sharpen 0x4.0 -contrast '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/tmp/IMG0001.jpg' '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/tmp/IMG0001.jpg'
Running tesseract -l deu '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/tmp/IMG0001.jpg' '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/txt/IMG0001.jpg.out.txt'
Tesseract Open Source OCR Engine v4.1.1 with Leptonica
Running convert -rotate ' 90' '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/img/IMG0004.jpg' '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/tmp/IMG0004.jpg'
Running convert -auto-level -sharpen 0x4.0 -contrast '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/tmp/IMG0004.jpg' '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/tmp/IMG0004.jpg'
Running tesseract -l deu '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/tmp/IMG0004.jpg' '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/txt/IMG0004.jpg.out.txt'
Tesseract Open Source OCR Engine v4.1.1 with Leptonica
Detected 62 diacritics
Image too small to scale!! (2x36 vs min width of 3)
Line cannot be recognized!!
Running convert -rotate ' 90' '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/img/IMG0008.jpg' '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/tmp/IMG0008.jpg'
Running convert -auto-level -sharpen 0x4.0 -contrast '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/tmp/IMG0008.jpg' '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/tmp/IMG0008.jpg'
Running tesseract -l deu '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/tmp/IMG0008.jpg' '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/txt/IMG0008.jpg.out.txt'
Tesseract Open Source OCR Engine v4.1.1 with Leptonica
Running convert -rotate ' 90' '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/img/IMG0006.jpg' '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/tmp/IMG0006.jpg'
Running convert -auto-level -sharpen 0x4.0 -contrast '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/tmp/IMG0006.jpg' '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/tmp/IMG0006.jpg'
Running tesseract -l deu '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/tmp/IMG0006.jpg' '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/txt/IMG0006.jpg.out.txt'
Tesseract Open Source OCR Engine v4.1.1 with Leptonica
poetry run
Text, Market, Date, Sum
2 real
 1.0 Real
data/txt/IMG0004.jpg.out.txt.txt Real None 9.31
rewe
 1.0 REWE
data/txt/IMG0001.jpg.out.txt.txt REWE 04.12.2014 0.99
dm dm-drogerie markt
 0.8 Drogerie
data/txt/IMG0008.jpg.out.txt.txt Drogerie 11.12.2014 5.85
penny h-milch
 1.0 Penny
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/parser/__init__.py", line 6, in main
    stats = ocr_receipts(config, receipt_files)
  File "/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/parser/parse.py", line 124, in ocr_receipts
    receipt = Receipt(config, receipt.readlines())
  File "/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/parser/receipt.py", line 40, in __init__
    self.parse()
  File "/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/parser/receipt.py", line 62, in parse
    self.date = self.parse_date()
  File "/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/parser/receipt.py", line 94, in parse_date
    dateutil.parser.parse(date_str)
  File "/home/dzwiedziu/.cache/pypoetry/virtualenvs/parser-dlSOXmLn-py3.8/lib/python3.8/site-packages/dateutil/parser/_parser.py", line 1374, in parse
    return DEFAULTPARSER.parse(timestr, **kwargs)
  File "/home/dzwiedziu/.cache/pypoetry/virtualenvs/parser-dlSOXmLn-py3.8/lib/python3.8/site-packages/dateutil/parser/_parser.py", line 649, in parse
    raise ParserError("Unknown string format: %s", timestr)
dateutil.parser._parser.ParserError: Unknown string format: 06.06. 2015
make: *** [Makefile:7: parse] Error 1

Notice the space in the date: "06.06. 2015".

Update README.md with Windows Commands

Hi All,
Is anyone able to add commands for running this program on windows? The make command is only supported on Linux.

Dependabot couldn't authenticate with https://pypi.python.org/simple/

Dependabot couldn't authenticate with https://pypi.python.org/simple/.

You can provide authentication details in your Dependabot dashboard by clicking into the account menu (in the top right) and selecting 'Config variables'.

View the update logs.

Add some unit tests

Right now, we only have some "integration tests", where we take a receipt and we pass it to parser.py. It would be better to write some proper unit tests for the different functionality, like the date parser.

make run

I keep getting the following error when I try to execute make run:

Traceback (most recent call last):
  File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/home/rsorage/workspaces/majoris/receipt-parser/parser/__main__.py", line 1, in <module>
    from parser import parser
  File "parser/parser.py", line 25, in <module>
    from parser.objectview import ObjectView
ImportError: No module named objectview
Makefile:7: recipe for target 'parse' failed
make: *** [parse] Error 1

Explaination about sum_format and date_format?

Hi
Thanks for your tutorial, indeed its nice heads up. I was reading config.yml and unable to understand how sum format and date_format is working. Can you explain a little bit, based on it I will add some more fields in the parser.

Thanks in advance
Sagar

make docker-run not working

Hi people,

today I'm evaluating your project, but when I run make docker-run it gives me an error:

docker run -v `pwd`/data/img:/app/data/img mre0/receipt-parser
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/app/parser/__init__.py", line 5, in main
    receipt_files = get_files_in_folder(config.receipts_path)
  File "/app/parser/parse.py", line 57, in get_files_in_folder
    files = os.listdir(os.path.join(BASE_PATH,folder))  # list content of folder
FileNotFoundError: [Errno 2] No such file or directory: '/app/data/txt'
make: *** [Makefile:26: docker-run] Error 1

Next, I tried with the suggested command:

$ docker run -v "$(pwd)/data/img:/usr/src/app/data/img" mre0/receipt-parser
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/app/parser/__init__.py", line 5, in main
    receipt_files = get_files_in_folder(config.receipts_path)
  File "/app/parser/parse.py", line 57, in get_files_in_folder
    files = os.listdir(os.path.join(BASE_PATH,folder))  # list content of folder
FileNotFoundError: [Errno 2] No such file or directory: '/app/data/txt'

then, with

$ docker run -v "$(pwd)/data/img:/usr/src/app/data/img" -v "$(pwd)/data/txt:/app/data/txt:rw" mre0/receipt-parser
Text, Market, Date, Sum
1587323551,0,0,0,0,

Then, I erased all images in data/img, and it gave more or less the same result, the first number changes, and that's it.

Am I doing something wrong?

Regards.

convert: no decode delegate for this image format `JPEG' @ error/constitute.c/ReadImage/508.

I receive this error when trying to run the program, in addition to many following errors.

Solutions that didn't work

brew unlink jpeg and then brew link jpeg
brew install jpeg; brew link jpeg

brew uninstall imagemagick jpeg libtiff jasper; brew install imagemagick

make run
pipenv run python parser/importer.py
/Usr/.local/share/virtualenvs/receipt-parser-master-rh6oEQUF/bin/python
('Found the following images in', '/Usr/Downloads/receipt-parser-master/data/img')
['IMG0008.jpg', 'IMG0003.jpg', 'IMG0001.jpg', 'IMG0004.jpg', 'IMG0007.jpg', 'IMG0006.jpg']
('Running', "convert -rotate ' 90' '/Usr/Downloads/receipt-parser-master/data/img/IMG0008.jpg' '/Usr/Downloads/receipt-parser-master/data/tmp/IMG0008.jpg'")
convert: no decode delegate for this image format JPEG' @ error/constitute.c/ReadImage/508. convert: no images defined /Usr/Downloads/receipt-parser-master/data/tmp/IMG0008.jpg' @ error/convert.c/ConvertImageCommand/3235.
('Running', "convert -auto-level -sharpen 0x4.0 -contrast '/Usr/Downloads/receipt-parser-master/data/tmp/IMG0008.jpg' '/Usr/Downloads/receipt-parser-master/data/tmp/IMG0008.jpg'")
convert: unable to open image '/Usr/Downloads/receipt-parser-master/data/tmp/IMG0008.jpg': No such file or directory @ error/blob.c/OpenBlob/2695.
convert: no decode delegate for this image format JPG' @ error/constitute.c/ReadImage/508. convert: no images defined /Usr/Downloads/receipt-parser-master/data/tmp/IMG0008.jpg' @ error/convert.c/ConvertImageCommand/3235.
('Running', "tesseract -l deu '/Usr/Downloads/receipt-parser-master/data/tmp/IMG0008.jpg' '/Usr/Downloads/receipt-parser-master/data/txt/IMG0008.jpg.out.txt'")
sh: tesseract: command not found
('Running', "convert -rotate ' 90' '/Usr/Downloads/receipt-parser-master/data/img/IMG0003.jpg' '/Usr/Downloads/receipt-parser-master/data/tmp/IMG0003.jpg'")
convert: no decode delegate for this image format JPEG' @ error/constitute.c/ReadImage/508. convert: no images defined /Usr/Downloads/receipt-parser-master/data/tmp/IMG0003.jpg' @ error/convert.c/ConvertImageCommand/3235.
('Running', "convert -auto-level -sharpen 0x4.0 -contrast '/Usr/Downloads/receipt-parser-master/data/tmp/IMG0003.jpg' '/Usr/Downloads/receipt-parser-master/data/tmp/IMG0003.jpg'")

OCR support for single articles

First of all, the script works very well, thanks for that.
Is it possible to read single articles per ocr from the receipt ?

This would be very nice!

Vertical receipt OCR outputs garbage

Vertical receipts, are rotated and otuputs garbage OCR.

Image:

OCR:

—
oO
x
4 J SL a; Jg —J
DOT nom CD oo Or GICO oOO QD+rOoO ADn2O n RDDLDLDDNnD m RANnrnO oo 18)
BODOOOORGHSNAOIOODOH-DNOGOHD Dr OO DR ND OSJON NM DL OODOVOODODDPDDIDIODPDNGCSOO DON DXDNN DM O1
DOSIOOONDDSIOSOSJIIIINDOOITOOGIIDRB IS OGISIND OD TIAISIOOOOO OO m Hmm OO AG OIH HH MD MIND OO N O0 (Od
BDH-OCONDONDOD DO SIGIHHADODOOSNS- OIOCO OO DAGSI- ALDODDIDODTIODHTDODOOOGIGIGIO OT DD OD OD OO SS SSH O-m OO co
= OOo TI III DODOOTITSNSITITS-TOr-DDONDONSNIDOSIDLGIDODAROODTONT-DBODDOIOIGIDOCGIITDODDDOOON CID O1 O
€ TOT UV oO <TD... zz vo u> UVOoy  TUUDODU{VTUOVDUUODUIVUDUDU do UV OO DVD U P>;> > UUUD ZZ T2Z3ET
=& 9 CC — 7 Doo9 - OO 2 — << Do — OO —- < u () —- —- 09 — 00T) u u u Do — —- 7 TODD CO cc CC - -— - a -+--. 09
>t rDOo00%o08 - D35 oo - TS C To 5 JO I Too VO Oo CE DIDI PCOOO Oo Oo oT oo oo OL cc. cLO CE TC OO HH OO D ED ER
+5 ı ıı OO -O9 I 2—-00—-0 010-1 31 _I) JS5=s ı 1 1 ı  Lı DD ı ı __ID2IS ID 1 11V VD ı 1 I DDIDdDd A
> VS ZU u un OD AT — NIOD oo DD — ı Aw Ävy SIuvzZzu3 dd -— m << o vo oO DB. DI DD nn Tv DD zz OO VO > JH A or ©
xu_n ° cc: 0709 adornd-er I DDr DIO TO TA DO OHIO DCCOoO PP OT D OD — — ——— DO AT CC CHI IT ID oıo ” oo —
ca o N O00ZRFI 9 TO AO DD DoVOODOO —-O TI VO TO —- Oo cc rt 3 3 o- NN 3 — —- DD — DIVE rTITITO DD > bs yon ttnmdr-
<D a rt oda än a aoe esnn ı — 3; rt0 O0 © 02 X Zu 9) ONNCTOTOTDOTDOONDTODDI OO N Hd —_- —-
SE DV —-- Io TD ı DZ TUI TO dm I Dom IT D + I ON DOO—OD D-O0—929 000 oo Do DT.) = Or ntnro DD Dx x —- I
—- ge TITII-- oO -— oo TI) cc oo DD od —-D 9 rHtQO co ı DTIT7’WVOo ı TI DR RTITI RAR TID zCı ı O9 rHrtrtrt TI I — —ı
<D DO vv. AI DD I DI Oo u CZ 5 c [em A —- 0 c 7’ DpoD pp I ı VO CC DD U DI — — og TIororoIT, ı  ı 1ı 433m m =OoO
nm Tao 1) Jj -- oo x ch ı1 DI +07 vV or > O0 —-—- O7VOvVyvVv C vv DDD THIOTıboıvy om ovyovpgogogc mh —-
<D x — >. DD -5-...:. 7905 -00O0CcC > I a re 3 5 _—o0— —- co oe» OT AOODOTDT TC — 2x Dıxı — oO
=> MO > E 00 OX—-0O0d0 5 — OD +» © D N “D ON 07 an ze ı @ —NNI Do (o DD —.2Z2Z “oO
— Io O2 — J1O3 >= nn IT I 73 > —— D< _ —: + — ho DD. I —- _— . NNN -—- —.- Om Orm-
>t D-%0 oT OO ı © _- om + ua Q — oO © . — —— 2 no oO ge) {on =——r v0. %.
€ OD —- >D- Or De Tr &D oc —D- Ed “& DO 0 — —— vo © OOMNCOIND O0]
= DV A Ar on © . = 23 — 0 N — aA EC DD re = = a  e B oo 01
- Orts. os O-HOrs Hm DOHr HH Or OH Hs Or Or OO OO HH OO Hmm Dr OO HH. OOrHmdHe0o C
SORTITGTONSOOHDOCNOOROODSFENIGSODOOSITCOLAONSJTDTDOGTIDOALII OTTO ONSINNSIOIINSNSNSNS2OOGHEOImW u
DISOOOITOSOOSGOTASDTD TAI OCOODOCOOOOCOODSOCOODVOCOODODOOGOCOCOOOCODODOOOCOOCOSOSOGCOOGOOVDTOOCOOOO DO AO
VU>>r>>>>>- >> >> oT py,>->>>UVU>>>r>->>>r>r;,>>r>>->,D>->r>r>r>->->->,>,>r,>r,>r,>r,>-r>r,>>r>,>r>,>>->-UDOD ur

Any ideas ?

Running the code on Windows (and tried Mac)

Sorry, if I am missing something. However, I did run the following commands to receive some errors.
On Windows:
Opened the project using VSCode ran pipenv install and finally ran parser to get the following
PS C:\Users\lobvi02\Downloads\OCR\receipt-parser> & python c:/Users//Downloads/OCR/receipt-parser/parser.py
Text, Market, Date, Sum
1539958502,0,0,0,0,

On Mac:
As instructed ran make run and got
**make: *No rule to make target 'Importer', needed by 'run'. Stop.

I am new to Python and so have my questions.

Add contiuous integration

Would be nice to use Travis CI or Drone.io to run some tests on every change.
This would help a lot with refactoring the tool in the future without breaking it.
Accepting PRs for this. Just comment here if you like to work on it. I can provide support if you like.

[APPLICATION] Receipt parser front end

Hey,
thanks for the awesome project. I decided to create a front end which is written in dart and completely open-source. It used the receipt parser libary.

Application https://github.com/ReceiptManager/Application
Server https://github.com/ReceiptManager/Server

Glad to read your feedback.

UnicodeEncodeError while reading image data

Env:

Windows 10, Python 3.9.

Issue:

Got the UnicodeEncodeError while processing example tesseract image data with make run.
UnicodeEncodeError: 'charmap' codec can't encode character '\xfc' in position 57: character maps to <undefined>

Solution:

Specify utf-8 encoding while reading the data.
out = open(output_file, "w", encoding='utf-8')

How to Run This code

When I Run Command: make run
Then it shows me a message : make: Nothing to be done for 'run'.

Refactor code

Although the receipt-parser works, it's really not too well structured. Might be better for testability (see #4) to clean up the parsers and make the code a bit more object-oriented overall.
Would be happy for anyone who wants to tackle this. Providing mentorship if needed.

make docker-run

Hi,

When I run "make docker-run", I get the following error:
"Removing tmp folder
pipenv run python -m parser
Traceback (most recent call last):
File "/usr/local/lib/python3.8/runpy.py", line 193, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/local/lib/python3.8/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/usr/src/app/parser/main.py", line 11, in
main()
File "/usr/src/app/parser/main.py", line 6, in main
stats = parser.ocr_receipts(config, receipt_files)
File "/usr/src/app/parser/parser.py", line 125, in ocr_receipts
receipt = Receipt(config, receipt.readlines())
File "/usr/src/app/parser/receipt.py", line 40, in init
self.parse()
File "/usr/src/app/parser/receipt.py", line 62, in parse
self.date = self.parse_date()
File "/usr/src/app/parser/receipt.py", line 94, in parse_date
dateutil.parser.parse(date_str)
File "/root/.local/share/virtualenvs/app-lp47FrbD/lib/python3.8/site-packages/dateutil/parser/_parser.py", line 1374, in parse
return DEFAULTPARSER.parse(timestr, **kwargs)
File "/root/.local/share/virtualenvs/app-lp47FrbD/lib/python3.8/site-packages/dateutil/parser/_parser.py", line 649, in parse
raise ParserError("Unknown string format: %s", timestr)
dateutil.parser._parser.ParserError: Unknown string format: 06.06. 2015
Text, Market, Date, Sum
rewe
1.0 REWE
data/txt/IMG0001.jpg.out.txt.txt REWE 04.12.2014 0.99
kaiser's tengelmanrı gmbh
0.8 Kaiser's
data/txt/IMG0006.jpg.out.txt.txt Kaiser's 31.08.2015 15.95
dm dm-drogerie markt
0.8 Drogerie
data/txt/IMG0008.jpg.out.txt.txt Drogerie 11.12.2014 5.85
penny h-milch
1.0 Penny
make: *** [Makefile:7: parse] Error 1
Makefile:22: recipe for target 'docker-run' failed
make: *** [docker-run] Error 2
"
Any ideas what the problem could be?

Add pipenv

I'm a big fan of pipenv. It would be nice to use it for this project.
Accepting PRs for this. If you need support, just add a comment here. 😃

Switch to ImageMagick auto-orient instead of a hard coded 90 degree rotation

No image found

Hey,

When I try running OCR it does not found any images in my specific folder... Why ?

Did I miss something ?

Thanks for your help.

Support for PDF receipts

Not sure if this use case is shared among others: I use Scanbot to scan my receipts as multi-page PDFs. Would be great if this tool could work on these pdfs.

Scanbot does a sort of OCR itself, but it doesn't seem to be that good, in the sense that it adds too much noise: a receipt contains so much text, and I'm only interested in the articles, price per article, to see price evolution across multiple weeks.