Giter VIP home page Giter VIP logo

pdfparser's People

Contributors

karmapenny avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

pdfparser's Issues

handle crypto block panics

one of the crypto packages panics when it gets bad data to decrypt. Make sure we recover from the panic and continue parsing the rest of the pdf

More Filters

Add support for additional filters

ASCIIHexDecode
ASCII85Decode
RunLengthDecode

Custom ASCII85Decode

the golang base85 decoder seems to fail with some pdfs so make an in house version that follows the pdf specification

Text extraction

content streams can obscure text quite a bit. look into extracting the text value from these streams including CID encoded fonts that look like this

1 0 obj
<</MediaBox [0 0 612.8264285714283 792.8264285714283]/Group <</CS /DeviceRGB/I true/S /Transparency>>/Contents 2 0 R/Type /Page/Parent 4 0 R/Resources 11 0 R>>
endobj

2 0 obj
<</Length 3 0 R/Filter /FlateDecode>>
stream
0.1 w
q 0 0 612 792 re
W* n
q 0 0 0 rg
BT
56.8 724 Td /F1 12 Tf[<01>-2<02>1<03>2<03>2<04>-7<05>17<06>85<040703>2<08>]TJ
ET
Q
Q 
endstream
endobj

8 0 obj
<</Filter /FlateDecode/Length 260>>
stream
/CIDInit/ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo<<
/Registry (Adobe)
/Ordering (UCS)
/Supplement 0
>> def
/CMapName/Adobe-Identity-UCS def
/CMapType 2 def
1 begincodespacerange
<00> <FF>
endcodespacerange
8 beginbfchar
<01> <0048>
<02> <0065>
<03> <006C>
<04> <006F>
<05> <0020>
<06> <0057>
<07> <0072>
<08> <0064>
endbfchar
endcmap
CMapName currentdict /CMap defineresource pop
end
end
endstream
endobj

9 0 obj
<</Type /Font/Subtype /TrueType/BaseFont /BAAAAA+LiberationSerif/FirstChar 0/LastChar 8/Widths [777 722 443 277 500 250 943 333 500]/FontDescriptor 7 0 R/ToUnicode 8 0 R>>
endobj

10 0 obj
<</F1 9 0 R>>
endobj

11 0 obj
<</Font 10 0 R/ProcSet [/PDF /Text]>>
endobj

Object 2 uses the character map in object 8 to spell out Hello World

Page Contents Array

The contents field in a page object can actually be an array. Add support for arrays in this case.

better xref repair

if something goes wrong, make an attempt to locate the xref or at least the Root dictionary

Support all BitsPerComponentValues

Currently only 8 bits per component is supported because other values are rare and things are much tricky when using non byte values.

Compressed object support

xref streams can set type to 2 which means compressed object.

That object is then stored inside another objects stream for some god awful reason

handle base uri

a base URI can be declared in the catalog and is prepended to relative URIs

Handle empty streams

stumbled upon a pdf with empty streams in it. Unsure if this would occur in a legit pdf or not but we should handle it none the less

Repair XRef table when missing

sometimes pdfs xref tables are not included

repair the xref table by finding all offsets to strings that look like [0-9]+\s+[0-9]+\s+obj

Assert startxref at end of file

all pdfs should end with the line %%EOF

assert this is true and get the start xref offset from the second to last line

repair xref if this assertion is not true

Stream length mismatch

sometimes the length specified in the stream dictionary is intentionally less that the actual length of the stream. Need to read until endstream marker and then compare sizes

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.