karmapenny / pdfparser Goto Github PK
View Code? Open in Web Editor NEWPDF Parser is a command line tool and go library for analyzing PDF files.
License: Other
PDF Parser is a command line tool and go library for analyzing PDF files.
License: Other
one of the crypto packages panics when it gets bad data to decrypt. Make sure we recover from the panic and continue parsing the rest of the pdf
file specifications can be for URIs
Add support for additional filters
ASCIIHexDecode
ASCII85Decode
RunLengthDecode
add urls from URL NameTreeMap
Add support for the standard decryption filter with a blank password
the golang base85 decoder seems to fail with some pdfs so make an in house version that follows the pdf specification
extract embedded files files to directory
content streams can obscure text quite a bit. look into extracting the text value from these streams including CID encoded fonts that look like this
1 0 obj
<</MediaBox [0 0 612.8264285714283 792.8264285714283]/Group <</CS /DeviceRGB/I true/S /Transparency>>/Contents 2 0 R/Type /Page/Parent 4 0 R/Resources 11 0 R>>
endobj
2 0 obj
<</Length 3 0 R/Filter /FlateDecode>>
stream
0.1 w
q 0 0 612 792 re
W* n
q 0 0 0 rg
BT
56.8 724 Td /F1 12 Tf[<01>-2<02>1<03>2<03>2<04>-7<05>17<06>85<040703>2<08>]TJ
ET
Q
Q
endstream
endobj
8 0 obj
<</Filter /FlateDecode/Length 260>>
stream
/CIDInit/ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo<<
/Registry (Adobe)
/Ordering (UCS)
/Supplement 0
>> def
/CMapName/Adobe-Identity-UCS def
/CMapType 2 def
1 begincodespacerange
<00> <FF>
endcodespacerange
8 beginbfchar
<01> <0048>
<02> <0065>
<03> <006C>
<04> <006F>
<05> <0020>
<06> <0057>
<07> <0072>
<08> <0064>
endbfchar
endcmap
CMapName currentdict /CMap defineresource pop
end
end
endstream
endobj
9 0 obj
<</Type /Font/Subtype /TrueType/BaseFont /BAAAAA+LiberationSerif/FirstChar 0/LastChar 8/Widths [777 722 443 277 500 250 943 333 500]/FontDescriptor 7 0 R/ToUnicode 8 0 R>>
endobj
10 0 obj
<</F1 9 0 R>>
endobj
11 0 obj
<</Font 10 0 R/ProcSet [/PDF /Text]>>
endobj
Object 2 uses the character map in object 8 to spell out Hello World
The contents field in a page object can actually be an array. Add support for arrays in this case.
2fdc09faab1a53a1eec970796544163ccd3eb6500d943797a39c77459f8c054a.pdf
if something goes wrong, make an attempt to locate the xref or at least the Root dictionary
add support for this filter
Currently only 8 bits per component is supported because other values are rare and things are much tricky when using non byte values.
extract urls for hyperlinks into their own file
<> throws invalid hex character error
locate xref streams prior to loading xref in case startxref offset is broken
xref streams can set type to 2 which means compressed object.
That object is then stored inside another objects stream for some god awful reason
Adobe added an encryption version detailed in their supplement detailed here
report all malformating issues and abnormalities to a separate file
a base URI can be declared in the catalog and is prepended to relative URIs
add support for this filter
add support for this filter
add support for this filter
dump all filenames to manifest, not just embedded ones
stumbled upon a pdf with empty streams in it. Unsure if this would occur in a legit pdf or not but we should handle it none the less
dump all javascript into a separate file
sometimes pdfs xref tables are not included
repair the xref table by finding all offsets to strings that look like [0-9]+\s+[0-9]+\s+obj
all pdfs should end with the line %%EOF
assert this is true and get the start xref offset from the second to last line
repair xref if this assertion is not true
The pdf specification specifically prohibits this but we have at least one example like this.
rework the TIFF predictor for LZW and Flate decode to apply to the same color component rather than byte
sometimes the length specified in the stream dictionary is intentionally less that the actual length of the stream. Need to read until endstream marker and then compare sizes
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.