karmapenny / pdfparser Goto Github PK

View Code? Open in Web Editor NEW

11.0 1.0 6.0 214 KB

PDF Parser is a command line tool and go library for analyzing PDF files.

License: Other

Go 100.00%

pdf golang go security-tools malware-analyzer

pdfparser's People

Contributors

Stargazers

Watchers

Forkers

asch513 krayzpipes dolanor-galaxy cristina-gabriela adamjaso

pdfparser's Issues

handle crypto block panics

one of the crypto packages panics when it gets bad data to decrypt. Make sure we recover from the panic and continue parsing the rest of the pdf

handle URI in file specification

file specifications can be for URIs

More Filters

Add support for additional filters

ASCIIHexDecode
ASCII85Decode
RunLengthDecode

URL NameTreeMap

add urls from URL NameTreeMap

Standard passwordless Decryption support

Add support for the standard decryption filter with a blank password

Custom ASCII85Decode

the golang base85 decoder seems to fail with some pdfs so make an in house version that follows the pdf specification

embedded file extraction

extract embedded files files to directory

Support additional filters

CCITTFaxDecode
JBIG2Decodeyes
DCTDecode
JPXDecode
Crypt

Text extraction

content streams can obscure text quite a bit. look into extracting the text value from these streams including CID encoded fonts that look like this

1 0 obj
<</MediaBox [0 0 612.8264285714283 792.8264285714283]/Group <</CS /DeviceRGB/I true/S /Transparency>>/Contents 2 0 R/Type /Page/Parent 4 0 R/Resources 11 0 R>>
endobj

2 0 obj
<</Length 3 0 R/Filter /FlateDecode>>
stream
0.1 w
q 0 0 612 792 re
W* n
q 0 0 0 rg
BT
56.8 724 Td /F1 12 Tf[<01>-2<02>1<03>2<03>2<04>-7<05>17<06>85<040703>2<08>]TJ
ET
Q
Q 
endstream
endobj

8 0 obj
<</Filter /FlateDecode/Length 260>>
stream
/CIDInit/ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo<<
/Registry (Adobe)
/Ordering (UCS)
/Supplement 0
>> def
/CMapName/Adobe-Identity-UCS def
/CMapType 2 def
1 begincodespacerange
<00> <FF>
endcodespacerange
8 beginbfchar
<01> <0048>
<02> <0065>
<03> <006C>
<04> <006F>
<05> <0020>
<06> <0057>
<07> <0072>
<08> <0064>
endbfchar
endcmap
CMapName currentdict /CMap defineresource pop
end
end
endstream
endobj

9 0 obj
<</Type /Font/Subtype /TrueType/BaseFont /BAAAAA+LiberationSerif/FirstChar 0/LastChar 8/Widths [777 722 443 277 500 250 943 333 500]/FontDescriptor 7 0 R/ToUnicode 8 0 R>>
endobj

10 0 obj
<</F1 9 0 R>>
endobj

11 0 obj
<</Font 10 0 R/ProcSet [/PDF /Text]>>
endobj

Object 2 uses the character map in object 8 to spell out Hello World