Giter VIP home page Giter VIP logo

pdf's Introduction

pdf's People

Contributors

functionary avatar josharian avatar leighmcculloch avatar odeke-em avatar rsc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pdf's Issues

Text position for certain font might not work

Documenting the issue at least, so people with similar goals with me would know it exists.
Basically when some fonts are decoded, it is analyzed character by character, however, all those characters would have the same position coordinates...See screenshot below.
I might dig into it and try to fix it. We'll see.
Screenshot of text position error

trouble decoding PDF with space at end of line

I received a PDF from clippercard.com that has "%PDF-1.3 " as the first line - note the space character after the "3". This causes rsc.io/pdf to fail to parse the PDF file. The PDF file in question has a number of space characters at the end of lines which cause the PDF library to alternately return these errors, depending on which space characters you fix:

not a PDF file: invalid header
malformed PDF: cross-reference table not found: ref
malformed PDF file: missing final startxref

Chrome, Mozilla Firefox and Preview.app have no problem displaying the PDF in question.

panic on some PDFs + suspect memory leak

I have the following Go program that uses this library:

package main

import (
	"fmt"
	"os"
	"strconv"
	"rsc.io/pdf"
)

func main() {
	if len(os.Args) < 2 || os.Args[1] == "-h" || os.Args[1] == "--help" {
		fmt.Println("usage: pdfpage file.pdf [pnum]")
		os.Exit(1)
	}
	reader, err := pdf.Open(os.Args[1])
	if err != nil {
		fmt.Println(err)
		os.Exit(2)
	}
	if len(os.Args) == 3 {
		var pnum int
		var err error
		if pnum, err = strconv.Atoi(os.Args[2]); err != nil {
			pnum = 1
		}
		fmt.Printf("PAGE %d\n", pnum)
		printPage(reader, pnum)
	} else {
		for pnum := 1; pnum <= reader.NumPage(); pnum++ {
			fmt.Printf("PAGE %d\n", pnum)
			printPage(reader, pnum)
			fmt.Println("")
		}
	}
}

func printPage(reader *pdf.Reader, pnum int) {
	page := reader.Page(pnum)
	if page.V.IsNull() {
		fmt.Printf("failed to read page %d\n", pnum)
		os.Exit(3)
	}
	for _, chunk := range page.Content().Text {
		fmt.Printf("x=%06.2f y=%06.2f w=%06.2f %q %s %.1fpt\n",
			chunk.X, chunk.Y, chunk.W, chunk.S, chunk.Font,
			chunk.FontSize)
	}
}

This builds and runs fine and for many PDFs gives the expected output (although it is rather slow).
However I have a few PDFs which produce a panic:

PAGE 1
panic: malformed PDF: reading at offset 0: stream not present

goroutine 1 [running]:
rsc.io/pdf.(*buffer).errorf(0xc4200d3948, 0x507f70, 0x27, 0xc4200d36d0, 0x2, 0x2)
	/home/mark/app/go/src/rsc.io/pdf/lex.go:82 +0x74
rsc.io/pdf.(*buffer).reload(0xc4200d3948, 0x8)
	/home/mark/app/go/src/rsc.io/pdf/lex.go:95 +0x193
rsc.io/pdf.(*buffer).readByte(0xc4200d3948, 0x599da0)
	/home/mark/app/go/src/rsc.io/pdf/lex.go:71 +0x69
rsc.io/pdf.(*buffer).readToken(0xc4200d3948, 0xc42000aca0, 0x1000)
	/home/mark/app/go/src/rsc.io/pdf/lex.go:135 +0x4a
rsc.io/pdf.Interpret(0xc42006e060, 0x37, 0x4d78a0, 0xc42000ab60, 0xc4200d3b08)
	/home/mark/app/go/src/rsc.io/pdf/ps.go:64 +0x1c6
rsc.io/pdf.Page.Content(0xc42006e060, 0x37, 0x4db2e0, 0xc420014810, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
	/home/mark/app/go/src/rsc.io/pdf/page.go:613 +0x326
main.printPage(0xc42006e060, 0x1)
	/home/mark/app/go/src/pdfpage2/main.go:47 +0xa8
main.main()
	/home/mark/app/go/src/pdfpage2/main.go:35 +0x25d

I also have a 647 page PDF for which the program outputs the first 22 pages, then outputs PAGE 23 and then just sits there eating memory and using ~25% CPU. That particular page has some Japanese characters but I don't know if they are Unicode text or paths.

How to avoid loops when traversing the graph?

Greetings,

Given a PDF file that has a loop (ex: pages found from "Kids" have entries for "Parent" pointing back), how to I traverse the graph without getting stuck?

I tried saving the Values that I have visited, but they are not == to the new Values.

go get error: x509: certificate signed by unknown authority

hi,

it seems rsc.io/pdf can not be retrieved with go get (short of using the -insecure flag):

$ go get -v -u rsc.io/pdf
Fetching https://rsc.io/pdf?go-get=1
https fetch failed: Get https://rsc.io/pdf?go-get=1: x509: certificate signed by unknown authority
package rsc.io/pdf: unrecognized import path "rsc.io/pdf" (https fetch: Get https://rsc.io/pdf?go-get=1: x509: certificate signed by unknown authority)

could this be fixed?

-s

Skipping spaces?

Correct me if I am using this library incorrectly but I seem to get the text (string) output of a PDF page and it does not include spaces between characters.

func ParsePDF() (text string) {
	fileName := "./someFolder/testPDF.pdf"
	reader, err := pdf.Open(fileName)
	if err != nil {
		// log the error
	}
	foundEnd := 0
	pageNum := 1
	text = ""
	for foundEnd < 1 {
		page := reader.Page(pageNum)
		if page.V.IsNull() {
			foundEnd++
			break
		} else {
			content := page.Content()
			textStruct := content.Text
			for _, v := range textStruct {
				text += v.S
			}
			pageNum++
		}
	}
	return text
}

When I call this method that wraps the code of this library, the result is a correct text and characters, but with no space characters. I believe this is related to: https://github.com/rsc/pdf/blob/master/page.go#L422

Is there a particular reason spaces are being ignored? Am I just using the library incorrectly?

Decryption with PKCS

I met a pdf with PKCS protection which I had to decrypt it with a pfx cert. I hope you could add a feature to decode it.

Plans for the future

Hi @rsc. First off: Cool library, thanks for making it!

Do you have any plans to support more PDF versions, or improve on any of the bugs listed on godoc? I would love to have a proper library in Go for parsing PDF documents instead of having to rely on Python's PDFMiner.

I would also love to help out, but I am not sure where to start. I do not know anything about the black magic that seems to be the inner workings of PDF documents.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.