Giter VIP home page Giter VIP logo

pdf2word's People

Contributors

ofchenyuan avatar simpleapples avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pdf2word's Issues

虚拟中运行的,没有出现文件额....

运行"python3 main.py " 或者 "python main.py"

正在处理:  aa.pdf
完成

尽管运行的结果是这样的,我在words的文件夹下没有找到word文件啊...
config.cfg中我也进行了如下的编辑

[default]
pdf_folder=/home/ffz/Projects/pdfs
word_folder=/home/ffz/Projects/words
max_worker=5

pdf中包含图片

pdf中包含图片,转成word之后,在word文档中没有图片,请问怎么处理

建议改几个地方

主函数中:

1.避免输出时输出一堆找不到字体warning

import logging
logging.Logger.propagate = False
logging.getLogger().setLevel(logging.ERROR)

2.pdf转成txt之后至少进行一些正则匹配,下面是我加的

def pdf_to_word(pdf_file_path, word_file_path):
    content = read_from_pdf(pdf_file_path)
    content = re.compile(r'([0-9a-zA-Z_])\n([0-9a-zA-Z_])').sub(r'\1 \2', content)
    content0 = re.compile(r'(-)\n([0-9a-zA-Z_])').sub(r'\2', content)
    content1 = re.compile(r' \n ').sub(r'', content0)
    content_2 = re.compile(r'([^.])\n').sub(r'\1', content1)
    content_compile = re.compile(r'\(cid:\d{1,2}\)').sub(r'', content_2)
    save_text_to_word(content_compile, word_file_path)

生成的doc都是cid:1050

λ python main.py
正在处理:  4月报销.pdf
WARNING:root:UniGB-UCS2-H
WARNING:pdfminer.converter:undefined: <PDFCIDFont: basefont='AdobeKaitiStd-Regular', cidcoding='Adobe-GB1'>, 1050
WARNING:pdfminer.converter:undefined: <PDFCIDFont: basefont='AdobeKaitiStd-Regular', cidcoding='Adobe-GB1'>, 2264
WARNING:pdfminer.converter:undefined: <PDFCIDFont: basefont='AdobeKaitiStd-Regular', cidcoding='Adobe-GB1'>, 4409
WARNING:pdfminer.converter:undefined: <PDFCIDFont: basefont='AdobeKaitiStd-Regular', cidcoding='Adobe-GB1'>, 4532
WARNING:pdfminer.converter:undefined: <PDFCIDFont: basefont='AdobeKaitiStd-Regular', cidcoding='Adobe-GB1'>, 3493
WARNING:pdfminer.converter:undefined: <PDFCIDFont: basefont='AdobeKaitiStd-Regular', cidcoding='Adobe-GB1'>, 1480

生成的doc
image

安装txt文件报错

Could not find a version that satisfies the requirement pdfminer3k==1.3.1 (from -r requirements.txt (line 3)) (from versions: 1.3.2, 1.3.3, 1.3.4)
No matching distribution found for pdfminer3k==1.3.1 (from -r requirements.txt (line 3))

can't convert the professional literature

正在处理: Opportunities for using Navy marine mammals to explore associations between organochlorine contaminants and unfavorable effects on reproduction.pdf
WARNING:root:Cannot locate objid=67
WARNING:root:Wrong type: 0 required: <class 'dict'>
WARNING:root:Catalog not found!

python3.7导入process_pdf失败

ImportError: cannot import name 'process_pdf' from 'pdfminer.pdfinterp' (/anaconda3/lib/python3.7/site-packages/pdfminer/pdfinterp.py)

pdfminer3k版本报错

python版本3.8.5,pip版本20.1.1。
在安装依赖是报错:

ERROR: Could not find a version that satisfies the requirement pdfminer3k==1.3.1 (from -r requirements.txt (line 3)) (from versions: 1.3.2, 1.3.3, 1.3.4)
ERROR: No matching distribution found for pdfminer3k==1.3.1 (from -r requirements.txt (line 3))

把pdfminer3k==1.3.1改为pdfminer3k==1.3.4后可以成功安装,并且能成功运行。

执行时候报错

执行Python main.py报这个错,ImportError: No module named 'pdfminer'
应该怎么解决

转中文有问题

PDF 中包含中文,转换失败
PDFDocEncoding = ''.join( chr(x) for x in (
ValueError: chr() arg not in range(256)

图片没了,公式乱了

你好呀,转换了一下 arXiv 的几篇技术论文,发现图片没了,公式有些会乱掉,应该是不支持。不知道这部分功能能不能改进呢?

运行成功,但排版乱了

我转的是一份具有三列分栏的PDF文档,转换后需要手动分段,不过大部分单词都是正确的。多谢啦Y(^_^)Y

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.