python-fan / pdf2word Goto Github PK

View Code? Open in Web Editor NEW

781.0 23.0 339.0 13 KB

60行代码实现多线程PDF转Word

License: MIT License

Python 100.00%

python converter pdf

pdf2word's Introduction

pdf2word

~~60行~~40行代码实现多进程PDF转Word

新版本基于https://github.com/dothinking/pdf2docx实现

使用方法

clone或下载项目到本地

git clone git@github.com:simpleapples/pdf2word.git

cd pdf2word
python3 -m venv venv

# Linux
source venv/bin/activate

# Windows
venv\Scripts\activate

# Python < 3.10
pip install -r requirements.txt

# Python 3.10 or later
pip install -r requirements_3_10.txt

修改config.cfg文件，指定存放pdf和word文件的文件夹，以及同时工作的进程数
执行python main.py

ModuleNotFoundError: No module named '_tkinter' 报错处理

macOS环境

安装homebrew

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

使用homebrew安装tkinter

brew install python-tk

Linux环境

以ubuntu为例

sudo apt install python3-tk

欢迎Star

Python私房菜

License

采用 MIT 开源许可证

pdf2word's People

Contributors

Stargazers

Watchers

Forkers

lpjxlove hijen xjxjxj qixuxiang marvel92 itolpan aka99 rokycool wenxuanzhao czgonroad leecarty sionzheng zeroxuw liyk1024 rookiegu ll490703503 zcc888 dongyuya fsudong github1599145301 dageqin goodweather0 asd123freedom asd7536325 ilabservice deczy guzhandong xiaopaoge qw-lzm poacher69 a100q100 10779164 markma1990 prsioner ruishan-guo polarbluebear friedrichwilhelmnietzsche gezhengyy66 teethpopping overfitover sjyderrick0127 siyu1995 taoorwell wjjayst2008 goodhope258 marshalt yangmiok songboriceboy taketheone tonylincon googlebe 1995zhanbudu0115 coderwar w1146869587 2016scj ztxcyk xrr8417403 universal-solution m0tky kngines ghcodest sumonst21 liangdong-xjtu yuzhoustar amorness rileyzhang1029 hear7 qinweiming callmesaox byte-marvel ruffec oldkingnearby kitefly227 wowwf yeawaink xuemk ofchenyuan liuzhenyu111 jiuang xlzhang223 zhongshuiping imhsz wjn0918 jediwang 2877206 gfchong tongtongkunkun hengle 18pguo pythonthings huahuarong pkuzzq myhololens weforkbusiness pretty2010kit0 benature niuxuliang929 huhangtao bpc3912 zhangwa5

pdf2word's Issues

图片没了，公式乱了

你好呀，转换了一下 arXiv 的几篇技术论文，发现图片没了，公式有些会乱掉，应该是不支持。不知道这部分功能能不能改进呢？

虚拟中运行的，没有出现文件额....

运行"python3 main.py " 或者 "python main.py"

正在处理:  aa.pdf
完成

尽管运行的结果是这样的，我在words的文件夹下没有找到word文件啊...
config.cfg中我也进行了如下的编辑

[default]
pdf_folder=/home/ffz/Projects/pdfs
word_folder=/home/ffz/Projects/words
max_worker=5

转中文有问题

PDF 中包含中文，转换失败
PDFDocEncoding = ''.join( chr(x) for x in (
ValueError: chr() arg not in range(256)

WARNING:root:Cannot locate objid=58

建议改几个地方

主函数中：

1.避免输出时输出一堆找不到字体warning

import logging
logging.Logger.propagate = False
logging.getLogger().setLevel(logging.ERROR)

2.pdf转成txt之后至少进行一些正则匹配，下面是我加的

def pdf_to_word(pdf_file_path, word_file_path):
    content = read_from_pdf(pdf_file_path)
    content = re.compile(r'([0-9a-zA-Z_])\n([0-9a-zA-Z_])').sub(r'\1 \2', content)
    content0 = re.compile(r'(-)\n([0-9a-zA-Z_])').sub(r'\2', content)
    content1 = re.compile(r' \n ').sub(r'', content0)
    content_2 = re.compile(r'([^.])\n').sub(r'\1', content1)
    content_compile = re.compile(r'\(cid:\d{1,2}\)').sub(r'', content_2)
    save_text_to_word(content_compile, word_file_path)

我是win系统的，也报cannot import name process_pdf错误了

win系统好像用不了source就没写source下面的那几行代码，修改了下config就直接跑程序，就出现如题了

报cannot import name process_pdf

无法导入process_pdf包，怎么解决

python3.7导入process_pdf失败

ImportError: cannot import name 'process_pdf' from 'pdfminer.pdfinterp' (/anaconda3/lib/python3.7/site-packages/pdfminer/pdfinterp.py)

执行时候报错

执行Python main.py报这个错，ImportError: No module named 'pdfminer'
应该怎么解决

can't convert the professional literature

正在处理: Opportunities for using Navy marine mammals to explore associations between organochlorine contaminants and unfavorable effects on reproduction.pdf
WARNING:root:Cannot locate objid=67
WARNING:root:Wrong type: 0 required: <class 'dict'>
WARNING:root:Catalog not found!

运行成功，但排版乱了

我转的是一份具有三列分栏的PDF文档，转换后需要手动分段，不过大部分单词都是正确的。多谢啦Y(^_^)Y

pdf中包含图片

pdf中包含图片，转成word之后，在word文档中没有图片，请问怎么处理

安装txt文件报错

Could not find a version that satisfies the requirement pdfminer3k==1.3.1 (from -r requirements.txt (line 3)) (from versions: 1.3.2, 1.3.3, 1.3.4)
No matching distribution found for pdfminer3k==1.3.1 (from -r requirements.txt (line 3))

pdfminer3k版本报错

python版本3.8.5，pip版本20.1.1。
在安装依赖是报错：

ERROR: Could not find a version that satisfies the requirement pdfminer3k==1.3.1 (from -r requirements.txt (line 3)) (from versions: 1.3.2, 1.3.3, 1.3.4)
ERROR: No matching distribution found for pdfminer3k==1.3.1 (from -r requirements.txt (line 3))

把pdfminer3k==1.3.1改为pdfminer3k==1.3.4后可以成功安装，并且能成功运行。

Could not find function xmlCheckVersion in library libxml2. Is libxml2 installed?

报这个错怎么回事啊？