The confectionary from pszemraj

Add the ability to build PDF without paragraph segmentation

Add the ability to build PDF without paragraph segmentation.

(aside:_ add a progress bar for downloading word2vec models)

Context

low-commital way to try out repo
faster

Expected result

use with some switch or arg to not have to do the paragraph seg

Current result

well. you are forced to do it now

UnicodeEncodeError: 'latin-1' codec

files with non-standard characters cause the latin-1 codec used by the package to error out

Context

need to be able to handle files with special characters in the name

Examples of weird char names:

 'SUMM OCR Pałczyński et al. - 2022 - Study of the Few-S .txt',
 'SUMM OCR Refinetti, Goldt - 2022 - The dynamics of repr .txt',
 'SUMM OCR Sercan, Arık, Pfister - 2019 - TabNet Attenti .txt',
 'SUMM OCR Serhal et al. - 2022 - Overview on prediction, .txt',
 'SUMM OCR Somani et al. - 2021 - Deep learning and the e .txt',
 'SUMM OCR Somepalli et al. - 2021 - SAINT Improved Neura .txt',
 'SUMM OCR Śmigiel, Pałczyński, Ledziński - 2021 - ECG .txt',

Process

user passes path to directory with text files with special chars
files are loaded
when confectionary tries to write a chapter name errors out

Expected result

filenames are cleaned to remove special chars before writing to chapter name.
original file names are left intact

Current result

code fails to run

(pdf) C:\Users\peter\code-dev-22\misc-repos\text2pdf>python confectionary\text2pdf.py -i "G:\My Drive\ETHZ-2022-S\ml-healthcare\ml4hc-p1-papers\LED_large_5e_[batch=2048]_[nbeams=20]_[max_l=512]\NSC + SBD" -kw "ML4HC Paper Summaries - Project 1 - LED-L-5e"
18 files found matching extension .txt

# entries is 18, < title thresh 39
will use one page for TOC

Building Chapters in PDF file:   0%|                                                            | 0/18 [00:00<?, ?it/s]Traceback (most recent call last):
  File "C:\Users\peter\code-dev-22\misc-repos\text2pdf\confectionary\text2pdf.py", line 354, in <module>
    _finished_pdf_loc = dir_to_pdf(
  File "C:\Users\peter\code-dev-22\misc-repos\text2pdf\confectionary\text2pdf.py", line 255, in dir_to_pdf
    pdf.print_chapter(filepath=str(textfile.resolve()), num=i, title=out_name)
  File "C:\Users\peter\miniconda3\envs\pdf\lib\site-packages\confectionary\pdf.py", line 300, in print_chapter
    self.chapter_title(num, title)
  File "C:\Users\peter\miniconda3\envs\pdf\lib\site-packages\confectionary\pdf.py", line 207, in chapter_title
    self.start_section(total_title)
  File "C:\Users\peter\miniconda3\envs\pdf\lib\site-packages\fpdf\fpdf.py", line 221, in wrapper
    return fn(self, *args, **kwargs)
  File "C:\Users\peter\miniconda3\envs\pdf\lib\site-packages\fpdf\fpdf.py", line 4040, in start_section
    self.multi_cell(w=self.epw, h=self.font_size, txt=name, ln=1)
  File "C:\Users\peter\miniconda3\envs\pdf\lib\site-packages\fpdf\fpdf.py", line 221, in wrapper
    return fn(self, *args, **kwargs)
  File "C:\Users\peter\miniconda3\envs\pdf\lib\site-packages\fpdf\fpdf.py", line 2375, in multi_cell
    txt = self.normalize_text(txt)
  File "C:\Users\peter\miniconda3\envs\pdf\lib\site-packages\fpdf\fpdf.py", line 2945, in normalize_text
    return txt.encode(self.core_fonts_encoding).decode("latin-1")
UnicodeEncodeError: 'latin-1' codec can't encode character '\u0131' in position 33: ordinal not in range(256)

Possible Fix

use clean() from the clean-text package

pszemraj / confectionary Goto Github PK

confectionary's People

Contributors

Stargazers

Watchers

confectionary's Issues

Add the ability to build PDF without paragraph segmentation

Context

Expected result

Current result

UnicodeEncodeError: 'latin-1' codec

Context

Process

Expected result

Current result

Possible Fix

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent