Giter VIP home page Giter VIP logo

confectionary's People

Contributors

jonathanlehner avatar pszemraj avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar

confectionary's Issues

Add the ability to build PDF without paragraph segmentation

Add the ability to build PDF without paragraph segmentation.

(aside:_ add a progress bar for downloading word2vec models)

Context

  • low-commital way to try out repo
  • faster

Expected result

use with some switch or arg to not have to do the paragraph seg

Current result

well. you are forced to do it now

UnicodeEncodeError: 'latin-1' codec

files with non-standard characters cause the latin-1 codec used by the package to error out

Context

  • need to be able to handle files with special characters in the name

Examples of weird char names:

 'SUMM OCR Pałczyński et al. - 2022 - Study of the Few-S .txt',
 'SUMM OCR Refinetti, Goldt - 2022 - The dynamics of repr .txt',
 'SUMM OCR Sercan, Arık, Pfister - 2019 - TabNet Attenti .txt',
 'SUMM OCR Serhal et al. - 2022 - Overview on prediction, .txt',
 'SUMM OCR Somani et al. - 2021 - Deep learning and the e .txt',
 'SUMM OCR Somepalli et al. - 2021 - SAINT Improved Neura .txt',
 'SUMM OCR Śmigiel, Pałczyński, Ledziński - 2021 - ECG .txt',

Process

  1. user passes path to directory with text files with special chars
  2. files are loaded
  3. when confectionary tries to write a chapter name errors out

Expected result

  • filenames are cleaned to remove special chars before writing to chapter name.
  • original file names are left intact

Current result

code fails to run

(pdf) C:\Users\peter\code-dev-22\misc-repos\text2pdf>python confectionary\text2pdf.py -i "G:\My Drive\ETHZ-2022-S\ml-healthcare\ml4hc-p1-papers\LED_large_5e_[batch=2048]_[nbeams=20]_[max_l=512]\NSC + SBD" -kw "ML4HC Paper Summaries - Project 1 - LED-L-5e"
18 files found matching extension .txt

# entries is 18, < title thresh 39
will use one page for TOC

Building Chapters in PDF file:   0%|                                                            | 0/18 [00:00<?, ?it/s]Traceback (most recent call last):
  File "C:\Users\peter\code-dev-22\misc-repos\text2pdf\confectionary\text2pdf.py", line 354, in <module>
    _finished_pdf_loc = dir_to_pdf(
  File "C:\Users\peter\code-dev-22\misc-repos\text2pdf\confectionary\text2pdf.py", line 255, in dir_to_pdf
    pdf.print_chapter(filepath=str(textfile.resolve()), num=i, title=out_name)
  File "C:\Users\peter\miniconda3\envs\pdf\lib\site-packages\confectionary\pdf.py", line 300, in print_chapter
    self.chapter_title(num, title)
  File "C:\Users\peter\miniconda3\envs\pdf\lib\site-packages\confectionary\pdf.py", line 207, in chapter_title
    self.start_section(total_title)
  File "C:\Users\peter\miniconda3\envs\pdf\lib\site-packages\fpdf\fpdf.py", line 221, in wrapper
    return fn(self, *args, **kwargs)
  File "C:\Users\peter\miniconda3\envs\pdf\lib\site-packages\fpdf\fpdf.py", line 4040, in start_section
    self.multi_cell(w=self.epw, h=self.font_size, txt=name, ln=1)
  File "C:\Users\peter\miniconda3\envs\pdf\lib\site-packages\fpdf\fpdf.py", line 221, in wrapper
    return fn(self, *args, **kwargs)
  File "C:\Users\peter\miniconda3\envs\pdf\lib\site-packages\fpdf\fpdf.py", line 2375, in multi_cell
    txt = self.normalize_text(txt)
  File "C:\Users\peter\miniconda3\envs\pdf\lib\site-packages\fpdf\fpdf.py", line 2945, in normalize_text
    return txt.encode(self.core_fonts_encoding).decode("latin-1")
UnicodeEncodeError: 'latin-1' codec can't encode character '\u0131' in position 33: ordinal not in range(256)

Possible Fix

  • use clean() from the clean-text package

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.