phfaist / pylatexenc Goto Github PK

View Code? Open in Web Editor NEW

287.0 7.0 35.0 2.14 MB

Simple LaTeX parser providing latex-to-unicode and unicode-to-latex conversion

Home Page: https://pylatexenc.readthedocs.io

License: MIT License

Python 99.38% TeX 0.01% JavaScript 0.61%

python latex parser encoding unicode

pylatexenc's Issues

$x$ → 𝑥 or using chars from the unicode block "Mathematical Alphanumeric Symbols" for math

As a possible future enhancement, would it be possible to use the characters of the unicode block "Mathematical Alphanumeric Symbols" when encoding math symbols to text?
So that "$x$" would became "𝑥" (U+1D465 MATHEMATICAL ITALIC SMALL X).
Maybe with an option to enable/disable the use of this block?
Also here.
Thanks :-)

LatexWalker infinite loop bug

There is a bug within the LatexWalker functionality, when a given LaTeX string incorrectly defines \begin or \end macros without the environment they refer to next to them.

Issue

The bug produces this LatexWalker.get_latex_nodes function while loop, to keep launching LatexWalkerParseError exceptions, while being forever ignored when the self.tolerant_parsing option is True (by default).

How to reproduce?

from pylatexenc.latex2text import LatexNodes2Text

content = r'''
    \begin
    \item {1.} Example text 1
    \item {2.} Example text 2
    \end
'''

conversor = LatexNodes2Text()
text = conversor.latex_to_text(content)

Proposal

It is up to the maintainer of this package to consider this as a "major" error and raise an exception to the user, or to log and ignore these special parsing errors.

My proposal is to consider this as a special type of parsing errors, logging a warning and continue with the parsing. I will create a Pull Request with the proposal.

Question about equality

Question equality of two latex expressions. Perhaps this is naïve but if I have two expressions that display the same way, should there be a way to do this?

a = r"\bar{x}_{y}^{z}"
b = r"\bar{x}^{z}_{y}"

def tex_eq(a,b):
    # equalitycheck

assert tex_eq(a,b) is True

Pos of data node is None for trailing newline

Hello devs,

When I was using LatexWalker on a string with trailing newline after an environment, that newline is parsed into a LatexCharsNode with pos = None. Is this intentional?

I'm using version 2.1 (installed from pip); a minimal input that displays this behavior would be "\begin{table}\n\end{table}\n".

Also, thanks for the great work! I'm writing a customary latex formatter and this library really helps!

double quotes are not be converted back to double quotes

from pylatexenc.latexencode import unicode_to_latex
from pylatexenc.latex2text import LatexNodes2Text
LatexNodes2Text().latex_to_text(unicode_to_latex("\""))=="\""

I think the error is in unicode_to_latex, because it converts " to ''. I don't really have an idea on how to fix this.

\le, \ge, \leqslant and \geqslant produce incorrect symbols

I'm not sure is it intentional or not, but currently \le and \ge are rendered as < and >. However, accoding to their LaTeX meaning, they have to be rendered as ≤ and ≥. Also, \leqslant and \geqslant can (and probably should) be rendered as ⩽ and ⩾.

Update print statements to python 3 print() function

Another syntax error on Python 3.6:

  File "/Users/rasmus/Dev/repos-others/pylatexenc/pylatexenc/latex2text.py", line 667
    print "Please type some latex text (Ctrl+D twice to stop) ..."
                                                                 ^
SyntaxError: invalid syntax

To fix, convert print statements to use the print(...). Add from __future__ import print_function at the top to retain python2 compatability.

Python 3 support: Define unicode() function for Python 3+

The unicode() function is not defined for python 3+ and will cause a NameError.

To fix, either:
a. Define a no-op unicode(str) function for python 3+, or:
b. Use from builtins import str and replace unicode(s) with str(s) (as outlined in http://python-future.org/compatible_idioms.html#unicode)

Spacing issue with diacritical marks

Thanks for publishing and maintaining this package.

While using your package, I found some issues with the spacing inpylatexenc.latex2text.LatexNodes2Text. With LaTeX typing \"{o} \L u yields ö Łu. While LatexNodes2Text yields öŁ u. Note the difference in whitespace.

Here is a minimal working example.

>>> from pylatexenc.latex2text import LatexNodes2Text
>>> encoder = LatexNodes2Text()
>>> encoder.latex_to_text(r'\"{o} \L u')
'öŁ u'

I have tested different combinations of the single glyph symbol \L, the combined letter \"o, and u using the code published here. An F in the first column indicates a wrong result.

 :      tex code                encoded output
F:      \"{o} \"{o} \"{o}       ööö
F:      \"{o} \"{o} {\"o}       ööö
F:      \"{o} \"{o} \L          ööŁ
F:      \"{o} \"{o} {\L}        ööŁ
F:      \"{o} \"{o} u           öö u
F:      \"{o} {\"o} {\"o}       ööö
F:      \"{o} {\"o} \L          ööŁ
F:      \"{o} {\"o} {\L}        ööŁ
F:      \"{o} {\"o} u           öö u
F:      \"{o} \L \L             öŁŁ
F:      \"{o} \L {\L}           öŁŁ
F:      \"{o} \L u              öŁ u
F:      \"{o} {\L} {\L}         öŁŁ
F:      \"{o} {\L} u            öŁ u
.:      \"{o} u u               ö u u
F:      {\"o} {\"o} {\"o}       ööö
F:      {\"o} {\"o} \L          ööŁ
F:      {\"o} {\"o} {\L}        ööŁ
F:      {\"o} {\"o} u           öö u
F:      {\"o} \L \L             öŁŁ
F:      {\"o} \L {\L}           öŁŁ
F:      {\"o} \L u              öŁ u
F:      {\"o} {\L} {\L}         öŁŁ
F:      {\"o} {\L} u            öŁ u
.:      {\"o} u u               ö u u
.:      \L \L \L                ŁŁŁ
.:      \L \L {\L}              ŁŁŁ
F:      \L \L u                 ŁŁ u
F:      \L {\L} {\L}            ŁŁŁ
.:      \L {\L} u               ŁŁ u
F:      \L u u                  Ł u u
F:      {\L} {\L} {\L}          ŁŁŁ
F:      {\L} {\L} u             ŁŁ u
.:      {\L} u u                Ł u u
.:      u u u                   u u u

If you could point me in the right direction on where to fix this issue, I am happy to contribute.

The function of "latex_to_text" can not convert \sqrt[n]{x} with hold the sqrt num n

As my title say.

from pylatexenc.latex2text import LatexNodes2Text
latex = r"""
... (\sqrt[25]{10-2.56})-1=8.36%
... """
latex = latex.replace('%', '/100')
text = LatexNodes2Text().latex_to_text(latex)
print(text)

√(10-2.56)-1=8.36/100

The function of "latex_to_text" did ignore the num 25 in "sqrt[25]".
It thransfer "\sqrt[25]" to "√" which is wrong.
I can understand this issue is caused by that utf code did not acquire other sqrt code than "√".
So how to thransfer a latex code like "\sqrt[n]{x}" to "x**(1/n)".
Maybe "Define replacement texts" can solve this question,but I want u know this issue.
Your proj is great, wash it and ur life better and better, THX.

Wrong reference to nodelist_to_latex func.

Issue

The apply_simplify_repl function within the LatexNodes2Text object references a non-existing function with the name self.nodelist_to_latex.

The issue only appears when parsing a document with the "%" character, as the only references to that non-existing function are within an if "%" in simplify_repl block.

Proposal

There is already a nodelist_to_latex function defined on the latexwalker/__init__.py file.

A simple fix would be to replace:

self.nodelist_to_latex

By:

latexwalker.nodelist_to_latex

Testing

I have tried to parse this ArXiV paper by calling LatexNodes2Text.latex_to_text function before and after the fix.

Where before it crashed, now it does not.

Allow for exception when converting

It would be good to have an exception list when doing the conversion. For example I would like to keep \ref as \ref in order to put label markers in prior to latex. Right now that gets printed just as \ref.

LatexWalker does not correctly parse `$$` in `$a$$b$`

Dear Philippe
Thanks for your work!
The lib help me a lot.
I have a question when I use LatexWalker module to parse a Latex String, it see all text to a LatexMathNode.For example, the following Latex code is part of article:

\author[$\dagger$$\ddagger$1]{Qinyuan REN} \author[1]{Ping LI} %\author[$\dagger$1]{Ping Li} \affil[1]{State Key Laboratory of Industrial Control Technology, Zhejiang University, Hangzhou 310027, China} \affil[2]{Department of Electrical and Computer Engineering, National University of Singapore, Singapore 117576, Singapore} \affil[3]{Zhejiang University of Science and Technology, Hangzhou 310023, China}

I don't kown is it reasonable, but it is a trouble for me with parse Latex code.
Thanks in advance!

latex2text prints spurious new line in the end

Running

echo foo | latex2text

gives two lines, the second of which is empty.

Quotation marks

Thanks for the great package!

I am facing an issue with the way quotation marks are parsed using unicode_to_latex.

See the following example:

>>> from pylatexenc.latexencode import unicode_to_latex
>>> print(unicode_to_latex('Hello "world".'))
Hello ''world''.

Whereas I would expect to get

Hello ``world''.

because in LaTeX left quotation marks are represented by backticks (per https://www.maths.tcd.ie/~dwilkins/LaTeXPrimer/QuotDash.html for example).

Is this an issue that can easily be solved?

Super and Subscript encoding gets removed but content not displayed accordingly

I do have read through this issue: https://github.com/phfaist/pylatexenc/issues/36 which states that you don't support super- and subscripts out of understandable reasons.

However it would be nice to have the possibility to keep the encoding complete.

However using LatexNodes2Text().latex_to_text() a string like:

"This is H$_{2}$O"

Gets converted to:

"This is H_2O"

Which is less safe to match and replace using f.e. Regex

Different Item replacement for Environments

Hello everyone.

First of all: nice work!
The lib really helps me a lot.

But I'm currently facing the issue that for the generated latex code:

\begin{enumerate}% 
	\item% 
		\begin{itemize}%
			\item% 
				BLA BLA 1.1 
			\item% 
				BLA BLA 1.2 
		\end{itemize}% 
	\item
		\begin{itemize}% 
			\item% 
				Word = BLA BLA 2.1 
			\item% 
				Word = BLA BLA 2.2 
		\end{itemize}% 
	\item%
		BLABLA 3 
\end{enumerate}

I'm not able to distinguish the "item" for enumerations and for itemize.

What I desire is something like:

1. 
   - BLA BLA 1.1
   - BLA BLA 1.2
2.
   - BLA BLA 2.1
   - BLA BLA 2.2
3.
   BLA BLA 3

I tried to fiddle with the default_context_db, or to post-process and replace the generated " * \n" strings, but I believe in a well-structured framework like this, there would be a better solution.

Thanks in advance!

Feature request - option in utf8tolatex to maintain capitalisation for BibTeX

It would be great to have an option to keep custom capitalisation for bibtex.

For example

TCP: The Capitalisation Example

would be encoded as

{TCP}: The Capitalisation Example

For now, I am using code borrowed from https://openreview-py.readthedocs.io/en/latest/_modules/tools.html#get_bibtex in combination with utf8encode:

def capitalize_title(title):
    capitalization_regex = re.compile('[A-Z]{2,}')
    words = re.split('(\W)', title)
    for idx, word in enumerate(words):
        m = capitalization_regex.search(word)
        if m:
            new_word = '{' + word[m.start():m.end()] + '}'
            words[idx] = words[idx].replace(word[m.start():m.end()], new_word)
    return ''.join(words)

bibtex_title = capitalize_title(utf8tolatex(orig_title))

Greek letters \iota and \epsilon in unicode_to_latex()

Thank you very much for this project! It is really helpful.

Could you please fix a couple of bugs related to the Greek letters?

For ι (Greek small letter iota, U+03B9), function unicode_to_latex() gives \ensuremath{\i}. However, this is incorrect. It should be \ensuremath{\iota} instead.

For ϵ (Greek lunate epsilon symbol, U+03F5), function unicode_to_latex() gives nothing but it should be \ensuremath{\epsilon}.

Source: Table 188 in The Comprehensive LaTeX Symbol List (PDF).

Caret not in utf8latexmap file

When using the software I found that the caret (^) character was not being escaped by the UTF8tolatex function. It seems it not in the _utf8latexmap file.

To "fix" this I monkey patched the utf82latex array with:

0x005E: r'\textasciicircum'

Not sure that was the correct symbol to use but it fixed my issie.

Missing space in latex_to_text

Hi there,
When processing BibTex files, I am seeing cases where latex_to_text incorrectly removes some spaces. Here is an example:

Python 3.5.2 (default, Oct 7 2020, 17:19:02)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.

from pylatexenc.latex2text import LatexNodes2Text
s = "{B}uilding an {E}fficient {C}omponent for {OCL} {E}valuation"
LatexNodes2Text().latex_to_text(s)
'Building an Efficient Component for OCLEvaluation'

Note that "OCLEvaluation" should be "OCL Evaluation"

I can provide more examples if needed

Throw error/exception for non-implemented latex command?

Dear Philippe,
Thanks for your work!
I wonder if latex_to_text should throw an exception if it faces a latex command that it can not convert? For example, the following Latex code is a matrix:
\left (\begin{array}{llll}25 & 31 & 17 & 43\\75 & 94 & 53 & 132\\75 & 94 & 54 & 134\\25 & 32 & 20 & 48\end{array}\right )
and all I get is
< a r r a y >
but I'd like to catch a conversion erro rin order to deal with it.

Is there a way to do that?
Thanks in advance!

missing braces after macro

Hi,

the Polish ł is not properly protected with the option
replacement_latex_protection='braces-after-macro'

In [1]: from pylatexenc.latexencode import unicode_to_latex                                                                                   

In [2]: unicode_to_latex('Jabłoński', replacement_latex_protection='braces-after-macro')                                                      
Out[2]: "Jab\\lo\\'nski"

it should be "Jab\\l{}o\\'nski". This is with pylatexenc==2.7

Thanks for this very useful package
T.

Argument parsers should be given the name of the encountered macro, in order to handle unknown macros

I'm trying to have pylatexenc emit a warning when it finds an unknown macro.

I define an arguments parser that does nothing and emits a warning. Then I define a MacroSpec that uses this parser and finally I register it with the walker's context using set_unknown_macro_spec(). See the code below.

I gently ask if this is the correct way to go.
I suspect that I'm missing something, because I get a warning at the end of the math mode (at the second "$" in the second example in the code below).

I would also like to emit the name of the unknown macro, but this is for later :)

"""Emit a waning on unknown macros - proof of concept."""

from pylatexenc import macrospec, latexwalker, latex2text
import logging


class DoNothingArgumentsParser(macrospec.MacroStandardArgsParser):
    """An argument parser that does nothing and emits a warning."""

    def parse_args(self, w, pos, parsing_state=None):
        """Override the parse_args method to emit the warning."""
        logging.warning("Unknown macro XXX at %s",
                        pos)
        return super().parse_args(w, pos, parsing_state=None)


walker_context = latexwalker.get_default_latex_context_db()
unknown_macro_spec = macrospec.MacroSpec(
    "",  # anything would do?
    args_parser=DoNothingArgumentsParser()
)
walker_context.set_unknown_macro_spec(unknown_macro_spec)

# first example
output = latex2text.LatexNodes2Text().latex_to_text(
    r"""\unknown""", latex_context=walker_context)
print(output)

print("===")

# second example
output = latex2text.LatexNodes2Text().latex_to_text(
    r"""start
$\mu $
\foo
\foobar
""", latex_context=walker_context)
print(output)

# Output:
# WARNING:root:Unknown macro XXX at 8
#
# ===
# WARNING:root:Unknown macro XXX at 11
# WARNING:root:Unknown macro XXX at 18
# WARNING:root:Unknown macro XXX at 26
# start
# μ
#

Creating a conda-forge feedstock

Hello devs!

I am in the process of creating a conda-forge/pylatexenc-feedstock for this library. Thought I would mention my intentions. Feel free to close, or discuss concerns etc.

Cheers!

Python 3 support: Basestring = str for python 3+

NameError in latexwalker.do_read():

/Users/rasmus/Dev/repos-others/pylatexenc/pylatexenc/latexwalker.py in do_read(nodelist, p)
    922                         (nodeoptarg, p.pos) = getoptarg(p.pos);
    923 
--> 924                     if (isinstance(mac.numargs, basestring)):
    925                         # specific argument specification
    926                         for arg in mac.numargs:

NameError: name 'basestring' is not defined

To fix for python 3+, define basestring = str at the top of the file if sys.version_info.major > 2.

ur"Expected exactly (...)" string gives SyntaxError: invalid syntax

Got syntax error when importing pylatexenc.latex2text module (Python 3.6, Mac, Anaconda3 distrubution):

import pylatexenc.latex2text

  File "/Users/rasmus/anaconda/envs/tts/lib/python3.6/site-packages/pylatexenc/latex2text.py", line 501
    logger.warning(ur"Expected exactly one argument for '\input' ! Got = %r", n.nodeargs)
                                                                           ^
SyntaxError: invalid syntax

Python 3.5+ does not support ur prefix. To fix, use either u or r string but not both.

Actual content for array, pmatrix, and matrix

Firstly let me just say that I love this library, it has helped myself and my team to do amazing things. I don't know if it's already in the works or not but I would really appreciate if you could include actual translations for the array, pmatrix, and matrix environments or could maybe update the documentation and explain how to do so. I just really need the content of the environment to be displayed as opposed to < p m a t r i x > or < a r r a y >. If this issue is already being worked on feel free to let me know. If not do you think that could be implemented?

latex2text is removing macroname from the latext- text

encoder = LatexNodes2Text(keep_inline_math=True, keep_comments=True)
print encoder.latex_to_text(r'Global well-posedness for the mass-critical stochastic nonlinear Schr\"{o}dinger equation on $\mathbb{R}$: small initial data')
I have used above code to decode Latex command and I only want the latex accent command to be converted to plain text.

but Latex is removing latex macro from the output.
Output:
Global well-posedness for the mass-critical stochastic nonlinear Schrödinger equation on $R$: small initial data

Expected output:
Global well-posedness for the mass-critical stochastic nonlinear Schrödinger equation on $\mathbb{R}$: small initial data

LatexWalker.get_latex_nodes returns wrong "len" (sometimes)

Working on the function that you proposed on #48, I came across a strange behavior: sometimes the len returned by the get_latex_nodes function is not the same as the len of the parsed node.
In the MWE below, the first node is a MacroNode and the pos and len reported by the function are the same as the attributes of the node itself, but for the second node (a CharsNode, the newline char), the function reports a len different from that of the node.

import pylatexenc.latexwalker

doc_content = r"""\emph{A}
\emph{B}"""
print(f"DOC: {doc_content}")

lw = pylatexenc.latexwalker.LatexWalker(
    doc_content,
    tolerant_parsing=False,
)

pos = 0
(tmp_list, npos, nlen) = lw.get_latex_nodes(pos, read_max_nodes=1)
node = tmp_list[0]
print(f"Node at pos {pos}: node len {node.len} == returned len {nlen}; OK")

pos = 8
(tmp_list, npos, nlen) = lw.get_latex_nodes(pos, read_max_nodes=1)
node = tmp_list[0]
print(f"Node at pos {pos}: node len {node.len} != returned len {nlen}; BAD")

By the way, with some minor tweaks, the function you proposed in #48 works well, thanks again 👍

missing spec for \href{}{} in latexwalker

I think the spec for the macro \href might be missing from pylatexenc/latexwalker/_defaultspecs.py
The following code gives error: it expects a node with two arguments, but gets a node with empty argnlist.

from pylatexenc.latex2text import LatexNodes2Text
latex = r"\href{a}{b}"
print(LatexNodes2Text().latex_to_text(latex))

behavior when keep_inline_math=True

[furutaka@Furutaka-3 ~]$ conda list|grep pylatexenc
pylatexenc 1.2
[furutaka@Furutaka-3 ~]$ python
Python 2.7.14 |Anaconda, Inc.| (default, Dec 7 2017, 17:05:42)
[GCC 7.2.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from pylatexenc.latex2text import LatexNodes2Text
>>> latex=r"""$\gamma$"""
>>> print LatexNodes2Text(keep_inline_math=True).latex_to_text(latex)
$γ$

Is this the intended behavior of the function?

I expected $\gamma$ ...

Kazuyoshi

Python 3: Avoid sys.exc_value, use sys.exc_info() instead

latex2text.py:688 uses sys.exc_value, which has been deprecated since version 1.5 and is no longer available in Python 3. Use exc_info()[1] instead. (exc_value is actually defined in the line just above, so I imagine the continued use of sys.exc_value is unintentional?)

Powers

>>> latex = 'e^{2x}'
>>> LatexNodes2Text().latex_to_text(latex)
'e^2x'

How i can get e^(2x)?

Thanks.

LatexGroupNode Parse uncorrect

`w = LatexWalker(r"""
... \documentclass[twoside]{article}
...\usepackage{amsmath,amssymb}
...\usepackage{amsthm}

...\usepackage{graphicx,epsfig,amscd,mathrsfs,multirow,bm}

...\usepackage{cite}%引用宏包

...\input xy
...\xyoption{all}
...\input scrload.tex
...\allowdisplaybreaks[4]%多行公式换页从0到4，表示分页的坚决程度,例如0表示能不分页就不分页,4表示强制分页。

%------------------------ Page Format --------------------------
...\textwidth=147truemm
...\textheight=210truemm
...\headsep=4truemm
...\topmargin= 0pt
...\oddsidemargin=0pt
...\evensidemargin=0pt
...\parindent=16pt
...\setcounter{page}{415}

...\footskip=0pt
...\renewcommand{\baselinestretch}{1.06}
...\renewcommand{\arraystretch}{1.2}
...\catcode`@=11
...\long\def@makefntext#1{\noindent #1}
...\newskip\tabcentering \tabcentering=1000pt plus 1000pt minus 1000pt
...\def\REF#1{\par\hangindent\parindent\indent\llap{#1\enspace}\ignorespaces} %reference format
...\def\MCH#1#2{\setbox0=\hbox{\raise#1\hbox{#2}}\smash{\box0}}% move char
...\def\sub#1{\par\vspace{\baselineskip}\noindent #1\vspace*{\baselineskip}\rm\par}
...\def\CR{\cr\noalign{\vspace{1mm} \hrule \vspace{1mm}}}
...\def@evenfoot{}\def@oddfoot{}

...\def@evenhead{\hbox to\textwidth{\small\rm\thepage \hfill
...{\it Yu-e BAO, Na LI and Linfen ZHANG}}} % authors name (the given name is before the surname, and use "and" to separate two authors)%

...\def@oddhead{\hbox to \textwidth{\small{\it
...Differentiability of interval valued function and its application in interval valued programming
...} \hfill\thepage}} % Capitalize the first letter in abbreviate title%

...%format defination
...\def\scr{\mathscr}
...\def\SUB#1{\vskip .2in\leftline{\large\bf #1}\vskip .1in}
...\def\subsec#1{\vskip 2mm\leftline{#1}\vskip 1mm}
...\def\th#1{\vskip 1mm\noindent{\bf #1}\quad}
...\def\thn#1{\noindent{\bf #1}\quad}
...\def\proofn{\noindent{\it Proof}\quad}
...\def\proof{\vskip 1mm\noindent{\it Proof}\quad}
...\def\vsp{\vskip {1mm}}
...%end format defination

...\font\tenbf=cmb10 scaled \magstep0
...\font\tenrm=cmr10 scaled \magstep0
...\def\ESEC#1#2#3{\vskip.2in \begin{center} \tenbf #1\[.1in]
...\small #2\
...\footnotesize $($#3$)$\[.1in]
...\end{center}}

...\renewcommand{\topfraction}{1}
...\renewcommand{\bottomfraction}{1}
...\renewcommand{\textfraction}{0}
...\renewcommand{\floatpagefraction}{0}
...\floatsep=0pt
...\textfloatsep=0pt
...\intextsep=0pt
...\catcode`@=12

...\def\bc{\begin{center}}
...\def\ec{\end{center}}
...\def\no{\noindent}
...\def\hang{\hangindent\parindent}
...\def\textindent#1{\indent\llap{\qquad #1\ \ \enspace}\ignorespaces}
...\def\ref{\par\hang\textindent}

...\def\d#1#2{\frac{\displaystyle #1}{\displaystyle #2}}
...\def\f#1#2{\frac{#1}{#2}}

...\def\d{{\rm d}}

...%\usepackage{lineno}

...\begin{document}

...%\linenumbers

...\abovedisplayskip=6pt plus 1pt minus 1pt \belowdisplayskip=6pt
...plus 1pt minus 1pt
...%------------------- First Head -----------------------------------------
...\thispagestyle{empty} \vspace*{-1.0truecm} \noindent
...\parbox[b]{7truecm}{\footnotesize\baselineskip=11pt\noindent{\it Journal of Mathematical Research with Applications}\
...Jul., 2020, Vol.,40, No.,4, pp.,415--431\
...DOI:10.3770/j.issn:2095-2651.2019.04.009\
...Http://jmre.dlut.edu.cn} \hfill
...%\parbox[]{6truecm}{\vskip -1.7cm \hfill \begin{tabular}{l}\\hline {\bf Journal of Mathematical}\ {\bf Research and Exposition}\\hline\end{tabular}}
...%\parbox[t]{6truecm}{\vskip -1.7cm \hfill\includegraphics{actmark.eps}}
...%===================Text=============================================
...\vskip 10mm \bc{\Large\bf Differentiability of Interval Valued Function and Its Application in Interval Valued Programming
...\footnotetext{\footnotesize Received February 20, 2019; Accepted April 21, 2020\
...Supported by the National Natural Science Fund of China (Grant No.,11461052) and the Natural Science Foundation of Inner Mongolia (Grant No.,2018MS01010).\
...* Corresponding author\
...E-mail address: [email protected] (Yu-e BAO); [email protected] (Na LI); [email protected] (Linfen ZHANG)
...} } \ec
...%国家自然科学基金(Grant No.11461052),内蒙古自然科学基金(Grant No.2018MS01010).

...\vskip 5mm
...\bc{\bf Yu-e BAO$^*$,\ \ \ Na LI,\ \ \ Linfen ZHANG}\
...{\small\it College of Mathematics and Physics, Inner Mongolia University for Nationalities,\ Inner Mongolia $028043$, P. R. China
...}\ec

...\vskip 1 mm
\end{document}`

Tex content as up follow, For this line "\def\bc{\begin{center}}" parse is uncorrect, "\begin{center}" and back content is parse to a LatexGroupNode, using version is v2.8.

LatexWalker does not correctly parse "\newcolumntype{C}{>{$}c<{$}}"

First, let me thank you for you work, it helps me a lot.

I want to report the issue in the title.
For instance:

echo '\newcolumntype{C}{>{$}c<{$}}'  |  latex2text

will issue a parse error (I think because the parser sees a closing group right after an opening math).
I don't know if it is possible to fix, but, if it is not, I wanted to ask if it would be possible to entirely skip the line(s) with parsing errors (maybe by supplying some optional argument) in the hope of not cluttering the rest of the parsing.

Something similar: echo '\newcommand{\be}{\begin{equation}}' |latex2text

My use case is this: I need to scan a tex file, looking for some specific macro (\title, \author,...) and transform their arguments into text. I do not want to use something like latex_text.find("\\title") (and then parse from there) because of comments and "look-alike" macros (e.g. \titleBar, \authorFoo). I could use regexps to find the starting point, but I prefer to navigate the tree of nodes built by LatexWalker.

Conform source to PEP8

Hi, thanks for putting together this nice library!

Would you be open to a pull request making the source conform better to PEP8?

Examples:

All import statements at the top, no empty lines between import statements.
Consistent line spacing between class/function definitions.
Remove semicolons at the end of lines.
Remove unnecessary parenthesis.
Removing spaces before/after brackets and making spacings consistent.
Removing unused imports.

Thanks again,

{\ensuremath{xxx}} gets removed by LatexNodes2Text().latex_to_text(value)

First of all thank you very much for this project. It saved me from a lot of hassle with all the special characters!

I have a small issue with the processing of {\ensuremath{xxx]}:

My input to LatesNodes2Text().latex_to_text() is f.e.

{{Neutral CGM as damped Ly{\ensuremath{\alpha}} absorbers at high redshift}}

What gets returned is:

Neutral CGM as damped Ly absorbers at high redshift

-> The whole ensuremath gets removed

I saw that this case was handled the other way around (encoding unicode) here:https://pylatexenc.readthedocs.io/en/latest/latexencode/

But was unable to resolve this problem.

Unknown latex macros do not include arguments?

Hi, thanks for this useful utility, and sorry for the usage question. I'm finding that the following code

print(LatexNodes2Text().latex_to_text(r"""
\documentclass{article}
\usepackage{times}
\definecolor{gray}
\RequirePackage{fixltx2e}
"""))

will output



gray
fixltx2e

Basically, the arguments to the first two macros get removed, whereas the arguments for the second two don't. I think this may be because the first two macros are built-in LaTeX macros, whereas the latter two are custom ones, so maybe pylatexenc doesn't know how many arguments there should be, so it defaults to assuming "no arguments". If I run this:

def raise_l2t_unknown_latex(n):
    if n.isNodeType(pylatexenc.latexwalker.LatexMacroNode):
        print(n.latex_verbatim())

l2t_db = pylatexenc.latex2text.get_default_latex_context_db()
l2t_db.set_unknown_macro_spec(
    pylatexenc.latex2text.MacroTextSpec("", simplify_repl=raise_l2t_unknown_latex)
)
LatexNodes2Text(latex_context=l2t_db).latex_to_text(r"""
\documentclass{article}
\usepackage{times}
\definecolor{gray}
\RequirePackage{fixltx2e}
""")

then indeed I get

\documentclass{article}
\usepackage{times}
\definecolor
\RequirePackage

Is there any way to make pylatexenc automatically try to grab as many arguments as possible from unknown macros? Thanks a lot!

mathbb symbols in .latex_to_text

Hi, first of all, thanks a lot for this project! I found it very useful to quickly insert unicode characters that correspond to usual mathematical symbols in emails, etc.

Am I correct that there is no support for symbols like $\mathbb R$ in .latex_to_text? I see that appropriate records are included into latexencode/_uni2latexmap.py, e.g.

pylatexenc/latexencode/_uni2latexmap.py:0x1D549: r'\ensuremath{\mathbb{R}}',              # MATHEMATICAL DOUBLE-STRUCK CAPITAL R

but it appears they are not used in latex_to_text:

from pylatexenc.latex2text import LatexNodes2Text
import unicodedata
R = LatexNodes2Text().latex_to_text(r"$\mathbb{R}$")
unicodedata.name(R)
# 'LATIN CAPITAL LETTER R'

Is it possible to add them? Unfortunately, I wasn't able to understand how to add new macros in .latex_to_text.

LatexNodes2Text removes whitespace between neighboring braced groups

When converting LaTeX to plain text, the conversion methods of the
LatexNodes2Text class consume whitespace between neighboring
braced groups in the strict_latex_spaces=True mode, deviating from
the LaTeX handling of whitespace in such cases.

For instance, LatexNodes2Text().latex_to_text('{a} {b}') returns
ab instead of a b.

Problems with multiple unicodes

Hello,
I have an html page with some math formulas that I want to transform in latex functions. But for a lot of unicode characters it returns this error:
There is a way to convert even these unicode characters?

Logs

No known latex representation for character: U+E683 - ‘’

Here is the code I'm using. ```python # obj is a BeautifulSoup Element print(f'<{header} style="margin: 0; font-weight: normal; font-size: {value}px;" class="ff5">{unicode_to_latex(obj.text)}) ```

Sorry for bothering you and thanks for helping me.

Support for common set-theory character, aleph "ℵ"

ℵ (\aleph) is used to denote infinite cardinals in set theory. It is missing from the _utf8latexmap.py map. Its character code is 0x2135.

See: https://en.wikipedia.org/wiki/Cardinal_number for a common use case.

Use absolute imports (python 3 support)

Import error:

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-7-a40f2e3e409e> in <module>()
      1 import pylatexenc
----> 2 import pylatexenc.latex2text

/Users/rasmus/Dev/repos-others/pylatexenc/pylatexenc/latex2text.py in <module>()
     28 import re
     29 import unicodedata
---> 30 import latexwalker
     31 import logging
     32 

ModuleNotFoundError: No module named 'latexwalker'

To fix: Convert relative import statements to absolute:

from pylatexenc import latexwalker

Add from __future__ import absolute_import to the top of all files to retain python 2 support.

LatexNodes2Text(keep_inline_math=True).latex_to_text(keep_inline_math=True) eats up spaces.

[furutaka@Furutaka-3 automate-refs]$ conda list|grep pylatexenc
pylatexenc 1.2
[furutaka@Furutaka-3 automate-refs]$ python
Python 2.7.14 |Anaconda, Inc.| (default, Dec 7 2017, 17:05:42)
[GCC 7.2.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from pylatexenc.latex2text import LatexNodes2Text
>>> print (LatexNodes2Text(keep_inline_math=True).latex_to_text(r"$\alpha$ $\beta$ $\gamma$",keep_inline_math=True))
$\alpha$$\beta$$\gamma$
>>> print (LatexNodes2Text(keep_inline_math=True).latex_to_text(r"$\gamma$ detector",keep_inline_math=True))
$\gamma$ detector
>>> print (LatexNodes2Text(keep_inline_math=True).latex_to_text(r"$\gamma$ $\gamma$ coincidence",keep_inline_math=True))
$\gamma$$\gamma$ coincidence

Support for symbols < and >

Although symbols < and > do not to be changed that need to be encapsulated with \ensuremath{}.

PS: Your library is being very helpful in a small project oI am working on. Thanks!

glitch in parsing tex fragments with macros that may have optional argument at the end - End of input while parsing arguments of macro

If I try to parse a tex fragment that contains a macro that may take the form like \foo[a]{b}[c], apparently the parser looks for the rightmost optional argument beyond the end of the input.
This is noticeable only with tolerant_parsing=False.

import pylatexenc
import pylatexenc.latexwalker

# add the spec of macro \foo that has one mandatory argument between two optional arguments
# add the spec of macro \bar that has a more common form (for comparison)
walker_context = pylatexenc.latexwalker.get_default_latex_context_db()
walker_context.add_context_category(
    'mymacros',
    macros=[
        pylatexenc.macrospec.std_macro('foo', '[{['),
        pylatexenc.macrospec.std_macro('bar', '[{'),
    ],
    prepend=True
)


def parse(latex):
    "helper function"
    lw = pylatexenc.latexwalker.LatexWalker(
        latex,
        latex_context=walker_context,
        tolerant_parsing=False
    )
    lw.get_latex_nodes()


# \bar (common macros, all fine)
latex = r"\bar{test}"
parse(latex)


# \foo (uncommon definition, some glitch)

# if the macro is followed by something, all is well
latex = r"\foo{test} ciao"
parse(latex)

# if the macro is alone, there is an Exception
latex = r"""\foo{test}"""
try:
    parse(latex)
except Exception as e:
    print(f"{latex} → {e}")
else:
    print(f"{latex} → OK")

Crash on latex_to_text with combining character

from pylatexenc.latex2text import LatexNodes2Text
LatexNodes2Text().latex_to_text(r'$1 \not= 2$')

gives

Traceback (most recent call last):
  ...
  File ".../pylatexenc/latex2text/_defaultspecs.py", line 478, in make_accented_char
    nodearg = node.nodeargs[0] if len(node.nodeargs) else latexwalker.LatexCharsNode(chars=' ')
NameError: name 'latexwalker' is not defined

Support for `\newenvironment` wrapping another environment

latex2text fails when parsing a document that contains a \newenvironment command that wraps an existing environment. I have been able to narrow it down to the following minimum example:

latex2text --code '\newenvironment{annotate}{\begin{scope}}{\end{scope}}'

which gives the following output:

INFO:pylatexenc.latexwalker:Ignoring parse error (tolerant parsing mode): Unexpected mismatching closing brace: '}' @(1,39)
INFO:pylatexenc.latexwalker:Ignoring parse error (tolerant parsing mode): Unexpected closing environment: 'scope' @(1,41)
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/pylatexenc/latexwalker/__init__.py", line 2248, in do_read
    mspec.parse_args(w=self, pos=tok.pos + tok.len,
  File "/usr/local/lib/python3.8/site-packages/pylatexenc/macrospec/__init__.py", line 95, in parse_args
    return self.args_parser.parse_args(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/pylatexenc/macrospec/_argparsers.py", line 293, in parse_args
    (node, np, nl) = w.get_latex_expression(
  File "/usr/local/lib/python3.8/site-packages/pylatexenc/latexwalker/__init__.py", line 1551, in get_latex_expression
    tok = self.get_token(pos, environments=False, parsing_state=parsing_state)
  File "/usr/local/lib/python3.8/site-packages/pylatexenc/latexwalker/__init__.py", line 1356, in get_token
    raise LatexWalkerEndOfStream(final_space=space)
pylatexenc.latexwalker.LatexWalkerEndOfStream

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/latex2text", line 11, in <module>
    load_entry_point('pylatexenc==2.8', 'console_scripts', 'latex2text')()
  File "/usr/local/lib/python3.8/site-packages/pylatexenc/latex2text/__main__.py", line 190, in main
    (nodelist, pos, len_) = lw.get_latex_nodes()
  File "/usr/local/lib/python3.8/site-packages/pylatexenc/latexwalker/__init__.py", line 2351, in get_latex_nodes
    r_endnow = do_read(nodelist, p)
  File "/usr/local/lib/python3.8/site-packages/pylatexenc/latexwalker/__init__.py", line 2251, in do_read
    e = self._exchandle_parse_subexpression(
  File "/usr/local/lib/python3.8/site-packages/pylatexenc/latexwalker/__init__.py", line 1862, in _exchandle_parse_subexpression
    e.open_contexts.append(
AttributeError: 'LatexWalkerEndOfStream' object has no attribute 'open_contexts'

By trial and error, I found out that parsing works if I add a custom definition macrospec.std_macro('newenvironment', "*[[{{"),, i.e. remove the first { argument from the default *{[[{{.

latex2text: LatexNodes2Text ignores SpecialsTextSpec replacement for single quotes in its context

I expect the following code to replace the ASCII apostrophe in
the given string to the Unicode right single quote, as specified:

from pylatexenc import latex2text, latexwalker

def parse_tex(str):
    return latexwalker.LatexWalker(str).get_latex_nodes()[0]

def render_tex(tex_nodes):
    return render_tex.converter.nodelist_to_text(tex_nodes)

SpecialsTextSpec = latex2text.SpecialsTextSpec
context = latex2text.get_default_latex_context_db()
context.add_context_category(
    'more-nonascii-specials',
    insert_after='nonascii-specials',
    specials=(
        SpecialsTextSpec('`', u'\N{LEFT SINGLE QUOTATION MARK}'),
        SpecialsTextSpec("'", u'\N{RIGHT SINGLE QUOTATION MARK}'),
    ),
)

render_tex.converter = latex2text.LatexNodes2Text(
    latex_context=context,
    strict_latex_spaces=True,
)

print(render_tex(parse_tex("patient's state")))

However, the apostrophe remains unchanged in the output.
Is it a bug, or am I missing something?

phfaist / pylatexenc Goto Github PK

pylatexenc's Issues

Issue

How to reproduce?

Proposal

Issue

Proposal

Testing

Recommend Projects

Recommend Topics

Recommend Org