phfaist / pylatexenc Goto Github PK
View Code? Open in Web Editor NEWSimple LaTeX parser providing latex-to-unicode and unicode-to-latex conversion
Home Page: https://pylatexenc.readthedocs.io
License: MIT License
Simple LaTeX parser providing latex-to-unicode and unicode-to-latex conversion
Home Page: https://pylatexenc.readthedocs.io
License: MIT License
As a possible future enhancement, would it be possible to use the characters of the unicode block "Mathematical Alphanumeric Symbols" when encoding math symbols to text?
So that "$x$" would became "𝑥" (U+1D465 MATHEMATICAL ITALIC SMALL X).
Maybe with an option to enable/disable the use of this block?
Also here.
Thanks :-)
There is a bug within the LatexWalker functionality, when a given LaTeX string incorrectly defines \begin
or \end
macros without the environment they refer to next to them.
The bug produces this LatexWalker.get_latex_nodes function while loop, to keep launching LatexWalkerParseError
exceptions, while being forever ignored when the self.tolerant_parsing
option is True
(by default).
from pylatexenc.latex2text import LatexNodes2Text
content = r'''
\begin
\item {1.} Example text 1
\item {2.} Example text 2
\end
'''
conversor = LatexNodes2Text()
text = conversor.latex_to_text(content)
It is up to the maintainer of this package to consider this as a "major" error and raise an exception to the user, or to log and ignore these special parsing errors.
My proposal is to consider this as a special type of parsing errors, logging a warning and continue with the parsing. I will create a Pull Request with the proposal.
Question equality of two latex expressions. Perhaps this is naïve but if I have two expressions that display the same way, should there be a way to do this?
a = r"\bar{x}_{y}^{z}"
b = r"\bar{x}^{z}_{y}"
def tex_eq(a,b):
# equalitycheck
assert tex_eq(a,b) is True
Hello devs,
When I was using LatexWalker
on a string with trailing newline after an environment, that newline is parsed into a LatexCharsNode
with pos = None
. Is this intentional?
I'm using version 2.1 (installed from pip); a minimal input that displays this behavior would be "\begin{table}\n\end{table}\n".
Also, thanks for the great work! I'm writing a customary latex formatter and this library really helps!
from pylatexenc.latexencode import unicode_to_latex
from pylatexenc.latex2text import LatexNodes2Text
LatexNodes2Text().latex_to_text(unicode_to_latex("\""))=="\""
I think the error is in unicode_to_latex
, because it converts "
to ''
. I don't really have an idea on how to fix this.
I'm not sure is it intentional or not, but currently \le
and \ge
are rendered as <
and >
. However, accoding to their LaTeX meaning, they have to be rendered as ≤
and ≥
. Also, \leqslant
and \geqslant
can (and probably should) be rendered as ⩽
and ⩾
.
Another syntax error on Python 3.6:
File "/Users/rasmus/Dev/repos-others/pylatexenc/pylatexenc/latex2text.py", line 667
print "Please type some latex text (Ctrl+D twice to stop) ..."
^
SyntaxError: invalid syntax
To fix, convert print
statements to use the print(...)
. Add from __future__ import print_function
at the top to retain python2 compatability.
The unicode()
function is not defined for python 3+ and will cause a NameError.
To fix, either:
a. Define a no-op unicode(str)
function for python 3+, or:
b. Use from builtins import str
and replace unicode(s)
with str(s)
(as outlined in http://python-future.org/compatible_idioms.html#unicode)
Thanks for publishing and maintaining this package.
While using your package, I found some issues with the spacing inpylatexenc.latex2text.LatexNodes2Text
. With LaTeX typing \"{o} \L u
yields ö Łu
. While LatexNodes2Text
yields öŁ u
. Note the difference in whitespace.
Here is a minimal working example.
>>> from pylatexenc.latex2text import LatexNodes2Text
>>> encoder = LatexNodes2Text()
>>> encoder.latex_to_text(r'\"{o} \L u')
'öŁ u'
I have tested different combinations of the single glyph symbol \L
, the combined letter \"o
, and u
using the code published here. An F
in the first column indicates a wrong result.
: tex code encoded output
F: \"{o} \"{o} \"{o} ööö
F: \"{o} \"{o} {\"o} ööö
F: \"{o} \"{o} \L ööŁ
F: \"{o} \"{o} {\L} ööŁ
F: \"{o} \"{o} u öö u
F: \"{o} {\"o} {\"o} ööö
F: \"{o} {\"o} \L ööŁ
F: \"{o} {\"o} {\L} ööŁ
F: \"{o} {\"o} u öö u
F: \"{o} \L \L öŁŁ
F: \"{o} \L {\L} öŁŁ
F: \"{o} \L u öŁ u
F: \"{o} {\L} {\L} öŁŁ
F: \"{o} {\L} u öŁ u
.: \"{o} u u ö u u
F: {\"o} {\"o} {\"o} ööö
F: {\"o} {\"o} \L ööŁ
F: {\"o} {\"o} {\L} ööŁ
F: {\"o} {\"o} u öö u
F: {\"o} \L \L öŁŁ
F: {\"o} \L {\L} öŁŁ
F: {\"o} \L u öŁ u
F: {\"o} {\L} {\L} öŁŁ
F: {\"o} {\L} u öŁ u
.: {\"o} u u ö u u
.: \L \L \L ŁŁŁ
.: \L \L {\L} ŁŁŁ
F: \L \L u ŁŁ u
F: \L {\L} {\L} ŁŁŁ
.: \L {\L} u ŁŁ u
F: \L u u Ł u u
F: {\L} {\L} {\L} ŁŁŁ
F: {\L} {\L} u ŁŁ u
.: {\L} u u Ł u u
.: u u u u u u
If you could point me in the right direction on where to fix this issue, I am happy to contribute.
As my title say.
from pylatexenc.latex2text import LatexNodes2Text
latex = r"""
... (\sqrt[25]{10-2.56})-1=8.36%
... """
latex = latex.replace('%', '/100')
text = LatexNodes2Text().latex_to_text(latex)
print(text)
√(10-2.56)-1=8.36/100
The function of "latex_to_text" did ignore the num 25 in "sqrt[25]".
It thransfer "\sqrt[25]" to "√" which is wrong.
I can understand this issue is caused by that utf code did not acquire other sqrt code than "√".
So how to thransfer a latex code like "\sqrt[n]{x}" to "x**(1/n)".
Maybe "Define replacement texts" can solve this question,but I want u know this issue.
Your proj is great, wash it and ur life better and better, THX.
The apply_simplify_repl function within the LatexNodes2Text
object references a non-existing function with the name self.nodelist_to_latex
.
The issue only appears when parsing a document with the "%" character, as the only references to that non-existing function are within an if "%" in simplify_repl
block.
There is already a nodelist_to_latex function defined on the latexwalker/__init__.py
file.
A simple fix would be to replace:
self.nodelist_to_latex
By:
latexwalker.nodelist_to_latex
I have tried to parse this ArXiV paper by calling LatexNodes2Text.latex_to_text
function before and after the fix.
Where before it crashed, now it does not.
It would be good to have an exception list when doing the conversion. For example I would like to keep \ref as \ref in order to put label markers in prior to latex. Right now that gets printed just as \ref.
Dear Philippe
Thanks for your work!
The lib help me a lot.
I have a question when I use LatexWalker module to parse a Latex String, it see all text to a LatexMathNode.For example, the following Latex code is part of article:
\author[$\dagger$$\ddagger$1]{Qinyuan REN} \author[1]{Ping LI} %\author[$\dagger$1]{Ping Li} \affil[1]{State Key Laboratory of Industrial Control Technology, Zhejiang University, Hangzhou 310027, China} \affil[2]{Department of Electrical and Computer Engineering, National University of Singapore, Singapore 117576, Singapore} \affil[3]{Zhejiang University of Science and Technology, Hangzhou 310023, China}
I don't kown is it reasonable, but it is a trouble for me with parse Latex code.
Thanks in advance!
Running
echo foo | latex2text
gives two lines, the second of which is empty.
Thanks for the great package!
I am facing an issue with the way quotation marks are parsed using unicode_to_latex
.
See the following example:
>>> from pylatexenc.latexencode import unicode_to_latex
>>> print(unicode_to_latex('Hello "world".'))
Hello ''world''.
Whereas I would expect to get
Hello ``world''.
because in LaTeX left quotation marks are represented by backticks (per https://www.maths.tcd.ie/~dwilkins/LaTeXPrimer/QuotDash.html for example).
Is this an issue that can easily be solved?
I do have read through this issue: https://github.com/phfaist/pylatexenc/issues/36 which states that you don't support super- and subscripts out of understandable reasons.
However it would be nice to have the possibility to keep the encoding complete.
However using LatexNodes2Text().latex_to_text() a string like:
"This is H$_{2}$O"
Gets converted to:
"This is H_2O"
Which is less safe to match and replace using f.e. Regex
Hello everyone.
First of all: nice work!
The lib really helps me a lot.
But I'm currently facing the issue that for the generated latex code:
\begin{enumerate}%
\item%
\begin{itemize}%
\item%
BLA BLA 1.1
\item%
BLA BLA 1.2
\end{itemize}%
\item
\begin{itemize}%
\item%
Word = BLA BLA 2.1
\item%
Word = BLA BLA 2.2
\end{itemize}%
\item%
BLABLA 3
\end{enumerate}
I'm not able to distinguish the "item" for enumerations and for itemize.
What I desire is something like:
1.
- BLA BLA 1.1
- BLA BLA 1.2
2.
- BLA BLA 2.1
- BLA BLA 2.2
3.
BLA BLA 3
I tried to fiddle with the default_context_db
, or to post-process and replace the generated " * \n" strings, but I believe in a well-structured framework like this, there would be a better solution.
Thanks in advance!
It would be great to have an option to keep custom capitalisation for bibtex.
For example
TCP: The Capitalisation Example
would be encoded as
{TCP}: The Capitalisation Example
For now, I am using code borrowed from https://openreview-py.readthedocs.io/en/latest/_modules/tools.html#get_bibtex in combination with utf8encode
:
def capitalize_title(title):
capitalization_regex = re.compile('[A-Z]{2,}')
words = re.split('(\W)', title)
for idx, word in enumerate(words):
m = capitalization_regex.search(word)
if m:
new_word = '{' + word[m.start():m.end()] + '}'
words[idx] = words[idx].replace(word[m.start():m.end()], new_word)
return ''.join(words)
bibtex_title = capitalize_title(utf8tolatex(orig_title))
Thank you very much for this project! It is really helpful.
Could you please fix a couple of bugs related to the Greek letters?
For ι (Greek small letter iota, U+03B9
), function unicode_to_latex()
gives \ensuremath{\i}
. However, this is incorrect. It should be \ensuremath{\iota}
instead.
For ϵ (Greek lunate epsilon symbol, U+03F5
), function unicode_to_latex()
gives nothing but it should be \ensuremath{\epsilon}
.
Source: Table 188 in The Comprehensive LaTeX Symbol List (PDF).
When using the software I found that the caret (^) character was not being escaped by the UTF8tolatex function. It seems it not in the _utf8latexmap file.
To "fix" this I monkey patched the utf82latex array with:
0x005E: r'\textasciicircum'
Not sure that was the correct symbol to use but it fixed my issie.
Hi there,
When processing BibTex files, I am seeing cases where latex_to_text incorrectly removes some spaces. Here is an example:
Python 3.5.2 (default, Oct 7 2020, 17:19:02)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
from pylatexenc.latex2text import LatexNodes2Text
s = "{B}uilding an {E}fficient {C}omponent for {OCL} {E}valuation"
LatexNodes2Text().latex_to_text(s)
'Building an Efficient Component for OCLEvaluation'
Note that "OCLEvaluation" should be "OCL Evaluation"
I can provide more examples if needed
Dear Philippe,
Thanks for your work!
I wonder if latex_to_text
should throw an exception if it faces a latex command that it can not convert? For example, the following Latex code is a matrix:
\left (\begin{array}{llll}25 & 31 & 17 & 43\\75 & 94 & 53 & 132\\75 & 94 & 54 & 134\\25 & 32 & 20 & 48\end{array}\right )
and all I get is
< a r r a y >
but I'd like to catch a conversion erro rin order to deal with it.
Is there a way to do that?
Thanks in advance!
Hi,
the Polish ł is not properly protected with the option
replacement_latex_protection='braces-after-macro'
In [1]: from pylatexenc.latexencode import unicode_to_latex
In [2]: unicode_to_latex('Jabłoński', replacement_latex_protection='braces-after-macro')
Out[2]: "Jab\\lo\\'nski"
it should be "Jab\\l{}o\\'nski"
. This is with pylatexenc==2.7
Thanks for this very useful package
T.
I'm trying to have pylatexenc emit a warning when it finds an unknown macro.
I define an arguments parser that does nothing and emits a warning. Then I define a MacroSpec
that uses this parser and finally I register it with the walker's context using set_unknown_macro_spec()
. See the code below.
I gently ask if this is the correct way to go.
I suspect that I'm missing something, because I get a warning at the end of the math mode (at the second "$" in the second example in the code below).
I would also like to emit the name of the unknown macro, but this is for later :)
"""Emit a waning on unknown macros - proof of concept."""
from pylatexenc import macrospec, latexwalker, latex2text
import logging
class DoNothingArgumentsParser(macrospec.MacroStandardArgsParser):
"""An argument parser that does nothing and emits a warning."""
def parse_args(self, w, pos, parsing_state=None):
"""Override the parse_args method to emit the warning."""
logging.warning("Unknown macro XXX at %s",
pos)
return super().parse_args(w, pos, parsing_state=None)
walker_context = latexwalker.get_default_latex_context_db()
unknown_macro_spec = macrospec.MacroSpec(
"", # anything would do?
args_parser=DoNothingArgumentsParser()
)
walker_context.set_unknown_macro_spec(unknown_macro_spec)
# first example
output = latex2text.LatexNodes2Text().latex_to_text(
r"""\unknown""", latex_context=walker_context)
print(output)
print("===")
# second example
output = latex2text.LatexNodes2Text().latex_to_text(
r"""start
$\mu $
\foo
\foobar
""", latex_context=walker_context)
print(output)
# Output:
# WARNING:root:Unknown macro XXX at 8
#
# ===
# WARNING:root:Unknown macro XXX at 11
# WARNING:root:Unknown macro XXX at 18
# WARNING:root:Unknown macro XXX at 26
# start
# μ
#
Hello devs!
I am in the process of creating a conda-forge/pylatexenc-feedstock for this library. Thought I would mention my intentions. Feel free to close, or discuss concerns etc.
Cheers!
NameError
in latexwalker.do_read()
:
/Users/rasmus/Dev/repos-others/pylatexenc/pylatexenc/latexwalker.py in do_read(nodelist, p)
922 (nodeoptarg, p.pos) = getoptarg(p.pos);
923
--> 924 if (isinstance(mac.numargs, basestring)):
925 # specific argument specification
926 for arg in mac.numargs:
NameError: name 'basestring' is not defined
To fix for python 3+, define basestring = str
at the top of the file if sys.version_info.major > 2
.
Got syntax error when importing pylatexenc.latex2text
module (Python 3.6, Mac, Anaconda3 distrubution):
import pylatexenc.latex2text
File "/Users/rasmus/anaconda/envs/tts/lib/python3.6/site-packages/pylatexenc/latex2text.py", line 501
logger.warning(ur"Expected exactly one argument for '\input' ! Got = %r", n.nodeargs)
^
SyntaxError: invalid syntax
Python 3.5+ does not support ur
prefix. To fix, use either u
or r
string but not both.
Firstly let me just say that I love this library, it has helped myself and my team to do amazing things. I don't know if it's already in the works or not but I would really appreciate if you could include actual translations for the array, pmatrix, and matrix environments or could maybe update the documentation and explain how to do so. I just really need the content of the environment to be displayed as opposed to < p m a t r i x > or < a r r a y >. If this issue is already being worked on feel free to let me know. If not do you think that could be implemented?
encoder = LatexNodes2Text(keep_inline_math=True, keep_comments=True)
print encoder.latex_to_text(r'Global well-posedness for the mass-critical stochastic nonlinear Schr\"{o}dinger equation on $\mathbb{R}$: small initial data')
I have used above code to decode Latex command and I only want the latex accent command to be converted to plain text.
but Latex is removing latex macro from the output.
Output:
Global well-posedness for the mass-critical stochastic nonlinear Schrödinger equation on $R$: small initial data
Expected output:
Global well-posedness for the mass-critical stochastic nonlinear Schrödinger equation on $\mathbb{R}$: small initial data
Working on the function that you proposed on #48, I came across a strange behavior: sometimes the len
returned by the get_latex_nodes
function is not the same as the len
of the parsed node.
In the MWE below, the first node is a MacroNode and the pos
and len
reported by the function are the same as the attributes of the node itself, but for the second node (a CharsNode, the newline char), the function reports a len
different from that of the node.
import pylatexenc.latexwalker
doc_content = r"""\emph{A}
\emph{B}"""
print(f"DOC: {doc_content}")
lw = pylatexenc.latexwalker.LatexWalker(
doc_content,
tolerant_parsing=False,
)
pos = 0
(tmp_list, npos, nlen) = lw.get_latex_nodes(pos, read_max_nodes=1)
node = tmp_list[0]
print(f"Node at pos {pos}: node len {node.len} == returned len {nlen}; OK")
pos = 8
(tmp_list, npos, nlen) = lw.get_latex_nodes(pos, read_max_nodes=1)
node = tmp_list[0]
print(f"Node at pos {pos}: node len {node.len} != returned len {nlen}; BAD")
By the way, with some minor tweaks, the function you proposed in #48 works well, thanks again 👍
I think the spec for the macro \href
might be missing from pylatexenc/latexwalker/_defaultspecs.py
The following code gives error: it expects a node with two arguments, but gets a node with empty argnlist.
from pylatexenc.latex2text import LatexNodes2Text
latex = r"\href{a}{b}"
print(LatexNodes2Text().latex_to_text(latex))
[furutaka@Furutaka-3 ~]$ conda list|grep pylatexenc
pylatexenc 1.2
[furutaka@Furutaka-3 ~]$ python
Python 2.7.14 |Anaconda, Inc.| (default, Dec 7 2017, 17:05:42)
[GCC 7.2.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from pylatexenc.latex2text import LatexNodes2Text
>>> latex=r"""$\gamma$"""
>>> print LatexNodes2Text(keep_inline_math=True).latex_to_text(latex)
$γ$
Is this the intended behavior of the function?
I expected $\gamma$
...
Kazuyoshi
latex2text.py:688
uses sys.exc_value
, which has been deprecated since version 1.5 and is no longer available in Python 3. Use exc_info()[1]
instead. (exc_value
is actually defined in the line just above, so I imagine the continued use of sys.exc_value
is unintentional?)
>>> latex = 'e^{2x}'
>>> LatexNodes2Text().latex_to_text(latex)
'e^2x'
How i can get e^(2x)
?
Thanks.
`w = LatexWalker(r"""
... \documentclass[twoside]{article}
...\usepackage{amsmath,amssymb}
...\usepackage{amsthm}
...\usepackage{graphicx,epsfig,amscd,mathrsfs,multirow,bm}
...\usepackage{cite}%引用宏包
...\input xy
...\xyoption{all}
...\input scrload.tex
...\allowdisplaybreaks[4]%多行公式换页从0到4,表示分页的坚决程度,例如0表示能不分页就不分页,4表示强制分页。
%------------------------ Page Format --------------------------
...\textwidth=147truemm
...\textheight=210truemm
...\headsep=4truemm
...\topmargin= 0pt
...\oddsidemargin=0pt
...\evensidemargin=0pt
...\parindent=16pt
...\setcounter{page}{415}
...\footskip=0pt
...\renewcommand{\baselinestretch}{1.06}
...\renewcommand{\arraystretch}{1.2}
...\catcode`@=11
...\long\def@makefntext#1{\noindent #1}
...\newskip\tabcentering \tabcentering=1000pt plus 1000pt minus 1000pt
...\def\REF#1{\par\hangindent\parindent\indent\llap{#1\enspace}\ignorespaces} %reference format
...\def\MCH#1#2{\setbox0=\hbox{\raise#1\hbox{#2}}\smash{\box0}}% move char
...\def\sub#1{\par\vspace{\baselineskip}\noindent #1\vspace*{\baselineskip}\rm\par}
...\def\CR{\cr\noalign{\vspace{1mm} \hrule \vspace{1mm}}}
...\def@evenfoot{}\def@oddfoot{}
...\def@evenhead{\hbox to\textwidth{\small\rm\thepage \hfill
...{\it Yu-e BAO, Na LI and Linfen ZHANG}}} % authors name (the given name is before the surname, and use "and" to separate two authors)%
...\def@oddhead{\hbox to \textwidth{\small{\it
...Differentiability of interval valued function and its application in interval valued programming
...} \hfill\thepage}} % Capitalize the first letter in abbreviate title%
...%format defination
...\def\scr{\mathscr}
...\def\SUB#1{\vskip .2in\leftline{\large\bf #1}\vskip .1in}
...\def\subsec#1{\vskip 2mm\leftline{#1}\vskip 1mm}
...\def\th#1{\vskip 1mm\noindent{\bf #1}\quad}
...\def\thn#1{\noindent{\bf #1}\quad}
...\def\proofn{\noindent{\it Proof}\quad}
...\def\proof{\vskip 1mm\noindent{\it Proof}\quad}
...\def\vsp{\vskip {1mm}}
...%end format defination
...\font\tenbf=cmb10 scaled \magstep0
...\font\tenrm=cmr10 scaled \magstep0
...\def\ESEC#1#2#3{\vskip.2in \begin{center} \tenbf #1\[.1in]
...\small #2\
...\footnotesize
...\end{center}}
...\renewcommand{\topfraction}{1}
...\renewcommand{\bottomfraction}{1}
...\renewcommand{\textfraction}{0}
...\renewcommand{\floatpagefraction}{0}
...\floatsep=0pt
...\textfloatsep=0pt
...\intextsep=0pt
...\catcode`@=12
...\def\bc{\begin{center}}
...\def\ec{\end{center}}
...\def\no{\noindent}
...\def\hang{\hangindent\parindent}
...\def\textindent#1{\indent\llap{\qquad #1\ \ \enspace}\ignorespaces}
...\def\ref{\par\hang\textindent}
...\def\d#1#2{\frac{\displaystyle #1}{\displaystyle #2}}
...\def\f#1#2{\frac{#1}{#2}}
...\def\d{{\rm d}}
...%\usepackage{lineno}
...\begin{document}
...%\linenumbers
...\abovedisplayskip=6pt plus 1pt minus 1pt \belowdisplayskip=6pt
...plus 1pt minus 1pt
...%------------------- First Head -----------------------------------------
...\thispagestyle{empty} \vspace*{-1.0truecm} \noindent
...\parbox[b]{7truecm}{\footnotesize\baselineskip=11pt\noindent{\it Journal of Mathematical Research with Applications}\
...Jul., 2020, Vol.,40, No.,4, pp.,415--431\
...DOI:10.3770/j.issn:2095-2651.2019.04.009\
...Http://jmre.dlut.edu.cn} \hfill
...%\parbox[]{6truecm}{\vskip -1.7cm \hfill \begin{tabular}{l}\\hline {\bf Journal of Mathematical}\ {\bf Research and Exposition}\\hline\end{tabular}}
...%\parbox[t]{6truecm}{\vskip -1.7cm \hfill\includegraphics{actmark.eps}}
...%===================Text=============================================
...\vskip 10mm \bc{\Large\bf Differentiability of Interval Valued Function and Its Application in Interval Valued Programming
...\footnotetext{\footnotesize Received February 20, 2019; Accepted April 21, 2020\
...Supported by the National Natural Science Fund of China (Grant No.,11461052) and the Natural Science Foundation of Inner Mongolia (Grant No.,2018MS01010).\
...* Corresponding author\
...E-mail address: [email protected] (Yu-e BAO); [email protected] (Na LI); [email protected] (Linfen ZHANG)
...} } \ec
...%国家自然科学基金(Grant No.11461052),内蒙古自然科学基金(Grant No.2018MS01010).
...\vskip 5mm
...\bc{\bf Yu-e BAO$^*$,\ \ \ Na LI,\ \ \ Linfen ZHANG}\
...{\small\it College of Mathematics and Physics, Inner Mongolia University for Nationalities,\ Inner Mongolia
...}\ec
...\vskip 1 mm
\end{document}`
Tex content as up follow, For this line "\def\bc{\begin{center}}" parse is uncorrect, "\begin{center}" and back content is parse to a LatexGroupNode, using version is v2.8.
First, let me thank you for you work, it helps me a lot.
I want to report the issue in the title.
For instance:
echo '\newcolumntype{C}{>{$}c<{$}}' | latex2text
will issue a parse error (I think because the parser sees a closing group right after an opening math).
I don't know if it is possible to fix, but, if it is not, I wanted to ask if it would be possible to entirely skip the line(s) with parsing errors (maybe by supplying some optional argument) in the hope of not cluttering the rest of the parsing.
Something similar: echo '\newcommand{\be}{\begin{equation}}' |latex2text
My use case is this: I need to scan a tex file, looking for some specific macro (\title
, \author
,...) and transform their arguments into text. I do not want to use something like latex_text.find("\\title")
(and then parse from there) because of comments and "look-alike" macros (e.g. \titleBar
, \authorFoo
). I could use regexps to find the starting point, but I prefer to navigate the tree of nodes built by LatexWalker.
Hi, thanks for putting together this nice library!
Would you be open to a pull request making the source conform better to PEP8?
Examples:
import
statements at the top, no empty lines between import statements.Thanks again,
First of all thank you very much for this project. It saved me from a lot of hassle with all the special characters!
I have a small issue with the processing of {\ensuremath{xxx]}:
My input to LatesNodes2Text().latex_to_text()
is f.e.
{{Neutral CGM as damped Ly{\ensuremath{\alpha}} absorbers at high redshift}}
What gets returned is:
Neutral CGM as damped Ly absorbers at high redshift
-> The whole ensuremath gets removed
I saw that this case was handled the other way around (encoding unicode) here:https://pylatexenc.readthedocs.io/en/latest/latexencode/
But was unable to resolve this problem.
Hi, thanks for this useful utility, and sorry for the usage question. I'm finding that the following code
print(LatexNodes2Text().latex_to_text(r"""
\documentclass{article}
\usepackage{times}
\definecolor{gray}
\RequirePackage{fixltx2e}
"""))
will output
gray
fixltx2e
Basically, the arguments to the first two macros get removed, whereas the arguments for the second two don't. I think this may be because the first two macros are built-in LaTeX macros, whereas the latter two are custom ones, so maybe pylatexenc doesn't know how many arguments there should be, so it defaults to assuming "no arguments". If I run this:
def raise_l2t_unknown_latex(n):
if n.isNodeType(pylatexenc.latexwalker.LatexMacroNode):
print(n.latex_verbatim())
l2t_db = pylatexenc.latex2text.get_default_latex_context_db()
l2t_db.set_unknown_macro_spec(
pylatexenc.latex2text.MacroTextSpec("", simplify_repl=raise_l2t_unknown_latex)
)
LatexNodes2Text(latex_context=l2t_db).latex_to_text(r"""
\documentclass{article}
\usepackage{times}
\definecolor{gray}
\RequirePackage{fixltx2e}
""")
then indeed I get
\documentclass{article}
\usepackage{times}
\definecolor
\RequirePackage
Is there any way to make pylatexenc automatically try to grab as many arguments as possible from unknown macros? Thanks a lot!
Hi, first of all, thanks a lot for this project! I found it very useful to quickly insert unicode characters that correspond to usual mathematical symbols in emails, etc.
Am I correct that there is no support for symbols like .latex_to_text
? I see that appropriate records are included into latexencode/_uni2latexmap.py
, e.g.
pylatexenc/latexencode/_uni2latexmap.py:0x1D549: r'\ensuremath{\mathbb{R}}', # MATHEMATICAL DOUBLE-STRUCK CAPITAL R
but it appears they are not used in latex_to_text
:
from pylatexenc.latex2text import LatexNodes2Text
import unicodedata
R = LatexNodes2Text().latex_to_text(r"$\mathbb{R}$")
unicodedata.name(R)
# 'LATIN CAPITAL LETTER R'
Is it possible to add them? Unfortunately, I wasn't able to understand how to add new macros in .latex_to_text
.
When converting LaTeX to plain text, the conversion methods of the
LatexNodes2Text
class consume whitespace between neighboring
braced groups in the strict_latex_spaces=True
mode, deviating from
the LaTeX handling of whitespace in such cases.
For instance, LatexNodes2Text().latex_to_text('{a} {b}')
returns
ab
instead of a b
.
Hello,
I have an html page with some math formulas that I want to transform in latex functions. But for a lot of unicode characters it returns this error:
There is a way to convert even these unicode characters?
Sorry for bothering you and thanks for helping me.
ℵ (\aleph
) is used to denote infinite cardinals in set theory. It is missing from the _utf8latexmap.py map. Its character code is 0x2135.
See: https://en.wikipedia.org/wiki/Cardinal_number for a common use case.
Import error:
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
<ipython-input-7-a40f2e3e409e> in <module>()
1 import pylatexenc
----> 2 import pylatexenc.latex2text
/Users/rasmus/Dev/repos-others/pylatexenc/pylatexenc/latex2text.py in <module>()
28 import re
29 import unicodedata
---> 30 import latexwalker
31 import logging
32
ModuleNotFoundError: No module named 'latexwalker'
To fix: Convert relative import statements to absolute:
from pylatexenc import latexwalker
Add from __future__ import absolute_import
to the top of all files to retain python 2 support.
[furutaka@Furutaka-3 automate-refs]$ conda list|grep pylatexenc
pylatexenc 1.2
[furutaka@Furutaka-3 automate-refs]$ python
Python 2.7.14 |Anaconda, Inc.| (default, Dec 7 2017, 17:05:42)
[GCC 7.2.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from pylatexenc.latex2text import LatexNodes2Text
>>> print (LatexNodes2Text(keep_inline_math=True).latex_to_text(r"$\alpha$$\beta$ $\gamma$ ",keep_inline_math=True))
$\alpha$ $\beta$$\gamma$
>>> print (LatexNodes2Text(keep_inline_math=True).latex_to_text(r"$\gamma$ detector",keep_inline_math=True))
$\gamma$ detector
>>> print (LatexNodes2Text(keep_inline_math=True).latex_to_text(r"$\gamma$$\gamma$ coincidence",keep_inline_math=True))
$\gamma$ $\gamma$ coincidence
Although symbols <
and >
do not to be changed that need to be encapsulated with \ensuremath{}
.
PS: Your library is being very helpful in a small project oI am working on. Thanks!
If I try to parse a tex fragment that contains a macro that may take the form like \foo[a]{b}[c]
, apparently the parser looks for the rightmost optional argument beyond the end of the input.
This is noticeable only with tolerant_parsing=False
.
import pylatexenc
import pylatexenc.latexwalker
# add the spec of macro \foo that has one mandatory argument between two optional arguments
# add the spec of macro \bar that has a more common form (for comparison)
walker_context = pylatexenc.latexwalker.get_default_latex_context_db()
walker_context.add_context_category(
'mymacros',
macros=[
pylatexenc.macrospec.std_macro('foo', '[{['),
pylatexenc.macrospec.std_macro('bar', '[{'),
],
prepend=True
)
def parse(latex):
"helper function"
lw = pylatexenc.latexwalker.LatexWalker(
latex,
latex_context=walker_context,
tolerant_parsing=False
)
lw.get_latex_nodes()
# \bar (common macros, all fine)
latex = r"\bar{test}"
parse(latex)
# \foo (uncommon definition, some glitch)
# if the macro is followed by something, all is well
latex = r"\foo{test} ciao"
parse(latex)
# if the macro is alone, there is an Exception
latex = r"""\foo{test}"""
try:
parse(latex)
except Exception as e:
print(f"{latex} → {e}")
else:
print(f"{latex} → OK")
from pylatexenc.latex2text import LatexNodes2Text
LatexNodes2Text().latex_to_text(r'$1 \not= 2$')
gives
Traceback (most recent call last):
...
File ".../pylatexenc/latex2text/_defaultspecs.py", line 478, in make_accented_char
nodearg = node.nodeargs[0] if len(node.nodeargs) else latexwalker.LatexCharsNode(chars=' ')
NameError: name 'latexwalker' is not defined
latex2text fails when parsing a document that contains a \newenvironment
command that wraps an existing environment. I have been able to narrow it down to the following minimum example:
latex2text --code '\newenvironment{annotate}{\begin{scope}}{\end{scope}}'
which gives the following output:
INFO:pylatexenc.latexwalker:Ignoring parse error (tolerant parsing mode): Unexpected mismatching closing brace: '}' @(1,39)
INFO:pylatexenc.latexwalker:Ignoring parse error (tolerant parsing mode): Unexpected closing environment: 'scope' @(1,41)
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/pylatexenc/latexwalker/__init__.py", line 2248, in do_read
mspec.parse_args(w=self, pos=tok.pos + tok.len,
File "/usr/local/lib/python3.8/site-packages/pylatexenc/macrospec/__init__.py", line 95, in parse_args
return self.args_parser.parse_args(*args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/pylatexenc/macrospec/_argparsers.py", line 293, in parse_args
(node, np, nl) = w.get_latex_expression(
File "/usr/local/lib/python3.8/site-packages/pylatexenc/latexwalker/__init__.py", line 1551, in get_latex_expression
tok = self.get_token(pos, environments=False, parsing_state=parsing_state)
File "/usr/local/lib/python3.8/site-packages/pylatexenc/latexwalker/__init__.py", line 1356, in get_token
raise LatexWalkerEndOfStream(final_space=space)
pylatexenc.latexwalker.LatexWalkerEndOfStream
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/bin/latex2text", line 11, in <module>
load_entry_point('pylatexenc==2.8', 'console_scripts', 'latex2text')()
File "/usr/local/lib/python3.8/site-packages/pylatexenc/latex2text/__main__.py", line 190, in main
(nodelist, pos, len_) = lw.get_latex_nodes()
File "/usr/local/lib/python3.8/site-packages/pylatexenc/latexwalker/__init__.py", line 2351, in get_latex_nodes
r_endnow = do_read(nodelist, p)
File "/usr/local/lib/python3.8/site-packages/pylatexenc/latexwalker/__init__.py", line 2251, in do_read
e = self._exchandle_parse_subexpression(
File "/usr/local/lib/python3.8/site-packages/pylatexenc/latexwalker/__init__.py", line 1862, in _exchandle_parse_subexpression
e.open_contexts.append(
AttributeError: 'LatexWalkerEndOfStream' object has no attribute 'open_contexts'
By trial and error, I found out that parsing works if I add a custom definition macrospec.std_macro('newenvironment', "*[[{{"),
, i.e. remove the first {
argument from the default *{[[{{
.
I expect the following code to replace the ASCII apostrophe in
the given string to the Unicode right single quote, as specified:
from pylatexenc import latex2text, latexwalker
def parse_tex(str):
return latexwalker.LatexWalker(str).get_latex_nodes()[0]
def render_tex(tex_nodes):
return render_tex.converter.nodelist_to_text(tex_nodes)
SpecialsTextSpec = latex2text.SpecialsTextSpec
context = latex2text.get_default_latex_context_db()
context.add_context_category(
'more-nonascii-specials',
insert_after='nonascii-specials',
specials=(
SpecialsTextSpec('`', u'\N{LEFT SINGLE QUOTATION MARK}'),
SpecialsTextSpec("'", u'\N{RIGHT SINGLE QUOTATION MARK}'),
),
)
render_tex.converter = latex2text.LatexNodes2Text(
latex_context=context,
strict_latex_spaces=True,
)
print(render_tex(parse_tex("patient's state")))
However, the apostrophe remains unchanged in the output.
Is it a bug, or am I missing something?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.