jgm / doclayout Goto Github PK
View Code? Open in Web Editor NEWA prettyprinting library designed for laying out plain text documents
License: BSD 3-Clause "New" or "Revised" License
A prettyprinting library designed for laying out plain text documents
License: BSD 3-Clause "New" or "Revised" License
Ignoring ANSI escape sequences would allow to layout marked up text in the terminal. Escape sequences for text markup all follow the pattern \ESC[([0-9;]*m
.
Initial discussion: hslua/hslua-module-doclayout#2
Prelude Text.DocLayout Data.List> literal "a’s"
Text 2 "a\8217s"
Should be length 3. I found this after noting a bunch of wrapping-related test failures in pandoc from the new doclayout release.
@Xitian9 can you see the problem? I believe this is due to your changes in real length calculation code.
Some combining characters, those with general character class Mc: Mark, spacing combining
, should actually add to the length of the text, even though they combine. These characters are commonly used in abugidas like Devanagari.
Emoji are supposed to be displayed as 2 characters wide, apparently since Unicode 9. However, here they are treated as 1 character wide.
Here is a list of emoji in Unicode 14 (https://unicode.org/emoji/charts/full-emoji-list.html). Things can get pretty ugly with zero-width combiners, but we can probably improve on the current situation.
c.f. https://bugs.launchpad.net/ubuntu/+source/gnome-terminal/+bug/1665140
https://gitlab.freedesktop.org/terminal-wg/specifications/-/issues/9
I've been looking into improving the performance of realLength, and reducing our reliance on shortcuts. I have sought advice and benchmarked a few different approaches on stackexchange.
The long and the short of it is that it seems we can do better, but we also have some choices to make. By far the best approach is making a giant unboxed array the size of all unicode points, and performing an array lookup. This will improve performance for all characters (except for ASCII control characters), but involves a significant memory and set-up cost. On my system it requires about 368MiB of memory, and about 150ms set-up cost.
Is this worthwhile? If you're working on ASCII the set-up cost is paid off after about 150 million lookups, while for text without shortcuts the payoff will come after only about 6 million lookups. But we would get a huge savings in code complexity, with no more shortcuts needed at all.
There are other improvements that can be made as well, in particular writing the binary search tree directly, allowing it to be specialised for our use case. This would not give as dramatic a speedup, but may allow us to maintain ASCII performance and get away with fewer shortcuts.
See jgm/pandoc#8711
ghci> render Nothing $ mconcat [Block 71 ["a","","b"]]
*** Exception: renderList encountered [Empty,CarriageReturn,Text 1 "b"]
CallStack (from HasCallStack):
error, called at src/Text/DocLayout.hs:453:21 in doclayout-0.4-inplace:Text.DocLayout
When converting a fenced code block with a leading blank line from markdown to latex, this one line doesn't appear in the output. For example,
```
This is the second line.
```
produces
\begin{verbatim}
This is the second line.
\end{verbatim}
This bug appears in version 2.5 and later. However, version 2.2.3.2 works and preserves the leading blank line as is.
The merged-in color support is limited to the 8-color ANSI palette. There should be rendering support and a combinator API for coloring text with the 256-color (indexed) palette and 24 bit/true color.
The inner document of a Styled
can be a Concat
, but as written, unfoldD
won't unfold that document. The ultimate effect, via the definition of offsetOf
, is that Styled
text will exceed the line length when output because renderList (BreakingSpace : xs)
can't correctly measure the offset of a Styled
following a BreakingSpace
.
It's not readily apparent what the right adaptation is here. Sprinkling cases around like unfoldD (Styled f x) = Styled f <$> unfoldD x
and offsetOf (Styled _ x) = offsetOf x
works towards addresses the line-breaking issue, but that then breaks how nested styles are flattened when outputting attributed text. That suggests we have to do some sort of further intermediate step but I'd have to think pretty hard about a good way of doing that.
This is my source markdown.
+---------+---------+---------+
| | column1 | column2 |
+:========+:=======:+:=======:+
| row1 | x | a |
+---------+---------+---------+
| row2 | ◯ | a |
+---------+---------+---------+
| row3 | ✕ | a |
+---------+---------+---------+
| row4 | あ | a |
+---------+---------+---------+
I got following result:
<table style="width:42%;">
<colgroup>
<col style="width: 13%" />
<col style="width: 13%" />
<col style="width: 13%" />
</colgroup>
<thead>
<tr class="header">
<th style="text-align: left;"></th>
<th style="text-align: center;">column1</th>
<th style="text-align: center;">column2</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td style="text-align: left;">row1</td>
<td style="text-align: center;">x</td>
<td style="text-align: center;">a</td>
</tr>
<tr class="even">
<td style="text-align: left;">row2</td>
<td style="text-align: center;">◯ |</td>
<td style="text-align: center;">a</td>
</tr>
<tr class="odd">
<td style="text-align: left;">row3</td>
<td style="text-align: center;">✕ |</td>
<td style="text-align: center;">a</td>
</tr>
<tr class="even">
<td style="text-align: left;">row4</td>
<td style="text-align: center;">あ</td>
<td style="text-align: center;">a</td>
</tr>
</tbody>
</table>
There is a problem on the next line.
<td style="text-align: center;">◯ |</td>
and
<td style="text-align: center;">✕ |</td>
These results include |
character.
I can modify the source markdown to get the expected result as follows.
+---------+---------+---------+
| | column1 | column2 |
+:========+:=======:+:=======:+
| row1 | x | a |
+---------+---------+---------+
| row2 | ◯ | a |
+---------+---------+---------+
| row3 | ✕ | a |
+---------+---------+---------+
| row4 | あ | a |
+---------+---------+---------+
However, it is not beautiful.
I think it's a half-width and full-width misjudgment.
◯
and ✕
are full width character as well as あ
.
sudo docker run --rm --mount type=bind,source=$(pwd),destination=/data pandoc/core -o out.html src.md
# pandoc --version
pandoc 2.14.2
Compiled with pandoc-types 1.22, texmath 0.12.3.1, skylighting 0.11,
citeproc 0.5, ipynb 0.1.0.1
User data directory: /root/.local/share/pandoc
Copyright (C) 2006-2021 John MacFarlane. Web: https://pandoc.org
This is free software; see the source for copying conditions. There is no
warranty, not even for merchantability or fitness for a particular purpose.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.