jgm / doclayout Goto Github PK

A prettyprinting library designed for laying out plain text documents

License: BSD 3-Clause "New" or "Revised" License

Haskell 99.54% Makefile 0.46%

doclayout's Issues

Ignore ANSI terminal escape sequences in length calculations

Ignoring ANSI escape sequences would allow to layout marked up text in the terminal. Escape sequences for text markup all follow the pattern \ESC[([0-9;]*m.

Initial discussion: hslua/hslua-module-doclayout#2

Wrong length for curly apostrophe

Prelude Text.DocLayout Data.List> literal "a’s"
Text 2 "a\8217s"

Should be length 3. I found this after noting a bunch of wrapping-related test failures in pandoc from the new doclayout release.

@Xitian9 can you see the problem? I believe this is due to your changes in real length calculation code.

Spacing combining characters should still increase width

Some combining characters, those with general character class Mc: Mark, spacing combining, should actually add to the length of the text, even though they combine. These characters are commonly used in abugidas like Devanagari.

charWidth gives incorrect result for emoji

Emoji are supposed to be displayed as 2 characters wide, apparently since Unicode 9. However, here they are treated as 1 character wide.

Here is a list of emoji in Unicode 14 (https://unicode.org/emoji/charts/full-emoji-list.html). Things can get pretty ugly with zero-width combiners, but we can probably improve on the current situation.

c.f. https://bugs.launchpad.net/ubuntu/+source/gnome-terminal/+bug/1665140
https://gitlab.freedesktop.org/terminal-wg/specifications/-/issues/9

Performance could be improved when determining character width

I've been looking into improving the performance of realLength, and reducing our reliance on shortcuts. I have sought advice and benchmarked a few different approaches on stackexchange.

The long and the short of it is that it seems we can do better, but we also have some choices to make. By far the best approach is making a giant unboxed array the size of all unicode points, and performing an array lookup. This will improve performance for all characters (except for ASCII control characters), but involves a significant memory and set-up cost. On my system it requires about 368MiB of memory, and about 150ms set-up cost.

Is this worthwhile? If you're working on ASCII the set-up cost is paid off after about 150 million lookups, while for text without shortcuts the payoff will come after only about 6 million lookups. But we would get a huge savings in code complexity, with no more shortcuts needed at all.

There are other improvements that can be made as well, in particular writing the binary search tree directly, allowing it to be specialised for our use case. This would not give as dramatic a speedup, but may allow us to maintain ASCII performance and get away with fewer shortcuts.

Rendering bug with certain inputs

See jgm/pandoc#8711

ghci> render Nothing $ mconcat [Block 71 ["a","","b"]]
*** Exception: renderList encountered [Empty,CarriageReturn,Text 1 "b"]
CallStack (from HasCallStack):
  error, called at src/Text/DocLayout.hs:453:21 in doclayout-0.4-inplace:Text.DocLayout

Recent versions of pandoc swallow the leading blank line of fenced code blocks when converting from markdown to latex

When converting a fenced code block with a leading blank line from markdown to latex, this one line doesn't appear in the output. For example,

```

This is the second line.
```

produces

\begin{verbatim}
This is the second line.
\end{verbatim}

This bug appears in version 2.5 and later. However, version 2.2.3.2 works and preserves the leading blank line as is.

Support indexed and 24 bit colors

The merged-in color support is limited to the 8-color ANSI palette. There should be rendering support and a combinator API for coloring text with the 256-color (indexed) palette and 24 bit/true color.

`Styled` documents interact poorly with line breaking.

The inner document of a Styled can be a Concat, but as written, unfoldD won't unfold that document. The ultimate effect, via the definition of offsetOf, is that Styled text will exceed the line length when output because renderList (BreakingSpace : xs) can't correctly measure the offset of a Styled following a BreakingSpace.

It's not readily apparent what the right adaptation is here. Sprinkling cases around like unfoldD (Styled f x) = Styled f <$> unfoldD x and offsetOf (Styled _ x) = offsetOf x works towards addresses the line-breaking issue, but that then breaks how nested styles are flattened when outputting attributed text. That suggests we have to do some sort of further intermediate step but I'd have to think pretty hard about a good way of doing that.

Wrong character width in full-width symbol

This is my source markdown.

+---------+---------+---------+
|         | column1 | column2 |
+:========+:=======:+:=======:+
| row1    | x       | a       |
+---------+---------+---------+
| row2    | ◯      | a       |
+---------+---------+---------+
| row3    | ✕      | a       |
+---------+---------+---------+
| row4    | あ      | a       |
+---------+---------+---------+

I got following result:

<table style="width:42%;">
<colgroup>
<col style="width: 13%" />
<col style="width: 13%" />
<col style="width: 13%" />
</colgroup>
<thead>
<tr class="header">
<th style="text-align: left;"></th>
<th style="text-align: center;">column1</th>
<th style="text-align: center;">column2</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td style="text-align: left;">row1</td>
<td style="text-align: center;">x</td>
<td style="text-align: center;">a</td>
</tr>
<tr class="even">
<td style="text-align: left;">row2</td>
<td style="text-align: center;">◯ |</td>
<td style="text-align: center;">a</td>
</tr>
<tr class="odd">
<td style="text-align: left;">row3</td>
<td style="text-align: center;">✕ |</td>
<td style="text-align: center;">a</td>
</tr>
<tr class="even">
<td style="text-align: left;">row4</td>
<td style="text-align: center;">あ</td>
<td style="text-align: center;">a</td>
</tr>
</tbody>
</table>

There is a problem on the next line.

<td style="text-align: center;">◯ |</td>

and

<td style="text-align: center;">✕ |</td>

These results include | character.

I can modify the source markdown to get the expected result as follows.

+---------+---------+---------+
|         | column1 | column2 |
+:========+:=======:+:=======:+
| row1    | x       | a       |
+---------+---------+---------+
| row2    | ◯       | a       |
+---------+---------+---------+
| row3    | ✕       | a       |
+---------+---------+---------+
| row4    | あ      | a       |
+---------+---------+---------+

However, it is not beautiful.

I think it's a half-width and full-width misjudgment.
◯ and ✕ are full width character as well as あ.

Command line

sudo docker run --rm --mount type=bind,source=$(pwd),destination=/data pandoc/core -o out.html src.md

Version

# pandoc --version
pandoc 2.14.2
Compiled with pandoc-types 1.22, texmath 0.12.3.1, skylighting 0.11,
citeproc 0.5, ipynb 0.1.0.1
User data directory: /root/.local/share/pandoc
Copyright (C) 2006-2021 John MacFarlane. Web:  https://pandoc.org
This is free software; see the source for copying conditions. There is no
warranty, not even for merchantability or fitness for a particular purpose.

jgm / doclayout Goto Github PK

doclayout's Issues

Ignore ANSI terminal escape sequences in length calculations

Wrong length for curly apostrophe

Spacing combining characters should still increase width

charWidth gives incorrect result for emoji

Performance could be improved when determining character width

Rendering bug with certain inputs

Recent versions of pandoc swallow the leading blank line of fenced code blocks when converting from markdown to latex

Support indexed and 24 bit colors

`Styled` documents interact poorly with line breaking.

Wrong character width in full-width symbol

Command line

Version

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent