Giter VIP home page Giter VIP logo

wcwidth's Introduction

Downloads codecov.io Code Coverage MIT License

Introduction

This library is mainly for CLI programs that carefully produce output for Terminals, or make pretend to be an emulator.

Problem Statement: The printable length of most strings are equal to the number of cells they occupy on the screen 1 character : 1 cell. However, there are categories of characters that occupy 2 cells (full-wide), and others that occupy 0 cells (zero-width).

Solution: POSIX.1-2001 and POSIX.1-2008 conforming systems provide wcwidth(3) and wcswidth(3) C functions of which this python module's functions precisely copy. These functions return the number of cells a unicode string is expected to occupy.

Installation

The stable version of this package is maintained on pypi, install using pip:

pip install wcwidth

Example

Problem: given the following phrase (Japanese),

>>> text = u'コンニチハ'

Python incorrectly uses the string length of 5 codepoints rather than the printable length of 10 cells, so that when using the rjust function, the output length is wrong:

>>> print(len('コンニチハ'))
5

>>> print('コンニチハ'.rjust(20, '_'))
_______________コンニチハ

By defining our own "rjust" function that uses wcwidth, we can correct this:

>>> def wc_rjust(text, length, padding=' '):
...    from wcwidth import wcswidth
...    return padding * max(0, (length - wcswidth(text))) + text
...

Our Solution uses wcswidth to determine the string length correctly:

>>> from wcwidth import wcswidth
>>> print(wcswidth('コンニチハ'))
10

>>> print(wc_rjust('コンニチハ', 20, '_'))
__________コンニチハ

Choosing a Version

Export an environment variable, UNICODE_VERSION. This should be done by terminal emulators or those developers experimenting with authoring one of their own, from shell:

$ export UNICODE_VERSION=13.0

If unspecified, the latest version is used. If your Terminal Emulator does not export this variable, you can use the jquast/ucs-detect utility to automatically detect and export it to your shell.

wcwidth, wcswidth

Use function wcwidth() to determine the length of a single unicode character, and wcswidth() to determine the length of many, a string of unicode characters.

Briefly, return values of function wcwidth() are:

-1

Indeterminate (not printable).

0

Does not advance the cursor, such as NULL or Combining.

2

Characters of category East Asian Wide (W) or East Asian Full-width (F) which are displayed using two terminal cells.

1

All others.

Function wcswidth() simply returns the sum of all values for each character along a string, or -1 when it occurs anywhere along a string.

Full API Documentation at https://wcwidth.readthedocs.org

Developing

Install wcwidth in editable mode:

pip install -e .

Execute unit tests using tox:

tox -e py27,py35,py36,py37,py38,py39,py310,py311,py312

Updating Unicode Version

Regenerate python code tables from latest Unicode Specification data files:

tox -e update

The script is located at bin/update-tables.py, requires Python 3.9 or later. It is recommended but not necessary to run this script with the newest Python, because the newest Python has the latest unicodedata for generating comments.

Building Documentation

This project is using sphinx 4.5 to build documentation:

tox -e sphinx

The output will be in docs/_build/html/.

Updating Requirements

This project is using pip-tools to manage requirements.

To upgrade requirements for updating unicode version, run:

tox -e update_requirements_update

To upgrade requirements for testing, run:

tox -e update_requirements37,update_requirements39

To upgrade requirements for building documentation, run:

tox -e update_requirements_docs

Utilities

Supplementary tools for browsing and testing terminals for wide unicode characters are found in the bin/ of this project's source code. Just ensure to first pip install -r requirements-develop.txt from this projects main folder. For example, an interactive browser for testing:

python ./bin/wcwidth-browser.py

Uses

This library is used in:

Other Languages

History

0.2.13 2024-01-06
  • Bugfix zero-width support for Hangul Jamo (Korean)
0.2.12 2023-11-21
  • re-release to remove .pyi file misplaced in wheel files Issue #101.
0.2.11 2023-11-20
  • Include tests files in the source distribution (PR #98, PR #100).
0.2.10 2023-11-13
  • Bugfix accounting of some kinds of emoji sequences using U+FE0F Variation Selector 16 (PR #97).
  • Updated Specification.
0.2.9 2023-10-30
  • Bugfix zero-width characters used in Emoji ZWJ sequences, Balinese, Jamo, Devanagari, Tamil, Kannada and others (PR #91).
  • Updated to include Specification of character measurements.
0.2.8 2023-09-30
  • Include requirements files in the source distribution (PR #82).
0.2.7 2023-09-28
  • Updated tables to include Unicode Specification 15.1.0.
  • Include bin, docs, and tox.ini in the source distribution
0.2.6 2023-01-14
  • Updated tables to include Unicode Specification 14.0.0 and 15.0.0.
  • Changed developer tools to use pip-compile, and to use jinja2 templates for code generation in bin/update-tables.py to prepare for possible compiler optimization release.
0.2.1 .. 0.2.5 2020-06-23
  • Repository changes to update tests and packaging issues, and begin tagging repository with matching release versions.
0.2.0 2020-06-01
  • Enhancement: Unicode version may be selected by exporting the Environment variable UNICODE_VERSION, such as 13.0, or 6.3.0. See the jquast/ucs-detect CLI utility for automatic detection.
  • Enhancement: API Documentation is published to readthedocs.org.
  • Updated tables for all Unicode Specifications with files published in a programmatically consumable format, versions 4.1.0 through 13.0
0.1.9 2020-03-22
  • Performance optimization by Avram Lubkin, PR #35.
  • Updated tables to Unicode Specification 13.0.0.
0.1.8 2020-01-01
  • Updated tables to Unicode Specification 12.0.0. (PR #30).
0.1.7 2016-07-01
  • Updated tables to Unicode Specification 9.0.0. (PR #18).
0.1.6 2016-01-08 Production/Stable
  • LICENSE file now included with distribution.
0.1.5 2015-09-13 Alpha
  • Bugfix: Resolution of "combining character width" issue, most especially those that previously returned -1 now often (correctly) return 0. resolved by Philip Craig via PR #11.
  • Deprecated: The module path wcwidth.table_comb is no longer available, it has been superseded by module path wcwidth.table_zero.
0.1.4 2014-11-20 Pre-Alpha
0.1.3 2014-10-29 Pre-Alpha
0.1.2 2014-10-28 Pre-Alpha
0.1.1 2014-05-14 Pre-Alpha
  • Initial release to pypi, Based on Unicode Specification 6.3.0

This code was originally derived directly from C code of the same name, whose latest version is available at https://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c:

* Markus Kuhn -- 2007-05-26 (Unicode 5.0)
*
* Permission to use, copy, modify, and distribute this software
* for any purpose and without fee is hereby granted. The author
* disclaims all warranties with regard to this software.

wcwidth's People

Contributors

avylove avatar bwagner avatar dependabot[bot] avatar fale avatar galaxysnail avatar hugovk avatar jquast avatar jwodder avatar lgtm-migrator avatar lmontopo avatar mgorny avatar msabramo avatar philipc avatar rasmuswl avatar s-t-e-v-e-n-k avatar thomasballinger avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

wcwidth's Issues

`python setup.py update` no longer works

On a fresh checkout:

$ python3 setup.py update
running update
data/ created.
retrieving http://www.unicode.org/Public/UNIDATA/EastAsianWidth.txt.
data/EastAsianWidth.txt saved.
parsing data/EastAsianWidth.txt ..
Traceback (most recent call last):
  File "setup.py", line 252, in <module>
    main()
  File "setup.py", line 248, in main
    cmdclass={'update': SetupUpdate},
  File "/usr/lib/python3.5/distutils/core.py", line 148, in setup
    dist.run_commands()
  File "/usr/lib/python3.5/distutils/dist.py", line 955, in run_commands
    self.run_command(cmd)
  File "/usr/lib/python3.5/distutils/dist.py", line 974, in run_command
    cmd_obj.run()
  File "setup.py", line 64, in run
    self.do_east_asian()
  File "setup.py", line 72, in do_east_asian
    properties=(u'W', u'F',)
  File "setup.py", line 127, in _parse_east_asian
    uline = line.decode('ascii')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 2: ordinal not in range(128)

Wrong width for some Cyrillic characters

The width for these are not correct:

  • COMBINING CYRILLIC HUNDRED THOUSANDS SIGN
  • COMBINING CYRILLIC HUNDRED MILLIONS SIGN

Test case:

import unicodedata
import math
import wcwidth

charnames = [u'COMBINING CYRILLIC HUNDRED THOUSANDS SIGN',
             u'COMBINING CYRILLIC HUNDRED MILLIONS SIGN',
             u'CIRCLED LATIN CAPITAL LETTER A',
             u'KANGXI RADICAL BIRD']

for charname in charnames:
    print(u'Character: {}'.format(charname))
    c = unicodedata.lookup(charname)
    cwidth = wcwidth.wcwidth(c)
    spacing = int(math.floor(2 - cwidth)) * u' '
    print(u"123456789")
    print(u"123{}{}6789".format(c, spacing))
    print(u"Char width: {}".format(cwidth, spacing))

This is printed with Python 2.7 and wcwidth 0.1.4:

Character: COMBINING CYRILLIC HUNDRED THOUSANDS SIGN
123456789
123҈ 6789
Char width: 1
Character: COMBINING CYRILLIC HUNDRED MILLIONS SIGN
123456789
123꙱ 6789
Char width: 1
Character: CIRCLED LATIN CAPITAL LETTER A
123456789
123Ⓐ 6789
Char width: 1
Character: KANGXI RADICAL BIRD
123456789
123⿃6789
Char width: 2

Related: dbcli/mycli#149

Upload wheel to PyPI

Please could you upload a wheel for 0.2.6 and future releases?

I recommend using https://pypi.org/project/build/

$ python -m pip install -U build
...
$ python -m build
...
Successfully built wcwidth-0.2.6.tar.gz and wcwidth-0.2.6-py2.py3-none-any.whl
$ ls dist
wcwidth-0.2.6-py2.py3-none-any.whl wcwidth-0.2.6.tar.gz

Bad release wheel

ERROR: wcwidth has an invalid wheel, multiple .dist-info directories found: wcwidth-0.2.0.dist-info, wcwidth-0.2.1.dist-info

Variation selectors are not correctly handled

Variation selectors (U+FE0E, U+FE0F) can change column widths of some preceding characters. For example, U+270F (✏) is a single-column glyph by itself, but with a succeeding U+FE0F it occupies 2 columns as shown in the snapshot below.

image

python3 package?

I haven't tried to use wcwidth yet, but I built and installed it without issues.
With pip3 I get:

Collecting wcswidth
  Could not find a version that satisfies the requirement wcswidth (from versions: )
No matching distribution found for wcswidth

Would it be possible to create a pip3 version so that we don't have to build from source?

Drop UNICODE_VERSION ?

From the work and results of ucs-detect, https://ucs-detect.readthedocs.io/results.html

I have discovered that terminals do not support a single version of Unicode. At this time, very few support a single version of the specification completely.

  • For specific types, like wide characters, they may very at any version, fe. gnome terminal, https://ucs-detect.readthedocs.io/sw_results/GNOMETerminal.html#gnometerminal supports 93% of characters unique to version 15.0, and 90% of characters unique to version 14.
  • It may not be immediately obvious, as, "Language Support" is a bit of a proxy for "Zero-Width support", because combining characters are best tested with the characters expected to be combined with, but their support for combining characters or the tables used in their code not necessarily match their latest wide table. In fact, most terminals only update their wide tables for the most popular demand of emoji support.
  • And of course, though ZWJ and VS-16 came out at roughly unicode version 8 and 9, very few terminals that support unicode 9 or higher of the wide tables support ZWJ and VS-16, see this specific part of the table:

image

Because of those results, I think its perfectly fine to drop support for this UNICODE_VERSION, I very much doubt it is used, or useful to anyone when it is, because it cannot correctly describe the terminal's support to wcwidth.

If is a useful idea?

I was interested whether terminal emulator authors would have feedback about UNICODE_VERSION, and whether they would consider exporting it. I have not received any feedback.

However, with tools like 'ucs-detect', we can very programmatically determine with black-box testing, which wide, zero-width, and whether ZWJ and VS-16 are supported, right down to exactly which ones. By making this a delta of expected terminal support, and using ranges with codepoints, maybe it is possible to describe with a complex environment variable.

Just spitballing an idea of what it might look like,

UNICODE_SUPPORT="zero[8.0:!category:Mc,Mn,!1001-1002,!1003],wide[15.1:!zwj,!vs16,!9009-9010]"

wc might be an empty string

wcwidth/wcwidth/wcwidth.py

Lines 104 to 182 in c71459e

def wcwidth(wc):
r"""
Given one unicode character, return its printable length on a terminal.
The wcwidth() function returns 0 if the wc argument has no printable effect
on a terminal (such as NUL '\0'), -1 if wc is not printable, or has an
indeterminate effect on the terminal, such as a control character.
Otherwise, the number of column positions the character occupies on a
graphic terminal (1 or 2) is returned.
The following have a column width of -1:
- C0 control characters (U+001 through U+01F).
- C1 control characters and DEL (U+07F through U+0A0).
The following have a column width of 0:
- Non-spacing and enclosing combining characters (general
category code Mn or Me in the Unicode database).
- NULL (U+0000, 0).
- COMBINING GRAPHEME JOINER (U+034F).
- ZERO WIDTH SPACE (U+200B) through
RIGHT-TO-LEFT MARK (U+200F).
- LINE SEPERATOR (U+2028) and
PARAGRAPH SEPERATOR (U+2029).
- LEFT-TO-RIGHT EMBEDDING (U+202A) through
RIGHT-TO-LEFT OVERRIDE (U+202E).
- WORD JOINER (U+2060) through
INVISIBLE SEPARATOR (U+2063).
The following have a column width of 1:
- SOFT HYPHEN (U+00AD) has a column width of 1.
- All remaining characters (including all printable
ISO 8859-1 and WGL4 characters, Unicode control characters,
etc.) have a column width of 1.
The following have a column width of 2:
- Spacing characters in the East Asian Wide (W) or East Asian
Full-width (F) category as defined in Unicode Technical
Report #11 have a column width of 2.
"""
# pylint: disable=C0103
# Invalid argument name "wc"
ucs = ord(wc)
# NOTE: created by hand, there isn't anything identifiable other than
# general Cf category code to identify these, and some characters in Cf
# category code are of non-zero width.
# pylint: disable=too-many-boolean-expressions
# Too many boolean expressions in if statement (7/5)
if (ucs == 0 or
ucs == 0x034F or
0x200B <= ucs <= 0x200F or
ucs == 0x2028 or
ucs == 0x2029 or
0x202A <= ucs <= 0x202E or
0x2060 <= ucs <= 0x2063):
return 0
# C0/C1 control characters
if ucs < 32 or 0x07F <= ucs < 0x0A0:
return -1
# combining characters with zero width
if _bisearch(ucs, ZERO_WIDTH):
return 0
return 1 + _bisearch(ucs, WIDE_EASTASIAN)

if wc is an empty string, an TypeError: ord() expected a character... exception will be raised

if there would be a statement if len(wc) == 0: return 0 before ord(wc), it will be better, I think.

Multi-codepoint emojis

Hi,

Can wcwidth help me with multi-codepoint emojis?

For instance, here I want to get the cell width for a "woman_mechanic_dark_skin_tone" emoji, which renders in the terminal as 2 cells, but wcswidth reports a width of 6 because it is adding up all the modifiers.

>>> s="👩\U+1F3FF\u200d🔧"
>>> print(repr(s))
'👩🏿\u200d🔧'
>>> from wcwidth import wcswidth
>>> wcswidth(s)
6
>>> print(s+"\n--")
👩🏿‍🔧
--

I've found support for these kind of emojis to be inconsistent across terminals, so maybe this is a lost cause, but is there some kind of standard for these emoji modifiers?

Dropping python 2 and 3.5 and earlier?

Thoughts,

image

I am a strong believer in backporting and so on, having worked in many restrictive environments, I can sympathize with the small number of the tens of thousands of wcwidth downloads performed every day by python 2.7. And, though they should be version-pinning when working with such legacy software versions, there are cases where they are not.

The reason I held onto python 2 support so long was because it wasn't too difficult, we had found simple 2/3 switches that work in the codebase and the tests have provided nearly 100% coverage.

However I think a majority of the bugs of wcswidth calculations have been resolved, and python 2.7 users can now benefit from this. And that I cannot expect any python 2.7 users will want to make use of any new API's that are discussed in open issues. And for that reason, I am a proponent of dropping support of older python releases.

0.2.11 wheels include typing info

While I actually appreciate this because I use typing and it allows me to drop a #type: ignore marker when importing the package, based on #71 not being merged, I assume they actually shouldn't be there.

The sdist does not have typing info leading me to think the wheel may have been created on a different branch?

Also, the wheel looks to have a spurious .DS_Store file, similar to what's mentioned here: pypa/wheel#297

I haven't checked the CI yet, so i don't know if you're doing sdist/wheel builds and uploads there, my guess is no, but maybe they can be triggered on tags.

wcwidth should have a "C Extension"

Previously discussed,

I'm open to any specific solution. Plenty were discussed in the past.

Why compile?

  • As wcwidth is likely a "hot path" to intensive applications (like pymux, a terminal emulator)
  • and, the _bisearch() and code tables are basic "if/else" statements and large arrays of integers.
  • pre-compiling machine code and providing build hooks could very significantly improve the performance of any dependent applications.
  • and, with our jinja2 framework in bin/update-tables.py, it would be very easy to systematically generate these "code tables" for any type language

My only suggested requirement is that wcwidth install without error on minority operating systems/environments that can't build or fetch a matching pre-built package: that those systems should succeed to install anyway and continue to use the pure-python implementation.

I think using just the basic C language is a fine choice, our use of the language and build would be the most basic and supportable across all kinds of systems and I know C well enough so I don't mind that at all.

Python-like languages like Cython are also very "inclusive" for outside developers to dissect and contribute to, as they are very likely to be python developers, whereas using Rust or something to create a foreign function interface might be very alienating.

wc_rjust() doesn't work for non-printables

Firstly, this isn't a problem for me, I just wanted to let you know about it.

Using wc_rjust() from the readme, if text contains any non-printable characters, the result is longer than length, which should never happen.

For example, '\n' is non-printable:

>>> wc_rjust('\n', 2, '.')
'...\n'

For reference, the width-naive version:

>>> '\n'.rjust(2, '.')
'.\n'

The problem is because of the math here:

max(0, (length - wcswidth(text)))

If wcswidth(text) is negative, the max is length + 1.

The simple solution is to just add a note in the readme warning about this situation, but if you wanted, you could expand the function to raise an error:

>>> def wc_rjust(text, length, padding=' '):
...    from wcwidth import wcswidth
...    width = wcswidth(text)
...    if width < 0:
...        raise ValueError('text contains non-printable characters')
...    return padding * max(0, (length - width)) + text
...
>>> wc_rjust('\n', 2, '.')
Traceback (most recent call last):
  ...
ValueError: text contains non-printable characters

Ultimately, it seems like the problem is using -1 as a sentinel return value instead of raising an error, but it looks like that's inherited from the C function, so fixing that would be a lot of work.

wcswidth incorrect for heart emoji, ❤️ ("\u2764\ufe0f")

Hello,

The wcswidth function seems to be incorrectly calculating the width of the heart "❤️" ("\u2764\ufe0f") moji. An example:

>>> from wcwidth import wcswidth
>>> wcswidth("❤️")
1
>>> wcswidth("💞")
2
>>> wcswidth("💘")
2

The heart emoji occupies 2 cells and should be returning 2 as per the other examples above.

Some emoji have incorrect width

Hi, thanks for your work on this project. It's been invaluable!

According to this document emoji presentation sequences should be treated as "East Asian Wide".

[UTS51] emoji presentation sequences behave as though they were East Asian Wide, regardless of their assigned East_Asian_Width property value.

When wcwidth reads in the EastAsianWide.txt file, it discards all the emoji presentation sequences it finds, rather than treating them as being wide (since it discards everything without W or F properties).

The full list of 353 emojis affected is available at:
https://unicode.org/emoji/charts/emoji-variants.html

wcwidth will report all of the emoji in the above list as having width 1 instead of width 2.

I would be happy to PR this, but I'm not sure the master branch is clean - I noticed some walrus operators etc. despite my understanding being that this project is 2.7 compatible

sdist is missing emoji files

The sdist package at PyPI is missing both emoji-variation-sequences.txt and emoji-zwj-sequences.txt files. Without both files testing of wcwidth fails. Please add missing files to sdist. Thank you.

Should wcwidth provide rjust, ljust, center and textwrap?

In the readme, we talk about the example of writing our own custom rjust function. If the most common use case is text alignment, let's just offer it as a public API. They would be the same name and signature as python's,

And text-wrapping interface,

I have previously added support of these (with terminal sequences) in blessed, https://github.com/jquast/blessed/blob/a34c6b1869b4dd467c6d1ab6895872bb72db7e0f/blessed/sequences.py#L141-L239

`Default_Ignorable_Code_Point`s should all be zero-width

From https://www.unicode.org/faq/unsup_char.html#3:

All default-ignorable characters should be rendered as completely invisible (and non advancing, i.e. “zero width”), if not explicitly supported in rendering.

However, this library incorrectly considers some of them, for example U+3164 HANGUL FILLER, to have non-zero width.

(There is one exception, where this library is correct in assigning a non-zero width to a Default_Ignorable_Code_Point: U+115F HANGUL CHOSEONG FILLER is meant to be combined with other Hangul jamo to form a width-2 syllable block, so it should be assigned width 2 even though it has no display on its own.)

wrong width for U+00AD

Hi, I was looking at your wcwidth library for comparison, since in the utf8proc library we are also implementing a similar feature (see JuliaStrings/utf8proc#2). The first disagreement that I came across between your implementation and ours was for U+00AD (soft hyphen), where you seem to give 1

>>> from wcwidth import wcwidth
>>> wcwidth(unichr(173))
1

and we give zero (a soft hyphen is used for line breaking, but is ordinarily not printed). In general, we return 0 for most characters in category Cf (formatting control characters). The wcwidth function on MacOS 10.10.2 also returns -1 (not printable) for this code point.

Am I calling your implementation incorrectly? This is for git master of wcwidth.

Using DerivedCombiningClass.txt to determine width is inappropriate

DerivedCombiningClass.txt contains the Canonical_Combining_Class field from UnicodeData.txt (see http://www.unicode.org/reports/tr44/#Canonical_Combining_Class_Values). This field is intended to be used for the collation algorithm.

wcwidth.py is currently assuming that characters are zero width combining characters if and only if they have a non-zero combining class. I think this is an invalid assumption. For example, characters that are enclosing marks (General Category = Me) all have a zero combining class, but they are also zero width combining characters.

I'm not sure what the standard way to determine zero width combining characters is. One possibility is to check for a General Category of Mn or Me, but I don't know if there are any exceptions to this. Also note that there are combining characters that do have a width (category Mc).

Package hash does not match for latest release 0.2.6

Hi,

Got below hash error today. Do you have an idea what could have caused it?

Thanks!

ERROR: THESE PACKAGES DO NOT MATCH THE HASHES FROM THE REQUIREMENTS FILE. If you have updated the package versions, please update the hashes. Otherwise, examine the package contents carefully; someone may have tampered with them.
    wcwidth==0.2.6 from https://files.pythonhosted.org/packages/20/f4/c0584a25144ce20bfcf1aecd041768b8c762c1eb0aa77502a3f0baa83f11/wcwidth-0.2.6-py2.py3-none-any.whl (from -r requirements/develop.txt (line 1712)):
        Expected sha256 a5220780a404dbe3353789870978e472cfe477761f06ee55077256e509b156d0
             Got        795b138f6875577cd91bba52baf9e445cd5118fd32723b460e30a0af30ea230e

What would wcwidth look like if it were built-in to Python?

Like P1868R2, "🦄 width: clarifying units of width and precision in std::format", Published Proposal, 2020-02-11 https://fmt.dev/papers/p1868.html

Why can't Python just do the right thing? For example, here it gets it wrong,

>>> print(f'|{"\u231a":x<5s}|\n'
...       f'|{"watch":x<5s}|\n')
|⌚xxxx|
|watch|

This emoji is measured as a width of 1, but it is actually a width of 2, causing rjust() to format it wrong. It also fails to account correctly when zero-width, ZWJ, and variation selectors are used. Python fails to get this measurement "right" for any kind of display device at all, but I think it goes without saying that the only purpose of this function is for monospace character displays such as terminals.

I believe the Built-in format string alignment functions, str.rjust, str.ljust, str.center, and textwrap.wrap should measure these unicode characters for their printable width, and not just the "number of codepoints".

The built-in REPL also gets this wrong in the readline-like library input. It becomes impossible to edit strings containing these characters, the cursor position and the result of input is unpredictable and disorienting.

IPython, which uses wcwidth, does a better job and should fare better with #91 closed, but it should not be required to use a large project like IPython as a REPL as a solution.

It would be good to experiment with the source code of Python, to see which parts of the codebase need changing. See #93 for the basic high-level functions

And, it would be better to draft and submit a PEP.

special character width problem

image

import pandas as pd
from tabulate import tabulate
from io import StringIO
import wcwidth


csv_str = ''',0,1,2
0,a)  将净利润调节为经营活动现金流量,,
1,,2019 年度,2018 年度
2,净利润,"63,205,243","55,350,200"
3,加:资产减值损失,"(73,370)","10,465,899"
4,信用减值损失,"3,611,595",—
5,固定资产折旧,"6,543,253","6,530,713"
6,投资性房地产折旧,"1,718,108","1,514,560"
7,无形资产摊销,"447,821","440,444"
8,长期待摊费用摊销,"338,210","189,875"
9,处置固定资产、无形资产和其他长,,
10,期资产的收益,"(568,141)","(175,112)"
11,财务费用,"10,179,757","12,568,535"
12,公允价值变动损失,"484,752","368,343"
13,投资收益,"(4,212,538)","(5,646,311)"
14,递延所得税资产的增加,"(3,083,170)","(2,511,075)"
15,递延所得税负债的增加/(减少),"107,125","(644,386)"
16,存货的增加,"(70,420,830)","(96,125,732)"
17,受限资金的(增加)/减少,"(1,387,681)","280,759"
18,经营性应收项目的增加,"(125,133,902)","(108,365,569)"
19,经营性应付项目的增加,"83,217,173","135,881,219"
'''

csv_io = StringIO(csv_str)

df = pd.read_csv(csv_io, index_col=0)

print(df)

df_tabulated = tabulate(df, headers='keys', tablefmt='psql', showindex=False)

print(df_tabulated)

Devanagari's zero-width characters are not accounted for properly

I am trying to tabulate entries containing Devanagari characters using python-tabulate. The library uses wcwidth to calculate the visible length of a string, apparently on line 768 here.

I had opened an issue in astanin/python-tabulate#68 a while ago. The dev directed me to also open an issue here, so here I am. I will quote myself directly from the issue:


This is how it renders

Name            Score
------------  -------
राष्ट्र परीक्षण    19.25
Test             0

versus

Name               Score
---------------  -------
Devanagari here    19.25
Test                0

How it should render:

Name            Score
------------  -------
राष्ट्र परीक्षण         19.25
Test             0

Should we measure terminal sequences?

I believe the fundamental reason that POSIX C API returned -1 for all c0 and c1 control codes, was as if to say, "this is a terminal emulator's job to parse, not mine", and so it was an error to pass a string containing terminal sequences, a terminal emulator should have partitioned the string and managed any cursor movements or attribute changes before sending the ESC sequence to wcwidth.

Should wcwidth measure terminal sequences, or should we leave this up to other libraries?

I think it could only do more help and be otherwise harmless.

The current situation for developers

  • They don't even want to have to use wcswidth() in the first place!
  • They would rather use print(f'{emoji_val:<30s') for text alignment !
  • They don't care about why this first line works perfectly, and the second gets it wildly wrong:
print(term.red + wc_rjust(emoji_val, 30))
print(wc_rjust(term.red + emoji_val, 30))

Wouldn't it be nice if both approaches were correct?

On '\b',

I noticed this Ruby library measures -1 for '\b', https://github.com/particle-iot/ruby-unicode-display-width#how-this-library-handles-widths -- It is the only such sequence that is measured this way by that library.

  • string\b is ambiguous, but any non-error value would be preferred. See #96 about returning 5.:
    • it has occupied 6 characters on the screen
    • but the cursor is positioned at the 5th cell
    • how the developer wishes us to measure this?

Only if in #79, we decide for a new function with a new signature, can we then allow return value of -1 for a single character, '\b', to be interpreted as a non-error. This function or signature change of wcswidth would also correctly return '0' for other, immeasurable control codes.

But why stop at '\b' ?? Why not also parse the CSI code patterns for moving cursor left and right?

On '\t',

It might not be immediately obvious, but tab cannot be safely measured. But I do like this ruby's approach of user-provided parameter table. This would allow us to interpret tab as 0 and allow any developer who really wishes to hint at the distance to next tabstop, though unlikely.

On CSI

Control Sequence Inducer (CSI) are terminal sequences beginning with '\x1b[' and require some advanced parsing mechanisms to discover the "end" of such sequences.

I have taken an approach in the "blessed" library to dynamically generate terminal sequences from termcap and to mixin a few custom ones, to programmatically create a regular expression to match terminal sequences in two categories,

I think this code could be simplified, and also changed from dynamic runtime to static definitions of regular expressions of common terminal sequences labeled or grouped by their measured effect.

https://github.com/jquast/blessed/blob/a34c6b1869b4dd467c6d1ab6895872bb72db7e0f/blessed/sequences.py#L57C8-L84

The import of pkg_resources creates issues packaging this library

Hey this is a popular and useful project, it's a 2rd degree transitive dependency in a project I'm using, tabulate -> wcwidth.

Unfortunately some of our slightly dimmer witted customers who refuse to use virtual environments, and live within a hosted Jupyter notebook are getting this kind of error.

 File "/.pyenv/versions/3.7.3/envs/sdk/lib/python3.7/site-packages/tabulate.py", line 54, in <module>
    import wcwidth  # optional wide-character (CJK) support
  File "/.pyenv/versions/3.7.3/envs/sdk/lib/python3.7/site-packages/wcwidth/__init__.py", line 12, in <module>
    from .wcwidth import ZERO_WIDTH  # noqa
  File "/.pyenv/versions/3.7.3/envs/sdk/lib/python3.7/site-packages/wcwidth/wcwidth.py", line 72, in <module>
    import pkg_resources
  File "/.pyenv/versions/3.7.3/envs/sdk/lib/python3.7/site-packages/pkg_resources/__init__.py", line 3260, in <module>
    @_call_aside
  File "/.pyenv/versions/3.7.3/envs/sdk/lib/python3.7/site-packages/pkg_resources/__init__.py", line 3244, in _call_aside
    f(*args, **kwargs)
  File "/.pyenv/versions/3.7.3/envs/sdk/lib/python3.7/site-packages/pkg_resources/__init__.py", line 3273, in _initialize_master_working_set
    working_set = WorkingSet._build_master()
  File "/Users/rollokonig-brock/.pyenv/versions/3.7.3/envs/sdk/lib/python3.7/site-packages/pkg_resources/__init__.py", line 583, in _build_master
    ws.require(__requires__)
  File "/.pyenv/versions/3.7.3/envs/sdk/lib/python3.7/site-packages/pkg_resources/__init__.py", line 900, in require
    needed = self.resolve(parse_requirements(requirements))
  File "/.pyenv/versions/3.7.3/envs/sdk/lib/python3.7/site-packages/pkg_resources/__init__.py", line 786, in resolve
    raise DistributionNotFound(req, requirers)
pkg_resources.DistributionNotFound: The 'ipython' distribution was not found and is required by mypackage

Because importing pkg_resources actually requires that the dependency structure of an environment be 100% okay, and this is actually a bit of an unexpected side effect (as we get non-devs messing about with pip).

(I'm going to raise this as a bug with the python core team).

Unicode tables version detection

just noting that iTerm2 has a methodology to determine version,

Unicode Version

iTerm2 by default uses Unicode 8's width tables. The user can opt to use Unicode 9's tables with a preference (which render emoji more nicely, but requires applications that expect Unicode 9 width tables). Since not all apps will be updated at the same time, you can tell iTerm2 to use a particular set of width tables with:

^[]1337;UnicodeVersion=n^G

Where n is 8 or 9

https://www.iterm2.com/documentation-escape-codes.html

some emojis return width of 1

This is something that I've been struggling with. Consider the following example:

>>> print('\U0001f6cd')
🛍
>>> wcwidth.wcwidth('\U0001f6cd')
1

In this case the display width of the symbol is 2 (as I believe is appropriate per Unicode standard). What is annoying though is that the terminal (default terminal on MacOS 10.13, with the default font SF Mono Regular) also thinks that the character has width 1, and draws the following character on top of this emoji.

I wonder if it is possible to create a special "print" method that knows to insert a superficial space after characters like this, ensuring that the output in terminal is not buggy?

`Prepended_Concatenation_Mark`s should not be zero-width

UAX 44, Prepended_Concatenation_Mark:

A small class of visible format controls, which precede and then span a sequence of other characters, usually digits. These have also been known as "subtending marks", because most of them take a form which visually extends underneath the sequence of following digits.

As they have visible display before the characters they modify, these should not be considered zero-width, however this library incorrectly treats them as such.

Should combining characters return -1 or 0 ?

Thanks , that's just the kind of feedback I was looking for ..

I chose to return -1 because thats what libc wcwidth(3) returns on my OSX and travis-ci.org's linux systems

(unless I'm doing it wrong, bin/wcwidth-libc-comparator.py),

Although there are a few cases where libc returns 1 where wcwidth.py returns -1, there aren't any cases of libc returning 0 ..

Matching values:

libc,ours=-1,-1 [--o͔o--] name=COMBINING LEFT ARROWHEAD BELOW val=852 http://codepoints.net/U+354

libc,ours=-1,-1 [--o᷇o--] name=COMBINING ACUTE-MACRON val=7623 http://codepoints.net/U+1DC7

libc 1 vs. wcwidth.py -1:

libc,ours=1,-1 [--o֭o--] name=HEBREW ACCENT DEHI val=1453 http://codepoints.net/U+5AD

libc,ours=1,-1 [--oฺo--] name=THAI CHARACTER PHINTHU val=3642 http://codepoints.net/U+E3A

Anyway you may be right .. I was hoping for some feedback.

My thoughts were:: If somebody wants to know the printable width of a string, and it contains a combining, until I fully understand their effect in full string as part of wcswidth, I should return -1 as if to say "indeterminate".

Examining a few consumers of wcwidth.c, they often return 0 in such cases, one example:

https://github.com/sickill/libtsm/blob/master/src/tsm_unicode.c#L393

Anyway, feedback appreciated. I'll open a bug, reading http://pubs.opengroup.org/onlinepubs/009696699/functions/wcwidth.html it seems it should return -1 for anything but wide characters and NULL.

jq

On May 5, 2014, at 12:24 AM, wrote:

Hi Mr Quast,

I was wondering if there was a reason behind your choice to return -1 for combining characters when in the original C code it returned 0.

Regards,

Wrong width for Hindi on macOS, but correct width on Linux

I tried using wcwidth to calculate the length of the name for the city of Mumbai in Hindi (बॉम्बे हिंदी)

from wcwidth import wcswidth
wcswidth('बॉम्बे हिंदी')
9

On macOS 10.13.5 using Python 3.6.5, I see a visual width of 5 characters and a calculated width of 9 characters.

On Ubuntu 18.04 using Python 3.6.5, I see a visual width of 9 characters and a calculated width of 9 characters.

Thank you by the way for creating a very useful module!

0.2.5: sphinx warnings and errors

Looks like on redndering documentation sphinx shows some sissues. Despite that documentation seems is generated.

+ /usr/bin/python3 setup.py build_sphinx -b man --build-dir build/sphinx
running build_sphinx
Running Sphinx v4.3.2
making output directory... done
WARNING: html_static_path entry '_static' does not exist
loading intersphinx inventory from https://docs.python.org/3/objects.inv...
building [mo]: targets for 0 po files that are out of date
building [man]: all manpages
updating environment: [new config] 4 added, 0 changed, 0 removed
reading sources... [100%] unicode_version
WARNING: autodoc: failed to import function '_get_package_version' from module 'wcwidth'; the following exception was raised:
Traceback (most recent call last):
  File "/usr/lib/python3.8/site-packages/sphinx/util/inspect.py", line 448, in safe_getattr
    return getattr(obj, name, *defargs)
AttributeError: module 'wcwidth' has no attribute '_get_package_version'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3.8/site-packages/sphinx/ext/autodoc/importer.py", line 110, in import_object
    obj = attrgetter(obj, mangled_name)
  File "/usr/lib/python3.8/site-packages/sphinx/ext/autodoc/__init__.py", line 332, in get_attr
    return autodoc_attrgetter(self.env.app, obj, name, *defargs)
  File "/usr/lib/python3.8/site-packages/sphinx/ext/autodoc/__init__.py", line 2780, in autodoc_attrgetter
    return safe_getattr(obj, name, *defargs)
  File "/usr/lib/python3.8/site-packages/sphinx/util/inspect.py", line 464, in safe_getattr
    raise AttributeError(name) from exc
AttributeError: _get_package_version

/home/tkloczko/rpmbuild/BUILD/wcwidth-0.2.5/wcwidth/wcwidth.py:docstring of wcwidth.wcwidth._wcmatch_version:1: WARNING: duplicate object description of wcwidth._wcmatch_version, other instance in api, use :noindex: for one of them
looking for now-outdated files... none found
pickling environment... done
checking consistency... done
writing... python-wcwidth.3 { intro unicode_version api } done
build succeeded, 3 warnings.

Propose new function, width(control_codes='ignore')

Problem

As for the need for "width" function, just about every downstream library has some issue with the POSIX wcwidth() and wcswidth() functions, either in C or in this python library.

This is mainly because both functions may return -1, and the return value must be checked, but it often is not.

Although using wcswidth() on a string is the most popular use case, it has the possibility to return -1 by POSIX definition, and Markus Kuhn's 2007 implementation returns -1 for control characters.

The return value is often unchecked where it is used with sum(), slice() or screen positioning functions with surprising results.

Solution

Provide new function signature, width that always returns a "best effort" of measured distance. It may ignore or measure control codes, instead. If "catching unexpected control codes" is a desired function, we can continue to provide it as an optional keyword argument, and, rather than return -1, raise an exception.

Maybe new keyword argument control_codes with default argument 'ignore', in similar spirit to 'errors' for https://docs.python.org/3/library/stdtypes.html#bytes.decode,

  • 'ignore': measure all individual control codes and terminal sequences as 0 width.
  • 'strict': raise ValueError on any control codes.
  • 'replace': make best effort to account for horizontal width movement in terminal sequences.

Workaround

As a workaround, I have suggested to use wcwidth() directly on each individual character and clip the possible -1 return value to 0, example: https://github.com/jquast/blessed/blob/a34c6b1869b4dd467c6d1ab6895872bb72db7e0f/blessed/sequences.py#L364

This provides the same function as wcswidth but provides a "best guess", however, this method cannot handle coming changes to wcswidth to handle zero width joiner (ZWJ) sequences.

housekeeping

ping @jquast, can you please let me know your interest in updating the travis-ci job to maintained versions of python. It will be very helpful to others whoever is working with updated python releases.

Wrong width for emoji chars

as reported in xonsh/xonsh#1569 some (maybe all) emojis are reported as being 2 char wide, while most (all?) terminals think they are 1 char wide:

440944d0-63b6-11e6-8173-7e954087f26c

The difference in position is because while editing a command xonsh is using wcwidth but to print the line it just prints and let the terminal position things

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.