Comments (11)
@kilobyte & @lo48576 Thank you very much for the detailed explanations! In order to keep things simple, I'd like to go with the --ascii
option that @lo48576 suggested.
Other opinions?
from hexyl.
@kilobyte In old days, yes they are full width because they use multiple bytes.
But it is not only reason they are full-width.
Some characters (especially punctuations like …
) should be full width in (at least) Japanese environment, and it should not be forced to be half width for all languages over the world.
I think this is not a defect of terminals, though we suffered from it.
Appropriate EAW setting is necessary for many users.
EAW (East asian width)
EAW characters are characters which "should be rendered as full width (double width) in CJK context, but should be rendered as half width (single width) in other context".
This includes greek characters for example (like α
), but not only alphabetical characters.
Many non-alphabetical symbols (such as ruled lines ┌
, times symbol ×
, ellipsis …
, and many others.
wcwidth
It is hard to detect context of where the EAW characters are used, so usually locale information (usually specified by $LANG
) is used.
Each locale has corresponding charmaps (in /usr/share/i18n/charmaps/
), and they are connected by /etc/locale.gen
file (in Linux).
wcwidth
refers to the charmap database corresponding to the current locale (usually $LANG
), and returns character width.
So, for example α
can be 1 for some locales and environments, but also can be 2 for other locales or environments.
full width font and terminal
TR11 recommends alphabetic characters to be rendered as half (single) width, but I think this does not apply to non-alphabetic symbols (typically punctuation marks like …
, they should absolutely be rendered as full (double) width in CJK, but they are rendered as half width in non-CJK area).
Generic UTF-8
locale setting provided by glibc returns half width for EAW characters (because, I think, most users and developers live in non-CJK area), but this can be changed by users, and it would be completely legal.
So, terminal can use full width for EAW characters even if it has no bug, and we cannot say it is problem of terminal.
SSH and wcwidth
Charmap database is referred by glibc (or something like that), so if users use SSH and run apps on server, server's locale and charmap database is used.
This may be problematic in some cases, for example, "I use locale and charmap which specify full width for EAW characters, but my server uses C locale and it uses half width for EAW characters".
Problem
Hexyl uses some EAW characters (as far as I know: all ruled lines, ×
, and •
, but there might be more).
They are full (double) width in some environments, but hexyl always consider them as half (single) width, so layout is broken.
How to solve
Make symbols customizable
IMHO, the best option is to make some special symbols customizable.
In this case, users can modify config files to use +-|
as ruled lines, x
instead of ×
, .
instead of ・
.
This might be useful for EAW users, or users with poor font or poor terminal.
(And non-CJK users can use good-looking box drawing characters).
Add "ASCII-only" option
If customizability is not important, simply --ascii
CLI option or something like that can be added.
This is less flexible, but useful enough like as with the first option.
Make hexyl wcwidth-aware
This works for some environments, but won't work as expected for some remote (SSH) environments, as @kilobyte pointed out.
from hexyl.
@12101111 Some legacy Chinese fonts may use full-width for U+00B7
I agree with @sharkdp’s --ascii
option. It would be better (on win32) to detect user codepage with GetACP
and turn on ascii
under CP 932, 936, 949, 950.
from hexyl.
So EastAsianAmbiguous is still a thing in some terminals? That was about as bad an idea as CJK Unification.
I'd argue that your terminal is broken and needs to be fixed. The Unicode standard says:
In modern practice, most alphabetic characters are rendered by variable-width fonts using narrow characters, even if their encoding in common legacy sets uses multiple bytes.
although when I asked them about improving some EAW settings that made no sense, they refused to make a stance, saying the whole EAW database is obsolete and shouldn't be used anymore. They didn't provide a replacement — I guess it's time to ask for an explicit database. But that'd take many months for a draft, a year for a release, then several years to be actually obeyed by terminals.
In the interim, I'd say tool like hexyl should avoid using any EastAsianAmbiguous characters — running under a CJK/non-CJK locale doesn't mean anything about the terminal, as a Japanese person ssh-ing to a company server will have EastAsianAmbiguous=N on the machine running hexyl but EastAsianAmbiguous=W in the terminal. And even if you have ssh sending the locale correctly, there's no such option for serial links or e-mail.
And, there's less than 256 byte values to display, so avoiding such characters is trivial.
from hexyl.
In the interim, I'd say tool like hexyl should avoid using any EastAsianAmbiguous characters
I'm not really familiar with the details here. What is an "EastAsianAmbiguous character"? ×
is the "Multiplication sign" character in the "Latin-1 Supplement" block of Unicode and •
is the "Bullet" character in the "General Punctuation" block of Unicode. What do either of these have to do with East Asian characters?
I understand that your terminal somehow prints these characters with a width of 2. Could we call wcwidth('×')
in hexyl
and fall back to another character if it returns 2? (see https://docs.rs/wcwidth/1.0.1/wcwidth/fn.char_width.html ?)
from hexyl.
Alas, it's not so easy — wcwidth() returns 1 for those. And there's nothing you can do on the machine running hexyl — the display depends on the receiving side, possibly years after hexyl was run.
I do consider assuming that EastAsianAmbiguous allows width 2 a defect in the terminal: the whole concept comes from a technical detail of some ancient systems that assumed the width of every character is same as the number of bytes it takes to encode within that particular legacy encoding. So not only it's compat with something badly obsolete, it's also ambiguous wrt which ancient encoding it's striving to be compatible with. Some of those characters will display as narrow, some as wide, and you have no way to detect that.
There's a database: package "unicode-data", file /usr/share/unicode/EastAsianWidth.txt
— anything marked as "A" is dangerous to use as it may exhibit this problem on some terminals. Anything "N", "Na" and "H" is safely narrow, anything "F" and "W" is wide. Here's the official standard.
But, as you need just a few characters, a solution seems trivial: just avoid anything marked as "A"; there's enough good alternatives to choose from.
from hexyl.
In the output above, there also seems to be a problem with the box drawing characters. I don't think there are reasonably good looking alternatives(?)
from hexyl.
Add option make the program more complicated, and CJK users had to always turn on this option.
Maybe change U+2022 "Bullet" to U+00B7 "Middle dot" is better?
It seems middle dot works in both CJK environment and Western environments (for me).
I tried most Chinese monospaced fonts,and found ×
and ·
is halfwidth,•
is fullwidth .
It seems that ×
in the Japanese font is also fullwidth .I don't know if there is an alternative to ×
.
from hexyl.
@12101111 The problem is not only bullet and cross sign, but also ruled lines...
They would be usually half-width in western environments, but full-width in CJK (at least in almost all Japanese fonts I know).
from hexyl.
@lo48576 Actually, the problem is that the command line application cannot know the font that the console is using and decide how many cells would be used to render such symbols under FAREAST environments :(
@sharkdp Always remember: text is hard.
from hexyl.
+1, it would be good if --ascii
is automatically enabled on some environment (in future).
Then border mode should have three modes, --ascii={never,auto,always}
, like --color
of many tools (ls, grep, etc...), I think.
For example, hexyl
will behave as hexyl --ascii=auto
, and hexyl --ascii
will behave as hexyl --ascii=always
.
from hexyl.
Related Issues (20)
- Is there a cheat sheet anywhere telling what color correlates to category of byte? HOT 3
- Add support for the NO_COLOR environment variable HOT 4
- Regression: broken pipe error
- Make --panels=auto the new default? HOT 3
- hexyl should not fail if `--panels=auto` is used and output is not a TTY
- Identical lines not squeezed HOT 3
- hexyl -C does the inverse of hexdump -C HOT 1
- Request: publish source tar with generated manpage HOT 3
- Support for different code pages / character tables
- Incorrect output when reading from terminal HOT 2
- Feature request: Ship as a crate HOT 3
- `hexyl /dev/zero` hangs HOT 6
- Display some non-printable characters like `bat --show-all` HOT 5
- Feature request: Comodore prg mode HOT 1
- CI artifact names conflict HOT 2
- Provide Statically Compiled Binaries for (aarch64|arm64) Linux HOT 1
- Support raw output HOT 2
- Customize colors
- Completions for fish shell HOT 1
- is there a way to open disk devices in windows? HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from hexyl.