WeeChat has some issues with the display of Unicode chars and the way to display some of them is different between chat area and bars.
Purpose of this specification is to improve the display of Unicode chars and fix display bugs:
Some other bugs with unicode chars are not covered by this specification:
The unicode debug displayed in the following chapters is the output of command
/debug unicode ${\u1234}
, with the result of the following functions:
strlen
: number of bytes in the stringutf8_strlen
: number of chars in the stringgui_chat_strlen
: number of chars in the string (WeeChat color codes are skipped)wcwidth
: number of columns needed to display the char in the terminalutf8_strlen_screen
: number of columns needed to display the char in the terminalgui_chat_strlen_screen
: number of columns needed to display the string in the terminal (WeeChat color codes are skipped)The result of utf8_char_size_screen
is added before utf8_strlen_screen
because the result is different for non printable chars: -1
instead of 0
.
The command output in next chapters includes this change.
Unicode debug (with option weechat.look.tab_width
set to 4
):
/debug unicode ${chars:${\u0001}-${\u001F}}
Unicode: "char" (hex codepoint, codepoint, UTF-8 sequence): strlen / utf8_strlen, gui_chat_strlen / wcwidth, utf8_char_size_screen, utf8_strlen_screen, gui_chat_strlen_screen:
" " (U+0001, 1, 0x01): 1 / 1, 1 / -1, 1, 1, 1
" " (U+0002, 2, 0x02): 1 / 1, 1 / -1, 1, 1, 1
" " (U+0003, 3, 0x03): 1 / 1, 1 / -1, 1, 1, 1
" " (U+0004, 4, 0x04): 1 / 1, 1 / -1, 1, 1, 1
" " (U+0005, 5, 0x05): 1 / 1, 1 / -1, 1, 1, 1
" " (U+0006, 6, 0x06): 1 / 1, 1 / -1, 1, 1, 1
" " (U+0007, 7, 0x07): 1 / 1, 1 / -1, 1, 1, 1
" " (U+0008, 8, 0x08): 1 / 1, 1 / -1, 1, 1, 1
" " (U+0009, 9, 0x09): 1 / 1, 1 / -1, 4, 4, 4
"
" (U+000A, 10, 0x0A): 1 / 1, 1 / -1, 1, 1, 1
" " (U+000B, 11, 0x0B): 1 / 1, 1 / -1, 1, 1, 1
" " (U+000C, 12, 0x0C): 1 / 1, 1 / -1, 1, 1, 1
" " (U+000D, 13, 0x0D): 1 / 1, 1 / -1, 1, 1, 1
" " (U+000E, 14, 0x0E): 1 / 1, 1 / -1, 1, 1, 1
" " (U+000F, 15, 0x0F): 1 / 1, 1 / -1, 1, 1, 1
" " (U+0010, 16, 0x10): 1 / 1, 1 / -1, 1, 1, 1
" " (U+0011, 17, 0x11): 1 / 1, 1 / -1, 1, 1, 1
" " (U+0012, 18, 0x12): 1 / 1, 1 / -1, 1, 1, 1
" " (U+0013, 19, 0x13): 1 / 1, 1 / -1, 1, 1, 1
" " (U+0014, 20, 0x14): 1 / 1, 1 / -1, 1, 1, 1
" " (U+0015, 21, 0x15): 1 / 1, 1 / -1, 1, 1, 1
" " (U+0016, 22, 0x16): 1 / 1, 1 / -1, 1, 1, 1
" " (U+0017, 23, 0x17): 1 / 1, 1 / -1, 1, 1, 1
" " (U+0018, 24, 0x18): 1 / 1, 1 / -1, 1, 1, 1
"" (U+0019, 25, 0x19): 1 / 1, 0 / -1, 1, 1, 0
"" (U+001A, 26, 0x1A): 1 / 1, 0 / -1, 1, 1, 0
"" (U+001B, 27, 0x1B): 1 / 1, 0 / -1, 1, 1, 0
"" (U+001C, 28, 0x1C): 1 / 1, 0 / -1, 1, 1, 0
" " (U+001D, 29, 0x1D): 1 / 1, 1 / -1, 1, 1, 1
" " (U+001E, 30, 0x1E): 1 / 1, 1 / -1, 1, 1, 1
" " (U+001F, 31, 0x1F): 1 / 1, 1 / -1, 1, 1, 1
Current and new behavior (chars other than spaces are displayed with reverse video attribute):
Char | Old: chat | Old: bars | New: chat + bars |
---|---|---|---|
U+0001 (1, start of heading) | 1 space | A |
A |
U+0002 (2, start of text) | 1 space | B |
B |
U+0003 (3, end of text) | 1 space | C |
C |
U+0004 (4, end of transmission) | 1 space | D |
D |
U+0005 (5, enquiry) | 1 space | E |
E |
U+0006 (6, acknowledge) | 1 space | F |
F |
U+0007 (7, bell) | 1 space | G |
G |
U+0008 (8, backspace) | 1 space | H |
H |
U+0009 (9, horizontal tab) | N spaces | I |
N spaces |
U+000A (10, NL line feed, new line) | New line | Item sep. | New line / item sep. |
U+000B (11, vertical tab) | 1 space | K |
K |
U+000C (12, NP form feed, new page) | 1 space | L |
L |
U+000D (13, carriage return) | 1 space | New line | M / New line |
U+000E (14, shift out) | 1 space | N |
N |
U+000F (15, shift in) | 1 space | O |
O |
U+0010 (16, data link escape) | 1 space | P |
P |
U+0011 (17, device control 1) | 1 space | Q |
Q |
U+0012 (18, device control 2) | 1 space | R |
R |
U+0013 (19, device control 3) | 1 space | S |
S |
U+0014 (20, device control 4) | 1 space | T |
T |
U+0015 (21, negative acknowledge) | 1 space | U |
U |
U+0016 (22, synchronous idle) | 1 space | V |
V |
U+0017 (23, end of trans. block) | 1 space | W |
W |
U+0018 (24, cancel) | 1 space | X |
X |
U+0019 (25, end of medium) | Not displayed | Not displayed | Not displayed |
U+001A (26, substitute) | Not displayed | Not displayed | Not displayed |
U+001B (27, escape) | Not displayed | Not displayed | Not displayed |
U+001C (28, file separator) | Not displayed | Not displayed | Not displayed |
U+001D (29, group separator) | 1 space | ] |
] |
U+001E (30, record separator) | 1 space | ^ |
^ |
U+001F (31, unit separator) | 1 space | _ |
_ |
Notes:
weechat.look.tab_width
.weechat.look.tab_width
is used to compute item length
on screen, resulting in display issues.Expected unicode debug after changes:
/debug unicode ${chars:${\u0001}-${\u001F}}
Unicode: "char" (hex codepoint, codepoint, UTF-8 sequence): strlen / utf8_strlen, gui_chat_strlen / wcwidth, utf8_char_size_screen, utf8_strlen_screen, gui_chat_strlen_screen:
"A" (U+0001, 1, 0x01): 1 / 1, 1 / -1, 1, 1, 1
"B" (U+0002, 2, 0x02): 1 / 1, 1 / -1, 1, 1, 1
"C" (U+0003, 3, 0x03): 1 / 1, 1 / -1, 1, 1, 1
"D" (U+0004, 4, 0x04): 1 / 1, 1 / -1, 1, 1, 1
"E" (U+0005, 5, 0x05): 1 / 1, 1 / -1, 1, 1, 1
"F" (U+0006, 6, 0x06): 1 / 1, 1 / -1, 1, 1, 1
"G" (U+0007, 7, 0x07): 1 / 1, 1 / -1, 1, 1, 1
"H" (U+0008, 8, 0x08): 1 / 1, 1 / -1, 1, 1, 1
" " (U+0009, 9, 0x09): 1 / 1, 1 / -1, 4, 4, 4
"
" (U+000A, 10, 0x0A): 1 / 1, 1 / -1, 1, 1, 1
"K" (U+000B, 11, 0x0B): 1 / 1, 1 / -1, 1, 1, 1
"L" (U+000C, 12, 0x0C): 1 / 1, 1 / -1, 1, 1, 1
"M" (U+000D, 13, 0x0D): 1 / 1, 1 / -1, 1, 1, 1
"N" (U+000E, 14, 0x0E): 1 / 1, 1 / -1, 1, 1, 1
"O" (U+000F, 15, 0x0F): 1 / 1, 1 / -1, 1, 1, 1
"P" (U+0010, 16, 0x10): 1 / 1, 1 / -1, 1, 1, 1
"Q" (U+0011, 17, 0x11): 1 / 1, 1 / -1, 1, 1, 1
"R" (U+0012, 18, 0x12): 1 / 1, 1 / -1, 1, 1, 1
"S" (U+0013, 19, 0x13): 1 / 1, 1 / -1, 1, 1, 1
"T" (U+0014, 20, 0x14): 1 / 1, 1 / -1, 1, 1, 1
"U" (U+0015, 21, 0x15): 1 / 1, 1 / -1, 1, 1, 1
"V" (U+0016, 22, 0x16): 1 / 1, 1 / -1, 1, 1, 1
"W" (U+0017, 23, 0x17): 1 / 1, 1 / -1, 1, 1, 1
"X" (U+0018, 24, 0x18): 1 / 1, 1 / -1, 1, 1, 1
"" (U+0019, 25, 0x19): 1 / 1, 0 / -1, 1, 1, 0
"" (U+001A, 26, 0x1A): 1 / 1, 0 / -1, 1, 1, 0
"" (U+001B, 27, 0x1B): 1 / 1, 0 / -1, 1, 1, 0
"" (U+001C, 28, 0x1C): 1 / 1, 0 / -1, 1, 1, 0
"]" (U+001D, 29, 0x1D): 1 / 1, 1 / -1, 1, 1, 1
"^" (U+001E, 30, 0x1E): 1 / 1, 1 / -1, 1, 1, 1
"_" (U+001F, 31, 0x1F): 1 / 1, 1 / -1, 1, 1, 1
Note: the letters and symbols between double quotes are displayed with reverse video attribute.
Unicode debug:
/debug unicode ${\u007F}
Unicode: "char" (hex codepoint, codepoint, UTF-8 sequence): strlen / utf8_strlen, gui_chat_strlen / wcwidth, utf8_char_size_screen, utf8_strlen_screen, gui_chat_strlen_screen:
" " (U+007F, 127, 0x7F): 1 / 1, 1 / -1, 1, 1, 1
Current and new behavior:
Char | Old: chat | Old: bars | New: chat + bars |
---|---|---|---|
U+007F (127, delete) | 1 space | 1 space | Not displayed |
Expected unicode debug after changes:
/debug unicode ${\u007F}
Unicode: "char" (hex codepoint, codepoint, UTF-8 sequence): strlen / utf8_strlen, gui_chat_strlen / wcwidth, utf8_char_size_screen, utf8_strlen_screen, gui_chat_strlen_screen:
"" (U+007F, 127, 0x7F): 1 / 1, 1 / -1, -1, 0, 0
Unicode debug:
/debug unicode ${\u0092}
Unicode: "char" (hex codepoint, codepoint, UTF-8 sequence): strlen / utf8_strlen, gui_chat_strlen / wcwidth, utf8_char_size_screen, utf8_strlen_screen, gui_chat_strlen_screen:
" " (U+0092, 146, 0xC2 0x92): 2 / 1, 1 / -1, 1, 1, 1
Current and new behavior:
Char | Old: chat | Old: bars | New: chat + bars |
---|---|---|---|
U+0092 (146, private use two) | 1 space | 1 space | Not displayed |
Expected unicode debug after changes:
/debug unicode ${\u0092}
Unicode: "char" (hex codepoint, codepoint, UTF-8 sequence): strlen / utf8_strlen, gui_chat_strlen / wcwidth, utf8_char_size_screen, utf8_strlen_screen, gui_chat_strlen_screen:
"" (U+0092, 146, 0xC2 0x92): 2 / 1, 1 / -1, -1, 0, 0
Unicode debug:
/debug unicode ${\u00AD}
Unicode: "char" (hex codepoint, codepoint, UTF-8 sequence): strlen / utf8_strlen, gui_chat_strlen / wcwidth, utf8_char_size_screen, utf8_strlen_screen, gui_chat_strlen_screen:
"" (U+00AD, 173, 0xC2 0xAD): 2 / 1, 1 / 1, 1, 1, 1
This char is supposed to be displayed (wcwidth
== 1), but as WeeChat
does not use it to break lines, it must be treated as a special character
and not displayed at all.
Current and new behavior:
Char | Old: chat | Old: bars | New: chat + bars |
---|---|---|---|
U+00AD (173, soft hyphen) | Hyphen | Hyphen | Not displayed |
Note: the hyphen displayed can also be a space or not displayed, according to the terminal and font used.
Expected unicode debug after changes:
/debug unicode ${\u00AD}
Unicode: "char" (hex codepoint, codepoint, UTF-8 sequence): strlen / utf8_strlen, gui_chat_strlen / wcwidth, utf8_char_size_screen, utf8_strlen_screen, gui_chat_strlen_screen:
"" (U+00AD, 173, 0xC2 0xAD): 2 / 1, 1 / 1, -1, 0, 0
Unicode debug:
/debug unicode ${\u200B}
Unicode: "char" (hex codepoint, codepoint, UTF-8 sequence): strlen / utf8_strlen, gui_chat_strlen / wcwidth, utf8_char_size_screen, utf8_strlen_screen, gui_chat_strlen_screen:
"" (U+200B, 8203, 0xE2 0x80 0x8B): 3 / 1, 1 / 0, 0, 0, 0
This char is supposed to be displayed (wcwidth
== 0), but in some cases causes
display issues, so it must be treated as a special character and not displayed at all.
For more information, see issue #1770.
Current and new behavior:
Char | Old: chat | Old: bars | New: chat + bars |
---|---|---|---|
U+200B (8203, zero width space) | 1 space | 1 space | Not displayed |
Note: the space may not be displayed or cause display issues, according to the terminal and font used.
Expected unicode debug after changes:
/debug unicode ${\u200B}
Unicode: "char" (hex codepoint, codepoint, UTF-8 sequence): strlen / utf8_strlen, gui_chat_strlen / wcwidth, utf8_char_size_screen, utf8_strlen_screen, gui_chat_strlen_screen:
"" (U+200B, 8203, 0xE2 0x80 0x8B): 3 / 1, 1 / 0, -1, 0, 0
All other non printable chars (when wcwidth
== -1) must not be displayed.
For example U+0085 (133, next line), unicode debug:
/debug unicode ${\u0085}
Unicode: "char" (hex codepoint, codepoint, UTF-8 sequence): strlen / utf8_strlen, gui_chat_strlen / wcwidth, utf8_char_size_screen, utf8_strlen_screen, gui_chat_strlen_screen:
" " (U+0085, 133, 0xC2 0x85): 2 / 1, 1 / -1, 1, 1, 1
Current and new behavior:
Char | Old: chat | Old: bars | New: chat + bars |
---|---|---|---|
U+0085 (133, next line) | Space | Space | Not displayed |
Expected unicode debug after changes:
/debug unicode ${\u0085}
Unicode: "char" (hex codepoint, codepoint, UTF-8 sequence): strlen / utf8_strlen, gui_chat_strlen / wcwidth, utf8_char_size_screen, utf8_strlen_screen, gui_chat_strlen_screen:
"" (U+0085, 133, 0xC2 0x85): 2 / 1, 1 / -1, -1, 0, 0
This function currently considers any non printable char (wcwidth
== -1)
needs one column to be displayed, because we display them as a space.
The new behavior for non printable chars:
weechat.look.tab_width
So the function will return:
-1
for any non printable char (must not be displayed)0
for any printable char that is not visible (for example Combining Diacritical Marks)≥ 1
for any other char that is visibleThis function must return the sum of wcwidth
for all chars in the string.
Any char with wcwidth
== -1 is considered as 0 column on screen.
This function has a bug: when the string has at least two unicode chars and
contains at least one non printable char (wcwidth
== -1), then the result
is smaller than expected. For example utf8_strlen_screen("abc\x01")
returns
1 instead of 4.
Function is removed.
Function is removed.
This function skips WeeChat color codes, this is the only difference with
the function utf8_strlen_screen
.
The changes must be implemented in this order:
utf8_char_size_screen
in command /debug unicode
utf8_strlen_screen
utf8_strncpy