Talk:Unicode

This is the talk page for discussing improvements to the Unicode article.
This is not a forum for general discussion of the article's subject.

Put new text under old text. Click here to start a new topic.
New to Wikipedia? Welcome! Learn to edit; get help.

Article policies

Find sources: Google (books · news · scholar · free images · WP refs) · FENS · JSTOR · TWL

Archives: Index, 1, 2, 3, 4, 5, 6, 7: 730 days

Typography Top‑importance

	This article is within the scope of WikiProject Typography, a collaborative effort to improve the coverage of articles related to Typography on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.TypographyWikipedia:WikiProject TypographyTemplate:WikiProject TypographyTypography articles
Top	This article has been rated as Top-importance on the importance scale.

Languages Low‑importance

	Language portal This article is within the scope of WikiProject Languages, a collaborative effort to improve the coverage of languages on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.LanguagesWikipedia:WikiProject LanguagesTemplate:WikiProject Languageslanguage articles
Low	This article has been rated as Low-importance on the project's importance scale.

Computing High‑importance

	This article is within the scope of WikiProject Computing, a collaborative effort to improve the coverage of computers, computing, and information technology on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.ComputingWikipedia:WikiProject ComputingTemplate:WikiProject ComputingComputing articles
High	This article has been rated as High-importance on the project's importance scale.

Globalization

	This article is within the scope of WikiProject Globalization, a collaborative effort to improve the coverage of Globalization on Wikipedia. If you would like to participate, you can edit the article attached to this page, or visit the project page, where you can join the project and see a list of open tasks.GlobalizationWikipedia:WikiProject GlobalizationTemplate:WikiProject GlobalizationGlobalization articles
???	This article has not yet received a rating on the project's importance scale.

Text and/or other creative content from this version of Unicode was copied or moved into incubator:Wp/nod/ᩀᩪᨶᩥᨣᩰ᩠ᨯ with this edit. The former page's history now serves to provide attribution for that content in the latter page, and it must not be deleted as long as the latter page exists.

Vulnerabilities[edit]

A security advisory has been recently released from two researchers, one from the University of Cambridge and the other from the same and from the University of Edinburgh, in which they assert that carefully crafted computer source code can be used to introduce vulnerabilities in apparently harmless programs. Some security groups (like the one for Rust language) are already taking measures and issuing their own security advisories.

I think that is something that affects Unicode as source code is one of the main applications of the standard. What do ye think would be a good way to introduce that to the article?

^[1] ^[2] ^[3]

Bruno Unna (talk) 12:02, 1 November 2021 (UTC)[reply]

Looks like existing Unicode § Issues is the place to go. Indeed, a true Unicode case (U+202E RIGHT-TO-LEFT OVERRIDE). Also consider mentioning at Bidirectional text? -DePiep (talk) 12:29, 1 November 2021 (UTC)[reply]

References

سلام M.h.gholamii (talk) 19:24, 14 July 2022 (UTC)[reply]

ko M.h.gholamii (talk) 19:24, 14 July 2022 (UTC)[reply]

question[edit]

I was designing text shapes for electrical symbols and electronic elements. I design them on the Unicode-encoded FontCreator program, but after exporting it and copying and pasting the symbol I designed into the phone programs, it does not work and appears in the form of a question mark, what is the solution? (Note this topic is important for articles development, I want to design different symbols for non-electrical shapes and not only in the field of electricity and I don't want them to be thumbnails but text). Mohmad Abdul sahib talk☎ talk 18:15, 18 April 2022 (UTC)[reply]

Likely, FontCreator has the appropriate font, containing the electric symbols. But the receiving programm does not. Looks like the font should have (Unicode block) Miscellaneous Technical. Requires downloadingf a certain font, but I cannot help any further. -DePiep (talk) 19:52, 14 July 2022 (UTC)[reply]

Version 15 & Wikidata[edit]

I am adding new blocks & data to Wikidata now. Assuming no DAB needed here, the pages are:

DePiep (talk) 16:10, 13 September 2022 (UTC)[reply]

QID added -DePiep (talk) 16:33, 13 September 2022 (UTC)[reply]

more listing -DePiep (talk) 18:02, 13 September 2022 (UTC)[reply]

Not much time to complete this list, for me. DePiep (talk) 18:12, 13 September 2022 (UTC)[reply]

Note that, as far as I can see, only two content articles require the "(Unicode block)" DAB-specifier, because of name overlap. The other "X (Unicode block)" pages sould be redirects to their (unambiguously named) content Block article. See also {{Unicode blocks/overview}}. DePiep.

Recent changes in Unicode

In Unicode pages & talkpages (source)
In ISO 15924 pages & talkpages (source)

List overview · Lists updated: 2022-10-01 · This box:

By now, most 15.0 changes seem to be processed & updated. See REcent Changes for current edits history. -DePiep (talk) 11:31, 21 September 2022 (UTC)[reply]
As a list of version-15.0-changes needed or done, this list is incomplete. DePiep (talk) 05:36, 24 October 2022 (UTC)[reply]

New Taskforce WikiProject Unicode?[edit]

A proposal is opened at WP:COMP § Taskforce WP Unicode –_proposal. Please take a look. DePiep (talk) 09:35, 2 October 2022 (UTC)[reply]

Code Points[edit]

The lead claims that there are currently 149 186 characters in the Standard. That's confusing! Is that actual characters or does it include unprintable code points? I know what a code point is, my point is that the lead shouldn't confuse code points with characters. (I also argue that a "control character" isn't 'really' a character, not a grapheme, but that's a fight for somewhere else.) Writing about Unicode without an early clear explanation of what a code point is, is -I think- awful pedagogy. In fact, I don't think code point - a fundamental aspect of Unicode - is even defined in the article!!!! Wow, just wow.

I also would like someone to verify that Unicode has characters for color. I believe that's wrong/false/misleading. I am aware that certain emoji can be modified by a code point to change some of its color. As far as I know, this is only true with a very small set of code points, and a very very small set of colors (I don't actually know if the colors are well-defined, I'd expect so, but...). These aren't colors, but are color modifiers for those other code points. 174.130.71.156 (talk) 16:00, 13 December 2022 (UTC)[reply]

There are no color defining codes in Unicode but there are names of characters that specify a color if displayed on a color device. Searching the word color in the article shows some possibly confusing text about color but nothing outright wrong.

This article leaves a lot to be desired, if you wish to make changes, you should. It's a wiki after all. SchmuckyTheCat (talk) 05:43, 15 December 2022 (UTC)[reply]

There are two Variation Selectors (U+FE0E and U+FE0F) which specify whether an Emoji should be ideally displayed in color or black and white, but other than that, there are no color specifications in Unicode. The term "character" and "code point" are specified in the Unicode Standard, and if you feel that the coverage here is inadequate in conveying the meaning of those terms, I absolutely encourage you to contribute content to better reflect their technical specification. For the record, any code point defined beyond "Not A Character" or "Reserved" is a "character". This means control characters and whitespace are all considered characters in Unicode, just like a letter in an alphabet, a Kanji with On and Kun readings, or a mathematical symbol. Van Isaac, GHTV^cont_WpWS 06:18, 15 December 2022 (UTC)[reply]

Lead is simply wrong.[edit]

The offending sentence is:"The Unicode standard defines three and several other encodings exist, all in practice variable-length encodings." (Sure, you could strain to interpret that to mean "all but UTF-32", but let's keep it clear. It clearly implies all encodings are variable length. Wikipedia's own article on UTF-32 says it is fixed length. (Because it only needs to use 21 of the 32 bits for Unicode code points, it is very inefficient (and rarely used, afaik). But rarely used is not the same as "doesn't exist", and "all are variable" clearly implies it doesn't exist. I'd have to look again, are there really 3 variable Unicode encodings? I can only think of UTF-8 and UTF-16. (and some others that afaik are not "defined" in the Unicode standard (like GB18030), or that are obsolete (like UTF-7).) Replace "all" with "all common encodings" or something similar, and mention UTF-32.174.130.71.156 (talk) 11:43, 15 December 2022 (UTC)[reply]

I think the intended meaning of this was that even if code points are fixed-size, modern Unicode is effectively variable-width, as what the user thinks is a "character" sometimes needs multiple code points.Spitzak (talk) 16:40, 15 December 2022 (UTC)[reply]

Yes, Unicode includes both combining characters and precomposed characters, e.g., <U+0061 “a” latin small letter a> <U+0308 "¨" combining diaeresis> is equivalent to <U+00E4 "ä" latin small letter A with diaeresis>. Further, some glyphs exist at multiple code points for historical reasons. There is a discussion of cannonical forms in the Unicode standard. --Shmuel (Seymour J.) Metz Username:Chatul (talk) 21:57, 15 December 2022 (UTC)[reply]

It seems odd to me to describe code points as "fixed size". They're just an abstract number. It's when you encode (or store) the code points that you get variable lengths, at least for UTF-8, UTF-EBCDIC, and UTF-16 as described in the article. I think combining characters are a red herring for this discussion. DRMcCreedy (talk) 23:10, 15 December 2022 (UTC)[reply]

The Unicode standard does restrict the number of code points, so describing them as as fixed length 21-bit or 32-bit data is reasonable. Spitzak is referring to characters, which indeed are variable length, a separate issue from the length of an encoded code point that does deserve mention. --Shmuel (Seymour J.) Metz Username:Chatul (talk) 17:14, 16 December 2022 (UTC)[reply]

Inline mentioning[edit]

I object to the reversal by Peter M. Brown, citing WP:ITALICTITLE inappropriately. I'd say that the name, a noun, should not be in italics.

ITALICTITLE referst to the name of a work, ie the work itself (play, periodic, book). However, the Unicode standard is a standard, not a book &tc. not even it's publication. The Standard is abstraction: the set of rules. It is a proper noun full stop. Key is, the article title notes the subject: the standard not the book. DePiep (talk) 17:04, 21 April 2023 (UTC)[reply]

@Peter M. Brown: -DePiep (talk) 10:43, 23 April 2023 (UTC)[reply]

Why no section about missing graphemes?[edit]

I don't know if it would be manageable, but Unicode clearly does not have all commonly used symbols. A simple example is the very commonly used 'slash marks' used to count. Most reading this will be familiar with the sequence /, //, ///, ////, and ~~////~~ with the crossmark (strike-through) diagonal (top left to bottom right) rather than horizontal. (This is typical in the USA, I understand European convention is slightly different). I request the editors to consider the addition of a list of missing (but documented) symbols.40.142.183.146 (talk) 11:49, 9 June 2023 (UTC)[reply]

Unicode's non-inclusion of tally marks is covered in Tally marks § Unicode. I don't think it's a good idea to include it also in this article. That would open the door of listing every proposal that has not yet been accepted. Indefatigable (talk) 15:42, 9 June 2023 (UTC)[reply]

I also oppose this idea. The set of unencoded symbols is open-ended and may exceed the number of encoded symbols. There would also be no way to determine which unencoded symbols merit mention. DRMcCreedy (talk) 16:01, 9 June 2023 (UTC)[reply]

Proposed new writing systems to be encoded into Unicode 16[edit]

Unicode 16 is set to release in September 2024. I think the following (con)scripts definitely need to be encoded:

Chữ Việt Trí - an alphabet invented by Tôn Thất Chương in 2012 for Vietnamese language. It's still nicer than Latin-based Quoc Ngu and needs wide recognition as the Shavian and Hangul did.
Add support for Quikscript.
Add extra missing runes from Baconsthrope and Sedgeford and Armanen runes
Possibly add something more.

94.180.80.9 (talk) 07:31, 9 July 2023 (UTC)[reply]

Take a look at Unicode's FAQ for Submitting Successful Character and Script Proposals. Wikipedia isn't affiliated with The Unicode Consortium so requests here won't be seen or acted upon by the people who can actually add characters/scripts to the Unicode Standard. DRMcCreedy (talk) 14:39, 9 July 2023 (UTC)[reply]

Combining macron and acute in text referencing them separately[edit]

@Spitzak: In the text for example, ḗ (precomposed e with macron and acute above) and ḗ (e followed by the combining macron above and combining acute above) should be rendered identically, the "e" is followed by two distinct combining characters, but they are rendered at a single location. I inserted a space to cause them to display as two separate characters, and Spitzak reverted the change with the comment They are supposed to be combined. In context, I don't understand how it makes sense to combine them, since the text refers to them individually. -- Shmuel (Seymour J.) Metz Username:Chatul (talk) 21:40, 16 October 2023 (UTC)[reply]

My reading of that sentence is that it's comparing the rendering of the precomposed character with the combining characters so you can see how the two render pretty much side by side. That reading prohibits a space between the combining characters. I suppose you should show the combining characters separated then together if you really want to show the components separately. Something like this, with my additions in green: "For example, ḗ (precomposed e with macron and acute above) and ḗ (e followed by the combining macron above and combining acute above) should be rendered identically, both appearing as an e with a macron (◌̄) and acute accent (◌́), but in practice, their appearance may vary depending upon what rendering engine and fonts are being used to display the characters."'DRMcCreedy (talk) 23:01, 16 October 2023 (UTC)[reply]

That reading is correct, but I don't agree with the inference; I would agree that it prohibits rendering a space between the two combining characters, but the does not mean that it prohibits a space in the markup that causes the characters to be rendered adjacent to each other with no intervening space. The text "ḗ" renders as ḗ, with the ̄ and ̋ overlaid, while the text "ē ́" renders as ē ́ , with no overlay and no intervening space.

Your suggested text should be acceptable. -- Shmuel (Seymour J.) Metz Username:Chatul (talk) 13:39, 18 October 2023 (UTC)[reply]

Great. I've made the updates. DRMcCreedy (talk) 14:29, 18 October 2023 (UTC)[reply]

Sorry, I read it too quickly; that still has the original issue unless you remove the "ḗ" and remove the parentheses from the parenthetical note, i.e.,for example, ḗ (precomposed e with macron and acute above) and e followed by the combining macron above and combining acute above should be rendered identically,. Alternatively,for example, ḗ (precomposed e with macron and acute above) and eōó (e followed by the combining macron above and combining acute above) should be rendered identically,. -- Shmuel (Seymour J.) Metz Username:Chatul (talk) 19:25, 18 October 2023 (UTC)[reply]

But that wouldn't allow the reader to see if the two equivalent versions (precomposed and combining) render the same on whatever device they're using. I think that's the point of having both the precomposed and combining in the example in the first place. DRMcCreedy (talk) 20:29, 18 October 2023 (UTC)[reply]

Welcome, I want the Kurdistan flag on my keyboard[edit]

Welcome, I want the Kurdistan flag on my keyboard 85.94.240.91 (talk) 23:28, 2 November 2023 (UTC)[reply]

Unfortunately, the flag of Kurdistan is not presently encoded in Unicode. Remsense聊 23:32, 2 November 2023 (UTC)[reply]

Nor will it be added per Unicode's proposal guidelines for flags DRMcCreedy (talk) 00:34, 3 November 2023 (UTC)[reply]

Slightly odd hatnote[edit]

@Spitzak, I'm also really not sure what you're talking about exactly—Microsoft seems to have the definition of "Unicode" in line with that of the rest of the world.[1] If they use "Unicode" as a shorthand for "UTF-16" sometimes (the way many people use it as a shorthand for "UTF-8", then the page I just linked seems to do any theoretical disambiguation work, and doesn't really leave us wondering whether they're somehow creating an ambiguity problem for us to solve. Remsense诉 02:28, 8 March 2024 (UTC)[reply]

I give up on this, but it is because I was looking at a function called isTextUnicode which returns false for UTF-8. There are a number of other examples where "Unicode" means the 16-bit interface.Spitzak (talk) 06:19, 8 March 2024 (UTC)[reply]

There are two separate issues:

Do we have to deal with this?: I believe that we do need to mention the limitations of Unicode support in windows.
Is a hatnote the best way to deal with it?: I believe that the hatnote is inappropriate and that the text should mention the limitation, probably not in the lead.

Perhaps Unicode#Operating systems could say In Microsoft windows, the Unicode support is limited to UTF-16. -- Shmuel (Seymour J.) Metz Username:Chatul (talk) 15:47, 8 March 2024 (UTC)[reply]

Except it isn't really limited to UTF-16, especially in modern versions. The problem is they use "Unicode" all over the place to mean "16-bit encoding" and do differentiate it from "8-bit encoding". This explicitly excludes every form of Unicode other than UTF-16 and UCS-2 (it also thus includes other 16-bit encodings that are not Unicode, but this is probably not a big deal).Spitzak (talk) 17:30, 8 March 2024 (UTC)[reply]

Can you work up text that concisely but accurately describes the m$ nomenclature and support for Unicode and the preferred role of UTF-16? -- Shmuel (Seymour J.) Metz Username:Chatul (talk) 18:38, 8 March 2024 (UTC)[reply]

I agree with this. Remsense诉 01:41, 9 March 2024 (UTC)[reply]

[1] ttps://krebsonsecurity.com/2021/11/trojan-source-bug-threatens-the-security-of-all-code/

[2] ttps://www.trojansource.codes/trojan-source.pdf

[3] ttps://blog.rust-lang.org/2021/11/01/cve-2021-42574.html

[1]

[2]

[3]