Invisible Unicode character that affects alphabetical sorting - unicode

Is there an invisible (Unicode) character that affects the alphabetical sorting of list entries?
The general question of invisible characters has been addressed here before, and here is a list of invisible characters, most of which seem to be blanks of some kind.
I am looking for an invisible character which can be placed e.g. at the beginning or inside a filename, and then makes that file getting sorted alphabetically on top with respect to other filenames.
I should add that the character I'm looking for should be sorted before a normal SPACE character as well, for example when used in a file name that is sorted in places like Sharepoint, Teams and OneDrive in the browser.

U+0020 SPACE (" ") is invisible and sorts before all visible characters in most orderings. It's been used to order things to the start of lists for decades.

Related

Is there a special character that cannot be typed or copied by user, but can be inserted/read by code into/from text?

I need to have a temporary delimiter, inserted server-side, that cannot possibly exist in content created by user.
The purpose for this is to have prepared content for CSV export, with configurable value delimiter, that will replace this untypeable character client-side, right before the export.
Does such character even exist?
There is no character that cannot possibly exist; however there are many characters (in particular control codes - those lower than decimal 32, excluding cr/lf/tab) that are extremely unlikely to exist in any reasonable text content. This is why escaping is often required in text-based protocols. There is no reserved space of characters that will be escaped in CSV, other than those already used in CSV itself.
Zero-width joiner is a unicode invisible kind of character which exist but do not exist. You can use that! :)

Invisible characters - ASCII

Are there any invisible characters? I have checked Google for invisible characters and ended up with many answers but I'm not sure about those. Can someone on Stack Overflow tell me more about this?
Also I have checked a profile on Facebook and found that the user didn't have any name to his profile? How can this be possible? Is it some database issue? Hacking or something?
When I searched over Internet, I found that 200D is an ASCII value with an invisible character. Is it true?
I just went through the character map to get these.
They are all in Calibri.
Number    Name      HTML Code    Appearance
------    --------------------  ---------    ----------
U+2000    En Quad           " "
U+2001    Em Quad           " "
U+2002    En Space        " "
U+2003    Em Space        " "
U+2004  Three-Per-Em Space      " "
U+2005  Four-Per-Em Space         " "
U+2006 Six-Per-Em Space       " "
U+2007 Figure Space         " "
U+2008 Punctuation Space        " "
U+2009 Thin Space         " "
U+200A Hair Space        " "
U+200B Zero-Width Space ​      "​"
U+200C Zero Width Non-Joiner ‌   "‌"
U+200D Zero Width Joiner ‍      "‍"
U+200E Left-To-Right Mark ‎      "‎"
U+200F Right-To-Left Mark ‏      "‏"
U+202F Narrow No-Break Space        " "
How a character is represented is up to the renderer, but the server may also strip out certain characters before sending the document.
You can also have untitled YouTube videos like https://www.youtube.com/watch?v=dmBvw8uPbrA by using the Unicode character ZERO WIDTH NON-JOINER (U+200C), or ‌ in HTML. The code block below should contain that character:
‌‌
There is actually a truly invisible character: U+FEFF.
This character is called the Byte Order Mark and is related to the Unicode 8 system. It is a really confusing concept that can be explained HERE The Byte Order Mark or BOM for short is an invisible character that doesn't take up any space. You can copy the character bellow between the > and <.
Here is the character:
> <
How to catch this character in action:
Copy the character between the > and <,
Write a line of text, then randomly put your caret in the line of text
Paste the character in the line.
Go to the beginning of the line and press and hold the right arrow key.
You will notice that when your caret gets to the place you pasted the character, it will briefly stop for around half a second. This is becuase the caret is passing over the invisible character. Even though you can't see it doesn't mean it isn't there. The caret still sees that there is a character in that area that you pasted the BOM and will pass through it. Since the BOM is invisble, the caret will look like it has paused for a brief moment. You can past the BOM multiple times in an area and redo the steps above to really show the affect. Good luck!
EDIT: Sadly, Stackoverflow doesn't like the character. Here is an example from w3.org: https://www.w3.org/International/questions/examples/phpbomtest.php
Other answers are correct - whether a character is invisible or not depends on what font you use. This seems to be a pretty good list to me of characters that are truly invisible (not even space). It contains some chars that the other lists are missing.
'\u2060', // Word Joiner
'\u2061', // FUNCTION APPLICATION
'\u2062', // INVISIBLE TIMES
'\u2063', // INVISIBLE SEPARATOR
'\u2064', // INVISIBLE PLUS
'\u2066', // LEFT - TO - RIGHT ISOLATE
'\u2067', // RIGHT - TO - LEFT ISOLATE
'\u2068', // FIRST STRONG ISOLATE
'\u2069', // POP DIRECTIONAL ISOLATE
'\u206A', // INHIBIT SYMMETRIC SWAPPING
'\u206B', // ACTIVATE SYMMETRIC SWAPPING
'\u206C', // INHIBIT ARABIC FORM SHAPING
'\u206D', // ACTIVATE ARABIC FORM SHAPING
'\u206E', // NATIONAL DIGIT SHAPES
'\u206F', // NOMINAL DIGIT SHAPES
'\u200B', // Zero-Width Space
'\u200C', // Zero Width Non-Joiner
'\u200D', // Zero Width Joiner
'\u200E', // Left-To-Right Mark
'\u200F', // Right-To-Left Mark
'\u061C', // Arabic Letter Mark
'\uFEFF', // Byte Order Mark
'\u180E', // Mongolian Vowel Separator
'\u00AD' // soft-hyphen
The question about invisible characters in Unicode deserves a more thorough explanation.
Short answer - there are lots
Here are 134 invisible characters →­؜᠎​‌‍‎‏‪‫‬‭‮⁠⁡⁢⁣⁤⁧⁦⁨⁩𝅳𝅴𝅵𝅶𝅷𝅸𝅹𝅺󠀁󠀠󠀡󠀢󠀣󠀤󠀥󠀦󠀧󠀨󠀩󠀪󠀫󠀬󠀭󠀮󠀯󠀰󠀱󠀲󠀳󠀴󠀵󠀶󠀷󠀸󠀹󠀺󠀻󠀼󠀽󠀾󠀿󠁀󠁁󠁂󠁃󠁄󠁅󠁆󠁇󠁈󠁉󠁊󠁋󠁌󠁍󠁎󠁏󠁐󠁑󠁒󠁓󠁔󠁕󠁖󠁗󠁘󠁙󠁚󠁛󠁜󠁝󠁞󠁟󠁠󠁡󠁢󠁣󠁤󠁥󠁦󠁧󠁨󠁩󠁪󠁫󠁬󠁭󠁮󠁯󠁰󠁱󠁲󠁳󠁴󠁵󠁶󠁷󠁸󠁹󠁺󠁻󠁼󠁽󠁾󠁿← and here is their escaped ASCII representation: U+00AD U+061C U+180E U+200B U+200C U+200D U+200E U+200F U+202A U+202B U+202C U+202D U+202E U+2060 U+2061 U+2062 U+2063 U+2064 U+2067 U+2066 U+2068 U+2069 U+206A U+206B U+206C U+206D U+206E U+206F U+FEFF U+1D173 U+1D174 U+1D175 U+1D176 U+1D177 U+1D178 U+1D179 U+1D17A U+E0001 U+E0020 U+E0021 U+E0022 U+E0023 U+E0024 U+E0025 U+E0026 U+E0027 U+E0028 U+E0029 U+E002A U+E002B U+E002C U+E002D U+E002E U+E002F U+E0030 U+E0031 U+E0032 U+E0033 U+E0034 U+E0035 U+E0036 U+E0037 U+E0038 U+E0039 U+E003A U+E003B U+E003C U+E003D U+E003E U+E003F U+E0040 U+E0041 U+E0042 U+E0043 U+E0044 U+E0045 U+E0046 U+E0047 U+E0048 U+E0049 U+E004A U+E004B U+E004C U+E004D U+E004E U+E004F U+E0050 U+E0051 U+E0052 U+E0053 U+E0054 U+E0055 U+E0056 U+E0057 U+E0058 U+E0059 U+E005A U+E005B U+E005C U+E005D U+E005E U+E005F U+E0060 U+E0061 U+E0062 U+E0063 U+E0064 U+E0065 U+E0066 U+E0067 U+E0068 U+E0069 U+E006A U+E006B U+E006C U+E006D U+E006E U+E006F U+E0070 U+E0071 U+E0072 U+E0073 U+E0074 U+E0075 U+E0076 U+E0077 U+E0078 U+E0079 U+E007A U+E007B U+E007C U+E007D U+E007E U+E007F
Are there more? Yes.
Are there invisible characters in the ASCII range? Depends on the font.
Long answer - ready? set. go!
The Unicode Standard enables anyone to read and write in their own language. To do that, it lists unique code points󠁗󠁲󠁩󠁴󠁴󠁥󠁮󠀠󠁢󠁹󠀠󠁚󠁶󠁩󠀠󠁁󠁺󠁲󠁡󠁮󠀠󠀻󠀩 (U+hex), that are categorized into letters (D,ž,Dž,ʶ,愛,𓂀), symbols (+∊≠,£¥₪,҂˚˟˿), marks (ם֑֟֯ ,ী,◌҉ ), separators ( , , , ,  ), emojis (😊,🙏,👍), and much more. ASCII/Basic Latin is the very beginning of the table and more code points are added every update.
Simply listing unique numbers for characters is not enough. Characters can change their shape or change the sentence depending on the context. To support that, every code point comes with a list of properties . These properties may define the width (AA), its role in the sentence (-“.), its direction (cכ), and much more.
Most invisible characters have the property General_Category=Format (other answers here included Spaces as well). Theis characters have a supporting role to a word/sentence. Here are some examples:
General Punctuation Block -
Invisible characters that are an integral part of some writing systems and emojis. Common ones are Zero width joiner (U+200D), Zero width non joiner (U+200C), Word joiner (U+2060)
Explicit Bidirectional Formatting characters - 12 invisible characters󠁗󠁲󠁩󠁴󠁴󠁥󠁮󠀠󠁢󠁹󠀠󠁚󠁶󠁩󠀠󠁁󠁺󠁲󠁡󠁮󠀠󠀻󠀩 used to enforce different direction constraints on the sentence. Helping present text to more than 300 million speakers of right-to-left languages e.g. Hebrew or Arabic.
Tags - 97 invisible characters that mirror ASCII (just drop the E and you get characters in the ASCII range). These are used as emoji modifiers and digital signatures to prove who copied your text.
This all leads to talk about exploiting invisible characters for homograph attack/visual spoofing. Sometimes it's harmless like invisible names and titles but in lots of cases they are used maliciously. For example U+202E is one invisible character that keeps doing more harm than good for decades!!
Last point, there is another way to make invisible characters using fonts. Fonts are files that store glyphs (pictures of characters), that present the characters' look. If the font does not contain a glyph for a codepoint, a substitute/replacement󠁗󠁲󠁩󠁴󠁴󠁥󠁮󠀠󠁢󠁹󠀠󠁚󠁶󠁩󠀠󠁁󠁺󠁲󠁡󠁮󠀠󠀻󠀩 character is displayed (e.g. �, □). But if the font contains a transparent glyph for a codepoint, then the character is invisible, only when displayed by that font. This is the only way to have invisible characters in the ASCII range (for example can you see →``← U+000C Form Feed).
Hope you find this explanation helpful and may you check strings for invisible characters more often 󠁗󠁲󠁩󠁴󠁴󠁥󠁮󠀠󠁢󠁹󠀠󠁚󠁶󠁩󠀠󠁁󠁺󠁲󠁡󠁮󠀠󠀻󠀩😉
Yes you can use invisible or blank name on facebook by using some HTML code/symbols.
Method 1:
Copy and paste (ﹺ                         ﹺ) symbols without brackets in your first and last name field.
Method 2:
Click on edit name. Now copy and paste following symbol in first and last name.
ՙՙ ՙՙ
An invisible Character is ​, or U+200b
​

Weird characters in a Microsoft Word document won't export/can't be searched

I have a document which has been sloppily authored. It's a dictionary that contains cyrillic characters. Most of the dictionary is manageable, but I'm stuck with one thing I need help with. Words have accented letters in them and they're mostly formatted properly as a letter with a unicode accent (thus forming a single letter). However there are some very peculiar letters that look similar for example to: a;´ (where "a" is any arbitrary cyrillic letter). You'd expect á in its place. However it wouldn't be a problem per se if only this thing could be exported to, say HTML and manipulated in a text editor. The problem is that Word treats this "thing" as a single character/entity and
when exporting it is COMPLETELY omitted
when copied it can only be pasted into Notepad (which translates it into three separate characters), when being pasted into WordPad it just won't appear at all.
when a search is run in Word it won't find the letter, neither the actual character nor the exactly copied/pasted combination.
the letter will disappear when the document is opened in any other software, such as Libre Office
At this point I'm trying to:
understand what this combination is exactly
run a search/replace operation to find and weed out all of those errors
Here's a sample Word file.
Here's a screenshot of the word/letter in question:
which when typed correctly should appear like "скре́пка".
The 'character' appears to be a Word field of type 'eq' (equation). Here is the field with toggled field codes:
If it is a large document you could try to create a VBA routine that removes the fields and replaces them with corresponding characters.
Assuming that #Anonimista’s analysis is correct, as I think it is, you could fix the file by running some search and replace operations in Word, replacing e.g. ^19eq \o(е;´)^21 by е́ (the latter is Cyrillic letter е followed by combining acute accent U+0301). This is dull because you would need to do this for each vowel separately (and for uppercase vowels too). But I cannot find a way to use wildcards in this context; the codes ^19 and ^21 for start and end of field work only when wildcards are not enabled.

What are the unicode ranges for Hindi accented characters?

I'm trying to gather a Unicode list of all the 'o' like shapes in the Hindi character-set. In fact, a list of any characters (in any language) that makes uses of separate characters to indicate an accent would be better.
I intend to use this unicode-list in a RegExp.
I been trying to edit a list of character-ranges by outputting them in an Input TextField, but editing this text causes weird issues (the keyboard-cursor isn't place on the correct character, selections suddenly dissappear / incorrectly warps... in other words... HINDI HELL!)
I've tried this with Notepad++ too, but although it was more responsive, it eventually crapped out on me like it did in the Flash Player textfield. This seems to occur especially while removing the [] block (nulls?) characters. Some of them trigger odd behaviors.
Anyways, all I want is a list of the accents.
An example of a few are in the image below (but I would need ALL accents):
Thanks!
You can find pdf's containing lists of unicode ranges, grouped by language, here: http://unicode.org/charts/
For Hindi, you probably want Devanagari or Devanagari Extended.
Here is the character class for Devanagari combining marks:
[\u901\u902\u903\u93c\u93e\u93f\u940\u941\u942\u943
\u944\u945\u946\u947\u948\u949\u94a\u94b\u94c\u94d
\u951\u952\u953\u954\u962\u963]
This is only the basic Devanagari block (not Devanagari Extended).
If you want the complete set (for all languages), you can do it problematically.
You start from the Unicode date file at ftp://ftp.unicode.org/Public/6.1.0/ucd/UnicodeData.txt, described by TR-44 (http://unicode.org/reports/tr44/#Property_Definitions)
You can use the Canonical_Combining_Class field (see at http://unicode.org/reports/tr44/#Canonical_Combining_Class_Values) to filter the exact characters you want.
Can't be more precise, because "accent" a bit vague :-)
You might even have to also look at General_Category to get the filter right (and exclude certain marks, or symbols, or punctuation).
And a script doing this would definitely be better than trying to mess with text editors.
One of the characteristics of combining characters is that they combine :-)
So you might get all kind of puzzling results (like this: http://www.siao2.com/2006/02/17/533929.aspx :-)

Why is this A0 character appearing in my HTML::Element output?

I'm parsing an HTML document with a couple Perl modules: HTML::TreeBuilder and HTML::Element. For some reason whenever the content of a tag is just , which is to be expected, it gets returned by HTML::Element as a strange character I've never seen before:
alt text http://www.freeimagehosting.net/uploads/2acca201ab.jpg
I can't copy the character so can't Google it, couldn't find it in character map, and strangely when I search with a regular expression, \w finds it. When I convert the returned document to ANSI or UTF-8 it disappears altogether. I couldn't find any info on it in the HTML::Element documentation either.
How can I detect and replace this character with something more useful like null and how should I deal with strange characters like this in the future?
The character is "\xa0" (i.e. 160), which is the standard Unicode translation for . (That is, it's Unicode's non-breaking space.) You should be able to remove them with s/\xa0/ /g if you like.
The character is non-breaking space which is what stands for:
In word processing and digital typesetting, a non-breaking space (" ") (also called no-break space, non-breakable space (NBSP), hard space, or fixed space) is a space character that prevents an automatic line break at its position. In some formats, including HTML, it also prevents consecutive whitespace characters from collapsing into a single space.
In HTML, the common non-breaking space, which is the same width as the ordinary space character, is encoded as   or  . In Unicode, it is encoded as U+00A0.