Invisible characters - ASCII - facebook

Are there any invisible characters? I have checked Google for invisible characters and ended up with many answers but I'm not sure about those. Can someone on Stack Overflow tell me more about this?
Also I have checked a profile on Facebook and found that the user didn't have any name to his profile? How can this be possible? Is it some database issue? Hacking or something?
When I searched over Internet, I found that 200D is an ASCII value with an invisible character. Is it true?

I just went through the character map to get these.
They are all in Calibri.
Number    Name      HTML Code    Appearance
------    --------------------  ---------    ----------
U+2000    En Quad           " "
U+2001    Em Quad           " "
U+2002    En Space        " "
U+2003    Em Space        " "
U+2004  Three-Per-Em Space      " "
U+2005  Four-Per-Em Space         " "
U+2006 Six-Per-Em Space       " "
U+2007 Figure Space         " "
U+2008 Punctuation Space        " "
U+2009 Thin Space         " "
U+200A Hair Space        " "
U+200B Zero-Width Space ​      "​"
U+200C Zero Width Non-Joiner ‌   "‌"
U+200D Zero Width Joiner ‍      "‍"
U+200E Left-To-Right Mark ‎      "‎"
U+200F Right-To-Left Mark ‏      "‏"
U+202F Narrow No-Break Space        " "

How a character is represented is up to the renderer, but the server may also strip out certain characters before sending the document.
You can also have untitled YouTube videos like https://www.youtube.com/watch?v=dmBvw8uPbrA by using the Unicode character ZERO WIDTH NON-JOINER (U+200C), or ‌ in HTML. The code block below should contain that character:
‌‌

There is actually a truly invisible character: U+FEFF.
This character is called the Byte Order Mark and is related to the Unicode 8 system. It is a really confusing concept that can be explained HERE The Byte Order Mark or BOM for short is an invisible character that doesn't take up any space. You can copy the character bellow between the > and <.
Here is the character:
> <
How to catch this character in action:
Copy the character between the > and <,
Write a line of text, then randomly put your caret in the line of text
Paste the character in the line.
Go to the beginning of the line and press and hold the right arrow key.
You will notice that when your caret gets to the place you pasted the character, it will briefly stop for around half a second. This is becuase the caret is passing over the invisible character. Even though you can't see it doesn't mean it isn't there. The caret still sees that there is a character in that area that you pasted the BOM and will pass through it. Since the BOM is invisble, the caret will look like it has paused for a brief moment. You can past the BOM multiple times in an area and redo the steps above to really show the affect. Good luck!
EDIT: Sadly, Stackoverflow doesn't like the character. Here is an example from w3.org: https://www.w3.org/International/questions/examples/phpbomtest.php

Other answers are correct - whether a character is invisible or not depends on what font you use. This seems to be a pretty good list to me of characters that are truly invisible (not even space). It contains some chars that the other lists are missing.
'\u2060', // Word Joiner
'\u2061', // FUNCTION APPLICATION
'\u2062', // INVISIBLE TIMES
'\u2063', // INVISIBLE SEPARATOR
'\u2064', // INVISIBLE PLUS
'\u2066', // LEFT - TO - RIGHT ISOLATE
'\u2067', // RIGHT - TO - LEFT ISOLATE
'\u2068', // FIRST STRONG ISOLATE
'\u2069', // POP DIRECTIONAL ISOLATE
'\u206A', // INHIBIT SYMMETRIC SWAPPING
'\u206B', // ACTIVATE SYMMETRIC SWAPPING
'\u206C', // INHIBIT ARABIC FORM SHAPING
'\u206D', // ACTIVATE ARABIC FORM SHAPING
'\u206E', // NATIONAL DIGIT SHAPES
'\u206F', // NOMINAL DIGIT SHAPES
'\u200B', // Zero-Width Space
'\u200C', // Zero Width Non-Joiner
'\u200D', // Zero Width Joiner
'\u200E', // Left-To-Right Mark
'\u200F', // Right-To-Left Mark
'\u061C', // Arabic Letter Mark
'\uFEFF', // Byte Order Mark
'\u180E', // Mongolian Vowel Separator
'\u00AD' // soft-hyphen

The question about invisible characters in Unicode deserves a more thorough explanation.
Short answer - there are lots
Here are 134 invisible characters →­؜᠎​‌‍‎‏‪‫‬‭‮⁠⁡⁢⁣⁤⁧⁦⁨⁩𝅳𝅴𝅵𝅶𝅷𝅸𝅹𝅺󠀁󠀠󠀡󠀢󠀣󠀤󠀥󠀦󠀧󠀨󠀩󠀪󠀫󠀬󠀭󠀮󠀯󠀰󠀱󠀲󠀳󠀴󠀵󠀶󠀷󠀸󠀹󠀺󠀻󠀼󠀽󠀾󠀿󠁀󠁁󠁂󠁃󠁄󠁅󠁆󠁇󠁈󠁉󠁊󠁋󠁌󠁍󠁎󠁏󠁐󠁑󠁒󠁓󠁔󠁕󠁖󠁗󠁘󠁙󠁚󠁛󠁜󠁝󠁞󠁟󠁠󠁡󠁢󠁣󠁤󠁥󠁦󠁧󠁨󠁩󠁪󠁫󠁬󠁭󠁮󠁯󠁰󠁱󠁲󠁳󠁴󠁵󠁶󠁷󠁸󠁹󠁺󠁻󠁼󠁽󠁾󠁿← and here is their escaped ASCII representation: U+00AD U+061C U+180E U+200B U+200C U+200D U+200E U+200F U+202A U+202B U+202C U+202D U+202E U+2060 U+2061 U+2062 U+2063 U+2064 U+2067 U+2066 U+2068 U+2069 U+206A U+206B U+206C U+206D U+206E U+206F U+FEFF U+1D173 U+1D174 U+1D175 U+1D176 U+1D177 U+1D178 U+1D179 U+1D17A U+E0001 U+E0020 U+E0021 U+E0022 U+E0023 U+E0024 U+E0025 U+E0026 U+E0027 U+E0028 U+E0029 U+E002A U+E002B U+E002C U+E002D U+E002E U+E002F U+E0030 U+E0031 U+E0032 U+E0033 U+E0034 U+E0035 U+E0036 U+E0037 U+E0038 U+E0039 U+E003A U+E003B U+E003C U+E003D U+E003E U+E003F U+E0040 U+E0041 U+E0042 U+E0043 U+E0044 U+E0045 U+E0046 U+E0047 U+E0048 U+E0049 U+E004A U+E004B U+E004C U+E004D U+E004E U+E004F U+E0050 U+E0051 U+E0052 U+E0053 U+E0054 U+E0055 U+E0056 U+E0057 U+E0058 U+E0059 U+E005A U+E005B U+E005C U+E005D U+E005E U+E005F U+E0060 U+E0061 U+E0062 U+E0063 U+E0064 U+E0065 U+E0066 U+E0067 U+E0068 U+E0069 U+E006A U+E006B U+E006C U+E006D U+E006E U+E006F U+E0070 U+E0071 U+E0072 U+E0073 U+E0074 U+E0075 U+E0076 U+E0077 U+E0078 U+E0079 U+E007A U+E007B U+E007C U+E007D U+E007E U+E007F
Are there more? Yes.
Are there invisible characters in the ASCII range? Depends on the font.
Long answer - ready? set. go!
The Unicode Standard enables anyone to read and write in their own language. To do that, it lists unique code points󠁗󠁲󠁩󠁴󠁴󠁥󠁮󠀠󠁢󠁹󠀠󠁚󠁶󠁩󠀠󠁁󠁺󠁲󠁡󠁮󠀠󠀻󠀩 (U+hex), that are categorized into letters (D,ž,Dž,ʶ,愛,𓂀), symbols (+∊≠,£¥₪,҂˚˟˿), marks (ם֑֟֯ ,ী,◌҉ ), separators ( , , , ,  ), emojis (😊,🙏,👍), and much more. ASCII/Basic Latin is the very beginning of the table and more code points are added every update.
Simply listing unique numbers for characters is not enough. Characters can change their shape or change the sentence depending on the context. To support that, every code point comes with a list of properties . These properties may define the width (AA), its role in the sentence (-“.), its direction (cכ), and much more.
Most invisible characters have the property General_Category=Format (other answers here included Spaces as well). Theis characters have a supporting role to a word/sentence. Here are some examples:
General Punctuation Block -
Invisible characters that are an integral part of some writing systems and emojis. Common ones are Zero width joiner (U+200D), Zero width non joiner (U+200C), Word joiner (U+2060)
Explicit Bidirectional Formatting characters - 12 invisible characters󠁗󠁲󠁩󠁴󠁴󠁥󠁮󠀠󠁢󠁹󠀠󠁚󠁶󠁩󠀠󠁁󠁺󠁲󠁡󠁮󠀠󠀻󠀩 used to enforce different direction constraints on the sentence. Helping present text to more than 300 million speakers of right-to-left languages e.g. Hebrew or Arabic.
Tags - 97 invisible characters that mirror ASCII (just drop the E and you get characters in the ASCII range). These are used as emoji modifiers and digital signatures to prove who copied your text.
This all leads to talk about exploiting invisible characters for homograph attack/visual spoofing. Sometimes it's harmless like invisible names and titles but in lots of cases they are used maliciously. For example U+202E is one invisible character that keeps doing more harm than good for decades!!
Last point, there is another way to make invisible characters using fonts. Fonts are files that store glyphs (pictures of characters), that present the characters' look. If the font does not contain a glyph for a codepoint, a substitute/replacement󠁗󠁲󠁩󠁴󠁴󠁥󠁮󠀠󠁢󠁹󠀠󠁚󠁶󠁩󠀠󠁁󠁺󠁲󠁡󠁮󠀠󠀻󠀩 character is displayed (e.g. �, □). But if the font contains a transparent glyph for a codepoint, then the character is invisible, only when displayed by that font. This is the only way to have invisible characters in the ASCII range (for example can you see →``← U+000C Form Feed).
Hope you find this explanation helpful and may you check strings for invisible characters more often 󠁗󠁲󠁩󠁴󠁴󠁥󠁮󠀠󠁢󠁹󠀠󠁚󠁶󠁩󠀠󠁁󠁺󠁲󠁡󠁮󠀠󠀻󠀩😉

Yes you can use invisible or blank name on facebook by using some HTML code/symbols.
Method 1:
Copy and paste (ﹺ                         ﹺ) symbols without brackets in your first and last name field.
Method 2:
Click on edit name. Now copy and paste following symbol in first and last name.
ՙՙ ՙՙ

An invisible Character is ​, or U+200b
​

Related

Why does the TextBox character ordering change with FlowDirection RightToLeft

In my UWP app I have a textbox.
I want the user to be able to type Farsi / Persian text (right to left) into the textbox so I set the FlowDirection property to RighToLeft.
The text can be entered and is displayed correctly:
When I save the text, and inspect the property during debugging, i see the same character order as on screen:
The same character order applies for the stored value when viewed with mssql management studio:
When I add a '.' or a '!' at the end of the text, the WPF textbox still displays what I expect,
but the text I get back from the text property puts the exclamation mark at the right side of the string.
It is also stored this way in the sql database:
When loading the database value (with the exclamation point on the right) into the textbox it shows the exclamation point correctly on the left side. There must be some magic happening here that I am not aware of, or maybe the problem is that the debug preview / mssql preview does not support displaying RTL values.
My problem is that this magic does not work in other situations.
When I load the database value and put it in a microsoft word document, it seems to do no conversion and place the text in the document exactly as it is in the database, resulting in the exclamation point to be shown on the 'wrong' side.
I would like to understand the 'magic' that takes place in displaying / storing these strings, so I can output it correctly in MS Word. And Yes, I have set the paragraph where I output the values in word to RTL.
In Unicode, all characters have directional properties that get used in the Unicode Bidirectional Algorithm for determining how characters are ordered visually. Most characters have a "strong" directional property, but not all. In particular, most punctuation characters are considered directionally neutral.
The visual ordering of neutral characters is determined by the characters that surround them. For example, the exclamation mark ! is neutral; if it occurs between two left-to-right characters, it will be treated as though it also is a left-to-right character. But if it occurs between two right-to-left characters, it will be treated as though it is a right-to-left character.
In your example, though, the exclamation mark occurs at the end of the string. So, it has a strong-direction character on one side, but nothing on the other side. In this case, another factor comes into play, which is that the paragraph as a whole has a base direction.
The Unicode Bidi Algorithm allows two ways that apps can handle the paragraph base direction:
the app can set the base direction explicitly, regardless of the string content in the paragraph; or
the app can let the base direction be derived implicitly from the string: the base direction is determined by the first strong-directional character in the string.
In your UWP app, when you set the flow direction to RTL, then the paragraph base direction (for purposes of the Bidi Algorithm) is RTL. With an Arabic-script string that ends with the exclamation mark, the directionality of the exclamation is set to RTL because of the paragraph base direction, and so it appears at the left end of the string. But when you view the control property value in an IDE, the IDE is presenting that property string in a control that has LTR base direction. That is causing the exclamation at the logical end of the string to appear visually at the right end.
Note that apps will often conflate base direction and alignment, though these are really distinct things. In Word, you can set the paragraph base direction in the Paragraph settings dialog, and when you do it will set the alignment to match by default:
But you can override the paragraph alignment to have a RTL base direction with left alignment:
Note that the visual order of the exclamation mark is affected by the paragraph base direction but not by the alignment. The Unicode Bidi Algorithm doesn't pay attention to the alignment.
This article gives a good overview of how the Bidi Algorithm works: https://www.w3.org/International/articles/inline-bidi-markup/uba-basics.
If you want to explore how the Bidi Algorithm works in more detail, you can read the spec, Unicode Standard Annex #9, Unicode Bidirectional Algorithm; and check out this Unicode utility that explains how the rules of the algorithm apply to sample strings you can provide.

What kind of Characters are these? Are these unicode? How to write them? [duplicate]

I've seen weirdly formatted text called Zalgo like below written on various forums. It's kind of annoying to look at, but it really bothers me because it undermines my notion of what a character is supposed to be. My understanding is that a character is supposed to move horizontally across a line and stay within a certain "container". Obviously the Zalgo text is moving vertically and doesn't seem to be restricted to any space.
Is this a bug/flaw/exploit/hack in Unicode? Are these individual characters with weird properties? "What" is happening here?
H̡̫̤̤̣͉̤ͭ̓̓̇͗̎̀ơ̯̗̱̘̮͒̄̀̈ͤ̀͡w͓̲͙͖̥͉̹͋ͬ̊ͦ̂̀̚ ͎͉͖̌ͯͅͅd̳̘̿̃̔̏ͣ͂̉̕ŏ̖̙͋ͤ̊͗̓͟͜e͈͕̯̮̙̣͓͌ͭ̍̐̃͒s͙͔̺͇̗̱̿̊̇͞ ̸̤͓̞̱̫ͩͩ͑̋̀ͮͥͦ̊Z̆̊͊҉҉̠̱̦̩͕ą̟̹͈̺̹̋̅ͯĺ̡̘̹̻̩̩͋͘g̪͚͗ͬ͒o̢̖͇̬͍͇͓̔͋͊̓ ̢͈͙͂ͣ̏̿͐͂ͯ͠t̛͓̖̻̲ͤ̈ͣ͝e͋̄ͬ̽͜҉͚̭͇ͅx͎̬̠͇̌ͤ̓̂̓͐͐́͋͡ț̗̹̝̄̌̀ͧͩ̕͢ ̮̗̩̳̱̾w͎̭̤͍͇̰̄͗ͭ̃͗ͮ̐o̢̯̻̰̼͕̾ͣͬ̽̔̍͟ͅr̢̪͙͍̠̀ͅǩ̵̶̗̮̮ͪ́?̙͉̥̬͙̟̮͕ͤ̌͗ͩ̕͡
The text uses combining characters, also known as combining marks. See section 2.11 of Combining Characters in the Unicode Standard (PDF).
In Unicode, character rendering does not use a simple character cell model where each glyph fits into a box with given height. Combining marks may be rendered above, below, or inside a base character
So you can easily construct a character sequence, consisting of a base character and “combining above” marks, of any length, to reach any desired visual height, assuming that the rendering software conforms to the Unicode rendering model. Such a sequence has no meaning of course, and even a monkey could produce it (e.g., given a keyboard with suitable driver).
And you can mix “combining above” and “combining below” marks.
The sample text in the question starts with:
LATIN CAPITAL LETTER H - H
COMBINING LATIN SMALL LETTER T - ͭ
COMBINING GREEK KORONIS - ̓
COMBINING COMMA ABOVE - ̓
COMBINING DOT ABOVE - ̇
Zalgo text works because of combining characters. These are special characters that allow to modify character that comes before.
OR
y + ̆ = y̆ which actually is
y + ̆ = y̆
Since you can stack them one atop the other you can produce the following:
y̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆
which actually is:
y̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆
The same goes for putting stuff underneath:
y̰̰̰̰̰̰̰̰̰̰̰̰̰̰̰̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆
that in fact is:
y̰̰̰̰̰̰̰̰̰̰̰̰̰̰̰̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆
In Unicode, the main block of combining diacritics for European languages and the International Phonetic Alphabet is U+0300–U+036F.
More about it here
To produce a list of combining diacritical marks you can use the following script (since links keep on dying)
for(var i=768; i<879; i++){console.log(new DOMParser().parseFromString("&#"+i+";", "text/html").documentElement.textContent +" "+"&#"+i+";");}
Also check em out
Mͣͭͣ̾ Vͣͥͭ͛ͤͮͥͨͥͧ̾

Two different eye emojis?

As far as I knew, there are currently two emojis for eyes. The pair of eyes (U+1F440) with hex code f09f9180 (👀), and a single eye (U+1F441) with hex code f09f9181 (👁).
I now found when using the emojis of the keyboard in my phone that another eye emoji exists, with hex code f09f9181efb88f (👁️).
The gajim messenger on the PC, and the Conversations app on the mobile phone, can display both. The gajim emoji-chooser only contains the short sequence and the Swiftkey-Keyboard Emoji-Chooser only the longer one.
When I copy and paste the emojis i.e. in the Firefox URL address bar, they look the same (blue eye, while the messengers both display them in black). When I Google for the emojis, I only find pages describing the shorter code point.
Firefox renders both emojis the same, but Vivaldi (Chromium based) shows the one with the shorter code point as narrow black and white emoji and the other one as larger brown eye.
When I Google for the hex dump, I find a lot of emojipedia sites for the shorter dump, and nothing useful at all for the longer one.
Is there somewhere any documentation about the additional emoji? Why aren't both emojis available in both emoji choosers?
f0 9f 91 80 is the UTF-8 encoded form of codepoint U+1F440.
f0 9f 91 81 is the UTF-8 encoded form of codepoint U+1F441.
f0 9f 91 81 ef b8 8f is the UTF-8 encoded form of codepoints U+1F441 U+FE0F.
U+FE0F is a Variation Selector:
Variation Selectors is a Unicode block containing 16 Variation Selector format characters (designated VS1 through VS16). They are used to specify a specific glyph variant for a Unicode character. They are currently used to specify standardized variation sequences for mathematical symbols, emoji symbols, 'Phags-pa letters, and CJK unified ideographs corresponding to CJK compatibility ideographs. At present only standardized variation sequences with VS1, VS15 and VS16 have been defined.
Where U+FE0F is VARIATION SELECTOR-16:
U+FE0F was added to Unicode in version 3.2 (2002). It belongs to the block Variation Selectors in the Basic Multilingual Plane.
This character is a Nonspacing Mark and inherits its script property from the preceding character.
The glyph is not a composition. It has a Ambiguous East Asian Width. In bidirectional context it acts as Nonspacing Mark and is not mirrored. In text U+FE0F behaves as Combining Mark regarding line breaks. It has type Extend for sentence and Extend for word breaks. The Grapheme Cluster Break is Extend.
This codepoint may change the appearance of the preceding character. If that is a symbol, dingbat or emoji, U+FE0F forces it to be rendered as a colorful image as compared to a monochrome text variant. The Unicode standard defines some standardized variants. See also “Unicode symbol as text or emoji” for a discussion of this codepoint.
In other words, U+FE0F tells VS-aware software to render U+1F441 as a colorful emoji instead of as monochromatic text.
The singular ‘👁’ is used as an emoji, but is defined as being text-style (i.e. black-and-white rather than colourful) by default. This isn’t implemented consistently across all platforms, however, so sometimes the character will also display as emoji style instead. In order to explicitly force one or the other style, the characters U+FE0E and U+FE0F can be appended to 👁 to make it appear as text style (👁︎) or emoji style (👁️) respectively. Because of the inconsistencies I mentioned, some devices and applications automatically add U+FE0F to the character (resulting in the longer code your phone keyboard produced), while others leave the character as-is (leaving just the code for the eye itself).

Unicode tricks and hacks, vertical text, Non spacing marks [duplicate]

I've seen weirdly formatted text called Zalgo like below written on various forums. It's kind of annoying to look at, but it really bothers me because it undermines my notion of what a character is supposed to be. My understanding is that a character is supposed to move horizontally across a line and stay within a certain "container". Obviously the Zalgo text is moving vertically and doesn't seem to be restricted to any space.
Is this a bug/flaw/exploit/hack in Unicode? Are these individual characters with weird properties? "What" is happening here?
H̡̫̤̤̣͉̤ͭ̓̓̇͗̎̀ơ̯̗̱̘̮͒̄̀̈ͤ̀͡w͓̲͙͖̥͉̹͋ͬ̊ͦ̂̀̚ ͎͉͖̌ͯͅͅd̳̘̿̃̔̏ͣ͂̉̕ŏ̖̙͋ͤ̊͗̓͟͜e͈͕̯̮̙̣͓͌ͭ̍̐̃͒s͙͔̺͇̗̱̿̊̇͞ ̸̤͓̞̱̫ͩͩ͑̋̀ͮͥͦ̊Z̆̊͊҉҉̠̱̦̩͕ą̟̹͈̺̹̋̅ͯĺ̡̘̹̻̩̩͋͘g̪͚͗ͬ͒o̢̖͇̬͍͇͓̔͋͊̓ ̢͈͙͂ͣ̏̿͐͂ͯ͠t̛͓̖̻̲ͤ̈ͣ͝e͋̄ͬ̽͜҉͚̭͇ͅx͎̬̠͇̌ͤ̓̂̓͐͐́͋͡ț̗̹̝̄̌̀ͧͩ̕͢ ̮̗̩̳̱̾w͎̭̤͍͇̰̄͗ͭ̃͗ͮ̐o̢̯̻̰̼͕̾ͣͬ̽̔̍͟ͅr̢̪͙͍̠̀ͅǩ̵̶̗̮̮ͪ́?̙͉̥̬͙̟̮͕ͤ̌͗ͩ̕͡
The text uses combining characters, also known as combining marks. See section 2.11 of Combining Characters in the Unicode Standard (PDF).
In Unicode, character rendering does not use a simple character cell model where each glyph fits into a box with given height. Combining marks may be rendered above, below, or inside a base character
So you can easily construct a character sequence, consisting of a base character and “combining above” marks, of any length, to reach any desired visual height, assuming that the rendering software conforms to the Unicode rendering model. Such a sequence has no meaning of course, and even a monkey could produce it (e.g., given a keyboard with suitable driver).
And you can mix “combining above” and “combining below” marks.
The sample text in the question starts with:
LATIN CAPITAL LETTER H - H
COMBINING LATIN SMALL LETTER T - ͭ
COMBINING GREEK KORONIS - ̓
COMBINING COMMA ABOVE - ̓
COMBINING DOT ABOVE - ̇
Zalgo text works because of combining characters. These are special characters that allow to modify character that comes before.
OR
y + ̆ = y̆ which actually is
y + ̆ = y̆
Since you can stack them one atop the other you can produce the following:
y̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆
which actually is:
y̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆
The same goes for putting stuff underneath:
y̰̰̰̰̰̰̰̰̰̰̰̰̰̰̰̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆
that in fact is:
y̰̰̰̰̰̰̰̰̰̰̰̰̰̰̰̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆
In Unicode, the main block of combining diacritics for European languages and the International Phonetic Alphabet is U+0300–U+036F.
More about it here
To produce a list of combining diacritical marks you can use the following script (since links keep on dying)
for(var i=768; i<879; i++){console.log(new DOMParser().parseFromString("&#"+i+";", "text/html").documentElement.textContent +" "+"&#"+i+";");}
Also check em out
Mͣͭͣ̾ Vͣͥͭ͛ͤͮͥͨͥͧ̾

Why is this A0 character appearing in my HTML::Element output?

I'm parsing an HTML document with a couple Perl modules: HTML::TreeBuilder and HTML::Element. For some reason whenever the content of a tag is just , which is to be expected, it gets returned by HTML::Element as a strange character I've never seen before:
alt text http://www.freeimagehosting.net/uploads/2acca201ab.jpg
I can't copy the character so can't Google it, couldn't find it in character map, and strangely when I search with a regular expression, \w finds it. When I convert the returned document to ANSI or UTF-8 it disappears altogether. I couldn't find any info on it in the HTML::Element documentation either.
How can I detect and replace this character with something more useful like null and how should I deal with strange characters like this in the future?
The character is "\xa0" (i.e. 160), which is the standard Unicode translation for . (That is, it's Unicode's non-breaking space.) You should be able to remove them with s/\xa0/ /g if you like.
The character is non-breaking space which is what stands for:
In word processing and digital typesetting, a non-breaking space (" ") (also called no-break space, non-breakable space (NBSP), hard space, or fixed space) is a space character that prevents an automatic line break at its position. In some formats, including HTML, it also prevents consecutive whitespace characters from collapsing into a single space.
In HTML, the common non-breaking space, which is the same width as the ordinary space character, is encoded as   or  . In Unicode, it is encoded as U+00A0.