Bad formatting of arabic text when mixed with 1% - unicode

here is the code
<code>
<source> %1 added %2 to the group </source>
<translation type="unfinished"> 1% أضاف %2 للمجموعة</translation>
</code>
Problem is that i want the text to be reversed in translation, so instead of "1% أضاف %2 للمجموعة", i want to have: [للمجموعة ("percent" 2) أضاف("percent" 1)]
For some reason even notepad doesn't allow you to put %1 at the far right of an Arabic sentence because Arabic is right to left, so maybe you can't start a sentence with %1.
Are there any workarounds for this ?

You need to force the writing direction for the numbers and the percentage symbol. All of them are left-to-right characters. This is because the beginning of your string "1%" is written on the left. After that, the Arabic letters are all right-to-left, while the "2%" is left-to-right again, but immediately returns to arabic right-to-left. So the output is perfect correct.
To achieve what you intend, you have to force the right-to-left writing direction at the beginning of your string. Add a RTL character (‏ or U+200F in Unicode) before the first position start of your string and it displays as follows:
‏1% أضاف %2 للمجموعة
Inserting more RTL characters at the crucial positions (directly before the numbers and percentages), you can make it look like this:
‏1‏% أضاف ‏%2 للمجموعة

Related

Check Half width Katakana character in COBOL

I'm working on RedHat6 and using COBOL. I wanna check every single digit of variable, if it's half width --> CONTINUE, Else --> DISPLAY ERROR. Basicly I list all half width characters in WHEN Clause of EVALUATE statement. Like this:
PERFORM VARYING WK-IX FROM 1 BY 1 UNTIL WK-IX > WK-LENGTH
EVALUATE WK-FORMAT-CHK-VALUE(WK-IX:1)
WHEN 'A'
WHEN 'B'
WHEN 'C'
CONTINUE
WHEN OTHER
DISPLAY 'ERROR'
END-EVALUATE
END-PERFORM.
Everything is OK but when compile I have problem with half width katakana character. It said: "The ending quotation mark of the literal is missing. The characters at the end of Area B are assumed to be a literal" with all line check these character:
ツ テ ト ナ ニ ヌ ネ ノ ハ ヒ フ ヘ ホ マ ミ ム メ モ ヤ ユ ヨ ラ リ ル レ ロ ワ ヲ ン
Although I sure there isn't any line of code miss the ending quotation mark. Like this:
WHEN 'ツ'
WHEN 'テ'
WHEN 'ト'
But these character is OK and I don't know why:
ア イ ウ エ オ カ キ ク ケ コ サ シ ス セ ソ タ チ
Anyone can help me? Please!
Sorry for my bad English!
Because the Katakana character set is considered a multi-byte character set (as mentioned by Bill Woodger), you will need to ensure that the NSYMBOL and DBCS compile options are enabled. After that, you should be able to define the literals like this:
EVALUATE WK-FORMAT-CHK-VALUE(WK-IX:1)
WHEN N'ツ'
WHEN N'テ'
WHEN N'ト'
do something
WHEN OTHER
do something else
END-EVALUATE
the N will tell the program that this is a national character and as such is multi-byte.
Your input to the evaluate clause will also need to be defined as a PIC N rather than a PIC X. A PIC X field will not recognise double byte characters.

Invisible characters - ASCII

Are there any invisible characters? I have checked Google for invisible characters and ended up with many answers but I'm not sure about those. Can someone on Stack Overflow tell me more about this?
Also I have checked a profile on Facebook and found that the user didn't have any name to his profile? How can this be possible? Is it some database issue? Hacking or something?
When I searched over Internet, I found that 200D is an ASCII value with an invisible character. Is it true?
I just went through the character map to get these.
They are all in Calibri.
Number    Name      HTML Code    Appearance
------    --------------------  ---------    ----------
U+2000    En Quad           " "
U+2001    Em Quad           " "
U+2002    En Space        " "
U+2003    Em Space        " "
U+2004  Three-Per-Em Space      " "
U+2005  Four-Per-Em Space         " "
U+2006 Six-Per-Em Space       " "
U+2007 Figure Space         " "
U+2008 Punctuation Space        " "
U+2009 Thin Space         " "
U+200A Hair Space        " "
U+200B Zero-Width Space ​      "​"
U+200C Zero Width Non-Joiner ‌   "‌"
U+200D Zero Width Joiner ‍      "‍"
U+200E Left-To-Right Mark ‎      "‎"
U+200F Right-To-Left Mark ‏      "‏"
U+202F Narrow No-Break Space        " "
How a character is represented is up to the renderer, but the server may also strip out certain characters before sending the document.
You can also have untitled YouTube videos like https://www.youtube.com/watch?v=dmBvw8uPbrA by using the Unicode character ZERO WIDTH NON-JOINER (U+200C), or ‌ in HTML. The code block below should contain that character:
‌‌
There is actually a truly invisible character: U+FEFF.
This character is called the Byte Order Mark and is related to the Unicode 8 system. It is a really confusing concept that can be explained HERE The Byte Order Mark or BOM for short is an invisible character that doesn't take up any space. You can copy the character bellow between the > and <.
Here is the character:
> <
How to catch this character in action:
Copy the character between the > and <,
Write a line of text, then randomly put your caret in the line of text
Paste the character in the line.
Go to the beginning of the line and press and hold the right arrow key.
You will notice that when your caret gets to the place you pasted the character, it will briefly stop for around half a second. This is becuase the caret is passing over the invisible character. Even though you can't see it doesn't mean it isn't there. The caret still sees that there is a character in that area that you pasted the BOM and will pass through it. Since the BOM is invisble, the caret will look like it has paused for a brief moment. You can past the BOM multiple times in an area and redo the steps above to really show the affect. Good luck!
EDIT: Sadly, Stackoverflow doesn't like the character. Here is an example from w3.org: https://www.w3.org/International/questions/examples/phpbomtest.php
Other answers are correct - whether a character is invisible or not depends on what font you use. This seems to be a pretty good list to me of characters that are truly invisible (not even space). It contains some chars that the other lists are missing.
'\u2060', // Word Joiner
'\u2061', // FUNCTION APPLICATION
'\u2062', // INVISIBLE TIMES
'\u2063', // INVISIBLE SEPARATOR
'\u2064', // INVISIBLE PLUS
'\u2066', // LEFT - TO - RIGHT ISOLATE
'\u2067', // RIGHT - TO - LEFT ISOLATE
'\u2068', // FIRST STRONG ISOLATE
'\u2069', // POP DIRECTIONAL ISOLATE
'\u206A', // INHIBIT SYMMETRIC SWAPPING
'\u206B', // ACTIVATE SYMMETRIC SWAPPING
'\u206C', // INHIBIT ARABIC FORM SHAPING
'\u206D', // ACTIVATE ARABIC FORM SHAPING
'\u206E', // NATIONAL DIGIT SHAPES
'\u206F', // NOMINAL DIGIT SHAPES
'\u200B', // Zero-Width Space
'\u200C', // Zero Width Non-Joiner
'\u200D', // Zero Width Joiner
'\u200E', // Left-To-Right Mark
'\u200F', // Right-To-Left Mark
'\u061C', // Arabic Letter Mark
'\uFEFF', // Byte Order Mark
'\u180E', // Mongolian Vowel Separator
'\u00AD' // soft-hyphen
The question about invisible characters in Unicode deserves a more thorough explanation.
Short answer - there are lots
Here are 134 invisible characters →­؜᠎​‌‍‎‏‪‫‬‭‮⁠⁡⁢⁣⁤⁧⁦⁨⁩𝅳𝅴𝅵𝅶𝅷𝅸𝅹𝅺󠀁󠀠󠀡󠀢󠀣󠀤󠀥󠀦󠀧󠀨󠀩󠀪󠀫󠀬󠀭󠀮󠀯󠀰󠀱󠀲󠀳󠀴󠀵󠀶󠀷󠀸󠀹󠀺󠀻󠀼󠀽󠀾󠀿󠁀󠁁󠁂󠁃󠁄󠁅󠁆󠁇󠁈󠁉󠁊󠁋󠁌󠁍󠁎󠁏󠁐󠁑󠁒󠁓󠁔󠁕󠁖󠁗󠁘󠁙󠁚󠁛󠁜󠁝󠁞󠁟󠁠󠁡󠁢󠁣󠁤󠁥󠁦󠁧󠁨󠁩󠁪󠁫󠁬󠁭󠁮󠁯󠁰󠁱󠁲󠁳󠁴󠁵󠁶󠁷󠁸󠁹󠁺󠁻󠁼󠁽󠁾󠁿← and here is their escaped ASCII representation: U+00AD U+061C U+180E U+200B U+200C U+200D U+200E U+200F U+202A U+202B U+202C U+202D U+202E U+2060 U+2061 U+2062 U+2063 U+2064 U+2067 U+2066 U+2068 U+2069 U+206A U+206B U+206C U+206D U+206E U+206F U+FEFF U+1D173 U+1D174 U+1D175 U+1D176 U+1D177 U+1D178 U+1D179 U+1D17A U+E0001 U+E0020 U+E0021 U+E0022 U+E0023 U+E0024 U+E0025 U+E0026 U+E0027 U+E0028 U+E0029 U+E002A U+E002B U+E002C U+E002D U+E002E U+E002F U+E0030 U+E0031 U+E0032 U+E0033 U+E0034 U+E0035 U+E0036 U+E0037 U+E0038 U+E0039 U+E003A U+E003B U+E003C U+E003D U+E003E U+E003F U+E0040 U+E0041 U+E0042 U+E0043 U+E0044 U+E0045 U+E0046 U+E0047 U+E0048 U+E0049 U+E004A U+E004B U+E004C U+E004D U+E004E U+E004F U+E0050 U+E0051 U+E0052 U+E0053 U+E0054 U+E0055 U+E0056 U+E0057 U+E0058 U+E0059 U+E005A U+E005B U+E005C U+E005D U+E005E U+E005F U+E0060 U+E0061 U+E0062 U+E0063 U+E0064 U+E0065 U+E0066 U+E0067 U+E0068 U+E0069 U+E006A U+E006B U+E006C U+E006D U+E006E U+E006F U+E0070 U+E0071 U+E0072 U+E0073 U+E0074 U+E0075 U+E0076 U+E0077 U+E0078 U+E0079 U+E007A U+E007B U+E007C U+E007D U+E007E U+E007F
Are there more? Yes.
Are there invisible characters in the ASCII range? Depends on the font.
Long answer - ready? set. go!
The Unicode Standard enables anyone to read and write in their own language. To do that, it lists unique code points󠁗󠁲󠁩󠁴󠁴󠁥󠁮󠀠󠁢󠁹󠀠󠁚󠁶󠁩󠀠󠁁󠁺󠁲󠁡󠁮󠀠󠀻󠀩 (U+hex), that are categorized into letters (D,ž,Dž,ʶ,愛,𓂀), symbols (+∊≠,£¥₪,҂˚˟˿), marks (ם֑֟֯ ,ী,◌҉ ), separators ( , , , ,  ), emojis (😊,🙏,👍), and much more. ASCII/Basic Latin is the very beginning of the table and more code points are added every update.
Simply listing unique numbers for characters is not enough. Characters can change their shape or change the sentence depending on the context. To support that, every code point comes with a list of properties . These properties may define the width (AA), its role in the sentence (-“.), its direction (cכ), and much more.
Most invisible characters have the property General_Category=Format (other answers here included Spaces as well). Theis characters have a supporting role to a word/sentence. Here are some examples:
General Punctuation Block -
Invisible characters that are an integral part of some writing systems and emojis. Common ones are Zero width joiner (U+200D), Zero width non joiner (U+200C), Word joiner (U+2060)
Explicit Bidirectional Formatting characters - 12 invisible characters󠁗󠁲󠁩󠁴󠁴󠁥󠁮󠀠󠁢󠁹󠀠󠁚󠁶󠁩󠀠󠁁󠁺󠁲󠁡󠁮󠀠󠀻󠀩 used to enforce different direction constraints on the sentence. Helping present text to more than 300 million speakers of right-to-left languages e.g. Hebrew or Arabic.
Tags - 97 invisible characters that mirror ASCII (just drop the E and you get characters in the ASCII range). These are used as emoji modifiers and digital signatures to prove who copied your text.
This all leads to talk about exploiting invisible characters for homograph attack/visual spoofing. Sometimes it's harmless like invisible names and titles but in lots of cases they are used maliciously. For example U+202E is one invisible character that keeps doing more harm than good for decades!!
Last point, there is another way to make invisible characters using fonts. Fonts are files that store glyphs (pictures of characters), that present the characters' look. If the font does not contain a glyph for a codepoint, a substitute/replacement󠁗󠁲󠁩󠁴󠁴󠁥󠁮󠀠󠁢󠁹󠀠󠁚󠁶󠁩󠀠󠁁󠁺󠁲󠁡󠁮󠀠󠀻󠀩 character is displayed (e.g. �, □). But if the font contains a transparent glyph for a codepoint, then the character is invisible, only when displayed by that font. This is the only way to have invisible characters in the ASCII range (for example can you see →``← U+000C Form Feed).
Hope you find this explanation helpful and may you check strings for invisible characters more often 󠁗󠁲󠁩󠁴󠁴󠁥󠁮󠀠󠁢󠁹󠀠󠁚󠁶󠁩󠀠󠁁󠁺󠁲󠁡󠁮󠀠󠀻󠀩😉
Yes you can use invisible or blank name on facebook by using some HTML code/symbols.
Method 1:
Copy and paste (ﹺ                         ﹺ) symbols without brackets in your first and last name field.
Method 2:
Click on edit name. Now copy and paste following symbol in first and last name.
ՙՙ ՙՙ
An invisible Character is ​, or U+200b
​

Why does Github Flavored Markup only add newlines for lines that start with [\w\<]?

In our site (which is aimed at highly non-technical people), we let them use Markdown when sending emails. That way, they get nice things like bold, italic, etc. Being non-technical, however, they would never get past the “add two lines to make newlines actually work” quirk.
For that reason mainly, we are using a variant of Github Flavored Markdown.
We mainly borrowed this part:
# in very clear cases, let newlines become <br /> tags
text.gsub!(/^[\w\<][^\n]*\n+/) do |x|
x =~ /\n{2}/ ? x : (x.strip!; x << " \n")
end
This works well, but in some cases it doesn’t add the new-lines, and I guess the key to that is the “in very clear cases” part of that comment.
If I interpret it correctly, this is only adding newlines for lines that start with either a word character or a ‘<’.
Does anyone know why that is? Particularly, why ‘<’?
What would be the harm in just adding the two spaces to essentially anything (lines starting with spaces, hyphens, anything)?
'<' character is used at the beginning of a line to quote messages. I guess that is the reason.
The other answer to this question is quite wrong. This has nothing to do with quoting, and the character for markdown quoting is >.
^[\w\<][^\n]*\n+
Let's break the above regex into parts:
^ = anchor start of string.
[\w\<] matches a word character or the start of word boundary. \< is not a literal, but rather a GNU word boundary. See here (do a ctrl+f for \<).
[^\n]* matches any length of non-newline characters
\n matches a new line.
+ is, I believe, a possessive quantifier.
I believe, but am not 100% sure, that this simply is used to set x to a line of text. Then, the heavy work is done with the next line:
x =~ /\n{2}/ ? x : (x.strip!; x << " \n")
This says "if x satisfies the regex \n{2} (that is, has two line breaks), leave x as is. Otherwise, strip x and append a newline character.

Meaning of an RTF faulty piece of code

I am working on an RTF file made by someone else on an unknown platform, and everything is interpreted correctly, except some characters, whatever character set I open them from in openoffice. Here is the plain text, after interpretation:
"Même taille que la Terre, même masse, même âgec Vénus a souvent été qualifiée de sœur de la Terre. "
and here is the original ANSI paragraph:
"M\u234\'3fme taille que la Terre, m\u234\'3fme masse, m\u234\'3fme \u226\'3fge\uc2 \u61825\'ff\'81\uc1 c V\u233\'3fnus a souvent \u233\'3ft\u233\'3f qualifi\u233\'3fe de s\u339\'3fur de la Terre."
To zoom in:
"âgec Vénus" becomes "\u226\'3fge\uc2 \u61825\'ff\'81\uc1 c V\u233\'3fnus"
and finally, what we come up with:
"\uc2 \u61825\'ff\'81\uc1 c"
here \uc2 and \uc1 are to say we are going back and forth between 4-bytes and 2-bytes Unicode encoding.
\u61825 is an unknown Unicode character. Indeed, according to the RTF specification, any UTF character greater than 2^15 should be written in a negative form; negative form with ANSI characters should make the "-" (minus) sign visible to the notepad, am I right? So here already I have something I don't understand, how the RTF writer used by the person who made the rtf file in the first place could have done it. Maybe I missed something in the specification, specific versions, character sets, I don't know. If taken as is, 61825 would correspond to F181 which is in a private area of the Unicode table.
And then, the \'ff\'81 would be some use of the ANSI equivalent field of the whole "specific character" group (whose structure is usually \uN\'XX), to code something that would be 4-byte long. And here again, I could not find:
what is the code page (Windows-1252, ISO-8859-1, other?) being refered to (as in all the other places in the file where a \uN\'XX sequence apears, XX are always 3F, the Windows-1252 code for "?", so it did not give me much information)
what does the \'FF (which looks like some control character inside an escape sequence!) stand for, and then why \'81... Actually, the translation of \u61825 to hex is F181, not FF81...I am lost here!
Finally, what the translated text (in French) would make us expect is the ":" (semicolon): "Same size as Earth, same mass, same age: Venus has often been qualified as Earth's sister". It would make sense. But what rtf writer could imagine such a complicated code for the semicolon?
So again, after 1 hour of search, I open the question to you fellows: does someone recognize this, and could tell me what control word encoding is used, is there a big endian/little endian/2's complement mess here with the 61825, and same with the \'ff\'81, which would assemble as FF81 instead of F181, which itself doesn't mean anything as is...here my question is only to know if there would be a way to find the complete original text back from the bizarre RTF encoding!
what the translated text (in french) would make us expect is the ":" (semicolon
Nearly: it should be the ellipsis. You can see the source text eg here.
The ellipsis should normally be written simply as three periods, but there has traditionally been a separate character representing ellipsis in order better to control their spacing, back before complex text layout algorithms existed that could do automatic glyph replacement. Consequently there exists a Unicode compatibility character U+2026 HORIZONTAL ELLIPSIS to allow round-tripping to legacy encodings such as Windows code page 1252, where it is byte 133.
That, however, is not what has been encoded in your RTF document. That would be too easy.
61825 is an unknown Unicode character.
It's a Private Use Area character, which means it could represent absolutely anything. Word has exported certain common symbol fonts as PUA characters - see this post for the background.
So someone at some point may have used a symbol font where code unit 129 (the 0x81 in U+F181, 61825) maps to something that looks like an ellipsis. Quite what that font is, I have no idea! It doesn't seem to be one of the usual suspects (Symbol, Wingdings, Webdings). You might just have to manually replace U+F181 with U+2026 for now unless you can find out more about the source.

Why is this A0 character appearing in my HTML::Element output?

I'm parsing an HTML document with a couple Perl modules: HTML::TreeBuilder and HTML::Element. For some reason whenever the content of a tag is just , which is to be expected, it gets returned by HTML::Element as a strange character I've never seen before:
alt text http://www.freeimagehosting.net/uploads/2acca201ab.jpg
I can't copy the character so can't Google it, couldn't find it in character map, and strangely when I search with a regular expression, \w finds it. When I convert the returned document to ANSI or UTF-8 it disappears altogether. I couldn't find any info on it in the HTML::Element documentation either.
How can I detect and replace this character with something more useful like null and how should I deal with strange characters like this in the future?
The character is "\xa0" (i.e. 160), which is the standard Unicode translation for . (That is, it's Unicode's non-breaking space.) You should be able to remove them with s/\xa0/ /g if you like.
The character is non-breaking space which is what stands for:
In word processing and digital typesetting, a non-breaking space (" ") (also called no-break space, non-breakable space (NBSP), hard space, or fixed space) is a space character that prevents an automatic line break at its position. In some formats, including HTML, it also prevents consecutive whitespace characters from collapsing into a single space.
In HTML, the common non-breaking space, which is the same width as the ordinary space character, is encoded as   or  . In Unicode, it is encoded as U+00A0.