Exclamation mark moves from end to start when reading cells with Arabic text with the Google Sheets API - unicode

I'm using the Google Sheets API to export cells that contain Arabic text to a JSON file. When a cell contains text that has an exclamation mark at the end of a sentence (so on the left side of the text in an RTL cell), it switches positions in the API response and ends up on the right (incorrect) side in my resulting JSON output:
"levelSelect:titleOne": "قاوم الطعام غير الصحي!"
The word order is correct, but I would expect the exclamation mark to be after the third double quote from the left and not before the fourth one. The punctuation mark placement is correct in the original cell in Google Sheets:
So all the Unicode RTL characters are coming back in the original order, but the one non Unicode character (the exclamation mark) has switched places.
I query Google Sheets with a higher level library, but even if I make a direct API request (using the OAuth playground), I get the same output (FWIW, it makes the request with charset=utf-8 in the Content-type header). The order is already wrong before I write the response to a file.
At this point, I am not sure if this is a bug in the Google Sheets API, or there is something fundamental I don't understand about how Arabic text should be handled. There doesn't seem to be a magic 'format this mostly Unicode string with the non Unicode characters in the right place' conversion function.
I am aware that there are copy and paste issues bewteen different sources where this can happen when you paste text with formatting from the clipboard. If I copy the cell with Cmd-C (on a Mac) and then paste it into some places with just Cmd-V, I get the reversed order, and if I use Shift-Cmd-V, I get the correct order for the exclamation mark. It's as if what the API sends gets confused here on what format to send.

Related

Why does the TextBox character ordering change with FlowDirection RightToLeft

In my UWP app I have a textbox.
I want the user to be able to type Farsi / Persian text (right to left) into the textbox so I set the FlowDirection property to RighToLeft.
The text can be entered and is displayed correctly:
When I save the text, and inspect the property during debugging, i see the same character order as on screen:
The same character order applies for the stored value when viewed with mssql management studio:
When I add a '.' or a '!' at the end of the text, the WPF textbox still displays what I expect,
but the text I get back from the text property puts the exclamation mark at the right side of the string.
It is also stored this way in the sql database:
When loading the database value (with the exclamation point on the right) into the textbox it shows the exclamation point correctly on the left side. There must be some magic happening here that I am not aware of, or maybe the problem is that the debug preview / mssql preview does not support displaying RTL values.
My problem is that this magic does not work in other situations.
When I load the database value and put it in a microsoft word document, it seems to do no conversion and place the text in the document exactly as it is in the database, resulting in the exclamation point to be shown on the 'wrong' side.
I would like to understand the 'magic' that takes place in displaying / storing these strings, so I can output it correctly in MS Word. And Yes, I have set the paragraph where I output the values in word to RTL.
In Unicode, all characters have directional properties that get used in the Unicode Bidirectional Algorithm for determining how characters are ordered visually. Most characters have a "strong" directional property, but not all. In particular, most punctuation characters are considered directionally neutral.
The visual ordering of neutral characters is determined by the characters that surround them. For example, the exclamation mark ! is neutral; if it occurs between two left-to-right characters, it will be treated as though it also is a left-to-right character. But if it occurs between two right-to-left characters, it will be treated as though it is a right-to-left character.
In your example, though, the exclamation mark occurs at the end of the string. So, it has a strong-direction character on one side, but nothing on the other side. In this case, another factor comes into play, which is that the paragraph as a whole has a base direction.
The Unicode Bidi Algorithm allows two ways that apps can handle the paragraph base direction:
the app can set the base direction explicitly, regardless of the string content in the paragraph; or
the app can let the base direction be derived implicitly from the string: the base direction is determined by the first strong-directional character in the string.
In your UWP app, when you set the flow direction to RTL, then the paragraph base direction (for purposes of the Bidi Algorithm) is RTL. With an Arabic-script string that ends with the exclamation mark, the directionality of the exclamation is set to RTL because of the paragraph base direction, and so it appears at the left end of the string. But when you view the control property value in an IDE, the IDE is presenting that property string in a control that has LTR base direction. That is causing the exclamation at the logical end of the string to appear visually at the right end.
Note that apps will often conflate base direction and alignment, though these are really distinct things. In Word, you can set the paragraph base direction in the Paragraph settings dialog, and when you do it will set the alignment to match by default:
But you can override the paragraph alignment to have a RTL base direction with left alignment:
Note that the visual order of the exclamation mark is affected by the paragraph base direction but not by the alignment. The Unicode Bidi Algorithm doesn't pay attention to the alignment.
This article gives a good overview of how the Bidi Algorithm works: https://www.w3.org/International/articles/inline-bidi-markup/uba-basics.
If you want to explore how the Bidi Algorithm works in more detail, you can read the spec, Unicode Standard Annex #9, Unicode Bidirectional Algorithm; and check out this Unicode utility that explains how the rules of the algorithm apply to sample strings you can provide.

Unicode converted text isn't shown properly in MS-Word

In a mapping editor, the display is correct after the legacy to unicode conversion for DEVANAGARI text shown using a unicode font (Arial Unicode MS). However, in MS-WORD, the display isn't as expected for the same unicode text in the unicode font (Arial Unicode MS) or any other Devanagari unicode fonts. The expected sequence of unicodes are provided as per the documentation. The sequence can be seen on the left-hand side table.
Please let me know where I am going wrong.
Thanks for your help!
Does your map have to insert the zero_width_joiner? The halant (virama) by itself is enough to get the half-consonant (for some combinations) and in particular, it may be that Word is using the presence of the ZWJ to keep them separate.
If getting rid of the ZWJ doesn't help, another possibility is that Word may be treating the individual characters of the text string as individual "runs" of text.
If those first 4 characters are not in a single run, this can happen.
[aside: the way to tell if it's being treated as a single run, is to save the document as an xml file and then open it with something like notepad++ and look at the xml "w:t" element (IIRC) associated with these characters. If they're all in separate w:t elements, it means they're in separate runs. In that case, you might need to copy the text from Word to some other tool (e.g. Notepad++) and then copy it from there and paste it back in Word -- that might cause it to be imported into Word in a single run.

Why is this LSEP symbol showing up on Chrome and not Firefox or Edge?

So this web page is rendering with these symbols and they are found throughout this website/application but on no other sites. Can anyone tell me
What this symbol is?
Why it is showing up only in one browser?
That character is U+2028 Line Separator, which is a kind of newline character. Think of it as the Unicode equivalent of HTML’s <br>.
As to why it shows up here: my guess would be that an internal database uses LSEP to not conflict with literal newlines or HTML tags (which might break the database or cause security errors), and either:
The server-side scripts that convert the database to HTML neglected to replace LSEP with <br>
Chrome just breaks standards by displaying LSEP as a printing (visible) character, or
You have a font installed that displays LSEP as a printing character that only Chrome detects. To figure out which font it is, right click on the offending text and click “Inspect”, then switch to the “Computed” tab on the right-hand panel. At the very bottom you should see a section labeled “Rendered Fonts” which will help you locate the offending font.
More information on the line separator, excerpted from the Unicode standard, Chapter 5.8, Newline Guidelines (on p. 12 of this PDF):
Line Separator and Paragraph Separator
A paragraph separator—independent of how it is encoded—is used to indicate a
separation between paragraphs. A line separator indicates where a line break
alone should occur, typically within a paragraph. For example:
This is a paragraph with a line separator at this point,
causing the word “causing” to appear on a different line, but not causing
the typical paragraph indentation, sentence breaking, line spacing, or
change in flush (right, center, or left paragraphs).
For comparison, line separators basically correspond to HTML <BR>, and
paragraph separators to older usage of HTML <P> (modern HTML delimits
paragraphs by enclosing them in <P>...</P>). In word processors, paragraph
separators are usually entered using a keyboard RETURN or ENTER; line
separators are usually entered using a modified RETURN or ENTER, such as
SHIFT-ENTER.
A record separator is used to separate records. For example, when exchanging
tabular data, a common format is to tab-separate the cells and to use a CRLF
at the end of a line of cells. This function is not precisely the same as line
separation, but the same characters are often used.
Traditionally, NLF started out as a line separator (and sometimes record
separator). It is still used as a line separator in simple text editors such as
program editors. As platforms and programs started to handle word processing
with automatic line-wrap, these characters were reinterpreted to stand for
paragraph separators. For example, even such simple programs as the Windows
Notepad program and the Mac SimpleText program interpret their platform’s NLF
as a paragraph separator, not a line separator. Once NLF was reinterpreted to
stand for a paragraph separator, in some cases another control character was
pressed into service as a line separator. For example, vertical tabulation VT
is used in Microsoft Word. However, the choice of character for line separator
is even less standardized than the choice of character for NLF. Many Internet
protocols and a lot of existing text treat NLF as a line separator, so an
implementer cannot simply treat NLF as a paragraph separator in all
circumstances.
Further reading:
Unicode Technical Report #13: Newline Guidelines
General Punctuation (U+2000–U+206F) chart PDF
SE: Why are there so many spaces and line breaks in Unicode?
SO: What is unicode character 2028 (LS / Line Separator) used for?
U+2028 on codepoints.net A misprint here says that U+2028 was added in v. 1.1 of the Unicode standard, which is false — it was added in 1.0
I found that in WordPress the easiest way to remove "L SEP" and "P SEP" characters is to execute this two SQL queries:
UPDATE wp_posts SET post_content = REPLACE(post_content, UNHEX('e280a9'), '')
UPDATE wp_posts SET post_content = REPLACE(post_content, UNHEX('e280a8'), '')
The javascript way (mentioned in some of the answers) can break some things (in my case some modal windows stopped working).
You can use this tool...
http://www.nousphere.net/cleanspecial.php
...to remove all the special characters that Chrome displays.
Steps:
Paste your HTML and Clean using HTML option.
You can manually delete the characters in the editor on this page and see the result.
Paste back your HTML in file and save :)
I recently ran into this issue, tried a number of fixes but ultimately I had to paste the text into VIM and there was an extra space I had to delete. I tried a number of HTML cleaners but none of them worked, VIM was the key!
9999years answers is great.
In case you use Symfony with Twig template I would recommend to check for an empty Twig block. In my case it was an empty Twig block with an invisible char inside.
The LSEP char was only displayed on certain device / browser.
On the other I had a blank space above the header and I could not see any invisible char.
I had to inspect the GET request to see that the value 1f18 was before the open html tag.
Once I removed an empty Twig block it was gone.
hope this can help someone one day ...
My problem was similar, it was "PSEP" or "P SEP". Similar issue, an invisible character in my file.
I replaced \x{2029} with a normal space. Fixed. This problem only appeared on Windows Chrome. Not on my Mac.
I agree with #Kapil Bathija - Basically you can copy & paste your HTML code into http://www.nousphere.net/cleanspecial.php and convert it.
Then it will convert the special characters for you - Just remove the spaces in between the words and you will realize you have to press backspace 2x meaning there is an invalid character that can't be translated.
I had the same issue and it worked just fine afterwards.
You can also copy the text, paste it into a HTML editor such as Coda, remove the linebreak, copy it and paste it back into your site.
Video here: https://www.loom.com/share/501498afa7594d95a18382f1188f33ce
Looks like my client pasted HTML into Wordpress after initially creating it with MS-Word. Even deleting the and visible spaces did not fix the issue. The extended characters became visible in vi/vim.
If you don't have vi/vim available, try highlighting from 2 chars before the LSEP to 2 chars after the LSEP; delete that chunk, and re-type the correct characters.

JMeter CSV Data Set is corrupting Japanese strings stored as proper UTF-8, I get Question Marks instead

I read in search terms from a simple text file to send to a search engine.
It works fine in English, but gives me ???? for any Japanese text.
Text with mixed English and Japanese does show the English text, so I know it's reading it.
What I'm seeing:
Input text:
Snow Leopard をインストールする場合、新しい
Turns into:
Snow Leopard ???????????????
This is in my POST field of an HTTP.
If I set JMeter to encode the data, it just puts in the percent sequence for question marks.
About the Data:
The CSV file is very simple in
structure.
There's only one field / one column,
which I name TERM, and later use as
${TERM}
I don't really need full CSV because it's only one string per line.
There's no commas or quotes.
It's UTF-8 and when I run the Unix "file" command on the file, it says UTF-8 text.
I've also verified UTF-8 in command line and graphical mode on two machines.
Interesting note:
An interesting coincidence that I noticed: if there are 15 Japanese characters then I get 15 question marks, so at some point it's being seen as full characters and not just bytes.
JMeter CSV Dataset Config:
Filename: japanese-searches.csv
File encoding: UTF-8 (also tried without)
Variable names: TERM
Delimiter: ,
Allow Quoted Data: False (I also tried True, different, but still wrong)
Recycle at EOF: True
Stop at EOF: False
Staring mode: All threads
A few things I've tried:
- Tried Allow quoted Data. It changed to other strange characters.
- Added -Dfile.encoding=UTF-8
- Tried encoding the POST stage, but it just turned into a bunch of %nn for question marks
And I'm not sure how "debug" just after the each line of the CSV is read in. I think it's corrupted right away, but I'm not sure.
If it's only mangled when I reference it, then instead of ${TERM} perhaps there's some other "to bytes" function call. I'll start checking into that. I haven't done anything with the JMeter functions yet.
Edited Dec 24:
Tweaks:
Changed formatting and added bullet
points for more clarity.
Clarified that the file is UTF-8, and have verified that.
A new theory:
Is it possible that the Japanese characters are making it through, and the issue is that EVERY SINGLE place that shows them maps them to a "?" at DISPLAY TIME only. So even though I've checked in a bunch of places, they all have a display issue just in the UI?
Is there a way in JMeter to see the numeric value of a character or string? Actually, to tell JMeter to display the list of Unicode code points?
I'll look at my last log files... although I suppose even the server logs could mis-mapped the characters.
Also, perhaps when doing variable expansion inside of the text field that I POST, where I reference the ${TERM}, maybe at that point it also maps to question marks, but that the corruption happens at that later point. If that happened, AND it was mis-displayed in the UI, then it might lead to a false conclusion.
What I'd really like to do is pause JMeter after the first CSV record, just after that line is loaded, and look at it with a "data scope" or byte editor or something. Not sure if this is possible.
Found the issue, there was another place the UTF-8 had to be specified.
In the HTTP Request, to the right of the Method, you have to also set Content Encoding to UTF-8
Yes, in hindsight, this seems obvious, but there were a number of reasons I didn't think this was needed. Some of my incorrect assumptions might be helpful for others who are debugging, so here goes - I would have thought that:
1: Once text has made it into Java as Unicode, it stays as Unicode, and goes in and out by UTF-8. Obviously not in this case.
2: I sort of thought HTTP defaulted to UTF-8 unless you say otherwise, but maybe I'm just used to XML, but probably not a good practice to assume that, and maybe HTTP defaults to ISO-Latin1 or something, or even if there's a spec, maybe folks don't follow it.
3: And if I don't specific it, I'd think the "do no harm" approach would be to pass the characters on, and let the receiver on the other end deal with it. Wrong again!
(OK, so points 1, 2 and 3 overlap a bit)
4: Even though my HTTP Request POST, I did still try the Encode checkbox. I certainly thought that would have encoded it, but all I got was the repeating % hex for question marks, so seemed to me that the data was already corrupted at that point. Wrong again. I suspect WITHIN the HTTP phase, there's TWO character transitions, first from Unicode to whatever encoding it thinks you have, and THEN a second encoding into the %signs, and my data was mis-encoded at the first step.
5: And I would have thought JMeter would say something or warn, but from my reading, apparently it's not helpful in that respect. You can do logging or whatever.
And the "?" is Java's way of reporting a problem BY default, this started in the Java 1.4x timeframe. In my Java code I prefer to set encoding errors to report as an exception, but again, not the default, and not what JMeter does.
So I learned my lesson.
The HINT that the Unicode was at least starting out OK was that the number of question marks equaled the number of Japanese characters, instead of having 2 or 3 times as many question marks. If the length of "???" matches your Japanese (or Chinese) string, then Java DID see actual Unicode characters at some point along the journey. Whereas if you see 3 times as many ?'s as input text, then Java always saw them as bytes or ints or whatever, and NEVER as valid codepoints.
Came across this topic when searching for solution to use parameters from csv file that contained some columns written in Hebrew.
I used Excel 2007 to create a 1000 lines data for user registrations. The first and the last names had to be in Hebrew.
I exported the file to "Unicode text" file. It became tab delimited.
"Unicode Text" saves in UTF-16 LE (Little Endian), not in UTF-8. That is important.
I opened the result in Notepad++. I could see the Hebrew letters properly. The Notepad++ has the "Encoding" menu item, where you can check the encoding or change it. So I changed the Little Endian to UTF-8.
Then I replaced tabs with commas (just selected the tab and pasted it into the Find box.
The parameters were substituted ok, but after running the script I saw the following:
In the "View Results Tree" listener I opened the "Result" tab of the "Http Request".
The parameters were substituted, but the HTTP view tab (on the bottom) of the Request showed me some gibberish.
But when I looked at the Raw view, I saw that the request parameters actually contained strings like %D7%A9%D7%A8%D7%9E%D7%95%D7%98%D7%94 that when taken in pairs (%D7 %A9) corersponded properly to Hebrew letters.
To my mind, the JMeter has a bug and can not properly display the unicode chars. But it sends (POSTs) them out ok.
Hope I am right and hope it will help someone.
You can try to use "SHIFT-JIS" in Content encoding (it's nearby Method selection). Then you should uncheck "Encode?" for parameter that included Japanese.
Hope it works you.

Diamonds with question marks

I'm getting these little diamonds with question marks in them in my HTML attributes when I present data from my database. I'm using EPiServer and a few custom properties.
This is the information I've gathered,
I save my data as a XML document, since I use custom EPiServer properties which need more than one defined value. This is saved as UTF8.
It's only attributes in element tags which have this problem, such as align=left becomes align=�left�. There is no " character there, but I get the diamonds anyway.
If I use " outside an element, it works and shows correctly.
Any clues?
This is a problem with your character encoding scheme.
I would recommend reading this article, where (close to the bottom of it), he shows you why you get that little diamond with question marks.
Has the XML been touched by any of the Microsoft Office suite products.
These are notorius for switching vanilla quotes (") x'22' to smartquotes x'93' and x'94'(“”).
Also singlequote (') is often converted from x'27' to x'91' and x'92' pairs (‘’).