I sometimes see error messages in the log file of my iOS device with at least part of the text containing dashes between each letter, like this:
... ’-t -b-e -c-o-m-p-l-e-t-e-d-. -(-k-C-F-E-r-r-o-r-D-o-m-a-i-n-C-F-N-e-t-w-o-r-k -e-r-r-o-r -2-.-)-" -U-s-e-r-I-n-f-o-=-0-x-1-4-5-f-d-0 -{-k-C-F-G-e-t-A-d-d-r-I-n-f-o-F-a-i-l-u-r-e-K-e-y-=-8-}
I'm just curious as to any meaning this might have (certainly doesn't do much for readability, but I suppose it's easy to spot), or if it's just a random problem. (My device is jailbroken, if that makes a difference).
Update: I was able to format a log message similarly by calling NSLog with a non-ASCII character at the beginning:
NSLog(#"€ Line will be formatted strangely");
I suspect that the message was in UTF16 (or some other double-byte character set) and was incorrectly mapped in as ASCII when converting to an NSString.
Related
I have a .csv list of businesses. The file has some strange characters in. For example, in this field: Stocktonon-Tees, the first hyphen, between Stockton and on seems to be a character with the value 6 rather than a hyphen, with the value 45. Stack overflow will probably sanatize this so you can't see it, so here is a pastebin:
http://pastebin.com/NuyyaQy9
Can anyone explain why this could be? Is it some encoding issue that I have missed? Or a corruption in the dataset?
Yes, it's almost certainly an encoding issue. A file just consists of binary data - it's how you interpret that binary data that matters. It sounds like Notepad is guessing at the originally-intended encoding, but whatever else you're using isn't.
Unfortunately you haven't said anything about what software is trying to read the file or what wrote it in the first place - but you should look at what encoding Notepad thinks it is, and work from there.
If it's your code that wrote the file out, and you get to decide the encoding, I'd recommend UTF-8 as a good general purpose, platform-portable encoding.
When the device I am communicating with sends binary data, I can recover most of it. However, there always seem to be some bytes missing, replaced by non-standard characters. For instance, one individual output looks like this:
\xc4\xa5\x06\x00.\xb3\x01\x01\x02\x00\x00\x00=\xa9
The period and equals sign should be traditional bytes in hexadecimal format (I confirmed this in another application). Other times I get other weird characters such as ')' or 's'. These characters usually occur in the exact same spot (which varies with the command I passed to the device).
How can I fix this problem?
Are you displaying the output using something like this?:
print output
If some of your bytes happen to correspond with printable characters, they'll show up as characters. Try this:
print output.encode('hex')
to see hex values for all your bytes.
At first I liked #RichieHindle answer, but when I tried it the hex bytes were all bunched together.
To get a friendlier output, I use
print ' '.join(map(lambda x:x.encode('hex'),output))
<text><![CDATA[øCu·l es tu principal reto, objetivo o problema?]]></text>
while parsing the above tag, its crashing.
how to parse the CDATA
the same line is appearing in windows like this...
<text><![CDATA[¿Cuál es tu principal reto, objetivo o problema?]]></text>
due to the special chars the parser is crashing.
why they are converted into special chars in Mac..?
how to solve this?
Well for one, the string as you post it here looks like something has gone wrong with the encoding. "ø" is not a Spanish character.
What xml parser are you using? I would guess that somewhere in that string is a character, possibly hidden, or maybe it's "ø" which makes your parser crash.
Edit (in response to the OP's comment)
I will try to guess what is happening and hope you can use my guess to resolve what is actually happening. So when you created the xml file you used some editor. This editor used a particular encoding. This means that it transferred the characters on your screen into bytes on your disk using a particular mapping from character into bytes (it encoded the characters into bytes). There are many different encodings, one common encoding is called Latin-1. So let's assume the file was encoded using Latin-1. After creating it, you transferred the file onto another machine where you opened it in a different editor. Now, how does the new editor know the encoding of the file? The answer is that it probably tried to guess the encoding. Now here is where the problem arises: it guessed wrong and interpreted the bytes using an encoding other than Latin-1.
While you have your (garbled) file open in an editor try selecting different encodings from the menu. The one that displays all your special characters correctly is likely to be the one used when the file was created.
Edit 2
But my other question remains: what xml parser are you using?
Edit 3
Ok, so now when you write "crashing", do you actually mean crashing or does it just return? Do you get an error message? If yes, what? Can you do the following:
Remove the funny characters from this line and run your code on the following:
<text><![CDATA[l es tu principal reto, objetivo o problema?]]></text>
Does it still crash?
I read in search terms from a simple text file to send to a search engine.
It works fine in English, but gives me ???? for any Japanese text.
Text with mixed English and Japanese does show the English text, so I know it's reading it.
What I'm seeing:
Input text:
Snow Leopard をインストールする場合、新しい
Turns into:
Snow Leopard ???????????????
This is in my POST field of an HTTP.
If I set JMeter to encode the data, it just puts in the percent sequence for question marks.
About the Data:
The CSV file is very simple in
structure.
There's only one field / one column,
which I name TERM, and later use as
${TERM}
I don't really need full CSV because it's only one string per line.
There's no commas or quotes.
It's UTF-8 and when I run the Unix "file" command on the file, it says UTF-8 text.
I've also verified UTF-8 in command line and graphical mode on two machines.
Interesting note:
An interesting coincidence that I noticed: if there are 15 Japanese characters then I get 15 question marks, so at some point it's being seen as full characters and not just bytes.
JMeter CSV Dataset Config:
Filename: japanese-searches.csv
File encoding: UTF-8 (also tried without)
Variable names: TERM
Delimiter: ,
Allow Quoted Data: False (I also tried True, different, but still wrong)
Recycle at EOF: True
Stop at EOF: False
Staring mode: All threads
A few things I've tried:
- Tried Allow quoted Data. It changed to other strange characters.
- Added -Dfile.encoding=UTF-8
- Tried encoding the POST stage, but it just turned into a bunch of %nn for question marks
And I'm not sure how "debug" just after the each line of the CSV is read in. I think it's corrupted right away, but I'm not sure.
If it's only mangled when I reference it, then instead of ${TERM} perhaps there's some other "to bytes" function call. I'll start checking into that. I haven't done anything with the JMeter functions yet.
Edited Dec 24:
Tweaks:
Changed formatting and added bullet
points for more clarity.
Clarified that the file is UTF-8, and have verified that.
A new theory:
Is it possible that the Japanese characters are making it through, and the issue is that EVERY SINGLE place that shows them maps them to a "?" at DISPLAY TIME only. So even though I've checked in a bunch of places, they all have a display issue just in the UI?
Is there a way in JMeter to see the numeric value of a character or string? Actually, to tell JMeter to display the list of Unicode code points?
I'll look at my last log files... although I suppose even the server logs could mis-mapped the characters.
Also, perhaps when doing variable expansion inside of the text field that I POST, where I reference the ${TERM}, maybe at that point it also maps to question marks, but that the corruption happens at that later point. If that happened, AND it was mis-displayed in the UI, then it might lead to a false conclusion.
What I'd really like to do is pause JMeter after the first CSV record, just after that line is loaded, and look at it with a "data scope" or byte editor or something. Not sure if this is possible.
Found the issue, there was another place the UTF-8 had to be specified.
In the HTTP Request, to the right of the Method, you have to also set Content Encoding to UTF-8
Yes, in hindsight, this seems obvious, but there were a number of reasons I didn't think this was needed. Some of my incorrect assumptions might be helpful for others who are debugging, so here goes - I would have thought that:
1: Once text has made it into Java as Unicode, it stays as Unicode, and goes in and out by UTF-8. Obviously not in this case.
2: I sort of thought HTTP defaulted to UTF-8 unless you say otherwise, but maybe I'm just used to XML, but probably not a good practice to assume that, and maybe HTTP defaults to ISO-Latin1 or something, or even if there's a spec, maybe folks don't follow it.
3: And if I don't specific it, I'd think the "do no harm" approach would be to pass the characters on, and let the receiver on the other end deal with it. Wrong again!
(OK, so points 1, 2 and 3 overlap a bit)
4: Even though my HTTP Request POST, I did still try the Encode checkbox. I certainly thought that would have encoded it, but all I got was the repeating % hex for question marks, so seemed to me that the data was already corrupted at that point. Wrong again. I suspect WITHIN the HTTP phase, there's TWO character transitions, first from Unicode to whatever encoding it thinks you have, and THEN a second encoding into the %signs, and my data was mis-encoded at the first step.
5: And I would have thought JMeter would say something or warn, but from my reading, apparently it's not helpful in that respect. You can do logging or whatever.
And the "?" is Java's way of reporting a problem BY default, this started in the Java 1.4x timeframe. In my Java code I prefer to set encoding errors to report as an exception, but again, not the default, and not what JMeter does.
So I learned my lesson.
The HINT that the Unicode was at least starting out OK was that the number of question marks equaled the number of Japanese characters, instead of having 2 or 3 times as many question marks. If the length of "???" matches your Japanese (or Chinese) string, then Java DID see actual Unicode characters at some point along the journey. Whereas if you see 3 times as many ?'s as input text, then Java always saw them as bytes or ints or whatever, and NEVER as valid codepoints.
Came across this topic when searching for solution to use parameters from csv file that contained some columns written in Hebrew.
I used Excel 2007 to create a 1000 lines data for user registrations. The first and the last names had to be in Hebrew.
I exported the file to "Unicode text" file. It became tab delimited.
"Unicode Text" saves in UTF-16 LE (Little Endian), not in UTF-8. That is important.
I opened the result in Notepad++. I could see the Hebrew letters properly. The Notepad++ has the "Encoding" menu item, where you can check the encoding or change it. So I changed the Little Endian to UTF-8.
Then I replaced tabs with commas (just selected the tab and pasted it into the Find box.
The parameters were substituted ok, but after running the script I saw the following:
In the "View Results Tree" listener I opened the "Result" tab of the "Http Request".
The parameters were substituted, but the HTTP view tab (on the bottom) of the Request showed me some gibberish.
But when I looked at the Raw view, I saw that the request parameters actually contained strings like %D7%A9%D7%A8%D7%9E%D7%95%D7%98%D7%94 that when taken in pairs (%D7 %A9) corersponded properly to Hebrew letters.
To my mind, the JMeter has a bug and can not properly display the unicode chars. But it sends (POSTs) them out ok.
Hope I am right and hope it will help someone.
You can try to use "SHIFT-JIS" in Content encoding (it's nearby Method selection). Then you should uncheck "Encode?" for parameter that included Japanese.
Hope it works you.
Somewhere upstream of me, "something" happened that looks like unicode mangling. One symptom is that a lowercase u umlaut (ü) gets converted to "ü" (ie, character FC gets converted to C3 BC). Assuming that I have no control over this upstream process, how can I reverse-engineer what's going on? And if that is possible, can I crank the sausage machine backwards and get the original text back?
(If it helps to understand this case, the text I received was in the form of a MySQL dump. I think somwewhere in the dump/transport process it got mangled.)
Your text isn't 'mangled'. It's just in UTF8. C3 BC is what the ü is supposed to be encoded as. Just set whatever software you use to UTF8 also, and all pain will go away. If you can't set your software to Unicode, seriously consider switching to newer software.
I know it's scary at first, but you will have to do that eventually, anyway. My favorite music typesetter switched to Unicode-only input a while ago (they even deliberately removed support for the old 8-bit code pages to get people to switch), and I was upset, thinking that Latin-1 was good enough for me, and it was stupid to break stuff that was working perfectly well... and then I got over it and just set emacs to Unicode buffers, and now I'll never have to think about character encoding again in my life!
First of all, it looks like you've got UTF-8 encoded text (as you've found ü interpreted in your expected encoding, maybe Latin-1).
You could guess this encoding being used by checking that the correct byte sequences are used (and the illegal ones not used, of course). See the Wikipedia article for a reference and look for valid and invalid byte sequences. You can be pretty sure about the encoding if the text starts with a BOM, but that's not required for UTF-8.
To get the text back in your required encoding, several tools are available, GNU recode for one.