I've stuck with the following problem:
I have a script which is retrieving title form the Firefox window:
tell application "Firefox"
if the (count of windows) is not 0 then
set window_name to name of front window
end if
end tell
It works well as long as the title contains only English characters but when title contains some non-ASCII characters(Cyrillic in my case) it produces some utf-8 garbage. I've analyzed this garbage a bit and it seems that my Cyrillic character is converted to the Utf-8 without any concerning about codepage i.e instead of using Cyrillic codepage for conversion it uses non codepages at all and I have utf-8 text with characters different from those in the window title.
My question is: How can I retrieved the window title in utf-8 directly without any conversion?
I can achieve this goal by using AXAPI but I want to achieve this by AppleScript because AXAPI needs some option turned on in the system.
UPD:
It works fine in the AppleScript Editor. But I'm compiling it through the C++ code via OSACompile->OSAExecute->OSADisplay
I don't know the guts of the AppleScript Editor so maybe it has some inside information about how to encode the characters
I've found the answer when wrote update. Sometimes it is good to ask a question for better it understanding :)
So for the future searchers: If you want to use unicode result of the script execution you should provide typeUnicodeText to the OSADisplay then you will have result in the UTF-16LE in the result AEDesc
Related
I have a very old program (not a server or something on the internet) that I think it use the ANSI (Windows-1252) encoding.
The problem is that some inputs to this program are written in Arabic.
However, when I am trying to read the result, the Arabic words are written with very wired character. For example the input: "نور" is converted to "äæÑ".
The program output should contain a combination of English words and Arabic words.
E.x. It outputs "Name äæÑ" while the correct output should be something like "Name نور".
In general, the English words are correct and readable with both UTF-8 and ANSI. But the Arabic words are read for example as "���" with UTF-8 and as "äæÑ" with ANSI.
I understand that this is because ANSI doesn't have support to non-Latin letters.
but what should I do now? How can I convert them to Arabic again?
Note: I know the exact input and the exact output that this program should produce.
Note2: I don't have the source code of this program. I just want to convert the output file of this program to have the correct words or encoding.
I solved this problem now by typing in the terminal:
iconv -f WINDOWS-1256 -t utf8 < my_File.ged > result.ged
I tried to write code in java that do a similar thing but it wasn't really working with giving my the result I wanted.
I have also tried the previous terminal command but using WINDOWS-1252 instead of WINDOWS-1256 but it wasn't working. So, I guess it is good to try different encoding until it is working
i am using wxMac 2.8 in non-unicode build. I try to read a file with mutated vowels "ü" to a wxtextctrl. When i do, the data gets interpreted as current encoding, but it is a multibyte string. I narrowed the problem down to this:
text_ctrl->Clear();
text_ctrl->SetValue("üüüäääööößßß");
This is the result:
üüüäääööößßß
Note that the character count has doubled - printing the string in gdb displays "\303\274" and similar per original char. Typing "ü" or similar into the textctrl is no problem. I tried various wxMBConv methods but the result is always the same. Is there a way to solve this?
Best regards,
If you use anything but 7 bit ASCII, you must use Unicode build of wxWidgets. Just do yourself a favour and switch to it. If you have too much existing code that was written for "ANSI" build of wxWidgets 2.8 and earlier and doesn't compile with Unicode build, use wxWidgets 2.9 instead where it will compile -- and work as intended.
It sounds like your text editor (for program source code) is in a different encoding from the running program.
Suppose for example that your text entry control and the rest of your program are (correctly) using UTF-8. Now if your text editor is using some other encoding, then a string that looks fine on screen will actually contain garbage bytes.
Assuming you are in a position to help create a pure-UTF8 world, then you should:
1) Encode UTF-8 directly into the string literals using escapes, e.g. "\303" or "\xc3". That's annoying to do, but it means you just don't have to worry about you text editor (or the editor settings of other developers).
2) Then check that the program is using UTF-8 everywhere.
I created file with UTF-8 encoded content (using PHP fputcsv).
When I open this file in Notepad++ - characters are wrong (Notepad++ starts with ANSI encoding).
When I set Format->"Encode in UTF-8" from menu - everything is fine.
Im worrying, that Notepad++ can recognize encoding somehow, and maybe something is wrong with my file created with fputcsv? First byte or something?
Automatically detecting an encoding is not something that can be done accurately. It's pretty much essential that the encoding be specified explicitly. It can be guessed in some cases, but even then not with 100% certainty.
This documentation (Encoding) explains the situation in relation to Notepad++.
They also point out that the difficulty arises especially if the file has not been saved with a Byte Order Mark (BOM).
Given that your file displays correctly once you manually set the encoding, I would say there's nothing wrong with how you are generating and saving the file. The only thing you can check for is whether a BOM is being saved, which might improve the chances of Notepad++ being able to automatically detect the encoding.
It's worth noting that although it may help editors like Notepad++ identify the encoding more accurately, according to The Unicode Standard document, the BOM is not recommended.
You have to check the lower right corner of the Notepad++ GUI to see the actual enconding that is being used. The problem it's not that Notepad++ specific because guessing the right encoding is a big problem without any real solution so it's better to let the user decide what is the most appropriate encoding in each single case.
When you want to reflect the encoding of the text file in a Java program, you have to consider two thnigs: encoding and character set. When you open a text file, you see encoding under "Encoding" menu. Additionally look at the character set menu point. Under "Eastern European" you will find "ISO 8859-2", and under Central European "Windows-1250". You can set corresponding encoding in the Java program
when you look up in the table:
https://docs.oracle.com/javase/8/docs/technotes/guides/intl/encoding.doc.html
For example, for Cenntral European character set "Windows-1250" the table suggest Java encoding "Cp1250". Set the encoding and you will see the characters in program properly.
I'm using HTML Tidy Online (http://infohound.net/tidy/) to tidy up some very old and messed up HTML file which contains some Hebrew characters. Whenever the page is processed by Tidy the output turns Hebrew characters into gibberish, even after changing encoding methods in the settings. Using different settings, I do manage to get the same output with the Hebrew characters as unicode entities.
I Googled around for a possible solution but found none.
I had a couple ideas in mind, but I'm not sure exactly how to approach them, if at all (maybe someone has a better solution).
I thought maybe I could (after processing the page) scan the page for unicode entities and replace them with the corresponding Hebrew characters (in a systematic way, of course).
Maybe I could take the HTML Tidy source code and modify it to output Hebrew characters appropriately. The problem with this is that I doubt I am knowledgeable enough to even get started on something like this.
I had a similar problem. Document in UTF-8, containing unicode characters. HTML Tidy turned them into HTML entities. This in HTMLTIDY.CFG fixed it:
char-encoding: utf8
input-encoding: utf8
output-encoding: utf8
Hope it helps.
The website http://infohound.net/tidy/ that you are using has a "Char encoding" clause at the bottom right. You need to choose utf-8, but first you need to make sure that the page is encoded in UTF-8 in your test editor. In Notepad++ for example, you can go to Encoding > Convert to UTF-8 without BOM.
I read in search terms from a simple text file to send to a search engine.
It works fine in English, but gives me ???? for any Japanese text.
Text with mixed English and Japanese does show the English text, so I know it's reading it.
What I'm seeing:
Input text:
Snow Leopard をインストールする場合、新しい
Turns into:
Snow Leopard ???????????????
This is in my POST field of an HTTP.
If I set JMeter to encode the data, it just puts in the percent sequence for question marks.
About the Data:
The CSV file is very simple in
structure.
There's only one field / one column,
which I name TERM, and later use as
${TERM}
I don't really need full CSV because it's only one string per line.
There's no commas or quotes.
It's UTF-8 and when I run the Unix "file" command on the file, it says UTF-8 text.
I've also verified UTF-8 in command line and graphical mode on two machines.
Interesting note:
An interesting coincidence that I noticed: if there are 15 Japanese characters then I get 15 question marks, so at some point it's being seen as full characters and not just bytes.
JMeter CSV Dataset Config:
Filename: japanese-searches.csv
File encoding: UTF-8 (also tried without)
Variable names: TERM
Delimiter: ,
Allow Quoted Data: False (I also tried True, different, but still wrong)
Recycle at EOF: True
Stop at EOF: False
Staring mode: All threads
A few things I've tried:
- Tried Allow quoted Data. It changed to other strange characters.
- Added -Dfile.encoding=UTF-8
- Tried encoding the POST stage, but it just turned into a bunch of %nn for question marks
And I'm not sure how "debug" just after the each line of the CSV is read in. I think it's corrupted right away, but I'm not sure.
If it's only mangled when I reference it, then instead of ${TERM} perhaps there's some other "to bytes" function call. I'll start checking into that. I haven't done anything with the JMeter functions yet.
Edited Dec 24:
Tweaks:
Changed formatting and added bullet
points for more clarity.
Clarified that the file is UTF-8, and have verified that.
A new theory:
Is it possible that the Japanese characters are making it through, and the issue is that EVERY SINGLE place that shows them maps them to a "?" at DISPLAY TIME only. So even though I've checked in a bunch of places, they all have a display issue just in the UI?
Is there a way in JMeter to see the numeric value of a character or string? Actually, to tell JMeter to display the list of Unicode code points?
I'll look at my last log files... although I suppose even the server logs could mis-mapped the characters.
Also, perhaps when doing variable expansion inside of the text field that I POST, where I reference the ${TERM}, maybe at that point it also maps to question marks, but that the corruption happens at that later point. If that happened, AND it was mis-displayed in the UI, then it might lead to a false conclusion.
What I'd really like to do is pause JMeter after the first CSV record, just after that line is loaded, and look at it with a "data scope" or byte editor or something. Not sure if this is possible.
Found the issue, there was another place the UTF-8 had to be specified.
In the HTTP Request, to the right of the Method, you have to also set Content Encoding to UTF-8
Yes, in hindsight, this seems obvious, but there were a number of reasons I didn't think this was needed. Some of my incorrect assumptions might be helpful for others who are debugging, so here goes - I would have thought that:
1: Once text has made it into Java as Unicode, it stays as Unicode, and goes in and out by UTF-8. Obviously not in this case.
2: I sort of thought HTTP defaulted to UTF-8 unless you say otherwise, but maybe I'm just used to XML, but probably not a good practice to assume that, and maybe HTTP defaults to ISO-Latin1 or something, or even if there's a spec, maybe folks don't follow it.
3: And if I don't specific it, I'd think the "do no harm" approach would be to pass the characters on, and let the receiver on the other end deal with it. Wrong again!
(OK, so points 1, 2 and 3 overlap a bit)
4: Even though my HTTP Request POST, I did still try the Encode checkbox. I certainly thought that would have encoded it, but all I got was the repeating % hex for question marks, so seemed to me that the data was already corrupted at that point. Wrong again. I suspect WITHIN the HTTP phase, there's TWO character transitions, first from Unicode to whatever encoding it thinks you have, and THEN a second encoding into the %signs, and my data was mis-encoded at the first step.
5: And I would have thought JMeter would say something or warn, but from my reading, apparently it's not helpful in that respect. You can do logging or whatever.
And the "?" is Java's way of reporting a problem BY default, this started in the Java 1.4x timeframe. In my Java code I prefer to set encoding errors to report as an exception, but again, not the default, and not what JMeter does.
So I learned my lesson.
The HINT that the Unicode was at least starting out OK was that the number of question marks equaled the number of Japanese characters, instead of having 2 or 3 times as many question marks. If the length of "???" matches your Japanese (or Chinese) string, then Java DID see actual Unicode characters at some point along the journey. Whereas if you see 3 times as many ?'s as input text, then Java always saw them as bytes or ints or whatever, and NEVER as valid codepoints.
Came across this topic when searching for solution to use parameters from csv file that contained some columns written in Hebrew.
I used Excel 2007 to create a 1000 lines data for user registrations. The first and the last names had to be in Hebrew.
I exported the file to "Unicode text" file. It became tab delimited.
"Unicode Text" saves in UTF-16 LE (Little Endian), not in UTF-8. That is important.
I opened the result in Notepad++. I could see the Hebrew letters properly. The Notepad++ has the "Encoding" menu item, where you can check the encoding or change it. So I changed the Little Endian to UTF-8.
Then I replaced tabs with commas (just selected the tab and pasted it into the Find box.
The parameters were substituted ok, but after running the script I saw the following:
In the "View Results Tree" listener I opened the "Result" tab of the "Http Request".
The parameters were substituted, but the HTTP view tab (on the bottom) of the Request showed me some gibberish.
But when I looked at the Raw view, I saw that the request parameters actually contained strings like %D7%A9%D7%A8%D7%9E%D7%95%D7%98%D7%94 that when taken in pairs (%D7 %A9) corersponded properly to Hebrew letters.
To my mind, the JMeter has a bug and can not properly display the unicode chars. But it sends (POSTs) them out ok.
Hope I am right and hope it will help someone.
You can try to use "SHIFT-JIS" in Content encoding (it's nearby Method selection). Then you should uncheck "Encode?" for parameter that included Japanese.
Hope it works you.