May i assume that for every web page the WebElement.getText() function does return utf8 Strings, or can it have other encodings?
If there could be other encodings, how can i identify it and convert it to utf8?
I don't know no one good way(maybe if you like reflection, you can hack and rewrite) to do getText() return needed encoding.
But in this
Selenium web driver and multillanguage
I wrote some way to encode String.
I don't think so. According to the API, getText() returns a String. You will have to find the encoding of that from the page headers.
Related
I need to get a string from <STDIN>, written in latin and russian mixed encodings, and convert it to some url:
$search_url = "http://searchengine.com/search?text=" . uri_escape($query);
But this proccess goes bad and gives out Mojibake (a mixture of weird letters). What can I do with Perl to solve it?
Before you can get started, there's a few things you need to know.
You'll need to know the encoding of your input. "Latin" and "russian" aren't (character) encodings.
If you're dealing with multiple encodings, you'll need to know what is encoded using which encoding. "It's a mix" isn't good enough.
You'll need to know the encoding the site expects the query to use. This should be the same encoding as the page that contains the search form.
Then, it's just a matter of decoding the input using the correct encoding, and encoding the query using the correct encoding. That's the easy part. Encode provides functions decode and encode to do just that.
In Lua, for an iPad Corona project, I'm requesting a UTF-8 server text file (containing Chinese characters) using network.request, but the result when displayed in the console or in the app shows as "garbage". Google Chrome, for instance, displays the same UTF-8 page fine, as I'm setting the http header when the server sends this (using PHP) to 'Content-Type: text/plain; charset=utf-8' (and there's no BOM, byte order mark either). The "garbage" I'm seeing in Lua looks similar to when I "force" Chrome to render the page as ISO-8859-1 using the options menu.
Does anyone have any help or pointers?
If all else fails, how would I convert the "garbage" string back to its UTF-8 origins within Lua?
Thanks for any help!
Lua doesn't know anything about UTF-8; Lua strings are just sequences of bytes. It sounds like Corona itself is parsing the strings as ISO8859-1. The most likely cause for this is them doing something really stupid and naive like treating each byte of the string as a Unicode code point.
I'm afraid I don't know Corona, so can't provide any specific solutions, but I'd suggest looking to see what functions it's got that involve encodings --- there may be a specific function to render a string with a particular encoding, for example.
Can you show the code for your network.request() call?
If you're downloading a html page, you should use network.download().
I had this exact same problem, except with Japanese characters. Although Lua doesn't support UTF-8, Corona acts like it does. What that means is that... if you pass a UTF-8 String to display.newText(...), it should display properly. Now, if you output to the console, it will actually print out the raw bytes of the String. And, if you try to print the length of the string, it will actually print out the number of bytes.
So, in summary, Lua treats all strings as an array of bytes. It knows nothing about UTF-8. Some Corona API methods, when passed UTF-8 strings, will display the strings correctly.
I had issues when I mixed UTF-8 with plain ASCII characters, which I believe confused Corona (what I mean is that I mixed English characters with Japanese characters... still all UTF-8, though). I have a hunch that each character in the string must be of the same length in bytes for Corona to display it properly. Try printing out one character at a time to see if that helps. Please feel free to post comments here if you run into trouble. I'd like to figure this issue out myself, too.
There is an encoding issue in the web page means it showing some special characters in the browser(Cinéma). content is in ISO, web page is rendering in UTF-8. some articles are displaying properly,bcz those are in UTF encode.some of the articles are shows the encoding issue like Cinéma in Perl 5.
Can any once help me out for this encoding issue.that would be a great!
Thanks in advance.
Ensure your Content-type header, or meta document element, contains correct encoding information.
A quick and easy way to test if this is your issue is to ask the browser to render the page as if it had received a specific encoding directive. In Safari this would be View -> Text Encoding and then selecting something appropriate.
I'd hazard a guess that if you inform the browser to use utf-8 then it will render the page correctly.
The only way to solve this will be to spend some time reading up on Unicode and UTF-8 and how to handle encoding in Perl. (perldoc perluniintro, perldoc perlunitut, perldoc perlunicode, perldoc perlunifaq for example).
UTF-8 encoding is a very different concept to other encodings that programmers encounter (escaping in strings, URL encoding, HTML character entities, etc) - it's about how your code should interpret sequences of bytes as characters.
Without knowing the source of the word containing the special character (an accented 'e'), it's impossible to offer further help - is it coming from a database? in a static HMTL page? in an HTML template? a string within Perl code?
On a specific webpage, when I hover over a link, I can see the text as "bishop" but when I copy-and-paste the link to TextPad, it shows up as "%62%69%73%68%6F%70". What kind of code is this, and how can I convert it into text?
Thanks!
URL encoding, I think.
You can decode it here: http://meyerweb.com/eric/tools/dencoder/
Most programming languages will have functions to urlencode/decode too.
This is URL encoding. It is designed to pass characters like < / or & through a URL using their ASCII values in hex after a %. However, you can also use this for characters that don't need encoding per se. Makes the URL harder to read, which is sometimes desirable.
URL encoding replaces characters outside the ascii set.
More info about URL encoding in the w3schools site.
As mentioned by others, this is simply an ASCII representation of the text so that it can be passed around the HTTP object easily. If you've ever noticed typing in a website URL that has a space in it, the browser will usually convert that to %20. That's the hexadecimal value for the "space" character in ASCII.
This used to be a way to trick old spam scrapers. One way spammers get email addresses is to scrape the source code of websites for strings matching the pattern "username#company.tld". By encoding just the username portion or the whole string as ASCII characters, the string would be readable by humans, but would require the scraper to convert it to a literal string before it could be used to send emails. Of course, modern-day spamming tools account for these sort of strings.
I support a website written in Tcl which displays data in Traditional Chinese (big5). We then have a Java servlet, using the translation code from mandarintools.com, to translate a page request into Simplified Chinese. The conversion as specified to the translation code is from UTF-8 to UTF-8S; Java is apparently correctly translating the data to UTF-8 as it comes in.
The Java translation code works but is slow, and since the website is written in Tcl someone on another list suggested I try using that. Unfortunately, Tcl doesn't support UTF-8S and I have been unable to figure out what translation to use in its place. I've tried gb2312, gb2312-raw,gb1988, euc-cn... all result in gibberish. My assumption is that Tcl is also translating to UTF-8 as it comes in, though I have tried converting from big5 first and it doesn't help.
My test code looks like this:
set page_body [ns_httpget http://www.mysite.com]
set translated_page_body [encoding convertto gb2312 $page_body]
ns_write $translated_page_body
I have also tried
set page_body [ns_httpget http://www.mysite.com]
set translated_page_body [encoding convertto gb2312 [encoding convertfrom big5 $page_body]]
ns_write $translated_page_body
But it didn't change anything.
Does anyone out there have enough experience with this to help me figure it out?
FYI for completeness' sake, I've been told by Tcl experts that you can't do the conversion this way, it has to be done via character replacement.
By any chance, are you grabbing your data from Oracle?
If so, see if you can use the CONVERT function to convert to from "utf8" to "al32utf8", which is the true Utf8 standard and which Tcl should work-with seamlessly.
If not, well, I guess I'll wait for you comment(s).