The following will demonstrate the error:
catalyst.pl Hello
cd Hello
echo "encoding utf8" >> hello.conf
script/hello_server.pl -r
Then navigate to http://localhost:3000/?q=P%E9rl in your browser and you'll get a 400 Bad Request.
It appears to be Catalyst's _handle_param_unicode_decoding() method which is generating this error. Given that this error is trivial to generate, it's showing up in the error logs and Google has failed me in trying to fix this error. I can't stop users from entering query strings like that. How can I work around this?
URLs are suppose to be encoded using UTF-8. RFC3986:
When a new URI scheme defines a component that represents textual data consisting of characters from the Universal Character Set, the data should first be encoded as octets according to the UTF-8 character encoding; then only those octets that do not correspond to characters in the unreserved set should be percent-encoded.
P E9 r l is not valid UTF-8.
I believe you were going for Pérl (é is U+00E9)? That would be
$ perl -Mutf8 -MURI::Escape -E'say uri_escape_utf8("Pérl")'
P%C3%A9rl
400 Bad Request is an appropriate error for providing a bad URL. If the user doesn't want to see this error, they should use a valid URL. You could override Catalyst's default error handling behaviour (e.g. to provide a more precise error page) using handle_unicode_encoding_exception().
So there is a method in Catalyst.pm that you can modify in your subclass (Hello.pm in the example above) which controls how errors look. If you want to surprise those types of errors you can do so. Take a look at:
https://metacpan.org/source/JJNAPIORK/Catalyst-Runtime-5.90077/lib/Catalyst.pm#L3108
you can override that method if you like.
Alternatively if you have a proposal for a code change or some sort of configuration option you can branch off the Catalyst github repo and send me a pull request with your ideas:
https://github.com/perl-catalyst/catalyst-runtime
These methods are currently considered somewhat private but I am considering making them fully public.
Related
I am serving files from Google Cloud Storage and some of the filenames contain non-ASCII, UTF-8 encoded characters. For example, volvía.mp3.
If I request volvía.mp3, GCS throws an error.
If I percent encode the filename (í = %C3%AD) as volv%C3%AD.mp3, it still fails.
If I percent encode the filename using the "combining acute accent" = %CC%81 as volvi%CC%81a.mp3, it succeeds.
Any ideas what is going on?
EDIT: The error it throws is an "Access Denied" error:
Anonymous users does not have storage.objects.get access to object. However, this seems to be the error one gets when requesting an object that's not found.
The problem is due Mac OS's HFS+ file system, which enforces canonical decomposition (NFD) on filenames. This means it normalizes characters such as í into two code points (i + combining acute accent) rather than the single code point that is used in "composed" forms, ie., NFC).
GCS treats these two different forms as distinct filenames, despite that fact that they appear identical.
One solution is to convert NFD filenames to the more common NFC forms (using a utility such as convmv) before uploading to GCS. However, this can't be done on Mac OS because the file system itself enforces NFD.
I was not able to reproduce your issue. I uploaded an object named volvía.mp3 and was able to retrieve it as both http://storage.googleapis.com/bucketname/volvía.mp3 and http://storage.googleapis.com/bucketname/volv%C3%ADa.mp3
I suspect that you actually created an object with the "combining acute accent" character instead. How did you upload your object?
I've got a unicode string (s) which I want to write into a file.
In Python 2 I could write:
open('filename', 'w').write(s.encode('utf-8'))
But this fails for Python 3. Apparently, s.encode() returns something of type 'bytes', which the write() function does not accept:
TypeError: must be str, not bytes
Does anyone know how to port the above code to Python 3?
Edit:
Thanks to all of you who proposed using binary mode! Unfortunately, this causes a problem with the \n characters. Is there any way to achieve the same result I had with Python 2 (namely to encode non-ANSI characters in UTF-8 while keeping the OS-specific rendition of \n)?
Thanks!
You do not want to muck around with manually encoding each and every piece of data like that! Simply pass the encoding as an argument to open, like this:
#!/usr/bin/env python3.2
slist = [
"Ca\N{LATIN SMALL LETTER N WITH TILDE}on City",
"na\N{LATIN SMALL LETTER I WITH DIAERESIS}vet\N{LATIN SMALL LETTER E WITH ACUTE}",
"fa\N{LATIN SMALL LETTER C WITH CEDILLA}ade",
"\N{GREEK SMALL LETTER BETA}-globulin"
]
with open("/tmp/sample.utf8", mode="w", encoding="utf8") as f:
for s in slist:
print(s, file=f)
Now if you the file you made, you’ll see that it says:
$ cat /tmp/sample.utf8
Cañon City
naïveté
façade
β-globulin
And you can see that those are the right code points this way:
$ uniquote -x /tmp/sample.utf
Ca\x{F1}on City
na\x{EF}vet\x{E9}
fa\x{E7}ade
\x{3B2}-globulin
See how much easier that is? Let the stream object handle any low-level encoding or decoding for you.
Summary: Don't call encode or decode yourself when all you are doing is using them to process a homogeneous stream that's all of it in the same encoding. That's way too much bother for zero gain. Use the encoding argument just once and for all.
Open the file in binary mode, that's the least invasive way in terms of changes.
On the other hand, you could set the output file encoding with open() and avoid explicit string encoding altogether.
You might want to read the manual of the open() function.
Open the file in binary mode
open('filename', 'wb').write(s.encode('utf-8'))
I use ISAPI_Rewrite v2 for url rewriting quite a while. The site is in the Hebrew language and so the pages urls.
ISAPI_Rewrite v2 doesnt support Hebrew characters, but I overcome this problem by using UTF-8(Hex) code for the hebrew characters.
Here is an example:
RewriteRule ^/\%D7\%A6\%D7\%95\%D7\%A8_\%D7\%A7\%D7\%A9\%D7\%A8/$ /Contact.aspx [L,I]
RewriteRule ^/\%D7\%A6\%D7\%95\%D7\%A8_\%D7\%A7\%D7\%A9\%D7\%A8$ /Contact.aspx [L,I]
The problem:
While checking my popular pages in statcounter I came across this url:
http://mysite.com/%u05F6%u05E5%u05F8_%u05F7%u05F9%u05F8
Which is the same URL rule as in my example but in Unicode! And apparently ISAPI_Rewrite v2 doesnt handle this URLs, And I the user get "The page cannot be found".
There is also pages that are more complex, for example send part of the URL as a query parameter.. Which also in Unicode.
I though only on one solution - make the same rules, this time in Unicode and deal with the Unicode in the code behind. But there's 2 problems with the solution:
The URL shows for the user in Unicode and not in the Hebrew language.
More code in the code behind which, for my opinion, doesnt need to be. What I mean is that this scenario can/need to be handle before it reach the code..
Any thoughts?
Thanks.
EDIT:
Maybe this redirection can be accomplish by IIS6 somehow? When ever the IIS identify Unicode URL, it convert it to UTF-8 and redirect the page.
ISAPI_Rewrite v2 doesnt support Hebrew characters, but I overcome this problem by using UTF-8
IIS in general requires you to use UTF-8 in URLs. There is a fallback to using the default locale-specific (‘ANSI’) encoding when the URL isn't a valid UTF-8 sequence, but that's (a) no use if your server's locale isn't Hebrew (code page 1255), and (b) still not wholly reliable as some cp1255 strings can also be valid UTF-8 sequences. So, yes, for reliability always use the UTF-8 form.
http://mysite.com/%u05F6%u05E5%u05F8_%u05F7%u05F9%u05F8
Which is the same URL rule as in my example but in Unicode!
Not really. The %uxxxx syntax comes from the JavaScript escape() function and is specific to that's function's custom form of encoding. It has no relation to standard URL-encoding. The above is not even a valid URL and won't be accepted by some browsers.
You need to find where that link is coming from and fix it to use proper UTF-8-%xx-encoding instead.
In the meantime you might be able to do something with a 404 handler that redirects to the canonical form instead.
If you use some FastCGI extension behind IIS you can try configure to configure FastCGI to use UTF-8 encoding for a particular set of server variables, use the REG_MULTI_SZ registry key FastCGIUtf8ServerVariables and set its value to a list of server variable names.
reg add HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\w3svc\Parameters /v FastCGIUtf8ServerVariables /t REG_MULTI_SZ /d REQUEST_URI\0PATH_INFO
https://www.iis.net/learn/application-frameworks/install-and-configure-php-on-iis/configuring-the-fastcgi-extension-for-iis-60#utf8servervars
We restored from a backup in a different format to a new MySQL structure (which is setup correctly for UTF-8 support). We have weird characters showing in the browser, but we're not sure what they're called so we can find a master list of what they translate to.
I have noticed that they do, in fact, correlate to a specific character. For example:
â„¢ always translates to ™
— always translates to —
• always translates to ·
I referenced this post, which got me started, but this is far from a complete list. Either I'm not searching for the correct name, or the "master list" of these bad-to-good conversions as a reference doesn't exist.
Reference:
Detecting utf8 broken characters in MySQL
Also, when trying to search via MySQL query, if I search for â, I always get MySQL treating it as an "a". Is there any way to tweak my MySQL queries so that they are more literal searches? We don't use internationalization much so I can safely assume any fields containing the â character is considered to be a problematic entry, which would need to be remedied by our "fixit" script we're building.
Instead of designing a "fixit" script to go through and replace this data, I think it would be better to simply fix the issue directly. It seems like the data was originally stored in a different format than UTF-8 so that when you brought it into the table that was set up for UTF-8, it garbled the text. If you have the opportunity, go back to your original backup to determine the format the data was stored in. If you can't do that, you will probably need to do a bit of trial and error to figure out which format the data is in. However, once you know that, conversion is easy. Read the following article's section on Repairing:
http://www.istognosis.com/en/mysql/35-garbled-data-set-utf8-characters-to-mysql-
Basically you are going to set the column to BINARY and then set it to the original charset. That should make the text appear properly (a good check to know you are using the correct charset). Once that is done, set the column to UTF-8. This will convert the data properly and it will correct the problems you are currently experiencing.
I read in search terms from a simple text file to send to a search engine.
It works fine in English, but gives me ???? for any Japanese text.
Text with mixed English and Japanese does show the English text, so I know it's reading it.
What I'm seeing:
Input text:
Snow Leopard をインストールする場合、新しい
Turns into:
Snow Leopard ???????????????
This is in my POST field of an HTTP.
If I set JMeter to encode the data, it just puts in the percent sequence for question marks.
About the Data:
The CSV file is very simple in
structure.
There's only one field / one column,
which I name TERM, and later use as
${TERM}
I don't really need full CSV because it's only one string per line.
There's no commas or quotes.
It's UTF-8 and when I run the Unix "file" command on the file, it says UTF-8 text.
I've also verified UTF-8 in command line and graphical mode on two machines.
Interesting note:
An interesting coincidence that I noticed: if there are 15 Japanese characters then I get 15 question marks, so at some point it's being seen as full characters and not just bytes.
JMeter CSV Dataset Config:
Filename: japanese-searches.csv
File encoding: UTF-8 (also tried without)
Variable names: TERM
Delimiter: ,
Allow Quoted Data: False (I also tried True, different, but still wrong)
Recycle at EOF: True
Stop at EOF: False
Staring mode: All threads
A few things I've tried:
- Tried Allow quoted Data. It changed to other strange characters.
- Added -Dfile.encoding=UTF-8
- Tried encoding the POST stage, but it just turned into a bunch of %nn for question marks
And I'm not sure how "debug" just after the each line of the CSV is read in. I think it's corrupted right away, but I'm not sure.
If it's only mangled when I reference it, then instead of ${TERM} perhaps there's some other "to bytes" function call. I'll start checking into that. I haven't done anything with the JMeter functions yet.
Edited Dec 24:
Tweaks:
Changed formatting and added bullet
points for more clarity.
Clarified that the file is UTF-8, and have verified that.
A new theory:
Is it possible that the Japanese characters are making it through, and the issue is that EVERY SINGLE place that shows them maps them to a "?" at DISPLAY TIME only. So even though I've checked in a bunch of places, they all have a display issue just in the UI?
Is there a way in JMeter to see the numeric value of a character or string? Actually, to tell JMeter to display the list of Unicode code points?
I'll look at my last log files... although I suppose even the server logs could mis-mapped the characters.
Also, perhaps when doing variable expansion inside of the text field that I POST, where I reference the ${TERM}, maybe at that point it also maps to question marks, but that the corruption happens at that later point. If that happened, AND it was mis-displayed in the UI, then it might lead to a false conclusion.
What I'd really like to do is pause JMeter after the first CSV record, just after that line is loaded, and look at it with a "data scope" or byte editor or something. Not sure if this is possible.
Found the issue, there was another place the UTF-8 had to be specified.
In the HTTP Request, to the right of the Method, you have to also set Content Encoding to UTF-8
Yes, in hindsight, this seems obvious, but there were a number of reasons I didn't think this was needed. Some of my incorrect assumptions might be helpful for others who are debugging, so here goes - I would have thought that:
1: Once text has made it into Java as Unicode, it stays as Unicode, and goes in and out by UTF-8. Obviously not in this case.
2: I sort of thought HTTP defaulted to UTF-8 unless you say otherwise, but maybe I'm just used to XML, but probably not a good practice to assume that, and maybe HTTP defaults to ISO-Latin1 or something, or even if there's a spec, maybe folks don't follow it.
3: And if I don't specific it, I'd think the "do no harm" approach would be to pass the characters on, and let the receiver on the other end deal with it. Wrong again!
(OK, so points 1, 2 and 3 overlap a bit)
4: Even though my HTTP Request POST, I did still try the Encode checkbox. I certainly thought that would have encoded it, but all I got was the repeating % hex for question marks, so seemed to me that the data was already corrupted at that point. Wrong again. I suspect WITHIN the HTTP phase, there's TWO character transitions, first from Unicode to whatever encoding it thinks you have, and THEN a second encoding into the %signs, and my data was mis-encoded at the first step.
5: And I would have thought JMeter would say something or warn, but from my reading, apparently it's not helpful in that respect. You can do logging or whatever.
And the "?" is Java's way of reporting a problem BY default, this started in the Java 1.4x timeframe. In my Java code I prefer to set encoding errors to report as an exception, but again, not the default, and not what JMeter does.
So I learned my lesson.
The HINT that the Unicode was at least starting out OK was that the number of question marks equaled the number of Japanese characters, instead of having 2 or 3 times as many question marks. If the length of "???" matches your Japanese (or Chinese) string, then Java DID see actual Unicode characters at some point along the journey. Whereas if you see 3 times as many ?'s as input text, then Java always saw them as bytes or ints or whatever, and NEVER as valid codepoints.
Came across this topic when searching for solution to use parameters from csv file that contained some columns written in Hebrew.
I used Excel 2007 to create a 1000 lines data for user registrations. The first and the last names had to be in Hebrew.
I exported the file to "Unicode text" file. It became tab delimited.
"Unicode Text" saves in UTF-16 LE (Little Endian), not in UTF-8. That is important.
I opened the result in Notepad++. I could see the Hebrew letters properly. The Notepad++ has the "Encoding" menu item, where you can check the encoding or change it. So I changed the Little Endian to UTF-8.
Then I replaced tabs with commas (just selected the tab and pasted it into the Find box.
The parameters were substituted ok, but after running the script I saw the following:
In the "View Results Tree" listener I opened the "Result" tab of the "Http Request".
The parameters were substituted, but the HTTP view tab (on the bottom) of the Request showed me some gibberish.
But when I looked at the Raw view, I saw that the request parameters actually contained strings like %D7%A9%D7%A8%D7%9E%D7%95%D7%98%D7%94 that when taken in pairs (%D7 %A9) corersponded properly to Hebrew letters.
To my mind, the JMeter has a bug and can not properly display the unicode chars. But it sends (POSTs) them out ok.
Hope I am right and hope it will help someone.
You can try to use "SHIFT-JIS" in Content encoding (it's nearby Method selection). Then you should uncheck "Encode?" for parameter that included Japanese.
Hope it works you.