Decoding URL containing unicode characters - unicode

I have the following code in a Mako template:
<a href="#" onclick='getCompanyHTML("${fund.investments[inv_name].name | u}"); return false;'>${inv_name}</a>
This applies url escaping to the name string of an object representing a company. The resulting escaped string is then used in a url. The Mako documentation states that url encoding is provided using urllib.quote_plus(string.encode('utf-8')).
On the server I receive the company name part into the argument investment_name:
def Investment(client, fund_name, investment_name, **kwargs):
client = urllib.unquote_plus(client)
fund_name = urllib.unquote_plus(fund_name)
investment_name = urllib.unquote_plus(investment_name)
I then use investment_name as a key back in to the same dictionary from which it was extracted in the template.
This works fine for all the standard cases, such as spaces, slashes, and single quotes in the company name. However, it fails if the company name contains unicode characters outside of the ascii character set.
For instance, the url for company name "Eptisa Servicios de Ingeniería S.L." is rendered as "Eptisa+Servicios+de+Ingenier%C3%ADa+S.L." When this value arrives back at the server, I'm reversing the url escaping but clearly failing to decode the unicode properly because my attempt to use the result as a dictionary key generates a key error.
I've tried adding unicode decoding in these two forms, without luck:
investment_name = urllib.unquote_plus(investment_name.decode('utf-8'))
investment_name = urllib.unquote_plus(investment_name.encode('raw_unicode_escape').decode('utf-8'))
Can anyone suggest what I must do to "Eptisa+Servicios+de+Ingenier%C3%ADa+S.L." to turn it back into "Eptisa Servicios de Ingeniería S.L."?

Do it in the reverse order: first unquote then .decode('utf-8')
Do not mix bytes and Unicode strings.
Example
import urllib
q = "Eptisa+Servicios+de+Ingenier%C3%ADa+S.L."
b = urllib.unquote_plus(q)
u = b.decode("utf-8")
print u
Note: print u might produce UnicodeEncodeError. To fix it:
print u.encode(character_encoding_your_console_understands)
Or set PYTHONIOENCODING environment variable.
On Unix you could try locale.getpreferredencoding() as character encoding, on Windows see output of chcp

Related

Trouble understanding C# URL decode with Unicode character(s) in PowerShell

I'm currently working on something that requires me to pass a Base64 string to a PowerShell script. But while decoding the string back to the original I'm getting some unexpected results as I need to use UTF-7 during decoding and I don't understand why. Would someone know why?
The Mozilla documentation would suggest that it's insufficient to use Base64 if you have Unicode characters in your string. Thus you need to use a workaround that consists of using encodeURIComponent and a replace. I don't really get why the replace is needed and shortened it to btoa(escape('✓ à la mode')) to encode the string. The result of that operation would be JXUyNzEzJTIwJUUwJTIwbGElMjBtb2Rl.
Using PowerShell to decode the string back to the original, I need to first undo the Base64 encoding. In order to do System.Convert can be used (which results in a byte array) and its output can be converted to a UTF-8 string using System.Text.Encoding. Together this would look like the following:
$bytes = [System.Convert]::FromBase64String($inputstring)
$utf8string = [System.Text.Encoding]::UTF8.GetString($bytes)
What's left to do is URL decode the whole thing. As it is a UTF-8 string I'd expect only to need to run the URL decode without any further parameters. But if you do that you end up with a accented a that looks like � in a file or ? on the console. To get the actual original string it's necessary to tell the URL decode to use UTF-7 as the character set. It's nice that this works but I don't really get why it's necessary since the string should be UTF-8 and UTF-8 certainly supports an accented a. See the last two lines of the entire script for what I mean. With those two lines you will end up with one line that has the garbled text and one which has the original text in the same file encoded as UTF-8
Entire PowerShell script:
Add-Type -AssemblyName System.Web
$inputstring = "JXUyNzEzJTIwJUUwJTIwbGElMjBtb2Rl"
$bytes = [System.Convert]::FromBase64String($inputstring)
$utf8string = [System.Text.Encoding]::UTF8.GetString($bytes)
[System.Web.HttpUtility]::UrlDecode($utf8string) | Out-File -Encoding utf8 C:\temp\output.txt
[System.Web.HttpUtility]::UrlDecode($utf8string, [System.Text.UnicodeEncoding]::UTF7) | Out-File -Append -Encoding utf8 C:\temp\output.txt
Clarification:
The problem isn't the conversion of the Base64 to UTF-8. The problem is some inconsistent behavior of the UrlDecode of C#. If you run escape('✓ à la mode') in your browser you will end up with the following string %u2713%20%E0%20la%20mode. So we have a Unicode representation of the check mark and a HTML entity for the á. If we use this directly in UrlDecode we end up with the same error. My current assumption would be that it's an issue with the encoding of the PowerShell window and pasting characters into it.
Turns out it actually isn't all that strange. It's just for what I want to do it's advantages to use a newer function. I'm still not sure why it works if you use the UTF-7 encoding. But anyways, as an explanation:
... The hexadecimal form for characters, whose code unit value is 0xFF or less, is a two-digit escape sequence: %xx. For characters with a greater code unit, the four-digit format %uxxxx is used.
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/escape
As TesselatingHecksler pointed out What is the proper way to URL encode Unicode characters? would indicate that the %u format wasn't formerly standardized. A newer version to escape characters exists though, which is encodeURIComponent.
The encodeURIComponent() function encodes a Uniform Resource Identifier (URI) component by replacing each instance of certain characters by one, two, three, or four escape sequences representing the UTF-8 encoding of the character (will only be four escape sequences for characters composed of two "surrogate" characters).
The output of this function actually works with the C# implementation of UrlDecode without supplying an additional encoding of UTF-7.
The original linked Mozilla article about a Base64 encode for an UTF-8 strings modifies the whole process in a way to allows you to just call the Base64 decode function in order to get the whole string. This is realized by converting the URL encode version of the string to bytes.

How do I replace spaces with %20 in PowerShell?

I'm creating a PowerShell script that will assemble an HTTP path from user input. The output has to convert any spaces in the user input to the product specific codes, "%2F".
Here's a sample of the source and the output:
The site URL can be a constant, though a variable would be a better approach for reuse, as used in the program is: /http:%2F%2SPServer/Projects/"
$Company="Company"
$Product="Product"
$Project="The new project"
$SitePath="$SiteUrl/$Company/$Product/$Project"
As output I need:
'/http:%2F%2FSPServer%2FProjects%2FCompany%2FProductF2FThe%2Fnew%2Fproject'
To replace " " with %20 and / with %2F and so on, do the following:
[uri]::EscapeDataString($SitePath)
The solution of #manojlds converts all odd characters in the supplied string.
If you want to do escaping for URLs only, use
[uri]::EscapeUriString($SitePath)
This will leave, e.g., slashes (/) or equal signs (=) as they are.
Example:
# Returns http%3A%2F%2Ftest.com%3Ftest%3Dmy%20value
[uri]::EscapeDataString("http://test.com?test=my value")
# Returns http://test.com?test=my%20value
[uri]::EscapeUriString("http://test.com?test=my value")
For newer operating systems, the command is changed. I had problems with this in Server 2012 R2 and Windows 10.
[System.Net.WebUtility] is what you should use if you get errors that [System.Web.HttpUtility] is not there.
$Escaped = [System.Net.WebUtility]::UrlEncode($SitePath)
The output transformation you need (spaces to %20, forward slashes to %2F) is called URL encoding. It replaces (escapes) characters that have a special meaning when part of a URL with their hex equivalent preceded by a % sign.
You can use .NET framework classes from within Powershell.
[System.Web.HttpUtility]::UrlEncode($SitePath)
Encodes a URL string. These method overloads can be used to encode the entire URL, including query-string values.
http://msdn.microsoft.com/en-us/library/system.web.httputility.urlencode.aspx

%40 converted into # on Get

I am passing my variables as follows to url using GET method
http://www.mysite.com/demo.php?sid=123121&email_id=stevemartin144%40gmail.com
& when i print $_GET on demo.php it displays parameters as follows:
email_id stevemartin144#gmail.com
sid 123121
instead of above output i want parameters as i passed
email_id stevemartin144%40gmail.com
sid 123121
I don't want to convert %40 into #
please suggest me solution on this
Thanks in advance
"%40" in a URL means "#". If you want a "%" to mean "%", you need to URL encode it to "%25".
URL encoding is just a transport encoding. If you feed in "#", its transport encoded version is "%40", but the recipient will get "#" again. If you want to feed in "%40" and have the recipient receive "%40", you need to URL encode it to "%2540".
If the recipient correctly receives "#" but you want to use the URL encoded version for whatever reason, you can also have the recipient urlencode it again.
Notes:
Online Converter:
Replace special characters with its equivalent hexadecimal unicode.
For a list of unicodes refer the website https://unicode-table.com (or) http://unicodelookup.com
Local Converter using Python:
Reference: conversion of password "p#s#w:E" to unicode will be as follows,
# = %40
$ = %24
# = %23
: = %3A
p#s#w:E = p%40s%23w%3AE
Input:
[root#localhost ~]# python -c "import sys, urllib as enc; print enc.quote_plus(sys.argv[1])" "p#s#w:E"
Output:
p%40s%23w%3AE

Perl JSON pound sign escaping

I am trying to use a web API of a service written in Perl (OTRS).
The data is sent in JSON format.
One of the string values inside the JSON structure contains a pound sign, which in apparently is used as a comment character in JSON.
This results in a parsing error:
unexpected end of string while parsing
JSON string
I couldn't find how to escape the character in order to get the string parsed successfully.
The obvious slash escaping results in:
illegal backslash escape sequence in
string
Any ideas how to escape it?
Update:
The URL I am trying to use looks something like that (simplified but still causes the error):
http://otrs.server.url/otrs/json.pl?User=username&Password=password&Object=TicketObject&Method=ArticleSend&Data={"Subject":"[Ticket#100000] Test Ticket from OTRS"}
Use Uri::escape:
use URI::Escape;
my $safe = uri_escape($url);
See rfc1738 for the list of characters which can be unsafe.
The hash symbol, #, has a special meaning in URLs, not in JSON. Your URL is probably getting truncated at the hash before the remove server even sees it:
http://otrs.server.url/otrs/json.pl?User=username&Password=password&Object=TicketObject&Method=ArticleSend&Data={"Subject":"[Ticket
And that means that the remote server gets mangled JSON in Data. The solution is to URL encode your parameters before pasting them together to form your URL; eugene y tells you how to do this.

How to detect if a Unicode char is supported by EBCDIC in .NET 4.0?

We have a web site and WinForms application written in .NET 4.0 that allows users to enter any Unicode char (pretty standard).
The problem is that a small amount of our data gets submitted to an old mainframe application. While we were testing a user entered a name with characters that ending up crashing the mainframe program. The name was BOËNS. The E is not supported.
What is the best way to detect if a unicode char is supported by EBCDIC?
I tried using the following regular expression but that restricted some standard special chars (/, _, :) which are fine for the mainframe.
I would prefer to use one method to validate each char or have a method that you just passed in a string and it returned true or false if chars not supported by EBCDIC were contained in the strig.
First, you would have to get the proper Encoding instance for EBCDIC, calling the static GetEncoding method which will takes the code page id as a parameter.
Once you have that, you can set the DecoderFallback property to the value in the static ExceptionFallback property on the DecoderFallback class.
Then, in your code, you would loop through each character in your string and call the GetBytes method to encode the character to the byte sequence. If it cannot be encoded, then a DecoderFallbackException is thrown; you would just have to wrap each call to GetBytes in a try/catch block to determine which character is in error.
Note, the above is required if you want to know the position of the character that failed. If you don't care about the position of the character, just if the string will not encode as a whole, then you can just call the GetBytes method which takes a string parameter and it will throw the same DecoderFallbackException if a character that cannot be encoded is encountered.
You can escape characters in Regex using the \ . So if you want to match a dot, you can do #"\." . To match /._,:[]- for example: #"[/._,:\-\[\]] . Now, EBDIC is 8 bits, but many characters are control characters. Do you have a list of "valid" characters?
I have made this pattern:
string pattern = #"[^a-zA-Z0-9 ¢.<(+&!$*);¬/|,%_>?`:##'=~{}\-\\" + '"' + "]";
It should find "illegal" characters. If IsMatch then there is a problem.
I have used this: http://nemesis.lonestar.org/reference/telecom/codes/ebcdic.html
Note the special handling of the ". I'm using the # at the beginning of the string to disable \ escape expansion, so I can't escape the closing quote, and so I add it to the pattern in the end.
To test it:
Regex rx = new Regex(pattern);
bool m1 = rx.IsMatch(#"a-zA-Z0-9 ¢.<(+&!$*);¬/|,%_>?`:##'=~{}\-\\" + '"');
bool m2 = rx.IsMatch(#"€a-zA-Z0-9 ¢.<(+&!$*);¬/|,%_>?`:##'=~{}\-\\" + '"');
m1 is false (it's the list of all the "good" characters), m2 is true (to the other list I've added the € symbol)