How to get the same utf-8 encoding as Google for Arabic URLs? - encoding

Google: https%3A%2F%2Fwww.aljazeera.net%2Fnews%2Fhealthmedicine%2F2019%2F4%2F29%2F%25D9%2584%25D8%25AD%25D8%25AF%25D9%2588%25D8%25AB-%25D8%25A7%25D9%2584%25D8%25AD%25D9%2585%25D9%2584-%25D8%25A3%25D9%2588-%25D8%25AA%25D8%25AC%25D9%2586%25D8%25A8%25D9%2587-%25D9%2587%25D9%2583%25D8%25B0%25D8%25A7-%25D8%25AA%25D8%25AD%25D8%25AA%25D8%25B3%25D8%25A8%25D9%258A%25D9%2586-%25D8%25A3%25D9%258A%25D8%25A7%25D9%2585-%25D8%25A7%25D9%2584%25D8%25AA%25D8%25A8%25D9%2588%25D9%258A%25D8%25B6
Encoding with utf-8, I get the below: https%3A%2F%2Fwww.aljazeera.net%2Fnews%2Fhealthmedicine%2F2019%2F4%2F29%2F%D9%84%D8%AD%D8%AF%D9%88%D8%AB-%D8%A7%D9%84%D8%AD%D9%85%D9%84-%D8%A3%D9%88-%D8%AA%D8%AC%D9%86%D8%A8%D9%87-%D9%87%D9%83%D8%B0%D8%A7-%D8%AA%D8%AD%D8%AA%D8%B3%D8%A8%D9%8A%D9%86-%D8%A3%D9%8A%D8%A7%D9%85-%D8%A7%D9%84%D8%AA%D8%A8%D9%88%D9%8A%D8%B6
How can I get the same URLs as Google's?
In Python I've used the following method to utf-8 encode the Arabic url:
urllib.parse.quote(url.encode('utf-8'), safe='')
This gives the first encoded url above, which ends with D8%B6. Google's however ends with D8%25B6.
If I copy-paste the Arabic URL from a browser window to another i get the url encoding similar to mine, not the Google one:

The way I understand your question, you have a URL such as (from an Al Jazeera page in this case):
https://www.aljazeera.net/news/healthmedicine/2019/4/29/%D9%84%D8%AD%D8%AF%D9%88%D8%AB-%D8%A7%D9%84%D8%AD%D9%85%D9%84-%D8%A3%D9%88-%D8%AA%D8%AC%D9%86%D8%A8%D9%87-%D9%87%D9%83%D8%B0%D8%A7-%D8%AA%D8%AD%D8%AA%D8%B3%D8%A8%D9%8A%D9%86-%D8%A3%D9%8A%D8%A7%D9%85-%D8%A7%D9%84%D8%AA%D8%A8%D9%88%D9%8A%D8%B6
You then want to construct a Google Search Console URL for this page like:
https://search.google.com/search-console/performance/search-analytics?resource_id=sc-domain%3Aaljazeera.net&hl=ar&breakdown=page&page=!https%3A%2F%2Fwww.aljazeera.net%2Fnews%2Fhealthmedicine%2F2019%2F4%2F29%2F%25D9%2584%25D8%25AD%25D8%25AF%25D9%2588%25D8%25AB-%25D8%25A7%25D9%2584%25D8%25AD%25D9%2585%25D9%2584-%25D8%25A3%25D9%2588-%25D8%25AA%25D8%25AC%25D9%2586%25D8%25A8%25D9%2587-%25D9%2587%25D9%2583%25D8%25B0%25D8%25A7-%25D8%25AA%25D8%25AD%25D8%25AA%25D8%25B3%25D8%25A8%25D9%258A%25D9%2586-%25D8%25A3%25D9%258A%25D8%25A7%25D9%2585-%25D8%25A7%25D9%2584%25D8%25AA%25D8%25A8%25D9%2588%25D9%258A%25D8%25B6
So in short, you have a Google Search Console URL and want to add another URL as a query parameter.
Note that the Al Jazeera URL contains many non-ASCII characters that are properly encoded. In your browser's address bar, the URL will likely be displayed as
aljazeera.net/news/healthmedicine/2019/4/29/لحدوث-الحمل-أو-تجنبه-هكذا-تحتسبين-أيام-التبويض
That's not a valid URL but easier to read. When you copy the URL, you get the escaped one with ASCII characters only. That's the one you start with.
So the steps to create the Search Console URL are:
Run the Al Jazeera URL through URL encoding. Most programming language provide such a function. Or there are online service like https://www.urlencoder.org/
Append the result to the base Google Search Console:(https://search.google.com/search-console/performance/search-analytics?resource_id=sc-domain%3Aaljazeera.net&hl=ar&breakdown=page&page=!)
That's it.
Note that the Search Console base URL has two peculiarities:
The page parameter starts with an exclamation mark, e.g. ...&page=!https%3A...
For a different domain, the URL needs to be changed as the domain name appears a second time in the URL.
Python code:
import urllib.parse
url = "https://www.aljazeera.net/news/healthmedicine/2019/4/29/%D9%84%D8%AD%D8%AF%D9%88%D8%AB-%D8%A7%D9%84%D8%AD%D9%85%D9%84-%D8%A3%D9%88-%D8%AA%D8%AC%D9%86%D8%A8%D9%87-%D9%87%D9%83%D8%B0%D8%A7-%D8%AA%D8%AD%D8%AA%D8%B3%D8%A8%D9%8A%D9%86-%D8%A3%D9%8A%D8%A7%D9%85-%D8%A7%D9%84%D8%AA%D8%A8%D9%88%D9%8A%D8%B6"
google_base_url = "https://search.google.com/search-console/performance/search-analytics?resource_id=sc-domain%3Aaljazeera.net&hl=ar&breakdown=page&page=!"
final_url = google_base_url + urllib.parse.quote(url)
print(final_url)
Old answer
URL encoding is a tricky business because of mistakes in the encoding design, pecularities of the web servers and mostly because several different cases are usually mixed up.
Also note that most browsers do not display a correct URL in the address bar, but rather a partially decoded, easier to read URL.
The main cases to distinguish are:
Insert data with non-ASCII characters into the path of an URL (e.g.: https://ttt.com/FANCY_CHARACTERS/...)
Add data with non-ASCII characters as a query parameter (e.g.> https://ttt.com/res/f?f=FANCY_CHARACTERS)
Your case seems to be a special version of case 2, namely adding a URL as a query parameter to another URL.
So let's assume you have a valid URL from whatever source. It already contains encoded characters.
https://www.aljazeera.net/news/healthmedicine/2019/4/29/%D9%84%D8%AD%D8%AF%D9%88%D8%AB-%D8%A7%D9%84%D8%AD%D9%85%D9%84-%D8%A3%D9%88-%D8%AA%D8%AC%D9%86%D8%A8%D9%87-%D9%87%D9%83%D8%B0%D8%A7-%D8%AA%D8%AD%D8%AA%D8%B3%D8%A8%D9%8A%D9%86-%D8%A3%D9%8A%D8%A7%D9%85-%D8%A7%D9%84%D8%AA%D8%A8%D9%88%D9%8A%D8%B6
If you want to add it to another URL, you just need to run it through URL encoding. You don't need to care about Unicode characters as they are already encoded. The URL contains ASCII characters only:
https%3A%2F%2Fwww.aljazeera.net%2Fnews%2Fhealthmedicine%2F2019%2F4%2F29%2F%25D9%2584%25D8%25AD%25D8%25AF%25D9%2588%25D8%25AB-%25D8%25A7%25D9%2584%25D8%25AD%25D9%2585%25D9%2584-%25D8%25A3%25D9%2588-%25D8%25AA%25D8%25AC%25D9%2586%25D8%25A8%25D9%2587-%25D9%2587%25D9%2583%25D8%25B0%25D8%25A7-%25D8%25AA%25D8%25AD%25D8%25AA%25D8%25B3%25D8%25A8%25D9%258A%25D9%2586-%25D8%25A3%25D9%258A%25D8%25A7%25D9%2585-%25D8%25A7%25D9%2584%25D8%25AA%25D8%25A8%25D9%2588%25D9%258A%25D8%25B6
You can now add this URL to another URL, e.g.:
https://fff.com/ttt/qqq?url=https%3A%2F%2Fwww.aljazeera.net%2Fnews%2Fhealthmedicine%2F2019%2F4%2F29%2F%25D9%2584%25D8%25AD%25D8%25AF%25D9%2588%25D8%25AB-%25D8%25A7%25D9%2584%25D8%25AD%25D9%2585%25D9%2584-%25D8%25A3%25D9%2588-%25D8%25AA%25D8%25AC%25D9%2586%25D8%25A8%25D9%2587-%25D9%2587%25D9%2583%25D8%25B0%25D8%25A7-%25D8%25AA%25D8%25AD%25D8%25AA%25D8%25B3%25D8%25A8%25D9%258A%25D9%2586-%25D8%25A3%25D9%258A%25D8%25A7%25D9%2585-%25D8%25A7%25D9%2584%25D8%25AA%25D8%25A8%25D9%2588%25D9%258A%25D8%25B6
Let me know if that's what you wanted to do...

Related

Trouble with Login with PayPal redirect_uri mismatch

I am trying to configure my NetIQ SocialAccess appliance to allow authentication via Login with PayPal using OpenIDConnect but cannot seem to get my Return URL correct. I have seen a recent blog entry stating that the matching would become more strict and wonder if anyone can tell me if the difference in these two strings would cause the redirect_uri mismatch error. SocialAccess is adding a header with a redirect_uri string beginning with https%3A rather than https: as configured for my application's Return URL.
"%3A" is encoded format of character ":", meaning SocialAccess is adding an encoded url string as your redirect_url, and eventually leading to a mismatch from what you have set in your APP config.
URLs can only be sent over the Internet using the ASCII character-set.
Since URLs often contain characters outside the ASCII set, the URL has to be > converted into a valid ASCII format.
URL encoding replaces unsafe ASCII characters with a "%" followed by two hexadecimal digits.

QNetworkRequest and automatic convertation of percent-encoded characters

I'm trying to download the audio samples from Amazon with the help of QNetworkAccessManager+QNetworkRequest+QNetworkReply. I've got a big problem in processing the redirect from, for example, http://www.amazon.com/gp/dmusic/aws/sampleTrack.html?clientid=Shazam&ASIN=B00DJBQWAE to http://d28julafmv4ekl.cloudfront.net/64%2F30%2F239068457_S64.mp3?Expires=1380627695&Signature=BlaBlaBlaBla&Key-Pair-Id=BlaBlaBla
(Note the percent-encoded path returned from the server). The problem is that when redirect target URL is passed to new QNetworkRequest and the request is sent via QNAM, the %2F characters are automatically converted to slashes. This seems to be correct behavior, BUT the server requires these slashes to remain encoded. Is there any way to disable this convertation?
Btw, QNetworkReply also has similar feature - it returns the redirect url with already converted %xx characters.
You can apply a percent encoding to this url. This way, the '%2F' will be encoded to '%252F' and the QNetworkRequest will encode it back to '%2F'.
With this method: https://developer.blackberry.com/native/reference/cascades/qurl.html#toPercentEncoding

Special characters in WCF web API URLs

I have a web service which uses the WCF web api to create RESTful service. This serivce expects many different values in the url path seperated with a comma. This method works perfectly for simple data e.g. someones name or a numeric value. However I have a field on the client side (a java based BlackBerry app) which allows a user to freely type data which includes characters such as . or / which messes up my whole url.
Even when I replace the characters with their hex values e.g. a / to %2F the problem persists.
Does anyone know a means to either represent these characters in a URL which will be ignored when looking for the address or better yet a means to tell the URL the following characters are to be ignored perhaps in the way quotation marks would work?
You can use encode url function. URL Encoding is the process of converting string into valid URL format. Valid URL format means that the URL contains only what is termed alpha | digit | safe | extra | escape characters.

Url with Unicode - ISAPI_Rewrite doesnt recognize it

I use ISAPI_Rewrite v2 for url rewriting quite a while. The site is in the Hebrew language and so the pages urls.
ISAPI_Rewrite v2 doesnt support Hebrew characters, but I overcome this problem by using UTF-8(Hex) code for the hebrew characters.
Here is an example:
RewriteRule ^/\%D7\%A6\%D7\%95\%D7\%A8_\%D7\%A7\%D7\%A9\%D7\%A8/$ /Contact.aspx [L,I]
RewriteRule ^/\%D7\%A6\%D7\%95\%D7\%A8_\%D7\%A7\%D7\%A9\%D7\%A8$ /Contact.aspx [L,I]
The problem:
While checking my popular pages in statcounter I came across this url:
http://mysite.com/%u05F6%u05E5%u05F8_%u05F7%u05F9%u05F8
Which is the same URL rule as in my example but in Unicode! And apparently ISAPI_Rewrite v2 doesnt handle this URLs, And I the user get "The page cannot be found".
There is also pages that are more complex, for example send part of the URL as a query parameter.. Which also in Unicode.
I though only on one solution - make the same rules, this time in Unicode and deal with the Unicode in the code behind. But there's 2 problems with the solution:
The URL shows for the user in Unicode and not in the Hebrew language.
More code in the code behind which, for my opinion, doesnt need to be. What I mean is that this scenario can/need to be handle before it reach the code..
Any thoughts?
Thanks.
EDIT:
Maybe this redirection can be accomplish by IIS6 somehow? When ever the IIS identify Unicode URL, it convert it to UTF-8 and redirect the page.
ISAPI_Rewrite v2 doesnt support Hebrew characters, but I overcome this problem by using UTF-8
IIS in general requires you to use UTF-8 in URLs. There is a fallback to using the default locale-specific (‘ANSI’) encoding when the URL isn't a valid UTF-8 sequence, but that's (a) no use if your server's locale isn't Hebrew (code page 1255), and (b) still not wholly reliable as some cp1255 strings can also be valid UTF-8 sequences. So, yes, for reliability always use the UTF-8 form.
http://mysite.com/%u05F6%u05E5%u05F8_%u05F7%u05F9%u05F8
Which is the same URL rule as in my example but in Unicode!
Not really. The %uxxxx syntax comes from the JavaScript escape() function and is specific to that's function's custom form of encoding. It has no relation to standard URL-encoding. The above is not even a valid URL and won't be accepted by some browsers.
You need to find where that link is coming from and fix it to use proper UTF-8-%xx-encoding instead.
In the meantime you might be able to do something with a 404 handler that redirects to the canonical form instead.
If you use some FastCGI extension behind IIS you can try configure to configure FastCGI to use UTF-8 encoding for a particular set of server variables, use the REG_MULTI_SZ registry key FastCGIUtf8ServerVariables and set its value to a list of server variable names.
reg add HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\w3svc\Parameters /v FastCGIUtf8ServerVariables /t REG_MULTI_SZ /d REQUEST_URI\0PATH_INFO
https://www.iis.net/learn/application-frameworks/install-and-configure-php-on-iis/configuring-the-fastcgi-extension-for-iis-60#utf8servervars

What kind of text code is %62%69%73%68%6F%70?

On a specific webpage, when I hover over a link, I can see the text as "bishop" but when I copy-and-paste the link to TextPad, it shows up as "%62%69%73%68%6F%70". What kind of code is this, and how can I convert it into text?
Thanks!
URL encoding, I think.
You can decode it here: http://meyerweb.com/eric/tools/dencoder/
Most programming languages will have functions to urlencode/decode too.
This is URL encoding. It is designed to pass characters like < / or & through a URL using their ASCII values in hex after a %. However, you can also use this for characters that don't need encoding per se. Makes the URL harder to read, which is sometimes desirable.
URL encoding replaces characters outside the ascii set.
More info about URL encoding in the w3schools site.
As mentioned by others, this is simply an ASCII representation of the text so that it can be passed around the HTTP object easily. If you've ever noticed typing in a website URL that has a space in it, the browser will usually convert that to %20. That's the hexadecimal value for the "space" character in ASCII.
This used to be a way to trick old spam scrapers. One way spammers get email addresses is to scrape the source code of websites for strings matching the pattern "username#company.tld". By encoding just the username portion or the whole string as ASCII characters, the string would be readable by humans, but would require the scraper to convert it to a literal string before it could be used to send emails. Of course, modern-day spamming tools account for these sort of strings.