Special/accented UTF-8 characters not recognised by PhantomJS

Special/accented UTF-8 characters not recognised by PhantomJS - unicode

I am currently having an issue with PhantomJS (version 2.1.1/Windows 7) not recognising UTF-8 characters. Prior to asking this question, I have found the following two articles useful to configuring the command prompt:
PhantomJS arguments.js example UTF-8 not working
Unicode characters in Windows command line - how?
As suggested by the second article, I used the command
chcp 65001
to change the code page to UTF-8. I then also set the command prompt's default font to Lucida console.
To test this had worked, I created the following UTF-8 text file
---------------------------------------------------------
San José
Cañon City
Przecław Lanckoroński
François Gérard Hollande
El Niño
vis-à-vis
---------------------------------------------------------
and then ran the following command to demonstrate whether characters were being recognised and correctly displayed by the command prompt:
type utf8Test.txt
After this worked, I turned my attention to PhantomJS. Following the instructions here i created the below settings json file to ensure that UTF-8 is the input and output character encoding (though this appears to be the default for according to the official documentation).
{
"outputEncoding: "utf8",
"scriptEncoding": "utf8"
}
I then ran the following JavaScript through PhantomJS using the aforementioned json settings file in the same command prompt window:
console.log("---------------------------------------------------------");
console.log("San José");
console.log("Cañon City");
console.log("Przecław Lanckoroński");
console.log("François Gérard Hollande");
console.log("El Niño");
console.log("vis-à-vis");
console.log("---------------------------------------------------------");
page = require('webpage').create();
// Display the initial requested URL
page.onResourceRequested = function(requestData, request) {
if(requestData.id === 1){
console.log(requestData.url);
}
};
// Display any initial requested URL response error
page.onResourceError = function(resourceError) {
if(resourceError.id === 1){
console.log(resourceError.status + " : " + resourceError.statusText);
}
};
page.open("https://en.wikipedia.org/wiki/San_José", function(status) {
console.log("---------------------------------------------------------");
phantom.exit();
});
The output from running this script is shown below:
From this I can see that PhantomJS is not able to understand UTF-8 special characters, and furthermore it passes the "unknown" character to websites when provided with a special or accented character as below:
URL passed to PhantomJS:
https://en.wikipedia.org/wiki/San_José
URL passed to remote host:
https://en.wikipedia.org/wiki/San_Jos%EF%BF%BD
----------------------------------------------
%EF%BF%BD
�
instead of:
%C3%A9
é
This causes websites to respond with '400 : Bad Request' errors, and in the case of Wikipedia specifically, requesting the URL https://en.wikipedia.org/wiki/San_Jos%EF%BF%BD results in an error message of:
Bad title - The requested page title contains an invalid UTF-8 sequence.
So, with all this being said, does anyone know how to remedy this? There are many websites these days that use UTF-8 special/accented characters in their page urls, and it would be great if PhantomJS could be used to access them.
I really appreciate any help or suggestions you can provide me with.

var url = 'https://en.wikipedia.org/wiki/San_José';
page.open(encodeURI(url), function(status) {
console.log("---------------------------------------------------------");
console.log(page.evaluate(function(){ return document.title }));
phantom.exit();
});
Yes, it's garbling those symbols on Windows (on Linux it works beautifully) but at least you will be able to open pages and process them.

Related

How to get the same utf-8 encoding as Google for Arabic URLs?

Google: https%3A%2F%2Fwww.aljazeera.net%2Fnews%2Fhealthmedicine%2F2019%2F4%2F29%2F%25D9%2584%25D8%25AD%25D8%25AF%25D9%2588%25D8%25AB-%25D8%25A7%25D9%2584%25D8%25AD%25D9%2585%25D9%2584-%25D8%25A3%25D9%2588-%25D8%25AA%25D8%25AC%25D9%2586%25D8%25A8%25D9%2587-%25D9%2587%25D9%2583%25D8%25B0%25D8%25A7-%25D8%25AA%25D8%25AD%25D8%25AA%25D8%25B3%25D8%25A8%25D9%258A%25D9%2586-%25D8%25A3%25D9%258A%25D8%25A7%25D9%2585-%25D8%25A7%25D9%2584%25D8%25AA%25D8%25A8%25D9%2588%25D9%258A%25D8%25B6
Encoding with utf-8, I get the below: https%3A%2F%2Fwww.aljazeera.net%2Fnews%2Fhealthmedicine%2F2019%2F4%2F29%2F%D9%84%D8%AD%D8%AF%D9%88%D8%AB-%D8%A7%D9%84%D8%AD%D9%85%D9%84-%D8%A3%D9%88-%D8%AA%D8%AC%D9%86%D8%A8%D9%87-%D9%87%D9%83%D8%B0%D8%A7-%D8%AA%D8%AD%D8%AA%D8%B3%D8%A8%D9%8A%D9%86-%D8%A3%D9%8A%D8%A7%D9%85-%D8%A7%D9%84%D8%AA%D8%A8%D9%88%D9%8A%D8%B6
How can I get the same URLs as Google's?
In Python I've used the following method to utf-8 encode the Arabic url:
urllib.parse.quote(url.encode('utf-8'), safe='')
This gives the first encoded url above, which ends with D8%B6. Google's however ends with D8%25B6.
If I copy-paste the Arabic URL from a browser window to another i get the url encoding similar to mine, not the Google one:

The way I understand your question, you have a URL such as (from an Al Jazeera page in this case):
https://www.aljazeera.net/news/healthmedicine/2019/4/29/%D9%84%D8%AD%D8%AF%D9%88%D8%AB-%D8%A7%D9%84%D8%AD%D9%85%D9%84-%D8%A3%D9%88-%D8%AA%D8%AC%D9%86%D8%A8%D9%87-%D9%87%D9%83%D8%B0%D8%A7-%D8%AA%D8%AD%D8%AA%D8%B3%D8%A8%D9%8A%D9%86-%D8%A3%D9%8A%D8%A7%D9%85-%D8%A7%D9%84%D8%AA%D8%A8%D9%88%D9%8A%D8%B6
You then want to construct a Google Search Console URL for this page like:
https://search.google.com/search-console/performance/search-analytics?resource_id=sc-domain%3Aaljazeera.net&hl=ar&breakdown=page&page=!https%3A%2F%2Fwww.aljazeera.net%2Fnews%2Fhealthmedicine%2F2019%2F4%2F29%2F%25D9%2584%25D8%25AD%25D8%25AF%25D9%2588%25D8%25AB-%25D8%25A7%25D9%2584%25D8%25AD%25D9%2585%25D9%2584-%25D8%25A3%25D9%2588-%25D8%25AA%25D8%25AC%25D9%2586%25D8%25A8%25D9%2587-%25D9%2587%25D9%2583%25D8%25B0%25D8%25A7-%25D8%25AA%25D8%25AD%25D8%25AA%25D8%25B3%25D8%25A8%25D9%258A%25D9%2586-%25D8%25A3%25D9%258A%25D8%25A7%25D9%2585-%25D8%25A7%25D9%2584%25D8%25AA%25D8%25A8%25D9%2588%25D9%258A%25D8%25B6
So in short, you have a Google Search Console URL and want to add another URL as a query parameter.
Note that the Al Jazeera URL contains many non-ASCII characters that are properly encoded. In your browser's address bar, the URL will likely be displayed as
aljazeera.net/news/healthmedicine/2019/4/29/لحدوث-الحمل-أو-تجنبه-هكذا-تحتسبين-أيام-التبويض
That's not a valid URL but easier to read. When you copy the URL, you get the escaped one with ASCII characters only. That's the one you start with.
So the steps to create the Search Console URL are:
Run the Al Jazeera URL through URL encoding. Most programming language provide such a function. Or there are online service like https://www.urlencoder.org/
Append the result to the base Google Search Console:(https://search.google.com/search-console/performance/search-analytics?resource_id=sc-domain%3Aaljazeera.net&hl=ar&breakdown=page&page=!)
That's it.
Note that the Search Console base URL has two peculiarities:
The page parameter starts with an exclamation mark, e.g. ...&page=!https%3A...
For a different domain, the URL needs to be changed as the domain name appears a second time in the URL.
Python code:
import urllib.parse
url = "https://www.aljazeera.net/news/healthmedicine/2019/4/29/%D9%84%D8%AD%D8%AF%D9%88%D8%AB-%D8%A7%D9%84%D8%AD%D9%85%D9%84-%D8%A3%D9%88-%D8%AA%D8%AC%D9%86%D8%A8%D9%87-%D9%87%D9%83%D8%B0%D8%A7-%D8%AA%D8%AD%D8%AA%D8%B3%D8%A8%D9%8A%D9%86-%D8%A3%D9%8A%D8%A7%D9%85-%D8%A7%D9%84%D8%AA%D8%A8%D9%88%D9%8A%D8%B6"
google_base_url = "https://search.google.com/search-console/performance/search-analytics?resource_id=sc-domain%3Aaljazeera.net&hl=ar&breakdown=page&page=!"
final_url = google_base_url + urllib.parse.quote(url)
print(final_url)
Old answer
URL encoding is a tricky business because of mistakes in the encoding design, pecularities of the web servers and mostly because several different cases are usually mixed up.
Also note that most browsers do not display a correct URL in the address bar, but rather a partially decoded, easier to read URL.
The main cases to distinguish are:
Insert data with non-ASCII characters into the path of an URL (e.g.: https://ttt.com/FANCY_CHARACTERS/...)
Add data with non-ASCII characters as a query parameter (e.g.> https://ttt.com/res/f?f=FANCY_CHARACTERS)
Your case seems to be a special version of case 2, namely adding a URL as a query parameter to another URL.
So let's assume you have a valid URL from whatever source. It already contains encoded characters.
https://www.aljazeera.net/news/healthmedicine/2019/4/29/%D9%84%D8%AD%D8%AF%D9%88%D8%AB-%D8%A7%D9%84%D8%AD%D9%85%D9%84-%D8%A3%D9%88-%D8%AA%D8%AC%D9%86%D8%A8%D9%87-%D9%87%D9%83%D8%B0%D8%A7-%D8%AA%D8%AD%D8%AA%D8%B3%D8%A8%D9%8A%D9%86-%D8%A3%D9%8A%D8%A7%D9%85-%D8%A7%D9%84%D8%AA%D8%A8%D9%88%D9%8A%D8%B6
If you want to add it to another URL, you just need to run it through URL encoding. You don't need to care about Unicode characters as they are already encoded. The URL contains ASCII characters only:
https%3A%2F%2Fwww.aljazeera.net%2Fnews%2Fhealthmedicine%2F2019%2F4%2F29%2F%25D9%2584%25D8%25AD%25D8%25AF%25D9%2588%25D8%25AB-%25D8%25A7%25D9%2584%25D8%25AD%25D9%2585%25D9%2584-%25D8%25A3%25D9%2588-%25D8%25AA%25D8%25AC%25D9%2586%25D8%25A8%25D9%2587-%25D9%2587%25D9%2583%25D8%25B0%25D8%25A7-%25D8%25AA%25D8%25AD%25D8%25AA%25D8%25B3%25D8%25A8%25D9%258A%25D9%2586-%25D8%25A3%25D9%258A%25D8%25A7%25D9%2585-%25D8%25A7%25D9%2584%25D8%25AA%25D8%25A8%25D9%2588%25D9%258A%25D8%25B6
You can now add this URL to another URL, e.g.:
https://fff.com/ttt/qqq?url=https%3A%2F%2Fwww.aljazeera.net%2Fnews%2Fhealthmedicine%2F2019%2F4%2F29%2F%25D9%2584%25D8%25AD%25D8%25AF%25D9%2588%25D8%25AB-%25D8%25A7%25D9%2584%25D8%25AD%25D9%2585%25D9%2584-%25D8%25A3%25D9%2588-%25D8%25AA%25D8%25AC%25D9%2586%25D8%25A8%25D9%2587-%25D9%2587%25D9%2583%25D8%25B0%25D8%25A7-%25D8%25AA%25D8%25AD%25D8%25AA%25D8%25B3%25D8%25A8%25D9%258A%25D9%2586-%25D8%25A3%25D9%258A%25D8%25A7%25D9%2585-%25D8%25A7%25D9%2584%25D8%25AA%25D8%25A8%25D9%2588%25D9%258A%25D8%25B6
Let me know if that's what you wanted to do...

I need to remove a specific unicode in my existing subtitle text file

I basically work on subtitles and I have this arabic file and when I open it up on notepad and right click and select SHOW UNICODE CONTROL CHARACTERS I give me some weird characters on the left of every line. I tried so many ways to remove it but failed I also tried NOTEPAD++ but failed.
Notepad ++
SUBTITLE EDIT
EXCEL
WORD
288
00:24:41,960 --> 00:24:43,840
‫أتعلم، قللنا من شأنك فعلاً‬
289
00:24:44,000 --> 00:24:47,120
‫كان علينا تجنيدك لتكون جاسوساً‬
‫مكان (كاي سي)‬
290
00:24:47,280 --> 00:24:51,520
‫لا تعلمون كم أنا سعيد‬
‫لسماع ذلك‬
291
00:24:54,800 --> 00:24:58,160
‫لا تقلق، سيستيقظ نشيطاً غداً‬
292
00:24:58,320 --> 00:25:00,800
‫ولن يتذكر ما حصل‬
‫في الساعات الـ٦‬
the unicodes are not showing in this the unicode is U+202B which shows a ¶ sign, after googling it I think it's called PILCROW.
The issue with this is that it doesn't display subtitles correctly on ps4 app.
I need this PILCROW sign to go away. with this website I can see the issue in this file https://www.soscisurvey.de/tools/view-chars.php

The PILCROW ¶ is used by various software and publishers to show the end of a line in a document. The actual Unicode character does not exist in your file so you can't get rid of it.

The Unicode characters in these lines are 'RIGHT-TO-LEFT EMBEDDING'
(code \u202b) and 'POP DIRECTIONAL FORMATTING' (code \u202c) -
these are used in the text to indicate that the included text should be rendered
right-to-left instead of the ocidental left-to-right direction.
Now, these characters are included as hints to the application displaying the text, rather than to actually perform the text reversing - so they likely can be removed without compromising the text displaying itself.
Now this a programing Q&A site, but you did not indicate any programming language you are familiar with - enough for at least running a program. So it is very hard to know how give an answer that is suitable to you.
Python can be used to create a small program to filter such characters from a file, but I am not willing to write a full fledged GUI program, or an web app that you could run there just as an answer here.
A program that can work from the command line just to filter out a few characters is another thing - as it is just a few lines of code.
You have to store the follwing listing as a file named, say "fixsubtitles.py" there, and, with a terminal ("cmd" if you are on Windows) type python3 fixsubtitles.py \path\to\subtitlefile.txt and press enter.
That, of course, after installing Python3 runtime from http://python.org
(if you are on Mac or Linux that is already pre-installed)
import sys
from pathlib import Path
encoding = "utf-8"
remove_set = str.maketrans("\u202b\u202c")
if len(sys.argv < 2):
print("Usage: python3 fixsubtitles.py [filename]", file=sys.stderr)
exit(1)
path = Path(sys.argv[1])
data = path.read_text(encoding=encoding)
path.write_text(data.translate("", "", remove_set), encoding=encoding)
print("Done")
You may need to adjust the encoding - as Windows not always use utf-8 (the files can be in, for example "cp1256" - if you get an unicode error when running the program try using this in place of "utf-8") , and maybe add more characters to the set of characters to be removed - the tool you linked in the question should show you other such characters if any. Other than that, the program above should work

Beautiful Soup lxml Character Encoding Issue

I'm trying to parse a web page that has non-printable characters on it and write that to a file in python. I'm using Python 2.7 with requests and Beautiful Soup.
I get the page with requests, and parse it with the following-
for option in recon:
data['opts'] = '/c' + option
print "Getting: ",
print option
r = requests.post(url, data)
print r.content
page = bs4.BeautifulSoup(r.content, "lxml", from_encoding='utf-8')
print page
tag = page.pre.contents
print tag[0]
When testing, the print r.content shows the page properly in all its unformatted glory. The page is a .cfm, and the text I'm looking for falls between "pre" tags. After running through bs though, bs interprets some of the non printable text into "br" tags, resulting in tags being a list of 2 items, instead of just all the text between the pre tags. Is there a way to either just get the text between the pre tags with requests, or do something differently with bs to get it to not misinterpret the characters?
I've read through the following trying to figure it out, plus requests and beautiful soup docs, but found no luck so far-
Joel on Software - Character Sets
SO utf-8 vs unicode
SO Getting text between tags

Overthought the problem. I just base64 encoded the data before transfer with certutil on windows, removed the first and last line, and then decoded on the far side.

Flask - handling unicode text with werkzeug?

So I am trying to have a browser download a file with a certain name, which is stored in a database. To prevent filename conflicts the file is saved on disk with a GUID, and when it comes time to actually download it, the filename from the database is supplied for the browser. The name is written in Japanese, and when I display it on the page it comes out fine, so it is stored OK in the database. When I try to actually have the browser download it under that name:
return send_from_directory(app.config['FILE_FOLDER'], name_on_disk,
as_attachment=True, attachment_filename = filename)
Flask throws an error:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 15-20:
ordinal not in range(128)
The error seems to originate not from my code, but from part of Werkzeug:
/werkzeug/http.py", line 150, in quote_header_value
value = str(value)
Why is this happening? According to their docs, Flask is "100% Unicode"
I actually had this problem before I rewrote my code, and fixed it by modifying numerous things actually in Werkzeug, but I really do not want to have to do this for the deployed app because it is a pain and bad practice.
Python 2.7.6 (default, Nov 26 2013, 12:52:49)
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
filename = "[얼티메이트] [131225] TVアニメ「キルラキル」オリジナルサウンドトラック (FLAC).zip"
print repr(filename)
'[\xec\x96\xbc\xed\x8b\xb0\xeb\xa9\x94\xec\x9d\xb4\xed\x8a\xb8] [131225] TV\xe3\x82\xa2\xe3\x83\x8b\xe3\x83\xa1\xe3\x80\x8c\xe3\x82\xad\xe3\x83\xab\xe3\x83\xa9\xe3\x82\xad\xe3\x83\xab\xe3\x80\x8d\xe3\x82\xaa\xe3\x83\xaa\xe3\x82\xb8\xe3\x83\x8a\xe3\x83\xab\xe3\x82\xb5\xe3\x82\xa6\xe3\x83\xb3\xe3\x83\x89\xe3\x83\x88\xe3\x83\xa9\xe3\x83\x83\xe3\x82\xaf (FLAC).zip'
>>>

You should explictly pass unicode strings (type unicode) when dealing with non-ASCII data. Generally in Flask, bytestrings are assumed to have an ascii encoding.

I had a similar problem. I originally had this to send the file as attachment:
return send_file(dl_fd,
mimetype='application/pdf',
as_attachment=True,
attachment_filename=filename)
where dl_fd is a file descriptor for my file.
The unicode filename didn't work because the HTTP header doesn't support it. Instead, based on information from this Flask issue and these test cases for RFC 2231, I rewrote the above to encode the filename:
response = make_response(send_file(dl_fd,
mimetype='application/pdf'
))
response.headers["Content-Disposition"] = \
"attachment; " \
"filename*=UTF-8''{quoted_filename}".format(
quoted_filename=urllib.quote(filename.encode('utf8'))
)
return response
Based on the test cases, the above doesn't work with IE8 but works with the other browsers listed. (I personally tested Firefox, Safari and Chrome on Mac)

You should use something like:
#route('/attachment/<int:attachment_id>/<filename>', methods=['GET'])
def attachment(attachment_id, filename):
attachment_meta = AttachmentMeta(attachment_id, filename)
if not attachment_meta:
flask.abort(404)
return flask.send_from_directory(
directory=flask.current_app.config['UPLOAD_FOLDER'],
filename=attachment_meta.filepath,
)
This way url_for('attachment',1,u'Москва 北京 תֵּל־אָבִיב.pdf') would generate:
/attachment/1/%D0%9C%D0%BE%D1%81%D0%BA%D0%B2%D0%B0%20%E5%8C%97%E4%BA%AC%20%D7%AA%D6%B5%D6%BC%D7%9C%D6%BE%D7%90%D6%B8%D7%91%D6%B4%D7%99%D7%91.pdf
Browsers would display or save this file with correct unicode name. Don't use as_attachment=True, as this would not work.

Encoding problems in ASP when using English and Chinese characters

I am having problems with encoding Chinese in an ASP site. The file formats are:
translations.txt - UTF-8 (to store my translations)
test.asp - UTF-8 - (to render the page)
test.asp is reading translations.txt that contains the following data:
Help|ZH|帮助
Home|ZH|首页
The test.asp splits on the pipe delimiter and if the user contains a cookie with ZH, it will display this translation, else it will just revert back to the Key value.
Now, I have tried the following things, which have not worked:
Add a meta tag
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
Set the Response.CharSet = "UTF-8"
Set the Response.ContentType = "text/html"
Set the Session.CodePage (and Response) to both 65001 (UTF-8)
I have confirmed that the text in translations.txt is definitely in UTF-8 and has no byte order mark
The browser is picking up that the page is Unicode UTF-8, but the page is displaying gobbledegook.
The Scripting.OpenTextFile(<file>,<create>,<iomode>,<encoding>) method returns the same incorrect text regardless of the Encoding parameter.
Here is a sample of what I want to be displayed in China (ZH):
首页
帮助
But the following is displayed:
é¦–é¡µ
å¸®åŠ©
This occurs all tested browsers - Google Chrome, IE 7/8, and Firefox 4. The font definitely has a Chinese branch of glyphs. Also, I do have Eastern languages installed.
--
I have tried pasting in the original value into the HTML, which did work (but note this is a hard coded value).
首页
é¦–é¡µ
However, this is odd.
首页 --(in hex)--> E9 A6 96 E9 A1 --(as chars)--> é¦–é¡µ
Any ideas what I am missing?

In order to read the UTF-8 file, you'll probably need to use the ADODB.Stream object. I don't claim to be an expert on character encoding, but this test worked for me:
test.txt (saved as UTF-8 without BOM):
首页
帮助
test.vbs
Option Explicit
Const adTypeText = 2
Const adReadLine = -2
Dim stream : Set stream = CreateObject("ADODB.Stream")
stream.Open
stream.Type = adTypeText
stream.Charset = "UTF-8"
stream.LoadFromFile "test.txt"
Do Until stream.EOS
WScript.Echo stream.ReadText(adReadLine)
Loop
stream.Close

Whatever part of the process is reading the translations.txt file does not seem to understand that the file is in UTF-8. It looks like it is reading it in as some other encoding. You should specify encoding in whatever process is opening and reading that file. This will be different from the encoding of your web page.
Inserting the byte order mark at the beginning of that file may also be a solution.

Scripting.OpenTextFile does not understand UTF-8 at all. It can only read the current OEM encoding or Unicode. As you can see from the number of bytes being used for some character sets UTF-8 is quite inefficient. I would recommend Unicode for this sort of data.
You should save the file as Unicode (in Windows parlance) and then open with:
Dim stream : Set stream = Scripting.OpenTextFile(yourFilePath, 1, false, -1)

Just use the script below at the top of your page
Response.CodePage=65001
Response.CharSet="UTF-8"

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Special/accented UTF-8 characters not recognised by PhantomJS - unicode

Related

How to get the same utf-8 encoding as Google for Arabic URLs?

I need to remove a specific unicode in my existing subtitle text file

Beautiful Soup lxml Character Encoding Issue

Flask - handling unicode text with werkzeug?

Encoding problems in ASP when using English and Chinese characters

Categories

Resources