Encoding not present in HTTP header, how to find it in HTML header? (iPhone) - iphone

I'm writing a browser for the iPhone.
I'm using
NSString* storyHTML = #"";
ASIHTTPRequest *request = [ASIHTTPRequest requestWithURL:url];
[request startSynchronous];
to download HTML. The problem is sometimes there is no encoding in the HTTP header, in which case the above code defaults to Latin-ISO.
In this case I can read up to the header in the HTML and find the meta tag that specifies the actual encoding. Which looks something like this:
<meta http-equiv="content-type" content="application/xhtml+xml; charset=UTF-8" />
The problem is there are a TON of possible encodings that can be found in the meta tag as seen here: http://www.iana.org/assignments/character-sets
I would need to some how convert one of those encoding strings into one of the constant encodings found in the NSString class:
enum {
NSASCIIStringEncoding = 1,
NSNEXTSTEPStringEncoding = 2,
NSJapaneseEUCStringEncoding = 3,
NSUTF8StringEncoding = 4,
NSISOLatin1StringEncoding = 5, ...
There must be a class that some how determines the encoding of HTML for you. Is there a way to look into UIWebView and see how they do it?
It seems like downloading HTML should be easy, what am I missing?
Thanks!

Just going to round-up my comments and add a few final words of advice into an answer.
Comment 1:
From general usage, you can use the ASIHTTPRequest -responseString, otherwise you can use the data itself and use your own logic to figure out what type of encoding (UTF8, UTF16, etc)
Comment 2:
From the ASIHTTP website:
ASIHTTPRequest will attempt to read the text encoding of the received data from the Content-Type header. If it finds a text encoding, it will set responseEncoding to the appropriate NSStringEncoding. If it does not find a text encoding in the header, it will use the value of defaultResponseEncoding (this defaults to NSISOLatin1StringEncoding). > When you call [request responseString], ASIHTTPRequest will attempt to create a string from the data it received, using responseEncoding as the source encoding.
Comment 3
See also: Encoding issue with ASIHttpRequest
I would personally recommend taking the response data and just assuming the content can fit into UTF16 (or 8). Of course you could also use a regular-expression or HTML parser to grab the <meta> tag inside the <head> element, but if the response is in a weird content-type then you might not be able to find the string #"<head"
I would also use curl from the CLI on your computer to see what content-types ASIHTTPRequest is fetching. If you run a command like
curl -I "http://www.google.com/"
You'll get the following response:
HTTP/1.1 200 OK
Date: Tue, 09 Aug 2011 20:05:00 GMT
Expires: -1
Cache-Control: private, max-age=0
Content-Type: text/html; charset=ISO-8859-1
It would appear almost all sites respond correctly with this header, and when they don't I think using UTF8 would be a great bet. Could you comment with the link of the site that was giving you the issue?

Is there a way to look into UIWebView and see how they do it?
There is. UIWebView is a wrapper around WebKit, which is an open source project. You can check out the source code or browse it online.

Related

HTTP Response encoding issue

I am trying to fetch a CSV file from a website (https://www.stocknet.fr/accueil.asp) using a GET request on the https URL. The response I get via Postman looks like this:
Type;Groupe Acc�s;Code;EOTP autoris�s;Familles EOTP autoris�es;Nom;Pr�nom;Adresse Mail;Agences autoris�es;D�p�ts autoris�s;Date cr�ation;Fournisseurs autoris�s;Classes autoris�es;Familles article
But when access the URL directly, my browser automatically downloads the file, and I open it on windows with a proper encoding:
Type;Groupe Accès;Code;EOTP autorisés;Familles EOTP autorisées;Nom;Prénom;Adresse Mail;Agences autorisées;Dépôts autorisés;Date création;Fournisseurs autorisés;Classes autorisées;Familles article
When I inspect the website HTML, I can see the tag <meta charset="ISO-8859-1" />
I tried using headers as such:
Accept-Charset: ISO-8859-1
Accept-Charset: UTF-8
Content-Type: text/csv; charset=ISO-8859-1
Content-Type: text/csv; charset=UTF-8
Content-Encoding: gzip
Content-Encoding: compress
Content-Encoding: deflate
Content-Encoding: identity
Content-Encoding: br
Nothing seem to return a response with the correct encoding.
Any idea what I am doing wrong ? Note that, whatever page of the website I try to fetch, I get this wrong encoding. It's not only with the CSV file.
The server is returning content in iso-8859-1 and telling you it's iso-8859-1. You will not convince the server to return anything else. Your web browser contains code to convert encodings. If you want to have the content in a different encoding, you have to convert it yourself.
For ways how to do that, see:
Best way to convert text files between character sets?

Gmail API - plaintext word wrapping

When sending emails using the Gmail API, it places hard line breaks in the body at around 78 characters per line. A similar question about this can be found here.
How can I make this stop? I simply want to send plaintext emails through the API without line breaks. The current formatting looks terrible, especially on mobile clients (tested on Gmail and iOS Mail apps).
I've tried the following headers:
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Am I missing anything?
EDIT: As per Mr.Rebot's suggestion, I've also tried this with no luck:
Content-Type: mixed/alternative
EDIT 2: Here's the exact format of the message I'm sending (attempted with and without the quoted-printable header:
From: Example Account <example1#example.com>
To: <example2#example.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Subject: This is a test!
Date: Tue, 18 Oct 2016 10:46:57 -GMT-07:00
Here is a long test message that will probably cause some words to wrap in strange places.
I take this full message and Base64-encode it, then POST it to /gmail/v1/users/{my_account}/drafts/send?fields=id with the following JSON body:
{
"id": MSG_ID,
"message": {
"raw": BASE64_DATA
}
}
Are you running the content through a quoted printable encoder and sending the encoded content value along with the header or expecting the API to encode it for you?
Per wikipedia it seems like if you add soft line breaks with = less than 76 characters apart as the last character on arbitrary lines, they should get decoded out of the result restoring your original text.
UPDATE
Try sending with this content whose message has been quoted-printable encoded (base64 it):
From: Example Account <example1#example.com>
To: <example2#example.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Subject: This is a test!
Date: Tue, 18 Oct 2016 10:46:57 -GMT-07:00
Here is a long test message that will probably cause some words to wrap in =
strange places.
I'm assuming you have a function similar to this:
1. def create_message(sender, to, cc, subject, message_body):
2. message = MIMEText(message_body, 'html')
3. message['to'] = to
4. message['from'] = sender
5. message['subject'] = subject
6. message['cc'] = cc
7. return {'raw': base64.urlsafe_b64encode(message.as_string())}
The one trick that finally worked for me, after all the attempts to modify the header values and payload dict (which is a member of the message object), was to set (line 2):
message = MIMEText(message_body, 'html') <-- add the 'html' as the second parameter of the MIMEText object constructor
The default code supplied by Google for their gmail API only tells you how to send plain text emails, but they hide how they're doing that.
ala...
message = MIMEText(message_body)
I had to look up the python class email.mime.text.MIMEText object.
That's where you'll see this definition of the constructor for the MIMEText object:
class email.mime.text.MIMEText(_text[, _subtype[, _charset]])
We want to explicitly pass it a value to the _subtype. In this case, we want to pass: 'html' as the _subtype.
Now, you won't have anymore unexpected word wrapping applied to your messages by Google, or the Python mime.text.MIMEText object
This exact issue made me crazy for a good couple of hours, and no solution I could find made any difference.
So if anyone else ends up frustrated here, I'd thought I'd just post my "solution".
Turn your text (what's going to be the body of the email) into simple HTML. I wrapped every paragraph in a simple <p>, and added line-breaks (<br>) where needed (e.g. my signature).
Then, per Andrew's answer, I attached the message body as MIMEText(message_text, _subtype="html"). The plain-text is still not correct AFAIK, but it works and I don't think there's a single actively used email-client out there that doesn't render HTML anymore.

How to design REST API for export endpoint?

I am designing a REST API and am running into a design issue. I have alerts that I'd like the user to be able to export to one of a handful of file formats. So we're already getting into actions/commands with export, which feels like RPC and not REST.
Moreover, I don't want to assume a default file format. Instead, I'd like to require it to be provided. I don't know how to design the API to do that, and I also don't know what response code to return if the required parameter isn't provided.
So here's my first crack at it:
POST /api/alerts/export?format=csv
OR
POST /api/alerts/export/csv
Is this endpoint set up the way you would? And is it set up in the right way to require the file format? And if the required file format isn't provided, what's the correct status code to return?
Thanks.
In fact you should consider HTTP content negotiation (or CONNEG) to do this. This leverages the Accept header (see the HTTP specification: http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.1) that specifies which is the expected media type for the response.
For example, for CSV, you could have something like that:
GET /api/alerts
Accept: text/csv
If you want to specify additional hints (file name, ...), the server could return the Content-Disposition header (see the HTTP specification: http://www.w3.org/Protocols/rfc2616/rfc2616-sec19.html#sec19.5.1) in the response, as described below:
GET /api/alerts
Accept: text/csv
HTTP/1.1 200 OK
Content-Disposition: attachment; filename="alerts.csv"
(...)
Hope it helps you,
Thierry

Postman Chrome: What is the difference between form-data, x-www-form-urlencoded and raw

I am using the Postman Chrome extension for testing a web service.
There are three options available for data input.
I guess the raw is for sending JSON.
What is the difference between the other two, form-data and x-www-form-urlencoded?
These are different Form content types defined by W3C.
If you want to send simple text/ ASCII data, then x-www-form-urlencoded will work. This is the default.
But if you have to send non-ASCII text or large binary data, the form-data is for that.
You can use Raw if you want to send plain text or JSON or any other kind of string. Like the name suggests, Postman sends your raw string data as it is without modifications. The type of data that you are sending can be set by using the content-type header from the drop down.
Binary can be used when you want to attach non-textual data to the request, e.g. a video/audio file, images, or any other binary data file.
Refer to this link for further reading:
Forms in HTML documents
This explains better:
Postman docs
Request body
While constructing requests, you would be dealing with the request body editor a lot. Postman lets you send almost any kind of HTTP request (If you can't send something, let us know!). The body editor is divided into 4 areas and has different controls depending on the body type.
form-data
multipart/form-data is the default encoding a web form uses to transfer data. This simulates filling a form on a website, and submitting it. The form-data editor lets you set key/value pairs (using the key-value editor) for your data. You can attach files to a key as well. Do note that due to restrictions of the HTML5 spec, files are not stored in history or collections. You would have to select the file again at the time of sending a request.
urlencoded
This encoding is the same as the one used in URL parameters. You just need to enter key/value pairs and Postman will encode the keys and values properly. Note that you can not upload files through this encoding mode. There might be some confusion between form-data and urlencoded so make sure to check with your API first.
raw
A raw request can contain anything. Postman doesn't touch the string entered in the raw editor except replacing environment variables. Whatever you put in the text area gets sent with the request. The raw editor lets you set the formatting type along with the correct header that you should send with the raw body. You can set the Content-Type header manually as well. Normally, you would be sending XML or JSON data here.
binary
binary data allows you to send things which you can not enter in Postman. For example, image, audio or video files. You can send text files as well. As mentioned earlier in the form-data section, you would have to reattach a file if you are loading a request through the history or the collection.
UPDATE
As pointed out by VKK, the WHATWG spec say urlencoded is the default encoding type for forms.
The invalid value default for these attributes is the application/x-www-form-urlencoded state. The missing value default for the enctype attribute is also the application/x-www-form-urlencoded state.
Here are some supplemental examples to see the raw text that Postman passes in the request. You can see this by opening the Postman console:
form-data
Header
content-type: multipart/form-data; boundary=--------------------------590299136414163472038474
Body
key1=value1key2=value2
x-www-form-urlencoded
Header
Content-Type: application/x-www-form-urlencoded
Body
key1=value1&key2=value2
Raw text/plain
Header
Content-Type: text/plain
Body
This is some text.
Raw json
Header
Content-Type: application/json
Body
{"key1":"value1","key2":"value2"}
multipart/form-data
Note. Please consult RFC2388 for additional information about file uploads, including backwards compatibility issues, the relationship between "multipart/form-data" and other content types, performance issues, etc.
Please consult the appendix for information about security issues for forms.
The content type "application/x-www-form-urlencoded" is inefficient for sending large quantities of binary data or text containing non-ASCII characters. The content type "multipart/form-data" should be used for submitting forms that contain files, non-ASCII data, and binary data.
The content type "multipart/form-data" follows the rules of all multipart MIME data streams as outlined in RFC2045. The definition of "multipart/form-data" is available at the [IANA] registry.
A "multipart/form-data" message contains a series of parts, each representing a successful control. The parts are sent to the processing agent in the same order the corresponding controls appear in the document stream. Part boundaries should not occur in any of the data; how this is done lies outside the scope of this specification.
As with all multipart MIME types, each part has an optional "Content-Type" header that defaults to "text/plain". User agents should supply the "Content-Type" header, accompanied by a "charset" parameter.
application/x-www-form-urlencoded
This is the default content type. Forms submitted with this content type must be encoded as follows:
Control names and values are escaped. Space characters are replaced by +', and then reserved characters are escaped as described in [RFC1738], section 2.2: Non-alphanumeric characters are replaced by %HH', a percent sign and two hexadecimal digits representing the ASCII code of the character. Line breaks are represented as "CR LF" pairs (i.e., %0D%0A'). The control names/values are listed in the order they appear in the document. The name is separated from the value by =' and name/value pairs are separated from each other by `&'.
application/x-www-form-urlencoded the body of the HTTP message sent to the server is essentially one giant query string -- name/value pairs are separated by the ampersand (&), and names are separated from values by the equals symbol (=). An example of this would be:
MyVariableOne=ValueOne&MyVariableTwo=ValueTwo
The content type "application/x-www-form-urlencoded" is inefficient for sending large quantities of binary data or text containing non-ASCII characters. The content type "multipart/form-data" should be used for submitting forms that contain files, non-ASCII data, and binary data.
let's take everything easy, it's all about how a http request is made:
1- x-www-form-urlencoded
http request:
GET /getParam1 HTTP/1.1
User-Agent: PostmanRuntime/7.28.4
Accept: */*
Postman-Token: a14f1286-52ae-4871-919d-887b0e273052
Host: localhost:12345
Accept-Encoding: gzip, deflate, br
Connection: keep-alive
Content-Type: application/x-www-form-urlencoded
Content-Length: 55
postParam1Key=postParam1Val&postParam2Key=postParam2Val
2- raw
http request:
GET /getParam1 HTTP/1.1
Content-Type: text/plain
User-Agent: PostmanRuntime/7.28.4
Accept: */*
Postman-Token: e3f7514b-3f87-4354-bcb1-cee67c306fef
Host: localhost:12345
Accept-Encoding: gzip, deflate, br
Connection: keep-alive
Content-Length: 73
{
postParam1Key: postParam1Val,
postParam2Key: postParam2Val
}
3- form-data
http request:
GET /getParam1 HTTP/1.1
User-Agent: PostmanRuntime/7.28.4
Accept: */*
Postman-Token: 8e2ce54b-d697-4179-b599-99e20271df90
Host: localhost:12345
Accept-Encoding: gzip, deflate, br
Connection: keep-alive
Content-Type: multipart/form-data; boundary=--------------------------140760168634293019785817
Content-Length: 181
----------------------------140760168634293019785817
Content-Disposition: form-data; name="postParam1Key"
postParam1Val
----------------------------140760168634293019785817--

How to encode/decode charset encoding in NodeJS?

I have this code :
request({ url: 'http://www.myurl.com/' }, function(error, response, html) {
if (!error && response.statusCode == 200) {
console.log($('title', html).text());
}
});
But the websites that Im crawling can have different charset (utf8, iso-8859-1, etc..) how to get it and encode/decode the html always to the right encoding (utf8) ?
Thanks and sorry for my english ;)
The website could return the content encoding in the content-type header or the content-type meta tag inside the returned HTML, eg:
<meta http-equiv="Content-Type" content="text/html; charset=latin1"/>
You can use the charset module to automatically check both of these for you. Not all websites or servers will specify an encoding though, so you'll want to fall back to detecting the charset from the data itself. The jschardet module can help you with that.
Once you've worked out the charset you can use the iconv module to do the actual conversion. Here's a full example:
request({url: 'http://www.myurl.com/', encoding: 'binary'}, function(error, response, html) {
enc = charset(response.headers, html)
enc = enc or jchardet.detect(html).encoding.toLowerCase()
if enc != 'utf-8'
iconv = new Iconv(enc, 'UTF-8//TRANSLIT//IGNORE')
html = iconv.convert(new Buffer(html, 'binary')).toString('utf-8')
console.log($('title', html).text());
});
First up, you could send an Accept-Charset header which would prevent websites from sending data in other charsets.
Once you get a response, you can check the Content-Type header for the charset entry and do appropriate processing.
Anothr hack (I've used in the past) when the content encoding is unknown is try to decode using all possible content encodings and stick to the one that doesn't throw an exception (using in python though).