Rexster + Bulbs: Unicode node property - node created but not found - unicode

I am using bulbs and rexster and am trying to store nodes with unicode properties (see example below).
Apparently, creating nodes in the graph works properly as I can see the nodes in the web interface that comes with rexster (Rexster Dog House) but retrieving the same node does not work - all I get is None.
Everything works as expected when I create and look for nodes with non-unicode-specific letters in their properties.
E.g. in the following example a node with name = u'University of Cambridge' would be retrievable as expected.
Rexster version:
[INFO] Application - Rexster version [2.4.0]
Example code:
# -*- coding: utf-8 -*-
from bulbs.rexster import Graph
from bulbs.model import Node
from bulbs.property import String
from bulbs.config import DEBUG
import bulbs
class University(Node):
element_type = 'university'
name = String(nullable=False, indexed=True)
g = Graph()
g.add_proxy('university', University)
g.config.set_logger(DEBUG)
name = u'Université de Montréal'
g.university.create(name=name)
print g.university.index.lookup(name=name)
print bulbs.__version__
Gives the following output on the command line:
POST url: http://localhost:8182/graphs/emptygraph/tp/gremlin
POST body: {"params": {"keys": null, "index_name": "university", "data": {"element_type": "university", "name": "Universit\u00e9 de Montr\u00e9al"}}, "script": "def createIndexedVertex = {\n vertex = g.addVertex()\n index = g.idx(index_name)\n for (entry in data.entrySet()) {\n if (entry.value == null) continue;\n vertex.setProperty(entry.key,entry.value)\n if (keys == null || keys.contains(entry.key))\n\tindex.put(entry.key,String.valueOf(entry.value),vertex)\n }\n return vertex\n }\n def transaction = { final Closure closure ->\n try {\n results = closure();\n g.commit();\n return results; \n } catch (e) {\n g.rollback();\n throw e;\n }\n }\n return transaction(createIndexedVertex);"}
GET url: http://localhost:8182/graphs/emptygraph/indices/university?value=Universit%C3%A9+de+Montr%C3%A9al&key=name
GET body: None
None
0.3

Ok, I finally got to the bottom of this.
Since TinkerGraph uses a HashMap for its index, you can see what's being stored in the index by using Gremlin to return the contents of the map.
Here's what's being stored in the TinkerGraph index using your Bulbs g.university.create(name=name) method above...
$ curl http://localhost:8182/graphs/emptygraph/tp/gremlin?script="g.idx(\"university\").index"
{"results":[{"name":{"Université de Montréal":[{"name":"Université de Montréal","element_type":"university","_id":"0","_type":"vertex"}]},"element_type":{"university":[{"name":"Université de Montréal","element_type":"university","_id":"0","_type":"vertex"}]}}],"success":true,"version":"2.5.0-SNAPSHOT","queryTime":3.732632}
All that looks good -- the encodings look right.
To create and index a vertex like the one above, Bulbs uses a custom Gremlin script via an HTTP POST request with a JSON content type.
Here's the problem...
Rexster's index lookup REST endpoint uses URL query params, and Bulbs encodes URL params as UTF-8 byte strings.
To see how Rexster handles URL query params encoded as UTF-8 byte strings, I executed a Gremlin script via a URL query param that simply returns the encoded string...
$ curl http://localhost:8182/graphs/emptygraph/tp/gremlin?script="'Universit%C3%A9%20de%20Montr%C3%A9al'"
{"results":["Université de Montréal"],"success":true,"version":"2.5.0-SNAPSHOT","queryTime":16.59432}
Egad! That's not right. As you can see, that text is mangled.
In a twist of irony, we have Gremlin returning gremlins, and that's what Rexster is using for the key's value in the index lookup, which as we can see is not what's stored in TinkerGraph's HashMap index.
Here's what's going on...
This is what the unquoted byte string looks like in Bulbs:
>>> name
u'Universit\xe9 de Montr\xe9al'
>>> bulbs.utils.to_bytes(name)
'Universit\xc3\xa9 de Montr\xc3\xa9al'
'\xc3\xa9' is the UTF-8 encoding of the unicode character u'\xe9' (which can also be specified as u'\u00e9').
UTF-8 uses 2 bytes to encode a character, and Jersey/Grizzly 1.x (Rexster's app server) has a bug where it doesn't properly handle 2-byte character encodings like UTF-8.
See http://markmail.org/message/w6ipdpkpmyghdx2p
It looks like this is fixed in Jersey/Grizzly 2.0, but switching Rexster from Jersey/Grizzly 1.x to Jersey/Grizzly 2.x is a big ordeal.
Last year TinkerPop decided to switch to Netty instead, and so for the TinkerPop 3 release this summer, Rexster is in the process of morphing into Gremlin Server, which is based on Netty rather than Grizzly.
Until then, here are few workarounds...
Since Grizzly can't handle 2-byte encodings like UTF-8, client libraries need to encode URL params as 1-byte latin1 encodings (AKA ISO-8859-1), which is Grizzly's default encoding.
Here's the same value encoded as a latin1 byte string...
$ curl http://localhost:8182/graphs/emptygraph/tp/gremlin?script="'Universit%E9%20de%20Montr%E9al'"
{"results":["Université de Montréal"],"success":true,"version":"2.5.0-SNAPSHOT","queryTime":17.765313}
As you can see, using a latin1 encoding works in this case.
However, for general purposes, it's probably best for client libraries to use a custom Gremlin script via an HTTP POST request with a JSON content type and thus avoid the URL param encoding issue all together -- this is what Bulbs is going to do, and I'll push the Bulbs update to GitHub later today.
UPDATE: It turns out that even though we cannot change Grizzly's default encoding type, we can specify UTF-8 as the charset in the HTTP request Content-Type header and Grizzly will use it. Bulbs 0.3.29 has been updated to include the UTF-8 charset in its request header, and all tests pass. The update has been pushed to both GitHub and PyPi.

Related

Requests fail authorization when query string contains certain characters

I'm making requests to Twitter, using the OAuth1.0 signing process to set the Authorization header. They explain it step-by-step here, which I've followed. It all works, most of the time.
Authorization fails whenever special characters are sent without percent encoding in the query component of the request. For example, ?status=hello%20world! fails, but ?status=hello%20world%21 succeeds. But the change from ! to the percent encoded form %21 is only made in the URL, after the signature is generated.
So I'm confused as to why this fails, because AFAIK that's a legally encoded query string. Only the raw strings ("status", "hello world!") are used for signature generation, and I'd assume the server would remove any percent encoding from the query params and generate its own signature for comparison.
When it comes to building the URL, I let URLComponents do the work, so I don't add percent encoding manually, ex.
var urlComps = URLComponents()
urlComps.scheme = "https"
urlComps.host = host
urlComps.path = path
urlComps.queryItems = [URLQueryItem(key: "status", value: "hello world!")]
urlComps.percentEncodedQuery // "status=hello%20world!"
I wanted to see how Postman handled the same request. I selected OAuth1.0 as the Auth type and plugged in the same credentials. The request succeeded. I checked the Postman console and saw ?status=hello%20world%21; it was percent encoding the !. I updated Postman, because a nice little prompt asked me to. Then I tried the same request; now it was getting an authorization failure, and I saw ?status=hello%20world! in the console; the ! was no longer being percent encoded.
I'm wondering who is at fault here. Perhaps Postman and I are making the same mistake. Perhaps it's with Twitter. Or perhaps there's some proxy along the way that idk, double encodes my !.
The OAuth1.0 spec says this, which I believe is in the context of both client (taking a request that's ready to go and signing it before it's sent), and server (for generating another signature to compare against the one received):
The parameters from the following sources are collected into a
single list of name/value pairs:
The query component of the HTTP request URI as defined by
[RFC3986], Section 3.4. The query component is parsed into a list
of name/value pairs by treating it as an
"application/x-www-form-urlencoded" string, separating the names
and values and decoding them as defined by
[W3C.REC-html40-19980424], Section 17.13.4.
That last reference, here, outlines the encoding for application/x-www-form-urlencoded, and says that space characters should be replaced with +, non-alphanumeric characters should be percent encoded, name separated from value by =, and pairs separated by &.
So, the OAuth1.0 spec says that the query string of the URL needs to be decoded as defined by application/x-www-form-urlencoded. Does that mean that our query string needs to be encoded this way too?
It seems to me, if a request is to be signed using OAuth1.0, the query component of the URL that gets sent must be encoded in a way that is different to what it would normally be encoded in? That's a pretty significant detail if you ask me. And I haven't seen it explicitly mentioned, even in Twitter's documentation. And evidently the folks at Postman overlooked it too? Unless I'm not supposed to be using URLComponents to build a URL, but that's what it's for, no? Have I understood this correctly?
Note: ?status=hello+world%21 succeeds; it tweets "hello world!"
I ran into a similar issue.
put the status in post body, not query string.
Percent-encoding:
private encode(str: string) {
// encodeURIComponent() escapes all characters except: A-Z a-z 0-9 - _ . ! ~ * " ( )
// RFC 3986 section 2.3 Unreserved Characters (January 2005): A-Z a-z 0-9 - _ . ~
return encodeURIComponent(str)
.replace(/[!'()*]/g, c => "%" + c.charCodeAt(0).toString(16).toUpperCase());
}

Play Framework Ning WS API encoding issue with HTML pages

I'm using Play Framework 2.3 and the WS API to download and parse HTML pages. For none-English pages (e.g Russian, Hebrew), I often get wrong encoding.
Here's an example:
def test = Action.async { request =>
WS.url("http://news.walla.co.il/item/2793388").get.map { response =>
Ok(response.body)
}
}
This returns the web page's HTML. English characters are received ok. The Hebrew letters appear as Gibberish. (Not just when rendering, at the internal String level). Like so:
<title>29 ×ר×××× ××פ××ת ×ש×××× ×× ×¤××, ××× ×©×××©× ×שר×××× - ×××××! ××ש×ת</title>
Other articles from the same web-site can appear ok.
using cURL with the same web-page returns perfectly fine which makes me believe the problem is within the WS API.
Any ideas?
Edit:
I found a solution in this SO question.
Parsing the response as ISO-8859-1 and then converting it to UTF-8 like-so:
Ok(new String(response.body.getBytes("ISO-8859-1") , response.header(CONTENT_ENCODING).getOrElse("UTF-8")))
display correctly. So I have a working solution, but why isn't this done internally?
Ok, here the solution I ended up using in production:
def responseBody = response.header(CONTENT_TYPE).filter(_.toLowerCase.contains("charset")).fold(new String(response.body.getBytes("ISO-8859-1") , "UTF-8"))(_ => response.body)
Explanation:
If the request returns a "Content-Type" header that also specifies a charset, simply return the response body sine the WS API will use it to decode correctly, otherwise, assume the response is ISO-8859-1 encoded and convert it to UTF-8

Base64 decoding of MIME email not working (GMail API)

I'm using the GMail API to retrieve an email contents. I am getting the following base64 encoded data for the body: http://hastebin.com/ovucoranam.md
But when I run it through a base64 decoder, it either returns an empty string (error) or something that resembles the HTML data but with a bunch of weird characters.
Help?
I'm not sure if you've solved it yet, but GmailGuy is correct. You need to convert the body to the Base64 RFC 4648 standard. The jist is you'll need to replace - with + and _ with /.
I've taken your original input and did the replacement: http://hastebin.com/ukanavudaz
And used base64decode.org to decode it, and it was fine.
You need to use URL (aka "web") safe base64 decoding alphabet (see rfc 4648), which it doesn't appear you're doing. Using the standard base64 alphabet may work sometimes but not always (2 of the characters are different).
Docs don't seem to consistently mention this important detail. Here's one where it does though:
https://developers.google.com/gmail/api/guides/drafts
Also, if your particular library doesn't support the "URL safe" alphabet then you can do string substitution on the string first ("-" with "+" and "_" with "/") and then do normal base64 decoding on it.
I had the same issue decoding the 'data' fields in the message object response from the Gmail API. The Google Ruby API library wasn't decoding the text correctly either. I found I needed to do a url-safe base64 decode:
#data = Base64.urlsafe_decode64(JSON.parse(#result.data.to_json)["payload"]["body"]["data"])
Hope that helps!
There is an example for python 2.x and 3.x:
decodedContents = base64.urlsafe_b64decode(payload["body"]["data"].encode('ASCII'))
If you only need to decode for displaying purposes, consider using atob to decode the messages in JavaScript frontend (see ref).
I found whilst playing with the API result, once I had drilled down to the body I was given an option to decode in the available methods.
val message = mService!!.users().messages().get(user, id).setFormat("full").execute()
println("Message snippet: " + message.snippet)
if(message.payload.mimeType == "text/plain"){
val body = message.payload.body.decodeData() // getValue("body")
Log.i("BODY", body.toString(Charset.defaultCharset()))
}
The result:-
com.example.quickstart I/BODY: ISOLATE NORMAL: 514471,Fap, South Point Rolleston, 55 Faringdon Boulevard , Rolleston, 30 May 2018 20:59:21
I coped the base64 test to a file (b64.txt), then base64-decoded it using base64 (from coreutils) with the -d option (see http://linux.die.net/man/1/base64) and I got text that was perfectly readable. The command I used was:
cat b64.txt | base64 -d

Grails rest plugin not encoding ampersand

I'm using the Grails rest plugin, and having issues with parameters containing an ampersand. Here is an example of my query:
def query = [
method: 'artist.getinfo',
artist: 'Matt & Kim',
format: 'json'
]
withRest(uri:'http://ws.audioscrobbler.com/') {
def resp = get(path: '/2.0/', query: query)
}
I think that the get method should automatically URL encode the parameters in query - it correctly converts spaces to '+'. However, it leaves the ampersand as is, which is incorrect (it should be encoded to %26).
I tried manually encoding the artist name before calling get, but then the rest plugin encodes the percent sign!
I turned on logging for the rest client, so I can see what URLs it's requesting.
Originally: http://ws.audioscrobbler.com/2.0/?method=artist.getinfo&artist=Matt+&+Kim&format=json
If I manually encode the name: http://ws.audioscrobbler.com/2.0/?method=artist.getinfo&artist=Matt+%2526+Kim&format=json
Do I need to set an encoding type? (the last.fm API specifies UTF-8) Is this a bug?
As of version 0.7, the rest plugin is using a version of HTTPBuilder which has issues encoding (and decoding) the ampersand character.
There is a JIRA Issue about this with a suggested workaround (upgrading HTTPBuilder to >= 0.5.2)

httprequest encoding mismatch

I'm using a Google Gears Worker to submt a POST httprequest (using var request = google.gears.factory.create('beta.httprequest'); )
with a parameter containing the string
"bford%20%24%23%26!%3F%40%20%E5%BE%B3%E5%8A%9B%E5%9F%BA%E5%BD%A6"
but the Django HttpRequest is receiving it as "bford $#&!?# å¾³å\u008a\u009bå\u009fºå½¦"
How do I specify to one or the other of the parties in the transaction to leave it untranslated?
Check the HttpRequest.encoding and the DEFAULT_CHARSET settings. Judging by the encoded value, this should be UTF-8 (which is indeed usually the right thing).
You can get the ‘untranslated’ (with %s still in) value by looking at the input stream (for POST) or environ QUERY_STRING (for GET) and decoding it manually, but it would be better to fix Django's incorrect string-to-unicode decoding really.
As I understand it, Django 1.0 should default to using UTF-8, so I'm not sure why it's not in your case.