SharePoint 2013 REST API odata $filter ignores unicode characters such as German umlauts äöü - rest

I'm trying to use SharePoint 2013 REST API (odata) with unicode characters such as umlauts (ä ö ü).
...?$select=Title%2CID&$filter=substringof%28%27hello%20w%F6rld%27%2C%20Title%29&$orderby=ID%20desc&$top=14
^^ should search for "hello w*ö*rld" using substringof('...', Field)
I'm escaping the URL correctly (and also single quotes with double quotes) and filtering works for all kinds of characters (even backslash and quotes), however, entering ä/ö/ü or any other unicode character has no effect, it is as if those characters were simply filtered out on the server side (i can insert a lot of ääääääs without changing the results).
Any idea how to escape those? I tried the obvious (%ab { \u1234 \xab x1234) without success. Can't find anything on the web or in the specs either.
Thanks for suggestions.
UPDATE - SOLVED
I found that you can use the %uhhhh variant of escaping them:
?$filter=substringof('hello w%u00f6rld')
Of course one must only escape that once (i.e. not the whole thing again), but it seems that's the way to go.
(can't answer my own question now lol)

Related

Issues with special characters in QBO API v3 .NET SDK

I'm using the .NET SDK to import customers and transactions from another system that accepts UTF-8 encoding in their data, and am having a lot of trouble with special characters. Is there a comprehensive list of (a) what characters need to be escaped (like apostrophe), and (b) what characters are simply not allowed in QBO (like colon)? All I can find in the online doc is "use backslash to escape special characters like apostrophe". OK, what about ampersand, em dash, en dash, grave accent, acute accent... you get the idea.
This problem affects both queries and inserts which causes all kinds of problems. For example, if we query a customer by name, and the query fails (maybe due to an invalid character), we try to insert the customer in QBO, which of course also fails, either due to the customer existing or invalid characters. True, we can usually figure out if the query failed due to a bad character vs the record not existing, but we need a design-time solution. Any suggestions?
If you use Query endpoint, then please URL encode the query parameters.
For ex -
For the following query
select * from Customer where DisplayName='Aülleünte'
URL request would be
https://quickbooks.api.intuit.com/v3/company/<relamId>/query?query=select+*+from+Customer+where+DisplayName%3D%27A%C3%BClle%C3%BCnte%27
PN - Some QBO textfields(for ex - 'Description/Note' of Customer window) allow to enter control characters which gets returned as part of query response. As some of those characters are not supported in XML, object deserialization fails/shows warning.
You should either remove those characters from UI or you need to use some lib/regex in the client side code to remove those characters programmatically. Ideally it should be handled in the server side.
QBO Global UI supports UTF-8 encoding for sure. But It seems, QBO US UI behaves differently while dealing with special characters.
For ex - In QBO US UI, if you enter '你好嗎' then after saving, it gets converted to '}Î'.
Edit
Here is a list of accepted characters:
•Alpha-numeric (A-Z, a-z, 0-9)
•Comma (,)
•Dot or period (.)
•Question mark (?)
•At symbol (#)
•Ampersand (&)
•Exclamation point (!)
•Number/pound sign (#)
•Single quote (')
•Tilde (~)
•Asterisk (*)
•Space ( )
•Underscore (_)
•Minus sign/hyphen (-)
•Semi-colon (;)
•Plus sign (+)

Differentiate properly escaped HTML metacharacters from improperly escaped ones

I'm working on a replacement for a desktop Java app, a single page app written in Scala and Lift.
I have this situation where some of data in the database has properly used HTML metacharacters, such as Unicode escape sequences for accented characters in non-English names. At the same time, I have other data with improper HTML metacharacters, such as ampersands in the names or organizations.
Good (don't escape): Universita\u0027
Bad (needs escape): Bob & Jim
How do I determine whether or not the data needs to be fixed before I send it to the client?
There are two ways to approach this. One is a function that takes a string and returns the index of any improperly escaped HTML metacharacters (which I can fix myself). Alternately it could be a function that takes a string and returns a string with the improperly escaped metacharacters fixed, and leaves the proper ones alone.

What Unicode characters are dangerous?

What Unicode characters (more precisely codepoints) are dangerous and should be blacklisted and prohibited for the users to use?
I know that BIDI override characters and the "zero width space" are very prone to make problems, but what others are there?
Thanks
Characters aren’t dangerous: only inappropriate uses of them are.
You might consider reading things like:
Unicode Standard Annex #31: Unicode Identifier and Pattern Syntax
RFC 3454: Preparation of Internationalized Strings (“stringprep”)
It is impossible to guess what you mean by dangerous.
A Golden Rule in security is to whitelist instead of blacklist, instead of trying to cover all bad characters, it is a much better idea to validate based on ensuring the user only use known good characters.
There are solutions that help you build the large whitelist that is required for international whitelisting. For example, in .NET there is UnicodeCategory.
The idea is that instead of whitelisting thousands of individual characters, the library assigns them into categories like alphanumeric characters, punctuations, control characters, and such.
Tutorial on whitelisting international characters in .NET
Unicode Regex: Categories
'HANGUL FILLER' (U+3164)
Since Unicode 1.1 in 1993, there is an empty wide, zero space character.
We can't see it, neither copy/paste it alone because we can't select it!
It need to be generated, by the unix keyboard shortcut: CTRL + SHIFT + u + 3164
It can pretty much 💩 up anything: variables, function name, url, file names, mimic DNS, invalidate hash strings, database entries, blog posts, logins, allow to fake identical accounts, etc.
DEMO 1: Altering variables
The variable hijacked contains a Hangul Filler char, the console log call the variable without the char:
const normal = "Hello w488ld"
const hijaㅤcked = "Hello w488ld"
console.log(normal)
console.log(hijacked)
DEMO 2: Hijack URL's
Those 3 url will lead to xn--stackoverflow-fr16ea.com:
https://stackㅤㅤoverflow.com
https://stackㅤㅤoverflow.com
https://stackㅤㅤoverflow.com
See Unicode Security Considerations Report.
It covers various aspects, from spoofing of rendered strings to dangers of processing UTF encodings in unsafe languages.
U+2800 BRAILLE PATTERN BLANK - a Braille character without any "dots". It looks like a regular "space" but is not classified as one.

Converted to Junk Character - When Copy Paste in Text Box

Whenever i Copy and paste any Below Mention CHARACTER in text Box
Below are Copied character ( test this in notepad )
…
”
‘
Below are Typed Character
...
"
'
then that was converted to Junk Character. How can i Block this .
When i Type those character from keybord then it works but when copy paste it converted to Junk.
How can i detect and delete all this character before processing because ..user dont know about this issue ..
I want to delete that character wen user press Submit button.
” and ’ are not junk characters. They are perfectly good Unicode characters (U+201C LEFT DOUBLE QUOTATION MARK and U+2018 LEFT SINGLE QUOTATION MARK). Modern applications should be capable of dealing with all Unicode characters; if you can't handle the smart quotes you probably also can't handle accents, Greek, Cyrillic, Chinese or any of the other characters users are likely to want to use. You should concentrate on ensuring that your application supports Unicode, rather than trying to fix this one visible symptom.
Pasting ' and " (ASCII straight quote) characters into a text box should not turn them into non-ASCII ‘smart’ quotes. Where they typically tend to come from is Microsoft Word's misguided ‘AutoReplace’ feature, which replaces straight quotes with smart quotes as you type. This is an annoyance, but ultimately it's limited to Office and there's not really much you can do about it. Whilst you can manually replace “ and ” with " by doing a trivial string replacement (and how you do that depends on what language/environment you are talking about), you'll also be removing correct usage of those characters, and you won't be fixing all the other sad broken auto-replacements that MS Office does.
The … single-character ellipsis is a slightly different case, and arguably ‘junk’: to Unicode, U+2026 HORIZONTAL ELLIPSIS is a ‘compatibility character’ which is only intended to round-trip nicely to existing encodings that include it as a separate characters. Normally three dot characters should be used instead. You can replace compatibility characters by using Unicode normalisation, in particular Normal Form KC. Again, how you access normalisation is something that depends on your programming language/environment. For example in Python, unicodedata.normalize('NFKC', u'…') gives you u'...'.
Is your vnc client / server ON, try to exit (shutdown) all vnc server / clients and try again - if your copy paste works.

What made many of the coding websites converting standard " into non standard ”?

This question is about standard double quote " and non-standard double quote “ & ”
Yesterday when I searched for some sample facebook serverfbml codes, and came upon to this
http://mahmudahsan.wordpress.com/2008/11/22/facebook-fbml-rendering-in-iframe-application/
okay so it has got what I want, so I copied the code to my project and run it... bah... lots of errors
Why? Because the site turned the standard double quote " inside his script into “ or ” ,
or single quote from ' into ’
This is not the first time I faced this problem when copying codes from the Internet, and I believe many of the code writers haven't expected that the site turned their single/double quotes into strange ones.
Any explanation to this strange phenomenon ?
edited: I notice the title converted my " into “ & ” too... let me edit it... oh and I failed
At least in the title or in the text, it looks much better to have typographic double quotes (i.e. is more pleasant to the eye). Coding sites should not do this for actual code, i.e. in StackOverflow code that is indented by four spaces. If a double quote in text is converted to typographic, it's fine.
This gets really worse when you paste typographic quotes into a console that tries to display the character and falls back to a standard quote, because the console font does not have a typographic quote. Because then it looks like it's a standard one, but it isn't. Not much you can do about it, other than use a code display plugin on your website that does not change code.
The problem is in the underlying blog engine. Wordpress does that by default, and there is AFAIK no way to turn it off (Without changing the code). Given the fact that there are only relatively few really great blog engines, there may not always be a choice to switch to something "better".
Also in the same category: Fancy dashes, aka. turning - into –
the source shows that the quote char is sometimes ”
that's the quote that is the good looking quote which will cause problem in a program.
i think either the WordPress text editor or storage/retrieval converted the ordinary quote into that one.
You can use the replace function in your program editor to replace those characters.