Blob data replace '+' with space - iphone

I have an iphone app that converts a image into NSData & then converts into base64 encoded string.
When this encoded string is submitted to server in server's database, while storing on server '+' gets converted into 'space' and so the decoder does not work properly.
I guess the issue is with default encoding of table in database. Currently its latin, i tried changing it to UTF8 but problem still exits.
Any other encoding, please help

Of course - that has nothing to do with encoding. It is the format of the POST and GET parameters which creates a clash with base64. In http://en.wikipedia.org/wiki/Base64#Variants_summary_table you see alternatives which are designed to make base64 work with URLs etc.
One of these variants is "Base64 with URL and Filename Safe Alphabet (RFC 4648 'base64url' encoding)" which replaces the + with - and the / with _.
Another alternative would be to replace the offending characters +/= by their respective hexrepresentations with %xx - but that makes the data unnecessarily longer.

Related

Is it possible to base64 decode part of a base64 encoded message

I am working on a project where I am getting parts of base64 encoded data, but not the whole thing. Is it possible to figure out what that part of the base64 encoded data was?
For example. Say I base64 encode hello world
It becomes aGVsbG8gd29ybGQ=
But say I am only able to capture sbG8gd29y
Which base4 decodes to ݽ
I am familiar with how base64 encoding process works and I cannot think of a way to figure out what part of a base64 encoded message is without adding data randomly to the chunk on the front and back and comparing with dictionary words, but the problem is I am not even 100% sure that the data I am working with includes dictionary words.
Thanks
I just spent a little time using an online conveter (http://www.convertstring.com/EncodeDecode/Base64Decode)
If you take your captured section you can run it through the converter and see that its an invalid length for a base64 encoded string.
For a captured section to have a valid length you will need to add some extra characters (0-3 depending on the length of the section). A valid base64 string has a length that is exactly devisible by 4.
Pick a character ('a' for example) and then run through the posibilities of adding the correct amount of characters to the section, front and back. With your added characters the string will be decodable and one of the decoded values will be more readable, that will be the one that has the partially decoded data.
E.G:
sbG8gd29yaaa
and
aaasbG8gd29y
decodes to:
����ݽɦ�
and
i��lo wor
You can make a rudimentary programatic test for readability by counting the number of 'normal' characters within the string (a-z for example). You will need to make up your own mind what is 'normal', it will depend on the expected language of the data and the context (is it known to be numeric only for example).

Polish name (Wężarów) returned from json service as W\u0119\u017car\u00f3w, renders as Wężarów. Can't figure out encoding/charset.

I'm using DB-IP.com to get city names from IP addresses. Many of these are international cities, with special characters in the names.
As an example, one of these cities is Wężarów in Poland. Checking the JSON return in the console or opening the request URL directly, it's being returned from DB-IP as "W\u0119\u017car\u00f3w" with a Content-Type of text/javascript;charset=UTF-8. This is rendered in the browser as Wężarów - it is also saved in my mysql database as Wężarów (which I've tried with both utf8 and latin1 encoding).
I'm ok with saving it in the DB as another format, as long as I can convert it back to Wężarów for display in browser. I've tried encoding and decoding to/from several formats, even just to display directly on the screen (ignoring the DB entirely). I'm completely confused on what I need to do here to get it in readable format.
I'm working with PERL, however if I can figure out what I need to do with the encoding/decoding/charset (as I'm currently clueless), I'm sure I can figure it out from there.
It looks like the UTF-8 encoded string was interpreted by the browser as if it were Windows-1252. Here's how I deduced it:
% python3
>>> s = "W\u0119\u017car\u00f3w"
>>> b = bytes(s, encoding='utf-8')
>>> b
b'W\xc4\x99\xc5\xbcar\xc3\xb3w'
>>> str(b, encoding='utf-8')
'Wężarów'
>>> str(b, encoding='latin-1')
'WÄ\x99żarów'
>>> str(b, encoding='windows-1252')
'Wężarów'
If you're not good with Python, what I'm doing here is encoding the string "W\u0119\u017car\u00f3w" into UTF-8, yielding the byte sequence 'W\xc4\x99\xc5\xbcar\xc3\xb3w'. Decoding that with UTF-8 yielded 'Wężarów', confirming that this is the correct UTF-8 encoding of the string you want. So I took a guess that the browser is using the wrong encoding to render it, and decoded it using Latin-1. That gave me something very close, so I looked up Latin-1 and noticed that it's named as the basis for Windows-1252. Decoding again as Windows-1252 gives the result you saw.
What's gone wrong here is that the browser can't tell what encoding to use to render the page, and it's guessing wrong. You need to fix this by telling it explicitly to use UTF-8. Here's a page by the W3C that describes how to do that. Essentially what you need to do is add an HTML <meta> element to the document head. If you also set an HTTP header with the encoding name in it, make sure they are consistent.
(In Firefox, while you're debugging, you can go to View -> Character Encoding to set the encoding on a page-by-page basis. I assume other browsers have the same feature.)

How to auto detect a String encoding?

I have a String which contains some encoded values in some way like Base64.
The problem is that I really don't know if it's actually Base64 (there are A-Z, a-z. 0-9, +, /) so it can be some any other code that i'm not familiar with.
Is there a way or any other online site to send him an encoded input and it can tell me in which code is it?
NOTE:
I'm not asking how to know if my String is UTF-8 or iso-8859-1 or something like that.
What I need is to know in which is my code is encoded.
EDIT:
To be more clear,
I need something to get an input like: 23Nzi4lUE4qlc+Pmc3blWMS1Irmgo3i8UTQHhoL7VyzqpEV/i9bDhoiteZ0a7/TqcVSkrXR89V2Yj7tEFDGJx4gvWEBs= this is the encoded String that I have.
The output should be the type of the encoded String and it's decoding like:
Base64 -> "Big yellow fish is swimming in the tube."
Maybe there is some program which get's an input and tries to decode it with a list of coding types (Base64 and etc.). The output doesn't really matter because it's the users decision if it's good or not.
This site handles base64 de/encoding.
Since Base64 is just one instance of a class of encoding schemes ( specifically, encoding a bit stream as base_<n> number ), you probably will never fare better than testing for just a couple of standard encoding schemes.
You either check the well-formedness of the encoding scheme or try to decode without getting an error thrown using a web service or your own code.
In (possibly pathological) cases there will be more than one encoding scheme for which a given octet stream will successfully decode.
Best practice would be to take the effort invested into setting up the verification to committing the data provider to one (or 'a few') encoding(s) first (won't always be possible, of course).

How to handle spectial characters in Tritium

I'm working on a Spanish site using MoovWeb & Tritium but I am having issues with special characters.
For a content, wherever there are special characters, it always jumbles them up and show blocks or question-mark character. I don't have access to the source code of original site so I cannot determine if the site was using proper HTML-safe alternatives for special characters.
Is there any way to manage or handle special characters in tritium easily instead of doing some kind of find & replace routine?
While developing locally, you will have the source of the original site in the tmp/messages directory of your project. This is the raw response from the origin server, so any special encoding or character bytes will be preserved. If you can determine the bytes that make up the special characters, you can use Tritium's replace() function to change these bytes to HTML-safe alternatives. For example,
replace(/\xe9/, "é")
where \xe9 is the byte sequence for é.
This is assuming that this character was encoded properly. If not, you'll have to isolate the malformed bytes and replace those one by one.

base64 encoding: input character

I'm trying to understand what the input requirements are for base64 encoding. Nicholas Zakas, who I have tremendous respect for has an article here where he quotes a specification that an error should be thrown if input contains any character with a code higher than 255 Zakas Article on base64
Before even attempting to base64 encode a string, you should check to see if the string contains only ASCII characters. Since base64 encoding requires eight bits per input character, any character with a code higher than 255 cannot be accurately represented. The specification indicates that an error should be thrown in this case:
if (/([^\u0000-\u00ff])/.test(text)){
throw new Error("Can't base64 encode non-ASCII characters.");
}
He provides a link in another separate part of the article to the RFC 3548 but I don't see any input requirements other than:
Implementations MUST reject the encoding if it contains characters
outside the base alphabet when interpreting base encoded data, unless
the specification referring to this document explicitly states
otherwise.
Not sure what "base alphabet" means but perhaps this is what Zakas is referring to. But by saying they must reject the encoding it seems to imply that this is something that has already been encoded as opposed to the input (of course if the input is invalid it will also show up in the encoding so perhaps the point is moot).
A bit confused on what the standard is.
Fundamentally, it's a mistake to talk about "base64 encoding a string" where "string" is meant in terms of text.
Base64 encoding is applied to binary data (a sequence of bytes, or octets if you want to be even more picky), and the result is text. Every character in the output is printable ASCII text. The whole point of base64 is to provide a safe way of converting arbitrary binary data into a text format which can be reliably embedded in other text, transported etc. ASCII is compatible with almost all character sets, so you're very unlikely to be unable to encode ASCII text as part of something else.
When someone talks about "base64 encoding a string" they're really talking about encoding text as binary using some existing encoding (e.g. UTF-8), then applying a base64 encoding to the result. When decoding, you'd need to decode the base64 back to binary, and then decode that binary data with the original encoding, to get the original text.
For me the (first) linked article has a fundamental problem:
Before even attempting to base64 encode a string, you should check to see if the string contains only ASCII characters
You don't base64 encode strings. You base64 encode byte sequences. And when you're dealing with any kind of encoding work, it's extremely important to keep in mind this difference.
Also, his check for 'ASCII' actually lets through everything from 80 to ff, which aren't ASCII - ASCII is only 00 to 7f.
Now, if you have a string which you have checked is pure ASCII, you can then safely treat it as a byte sequence of the ASCII values of the characters in it - but this is a separate earlier step, nothing strictly to do with the act of base64 encoding.
(I should say that I do like his repeated urging for the reader to note that base64 encoding is not in any shape or form encryption)