So I have this json data that contains strings \r (carriage return) and \n (new line) - It's from Firebase. The problem is when I encode the data using json.encode it add an escaping character. So \r becomes \\r.
I'm sending that data to an another server.
json.encode works as expected if I do json.encode({'hello': 'world\r\n'}) but it adds \ when I used it on my other string.
Am I missing something?
Is there some type of encoding to prevent it from adding \?
It seems that the data you received does not contain CR and LF characters but contains their escape sequences (\ followed by r and \ followed by n). Therefore when you encode that to JSON, it will be escaped again.
You could do:
data = data.replaceAll('\\r', '\r').replaceAll('\\n', '\n');
which probably would work most of the time, but it would have the corner case of undesirably replacing occurrences that were explicitly intended to be escaped. (That is, a string '\\n' would be transformed to a sequence \, LF.)
Since the data is already escaped, you probably could unescape it with json.decode. Of course, decoding the data as JSON just to re-encode it to JSON seems a little silly, so if it's already properly encoded JSON, you ideally should pass it through it unchanged.
Related
I've got a string, that looks like a Base64 ASCII encoded string:
#2aHR0cDovL2RhdGEwMi1jZG4uZGF0YWxvY2sucnUvZmkybG0vZTAxNGNmZTZhMzE1ZjgyODgyZWUxMmJjOTY5MzQ1MDkvN2ZfVGVzdC5uYS5iZXJlbWVubm9zdC5zMDMuZTAyLldFQi1ETFJpcC4yNUt1em1pY2g\/\/b2xvbG8=uYTEuMTYuMTEuMjIubXA0
If I decode it without any edit, it seems like a mess, but if I remove 2 chars from the very begining (#2), it decodes into a mostly correct string:
http://data02-cdn.datalock.ru/fi2lm/e014cfe6a315f82882ee12bc96934509/7f_Test.na.beremennost.s03.e02.WEB-DLRip.25Kuzmich?
but is still not complete. This URL should be like:
http://data02-cdn.datalock.ru/fi2lm/f03143c36c778262bd9906da5d545f85/7f_Test.na.beremennost.s03.e02.WEB-DLRip.25Kuzmich.a1.16.11.22.mp4
If I remove some more characters from initial string (#2aHR0cDovL2RhdGEwMi1jZG4uZ), I get a corrupted text with correct ending of decoded URL:
][KLKLMMLMYYLLML
LK\K\[Y[LPQ\R^ZXololo.a1.16.11.22.mp4
Is it a regular problem of base64 encoding or maybe there is some sort of mutation in encoded string and it can be solved?
In my experiments i used base64decode.org
The slashes should not be backslashed and there should not be an equals sign in the middle of the stream.
If I remove the backslashed slashes and the equals sign, I get
http://data02-cdn.datalock.ru/fi2lm/e014cfe6a315f82882ee12bc96934509/7f_Test.na.beremennost.s03.e02.WEB-DLRip.25Kuzmich.16.11.22.184
I don't think there's anything "regular" about this corruption. Lucky they didn't inject valid base64 or a lot of spurious junk.
Using Python 3.4, suppose I have some data from a file, and it is literally the 6 individual characters \ u 0 0 C 0 but I need to convert it to the single unicode character \u00C0. Is there a simple way of doing that conversion? I can't find anything in the Python 3.4 Unicode documentation that seems to provide that kind of conversion, except for a complex way using exec() of an assignment statement which I'd like to avoid if possible.
Thanks.
Well, there is:
>>> b'\\u00C0'.decode('unicode-escape')
'À'
However, the unicode-escape codec is aimed at a particular format of string encoding, the Python string literal. It may produce unexpected results when faced with other escape sequences that are special in Python, such as \xC0, \n, \\ or \U000000C0 and it may not recognise other escape sequences from other string literal formats. It may also handle characters outside the Basic Multilingual Plane incorrectly (eg JSON would encode U+10000 to surrogates\uD800\uDC00).
So unless your input data really is a Python string literal shorn of its quote delimiters, this isn't the right thing to do and it'll likely produce unwanted results for some of these edge cases. There are lots of formats that use \u to signal Unicode characters; you should try to find out what format it is exactly, and use a decoder for that scheme. For example if the file is JSON, the right thing to do would be to use a JSON parser instead of trying to deal with \u/\n/\\/etc yourself.
I have an iphone app that converts a image into NSData & then converts into base64 encoded string.
When this encoded string is submitted to server in server's database, while storing on server '+' gets converted into 'space' and so the decoder does not work properly.
I guess the issue is with default encoding of table in database. Currently its latin, i tried changing it to UTF8 but problem still exits.
Any other encoding, please help
Of course - that has nothing to do with encoding. It is the format of the POST and GET parameters which creates a clash with base64. In http://en.wikipedia.org/wiki/Base64#Variants_summary_table you see alternatives which are designed to make base64 work with URLs etc.
One of these variants is "Base64 with URL and Filename Safe Alphabet (RFC 4648 'base64url' encoding)" which replaces the + with - and the / with _.
Another alternative would be to replace the offending characters +/= by their respective hexrepresentations with %xx - but that makes the data unnecessarily longer.
I am interested in theory on whether Encoding is the same as Escaping? According to Wikipedia
an escape character is a character
which invokes an alternative
interpretation on subsequent
characters in a character sequence.
My current thought is that they are different. Escaping is when you place an escape charater in front of a metacharacter(s) to mark it/them as to behave differently than what they would have normally.
Encoding, on the other hand, is all about transforming data into another form, and upon wanting to read the original content it is decoded back to its original form.
Escaping is a subset of encoding: You only encode certain characters by prefixing a special character instead of transferring (typically all or many) characters to another representation.
Escaping examples:
In an SQL statement: ... WHERE name='O\' Reilly'
In the shell: ls Thirty\ Seconds\ *
Many programming languages: "\"Test\" string (or """Test""")
Encoding examples:
Replacing < with < when outputting user input in HTML
The character encoding, like UTF-8
Using sequences that do not include the desired character, like \u0061 for a
They're different, and I think you're getting the distinction correctly.
Encoding is when you transform between a logical representation of a text ("logical string", e.g. Unicode) into a well-defined sequence of binary digits ("physical string", e.g. ASCII, UTF-8, UTF-16). Escaping is a special character (typically the backslash: '\') which initiates a different interpretation of the character(s) following the escape character; escaping is necessary when you need to encode a larger number of symbols to a smaller number of distinct (and finite) bit sequences.
They are indeed different.
You pretty much got it right.
I have a Perl script that takes text values from a MySQL table and writes it to a text file. The problem is, when I open the text file for viewing I am getting a lot of hex characters like \x92 and \x93 which stands for single and double quotes, I guess.
I am using DBI->quote function to escape the special chars before writing the values to the text file. I have tried using Encode::Encoder, but with no luck. The character set on both the tables is latin1.
How do I get rid of those hex characters and get the character to show in the text file?
ISO Latin-1 does not define characters in the range 0x80 to 0x9f, so displaying these bytes in hex is expected. Most likely your data is actually encoded in Windows-1252, which is the same as Latin1 except that it defines additional characters (including left/right quotes) in this range.
\x92 and \x93 are empty characters in the latin1 character set (see here or here). If you are certain that you are indeed dealing with latin1, you can simply delete them.
It sounds like you need to change the character sets on the tables, or translate the non-latin-1 characters into latin-1 equivalents. I'd prefer the first solution. Get used to Unicode; you're going to have to learn it at some point. :)