Converting latin1 to UFT8 in mongodb - mongodb

Please help me to change the encoding to UTF8? Encoding is in latin1 the fields are in latin format.
I need to change in MongoDB database.
How to change the encoding of a database from latin 1 to utf 8.
db.collection.insert(array("title" = utf8_encode("Péter")));

MongoDB supports UTF-8 out of the box. So the encoding should not be in the MongoDB but in the data you insert.
The String as no Encoding so changing the encoding can be performed by changing to byte array and enforce the UTF-8 encoding.
If you are using JAVA 7 try this:
byte titleValue[] = myString.getBytes(ISO_8859_1);
String value = new String(titleValue, UTF_8);

Related

Non-ISO extended-ASCII CSV giving special character while importing in DB

I am getting CSV from S3 server and inserting it into PostgreSQL using java.
S3Object object = s3Client.getObject(new GetObjectRequest(bucketName, key));
BufferedReader reader = new BufferedReader(
new InputStreamReader(object.getObjectContent())
);
For some of the rows the value in a column contains the special characters �. I tried using the encodings UTF-8, UTF-16 and ISO-8859-1 with InputStreamReader, but it didn't work out.
When the encoding WIN-1252 is used, the DB still shows some special characters, but when I export the data to CSV it shows the same characters which I found in the raw file.
But again when I am opening the file in Notepad the character is fine, but when I open it in excel, the same special character appears.
All the PostgreSQL stuff is quite irrelevant; PostgreSQL can deal with practically any encoding. Check your data with an utility such as enca to determine how it is encoded, and set your PostgreSQL session to that encoding. If the server is in the same encoding or in some Unicode encoding, it should work fine.

addPortalMessage requires decode('utf-8')

Currently it seems that in order for UTF-8 characters to display in a portal message you need to decode them first.
Here is a snippet from my code:
self.context.plone_utils.addPortalMessage(_(u'This document (%s) has already been uploaded.' % (doc_obj.Title().decode('utf-8'))))
If Titles in Plone are already UTF-8 encoded, the string is a unicode string and the underscore function is handled by i18ndude, I do not see a reason why we specifically need to decode utf-8. Usually I forget to add it and remember once I get a UnicodeError.
Any thoughts? Is this the expected behavior of addPortalMessage? Is it i18ndude that is causing the issue?
UTF-8 is a representation of Unicode, not Unicode and not a Python unicode string. In Python, we convert back and forth between Python's unicode strings and representations of unicode via encode/decode.
Decoding a UTF-8 string via utf8string.decode('utf-8') produces a Python unicode string that may be concatenated with other unicode strings.
Python will automatically convert a string to unicode if it needs to by using the ASCII decoder. That will fail if there are non-ASCII characters in the string -- because, for example, it is encoded in UTF-8.

when should I favor local 1byte encoding like Windows over UTF-8?

Local encodings like Windows-1251 take up 1 byte per character, while UTF-8 requires 2 bytes per character for Russian charset (beyond 127th number), which means the fileize doubles. However, by using UTF-8 I save myself troubles in the future which can manifest itself by displaying characters incorrectly. So my question is when should I favor local 1byte encoding like Windows over UTF-8?
Use UTF-8.
There is no good reason to use win1251 or any other 1-byte encoding.

ERROR: invalid byte sequence for encoding "UTF8" inserting in pgadmin

I am having issues while inserting data in postgres which has a character é .
While inserting this character through PGADMIN, it parses the character to ETX, while the pgsql shell parses it to ^C. When I keep the query with character in a file and pass the file in psql shel it gives me an error :
ERROR: invalid byte sequence for encoding "UTF8": 0x82
My Postgres 9.0 Db encoding is set to UTF-8.
Please let me know how to deal with these kind of characters.
Thanks,
Rohit.
PS: I am not sure if the character can be seen here properly. It is a box drawing character which is represented in
ASCII as – 192 and
UTF- 8 as - U+2514
The simple solution is to find out what encoding your client is using SET client_encoding
For example this may fix your problem:
SET client_encoding = 'WIN1252';
If you are on Windows with pgadmin, a client encoding of Windows 1252 would be the most likely cause of the problem.

Arabic character base64 encoding issue

I am encoding a string to base64 encoded data.
Edit: removed irrelevant base64 conversion code
Is there would be any problem when I trying to encode a mixed english and arabic data, because we are here using
base64Data = [string dataUsingEncoding:NSASCIIStringEncoding];
I heard that NSASCIIStringEncoding should not be used with Unicode encoded string.
Base64 encodes data (raw bytes) and produces ASCII encoded strings. So your problem is in converting your string into an encoded byte array.
You could use any encoding that contains arabic and english characters. But you have to make sure the recipient of the base 64 encoded message would understand and know the encoding.
UTF-8 is a good point to start.