Encoding in Pig - encoding

Loading data that contains some particular characters (as for example, À, ° and others) using Pig Latin and storing data in a .txt file is possible to see that these symbols in a txt file are displayed as � and ï characters. That happens because of UTF-8 substitution character.
I would like to ask if is possible to avoid it somehow, maybe with some pig commands, to have in the result (in txt file) for example À instead of �?

In Pig we have built in dynamic invokers that that allow a Pig programmer to refer to Java functions without having to wrap them in custom Pig UDFs. So now u can load the data as UTF-8 encoded strings, then decode it, then perform all your operations on it and then store it back as UTF-8. I guess this should work for the first part:
DEFINE UrlDecode InvokeForString('java.net.URLDecoder.decode', 'String String');
encoded_strings = LOAD 'encoded_strings.txt' as (encoded:chararray);
decoded_strings = FOREACH encoded_strings GENERATE UrlDecode(encoded, 'UTF-8');
The java code responsible for doing this is:
import java.io.IOException;
import java.net.URLDecoder;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
public class UrlDecode extends EvalFunc<String> {
#Override
public String exec(Tuple input) throws IOException {
String encoded = (String) input.get(0);
String encoding = (String) input.get(1);
return URLDecoder.decode(encoded, encoding);
}
}
Now modify this code to return UTF-8 encoded strings from normal strings and store it to your text file. Hope it works.

You are correct this is because of Text (http://hadoop.apache.org/docs/r2.6.0/api/org/apache/hadoop/io/Text.html) converts incoming data (Bytes) to UTF-8 automatically. To avoid this you should not work with Text.
That said you should use bytearray type instead of chararray (bytearray do not use Text and so no conversion is done). Since you don't specify any code, I'll provide an example for illustration:
this is what (likely) you did:
converted_to_utf = LOAD 'strangeEncodingdata' using TextLoader AS (line:chararray);
this is what you wanted to do:
no_conversion = LOAD 'strangeEncodingdata' using TextLoader AS (line:bytearray);

Related

Spark write text file without ignoring escape(backslash)

I'm trying write DataSet into text file.
Example
datasets
.wirte
.text(path)
What I intended is to write "some\text"(String which dataset contains).
As scala to interpret this String, we should set String value like something this
val text: String = "some\\text"
Of course when testing in scala, it prints out correct value ("some\text").
But when I write this dataset with spark.write, it appears to be written "some\\text"
Reading the internal codes, I just found escape option only for csv writing.
Is there any way to solve this problem?
Thanks

How to return a bytes value in a cherry py request body

I have a postgres table that contains a bytea column. This column contains an image.
The SqlAlchemy model has this column defined as a LargeBinary. I've also tried to using BLOB, but It didn't change a thing.
I can easily retrieve a value from the database and what I get is a variable of type bytes.
How can I jsonify that bytes array? I need the json value so I can return it in the cherrypy request body like so :
data = { 'id_image': image.id_image, 'image': image.value }
I'm assuming you need to show that image in a browser or similar software.
Normally you can use Data URI when embedding image as a string into a web page. Modern browsers know how to decode it back.
Also I'm assuming you have a PNG image. If your case is different, feel free to change image/png to something, which matches your needs.
Here's how you can generate data URI using Python.
This example uses Python 3.6 syntax:
import base64
img_id = image.id_image
img_base64_encoded = base64.b64encode(image.value).decode('ascii')
img_type = 'image/png' # Use some smart image type guessing if applicable
img_data_uri = f'data:{img_type};base64,{img_base64_encoded}'
img_data = {
'id': image.id_image,
'data_uri': img_data_uri
}

putting german text in hbase table

I am trying to update a table by adding a german string by doing the following:
put'table:data_validation_test','58e1f4200f23e474ca2d7f3a','urlbody:data','Auslöser'
What I get on scanning this table is this:
scan 'table:data_validation_test'
ROW COLUMN+CELL
58e1f4200f23e474ca2d7f3a column=urlbody:data, timestamp=1491215905923, value=Ausl\xC3\xB6ser
58e1f4200f23e474ca2d7f3a column=urlbody:id, timestamp=1491215697534, value=58e1f4200f23e474ca2d7f3a
I can't find a way to set encoding strings in hbase. How can I get the string as it is into Hbase?
This is just an output issue of the scan command (the same happens with get). In fact, your string is correctly stored.
This happens here because ö (\xC3\xB6) is encoded on 2 bytes, and \xC3 and \xB6 cannot be displayed as readable characters. Remember that in HBase, the main type is Array[Byte].
If you try to get your string value using JRuby (inside HBase shell) :
include Java
import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.client.HTable
import org.apache.hadoop.hbase.client.Get
import org.apache.hadoop.hbase.util.Bytes
config = HBaseConfiguration.create
htable = HTable.new(conf, 'table:data_validation_test')
result = htable.get(Get.new('58e1f4200f23e474ca2d7f3a'.to_java_bytes))
puts Bytes.toString(result.getValue('urlbody'.to_java_bytes, 'data'.to_java_bytes))
Then, your value should be displayed properly.

Camel: UTF-8 Encoding is lost after using Group

I'm using camel 2.14.1 and splitting huge xml file with Chinese/Japanese characters using group=10000 within tokenize tag.
Files are created successfully based on grouping but Chinese/Japanese text codes are converted to Junk characters.
I tried enforcing UTF-8 before new XML creation using "ConvertBodyTo" but still issue persists.
Can someone help me !!
I had run into a similar issue while trying to split a csv file using tokenize with grouping.
Sample csv file: (with Delimiter - '|')
CandidateNumber|CandidateLastName|CandidateFirstName|EducationLevel
CAND123C001|Wells|Jimmy|Bachelor's Degree (±16 years)
CAND123C002|Wells|Tom|Bachelor's Degree (±16 years)
CAND123C003|Wells|James|Bachelor's Degree (±16 years)
CAND123C004|Wells|Tim|Bachelor's Degree (±16 years)
The ± character is corrupted after tokenize with grouping. I was initially under the assumption that the problem was with not setting the proper File Encoding for split, but the exchange seems to have the right value for property CamelCharsetName=ISO-8859-1.
from("file://<dir with csv files>?noop=true&charset=ISO-8859-1")
.split(body().tokenize("\n",2,true)).streaming()
.log("body: ${body}");
The same works fine with dont use grouping.
from("file://<dir with csv files>?noop=true&charset=ISO-8859-1")
.split(body().tokenize("\n")).streaming()
.log("body: ${body}");
Thanks to this post, it confirmed the issue is while grouping.
Looking at GroupTokenIterator in camel code base the problem seems to be with the way TypeConverter is used to convert String to InputStream
// convert to input stream
InputStream is =
camelContext.getTypeConverter().mandatoryConvertTo(InputStream.class, data);
...
Note: the mandatoryConvertTo() has an overloaded method with exchange
<T> T mandatoryConvertTo(Class<T> type, Exchange exchange, Object value)
As the exchange is not passed as argument it always falls back to default charset set using system property "org.apache.camel.default.charset"
Potential Fix:
// convert to input stream
InputStream is =
camelContext.getTypeConverter().mandatoryConvertTo(InputStream.class, exchange, data);
...
As this fix is in the camel-core, another potential option is to use split without grouping and use AgrregateStrategy with completionSize() and completionTimeout().
Although it would be great to get this fixed in camel-core.

Insert image encoded in base 64 in a word document with python-docx?

I use python-docx to generate word document. the user want that he create a template(in a field description) and when he write for example %(company_logo)s in the template, I replace this expression by the picture of the company that I recupered from the database.
as a first issue, I recupered the logo of a company from the database(Postgresql) and I use this code to replace this expression:
cr.execute("select name, logo_web from res_company where id=%s",[soc_id])
r=cr.fetchone()
if r :
company_name=r[0]
logo_company = r[1]
output = cStringIO.StringIO()
doc = docx.Document()
contenu=contenu % {'company_logo': logo_company, 'company_name': company_name,}
doc.add_paragraph(contenu)
The output was a document word that contains the base 64 code of the image as a string. I decoded this code and I tried to add it as a picture with the following code:
logo_company = base64.b64decode(r[1])
doc.add_picture(logo_company)
But I have this error that tells to me that argument must be the path to the picture.
TypeError: file() argument 1 must be encoded string without NULL bytes, not str
The documentation here explains that the add_picture() method takes a file as an argument. The file can be in the form of a path, or it can be a file-like object, such as an open file or a StringIO object. It cannot accept a bytestring containing the bytes of the image, which is what you've tried to do.
So you'll need to convert the image bytes into a file-like object, perhaps using StringIO(), and hand the resulting file-like object to add_picture(). That will get it working for you. Something like:
logo_file = StringIO(base64.b64decode(r[1]))
doc.add_picture(logo_file)