putting german text in hbase table - encoding

I am trying to update a table by adding a german string by doing the following:
put'table:data_validation_test','58e1f4200f23e474ca2d7f3a','urlbody:data','Auslöser'
What I get on scanning this table is this:
scan 'table:data_validation_test'
ROW COLUMN+CELL
58e1f4200f23e474ca2d7f3a column=urlbody:data, timestamp=1491215905923, value=Ausl\xC3\xB6ser
58e1f4200f23e474ca2d7f3a column=urlbody:id, timestamp=1491215697534, value=58e1f4200f23e474ca2d7f3a
I can't find a way to set encoding strings in hbase. How can I get the string as it is into Hbase?

This is just an output issue of the scan command (the same happens with get). In fact, your string is correctly stored.
This happens here because ö (\xC3\xB6) is encoded on 2 bytes, and \xC3 and \xB6 cannot be displayed as readable characters. Remember that in HBase, the main type is Array[Byte].
If you try to get your string value using JRuby (inside HBase shell) :
include Java
import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.client.HTable
import org.apache.hadoop.hbase.client.Get
import org.apache.hadoop.hbase.util.Bytes
config = HBaseConfiguration.create
htable = HTable.new(conf, 'table:data_validation_test')
result = htable.get(Get.new('58e1f4200f23e474ca2d7f3a'.to_java_bytes))
puts Bytes.toString(result.getValue('urlbody'.to_java_bytes, 'data'.to_java_bytes))
Then, your value should be displayed properly.

Related

Spark write text file without ignoring escape(backslash)

I'm trying write DataSet into text file.
Example
datasets
.wirte
.text(path)
What I intended is to write "some\text"(String which dataset contains).
As scala to interpret this String, we should set String value like something this
val text: String = "some\\text"
Of course when testing in scala, it prints out correct value ("some\text").
But when I write this dataset with spark.write, it appears to be written "some\\text"
Reading the internal codes, I just found escape option only for csv writing.
Is there any way to solve this problem?
Thanks

Change the format of file path which is partitioned by java.sql.Timestamp

We are using spark as a data processing platform and Scala programming language. When we write data on storage account(ADLS gen 2), we partition the data by datetime column which is of type java.sql.Timestamp. We write the data using spark dataframe.write operation
By default, it creates following path on storage account and writes parquet files in it
Path - __datetime=a/b/c/yyyy-MM-dd HH%3Amm%3Ass
The problem is, it has encoded : but not space and because the URL is not fully encoded, it creates problems for us. Is there a fix to this problem?
Can I change the format of a column(of type java.sql.Timestamp), so that the output file path looks like this which does not have any encoding?
__datetime=a/b/c/yyyy-MM-dd-HH-mm-ss
or
__datetime=a/b/c/yyyy_MM_dd_HH_mm_ss
Is it possible to do this within java.sql.Timestamp object and without converting it to a string?
Thanks
You can change the name / type dataframe column with a simple select + alias.
The encoding is necessary, though because file paths cannot have : characters, but they can have spaces... Unclear why you need full URL encoding

How to get the BSON from a ReactiveMongo BSONDocument in scala?

I have a ReactiveMongo BSONDocument but I want to write it to a file - I know there's the BSON format (http://bsonspec.org/spec.html) and I want to write it according to those specs, but the problem is that I can't find any method call to do this. I've been able to convert it to an array of Bytes, but the problem begins when I convert to a string, UTF8 format by default.
However the BSON specs require a 32 bit number in the beginning. Is there a library that can do this for me? If not, how can I add string representing a 32 bit number and UTF8 string together without losing the encoding for either or both?
Here's what I've got in Scala:
import reactivemongo.bson.buffer.ArrayBSONBuffer
val doc = BSONDocument("data" -> overall)
val buffer = new ArrayBSONBuffer()
BSONDocument.write(doc, buffer)
val bytes = buffer.array
val str = new String(bytes, Charset.forName("UTF8"))
For reference, I know in Ruby, we can do something like this, but how do I do the same thing with ReactiveMongo?
bson_data = BSON.serialize({data: arr}).to_s
As indicated in the documentation, you can use BSONDocument.pretty(myDoc).
Note that you are using the deprecated/being removed BSON API.

Why does Open XML API Import Text Formatted Column Cell Rows Differently For Every Row

I am working on an ingestion feature that will take a strongly formatted .xlsx file and import the records to a temp storage table and then process the rows to create db records.
One of the columns is strictly formatted as "Text" but it seems like the Open XML API handles the columns cells differently on a row-by-row basis. Some of the values while appearing to be numeric values are truly not (which is why we format the column as Text) -
some examples are "211377", "211727.01", "209395.388", "209395.435"
what these values represent is not important but what happens is that some values (using the Open XML API v2.5 library) will be read in properly as text whether retrieved from the Shared Strings collection or simply from InnerXML property while others get sucked in as numbers with what appears to be appended rounding or precision.
For example the "211377", "211727.01" and "209395.435" all come in exactly as they are in the spreadsheet but the "209395.388" value is being pulled in as "209395.38800000001" (there are others that this happens to as well).
There seems to be no rhyme or reason to which values get messed up and which ones which import fine. What is really frustrating is that if I use the native Import feature in SQL Server Management Studio and ingest the same spreadsheet to a temp table this does not happen - so how is that the SSMS import can handle these values as purely text for all rows but the Open XML API cannot.
To begin the answer you main problem seems to be values,
"209395.388" value is being pulled in as "209395.38800000001"
Yes in .xlsx file value is stored as 209395.38800000001 instead of 209395.388. And it's the correct format to store floating point numbers; nothing wrong in it. You van simply confirm it by following code snippet
string val = "209395.38800000001"; // <= What we extract from Open Xml
Console.WriteLine(double.Parse(val)); // < = Simply pass it to double and print
The output is :
209395.388 // <= yes the expected value
So there's nothing wrong in the value you extract from .xlsx using Open Xml SDK.
Now to cells, yes cell can have verity of formats. Numbers, text, boleans or shared string text. And you can styles to a cell which would format your string to a desired output in Excel. (Ex - Date Time format, Forced strings etc.). And this the way Excel handle the vast verity of data. It need this kind of formatting and .xlsx file format had to be little complex to support all.
My advice is to use a proper parse method set at extracted values to identify what format it represent (For example to determine whether its a number or a text) and apply what type of parse.
ex : -
string val = "209395.38800000001";
Console.WriteLine(float.Parse(val)); // <= Float parse will be deduce a different value ; 209395.4
Update :
Here's how value is saved in internal XML
Try for yourself ;
Make an .xlsx file with value 209395.388 -> Change extention to .zip -> Unzip it -> goto worksheet folder -> open Sheet1
You will notice that value is stored as 209395.38800000001 as scene in attached image.. So nothing wrong on API for extracting stored number. It's your duty to decide what format to apply.
But if you make the whole column Text before adding data, you will see that .xlsx hold data as it is; simply said as string.

Encoding in Pig

Loading data that contains some particular characters (as for example, À, ° and others) using Pig Latin and storing data in a .txt file is possible to see that these symbols in a txt file are displayed as � and ï characters. That happens because of UTF-8 substitution character.
I would like to ask if is possible to avoid it somehow, maybe with some pig commands, to have in the result (in txt file) for example À instead of �?
In Pig we have built in dynamic invokers that that allow a Pig programmer to refer to Java functions without having to wrap them in custom Pig UDFs. So now u can load the data as UTF-8 encoded strings, then decode it, then perform all your operations on it and then store it back as UTF-8. I guess this should work for the first part:
DEFINE UrlDecode InvokeForString('java.net.URLDecoder.decode', 'String String');
encoded_strings = LOAD 'encoded_strings.txt' as (encoded:chararray);
decoded_strings = FOREACH encoded_strings GENERATE UrlDecode(encoded, 'UTF-8');
The java code responsible for doing this is:
import java.io.IOException;
import java.net.URLDecoder;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
public class UrlDecode extends EvalFunc<String> {
#Override
public String exec(Tuple input) throws IOException {
String encoded = (String) input.get(0);
String encoding = (String) input.get(1);
return URLDecoder.decode(encoded, encoding);
}
}
Now modify this code to return UTF-8 encoded strings from normal strings and store it to your text file. Hope it works.
You are correct this is because of Text (http://hadoop.apache.org/docs/r2.6.0/api/org/apache/hadoop/io/Text.html) converts incoming data (Bytes) to UTF-8 automatically. To avoid this you should not work with Text.
That said you should use bytearray type instead of chararray (bytearray do not use Text and so no conversion is done). Since you don't specify any code, I'll provide an example for illustration:
this is what (likely) you did:
converted_to_utf = LOAD 'strangeEncodingdata' using TextLoader AS (line:chararray);
this is what you wanted to do:
no_conversion = LOAD 'strangeEncodingdata' using TextLoader AS (line:bytearray);