Best way to store hierarchical data in hbase - xml-serialization

I have a hierarchical XML file received from client, i need to store it in Hbase database, as i am new to the Hbase i not able to understand how to approach, can you please guide me how should i proceed for this hierarchical data storage to Hbase.
Thanks in advance

Hbase stores data in Column wise format. Each record must have a unique key. The sub columns can be created on the fly but not the main columns.
For example condider this xml.
<X1>
<X2 name = "uniqueid">1</X2>
<X3>
<X4>value1</X4>
<X5>value2</X5>
<X6>
<X7>value3</X7>
<X8>value4</X8>
</X6>
</X3>
<X7>value5</X7>
</X1>
In this case, the main column family would be X3 and X7. Row Id can be taken from X2.
You can construct a Hbase entry equivalent to this using java api like,
Put p = new Put("/*put the unique row id */ ".getBytes() );
p.add("X3".getBytes(), "X4".getBytes(), value1.getBytes());
where the first argument is the column family and the second one is called the column qualifier(sub column).
You can also use 2 argument constructor like,
p.add("X3:X6:X7".getBytes(),value3);
then table.put(p). Thats it!!!

Related

How to map Data Flow parameters to Sink SQL Table

I need to store/map one or more data flow parameters to my Sink (Azure SQL Table).
I can fetch other data from a REST Api and is able to map these to my Sink columns (see below). I also need to generate some UUID's as key fields and add these to the same table.
I would like my EmployeeId column to contain my Data Flow Input parameter, e.g. named param_test. In addition to this I need to insert UUID's to other columns which are not part of my REST input fields.
How to I acccomplish that?
You need to use a derived column transformation, and there edit the expression to include the parameters.
derived column transformation
expression builder
Adding to #Chen Hirsh, use the same derived column to get uuid values to the columns after REST API Source.
They will come into sink mapping:
Output:

Cassandra Alter Column type from Timestamp to Date

Is there any way to alter the Cassandra column from timestamp to date without data lost? For example '2021-02-25 20:30:00+0000' to '2021-02-25'
If not, what is the easiest way to migrate this column(timestamp) to the new column(date)?
It's impossible to change a type of the existing column, so you need to add a new column with correct data type, and perform migration. Migration could be done via Spark + Spark Cassandra Connector - it could be most flexible solution, and even could be done via single node machine with Spark running in the local master mode (default). Code could look something like this (try on test data first):
import pyspark.sql.functions as F
options = { "table": "tbl", "keyspace": "ks"}
spark.read.format("org.apache.spark.sql.cassandra").options(**options).load()\
.select("pk_col1", "pk_col2", F.col("timestamp_col").cast("date").alias("new_name"))\
.write.format("org.apache.spark.sql.cassandra").options(**options).save()
P.S. you can use DSBulk, for example, but you need to have enough space to offload the data (although you need only primary key column + your timestamp)
To add to Alex Ott's answer, there are validations done in Cassandra that prevents changing the data type of a column. The reason is that SSTables (Cassandra data files) are immutable -- once they are written to disk, they are never modified/edited/updated. They can only be compacted to new SSTables.
Some try to get around it by dropping the column from the table then adding it back in with a new data type. Unlike traditional RDBMS, the existing data in the SSTables don't get updated so if you tried to read the old data, you'll get a CorruptSSTableException because the CQL type of the data on disk won't match that of the schema.
For this reason, it is no longer possible to drop/recreate columns with the same name (CASSANDRA-14948). If you're interested, I've explained it in a bit more detail in this post -- https://community.datastax.com/questions/8018/. Cheers!
You can use ToDate to change it. For example: Table Email has column Date with format: 2001-08-29 13:03:35.000000+0000.
Select Date, ToDate(Date) as Convert from keyspace.Email:
date | convert ---------------------------------+------------ 2001-08-29 13:03:35.000000+0000 | 2001-08-29

How to use azure data factory migrate table in storage account, that have column is many type

I want to use Data Factory to migrate data in the storage account, but data in the original table is a many type ex: some data in column int, String, DateTime.
When I use Data Factory I need to specify the data type, so how I can definite dynamic type and copy column. Because all data migrate parsed to String type, so how can I keep value type of column?
This my data in the original table
enter image description here
Thanks for your help
According my experience in Data factory, Data Factory can not help you keep value type of column in source table. You must specify the data type in sink dataset.
Copy Data:
As you have tried, if you didn't set the sink data type, the column type will passed String in default.
I have an idea is that cope the data twice, each time copy the different entity column. The sink dataset support 'Merge' and 'Replace'.
Hope this helps.
Not sure if I am understanding the question , but let me first put forward my understanding , you want to copy a table lets say sourceT1 to SinkT1 , if that's the case you can always use the copy activity and then map the columns . When you map the columns it does set the data type also .

Pivot data in Talend

I have some data which I need to pivot in Talend. This is a sample:
brandname,metric,value
A,xyz,2
B,xyz,2
A,abc,3
C,def,1
C,ghi,6
A,ghi,1
Now I need this data to be pivoted on the metric column like this:
brandname,abc,def,ghi,xyz
A,3,null,1,2
B,null,null,null,2
C,null,1,6,null
Currently I am using tPivotToColumnsDelimited to pivot the data to a file and reading back from that file. However having to store data on an external file and reading back is messy and unnecessary overhead.
Is there a way to do this with Talend without writing to an external file? I tried to use tDenormalize but as far as I understand, it will return the rows as 1 column which is not what I need. I also looked for some 3rd party component in TalendExchange but couldn't find anything useful.
Thank you for your help.
Assuming that your metrics are fixed, you can use their names as columns of the output. The solution to do the pivot has two parts: first, a tMap that transposes the value of each input-row in into the corresponding column in the output-row out and second, a tAggregate that groups the map's output-rows according to the brandname.
For the tMap you'd have to fill the columns conditionally like this, example for output colum named "abc":
out.abc = "abc".equals(in.metric)?in.value:null
In the tAggregate you'd have to group by out.brandname and aggregate each column as sum ignoring nulls.

How to list all row keys in an hbase table?

How do I list all row keys in an hbase table?
I need to do this using PHP with a REST interface.
If you are listing all of the keys in an HBase table, then you are using the wrong tool. HBase is for large data systems where it is impractical to list all of the keys.
What may be more sensible is to start at a given key and list the next N keys (for values of N less than 10K). There are nice Java interfaces for doing this type of thing with a scan -- setting a start key and/or an end key.
Most HBase functionality is exposed via the Thrift interface. I would suggest looking there
I have found a way..
http://localhost:8080/tablename/* will return an xml data and i can preg-match it to get the rows.
Inviting better suggestions..
This...
http://localhost:8080/tablename/*/columnfamily:columnid
...will return all values in your table relative to that column in that table, sort of like applying column filter in the scanner.
Also, if you're looking for multiple columns - separate them with a comma.
So: /tablename/*/columnfamily:columnid,columnfamily:columnid2
I don't know what the REST interface is like, but you probably want to filter some data out client-side to avoid large RPC responses. You can do this by adding server-side filters to your scan:
Scan s = new Scan();
FilterList fl = new FilterList();
// returns first instance of a row, then skip to next row
fl.addFilter(new FirstKeyOnlyFilter());
// only return the Key, don't return the value
fl.addFilter(new KeyOnlyFilter());
s.setFilter(fl);
HTable myTable;
ResultScanner rs = myTable.getScanner(s);
Result row = rs.next();
while (row != null) ...
http://svn.apache.org/repos/asf/hbase/branches/0.90/src/main/java/org/apache/hadoop/hbase/filter/