Loading Json Data to hbase using pyspark - pyspark

i wanted to load data into a Hbase Table using pyspark,
Can some one help how to load the json data to Hbase as ticid as rowkey as and all other goes into one column family.
Please find the json below.
{
"ticid": "1496",
"ticlocation": "vizag",
"custnum": "222",
"Comments": {
"comment": [{
"commentno": "1",
"desc": "journey",
"passengerseat": {
"intele": "09"
},
"passengerloc": {
"intele": "s15"
}
}, {
"commentno": "5",
"desc": " food",
"passengerseat": {
"intele": "09"
},
"passengerloc": {
"intele": "s15"
}
}, {
"commentno": "12",
"desc": " service",
"passengerseat": {
"intele": "09"
},
"passengerloc": {
"intele": "s15"
}
}]
},
"Rails": {
"Rail": [{
"Traino": "AP1545",
"startcity": "vizag",
"passengerseat": "5"
}, {
"Traino": "AP1555",
"startcity": "HYD",
"passengerseat": "15A"
}]
}
}

I assume that you don't have a single row to load but thousands or millions of rows? I would recommend converting your JSON data to TSV (tab separated) which is quite easy in Python and using the import-tsv feature of HBase
See also
Import TSV file into hbase table
Spark is not a good pattern for HBase Bulk load.

Related

Pyspark: Best way to set json strings in dataframe column

I need to create couple of columns in Dataframe where I want to parse and store the json string. Here is one json which I need to store in one column. Other json are also similar.Can you please help in how to transform and store this json string in the column. The values section needs to be filled from the values from other dataframe columns within the same data frame.
{
"name": "",
"headers": [
{
"name": "A",
"dataType": "number"
},
{
"name": "B",
"dataType": "string"
},
{
"name": "C",
"dataType": "string"
}
],
"values": [
[
2,
"some value",
"some value"
]
]
}

How to search mongodb collection map JSON

I have the JSON below in mongodb and would like write a bson.M filter to get a specific JSON in collection.
JSONs in collection:
{
"Id": "3fa85f64",
"Type": "DDD",
"Status": "PRESENT",
"List": [{
"dd": "55",
"cc": "33"
}],
"SeList": {
"comm_1": {
"seId": "comm_1",
"serName": "nmf-comm"
},
"comm_2": {
"seId": "comm_2",
"serName": "aut-comm"
}
}
}
{
"Id": "3fa8556",
"Type": "CCC",
"Status": "PRESENT",
"List": [{
"dd": "22",
"cc": "34"
}],
"SeList": {
"dnn_1": {
"seId": "dnn_1",
"serName": "dnf-comm"
},
"dnn_2": {
"seId": "dnn_2",
"serName": "dn2-comm"
}
}
}
I have written below the bson.M filter to select the first JSON but did not work because I do not know how to handle the map keys in the "SeList.serName". The keys comm_1, comm_2, dnn_1, etc could be any string.
filter := bson.M{"Type": DDD, "Status": "PRESENT", "SeList.serName": nmf-comm} // does not work because the "SeList.serName" is not correct.
I need help about how to select any JSON based on the example filter above.

Ingesting multi-valued dimension from comma sep string

I have event data from Kafka with the following structure that I want to ingest in Druid
{
"event": "some_event",
"id": "1",
"parameters": {
"campaigns": "campaign1, campaign2",
"other_stuff": "important_info"
}
}
Specifically, I want to transform the dimension "campaigns" from a comma-separated string into an array / multi-valued dimension so that it can be nicely filtered and grouped by.
My ingestion so far looks as follows
{
"type": "kafka",
"dataSchema": {
"dataSource": "event-data",
"parser": {
"type": "string",
"parseSpec": {
"format": "json",
"timestampSpec": {
"column": "timestamp",
"format": "posix"
},
"flattenSpec": {
"fields": [
{
"type": "root",
"name": "parameters"
},
{
"type": "jq",
"name": "campaigns",
"expr": ".parameters.campaigns"
}
]
}
},
"dimensionSpec": {
"dimensions": [
"event",
"id",
"campaigns"
]
}
},
"metricsSpec": [
{
"type": "count",
"name": "count"
}
],
"granularitySpec": {
"type": "uniform",
...
}
},
"tuningConfig": {
"type": "kafka",
...
},
"ioConfig": {
"topic": "production-tracking",
...
}
}
Which however leads to campaigns being ingested as a string.
I could neither find a way to generate an array out of it with a jq expression in flattenSpec nor did I find something like a string split expression that may be used as a transformSpec.
Any suggestions?
Try setting useFieldDiscover: false in your ingestion spec. when this flag is set to true (which is default case) then it interprets all fields with singular values (not a map or list) and flat lists (lists of singular values) at the root level as columns.
Here is a good example and reference link to use flatten spec:
https://druid.apache.org/docs/latest/ingestion/flatten-json.html
Looks like since Druid 0.17.0, Druid expressions support typed constructors for creating arrays, so using expression string_to_array should do the trick!

Parsing Really Messy Nested JSON Strings

I have a series of deeply nested json strings in a pyspark dataframe column. I need to explode and filter based on the contents of these strings and would like to add them as columns. I've tried defining the StructTypes but each time it continues to return an empty DF.
Tried using json_tuples to parse but there are no common keys to rejoin the dataframes and the row numbers dont match up? I think it might have to do with some null fields
The sub field can be nullable
Sample JSON
{
"TIME": "datatime",
"SID": "yjhrtr",
"ID": {
"Source": "Person",
"AuthIFO": {
"Prov": "Abc",
"IOI": "123",
"DETAILS": {
"Id": "12345",
"SId": "ABCDE"
}
}
},
"Content": {
"User1": "AB878A",
"UserInfo": "False",
"D": "ghgf64G",
"T": "yjuyjtyfrZ6",
"Tname": "WE ARE THE WORLD",
"ST": null,
"TID": "BPV 1431: 1",
"src": "test",
"OT": "test2",
"OA": "test3",
"OP": "test34
},
"Test": false
}

How to read the nested avro fields for creating streams?

I have following AVRO message in Kafka topic.
{
"table": {
"string": "Schema.xDEAL"
},
"op_type": {
"string": "Insert"
},
"op_ts": {
"string": "2018-03-16 09:03:25.000462"
},
"current_ts": {
"string": "2018-03-16 10:03:37.778000"
},
"pos": {
"string": "00000000000000010722"
},
"before": null,
"after": {
"row": {
"DEA_PID_DEAL": {
"string": "AAAAAAAA"
},
"DEA_NME_DEAL": {
"string": "MY OGG DEAL"
},
"DEA_NME_ALIAS_NAME": {
"string": "MY OGG DEAL"
},
"DEA_NUM_DEAL_CNTL": {
"string": "4swb6zs4"
}
}
}
}
When I run the following query. It creates the stream with null values.
CREATE STREAM tls_deal (DEA_PID_DEAL VARCHAR, DEA_NME_DEAL varchar, DEA_NME_ALIAS_NAME VARCHAR, DEA_NUM_DEAL_CNTL VARCHAR) WITH (kafka_topic='deal-ogg-topic',value_format='AVRO', key = 'DEA_PID_DEAL');
But when I change the AVRO message to following it works.
{
"table": {
"string": "Schema.xDEAL"
},
"op_type": {
"string": "Insert"
},
"op_ts": {
"string": "2018-03-16 09:03:25.000462"
},
"current_ts": {
"string": "2018-03-16 10:03:37.778000"
},
"pos": {
"string": "00000000000000010722"
},
"DEA_PID_DEAL": {
"string": "AAAAAAAA"
},
"DEA_NME_DEAL": {
"string": "MY OGG DEAL"
},
"DEA_NME_ALIAS_NAME": {
"string": "MY OGG DEAL"
},
"DEA_NUM_DEAL_CNTL": {
"string": "4swb6zs4"
}
}
Now If I run the above query the data will be populated.
My question is If I need to populate stream from nested field how can I handle this?
I am not able to find the solution in KSQL documentation page.
Thanks in advance. I appreciate the help. :)
As Robin states, this is not currently supported, (22 Mar 2018 / v0.5). However, it is a tracked feature request. You may want to up-vote or track this Github issue in the KSQL repo:
https://github.com/confluentinc/ksql/issues/638
KSQL does not currently (22 Mar 2018 / v0.5) support nested Avro. You can use Single Message Transform to flatten the data coming from Kafka Connect. For example, Debezium ships with UnwrapFromEnvelope.