KSQL streams - Get data from Array of Struct - apache-kafka

My JSON looks like:
{
"Obj1": {
"a": "abc",
"b": "def",
"c": "ghi"
},
"ArrayObj": [
{
"key1": "1",
"Key2": "2",
"Key3": "3",
},
{
"key1": "4",
"Key2": "5",
"Key3": "6",
},
{
"key1": "7",
"Key2": "8",
"Key3": "9",
}
]
}
I have written KSQL streams to convert it to AVRO and save to a topic, So that I can push it to JDBC Sink connector
CREATE STREAM Example1(ArrayObj ARRAY<STRUCT<key1 VARCHAR, Key2 VARCHAR>>,Obj1 STRUCT<a VARCHAR>)WITH(kafka_topic='sample_topic', value_format='JSON');
CREATE STREAM Example_Avro WITH(VALUE_FORMAT='avro') AS SELECT e.ArrayObj[0] FROM Example1 e;
In Example_Avro , I can get only first object in a array.
How can I get data shown as below, when I hit select * from Example_Avro in KSQL ?
a b key1 key2 key3
abc def 1 2 3
abc def 4 5 6
abc def 7 8 9

Test data (I removed the invalid trailing commas after key3 value):
ksql> PRINT test4;
Format:JSON
1/9/20 7:45:18 PM UTC , NULL , { "Obj1": { "a": "abc", "b": "def", "c": "ghi" }, "ArrayObj": [ { "key1": "1", "Key2": "2", "Key3": "3" }, { "key1": "4", "Key2": "5", "Key3": "6" }, { "key1": "7", "Key2": "8", "Key3": "9" } ] }
Query:
SELECT OBJ1->A AS A,
OBJ1->B AS B,
EXPLODE(ARRAYOBJ)->KEY1 AS KEY1,
EXPLODE(ARRAYOBJ)->KEY2 AS KEY2,
EXPLODE(ARRAYOBJ)->KEY3 AS KEY3
FROM TEST4
EMIT CHANGES;
Result:
+-------+-------+------+-------+-------+
|A |B |KEY1 |KEY2 |KEY3 |
+-------+-------+------+-------+-------+
|abc |def |1 |2 |3 |
|abc |def |4 |5 |6 |
|abc |def |7 |8 |9 |
Tested on ksqlDB 0.6, in which the EXPLODE function was added.

Related

Row number with lag condition on multiple columns

I would like to create row number that is partitioned by ACCOUNT, NAME and TYPE.
I tried dense rank and row number. However, I need all initial records that contain changes in any of those columns
df = spark.createDataFrame(
[
('20190910', 'A1', 'Linda', 'b2c'),
('20190911', 'A1', 'Tom', 'consultant'),
('20190912', 'A1', 'John', 'b2c'),
('20190913', 'A1', 'Tom', 'consultant'),
('20190914', 'A1', 'Tom', 'consultant'),
('20190915', 'A1', 'Linda', 'consultant'),
('20190916', 'A1', 'Linda', 'b2c'),
('20190917', 'B1', 'John', 'b2c'),
('20190916', 'B1', 'John', 'consultant'),
('20190910', 'B1', 'Linda', 'b2c'),
('20190911', 'B1', 'John', 'b2c'),
('20190915', 'C1', 'John', 'consultant'),
('20190916', 'C1', 'Linda', 'consultant'),
('20190917', 'C1', 'John', 'b2c'),
('20190916', 'C1', 'RJohn', 'consultant'),
('20190910', 'C1', 'Tom', 'b2c'),
('20190911', 'C1', 'John', 'b2c'),
],
['Event_date', 'account', 'name', 'type']
)
Expected outcome:
Event_date
account
name
type
row_number
20190910
A1
Linda
b2c
1
20190911
A1
Tom
consultant
1
20190912
A1
John
b2c
1
20190913
A1
Tom
consultant
2
20190914
A1
Tom
consultant
3
20190915
A1
Linda
consultant
1
20190916
A1
Linda
b2c
2
20190917
B1
John
b2c
1
20190916
B1
John
consultant
1
20190910
B1
Linda
b2c
2
20190911
B1
John
b2c
3
20190915
C1
John
consultant
1
20190916
C1
Linda
consultant
1
20190917
C1
John
b2c
1
20190916
C1
John
consultant
2
20190910
C1
Tom
b2c
1
20190911
C1
John
b2c
2
You could create a Window and partition it by account, name, type and then row_number over it.
Example:
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(
[
("20190910", "A1", "Linda", "b2c"),
("20190911", "A1", "Tom", "consultant"),
("20190912", "A1", "John", "b2c"),
("20190913", "A1", "Tom", "consultant"),
("20190914", "A1", "Tom", "consultant"),
("20190915", "A1", "Linda", "consultant"),
("20190916", "A1", "Linda", "b2c"),
("20190917", "B1", "John", "b2c"),
("20190916", "B1", "John", "consultant"),
("20190910", "B1", "Linda", "b2c"),
("20190911", "B1", "John", "b2c"),
("20190915", "C1", "John", "consultant"),
("20190916", "C1", "Linda", "consultant"),
("20190917", "C1", "John", "b2c"),
("20190916", "C1", "RJohn", "consultant"),
("20190910", "C1", "Tom", "b2c"),
("20190911", "C1", "John", "b2c"),
],
["Event_date", "account", "name", "type"],
)
w = Window.partitionBy("account", "name", "type").orderBy("Event_date")
df = df.withColumn("row_number", F.row_number().over(w)).orderBy("Event_date")
Result:
+----------+-------+-----+----------+----------+
|Event_date|account|name |type |row_number|
+----------+-------+-----+----------+----------+
|20190912 |A1 |John |b2c |1 |
|20190911 |A1 |Tom |consultant|1 |
|20190913 |A1 |Tom |consultant|2 |
|20190914 |A1 |Tom |consultant|3 |
|20190915 |A1 |Linda|consultant|1 |
|20190910 |A1 |Linda|b2c |1 |
|20190916 |A1 |Linda|b2c |2 |
|20190911 |B1 |John |b2c |1 |
|20190916 |B1 |John |consultant|1 |
|20190917 |B1 |John |b2c |2 |
|20190910 |B1 |Linda|b2c |1 |
|20190910 |C1 |Tom |b2c |1 |
|20190915 |C1 |John |consultant|1 |
|20190916 |C1 |RJohn|consultant|1 |
|20190911 |C1 |John |b2c |1 |
|20190916 |C1 |Linda|consultant|1 |
|20190917 |C1 |John |b2c |2 |
+----------+-------+-----+----------+----------+
It's not exactly the same as your expected outcome, since it's ordered by Event_date and account.
Your expected output doesn't seem to be consistent. Please check the numbers again, especially for B1. Also RJohn in the input data.
You can do something that partition by Account, Name and Type. Then you can order by account followed by event_date.
from pyspark.sql.functions import *
from pyspark.sql.window import Window
df = spark.createDataFrame(
[
("20190910", "A1", "Linda", "b2c"),
("20190911", "A1", "Tom", "consultant"),
("20190912", "A1", "John", "b2c"),
("20190913", "A1", "Tom", "consultant"),
("20190914", "A1", "Tom", "consultant"),
("20190915", "A1", "Linda", "consultant"),
("20190916", "A1", "Linda", "b2c"),
("20190917", "B1", "John", "b2c"),
("20190916", "B1", "John", "consultant"),
("20190910", "B1", "Linda", "b2c"),
("20190911", "B1", "John", "b2c"),
("20190915", "C1", "John", "consultant"),
("20190916", "C1", "Linda", "consultant"),
("20190917", "C1", "John", "b2c"),
("20190916", "C1", "John", "consultant"),
("20190910", "C1", "Tom", "b2c"),
("20190911", "C1", "John", "b2c"),
],
["Event_date", "account", "name", "type"],
)
w = Window.partitionBy("account", "name", "type").orderBy("Event_date")
df = df.withColumn("row_number", row_number().over(w)).orderBy("account","Event_date")
You will get the output as below :

Create new Dataframe from Json element inside XML using Pyspark

Hi I'm dealing with rather difficult XML file which I'm trying to reformat and clean for some processing. I've been using Pyspark to process the data into a dataframe and I am using com.databricks.spark.xml to read the file.
My Dataframe looks like this; Each field is JSON Formatted
+----------------+---------------------------------+
| Identifier| Info|
+----------------+---------------------------------+
| JSON | Json |
| | |
| | |
+----------------+---------------------------------+
This is a sample value from the Identifier column
{
"Other": [
{
"_Type": "A",
"_VALUE": "999"
},
{
"_Type": "B",
"_VALUE": "31086"
},
{
"_Type": "C",
"_VALUE": "13123"
},
{
"_Type": "D",
"_VALUE": "32323"
},
{
"_Type": "E",
"_VALUE": "2223"
},
{
"_Type": "F",
"_VALUE": "100"
},
]
}
And this is how the Info Column looks like
{
"Demo": {
"BirthDate": "2009-09-13",
"BirthPlace": {
"_VALUE": null,
"_nil": true
},
"Rel": {
"_VALUE": null,
"_nil": true
}
},
"EmailList": {
"_VALUE": null,
"_nil": true
},
"Name": {
"LastName": "Marwan",
"FullName": {
"_VALUE": null,
"_nil": true
},
"GivenName": "Saad",
"MiddleName": null,
"PreferredFamilyName": {
"_VALUE": null,
"_nil": true
}
},
"OtherNames": {
"_VALUE": null,
"_nil": true
}
}
I am trying to create a dataframe that looks like the following
+-------+--------+-----------+------------+------------+
| F| E| LastName| GivenName | BirthDate|
+-------+--------+-----------+------------+------------+

PostgreSQL: Paginate JSONB data

This may be long shot but is there any way to limit JSONB data with query?
We are investigating the differences between MongoDB and PostgreSQL JSONB and this may be a critical factor.
I have used both MongoDB and PostgreSQL (using JSONB) and IMO, PostgreSQL wins 90% of the time.
This is because most data in real-life is inherently relational and PostgreSQL gives you the best of both worlds. It's a powerful relational database but also has the flexibility of JSONB when required (e.g. JSON can be perfect for unstructured data).
It's disadvantages are at humongous (cough) scale - MongoDB can win then e.g. when there are huge amounts of raw JSON data (or data which can be easily converted to JSON) with no/limited relations.
The power of PostgreSQL JSONB is best illustrated with an example: -
Lets create a table (t) as follows: -
create table t (
id serial primary key,
data jsonb);
... with some demo data ...
insert into t (id, data)
values (1, '[{"name": "A", "age": 20},
{"name": "B", "age": 21},
{"name": "C", "age": 22},
{"name": "D", "age": 23},
{"name": "E", "age": 24},
{"name": "F", "age": 25},
{"name": "G", "age": 26}]'),
(2, '[{"name": "H", "age": 27},
{"name": "I", "age": 28},
{"name": "J", "age": 29},
{"name": "K", "age": 30},
{"name": "L", "age": 31}]'),
(3, '[{"name": "M", "age": 32},
{"name": "N", "age": 33},
{"name": "O", "age": 34},
{"name": "P", "age": 35},
{"name": "Q", "age": 36}]');
1. Simple select
If we simply select all from t we get 3 rows with a JSONB array in the data column.
select *
from t;
----+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
id | data
----+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 | [{"age": 20, "name": "A"}, {"age": 21, "name": "B"}, {"age": 22, "name": "C"}, {"age": 23, "name": "D"}, {"age": 24, "name": "E"}, {"age": 25, "name": "F"}, {"age": 26, "name": "G"}]
2 | [{"age": 27, "name": "H"}, {"age": 28, "name": "I"}, {"age": 29, "name": "J"}, {"age": 30, "name": "K"}, {"age": 31, "name": "L"}]
3 | [{"age": 32, "name": "M"}, {"age": 33, "name": "N"}, {"age": 34, "name": "O"}, {"age": 35, "name": "P"}, {"age": 36, "name": "Q"}]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
2. Select + Unnest
We can then "un-nest" the JSONB array in the data column by using the jsonb_array_elements function - this will return 17 rows of id and data JSONB objects.
select id,
jsonb_array_elements(data)
from t;
----+--------------------------
id | data
----+--------------------------
1 | {"age": 20, "name": "A"}
1 | {"age": 21, "name": "B"}
1 | {"age": 22, "name": "C"}
1 | {"age": 23, "name": "D"}
1 | {"age": 24, "name": "E"}
1 | {"age": 25, "name": "F"}
1 | {"age": 26, "name": "G"}
2 | {"age": 27, "name": "H"}
2 | {"age": 28, "name": "I"}
2 | {"age": 29, "name": "J"}
2 | {"age": 30, "name": "K"}
2 | {"age": 31, "name": "L"}
3 | {"age": 32, "name": "M"}
3 | {"age": 33, "name": "N"}
3 | {"age": 34, "name": "O"}
3 | {"age": 35, "name": "P"}
3 | {"age": 36, "name": "Q"}
-------------------------------
3. Select + Unnest + Pagination
We can then paginate the previous "un-nested" query above: -
select id,
jsonb_array_elements(data)
from t
limit 5; -- return 1st 5
----+---------------------------
id | data
----+---------------------------
1 | {"age": 20, "name": "A"}
1 | {"age": 21, "name": "B"}
1 | {"age": 22, "name": "C"}
1 | {"age": 23, "name": "D"}
1 | {"age": 24, "name": "E"}
--------------------------------
select id,
jsonb_array_elements(data)
from t
limit 5 offset 5; -- return next 5
----+---------------------------
id | data
----+---------------------------
1 | {"age": 25, "name": "F"}
1 | {"age": 26, "name": "G"}
2 | {"age": 27, "name": "H"}
2 | {"age": 28, "name": "I"}
2 | {"age": 29, "name": "J"}
--------------------------------
4. Select + Un-nest + Pagination + Re-nest
We can take this 1 step further and can group by id again and put the JSON back into an array using the jsonb_agg function: -
with t_unnested as (
select id,
jsonb_array_elements(data) as data
from t
limit 5 offset 5
)
select id, jsonb_agg (data)
from t_unnested
group by id;
----+--------------------------------------------------------------------------------
id | data
----+--------------------------------------------------------------------------------
1 | [{"age": 25, "name": "F"}, {"age": 26, "name": "G"}]
2 | [{"age": 27, "name": "H"}, {"age": 28, "name": "I"}, {"age": 29, "name": "J"}]
----+--------------------------------------------------------------------------------
5. Select + Un-nest + Pagination + Re-nest + Custom Object
We can take the previous query and re-construct a new object with new fields e.g. person_id and person_info. This
will return a single column with a new custom JSONB object (again a row per id).
with t_unnested as (
select id,
jsonb_array_elements(data) as data
from t
limit 5 offset 5
),
t_person as (
select jsonb_build_object (
'person_id', id,
'person_info', jsonb_agg (data)
) as person
from t_unnested
group by id
)
select person from t_person;
-----------------------------------------------------------------------------------------------------------------
person
-----------------------------------------------------------------------------------------------------------------
{"person_id": 1, "person_info": [{"age": 25, "name": "F"}, {"age": 26, "name": "G"}]}
{"person_id": 2, "person_info": [{"age": 27, "name": "H"}, {"age": 28, "name": "I"}, {"age": 29, "name": "J"}]}
-----------------------------------------------------------------------------------------------------------------
6. Select + Un-nest + Pagination + Re-nest + Custom Object + Further Re-Nest
The previous query returned 2 rows, we can create a single row by once again using the jsonb_agg function i.e.
with t_unnested as (
select id,
jsonb_array_elements(data) as data
from t
limit 5 offset 5
),
t_person as (
select jsonb_build_object (
'person_id', id,
'person_info', jsonb_agg (data)
) as person
from t_unnested
group by id
)
select jsonb_agg(person) from t_person;
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
person
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[{"person_id": 1, "person_info": [{"age": 25, "name": "F"}, {"age": 26, "name": "G"}]}, {"person_id": 2, "person_info": [{"age": 27, "name": "H"}, {"age": 28, "name": "I"}, {"age": 29, "name": "J"}]}]
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Hopefully this shows the power of JSONB / PostgreSQL in both storing JSONB, un-nesting (and re-nesting) JSON arrays with pagination.

How can I achieve mongo "unwind" in postgres JSONB? (Flatten nested arrays)

I recently looked into migrating our product database from mongo to postgres. Coming from mongoDb, I am used of "unwinding" objects and arrays.
Suppose you have the following object:
{
"styleGroupId": "2",
"brand": "MOP",
"colorVariants": [
{
"color": "red",
"colorCode": "222",
"sizeVariants": [
{"gtin": "444",
"size": "M"},
{"gtin": "555",
"size": "L"}
]
},
{
"color": "blue",
"colorCode": "111",
"sizeVariants": [
{"gtin": "66",
"size": "M"},
{"gtin": "77",
"size": "L"}
]
}
]
}
If you want to flatten it, in mongo you use the following:
db.test.aggregate([
{
$unwind: "$colorVariants"
},
{
$unwind: "$colorVariants.sizeVariants"
}
])
which will result in objects like this:
{
"_id" : ObjectId("5a7dc59dafc86d25964b873c"),
"styleGroupId" : "2",
"brand" : "MOP",
"colorVariants" : {
"color" : "red",
"colorCode" : "222",
"sizeVariants" : {
"gtin" : "444",
"size" : "M"
}
}
}
I have spent hours searching for "mongo unwind in postgres" but could hardly find a satisfying answer. Also a lot of the resources on querying JSONB data in postgres barely touch nested arrays. Hopefully this post will help other poor souls searching for a migration from mongoDb to postgres.
The function:
create or replace function jsonb_unwind(target jsonb, path text[])
returns jsonb language plpgsql as $$
begin
if jsonb_typeof(target #> path) = 'array' then
return jsonb_set(target, path, target #> path -> 0);
else
return target;
end if;
end $$;
Example usage:
select
jsonb_unwind(
jsonb_unwind(json_data, '{colorVariants}'),
'{colorVariants, sizeVariants}')
from my_table;
Test it in rextester.
The answer could implicitly found in this post: How to query nested arrays in a postgres json column?
edit
paired with this answer use - to remove a key from a JSONB
there is an easier way to get sth closer to mongo unwind starting from 9.6
You need to state each array level in the FROM clause of your query
exclude each nested object from your output so it doesn't appear multiple times
optional concatenate object and subobject
when concatenating objects with shared keys, only the key of the object right of the || operator will be preserved
-
SELECT
data::jsonb - 'colorVariants' ||
colorVariants.* - 'sizeVariants' ||
sizeVariants.*
FROM
test,
jsonb_array_elements(data -> 'colorVariants') colorVariants,
jsonb_array_elements(colorVariants -> 'sizeVariants') sizeVariants;
the result then is
?column?
-------------------------------------------------------------------------------------------------------
{"gtin": "11", "size": "M", "brand": "MOP", "color": "red", "colorCode": "222", "styleGroupId": "1"}
{"gtin": "22", "size": "L", "brand": "MOP", "color": "red", "colorCode": "222", "styleGroupId": "1"}
{"gtin": "33", "size": "M", "brand": "MOP", "color": "blue", "colorCode": "111", "styleGroupId": "1"}
{"gtin": "44", "size": "L", "brand": "MOP", "color": "blue", "colorCode": "111", "styleGroupId": "1"}
{"gtin": "444", "size": "M", "brand": "MOP", "color": "red", "colorCode": "222", "styleGroupId": "2"}
{"gtin": "555", "size": "L", "brand": "MOP", "color": "red", "colorCode": "222", "styleGroupId": "2"}
{"gtin": "66", "size": "M", "brand": "MOP", "color": "blue", "colorCode": "111", "styleGroupId": "2"}
{"gtin": "77", "size": "L", "brand": "MOP", "color": "blue", "colorCode": "111", "styleGroupId": "2"}
old post
You have to state each array level in the FROM clause of your query
You have to specifically list the attributes of each level. If you only state e.g. colorVariants the nested arrays will also be returned
So in the specific case the solution would be:
SELECT
data->'brand' as brand,
data->'styleGroupId' as styleGroupId,
colorVariants->'color' as color,
colorVariants->'colorCode' as colorCode,
sizeVariants->'gtin' as GTIN,
sizeVariants->'size' as size
FROM
test,
jsonb_array_elements(data->'colorVariants') colorVariants,
jsonb_array_elements(colorVariants->'sizeVariants') sizeVariants
which will result in
brand | stylegroupid | color | colorcode | gtin | size
-------+--------------+--------+-----------+-------+------
"MOP" | "1" | "red" | "222" | "11" | "M"
"MOP" | "1" | "red" | "222" | "22" | "L"
"MOP" | "1" | "blue" | "111" | "33" | "M"
"MOP" | "1" | "blue" | "111" | "44" | "L"
"MOP" | "2" | "red" | "222" | "444" | "M"
"MOP" | "2" | "red" | "222" | "555" | "L"
"MOP" | "2" | "blue" | "111" | "66" | "M"
"MOP" | "2" | "blue" | "111" | "77" | "L"
Very simple example.
Suppose you have the following data in userGroups table
id | userId | groups
1 | 21 | [{"name": "group_1" }, { "name": "group_2" }]
if you query for attributes with json_array_elements function for groups attribute
select id, userId, json_array_elements(groups) from userGroups;
you will get results similar to mongodb unwind operator results
id | userId | groups
1 | 21 | {"name": "group_1" }
1 | 21 | {"name": "group_2" }
Hope it helps someone!

OrientDB ETL Edge transformer 2 joinFieldName(s)

with one joinFieldName and lookup the Edge transformer works perfect. However, now two keys is required, i.e. compound index in the lookup. How can two joinFieldNames be specified?
This is the scripted(post processing) version:
Create edge Expands from (select from MC where sample=1 and mkey=6) to (select from Event where sample=1 and mcl=6).
This works, but is not suitable for production.
Can anyone help?
you can simply add 2 joinFieldName(s) like
{ "edge": { "class": "Conn",
"joinFieldName": "b1",
"lookup": "A.a1",
"joinFieldName": "b2",
"lookup": "A.a2",
"direction": "out"
}}
see below my test data:
json1.json
{
"source": { "file": { "path": "/home/ivan/Scrivania/cose/etl/stak39517796/data1.csv" } },
"extractor": { "csv": {} },
"transformers": [
{ "vertex": { "class": "A" } }
],
"loader": {
"orientdb": {
"dbURL": "plocal:/home/ivan/OrientDB/db_installati/enterprise/orientdb-enterprise-2.2.10/databases/stack39517796",
"dbType": "graph",
"dbAutoCreate": true,
"classes": [
{"name": "A", "extends": "V"},
{"name": "B", "extends": "V"},
{"name": "Conn", "extends": "E"}
]
}
}
}
json2.json
{
"source": { "file": { "path": "/home/ivan/Scrivania/cose/etl/stak39517796/data2.csv" } },
"extractor": { "csv": {} },
"transformers": [
{ "vertex": { "class": "B" } },
{ "edge": { "class": "Conn",
"joinFieldName": "b1",
"lookup": "A.a1",
"joinFieldName": "b2",
"lookup": "A.a2",
"direction": "out"
}}
],
"loader": {
"orientdb": {
"dbURL": "plocal:/home/ivan/OrientDB/db_installati/enterprise/orientdb-enterprise-2.2.10/databases/stack39517796",
"dbType": "graph",
"dbAutoCreate": true,
"classes": [
{"name": "A", "extends": "V"},
{"name": "B", "extends": "V"},
{"name": "Conn", "extends": "E"}
]
}
}
}
data1.csv
a1,a2
1,1
1,2
2,3
data2.csv
b1,b2
1,1
2,3
1,2
execution order:
json1
json2
and here is the final result:
orientdb {db=stack39517796}> select from v
+----+-----+------+----+----+-------+----+----+--------+
|# |#RID |#CLASS|a1 |a2 |in_Conn|b2 |b1 |out_Conn|
+----+-----+------+----+----+-------+----+----+--------+
|0 |#17:0|A |1 |1 |[#25:0]| | | |
|1 |#18:0|A |1 |2 |[#27:0]| | | |
|2 |#19:0|A |2 |3 |[#26:0]| | | |
|3 |#21:0|B | | | |1 |1 |[#25:0] |
|4 |#22:0|B | | | |3 |2 |[#26:0] |
|5 |#23:0|B | | | |2 |1 |[#27:0] |
+----+-----+------+----+----+-------+----+----+--------+