Partition in dataframe pyspark - pyspark

I have a dataframe:
data = [{"ID": 'asriыjewfsldflsar2','val':5},
{"ID": 'dsgвarwetreg','val':89},
{"ID": 'ewrсt43gdfb','val':36},
{"ID": 'q23м4534tgsdg','val':58},
{"ID": '34tя5erbdsfgv','val':52},
{"ID": '43t2ghnaef','val':123},
{"ID": '436tываhgfbds','val':457},
{"ID": '435t5вч3htrhnbszfdhb','val':54},
{"ID": '35yteвrhfdhbxfdbn','val':1},
{"ID": '345ghаывserh','val':1},
{"ID": 'asrijываewfsldflsar2','val':5},
{"ID": 'dsgarываwetreg','val':89},
{"ID": 'ewrt433gdfb','val':36},
{"ID": 'q2345выа34tgsdg','val':58},
{"ID": '34t5eоrbdsfgv','val':52},
{"ID": '43tghолnaef','val':123},
{"ID": '436thапgfbds','val':457},
{"ID": '435t5укн3htrhnbszfdhb','val':54},
{"ID": '35ytк3erhfdhbxfdbn','val':1},
{"ID": '345g244hserh','val':1}
]
df = spark.createDataFrame(data)
I want to split the rows into 4 groups, I used to be able to do this with the row_number():
.withColumn('part', F.row_number().over(Window.orderBy(F.lit(1))) % n)
But unfortunately this method does not suit me, because I have a large dataframe that will not fit into memory. I tried to use the hash function, but I think I'm doing it wrong
df2 = df.withColumn('hashed_name', (F.hash('ID') % N))\
.withColumn('divide',F.floor(F.col('hashed_name')/13))\
.sort('divide')
Maybe there is another way to split rows besides than row_number?

you can use partitionBy() when trying to save the dataframe in delta format.
df.write.partitionBy("ColumnName").format("delta").save("path_to_save_the_dataframe",header=True,mode="overwrite")
hope this helps!

Hi you can use coalesce() to force exact number of partitions and latter you can use partition number for future queries.
df1=df.coalesce(4)
df1.createOrReplaceTempView('df')
espsql="select x.*,spark_partition_id() as part from df x"
df_new=spark.sql(espsql)
newsql="select distinct part from df_new"
spark.sql(newsql).take(5)

Related

Coalesce value bound to object's key into parent's value

I have a PostgreSQL 12.x database. There is a column data in a table typename that contains JSON. The actual JSON data is not fixed to a particular structure; these are some examples:
{"emt": {"key": " ", "source": "INPUT"}, "id": 1, "fields": {}}
{"emt": {"key": "Stack Overflow", "source": "INPUT"}, "id": 2, "fields": {}}
{"emt": {"key": "https://www.domain.tld/index.html", "source": "INPUT"}, "description": {"key": "JSONB datatype", "source": "INPUT"}, "overlay": {"id": 5, "source": "bOv"}, "fields": {"id": 1, "description": "Themed", "recs ": "1"}}
Basically, what I'm trying to come up with is a (database migration) script that will find any object with the keys key and source, take the actual value of key and assign it to the corresponding key/value pair where the object was originally bound to. For instance:
{"emt": " ", "id": 1, "fields": {}}
{"emt": "Stack Overflow", "id": 2, "fields": {}}
{"emt": "https://www.domain.tld/index.html", "description": "JSONB datatype", "overlay": {"id": 5, "source": "bOv"}, "fields": {"id": 1, "description": "Themed", "recs ": "1"}}
I started finding the rows that contained "source": "INPUT" by using:
select * from typename
where jsonb_path_exists(data, '$.** ? (#.type() == "string" && # like_regex "INPUT")');
...but then I'm not sure how to update the returned subset or to loop through it :/
It took me a while but here is the update statement:
update typename
set data = jsonb_set(data, '{emt}', jsonb_extract_path(data, 'emt', 'key')::jsonb, false)
where jsonb_typeof(data -> 'emt') = 'object'
and jsonb_path_exists(data, '$.emt.key ? (#.type() == "string")')
and jsonb_path_exists(data, '$.emt.source ? (#.type() == "string" && # like_regex "INPUT")');
There are probably better ways to implement that where clause, but that one works ;)
One downside is that I had to figure it out how many keys are involved in the update and align it with the number of update statements; e.g.: in the original example there were two keys: emt and description — so it should have been two update statements.

How to add key/value to element in array of array (jsonb type) in plain sql (encountered ERROR: aggregate function calls cannot be nested)

dbfiddle here:
https://dbfiddle.uk/?rdbms=postgres_10&fiddle=5924abbdad955e159de7f3571ebbac5a
Say I've a table
CREATE TABLE yewu (
id serial PRIMARY KEY,
org varchar(50),
data jsonb
);
I want to update table
id org data
1 OA [{"profiles": [{"id": 1}, {"id": 2}]}, {"profiles": [{"id": 3}, {"id": 4}]}]
2 OB [{"profiles": [{"id": 1}, {"id": 2}]}, {"profiles": [{"id": 3}, {"id": 4}]}]
to
id org data
1 OA [{"profiles": [{"id": 1,"org":"OA"}, {"id": 2,"org":"OA"}]}, {"profiles": [{"id": 3,"org":"OA"}, {"id": 4,"org":"OA"}]}]
2 OB [{"profiles": [{"id": 1,"org":"OB"}, {"id": 2,"org":"OB"}]}, {"profiles": [{"id": 3,"org":"OB"}, {"id": 4,"org":"OB"}]}]
this is what I try:
UPDATE
yewu
SET
data = (
SELECT
jsonb_agg((
SELECT
jsonb_set(oc, '{profiles}', jsonb_agg((
SELECT
jsonb_set(p, '{org}', yewu.org::jsonb)
FROM jsonb_array_elements(oc -> 'profiles') p)))))
FROM
jsonb_array_elements(data) oc)
RETURNING
id;
and error:
ERROR: aggregate function calls cannot be nested
LINE 8: jsonb_set(oc, '{profiles}', jsonb_agg((
^
Sample Query:
select
t2.id,
t2.org,
jsonb_agg(t2.p1) as "data"
from (
select
t1.id,
t1.org,
t1.profiles,
jsonb_build_object('profiles', jsonb_agg(t1.p1)) as p1
from (
select
id,
org,
jsonb_array_elements("data")->'profiles' as "profiles",
jsonb_array_elements(jsonb_array_elements("data")->'profiles') || jsonb_build_object('org', org) as p1
from yewu
) t1
group by t1.id, t1.org, t1.profiles
) t2
group by t2.id, t2.org
------------------------- RETURN -------------------------
id org data
--------------------------------------------------------------------------------------------------------------------------------------------------
1 OA [{"profiles": [{"id": 1, "org": "OA"}, {"id": 2, "org": "OA"}]}, {"profiles": [{"id": 3, "org": "OA"}, {"id": 4, "org": "OA"}]}]
2 OB [{"profiles": [{"id": 1, "org": "OB"}, {"id": 2, "org": "OB"}]}, {"profiles": [{"id": 3, "org": "OB"}, {"id": 4, "org": "OB"}]}]

postgres 11.6 - Creating array of JSON Objects from JSON array

I have the following schema here: http://sqlfiddle.com/#!17/5c73a/1
I want to create a query where the results will be something like this:
id | tags
_________________________________
1. | [{"id": "id", "title": "first"}, {"id": "id", "title": "second"},{"id": "id", "title": "third"}]
2 | [{"id": "id", "title": "fourth"}, {"id": "id", "title": "fifth"},{"id": "id", "title": "sixth"}]
The idea is to build an array with an object for each line of the array, the important is the title variable
You need to unnest the array and then aggregate it back:
select t.id, jsonb_agg(jsonb_build_object('id', 'id', 'title', tg.title))
from things t
cross join jsonb_array_elements(tags) as tg(title)
group by t.id;
Online example

How can I get "complex JSON" from a kafka topic and insert it in several tables in MySQL?

I'm a Kafka beginner so it is possible there is an API or a tool that could help me, but I do not know it. If I'm approaching this problem wrongly and you leave me know it, I would really appreciate it.
I have a JSON in my topic which looks something like this:
{
"id": "001",
"value": "30000",
"items": [
{
"id": 1,
"description": "chicken breast",
"value": "2300"
},
{
"id": 2,
"description": "Cookies",
"value": "2400"
}
]
}
And I need to take and insert it in a database similar to:
should I create a KafkaStreamApp, transform my data trying to simplify that JSON? or is any way to "map" or construct each row that I need to insert in each table with a KafkaConnect?
P.S: This is a simplified JSON example, the JSON in my topic will be much more complex.

Search and update a JSON array element in Postgres

I have a Jsonb column that store array of elements like the following:
[
{"id": "11", "name": "John", "age":"25", ..........},
{"id": "22", "name": "Mike", "age":"35", ..........},
{"id": "33", "name": "Tom", "age":"45", ..........},
.....
]
I want to replace the 2nd object(id=22) with a totally new object. I don't want to update each property one by one because there are many properties and their values all could have changed. I want to just identify the 2nd element and replace the whole object.
I know there is a jsonb_set(). However, to update the 2nd element, I need to know its array index=1 so I can do the following:
jsonb_set(data, '{1}', '{"id": "22", "name": "Don", "age":"55"}',true)
But I couldn't find any way to search and get that index. Can someone help me out?
One way I can think of is to combine row_number and json_array_elements:
-- test data
create table test (id integer, data jsonb);
insert into test values (1, '[{"id": "22", "name": "Don", "age":"55"}, {"id": "23", "name": "Don2", "age":"55"},{"id": "24", "name": "Don3", "age":"55"}]');
insert into test values (2, '[{"id": "32", "name": "Don", "age":"55"}, {"id": "33", "name": "Don2", "age":"55"},{"id": "34", "name": "Don3", "age":"55"}]');
select subrow, id, row_number() over (partition by id)
from (
select json_array_elements(data) as subrow, id
from test
) as t;
subrow | id | row_number
------------------------------------------+----+------------
{"id": "22", "name": "Don", "age":"55"} | 1 | 1
{"id": "23", "name": "Don2", "age":"55"} | 1 | 2
{"id": "24", "name": "Don3", "age":"55"} | 1 | 3
{"id": "32", "name": "Don", "age":"55"} | 2 | 1
{"id": "33", "name": "Don2", "age":"55"} | 2 | 2
{"id": "34", "name": "Don3", "age":"55"} | 2 | 3
-- apparently you can filter what you want from here
select subrow, id, row_number() over (partition by id)
from (
select json_array_elements(data) as subrow, id
from test
) as t
where subrow->>'id' = '23';
In addition, think about your schema design. It may not be the best idea to store your data this way.