Joining two dataframes in Spark is blowing up - scala

I am trying to join two dataframes on a single field. In order to do this, I must first make sure the field is unique. So my order of events goes:
Read in the first dataframe
Select the field I want to join on (say, field1), and another field I want to bring in on the join (field2)
Do .distinct
Then, for the second table..
Read in the second dataframe
Do a leftouter join on field1 with the first table
Do .distinct
I have tried to run my script and it is taking way longer than it should.
To debug this, I put a println for the record count on the first table before and after the join, and here were the results:
Before the join, the record count was 904,326. After, 2,658,632.
So I think it is blowing up, but am not sure why. I think it has to do with trying to use only one "distinct" after selecting two fields..?
Please help!
Here is the code:
val ticketProduct = Source.fromArg(args, "f1").read
.select($"INSTRUMENT_SK", $"TICKET_CODES_SK")
.distinct
val instrumentD = Source.fromArg(args, "f2").read
// println("instrumentD count before join is " + instrumentD.count)
.join(ticketProduct, Seq("INSTRUMENT_SK"), "leftouter")
// .select($"SERIAL_NBR", $"TICKET_CODES_SK")
.distinct
println("instrumentD count after join is " + instrumentD.count)

The problem you have is that by calling distinct you only remove rows where the values of field1 and field2 are the same.
Since you join on field1, you might want the values of field1 to be unique.
You could try something like the following instead of calling distinct.
dataframe1.groupBy($"field1").agg(org.apache.spark.sql.functions.array($"field2"))
This will result in a dataframe where the column field1 is unique, and multiple values of field2 are aggregate into an array.
The same applies to the second dataframe.
To give an example: Say you have dataframes with the following content.
field1, field2
1, 1
1, 2
field1, field3
1, 1
1, 3
Then distinct does nothing to them, since the rows are distinct.
Now if you would do a join on field1 you would get the following.
1, 1, 1
1, 2, 1
1, 1, 3
1, 2, 3
The aggregation in contrast would give
field1, array_of_field2
1, [1,2]
field1, array_of field3
1, [1,3]
Then a join would then result in the following dataframe.
1, [1,2], [1,3]

Related

JSONB Data Type Modification in Postgresql

I have a doubt with modification of jsonb data type in postgres
Basic setup:-
array=> ["1", "2", "3"]
and now I have a postgresql database with an id column and a jsonb datatype column named lets just say cards.
id cards
-----+---------
1 {"1": 3, "4": 2}
thats the data in the table named test
Question:
How do I convert the cards of id->1 FROM {"1": 3, "4": 2} TO {"1": 4, "4":2, "2": 1, "3": 1}
How I expect the changes to occur:
From the array, increment by 1 all elements present inside the array that exist in the cards jsonb as a key thus changing {"1": 3} to {"1": 4} and insert the values that don't exist as a key in the cards jsonb with a value of 1 thus changing {"1":4, "4":2} to {"1":4, "4":2, "2":1, "3":1}
purely through postgres.
Partial Solution
I asked a senior for support regarding my question and I was told this:-
Roughly (names may differ): object keys to explode cards, array_elements to explode the array, left join them, do the calculation, re-aggregate the object. There may be a more direct way to do this but the above brute-force approach will work.
So I tried to follow through it using these two functions json_each_text(), json_array_elements_text() but ended up stuck halfway into this as well as I was unable to understand what they meant by left joining two columns:-
SELECT jsonb_each_text(tester_cards) AS each_text, jsonb_array_elements_text('[["1", 1], ["2", 1], ["3", 1]]') AS array_elements FROM tester WHERE id=1;
TLDR;
Update statement that checks whether a range of keys from an array exist or not in the jsonb data and automatically increments by 1 or inserts respectively the keys into the jsonb with a value of 1
Now it might look like I'm asking to be spoonfed but I really haven't managed to find anyway to solve it so any assistance would be highly appreciated 🙇
The key insight is that with jsonb_each and jsonb_object_agg you can round-trip a JSON object in a subquery:
SELECT id, (
SELECT jsonb_object_agg(key, value)
FROM jsonb_each(cards)
) AS result
FROM test;
(online demo)
Now you can JOIN these key-value pairs against the jsonb_array_elements of your array input. Your colleague was close, but not quite right: it requires a full outer join, not just a left (or right) join to get all the desired object keys for your output, unless one of your inputs is a subset of the other.
SELECT id, (
SELECT jsonb_object_agg(COALESCE(obj_key, arr_value), …)
FROM jsonb_array_elements_text('["1", "2", "3"]') AS arr(arr_value)
FULL OUTER JOIN jsonb_each(cards) AS obj(obj_key, obj_value) ON obj_key = arr_value
) AS result
FROM test;
(online demo)
Now what's left is only the actual calculation and the conversion to an UPDATE statement:
UPDATE test
SET cards = (
SELECT jsonb_object_agg(
COALESCE(key, arr_value),
COALESCE(obj_value::int, 0) + (arr_value IS NOT NULL)::int
)
FROM jsonb_array_elements_text('["1", "2", "3"]') AS arr(arr_value)
FULL OUTER JOIN jsonb_each_text(cards) AS obj(key, obj_value) ON key = arr_value
);
(online demo)

pyspark group by sum

I have a pyspark dataframe with 4 columns.
id/ number / value / x
I want to groupby columns id, number, and then add a new columns with the sum of value per id and number. I want to keep colunms x without doing nothing on it.
df= df.select("id","number","value","x")
.groupBy( 'id', 'number').withColumn("sum_of_value",df.value.sum())
At the end I want a data frame with 5 columns : id/ number / value / x /sum_of_value)
Does anyone can help ?
The result you are trying to achieve doesn't make sense. Your output dataframe will only have columns that were grouped by or aggregated (summed in this case). x and value would have multiple values when you group by id and number.
You can have a 3-column output (id, number and sum(value)) like this:
df_summed = df.groupBy(['id', 'number'])['value'].sum()
Lets say your DataFrame df has 3 Columns Initially.
df1 = df.groupBy("id","number").count()
Now df1 will contain 2 columns id, number and count.
Now you can join df1 and df based on columns "id" and "number" and select whatever columns you would like to select.
Hope it helps.
Regards,
Neeraj

PostgreSQL join 2 tables and get all values even if 1 table is empty

I have 2 table. Objects and properties. Properties table has properties of the object. But it is possible that the object does not have any properties.
I would like to make a query so that I get all the objects that have properties(value in property column) and all the objects that dont have properties(in this case the property column will be empty)
EXAMPLE: Simplified query that gives the same result
SELECT
row_number () OVER() AS id,
seire.id seire_id,
tegevus.arenguvajadus
FROM andmed seire
RIGHT OUTER JOIN tegevused tegevus ON seire.id = tegevus.seire_id
WHERE tegevus.aktiivne = true
Data example:
andmed:
Id, Data
1 , ...
2, ...
tegevused
id, aktiivne, arenguvajadus, seire_id
1, true, something something, 1
1, true, something2 , 1
Expected result
ID, Seire_id, arenguvajadus
1, 1, something something
2, 1, something2
3, 2,
You need to remove that LEFT JOINed table from your WHERE. I assume tegevused is properties.
SELECT
row_number () OVER() AS id,
seire.id seire_id,
tegevus.arenguvajadus
FROM andmed seire
LEFT OUTER JOIN tegevused tegevus ON seire.id = tegevus.seire_id AND tegevus.aktiivne = true

How to insert record into a dataframe in spark

I have a dataframe (df1) which has 50 columns, the first one is a cust_id and the rest are features. I also have another dataframe (df2) which contains only cust_id. I'd like to add one records per customer in df2 to df1 with all the features as 0. But as the two dataframe have two different schema, I cannot do a union. What is the best way to do that?
I use a full outer join but it generates two cust_id columns and I need one. I should somehow merge these two cust_id columns but don't know how.
You can try to achieve something like that by doing a full outer join like the following:
val result = df1.join(df2, Seq("cust_id"), "full_outer")
However, the features are going to be null instead of 0. If you really need them to be zero, one way to do it would be:
val features = df1.columns.toSet - "cust_id" // Remove "cust_id" column
val newDF = features.foldLeft(df2)(
(df, colName) => df.withColumn(colName, lit(0))
)
df1.unionAll(newDF)

How to group by more than 64 keys in BigQuery

Using Google-BigQuery, I created a query with almost 100 fields, grouping by 96 of them:
SELECT
field1,field2,(...),MAX(field100) as max100
FROM dataset.table1
GROUP BY field1,field2,(...),field96
and I got this error
Error: Maximum number of keys in GROUP BY clause is 64, query has 96 GROUP BY keys.
so, there is no chance to group by more than 64 fields using google-bigquery. Any suggestion?
If some of these fields are strings, and there is a character which cannot appear in them (say, ':'), then you could concatenate them together and group by concatenation, i.e.
SELECT CONCAT(field1, ':', field2, ':', field3) as composite_field, ...
FROM dataset.table
GROUP BY 1, 2, ..., 64
In order to recover the original fields later, you could use
SELECT
regexp_extract(composite_field, r'([^:]*):') field1,
regexp_extract(composite_field, r'[^:]*:([^:]*)') field2,
regexp_extract(composite_field, r'[^:]*:[^:]*:(.*)') field3,
...
FROM (...)
It seems that is an internal limit, not documented.
Another solution that I have developed is similar to the Mosha's solution.
You can add an extra column called, for example, hashref. That new column is computed by all the columns that you would like to group by, separated with a pipe for example and applying md5 or sha256 to the line.
Then you can group by with the new hashref and for the other columns you just apply the min() function, that is also an aggregator.
line = name + "|" + surname + "|" + age
hashref = md5(line)
... and then ...
SELECT hashref, min(name), min(surname)
FROM mytable
GROUP BY hashref