How to combine pyspark dataframes with different shapes and different columns - pyspark

I have two dataframes in Pyspark. One has more than 1000 rows and the other only 4 rows. The columns also are not matching.
df1 with more than 1000 rows:
+----+--------+--------------+-------------+
| ID | col1 | col2 | col 3 |
+----+--------+--------------+-------------+
| 1 | time1 | value_col2 | value_col3 |
| 2 | time 2 | value2_col2 | value2_col3 |
+----+--------+--------------+-------------+
...
df2 with only 4 rows:
+-----+--------------+--------------+
| key | col_c | col_d |
+-----+--------------+--------------+
| a | valuea_colc | valuea_cold |
| b | valueb_colc | valueb_cold |
+-----+--------------+--------------+
I want to create a dataframe looking like this:
+----+--------+-------------+-------------+--------------+---------------+--------------+-------------+
| ID | col1 | col2 | col 3 | a_col_c | a_col_d | b_col_c | b_col_d |
+----+--------+-------------+-------------+--------------+---------------+--------------+-------------+
| 1 | time1 | value_col2 | value_col3 | valuea_colc | valuea_cold | valueb_colc | valueb_cold |
| 2 | time 2 | value2_col2 | value2_col3 | valuea_colc | valuea_cold | valueb_colc | valueb_cold |
+----+--------+-------------+-------------+--------------+---------------+--------------+-------------+
Can you please help with this? I prefer not to use Pandas.
Thank you!

I actually figured this out using crossJoin.
https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html explains how to use crossJoin with Pyspark DataFrames.

Related

Compare consecutive rows and extract words(excluding the subsets) in spark

I am working on a spark dataframe. Input dataframe looks like below (Table 1). I need to write a logic to get the keywords with maximum length for each session ids. There are multiple keywords that would be part of output for each sessionid. expected output looks like Table 2.
Input dataframe:
(Table 1)
|-----------+------------+-----------------------------------|
| session_id| value | Timestamp |
|-----------+------------+-----------------------------------|
| 1 | cat | 2021-01-11T13:48:54.2514887-05:00 |
| 1 | catc | 2021-01-11T13:48:54.3514887-05:00 |
| 1 | catch | 2021-01-11T13:48:54.4514887-05:00 |
| 1 | par | 2021-01-11T13:48:55.2514887-05:00 |
| 1 | part | 2021-01-11T13:48:56.5514887-05:00 |
| 1 | party | 2021-01-11T13:48:57.7514887-05:00 |
| 1 | partyy | 2021-01-11T13:48:58.7514887-05:00 |
| 2 | fal | 2021-01-11T13:49:54.2514887-05:00 |
| 2 | fall | 2021-01-11T13:49:54.3514887-05:00 |
| 2 | falle | 2021-01-11T13:49:54.4514887-05:00 |
| 2 | fallen | 2021-01-11T13:49:54.8514887-05:00 |
| 2 | Tem | 2021-01-11T13:49:56.5514887-05:00 |
| 2 | Temp | 2021-01-11T13:49:56.7514887-05:00 |
|-----------+------------+-----------------------------------|
Expected Output:
(Table 2)
|-----------+------------+
| session_id| value |
|-----------+------------+
| 1 | catch |
| 1 | partyy |
| 2 | fallen |
| 2 | Temp |
|-----------+------------|
Solution I tried:
I added another column called col_length which captures the length of each word in value column. later on tried to compare each row with its subsequent row to see if it is of maximum lenth. But this solution only works party.
val df = spark.read.parquet("/project/project_name/abc")
val dfM = df.select($"session_id",$"value",$"Timestamp").withColumn("col_length",length($"value"))
val ts = Window
.orderBy("session_id")
.rangeBetween(Window.unboundedPreceding, Window.currentRow)
val result = dfM
.withColumn("running_max", max("col_length") over ts)
.where($"running_max" === $"col_length")
.select("session_id", "value", "Timestamp")
Current Output:
|-----------+------------+
| session_id| value |
|-----------+------------+
| 1 | catch |
| 2 | fallen |
|-----------+------------|
Multiple columns does not work inside an orderBy clause with window function so I didn't get desired output.I got 1 output per sesison id. Any suggesions would be highly appreciated. Thanks in advance.
You can solve it by using lead function:
val windowSpec = Window.orderBy("session_id")
dfM
.withColumn("lead",lead("value",1).over(windowSpec))
.filter((functions.length(col("lead")) < functions.length(col("value"))) || col("lead").isNull)
.drop("lead")
.show

PostgreSQL: Transforming rows into columns when more than three columns are needed

I have a table like the following one:
+---------+-------+-------+-------------+--+
| Section | Group | Level | Fulfillment | |
+---------+-------+-------+-------------+--+
| A | Y | 1 | 82.2 | |
| A | Y | 2 | 23.2 | |
| A | M | 1 | 81.1 | |
| A | M | 2 | 28.2 | |
| B | Y | 1 | 89.1 | |
| B | Y | 2 | 58.2 | |
| B | M | 1 | 32.5 | |
| B | M | 2 | 21.4 | |
+---------+-------+-------+-------------+--+
And this would be my desired output:
+---------+-------+--------------------+--------------------+
| Section | Group | Level1_Fulfillment | Level2_Fulfillment |
+---------+-------+--------------------+--------------------+
| A | Y | 82.2 | 23.2 |
| A | M | 81.1 | 28.2 |
| B | Y | 89.1 | 58.2 |
| B | M | 32.5 | 21.4 |
+---------+-------+--------------------+--------------------+
Thus, for each section and group I'd like to obtain their percents of fulfillment for level 1 and level 2. To achieve this, I've tried crosstab(), but using this function returns me an error ("The provided SQL must return 3 columns: rowid, category, and values.") because I'm using more than three columns (I need to maintain section and group as identifiers for each row). Is possible to use crosstab in this case?
Regards.
I find crosstab() unnecessary complicated to use and prefer conditional aggregation:
select section,
"group",
max(fulfillment) filter (where level = 1) as level_1,
max(fulfillment) filter (where level = 2) as level_2
from the_table
group by section, "group"
order by section;
Online example

T-SQL : Pivot table without aggregate

I am trying to understand how to pivot data within T-SQL but can't seem to get it working. I have the following table structure
+-------------------+-----------------------+
| Name | Value |
+-------------------+-----------------------+
| TaskId | 12417 |
| TaskUid | XX00044497 |
| TaskDefId | 23 |
| TaskStatusId | 4 |
| Notes | |
| TaskActivityIndex | 0 |
| ModifiedBy | Orange |
| Modified | /Date(1554540200000)/ |
| CreatedBy | Apple |
| Created | /Date(2121212100000)/ |
| TaskPriorityId | 40 |
| OId | 2 |
+-------------------+-----------------------+
I want to pivot the name column to be columns expected output
+--------+------------------------+-----------+--------------+-------+-------------------+------------+-----------------------+-----------+-----------------------+----------------+-----+
| TASKID | TASKUID | TASKDEFID | TASKSTATUSID | NOTES | TASKACTIVITYINDEX | MODIFIEDBY | MODIFIED | CREATEDBY | CREATED | TASKPRIORITYID | OID |
+--------+------------------------+-----------+--------------+-------+-------------------+------------+-----------------------+-----------+-----------------------+----------------+-----+
| | | | | | | | | | | | |
| 12417 | XX00044497 | 23 | 4 | | 0 | Orange | /Date(1554540200000)/ | Apple | /Date(2121212100000)/ | 40 | 2 |
+--------+------------------------+-----------+--------------+-------+-------------------+------------+-----------------------+-----------+-----------------------+----------------+-----+
Is there an easy way of doing it? The columns are fixed (not dynamic).
Any help appreciated
Try this:
select * from yourtable
pivot
(
min(value)
for Name in ([TaskID],[TaskUID],[TaskDefID]......)
) as pivotable
You can also use case statements.
You must use the aggregate function in the pivot table.
If you want to learn more, here is the reference:
https://learn.microsoft.com/en-us/sql/t-sql/queries/from-using-pivot-and-unpivot?view=sql-server-2017
Output (I only tried three columns):
DB<>Fiddle

Add columns but keep a specific id

I have a table "Listing" that looks like this:
| listing_id | amenities |
|------------|--------------------------------------------------|
| 5629709 | {"Air conditioning",Heating, Essentials,Shampoo} |
| 4156372 | {"Wireless Internet",Kitchen,"Pets allowed"} |
And another table "Amenity" like this:
| amenity_id | amenities |
|------------|--------------------------------------------------|
| 1 | Air conditioning |
| 2 | Kitchen |
| 3 | Heating |
Is there a way to join the two tables in a new one "Listing_Amenity" like this:
| listing_id | amenities |
|------------|-----------|
| 5629709 | 1 |
| 5629709 | 3 |
| 4156372 | 2 |
You could use unnest:
CREATE TABLE Listing_Amenity
AS
SELECT l.listing_id, a.amenity_id
FROM Listing l
, unnest(l.ammenities) sub(elem)
JOIN Amenity a
ON a.ammenities = sub.elem;
db<>fiddle demo

Talend: how to split column data into rows

I have one table:
| id | head1| head2 | head3|
| 1 | fv1 | fw1,fw2,fw3| fv3 |
| 2 | sv2 | sw1,sw2,sw3| sv4 |
And would like to have the following:
| id | head2 |
| 1 | fw1 |
| 1 | fw2 |
| 1 | fw3 |
| 2 | sw1 |
| 2 | sw2 |
| 2 | sw3 |
So I would like to split a comma-delimited content of some columns and then copy it over into the different table as rows for search purposes.
Which Talend component should I use to achieve this? Is that possible?
tNormalize should help you with this problem.
Just select "," as field separator, and head2 as the column to normalize.