Pyspark: Combine multiple columns into an Array of Strings with column names

Pyspark: Combine multiple columns into an Array of Strings with column names - pyspark

I'm attempting to append multiple columns into a single column as seen below:
Input
A | B | C | D | E
aa | bb | cc | dd | null
Output
A | B | C | D | E | combine
aa | bb | cc | dd | null | [A: aa, B: bb, C: cc, D: dd, E: null]

Since you need the array of string, you need to fill the null and concat the string:
df.fillna(
'null'
).withColumn(
'combine',
func.array([func.concat_ws(': ', col, val) for col, val in [(func.lit(col), func.col(col)) for col in df.columns]])
).show(1, False)
+---+---+---+---+----+-------------------------------------+
|A |B |C |D |E |combine |
+---+---+---+---+----+-------------------------------------+
|aa |bb |cc |dd |null|[A: aa, B: bb, C: cc, D: dd, E: null]|
+---+---+---+---+----+-------------------------------------+

Related

Pyspark intersection of two dataframe columns

I have two dataframes in the following format:
one is single row:
+-----+--------------------+
| col1| col2|
+-----+--------------------+
| A | [B, C, D] |
+-----+--------------------+
The other one has multiple rows:
+----------+--------------------+
| col1| col2|
+----------+--------------------+
| F |[A, B, C] |
| G |[J, K, B] |
| H |[C, H, D] |
+----------+--------------------+
I am looking for the intersection of these two:
+----------+--------------------+
| col1| col2|
+----------+--------------------+
| F |[B, C] |
| G |[B] |
| H |[C, D] |
+----------+--------------------+
I tried the solution proposed here but it didn't help. Is there any efficient way to find the intersection between one row dataframe and multiple rows dataframe?

You can crossJoin the col2 from the single row dataframe and use array_intersect function for the required intersection.
data2_sdf. \
crossJoin(func.broadcast(data1_sdf.selectExpr('col2 as col_to_check'))). \
withColumn('reqd_intersect', func.array_intersect('col2', 'col_to_check')). \
show(truncate=False)
# +----+---------+------------+--------------+
# |col1|col2 |col_to_check|reqd_intersect|
# +----+---------+------------+--------------+
# |F |[A, B, C]|[B, C, D] |[B, C] |
# |G |[J, K, B]|[B, C, D] |[B] |
# |H |[C, H, D]|[B, C, D] |[C, D] |
# +----+---------+------------+--------------+

how to update a cell of a spark data frame

I have the following a dataFrame on which I'm trying to update a cell depending on some conditions (like sql update where..)
for example, let's say I have the following data Frame :
+-------+-------+
|datas |isExist|
+-------+-------+
| AA | x |
| BB | x |
| CC | O |
| CC | O |
| DD | O |
| AA | x |
| AA | x |
| AA | O |
| AA | O |
+-------+-------+
How could I update the values to X when datas=AA and isExist is O, here is the expected output :
+-------+-------+
|IPCOPE2|IPROPE2|
+-------+-------+
| AA | x |
| BB | x |
| CC | O |
| CC | O |
| DD | O |
| AA | x |
| AA | x |
| AA | X |
| AA | X |
+-------+-------+
I could do a filter, then union, but I think its not the best solution, I could also use the when, but in this case I had create a new line containing the same values except for the isExist column, in that example is an acceptable solution, but what if I have 20 column !!

You can create new column using withColumn (either putting original or updated value) and then drop isExist column.

I am not sure why you do not want to use when for it seems to be exactly what you need. The withColumn method, when used with an existing column name will simply replace the column by the new value:
df.withColumn("isExist",
when('datas === "AA" && 'isExist === "O", "X").otherwise('isExist))
.show()
+-----+-------+
|datas|isExist|
+-----+-------+
| AA| x|
| BB| x|
| CC| O|
| CC| O|
| DD| O|
| AA| x|
| AA| x|
| AA| X|
| AA| X|
+-----+-------+
Then you can use withColumnRenamed to change the names of your columns. (e.g. df.withColumnRenamed("datas", "IPCOPE2"))

How to convert to list from many columns of dataframe in Spark?

I have this example dataframe:
id | A | B | C | D
1 |NULL | 1 | 1 |NULL
2 | 1 | 1 | 1 | 1
3 | 1 |NULL |NULL |NULL
and I want to change to this format:
id | newColumn
1 | {"B", "C"}
2 | {"A","B","C","D"}
3 | {"A"}
In other words, I want to make a new column with a list containing the column names where the row values are not null.
How can I do this in Spark using Scala?

First, get the column names where there is an actual value and not null. This can be done with a function such as:
val notNullColNames = Seq("A", "B", "C", "D").map(c => when(col(c).isNotNull, c))
To create an array of values normally array is used, however, this will still give back a null when the input is null. Instead, one solution is to use concat_ws and split to remove any null values:
df.select($"id", split(concat_ws(",", notNullColNames:_*), ",").as("newColumn"))
For the example input, this will output:
+---+------------+
| id| newColumn|
+---+------------+
| 1| [B, C]|
| 2|[A, B, C, D]|
| 3| [A]|
+---+------------+

Spark Scala: Joining 2 tables and extract the data with max date(please see description)

I want to join two tables A and B and pick the records having max date from table B for each value.
Consider the following tables:
Table A:
+---+-----+----------+
| id|Value|start_date|
+---+---- +----------+
| 1 | a | 1/1/2018 |
| 2 | a | 4/1/2018 |
| 3 | a | 8/1/2018 |
| 4 | c | 1/1/2018 |
| 5 | d | 1/1/2018 |
| 6 | e | 1/1/2018 |
+---+-----+----------+
Table B:
+---+-----+----------+
|Key|Value|sent_date |
+---+---- +----------+
| x | a | 2/1/2018 |
| y | a | 7/1/2018 |
| z | a | 11/1/2018|
| p | c | 5/1/2018 |
| q | d | 5/1/2018 |
| r | e | 5/1/2018 |
+---+-----+----------+
The aim is to bring in column id from Table A to Table B for each value in Table B.
For the same, table A and B needs to be joined together with column value and for each record in B, max(A.start_date) for each data in column Value in Table A is found with condition A.start_date < B.sent_date
Lets consider the value=a here.
In table A, we can see 3 records for Value=a with 3 different start_date.
So when joining Table B, for value=a with sent_date=2/1/2018, record with max(start_date) for start_date which are less than sent_date in Table B is taken(in this case 1/1/2018) and corresponding data in column A.id is pulled to Table B.
Similarly for record with value=a and sent_date = 11/1/2018 in Table B, id=3 from table A needs to be pulled to table B.
The result must be as follows:
+---+-----+----------+---+
|Key|Value|sent_date |id |
+---+---- +----------+---+
| x | a | 2/1/2018 | 1 |
| y | a | 7/1/2018 | 2 |
| z | a | 11/1/2018| 3 |
| p | c | 5/1/2018 | 4 |
| q | d | 5/1/2018 | 5 |
| r | e | 5/1/2018 | 6 |
+---+-----+----------+---+
I am using Spark 2.3.
I have joined the two tables(using Dataframe) and found the max(start_date) based on the condition.
But I am unable to figure out how to pull the records here.
Can anyone help me out here
Thanks in Advance!!

I just changed the date "11/1/2018" to "9/1/2018" as the string sorting gives incorrect results. When converted to date, the logic would still work. See below
scala> val df_a = Seq((1,"a","1/1/2018"),
| (2,"a","4/1/2018"),
| (3,"a","8/1/2018"),
| (4,"c","1/1/2018"),
| (5,"d","1/1/2018"),
| (6,"e","1/1/2018")).toDF("id","value","start_date")
df_a: org.apache.spark.sql.DataFrame = [id: int, value: string ... 1 more field]
scala> val df_b = Seq(("x","a","2/1/2018"),
| ("y","a","7/1/2018"),
| ("z","a","9/1/2018"),
| ("p","c","5/1/2018"),
| ("q","d","5/1/2018"),
| ("r","e","5/1/2018")).toDF("key","valueb","sent_date")
df_b: org.apache.spark.sql.DataFrame = [key: string, valueb: string ... 1 more field]
scala> val df_join = df_b.join(df_a,'valueb==='valuea,"inner")
df_join: org.apache.spark.sql.DataFrame = [key: string, valueb: string ... 4 more fields]
scala> df_join.filter('sent_date >= 'start_date).withColumn("rank", rank().over(Window.partitionBy('key,'valueb,'sent_date).orderBy('start_date.desc))).filter('rank===1).drop("valuea","start_date","rank").show()
+---+------+---------+---+
|key|valueb|sent_date| id|
+---+------+---------+---+
| q| d| 5/1/2018| 5|
| p| c| 5/1/2018| 4|
| r| e| 5/1/2018| 6|
| x| a| 2/1/2018| 1|
| y| a| 7/1/2018| 2|
| z| a| 9/1/2018| 3|
+---+------+---------+---+
scala>
UPDATE
Below is the udf to handle date strings with MM/dd/yyyy formats
scala> def dateConv(x:String):String=
| {
| val y = x.split("/").map(_.toInt).map("%02d".format(_))
| y(2)+"-"+y(0)+"-"+y(1)
| }
dateConv: (x: String)String
scala> val udfdateconv = udf( dateConv(_:String):String )
udfdateconv: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,StringType,Some(List(StringType)))
scala> val df_a_dt = df_a.withColumn("start_date",date_format(udfdateconv('start_date),"yyyy-MM-dd").cast("date"))
df_a_dt: org.apache.spark.sql.DataFrame = [id: int, valuea: string ... 1 more field]
scala> df_a_dt.printSchema
root
|-- id: integer (nullable = false)
|-- valuea: string (nullable = true)
|-- start_date: date (nullable = true)
scala> df_a_dt.show()
+---+------+----------+
| id|valuea|start_date|
+---+------+----------+
| 1| a|2018-01-01|
| 2| a|2018-04-01|
| 3| a|2018-08-01|
| 4| c|2018-01-01|
| 5| d|2018-01-01|
| 6| e|2018-01-01|
+---+------+----------+
scala>

Spark: way to join each row of dataframe with all rows of another dataframe

Assuming I am having the following dataframes:
val df1 = sc.parallelize(Seq("a1" -> "a2", "b1" -> "b2", "c1" -> "c2")).toDF("a", "b")
val df2 = sc.parallelize(Seq("aa1" -> "aa2", "bb1" -> "bb2")).toDF("aa", "bb")
And I want the following:
| a | b | aa | bb |
----------------------
| a1 | a2 | aa1 | aa2 |
| a1 | a2 | bb1 | bb2 |
| b1 | b2 | aa1 | aa2 |
| b1 | b2 | bb1 | bb2 |
| c1 | c2 | aa1 | aa2 |
| c1 | c2 | bb1 | bb2 |
So each row of df1 to map to all of the rows of df2. The way I am doing it is the following:
val df1_dummy = df1.withColumn("dummy_df1", lit("dummy"))
val df2_dummy = df2.withColumn("dummy_df2", lit("dummy"))
val desired_result = df1_dummy
.join(df2_dummy, $"dummy_df1" === $"dummy_df2", "left")
.drop("dummy_df1")
.drop("dummy_df2")
It gives the desired results but it seems a bit of a bad way. Is there a more efficient way of doing that? any recommendation?

That's what crossJoin is for:
val result = df1.crossJoin(df2)
result.show()
// +---+---+---+---+
// |a |b |aa |bb |
// +---+---+---+---+
// |a1 |a2 |aa1|aa2|
// |a1 |a2 |bb1|bb2|
// |b1 |b2 |aa1|aa2|
// |b1 |b2 |bb1|bb2|
// |c1 |c2 |aa1|aa2|
// |c1 |c2 |bb1|bb2|
// +---+---+---+---+

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Pyspark: Combine multiple columns into an Array of Strings with column names - pyspark

I'm attempting to append multiple columns into a single column as seen below: Input A | B | C | D | E aa | bb | cc | dd | null Output A | B | C | D | E | combine aa | bb | cc | dd | null | [A: aa, B: bb, C: cc, D: dd, E: null]

Related

Pyspark intersection of two dataframe columns

how to update a cell of a spark data frame

How to convert to list from many columns of dataframe in Spark?

Spark Scala: Joining 2 tables and extract the data with max date(please see description)

Spark: way to join each row of dataframe with all rows of another dataframe

Categories

Resources