pyspark json string to array type

pyspark json string to array type - pyspark

I'm working on a dataframe which contains a column that is a json array, which looks like this:
| col1 | col2 |
| ---- | --------------------------------------------------------------------- |
| aaaa | {"k1":"v1", "k2":{"k3":"v3"}},{"k1":"v2", "k2":{"k3":"v4"}} |
| bbbb | {"k11":"v11", "k21":{"k31":"v31"}},{"k11":"v21", "k21":{"k31":"v41"}} |
col2 here is a nested json array string, my goal is to convert col2 from string to array so I can use explode function in pyspark to col2 to get:
| col1 | col2 |
| ---- | ---------------------------------- |
| aaaa | {"k1":"v1", "k2":{"k3":"v3"}} |
| aaaa | {"k1":"v2", "k2":{"k3":"v4"}} |
| bbbb | {"k11":"v11", "k21":{"k31":"v31"}} |
| bbbb | {"k11":"v21", "k21":{"k31":"v41"}} |
been stuck for a while, any help will be appreciated, thanks in advance!

use this:
import pyspark.sql.functions as f
df = (
df
.withColumn('col2', f.split(f.col('col2'), '\},\{'))
.withColumn('col2', f.expr('transform(col2, (element, idx) -> case when idx = 0 then concat(element, "}") else concat("{", element) end)'))
.withColumn('col2', f.explode(f.col('col2')))
)

Related

How to explode a string column based on specific delimiter in Spark Scala

I want to explode string column based on a specific delimiter (| in my case )
I have a dataset like this:
+-----+--------------+
|Col_1|Col_2 |
+-----+--------------+
| 1 | aa|bb |
| 2 | cc |
| 3 | dd|ee |
| 4 | ff |
+-----+-----+---------
I want an output like this:
+-----+--------------+
|Col_1|Col_2 |
+-----+--------------+
| 1 | aa |
| 1 | bb |
| 2 | cc |
| 3 | dd |
| 3 | ee |
| 4 | ff |
+-----+-----+---------

Use explode and split functions, and use \\ escape |.
val df1 = df.select(col("Col_1"), explode(split(col("Col_2"),"\\|")).as("Col_2"))

Compare consecutive rows and extract words(excluding the subsets) in spark

I am working on a spark dataframe. Input dataframe looks like below (Table 1). I need to write a logic to get the keywords with maximum length for each session ids. There are multiple keywords that would be part of output for each sessionid. expected output looks like Table 2.
Input dataframe:
(Table 1)
|-----------+------------+-----------------------------------|
| session_id| value | Timestamp |
|-----------+------------+-----------------------------------|
| 1 | cat | 2021-01-11T13:48:54.2514887-05:00 |
| 1 | catc | 2021-01-11T13:48:54.3514887-05:00 |
| 1 | catch | 2021-01-11T13:48:54.4514887-05:00 |
| 1 | par | 2021-01-11T13:48:55.2514887-05:00 |
| 1 | part | 2021-01-11T13:48:56.5514887-05:00 |
| 1 | party | 2021-01-11T13:48:57.7514887-05:00 |
| 1 | partyy | 2021-01-11T13:48:58.7514887-05:00 |
| 2 | fal | 2021-01-11T13:49:54.2514887-05:00 |
| 2 | fall | 2021-01-11T13:49:54.3514887-05:00 |
| 2 | falle | 2021-01-11T13:49:54.4514887-05:00 |
| 2 | fallen | 2021-01-11T13:49:54.8514887-05:00 |
| 2 | Tem | 2021-01-11T13:49:56.5514887-05:00 |
| 2 | Temp | 2021-01-11T13:49:56.7514887-05:00 |
|-----------+------------+-----------------------------------|
Expected Output:
(Table 2)
|-----------+------------+
| session_id| value |
|-----------+------------+
| 1 | catch |
| 1 | partyy |
| 2 | fallen |
| 2 | Temp |
|-----------+------------|
Solution I tried:
I added another column called col_length which captures the length of each word in value column. later on tried to compare each row with its subsequent row to see if it is of maximum lenth. But this solution only works party.
val df = spark.read.parquet("/project/project_name/abc")
val dfM = df.select($"session_id",$"value",$"Timestamp").withColumn("col_length",length($"value"))
val ts = Window
.orderBy("session_id")
.rangeBetween(Window.unboundedPreceding, Window.currentRow)
val result = dfM
.withColumn("running_max", max("col_length") over ts)
.where($"running_max" === $"col_length")
.select("session_id", "value", "Timestamp")
Current Output:
|-----------+------------+
| session_id| value |
|-----------+------------+
| 1 | catch |
| 2 | fallen |
|-----------+------------|
Multiple columns does not work inside an orderBy clause with window function so I didn't get desired output.I got 1 output per sesison id. Any suggesions would be highly appreciated. Thanks in advance.

You can solve it by using lead function:
val windowSpec = Window.orderBy("session_id")
dfM
.withColumn("lead",lead("value",1).over(windowSpec))
.filter((functions.length(col("lead")) < functions.length(col("value"))) || col("lead").isNull)
.drop("lead")
.show

How to combine pyspark dataframes with different shapes and different columns

I have two dataframes in Pyspark. One has more than 1000 rows and the other only 4 rows. The columns also are not matching.
df1 with more than 1000 rows:
+----+--------+--------------+-------------+
| ID | col1 | col2 | col 3 |
+----+--------+--------------+-------------+
| 1 | time1 | value_col2 | value_col3 |
| 2 | time 2 | value2_col2 | value2_col3 |
+----+--------+--------------+-------------+
...
df2 with only 4 rows:
+-----+--------------+--------------+
| key | col_c | col_d |
+-----+--------------+--------------+
| a | valuea_colc | valuea_cold |
| b | valueb_colc | valueb_cold |
+-----+--------------+--------------+
I want to create a dataframe looking like this:
+----+--------+-------------+-------------+--------------+---------------+--------------+-------------+
| ID | col1 | col2 | col 3 | a_col_c | a_col_d | b_col_c | b_col_d |
+----+--------+-------------+-------------+--------------+---------------+--------------+-------------+
| 1 | time1 | value_col2 | value_col3 | valuea_colc | valuea_cold | valueb_colc | valueb_cold |
| 2 | time 2 | value2_col2 | value2_col3 | valuea_colc | valuea_cold | valueb_colc | valueb_cold |
+----+--------+-------------+-------------+--------------+---------------+--------------+-------------+
Can you please help with this? I prefer not to use Pandas.
Thank you!

I actually figured this out using crossJoin.
https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html explains how to use crossJoin with Pyspark DataFrames.

How to iterate over an array column in PySpark while joining

In PySpark, I have dataframe_a with:
+-----------+----------------------+
| str1 | array_of_str |
+-----------+----------------------+
| John | [mango, apple] |
| Tom | [mango, orange] |
| Matteo | [apple, banana] |
and dataframe_b with
+-----------+----------------------+
| key | value |
+-----------+----------------------+
| mango | 1 |
| apple | 2 |
| orange | 3 |
and I want to create a new column of type Array joined_result that maps each element in array_of_str (dataframe_a) to its value in dataframe_b, such as:
+-----------+----------------------+----------------------------------+
| str1 | array_of_str | joined_result |
+-----------+----------------------+----------------------------------+
| John | [mango, apple] | [1, 2] |
| Tom | [mango, orange] | [1, 3] |
| Matteo | [apple, banana] | [2] |
I'm not sure how to do it, I know I can use an udf with a lambda function but I don't manage to make it work :( Help!
from pyspark.sql import functions as F
from pyspark.sql.types import StringType, ArrayType
# START EXTRACT OF CODE
ret = (df
.select(['str1', 'array_of_str'])
.withColumn('joined_result', F.udf(
map(lambda x: ??????, ArrayType(StringType))
)
)
return ret
# END EXTRACT OF CODE

My answer in your question:
lookup_list = map(lambda row: row.asDict(), dataframe_b.collect())
lookup_dict = {lookup['key']:lookup['value'] for lookup in lookup_list}
def mapper(keys):
return [lookup_dict[key][0] for key in keys]
dataframe_a = dataframe_a.withColumn('joined_result', F.udf(mapper)("arr_of_str"))
It works as you want :-)

Get the row corresponding to the latest timestamp in a Spark Dataset using Scala

I am relatively new to Spark and Scala. I have a dataframe which has the following format:
| Col1 | Col2 | Col3 | Col_4 | Col_5 | Col_TS | Col_7 |
| 1234 | AAAA | 1111 | afsdf | ewqre | 1970-01-01 00:00:00.0 | false |
| 1234 | AAAA | 1111 | ewqrw | dafda | 2017-01-17 07:09:32.748 | true |
| 1234 | AAAA | 1111 | dafsd | afwew | 2015-01-17 07:09:32.748 | false |
| 5678 | BBBB | 2222 | afsdf | qwerq | 1970-01-01 00:00:00.0 | true |
| 5678 | BBBB | 2222 | bafva | qweqe | 2016-12-08 07:58:43.04 | false |
| 9101 | CCCC | 3333 | caxad | fsdaa | 1970-01-01 00:00:00.0 | false |
What I need to do is to get the row that corresponds to the latest timestamp.
In the example above, the keys are Col1, Col2 and Col3. Col_TS represents the timestamp and Col_7 is a boolean that determines the validity of the record.
What I want to do is to find a way to group these records based on the keys and retain the one that has the latest timestamp.
So the output of the operation in the dataframe above should be:
| Col1 | Col2 | Col3 | Col_4 | Col_5 | Col_TS | Col_7 |
| 1234 | AAAA | 1111 | ewqrw | dafda | 2017-01-17 07:09:32.748 | true |
| 5678 | BBBB | 2222 | bafva | qweqe | 2016-12-08 07:58:43.04 | false |
| 9101 | CCCC | 3333 | caxad | fsdaa | 1970-01-01 00:00:00.0 | false |
I came up with a partial solution but this way I can only return the dataframe of the Column keys on which the records are grouped and not the other columns.
df = df.groupBy("Col1","Col2","Col3").agg(max("Col_TS"))
| Col1 | Col2 | Col3 | max(Col_TS) |
| 1234 | AAAA | 1111 | 2017-01-17 07:09:32.748 |
| 5678 | BBBB | 2222 | 2016-12-08 07:58:43.04 |
| 9101 | CCCC | 3333 | 1970-01-01 00:00:00.0 |
Can someone help me in coming up with a Scala code for performing this operation?

You can use window function as following
import org.apache.spark.sql.functions._
val windowSpec = Window.partitionBy("Col1","Col2","Col3").orderBy(col("Col_TS").desc)
df.withColumn("maxTS", first("Col_TS").over(windowSpec))
.select("*").where(col("maxTS") === col("Col_TS"))
.drop("maxTS")
.show(false)
You should get output as following
+----+----+----+-----+-----+----------------------+-----+
|Col1|Col2|Col3|Col_4|Col_5|Col_TS |Col_7|
+----+----+----+-----+-----+----------------------+-----+
|5678|BBBB|2222|bafva|qweqe|2016-12-0807:58:43.04 |false|
|1234|AAAA|1111|ewqrw|dafda|2017-01-1707:09:32.748|true |
|9101|CCCC|3333|caxad|fsdaa|1970-01-0100:00:00.0 |false|
+----+----+----+-----+-----+----------------------+-----+

One option is firstly order the data frame by Col_TS, then group by Col1, Col2 and Col3 and take the last item from each other column:
val val_columns = Seq("Col_4", "Col_5", "Col_TS", "Col_7").map(x => last(col(x)).alias(x))
(df.orderBy("Col_TS")
.groupBy("Col1", "Col2", "Col3")
.agg(val_columns.head, val_columns.tail: _*).show)
+----+----+----+-----+-----+--------------------+-----+
|Col1|Col2|Col3|Col_4|Col_5| Col_TS|Col_7|
+----+----+----+-----+-----+--------------------+-----+
|1234|AAAA|1111|ewqrw|dafda|2017-01-17 07:09:...| true|
|9101|CCCC|3333|caxad|fsdaa|1970-01-01 00:00:...|false|
|5678|BBBB|2222|bafva|qweqe|2016-12-08 07:58:...|false|
+----+----+----+-----+-----+--------------------+-----+

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

pyspark json string to array type - pyspark

use this: import pyspark.sql.functions as f df = ( df .withColumn('col2', f.split(f.col('col2'), '\},\{')) .withColumn('col2', f.expr('transform(col2, (element, idx) -> case when idx = 0 then concat(element, "}") else concat("{", element) end)')) .withColumn('col2', f.explode(f.col('col2'))) )

Related

How to explode a string column based on specific delimiter in Spark Scala

Compare consecutive rows and extract words(excluding the subsets) in spark

How to combine pyspark dataframes with different shapes and different columns

How to iterate over an array column in PySpark while joining

Get the row corresponding to the latest timestamp in a Spark Dataset using Scala

Categories

Resources