parse dict key:value with pyspark (keys are not pre defined) - pyspark

I have a spark DataFrame that looks like :
[Row(id = '1', dictField={"keyA":"valueA","keyB":"valueB"}),
Row(id = '2', dictField={"keyC":"valueC","keyD":"valueD","keyA":"valueA"}),
Row(id = '3', dictField={"keyZ":"valueZ","keyA":"valueA"})]
I am trying to break it into the following format.
+---------+-----------+---------+
| id| key | value |
+-------------------------------+
| 1| keyA| valueA |
| 1| keyB| valueB |
| 2| keyC| valueC |
| 2| keyD| valueC |
| 2| keyA| valueA |
| 3| keyZ| valueZ |
| 3| keyA| valueA |
Please note - the keys are not predefined/known.

If your column is in proper json format, then you can use json parser(get_json_object)
to solve the problem.
I have used udf and create code in spark scala. you can refer below
code to solve the problem. I have used given schema id: string
(nullable = true) |-- dictField: string (nullable = true)
val df = Seq(("1","{\"keyA\":\"valueA\",\"keyB\":\"valueB\"}"),("2","{\"keyC\":\"valueC\",\"keyD\":\"valueD\",\"keyA\":\"valueA\"}"),("3","{\"keyZ\":\"valueZ\",\"keyA\":\"valueA\"}")).toDF("id","dictField")
+---+-------------------------------------------------+
|id |dictField |
+---+-------------------------------------------------+
|1 |{"keyA":"valueA","keyB":"valueB"} |
|2 |{"keyC":"valueC","keyD":"valueD","keyA":"valueA"}|
|3 |{"keyZ":"valueZ","keyA":"valueA"} |
+---+-------------------------------------------------+
//create udf to parse the string into Array(String, String)
def parse_str(value:String) = {
val values = value.replace("{","").replace("}","").replace("\"","").split(",").map(_.trim)
values.foldLeft(Array[(String,String)]()){
case (acc, present) =>
val Array(k, v) = present.split(",")(0).split(":")
acc :+ (k,v)
}
}
//register udf
val parsed_udf = udf(parse_str _)
//use above udf and explode the array
val result = df.withColumn("parse",explode(parsed_udf($"dictField")))
//select required value
result.select($"id", $"parse._1".as("key"), $"parse._2".as("value")).show()
+---+----+------+
| id| key| value|
+---+----+------+
| 1|keyA|valueA|
| 1|keyB|valueB|
| 2|keyC|valueC|
| 2|keyD|valueD|
| 2|keyA|valueA|
| 3|keyZ|valueZ|
| 3|keyA|valueA|
+---+----+------+

This can be a working solution for you, use MapType and then use Explode function which are available in Spark API Higher Order Function
Create the Dataframe Here
df_new = spark.createDataFrame([(str({"val1":"3", "val11":"31"})), (str({"val2":"4", "val22":"44"}))],T.StringType())
#OutPut
+----------------------------+
|value |
+----------------------------+
|{'val1': '3', 'val11': '31'}|
|{'val2': '4', 'val22': '44'}|
+----------------------------+
Logic Here
df_new = df_new.withColumn('col', F.from_json("value",T.MapType(T.StringType(), T.StringType())))
df_new = df_new.select("col", F.explode("col").alias("x", "y"))
df_new.show(truncate=False)
Output
+------------------------+-----+---+
|col |x |y |
+------------------------+-----+---+
|[val1 -> 3, val11 -> 31]|val1 |3 |
|[val1 -> 3, val11 -> 31]|val11|31 |
|[val2 -> 4, val22 -> 44]|val2 |4 |
|[val2 -> 4, val22 -> 44]|val22|44 |
+------------------------+-----+---+

Related

How do you split a column such that first half becomes the column name and the second the column value in Scala Spark?

I have a column which has value like
+----------------------+-----------------------------------------+
|UserId |col |
+----------------------+-----------------------------------------+
|1 |firstname=abc |
|2 |lastname=xyz |
|3 |firstname=pqr;lastname=zzz |
|4 |firstname=aaa;middlename=xxx;lastname=bbb|
+----------------------+-----------------------------------------+
and what I want is something like this:
+----------------------+--------------------------------+
|UserId |firstname | lastname| middlename|
+----------------------+--------------------------------+
|1 |abc | null | null |
|2 |null | xyz | null |
|3 |pqr | zzz | null |
|4 |aaa | bbb | xxx |
+----------------------+--------------------------------+
I have already done this:
var new_df = df.withColumn("temp_new", split(col("col"), "\\;")).select(
(0 until numCols).map(i => split(col("temp_new").getItem(i), "=").getItem(1).as(s"col$i")): _*
)
where numCols is the max length of col
but as you may have guessed I get something like this as the output:
+----------------------+--------------------------------+
|UserId |col0 | col1 | col2 |
+----------------------+--------------------------------+
|1 |abc | null | null |
|2 |xyz | null | null |
|3 |pqr | zzz | null |
|4 |aaa | xxx | bbb |
+----------------------+--------------------------------+
NOTE: The above is just an example. There could be more additions to the columns like firstname=aaa;middlename=xxx;lastname=bbb;age=20;country=India and so on for around 40-50 columnnames and values. They are dynamic and I don't know most of them in advance
I am looking for a a way to achieve the result with Scala in Spark.
You could apply groupBy/pivot to generate key columns after converting the key/value-pairs string column into a Map column via SQL function str_to_map, as shown below:
val df = Seq(
(1, "firstname=joe;age=33"),
(2, "lastname=smith;country=usa"),
(3, "firstname=zoe;lastname=cooper;age=44;country=aus"),
(4, "firstname=john;lastname=doe")
).toDF("user_id", "key_values")
df.
select($"user_id", explode(expr("str_to_map(key_values, ';', '=')"))).
groupBy("user_id").pivot("key").agg(first("value").as("value")).
orderBy("user_id"). // only for ordered output
show
/*
+-------+----+-------+---------+--------+
|user_id| age|country|firstname|lastname|
+-------+----+-------+---------+--------+
| 1| 33| null| joe| null|
| 2|null| usa| null| smith|
| 3| 44| aus| zoe| cooper|
| 4|null| null| john| doe|
+-------+----+-------+---------+--------+
*/
Since your data is split by ; then your key value pairs are split by = you may consider using str_to_map the following:
creating a temporary view of your data eg
df.createOrReplaceTempView("my_table")
Running the following on your spark session
result_df = sparkSession.sql("<insert sql below here>")
WITH split_data AS (
SELECT
UserId,
str_to_map(col,';','=') full_name
FROM
my_table
)
SELECT
UserId,
full_name['firstname'] as firstname,
full_name['lastname'] as lastname,
full_name['middlename'] as middlename
FROM
split_data
This solution is proposed in accordance with the expanded requirement described in the other answer's comments section:
Existence of duplicate keys in column key_values
Only duplicate key columns will be aggregated as ArrayType
There are probably other approaches. The solution below uses groupBy/pivot with collect_list, followed by extracting the single element (null if empty) from the non-duplicate key columns.
val df = Seq(
(1, "firstname=joe;age=33;moviegenre=comedy"),
(2, "lastname=smith;country=usa;moviegenre=drama"),
(3, "firstname=zoe;lastname=cooper;age=44;country=aus"),
(4, "firstname=john;lastname=doe;moviegenre=drama;moviegenre=comedy")
).toDF("user_id", "key_values")
val mainCols = df.columns diff Seq("key_values")
val dfNew = df.
withColumn("kv_arr", split($"key_values", ";")).
withColumn("kv", explode(expr("transform(kv_arr, kv -> split(kv, '='))"))).
groupBy("user_id").pivot($"kv"(0)).agg(collect_list($"kv"(1)))
val dupeKeys = Seq("moviegenre") // user-provided
val nonDupeKeys = dfNew.columns diff (mainCols ++ dupeKeys)
dfNew.select(
mainCols.map(col) ++
dupeKeys.map(col) ++
nonDupeKeys.map(k => when(size(col(k)) > 0, col(k)(0)).as(k)): _*
).
orderBy("user_id"). // only for ordered output
show
/*
+-------+---------------+----+-------+---------+--------+
|user_id| moviegenre| age|country|firstname|lastname|
+-------+---------------+----+-------+---------+--------+
| 1| [comedy]| 33| null| joe| null|
| 2| [drama]|null| usa| null| smith|
| 3| []| 44| aus| zoe| cooper|
| 4|[drama, comedy]|null| null| john| doe|
+-------+---------------+----+-------+---------+--------+
/*
Note that higher-order function transform is used to handle the key/value split, as SQL function str_to_map (used in the original solution) can't handle duplicate keys.

Spark DF create Seq column in witcolumn

I have a df:
col1
col2
1
abcdefghi
2
qwertyuio
and I want to repeat each row, dividing the col2 in 3 substrings of lenght 3:
col1
col2
1
abcdefghi
1
abc
1
def
1
ghi
2
qwertyuio
2
qwe
2
rty
2
uio
I was trying to create a new column of Seq containng Seq((col("col1"), substring(col("col2"),0,3))...) :
val df1 = df.withColumn("col3", Seq(
(col("col1"), substring(col("col2"),0,3)),
(col("col1"), substring(col("col2"),3,3)),
(col("col1"), substring(col("col2"),6,3)) ))
My idea was to select that new column, and reduce it, getting one final Seq. Then pass it to DF and append it to the initial df.
I am getting an error in the withColumn like:
Exception in thread "main" java.lang.RuntimeException: Unsupported literal type class scala.collection.immutable.$colon$colon
You can use the Spark array function instead:
val df1 = df.union(
df.select(
$"col1",
explode(array(
substring(col("col2"),0,3),
substring(col("col2"),3,3),
substring(col("col2"),6,3)
)).as("col2")
)
)
df1.show
+----+---------+
|col1| col2|
+----+---------+
| 1|abcdefghi|
| 2|qwertyuio|
| 1| abc|
| 1| cde|
| 1| fgh|
| 2| qwe|
| 2| ert|
| 2| yui|
+----+---------+
You can use udf also,
val df = spark.sparkContext.parallelize(Seq((1L,"abcdefghi"), (2L,"qwertyuio"))).toDF("col1","col2")
df.show(false)
// input
+----+---------+
|col1|col2 |
+----+---------+
|1 |abcdefghi|
|2 |qwertyuio|
+----+---------+
// udf
val getSeq = udf((col2: String) => col2.split("(?<=\\G...)"))
df.withColumn("col2", explode(getSeq($"col2")))
.union(df).show(false)
+----+---------+
|col1|col2 |
+----+---------+
|1 |abc |
|1 |ghi |
|1 |abcdefghi|
|1 |def |
|2 |qwe |
|2 |rty |
|2 |uio |
|2 |qwertyuio|
+----+---------+

Creating a new dataframe with many rows for each row in existing dataframe

I currently have a dataframe
df1 =
+-----+
| val|
+-----+
| 1|
| 2|
| 3|
....
| 2456|
+-----+
Each value corresponds to a single cell in a 3d cube.
I have a function findNeighbors which returns a list of the neighboring cubes, which I then map to df1 to get the neighbors of every row.
df2 = df1.map(row => findNeighbors(row(0).toInt)
This results in something like
df2 =
+---------------+
| neighbors|
+---------------+
| (1,2), (1, 7)|
| (2,1), (2, 3)|
.... etc
+---------------+
Where, for each row, for each Array in that row, the first item is the value of the cell and the second is the value of its neighbor.
I now want to create a new dataframe that takes all of those nested arrays and makes them rows like this:
finalDF =
+-----+------+
| cell|neighb|
+-----+------+
| 1| 2|
| 1| 7|
| 2| 1|
| 2| 3|
.... etc
+------------+
And this is where I am stuck
I tried using the code below, but I can't append to a local dataframe from within the foreach function.
var df: DataFrame = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], my_schema)
val colNames = Seq("cell", "neighb")
neighborsDf.foreach(row => {
var rowDf: DataFrame = row.toDF(colNames: _*)
df.union(rowDf)
})
I'm sure there is a much better way to approach this problem, but I'm very new and very lost in scala/spark, and 10 hours of googling hasn't helped me.
Starting a little down the track, a somewhat similar example:
val df2 = df.select(explode($"neighbours").as("neighbours_flat"))
val df3 = df2.select(col("neighbours_flat").getItem(0) as "cell",col("neighbours_flat")
.getItem(1) as "neighbour")
df3.show(false)
starting from neighbours field def:
+----------------+
|neighbours_flat |
+----------------+
|[[1, 2], [1, 7]]|
|[[2, 1], [2, 3]]|
+----------------+
results in:
+----+---------+
|cell|neighbour|
+----+---------+
|1 |2 |
|1 |7 |
|2 |1 |
|2 |3 |
+----+---------+
You need to have an array def and then use explode.

Replace words in Data frame using List of words in another Data frame in Spark Scala

I have two dataframes, lets say df1 and df2 in Spark Scala
df1 has two fields, 'ID' and 'Text' where 'Text' has some description (Multiple words). I have already removed all special characters and numeric characters from field 'Text' leaving only alphabets and spaces.
df1 Sample
+--------------++--------------------+
|ID ||Text |
+--------------++--------------------+
| 1 ||helo how are you |
| 2 ||hai haiden |
| 3 ||hw are u uma |
--------------------------------------
df2 contains a list of words and corresponding replacement words
df2 Sample
+--------------++--------------------+
|Word ||Replace |
+--------------++--------------------+
| helo ||hello |
| hai ||hi |
| hw ||how |
| u ||you |
--------------------------------------
I would need to find all occurrence of words in df2("Word") from df1("Text") and replace it with df2("Replace")
With the sample dataframes above, I would expect a resulting dataframe, DF3 as given below
df3 Sample
+--------------++--------------------+
|ID ||Text |
+--------------++--------------------+
| 1 ||hello how are you |
| 2 ||hi haiden |
| 3 ||how are you uma |
--------------------------------------
Your help is greatly appreciated in doing the same in Spark using Scala.
It'd be easier to accomplish this if you convert your df2 to a Map. Assuming it's not a huge table, you can do the following :
val keyVal = df2.map( r =>( r(0).toString, r(1).toString ) ).collect.toMap
This will give you a Map to refer to :
scala.collection.immutable.Map[String,String] = Map(helo -> hello, hai -> hi, hw -> how, u -> you)
Now you can use UDF to create a function that will utilize keyVal Map to replace values :
val getVal = udf[String, String] (x => x.split(" ").map(x => res18.get(x).getOrElse(x) ).mkString( " " ) )
Now, you can call the udf getVal on your dataframe to get the desired result.
df1.withColumn("text" , getVal(df1("text")) ).show
+---+-----------------+
| id| text|
+---+-----------------+
| 1|hello how are you|
| 2| hi haiden|
| 3| how are you uma|
+---+-----------------+
I will demonstrate only for the first id and assume that you can not do a collect action on your df2. First you need to be sure that the schema for your dataframe is and array for text column on your df1
+---+--------------------+
| id| text|
+---+--------------------+
| 1|[helo, how, are, ...|
+---+--------------------+
with schema like this:
|-- id: integer (nullable = true)
|-- text: array (nullable = true)
| |-- element: string (containsNull = true)
After that you can do an explode on the text column
res1.withColumn("text", explode(res1("text")))
+---+----+
| id|text|
+---+----+
| 1|helo|
| 1| how|
| 1| are|
| 1| you|
+---+----+
Assuming you're replace dataframe looks like this:
+----+-------+
|word|replace|
+----+-------+
|helo| hello|
| hai| hi|
+----+-------+
Joining the two dataframe will look like this:
res6.join(res8, res6("text") === res8("word"), "left_outer")
+---+----+----+-------+
| id|text|word|replace|
+---+----+----+-------+
| 1| you|null| null|
| 1| how|null| null|
| 1|helo|helo| hello|
| 1| are|null| null|
+---+----+----+-------+
Do a select with coalescing null values:
res26.select(res26("id"), coalesce(res26("replace"), res26("text")).as("replaced_text"))
+---+-------------+
| id|replaced_text|
+---+-------------+
| 1| you|
| 1| how|
| 1| hello|
| 1| are|
+---+-------------+
and then group by id and aggregate into a collect list function:
res33.groupBy("id").agg(collect_list("replaced_text"))
+---+---------------------------+
| id|collect_list(replaced_text)|
+---+---------------------------+
| 1| [you, how, hello,...|
+---+---------------------------+
Keep in mind that you should preserve you initial order of text elements.
I Suppose code below should solve your problem
I have solved this by using RDD
val wordRdd = df1.rdd.flatMap{ row =>
val wordList = row.getAs[String]("Text").split(" ").toList
wordList.map{word => Row.fromTuple(row.getAs[Int]("id"),word)}
}.zipWithIndex()
val wordDf = sqlContext.createDataFrame(wordRdd.map(x => Row.fromSeq(x._1.toSeq++Seq(x._2))),StructType(List(StructField("id",IntegerType),StructField("word",StringType),StructField("index",LongType))))
val opRdd = wordDf.join(df2,wordDf("word")===df2("word"),"left_outer").drop(df2("word")).rdd.groupBy(_.getAs[Int]("id")).map(x => Row.fromTuple(x._1,x._2.toList.sortBy(x => x.getAs[Long]("index")).map(row => if(row.getAs[String]("Replace")!=null) row.getAs[String]("Replace") else row.getAs[String]("word")).mkString(" ")))
val opDF = sqlContext.createDataFrame(opRdd,StructType(List(StructField("id",IntegerType),StructField("Text",StringType))))

Splitting row in multiple row in spark-shell

I have imported data in Spark dataframe in spark-shell. Data is filled in it like :
Col1 | Col2 | Col3 | Col4
A1 | 11 | B2 | a|b;1;0xFFFFFF
A1 | 12 | B1 | 2
A2 | 12 | B2 | 0xFFF45B
Here in Col4, the values are of different kinds and I want to separate them like (suppose "a|b" is type of alphabets, "1 or 2" is a type of digit and "0xFFFFFF or 0xFFF45B" is a type of hexadecimal no.):
So, the output should be :
Col1 | Col2 | Col3 | alphabets | digits | hexadecimal
A1 | 11 | B2 | a | 1 | 0xFFFFFF
A1 | 11 | B2 | b | 1 | 0xFFFFFF
A1 | 12 | B1 | | 2 |
A2 | 12 | B2 | | | 0xFFF45B
Hope I've made my query clear to you and I am using spark-shell. Thanks in advance.
Edit after getting this answer about how to make backreference in regexp_replace.
You can use regexp_replace with a backreference, then split twice and explode. It is, imo, cleaner than my original solution
val df = List(
("A1" , "11" , "B2" , "a|b;1;0xFFFFFF"),
("A1" , "12" , "B1" , "2"),
("A2" , "12" , "B2" , "0xFFF45B")
).toDF("Col1" , "Col2" , "Col3" , "Col4")
val regExStr = "^([A-z|]+)?;?(\\d+)?;?(0x.*)?$"
val res = df
.withColumn("backrefReplace",
split(regexp_replace('Col4,regExStr,"$1;$2;$3"),";"))
.select('Col1,'Col2,'Col3,
explode(split('backrefReplace(0),"\\|")).as("letter"),
'backrefReplace(1) .as("digits"),
'backrefReplace(2) .as("hexadecimal")
)
+----+----+----+------+------+-----------+
|Col1|Col2|Col3|letter|digits|hexadecimal|
+----+----+----+------+------+-----------+
| A1| 11| B2| a| 1| 0xFFFFFF|
| A1| 11| B2| b| 1| 0xFFFFFF|
| A1| 12| B1| | 2| |
| A2| 12| B2| | | 0xFFF45B|
+----+----+----+------+------+-----------+
you still need to replace empty strings by nullthough...
Previous Answer (somebody might still prefer it):
Here is a solution that sticks to DataFrames but is also quite messy. You can first use regexp_extract three times (possible to do less with backreference?), and finally split on "|" and explode. Note that you need a coalesce for explode to return everything (you still might want to change the empty strings in letter to null in this solution).
val res = df
.withColumn("alphabets", regexp_extract('Col4,"(^[A-z|]+)?",1))
.withColumn("digits", regexp_extract('Col4,"^([A-z|]+)?;?(\\d+)?;?(0x.*)?$",2))
.withColumn("hexadecimal",regexp_extract('Col4,"^([A-z|]+)?;?(\\d+)?;?(0x.*)?$",3))
.withColumn("letter",
explode(
split(
coalesce('alphabets,lit("")),
"\\|"
)
)
)
res.show
+----+----+----+--------------+---------+------+-----------+------+
|Col1|Col2|Col3| Col4|alphabets|digits|hexadecimal|letter|
+----+----+----+--------------+---------+------+-----------+------+
| A1| 11| B2|a|b;1;0xFFFFFF| a|b| 1| 0xFFFFFF| a|
| A1| 11| B2|a|b;1;0xFFFFFF| a|b| 1| 0xFFFFFF| b|
| A1| 12| B1| 2| null| 2| null| |
| A2| 12| B2| 0xFFF45B| null| null| 0xFFF45B| |
+----+----+----+--------------+---------+------+-----------+------+
Note: The regexp part could be so much better with backreference, so if somebody knows how to do it, please comment!
Not sure this is doable while staying 100% with Dataframes, here's a (somewhat messy?) solution using RDDs for the split itself:
import org.apache.spark.sql.functions._
import sqlContext.implicits._
// we switch to RDD to perform the split of Col4 into 3 columns
val rddWithSplitCol4 = input.rdd.map { r =>
val indexToValue = r.getAs[String]("Col4").split(';').map {
case s if s.startsWith("0x") => 2 -> s
case s if s.matches("\\d+") => 1 -> s
case s => 0 -> s
}
val newCols: Array[String] = indexToValue.foldLeft(Array.fill[String](3)("")) {
case (arr, (index, value)) => arr.updated(index, value)
}
(r.getAs[String]("Col1"), r.getAs[Int]("Col2"), r.getAs[String]("Col3"), newCols(0), newCols(1), newCols(2))
}
// switch back to Dataframe and explode alphabets column
val result = rddWithSplitCol4
.toDF("Col1", "Col2", "Col3", "alphabets", "digits", "hexadecimal")
.withColumn("alphabets", explode(split(col("alphabets"), "\\|")))
result.show(truncate = false)
// +----+----+----+---------+------+-----------+
// |Col1|Col2|Col3|alphabets|digits|hexadecimal|
// +----+----+----+---------+------+-----------+
// |A1 |11 |B2 |a |1 |0xFFFFFF |
// |A1 |11 |B2 |b |1 |0xFFFFFF |
// |A1 |12 |B1 | |2 | |
// |A2 |12 |B2 | | |0xFFF45B |
// +----+----+----+---------+------+-----------+