parse dict key:value with pyspark (keys are not pre defined)

parse dict key:value with pyspark (keys are not pre defined) - pyspark

I have a spark DataFrame that looks like :
[Row(id = '1', dictField={"keyA":"valueA","keyB":"valueB"}),
Row(id = '2', dictField={"keyC":"valueC","keyD":"valueD","keyA":"valueA"}),
Row(id = '3', dictField={"keyZ":"valueZ","keyA":"valueA"})]
I am trying to break it into the following format.
+---------+-----------+---------+
| id| key | value |
+-------------------------------+
| 1| keyA| valueA |
| 1| keyB| valueB |
| 2| keyC| valueC |
| 2| keyD| valueC |
| 2| keyA| valueA |
| 3| keyZ| valueZ |
| 3| keyA| valueA |
Please note - the keys are not predefined/known.

If your column is in proper json format, then you can use json parser(get_json_object)
to solve the problem.
I have used udf and create code in spark scala. you can refer below
code to solve the problem. I have used given schema id: string
(nullable = true) |-- dictField: string (nullable = true)
val df = Seq(("1","{\"keyA\":\"valueA\",\"keyB\":\"valueB\"}"),("2","{\"keyC\":\"valueC\",\"keyD\":\"valueD\",\"keyA\":\"valueA\"}"),("3","{\"keyZ\":\"valueZ\",\"keyA\":\"valueA\"}")).toDF("id","dictField")
+---+-------------------------------------------------+
|id |dictField |
+---+-------------------------------------------------+
|1 |{"keyA":"valueA","keyB":"valueB"} |
|2 |{"keyC":"valueC","keyD":"valueD","keyA":"valueA"}|
|3 |{"keyZ":"valueZ","keyA":"valueA"} |
+---+-------------------------------------------------+
//create udf to parse the string into Array(String, String)
def parse_str(value:String) = {
val values = value.replace("{","").replace("}","").replace("\"","").split(",").map(_.trim)
values.foldLeft(Array[(String,String)]()){
case (acc, present) =>
val Array(k, v) = present.split(",")(0).split(":")
acc :+ (k,v)
}
}
//register udf
val parsed_udf = udf(parse_str _)
//use above udf and explode the array
val result = df.withColumn("parse",explode(parsed_udf($"dictField")))
//select required value
result.select($"id", $"parse._1".as("key"), $"parse._2".as("value")).show()
+---+----+------+
| id| key| value|
+---+----+------+
| 1|keyA|valueA|
| 1|keyB|valueB|
| 2|keyC|valueC|
| 2|keyD|valueD|
| 2|keyA|valueA|
| 3|keyZ|valueZ|
| 3|keyA|valueA|
+---+----+------+

This can be a working solution for you, use MapType and then use Explode function which are available in Spark API Higher Order Function
Create the Dataframe Here
df_new = spark.createDataFrame([(str({"val1":"3", "val11":"31"})), (str({"val2":"4", "val22":"44"}))],T.StringType())
#OutPut
+----------------------------+
|value |
+----------------------------+
|{'val1': '3', 'val11': '31'}|
|{'val2': '4', 'val22': '44'}|
+----------------------------+
Logic Here
df_new = df_new.withColumn('col', F.from_json("value",T.MapType(T.StringType(), T.StringType())))
df_new = df_new.select("col", F.explode("col").alias("x", "y"))
df_new.show(truncate=False)
Output
+------------------------+-----+---+
|col |x |y |
+------------------------+-----+---+
|[val1 -> 3, val11 -> 31]|val1 |3 |
|[val1 -> 3, val11 -> 31]|val11|31 |
|[val2 -> 4, val22 -> 44]|val2 |4 |
|[val2 -> 4, val22 -> 44]|val22|44 |
+------------------------+-----+---+

Related

How do you split a column such that first half becomes the column name and the second the column value in Scala Spark?

I have a column which has value like
+----------------------+-----------------------------------------+
|UserId |col |
+----------------------+-----------------------------------------+
|1 |firstname=abc |
|2 |lastname=xyz |
|3 |firstname=pqr;lastname=zzz |
|4 |firstname=aaa;middlename=xxx;lastname=bbb|
+----------------------+-----------------------------------------+
and what I want is something like this:
+----------------------+--------------------------------+
|UserId |firstname | lastname| middlename|
+----------------------+--------------------------------+
|1 |abc | null | null |
|2 |null | xyz | null |
|3 |pqr | zzz | null |
|4 |aaa | bbb | xxx |
+----------------------+--------------------------------+
I have already done this:
var new_df = df.withColumn("temp_new", split(col("col"), "\\;")).select(
(0 until numCols).map(i => split(col("temp_new").getItem(i), "=").getItem(1).as(s"col$i")): _*
)
where numCols is the max length of col
but as you may have guessed I get something like this as the output:
+----------------------+--------------------------------+
|UserId |col0 | col1 | col2 |
+----------------------+--------------------------------+
|1 |abc | null | null |
|2 |xyz | null | null |
|3 |pqr | zzz | null |
|4 |aaa | xxx | bbb |
+----------------------+--------------------------------+
NOTE: The above is just an example. There could be more additions to the columns like firstname=aaa;middlename=xxx;lastname=bbb;age=20;country=India and so on for around 40-50 columnnames and values. They are dynamic and I don't know most of them in advance
I am looking for a a way to achieve the result with Scala in Spark.

You could apply groupBy/pivot to generate key columns after converting the key/value-pairs string column into a Map column via SQL function str_to_map, as shown below:
val df = Seq(
(1, "firstname=joe;age=33"),
(2, "lastname=smith;country=usa"),
(3, "firstname=zoe;lastname=cooper;age=44;country=aus"),
(4, "firstname=john;lastname=doe")
).toDF("user_id", "key_values")
df.
select($"user_id", explode(expr("str_to_map(key_values, ';', '=')"))).
groupBy("user_id").pivot("key").agg(first("value").as("value")).
orderBy("user_id"). // only for ordered output
show
/*
+-------+----+-------+---------+--------+
|user_id| age|country|firstname|lastname|
+-------+----+-------+---------+--------+
| 1| 33| null| joe| null|
| 2|null| usa| null| smith|
| 3| 44| aus| zoe| cooper|
| 4|null| null| john| doe|
+-------+----+-------+---------+--------+
*/

Since your data is split by ; then your key value pairs are split by = you may consider using str_to_map the following:
creating a temporary view of your data eg
df.createOrReplaceTempView("my_table")
Running the following on your spark session
result_df = sparkSession.sql("<insert sql below here>")
WITH split_data AS (
SELECT
UserId,
str_to_map(col,';','=') full_name
FROM
my_table
)
SELECT
UserId,
full_name['firstname'] as firstname,
full_name['lastname'] as lastname,
full_name['middlename'] as middlename
FROM
split_data

This solution is proposed in accordance with the expanded requirement described in the other answer's comments section:
Existence of duplicate keys in column key_values
Only duplicate key columns will be aggregated as ArrayType
There are probably other approaches. The solution below uses groupBy/pivot with collect_list, followed by extracting the single element (null if empty) from the non-duplicate key columns.
val df = Seq(
(1, "firstname=joe;age=33;moviegenre=comedy"),
(2, "lastname=smith;country=usa;moviegenre=drama"),
(3, "firstname=zoe;lastname=cooper;age=44;country=aus"),
(4, "firstname=john;lastname=doe;moviegenre=drama;moviegenre=comedy")
).toDF("user_id", "key_values")
val mainCols = df.columns diff Seq("key_values")
val dfNew = df.
withColumn("kv_arr", split($"key_values", ";")).
withColumn("kv", explode(expr("transform(kv_arr, kv -> split(kv, '='))"))).
groupBy("user_id").pivot($"kv"(0)).agg(collect_list($"kv"(1)))
val dupeKeys = Seq("moviegenre") // user-provided
val nonDupeKeys = dfNew.columns diff (mainCols ++ dupeKeys)
dfNew.select(
mainCols.map(col) ++
dupeKeys.map(col) ++
nonDupeKeys.map(k => when(size(col(k)) > 0, col(k)(0)).as(k)): _*
).
orderBy("user_id"). // only for ordered output
show
/*
+-------+---------------+----+-------+---------+--------+
|user_id| moviegenre| age|country|firstname|lastname|
+-------+---------------+----+-------+---------+--------+
| 1| [comedy]| 33| null| joe| null|
| 2| [drama]|null| usa| null| smith|
| 3| []| 44| aus| zoe| cooper|
| 4|[drama, comedy]|null| null| john| doe|
+-------+---------------+----+-------+---------+--------+
/*
Note that higher-order function transform is used to handle the key/value split, as SQL function str_to_map (used in the original solution) can't handle duplicate keys.

Spark DF create Seq column in witcolumn

I have a df:
col1
col2
1
abcdefghi
2
qwertyuio
and I want to repeat each row, dividing the col2 in 3 substrings of lenght 3:
col1
col2
1
abcdefghi
1
abc
1
def
1
ghi
2
qwertyuio
2
qwe
2
rty
2
uio
I was trying to create a new column of Seq containng Seq((col("col1"), substring(col("col2"),0,3))...) :
val df1 = df.withColumn("col3", Seq(
(col("col1"), substring(col("col2"),0,3)),
(col("col1"), substring(col("col2"),3,3)),
(col("col1"), substring(col("col2"),6,3)) ))
My idea was to select that new column, and reduce it, getting one final Seq. Then pass it to DF and append it to the initial df.
I am getting an error in the withColumn like:
Exception in thread "main" java.lang.RuntimeException: Unsupported literal type class scala.collection.immutable.$colon$colon

You can use the Spark array function instead:
val df1 = df.union(
df.select(
$"col1",
explode(array(
substring(col("col2"),0,3),
substring(col("col2"),3,3),
substring(col("col2"),6,3)
)).as("col2")
)
)
df1.show
+----+---------+
|col1| col2|
+----+---------+
| 1|abcdefghi|
| 2|qwertyuio|
| 1| abc|
| 1| cde|
| 1| fgh|
| 2| qwe|
| 2| ert|
| 2| yui|
+----+---------+

You can use udf also,
val df = spark.sparkContext.parallelize(Seq((1L,"abcdefghi"), (2L,"qwertyuio"))).toDF("col1","col2")
df.show(false)
// input
+----+---------+
|col1|col2 |
+----+---------+
|1 |abcdefghi|
|2 |qwertyuio|
+----+---------+
// udf
val getSeq = udf((col2: String) => col2.split("(?<=\\G...)"))
df.withColumn("col2", explode(getSeq($"col2")))
.union(df).show(false)
+----+---------+
|col1|col2 |
+----+---------+
|1 |abc |
|1 |ghi |
|1 |abcdefghi|
|1 |def |
|2 |qwe |
|2 |rty |
|2 |uio |
|2 |qwertyuio|
+----+---------+

Creating a new dataframe with many rows for each row in existing dataframe

I currently have a dataframe
df1 =
+-----+
| val|
+-----+
| 1|
| 2|
| 3|
....
| 2456|
+-----+
Each value corresponds to a single cell in a 3d cube.
I have a function findNeighbors which returns a list of the neighboring cubes, which I then map to df1 to get the neighbors of every row.
df2 = df1.map(row => findNeighbors(row(0).toInt)
This results in something like
df2 =
+---------------+
| neighbors|
+---------------+
| (1,2), (1, 7)|
| (2,1), (2, 3)|
.... etc
+---------------+
Where, for each row, for each Array in that row, the first item is the value of the cell and the second is the value of its neighbor.
I now want to create a new dataframe that takes all of those nested arrays and makes them rows like this:
finalDF =
+-----+------+
| cell|neighb|
+-----+------+
| 1| 2|
| 1| 7|
| 2| 1|
| 2| 3|
.... etc
+------------+
And this is where I am stuck
I tried using the code below, but I can't append to a local dataframe from within the foreach function.
var df: DataFrame = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], my_schema)
val colNames = Seq("cell", "neighb")
neighborsDf.foreach(row => {
var rowDf: DataFrame = row.toDF(colNames: _*)
df.union(rowDf)
})
I'm sure there is a much better way to approach this problem, but I'm very new and very lost in scala/spark, and 10 hours of googling hasn't helped me.

Starting a little down the track, a somewhat similar example:
val df2 = df.select(explode($"neighbours").as("neighbours_flat"))
val df3 = df2.select(col("neighbours_flat").getItem(0) as "cell",col("neighbours_flat")
.getItem(1) as "neighbour")
df3.show(false)
starting from neighbours field def:
+----------------+
|neighbours_flat |
+----------------+
|[[1, 2], [1, 7]]|
|[[2, 1], [2, 3]]|
+----------------+
results in:
+----+---------+
|cell|neighbour|
+----+---------+
|1 |2 |
|1 |7 |
|2 |1 |
|2 |3 |
+----+---------+
You need to have an array def and then use explode.

Replace words in Data frame using List of words in another Data frame in Spark Scala

I have two dataframes, lets say df1 and df2 in Spark Scala
df1 has two fields, 'ID' and 'Text' where 'Text' has some description (Multiple words). I have already removed all special characters and numeric characters from field 'Text' leaving only alphabets and spaces.
df1 Sample
+--------------++--------------------+
|ID ||Text |
+--------------++--------------------+
| 1 ||helo how are you |
| 2 ||hai haiden |
| 3 ||hw are u uma |
--------------------------------------
df2 contains a list of words and corresponding replacement words
df2 Sample
+--------------++--------------------+
|Word ||Replace |
+--------------++--------------------+
| helo ||hello |
| hai ||hi |
| hw ||how |
| u ||you |
--------------------------------------
I would need to find all occurrence of words in df2("Word") from df1("Text") and replace it with df2("Replace")
With the sample dataframes above, I would expect a resulting dataframe, DF3 as given below
df3 Sample
+--------------++--------------------+
|ID ||Text |
+--------------++--------------------+
| 1 ||hello how are you |
| 2 ||hi haiden |
| 3 ||how are you uma |
--------------------------------------
Your help is greatly appreciated in doing the same in Spark using Scala.

It'd be easier to accomplish this if you convert your df2 to a Map. Assuming it's not a huge table, you can do the following :
val keyVal = df2.map( r =>( r(0).toString, r(1).toString ) ).collect.toMap
This will give you a Map to refer to :
scala.collection.immutable.Map[String,String] = Map(helo -> hello, hai -> hi, hw -> how, u -> you)
Now you can use UDF to create a function that will utilize keyVal Map to replace values :
val getVal = udf[String, String] (x => x.split(" ").map(x => res18.get(x).getOrElse(x) ).mkString( " " ) )
Now, you can call the udf getVal on your dataframe to get the desired result.
df1.withColumn("text" , getVal(df1("text")) ).show
+---+-----------------+
| id| text|
+---+-----------------+
| 1|hello how are you|
| 2| hi haiden|
| 3| how are you uma|
+---+-----------------+

I will demonstrate only for the first id and assume that you can not do a collect action on your df2. First you need to be sure that the schema for your dataframe is and array for text column on your df1
+---+--------------------+
| id| text|
+---+--------------------+
| 1|[helo, how, are, ...|
+---+--------------------+
with schema like this:
|-- id: integer (nullable = true)
|-- text: array (nullable = true)
| |-- element: string (containsNull = true)
After that you can do an explode on the text column
res1.withColumn("text", explode(res1("text")))
+---+----+
| id|text|
+---+----+
| 1|helo|
| 1| how|
| 1| are|
| 1| you|
+---+----+
Assuming you're replace dataframe looks like this:
+----+-------+
|word|replace|
+----+-------+
|helo| hello|
| hai| hi|
+----+-------+
Joining the two dataframe will look like this:
res6.join(res8, res6("text") === res8("word"), "left_outer")
+---+----+----+-------+
| id|text|word|replace|
+---+----+----+-------+
| 1| you|null| null|
| 1| how|null| null|
| 1|helo|helo| hello|
| 1| are|null| null|
+---+----+----+-------+
Do a select with coalescing null values:
res26.select(res26("id"), coalesce(res26("replace"), res26("text")).as("replaced_text"))
+---+-------------+
| id|replaced_text|
+---+-------------+
| 1| you|
| 1| how|
| 1| hello|
| 1| are|
+---+-------------+
and then group by id and aggregate into a collect list function:
res33.groupBy("id").agg(collect_list("replaced_text"))
+---+---------------------------+
| id|collect_list(replaced_text)|
+---+---------------------------+
| 1| [you, how, hello,...|
+---+---------------------------+
Keep in mind that you should preserve you initial order of text elements.

I Suppose code below should solve your problem
I have solved this by using RDD
val wordRdd = df1.rdd.flatMap{ row =>
val wordList = row.getAs[String]("Text").split(" ").toList
wordList.map{word => Row.fromTuple(row.getAs[Int]("id"),word)}
}.zipWithIndex()
val wordDf = sqlContext.createDataFrame(wordRdd.map(x => Row.fromSeq(x._1.toSeq++Seq(x._2))),StructType(List(StructField("id",IntegerType),StructField("word",StringType),StructField("index",LongType))))
val opRdd = wordDf.join(df2,wordDf("word")===df2("word"),"left_outer").drop(df2("word")).rdd.groupBy(_.getAs[Int]("id")).map(x => Row.fromTuple(x._1,x._2.toList.sortBy(x => x.getAs[Long]("index")).map(row => if(row.getAs[String]("Replace")!=null) row.getAs[String]("Replace") else row.getAs[String]("word")).mkString(" ")))
val opDF = sqlContext.createDataFrame(opRdd,StructType(List(StructField("id",IntegerType),StructField("Text",StringType))))

Splitting row in multiple row in spark-shell

I have imported data in Spark dataframe in spark-shell. Data is filled in it like :
Col1 | Col2 | Col3 | Col4
A1 | 11 | B2 | a|b;1;0xFFFFFF
A1 | 12 | B1 | 2
A2 | 12 | B2 | 0xFFF45B
Here in Col4, the values are of different kinds and I want to separate them like (suppose "a|b" is type of alphabets, "1 or 2" is a type of digit and "0xFFFFFF or 0xFFF45B" is a type of hexadecimal no.):
So, the output should be :
Col1 | Col2 | Col3 | alphabets | digits | hexadecimal
A1 | 11 | B2 | a | 1 | 0xFFFFFF
A1 | 11 | B2 | b | 1 | 0xFFFFFF
A1 | 12 | B1 | | 2 |
A2 | 12 | B2 | | | 0xFFF45B
Hope I've made my query clear to you and I am using spark-shell. Thanks in advance.

Edit after getting this answer about how to make backreference in regexp_replace.
You can use regexp_replace with a backreference, then split twice and explode. It is, imo, cleaner than my original solution
val df = List(
("A1" , "11" , "B2" , "a|b;1;0xFFFFFF"),
("A1" , "12" , "B1" , "2"),
("A2" , "12" , "B2" , "0xFFF45B")
).toDF("Col1" , "Col2" , "Col3" , "Col4")
val regExStr = "^([A-z|]+)?;?(\\d+)?;?(0x.*)?$"
val res = df
.withColumn("backrefReplace",
split(regexp_replace('Col4,regExStr,"$1;$2;$3"),";"))
.select('Col1,'Col2,'Col3,
explode(split('backrefReplace(0),"\\|")).as("letter"),
'backrefReplace(1) .as("digits"),
'backrefReplace(2) .as("hexadecimal")
)
+----+----+----+------+------+-----------+
|Col1|Col2|Col3|letter|digits|hexadecimal|
+----+----+----+------+------+-----------+
| A1| 11| B2| a| 1| 0xFFFFFF|
| A1| 11| B2| b| 1| 0xFFFFFF|
| A1| 12| B1| | 2| |
| A2| 12| B2| | | 0xFFF45B|
+----+----+----+------+------+-----------+
you still need to replace empty strings by nullthough...
Previous Answer (somebody might still prefer it):
Here is a solution that sticks to DataFrames but is also quite messy. You can first use regexp_extract three times (possible to do less with backreference?), and finally split on "|" and explode. Note that you need a coalesce for explode to return everything (you still might want to change the empty strings in letter to null in this solution).
val res = df
.withColumn("alphabets", regexp_extract('Col4,"(^[A-z|]+)?",1))
.withColumn("digits", regexp_extract('Col4,"^([A-z|]+)?;?(\\d+)?;?(0x.*)?$",2))
.withColumn("hexadecimal",regexp_extract('Col4,"^([A-z|]+)?;?(\\d+)?;?(0x.*)?$",3))
.withColumn("letter",
explode(
split(
coalesce('alphabets,lit("")),
"\\|"
)
)
)
res.show
+----+----+----+--------------+---------+------+-----------+------+
|Col1|Col2|Col3| Col4|alphabets|digits|hexadecimal|letter|
+----+----+----+--------------+---------+------+-----------+------+
| A1| 11| B2|a|b;1;0xFFFFFF| a|b| 1| 0xFFFFFF| a|
| A1| 11| B2|a|b;1;0xFFFFFF| a|b| 1| 0xFFFFFF| b|
| A1| 12| B1| 2| null| 2| null| |
| A2| 12| B2| 0xFFF45B| null| null| 0xFFF45B| |
+----+----+----+--------------+---------+------+-----------+------+
Note: The regexp part could be so much better with backreference, so if somebody knows how to do it, please comment!

Not sure this is doable while staying 100% with Dataframes, here's a (somewhat messy?) solution using RDDs for the split itself:
import org.apache.spark.sql.functions._
import sqlContext.implicits._
// we switch to RDD to perform the split of Col4 into 3 columns
val rddWithSplitCol4 = input.rdd.map { r =>
val indexToValue = r.getAs[String]("Col4").split(';').map {
case s if s.startsWith("0x") => 2 -> s
case s if s.matches("\\d+") => 1 -> s
case s => 0 -> s
}
val newCols: Array[String] = indexToValue.foldLeft(Array.fill[String](3)("")) {
case (arr, (index, value)) => arr.updated(index, value)
}
(r.getAs[String]("Col1"), r.getAs[Int]("Col2"), r.getAs[String]("Col3"), newCols(0), newCols(1), newCols(2))
}
// switch back to Dataframe and explode alphabets column
val result = rddWithSplitCol4
.toDF("Col1", "Col2", "Col3", "alphabets", "digits", "hexadecimal")
.withColumn("alphabets", explode(split(col("alphabets"), "\\|")))
result.show(truncate = false)
// +----+----+----+---------+------+-----------+
// |Col1|Col2|Col3|alphabets|digits|hexadecimal|
// +----+----+----+---------+------+-----------+
// |A1 |11 |B2 |a |1 |0xFFFFFF |
// |A1 |11 |B2 |b |1 |0xFFFFFF |
// |A1 |12 |B1 | |2 | |
// |A2 |12 |B2 | | |0xFFF45B |
// +----+----+----+---------+------+-----------+

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

parse dict key:value with pyspark (keys are not pre defined) - pyspark

Related

How do you split a column such that first half becomes the column name and the second the column value in Scala Spark?

Spark DF create Seq column in witcolumn

Creating a new dataframe with many rows for each row in existing dataframe

Replace words in Data frame using List of words in another Data frame in Spark Scala

Splitting row in multiple row in spark-shell

Categories

Resources