Explode multiple columns into separate rows in Spark Scala - scala

I have a DF in the following structure
Col1. Col2 Col3
Data1Col1,Data2Col1. Data1Col2,Data2Col2. Data1Col3,Data2Col3
I want the resultant dataset to be of the following type:
Col1 Col2 Col3
Data1Col1. Data1Col2. Data1Col3
Data2Col1. Data2Col2 Data2Col3
Please suggest me how to approach this. I have tried explode , but that results in duplicate rows.

val df = Seq(("C,D,E,F","M,N,O,P","K,P,B,P")).toDF("Col1","Col2","Col3")
df.show
+-------+-------+-------+
| Col1| Col2| Col3|
+-------+-------+-------+
|C,D,E,F|M,N,O,P|K,P,B,P|
+-------+-------+-------+
val res1 = df.withColumn("Col1",split(col("Col1"),",")).withColumn("Col2",split(col("Col2"),",")).withColumn("Col3",split(col("Col3"),","))
res1.show
+------------+------------+------------+
| Col1| Col2| Col3|
+------------+------------+------------+
|[C, D, E, F]|[M, N, O, P]|[K, P, B, P]|
+------------+------------+------------+
val zip = udf((x: Seq[String], y: Seq[String], z: Seq[String]) => z.zip(x.zip(y)))
val res14 = res1.withColumn("test",explode(zip(col("Col1"),col("Col2"),col("Col3")))).show
+------------+------------+------------+-----------+
| Col1| Col2| Col3| test|
+------------+------------+------------+-----------+
|[C, D, E, F]|[M, N, O, P]|[K, P, B, P]|[K, [C, M]]|
|[C, D, E, F]|[M, N, O, P]|[K, P, B, P]|[P, [D, N]]|
|[C, D, E, F]|[M, N, O, P]|[K, P, B, P]|[B, [E, O]]|
|[C, D, E, F]|[M, N, O, P]|[K, P, B, P]|[P, [F, P]]|
+------------+------------+------------+-----------+
res14.withColumn("t3",col("test._1")).withColumn("tn",col("test._2")).withColumn("t2",col("tn._2")).withColumn("t1",col("tn._1")).select("t1","t2","t3").show
+---+---+---+
| t1| t2| t3|
+---+---+---+
| C| M| K|
| D| N| P|
| E| O| B|
| F| P| P|
+---+---+---+
res1 - Initial Dataframe
res14 - intermediate Df

Related

Programmatically extract columns from Struct column as individual columns

I have a dataframe as follows
val initialData = Seq(
Row("ABC1",List(Row("Java","XX",120),Row("Scala","XA",300))),
Row("Michael",List(Row("Java","XY",200),Row("Scala","XB",500))),
Row("Robert",List(Row("Java","XZ",400),Row("Scala","XC",250)))
)
val arrayStructSchema = new StructType().add("name",StringType)
.add("SortedDataSet",ArrayType(new StructType()
.add("name",StringType)
.add("author",StringType)
.add("pages",IntegerType)))
val df = spark
.createDataFrame(spark.sparkContext.parallelize(initialData),arrayStructSchema)
df.printSchema()
df.show(5, false)
+-------+-----------------------------------+
|name |SortedDataSet |
+-------+-----------------------------------+
|ABC1 |[[Java, XX, 120], [Scala, XA, 300]]|
|Michael|[[Java, XY, 200], [Scala, XB, 500]]|
|Robert |[[Java, XZ, 400], [Scala, XC, 250]]|
+-------+-----------------------------------+
I need to extract each element of the struct as an individual indexed column
Right now, I'm doing the following
val newDf = df.withColumn("Col1", sort_array('SortedDataSet).getItem(0))
.withColumn("Col2", sort_array('SortedDataSet).getItem(1))
.withColumn("name_1",$"Col1.name")
.withColumn("author_1",$"Col1.author")
.withColumn("pages_1",$"Col1.pages")
.withColumn("name_2",$"Col2.name")
.withColumn("author_2",$"Col2.author")
.withColumn("pages_2",$"Col2.pages")
This is simple as I have only 2 arrays and 5 columns. What do I do when I have multiple arrays and columns?
How can I do this programmatically?
One approach would be to flatten the dataframe to generate indexed array elements using posexplode, followed by a groupBy and pivot on the generated indices, like below:
Given the sample dataset:
df.show(false)
// +-------+--------------------------------------------------+
// |name |SortedDataSet |
// +-------+--------------------------------------------------+
// |ABC1 |[[Java, XX, 120], [Scala, XA, 300]] |
// |Michael|[[Java, XY, 200], [Scala, XB, 500], [Go, XD, 600]]|
// |Robert |[[Java, XZ, 400], [Scala, XC, 250]] |
// +-------+--------------------------------------------------+
Note that I've slightly generalized the sample data to showcase arrays with uneven sizes.
val flattenedDF = df.
select($"name", posexplode($"SortedDataSet")).
groupBy($"name").pivot($"pos" + 1).agg(
first($"col.name").as("name"),
first($"col.author").as("author"),
first($"col.pages").as("pages")
)
flattenedDF.show
// +-------+------+--------+-------+------+--------+-------+------+--------+-------+
// | name|1_name|1_author|1_pages|2_name|2_author|2_pages|3_name|3_author|3_pages|
// +-------+------+--------+-------+------+--------+-------+------+--------+-------+
// | ABC1| Java| XX| 120| Scala| XA| 300| null| null| null|
// |Michael| Java| XY| 200| Scala| XB| 500| Go| XD| 600|
// | Robert| Java| XZ| 400| Scala| XC| 250| null| null| null|
// +-------+------+--------+-------+------+--------+-------+------+--------+-------+
To revise the column names created by pivot to the wanted names:
val pattern = "^\\d+_.*"
val flattenedCols = flattenedDF.columns.filter(_ matches pattern)
def colRenamed(c: String): String =
c.split("_", 2).reverse.mkString("_") // Split on first "_" and switch segments
flattenedDF.
select($"name" +: flattenedCols.map(c => col(c).as(colRenamed(c))): _*).
show
// +-------+------+--------+-------+------+--------+-------+------+--------+-------+
// | name|name_1|author_1|pages_1|name_2|author_2|pages_2|name_3|author_3|pages_3|
// +-------+------+--------+-------+------+--------+-------+------+--------+-------+
// | ABC1| Java| XX| 120| Scala| XA| 300| null| null| null|
// |Michael| Java| XY| 200| Scala| XB| 500| Go| XD| 600|
// | Robert| Java| XZ| 400| Scala| XC| 250| null| null| null|
// +-------+------+--------+-------+------+--------+-------+------+--------+-------+
If your arrays have the same size, you can avoid doing an expensive explode, group by and pivot, by selecting the array and struct elements dynamically:
val arrSize = df.select(size(col("SortedDataSet"))).first().getInt(0)
val df2 = (1 to arrSize).foldLeft(df)(
(d, i) =>
d.withColumn(
s"Col$i",
sort_array(col("SortedDataSet"))(i-1)
)
)
val colNames = df.selectExpr("SortedDataSet[0] as tmp").select("tmp.*").columns
// colNames: Array[String] = Array(name, author, pages)
val colList = (1 to arrSize).map("Col" + _ + ".*").toSeq
// colList: scala.collection.immutable.Seq[String] = Vector(Col1.*, Col2.*)
val colRename = df2.columns ++ (
for {x <- (1 to arrSize); y <- colNames}
yield (x,y)
).map(
x => x._2 + "_" + x._1
).toArray[String]
// colRename: Array[String] = Array(name, SortedDataSet, Col1, Col2, name_1, author_1, pages_1, name_2, author_2, pages_2)
val newDf = df2.select("*", colList: _*).toDF(colRename: _*)
newDf.show(false)
+-------+-----------------------------------+---------------+----------------+------+--------+-------+------+--------+-------+
|name |SortedDataSet |Col1 |Col2 |name_1|author_1|pages_1|name_2|author_2|pages_2|
+-------+-----------------------------------+---------------+----------------+------+--------+-------+------+--------+-------+
|ABC1 |[[Java, XX, 120], [Scala, XA, 300]]|[Java, XX, 120]|[Scala, XA, 300]|Java |XX |120 |Scala |XA |300 |
|Michael|[[Java, XY, 200], [Scala, XB, 500]]|[Java, XY, 200]|[Scala, XB, 500]|Java |XY |200 |Scala |XB |500 |
|Robert |[[Java, XZ, 400], [Scala, XC, 250]]|[Java, XZ, 400]|[Scala, XC, 250]|Java |XZ |400 |Scala |XC |250 |
+-------+-----------------------------------+---------------+----------------+------+--------+-------+------+--------+-------+

Collect most occurring unique values across columns after a groupby in Spark

I have the following dataframe
val input = Seq(("ZZ","a","a","b","b"),
("ZZ","a","b","c","d"),
("YY","b","e",null,"f"),
("YY","b","b",null,"f"),
("XX","j","i","h",null))
.toDF("main","value1","value2","value3","value4")
input.show()
+----+------+------+------+------+
|main|value1|value2|value3|value4|
+----+------+------+------+------+
| ZZ| a| a| b| b|
| ZZ| a| b| c| d|
| YY| b| e| null| f|
| YY| b| b| null| f|
| XX| j| i| h| null|
+----+------+------+------+------+
I need to group by the main column and pick the two most occurring values from the remaining columns for each main value
I did the following
val newdf = input.select('main,array('value1,'value2,'value3,'value4).alias("values"))
val newdf2 = newdf.groupBy('main).agg(collect_set('values).alias("values"))
val newdf3 = newdf2.select('main, flatten($"values").alias("values"))
To get the data in the following form
+----+--------------------+
|main| values|
+----+--------------------+
| ZZ|[a, a, b, b, a, b...|
| YY|[b, e,, f, b, b,, f]|
| XX| [j, i, h,]|
+----+--------------------+
Now I need to pick the most occurring two items from the list as two columns. Dunno how to do that.
So, in this case the expected output should be
+----+------+------+
|main|value1|value2|
+----+------+------+
| ZZ| a| b|
| YY| b| f|
| XX| j| i|
+----+------+------+
null should not be counted and the final values should be null only if there are no other values to fill
Is this the best way to do things ? Is there a better way of doing it ?
You can use an udf to select the two values from the array that occur the most often.
input.withColumn("values", array("value1", "value2", "value3", "value4"))
.groupBy("main").agg(flatten(collect_list("values")).as("values"))
.withColumn("max", maxUdf('values)) //(1)
.cache() //(2)
.withColumn("value1", 'max.getItem(0))
.withColumn("value2", 'max.getItem(1))
.drop("values", "max")
.show(false)
with maxUdf being defined as
def getMax[T](array: Seq[T]) = {
array
.filter(_ != null) //remove null values
.groupBy(identity).mapValues(_.length) //count occurences of each value
.toSeq.sortWith(_._2 > _._2) //sort (3)
.map(_._1).take(2) //return the two (or one) most common values
}
val maxUdf = udf(getMax[String] _)
Remarks:
using an udf here means that the whole array with all entries for a single value of main has to fit into the memory of one Spark executor
cache is required here or the the udf will be called twice, once for value1 and once for value2
the sortWith here is stable but it might be necessary to add some extra logic to handle the situation if two elements have the same number of occurences (like i, j and h for the main value XX)
Here is my try without udf.
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy('main).orderBy('count.desc)
newdf3.withColumn("values", explode('values))
.groupBy('main, 'values).agg(count('values).as("count"))
.filter("values is not null")
.withColumn("target", concat(lit("value"), lit(row_number().over(w))))
.filter("target < 'value3'")
.groupBy('main).pivot('target).agg(first('values)).show
+----+------+------+
|main|value1|value2|
+----+------+------+
| ZZ| a| b|
| YY| b| f|
| XX| j| null|
+----+------+------+
The last row has the null value because I have modified your dataframe in this way,
+----+--------------------+
|main| values|
+----+--------------------+
| ZZ|[a, a, b, b, a, b...|
| YY|[b, e,, f, b, b,, f]|
| XX| [j,,,]| <- For null test
+----+--------------------+

Update values in a column based on values of another data frame's column values in PySpark

I have two data frames in PySpark: df1
+---+-----------------+
|id1| items1|
+---+-----------------+
| 0| [B, C, D, E]|
| 1| [E, A, C]|
| 2| [F, A, E, B]|
| 3| [E, G, A]|
| 4| [A, C, E, B, D]|
+---+-----------------+
and df2:
+---+-----------------+
|id2| items2|
+---+-----------------+
|001| [B]|
|002| [A]|
|003| [C]|
|004| [E]|
+---+-----------------+
I would like to create a new column in df1 that would update values in
items1 column, so that it only keeps values that also appear (in any row of) items2 in df2. The result should look as follows:
+---+-----------------+----------------------+
|id1| items1| items1_updated|
+---+-----------------+----------------------+
| 0| [B, C, D, E]| [B, C, E]|
| 1| [E, A, C]| [E, A, C]|
| 2| [F, A, E, B]| [A, E, B]|
| 3| [E, G, A]| [E, A]|
| 4| [A, C, E, B, D]| [A, C, E, B]|
+---+-----------------+----------------------+
I would normally use collect() to get a list of all values in items2 column and then use a udf applied to each row in items1 to get an intersection. But the data is extremely large (over 10 million rows) and I cannot use collect() to get such list. Is there a way to do this while keeping data in a data frame format? Or some other way without using collect()?
The first thing you want to do is explode the values in df2.items2 so that contents of the arrays will be on separate rows:
from pyspark.sql.functions import explode
df2 = df2.select(explode("items2").alias("items2"))
df2.show()
#+------+
#|items2|
#+------+
#| B|
#| A|
#| C|
#| E|
#+------+
(This assumes that the values in df2.items2 are distinct- if not, you would need to add df2 = df2.distinct().)
Option 1: Use crossJoin:
Now you can crossJoin the new df2 back to df1 and keep only the rows where df1.items1 contains an element in df2.items2. We can achieve this using pyspark.sql.functions.array_contains and this trick that allows us to use a column value as a parameter.
After filtering, group by id1 and items1 and aggregate using pyspark.sql.functions.collect_list
from pyspark.sql.functions import expr, collect_list
df1.alias("l").crossJoin(df2.alias("r"))\
.where(expr("array_contains(l.items1, r.items2)"))\
.groupBy("l.id1", "l.items1")\
.agg(collect_list("r.items2").alias("items1_updated"))\
.show()
#+---+---------------+--------------+
#|id1| items1|items1_updated|
#+---+---------------+--------------+
#| 1| [E, A, C]| [A, C, E]|
#| 0| [B, C, D, E]| [B, C, E]|
#| 4|[A, C, E, B, D]| [B, A, C, E]|
#| 3| [E, G, A]| [A, E]|
#| 2| [F, A, E, B]| [B, A, E]|
#+---+---------------+--------------+
Option 2: Explode df1.items1 and left join:
Another option is to explode the contents of items1 in df1 and do a left join. After the join, we have to do a similar group by and aggregation as above. This works because collect_list will ignore the null values introduced by the non-matching rows
df1.withColumn("items1", explode("items1")).alias("l")\
.join(df2.alias("r"), on=expr("l.items1=r.items2"), how="left")\
.groupBy("l.id1")\
.agg(
collect_list("l.items1").alias("items1"),
collect_list("r.items2").alias("items1_updated")
).show()
#+---+---------------+--------------+
#|id1| items1|items1_updated|
#+---+---------------+--------------+
#| 0| [E, B, D, C]| [E, B, C]|
#| 1| [E, C, A]| [E, C, A]|
#| 3| [E, A, G]| [E, A]|
#| 2| [F, E, B, A]| [E, B, A]|
#| 4|[E, B, D, C, A]| [E, B, C, A]|
#+---+---------------+--------------+

How to index Spark CoreNLP analysis?

I have been using the Stanford CoreNLP wrapper for Apache Spark to do NEP analysis and found it works well. However, i want to extend the simple example to where I can map the analysis back to an original dataframe id. See below, I have added two more row to the simple example.
val input = Seq(
(1, "<xml>Apple is located in California. It is a great company.</xml>"),
(2, "<xml>Google is located in California. It is a great company.</xml>"),
(3, "<xml>Netflix is located in California. It is a great company.</xml>")
).toDF("id", "text")
input.show()
input: org.apache.spark.sql.DataFrame = [id: int, text: string]
+---+--------------------+
| id| text|
+---+--------------------+
| 1|<xml>Apple is loc...|
| 2|<xml>Google is lo...|
| 3|<xml>Netflix is l...|
+---+--------------------+
I can then run this dataframe through the Spark CoreNLP wrapper to do both sentiment and NEP analysis.
val output = input
.select(cleanxml('text).as('doc))
.select(explode(ssplit('doc)).as('sen))
.select('sen, tokenize('sen).as('words), ner('sen).as('nerTags), sentiment('sen).as('sentiment))
However, in the output below i have lost the connection back to the original dataframe row ids.
+--------------------+--------------------+--------------------+---------+
| sen| words| nerTags|sentiment|
+--------------------+--------------------+--------------------+---------+
|Apple is located ...|[Apple, is, locat...|[ORGANIZATION, O,...| 2|
|It is a great com...|[It, is, a, great...| [O, O, O, O, O, O]| 4|
|Google is located...|[Google, is, loca...|[ORGANIZATION, O,...| 3|
|It is a great com...|[It, is, a, great...| [O, O, O, O, O, O]| 4|
|Netflix is locate...|[Netflix, is, loc...|[ORGANIZATION, O,...| 3|
|It is a great com...|[It, is, a, great...| [O, O, O, O, O, O]| 4|
+--------------------+--------------------+--------------------+---------+
Ideally, I want something like the following:
+--+---------------------+--------------------+--------------------+---------+
|id| sen| words| nerTags|sentiment|
+--+---------------------+--------------------+--------------------+---------+
| 1| Apple is located ...|[Apple, is, locat...|[ORGANIZATION, O,...| 2|
| 1| It is a great com...|[It, is, a, great...| [O, O, O, O, O, O]| 4|
| 2| Google is located...|[Google, is, loca...|[ORGANIZATION, O,...| 3|
| 2| It is a great com...|[It, is, a, great...| [O, O, O, O, O, O]| 4|
| 3| Netflix is locate...|[Netflix, is, loc...|[ORGANIZATION, O,...| 3|
| 3| It is a great com...|[It, is, a, great...| [O, O, O, O, O, O]| 4|
+--+---------------------+--------------------+--------------------+---------+
I have tried to create a UDF but am unable to make it work.
Using UDF defined in the Stanford CoreNLP wrapper for Apache Spark you can use the following code to produced the desired output
val output = input.withColumn("doc", cleanxml('text).as('doc))
.withColumn("sen", ssplit('doc).as('sen))
.withColumn("sen", explode($"sen"))
.withColumn("words", tokenize('sen).as('words))
.withColumn("ner", ner('sen).as('nerTags))
.withColumn("sentiment", sentiment('sen).as('sentiment))
.drop("text")
.drop("doc").show()
will produce the following Dataframe
+--+---------------------+--------------------+--------------------+---------+
|id| sen| words| nerTags|sentiment|
+--+---------------------+--------------------+--------------------+---------+
| 1| Apple is located ...|[Apple, is, locat...|[ORGANIZATION, O,...| 2|
| 1| It is a great com...|[It, is, a, great...| [O, O, O, O, O, O]| 4|
| 2| Google is located...|[Google, is, loca...|[ORGANIZATION, O,...| 3|
| 2| It is a great com...|[It, is, a, great...| [O, O, O, O, O, O]| 4|
| 3| Netflix is locate...|[Netflix, is, loc...|[ORGANIZATION, O,...| 3|
| 3| It is a great com...|[It, is, a, great...| [O, O, O, O, O, O]| 4|
+--+---------------------+--------------------+--------------------+---------+

Spark reduce and aggregate on same data-set

I have a text file which I read and then split using the split operation. This results in an RDD with Array(A, B, C, D, E, F, G, H, I).
I would like to find max(F) - min(G) for every key E (reduce by key E). Then I want to combine the resulting values by key C and concatenate this sum result for every row with the same key.
For example:
+--+--+--+--+
| C| E| F| G|
+--+--+--+--+
|en| 1| 3| 1|
|en| 1| 4| 0|
|nl| 2| 1| 1|
|nl| 2| 5| 2|
|nl| 3| 9| 3|
|nl| 3| 6| 4|
|en| 4| 9| 1|
|en| 4| 2| 1|
+-----------+
Should result in
+--+--+-------------+---+
| C| E|max(F)-min(G)|sum|
+--+--+-------------+---+
|en| 1| 4 |12 |
|nl| 2| 4 |10 |
|nl| 3| 6 |10 |
|en| 4| 8 |12 |
+--+--+-------------+---+
What would be the best way to tackle this? Currently I am trying to perform the max(F)-min(G) by running
val maxCounts = logEntries.map(line => (line(4), line(5).toLong)).reduceByKey((x, y) => math.max(x, y))
val minCounts = logEntries.map(line => (line(4), line(6).toLong)).reduceByKey((x, y) => math.min(x, y))
val maxMinCounts = maxCounts.join(minCounts).map{ case(id, maxmin) => (id, (maxmin._1 - maxmin._2)) }
And then join the resulting RDDs. However, this becomes tricky when I also want to sum these values and append them to my existing data set.
I would love to hear any suggestions!
This kind of logic is easily implemented in the dataframe API (also). But you need to explicitly form your columns from the array:
val window = Window.partitionBy('C)
val df = rdd
.map { case Array(_, _, c, _, e, f, g, _, _) => (c,e,f,g) }
.toDF("C","E","F","G")
.groupBy('C,'E)
.agg((max('F) - min('G)).as("diff"))
.withColumn("sum",sum('diff).over(window))
assuming, like your sample data, that unique E's never span multiple C's... you could do something like this.
import math.{max,min}
case class FG(f: Int, g: Int) {
def combine(that: FG) =
FG(max(f, that.f), min(g, that.g))
def result = f - g
}
val result = {
rdd
.map{ case Array(_, _, c, _, e, f, g, _, _) =>
((c, e), FG(f, g)) }
.reduceByKey(_ combine _)
.map{ case ((c, _), fg) =>
(c, fg.result) }
.reduceByKey(_+_)
}