How to join two dataframes? - scala

I cannot get Sparks DataFrame join to work (no result gets produced). Here is my code:
val e = Seq((1, 2), (1, 3), (2, 4))
var edges = e.map(p => Edge(p._1, p._2)).toDF()
var filtered = edges.filter("start = 1").distinct()
println("filtered")
filtered.show()
filtered.printSchema()
println("edges")
edges.show()
edges.printSchema()
var joined = filtered.join(edges, filtered("end") === edges("start"))//.select(filtered("start"), edges("end"))
println("joined")
joined.show()
It requires case class Edge(start: Int, end: Int) to be defined at top level. Here is the output it produces:
filtered
+-----+---+
|start|end|
+-----+---+
| 1| 2|
| 1| 3|
+-----+---+
root
|-- start: integer (nullable = false)
|-- end: integer (nullable = false)
edges
+-----+---+
|start|end|
+-----+---+
| 1| 2|
| 1| 3|
| 2| 4|
+-----+---+
root
|-- start: integer (nullable = false)
|-- end: integer (nullable = false)
joined
+-----+---+-----+---+
|start|end|start|end|
+-----+---+-----+---+
+-----+---+-----+---+
I don't understand why the output is empty. Why isn't the first row of filtered get combined with the last row of edges?

val f2 = filtered.withColumnRenamed("start", "fStart").withColumnRenamed("end", "fEnd")
f2.join(edges, f2("fEnd") === edges("start")).show
I believe this is because filtered("start").equals(edges("start")), that is as filtered is a filtered view on edges and they share the column definitions. The columns are the same so Spark does not understand which you are referencing.
As such you can do things like
edges.select(filtered("start")).show

Related

Scala Spark: Flatten Array of Key/Value structs

I have an input dataframe which contains an array-typed column. Each entry in the array is a struct consisting of a key (one of about four values) and a value. I want to turn this into a dataframe with one column for each possible key, and nulls where that value is not in the array for that row. Keys are never duplicated in any of the arrays, but they may be out of order or missing.
So far the best I've got is
val wantedCols =df.columns
.filter(_ != arrayCol)
.filter(_ != "col")
val flattened = df
.select((wantedCols.map(col(_)) ++ Seq(explode(col(arrayCol)))):_*)
.groupBy(wantedCols.map(col(_)):_*)
.pivot("col.key")
.agg(first("col.value"))
This does exactly what I want, but it's hideous and I have no idea what the ramifactions of grouping on every-column-but-one would be. What's the RIGHT way to do this?
EDIT: Example input/output:
case class testStruct(name : String, number : String)
val dfExampleInput = Seq(
(0, "KY", Seq(testStruct("A", "45"))),
(1, "OR", Seq(testStruct("A", "30"), testStruct("B", "10"))))
.toDF("index", "state", "entries")
.show
+-----+-----+------------------+
|index|state| entries|
+-----+-----+------------------+
| 0| KY| [[A, 45]]|
| 1| OR|[[A, 30], [B, 10]]|
+-----+-----+------------------+
val dfExampleOutput = Seq(
(0, "KY", "45", null),
(1, "OR", "30", "10"))
.toDF("index", "state", "A", "B")
.show
+-----+-----+---+----+
|index|state| A| B|
+-----+-----+---+----+
| 0| KY| 45|null|
| 1| OR| 30| 10|
+-----+-----+---+----+
FURTHER EDIT:
I submitted a solution myself (see below) that handles this well so long as you know the keys in advance (in my case I do.) If finding the keys is an issue, another answer holds code to handle that.
Without groupBy pivot agg first
Please check below code.
scala> val df = Seq((0, "KY", Seq(("A", "45"))),(1, "OR", Seq(("A", "30"),("B", "10")))).toDF("index", "state", "entries").withColumn("entries",$"entries".cast("array<struct<name:string,number:string>>"))
df: org.apache.spark.sql.DataFrame = [index: int, state: string ... 1 more field]
scala> df.printSchema
root
|-- index: integer (nullable = false)
|-- state: string (nullable = true)
|-- entries: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- name: string (nullable = true)
| | |-- number: string (nullable = true)
scala> df.show(false)
+-----+-----+------------------+
|index|state|entries |
+-----+-----+------------------+
|0 |KY |[[A, 45]] |
|1 |OR |[[A, 30], [B, 10]]|
+-----+-----+------------------+
scala> val finalDFColumns = df.select(explode($"entries").as("entries")).select("entries.*").select("name").distinct.map(_.getAs[String](0)).orderBy($"value".asc).collect.foldLeft(df.limit(0))((cdf,c) => cdf.withColumn(c,lit(null))).columns
finalDFColumns: Array[String] = Array(index, state, entries, A, B)
scala> val finalDF = df.select($"*" +: (0 until max).map(i => $"entries".getItem(i)("number").as(i.toString)): _*)
finalDF: org.apache.spark.sql.DataFrame = [index: int, state: string ... 3 more fields]
scala> finalDF.show(false)
+-----+-----+------------------+---+----+
|index|state|entries |0 |1 |
+-----+-----+------------------+---+----+
|0 |KY |[[A, 45]] |45 |null|
|1 |OR |[[A, 30], [B, 10]]|30 |10 |
+-----+-----+------------------+---+----+
scala> finalDF.printSchema
root
|-- index: integer (nullable = false)
|-- state: string (nullable = true)
|-- entries: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- name: string (nullable = true)
| | |-- number: string (nullable = true)
|-- 0: string (nullable = true)
|-- 1: string (nullable = true)
scala> finalDF.columns.zip(finalDFColumns).foldLeft(finalDF)((fdf,column) => fdf.withColumnRenamed(column._1,column._2)).show(false)
+-----+-----+------------------+---+----+
|index|state|entries |A |B |
+-----+-----+------------------+---+----+
|0 |KY |[[A, 45]] |45 |null|
|1 |OR |[[A, 30], [B, 10]]|30 |10 |
+-----+-----+------------------+---+----+
scala>
Final Output
scala> finalDF.columns.zip(finalDFColumns).foldLeft(finalDF)((fdf,column) => fdf.withColumnRenamed(column._1,column._2)).drop($"entries").show(false)
+-----+-----+---+----+
|index|state|A |B |
+-----+-----+---+----+
|0 |KY |45 |null|
|1 |OR |30 |10 |
+-----+-----+---+----+
I wouldn't worry too much about grouping by several columns, other than potentially making things confusing. In that vein, if there is a simpler, more maintainable way, go for it. Without example input/output, I'm not sure if this gets you where you're trying to go, but maybe it'll be of use:
Seq(Seq("k1" -> "v1", "k2" -> "v2")).toDS() // some basic input based on my understanding of your description
.select(explode($"value")) // flatten the array
.select("col.*") // de-nest the struct
.groupBy("_2") // one row per distinct value
.pivot("_1") // one column per distinct key
.count // or agg(first) if you want the value in each column
.show
+---+----+----+
| _2| k1| k2|
+---+----+----+
| v2|null| 1|
| v1| 1|null|
+---+----+----+
Based on what you've now said, I get the impression that there are many columns like "state" that aren't required for the aggregation, but need to be in the final result.
For reference, if you didn't need to pivot, you could add a struct column with all such fields nested within, then add it to your aggregation, eg: .agg(first($"myStruct"), first($"number")). The main advantage is only having actual key column(s) referenced in the groubBy. But when using pivot things get a little weird, so we'll set that option aside.
In this use case, the simplest way I could come up with involves splitting your dataframe and joining it back together after the aggregation using some rowkey. In this example I am assuming that "index" is suitable for that purpose:
val mehCols = dfExampleInput.columns.filter(_ != "entries").map(col)
val mehDF = dfExampleInput.select(mehCols:_*)
val aggDF = dfExampleInput
.select($"index", explode($"entries").as("entry"))
.select($"index", $"entry.*")
.groupBy("index")
.pivot("name")
.agg(first($"number"))
scala> mehDF.join(aggDF, Seq("index")).show
+-----+-----+---+----+
|index|state| A| B|
+-----+-----+---+----+
| 0| KY| 45|null|
| 1| OR| 30| 10|
+-----+-----+---+----+
I doubt you would see much of a difference in performance, if any. Maybe at the extremes, eg: very many meh columns, or very many pivot columns, or something like that, or maybe nothing at all. Personally, I would test both with decently-sized input, and if there wasn't a significant difference, use whichever one seemed easier to maintain.
Here is another way that is based on the assumption that there are no duplicates on the entries column i.e Seq(testStruct("A", "30"), testStruct("A", "70"), testStruct("B", "10")) will cause an error. The next solution combines both RDD and Dataframe APIs for the implementation:
import org.apache.spark.sql.functions.explode
import org.apache.spark.sql.types.StructType
case class testStruct(name : String, number : String)
val df = Seq(
(0, "KY", Seq(testStruct("A", "45"))),
(1, "OR", Seq(testStruct("A", "30"), testStruct("B", "10"))),
(2, "FL", Seq(testStruct("A", "30"), testStruct("B", "10"), testStruct("C", "20"))),
(3, "TX", Seq(testStruct("B", "60"), testStruct("A", "19"), testStruct("C", "40")))
)
.toDF("index", "state", "entries")
.cache
// get all possible keys from entries i.e Seq[A, B, C]
val finalCols = df.select(explode($"entries").as("entry"))
.select($"entry".getField("name").as("entry_name"))
.distinct
.collect
.map{_.getAs[String]("entry_name")}
.sorted // Attention: we need to retain the order of the columns
// 1. when generating row values and
// 2. when creating the schema
val rdd = df.rdd.map{ r =>
// transform the entries array into a map i.e Map(A -> 30, B -> 10)
val entriesMap = r.getSeq[Row](2).map{r => (r.getString(0), r.getString(1))}.toMap
// transform finalCols into a map with null value i.e Map(A -> null, B -> null, C -> null)
val finalColsMap = finalCols.map{c => (c, null)}.toMap
// replace null values with those that are present from the current row by merging the two previous maps
// Attention: this should retain the order of finalColsMap
val merged = finalColsMap ++ entriesMap
// concatenate the two first row values ["index", "state"] with the values from merged
val finalValues = Seq(r(0), r(1)) ++ merged.values
Row.fromSeq(finalValues)
}
val extraCols = finalCols.map{c => s"`${c}` STRING"}
val schema = StructType.fromDDL("`index` INT, `state` STRING," + extraCols.mkString(","))
val finalDf = spark.createDataFrame(rdd, schema)
finalDf.show
// +-----+-----+---+----+----+
// |index|state| A| B| C|
// +-----+-----+---+----+----+
// | 0| KY| 45|null|null|
// | 1| OR| 30| 10|null|
// | 2| FL| 30| 10| 20|
// | 3| TX| 19| 60| 40|
// +-----+-----+---+----+----+
Note: the solution requires one extra action to retrieve the unique keys although it doesn't cause any shuffling since it it based on narrow transformations only.
I've worked out a solution myself:
def extractFromArray(colName : String, key : String, numKeys : Int, keyName : String) = {
val indexCols = (0 to numKeys-1).map(col(colName).getItem(_))
indexCols.foldLeft(lit(null))((innerCol : Column, indexCol : Column) =>
when(indexCol.isNotNull && (indexCol.getItem(keyName) === key), indexCol)
.otherwise(innerCol))
}
Example:
case class testStruct(name : String, number : String)
val df = Seq(
(0, "KY", Seq(testStruct("A", "45"))),
(1, "OR", Seq(testStruct("A", "30"), testStruct("B", "10"))),
(2, "FL", Seq(testStruct("A", "30"), testStruct("B", "10"), testStruct("C", "20"))),
(3, "TX", Seq(testStruct("B", "60"), testStruct("A", "19"), testStruct("C", "40")))
)
.toDF("index", "state", "entries")
.withColumn("A", extractFromArray("entries", "B", 3, "name"))
.show
which produces:
+-----+-----+--------------------+-------+
|index|state| entries| A|
+-----+-----+--------------------+-------+
| 0| KY| [[A, 45]]| null|
| 1| OR| [[A, 30], [B, 10]]|[B, 10]|
| 2| FL|[[A, 30], [B, 10]...|[B, 10]|
| 3| TX|[[B, 60], [A, 19]...|[B, 60]|
+-----+-----+--------------------+-------+
This solution is a little different from other answers:
It works on only a single key at a time
It requires the key name and number of keys be known in advance
It produces a column of structs, rather than doing the extra step of extracting specific values
It works as a simple column-to-column operation, rather than requiring transformations on the entire DF
It can be evaluated lazily
The first three issues can be handled by calling code, and leave it somewhat more flexible for cases where you already know the keys or where the structs contain additional values to extract.

How can I split a column containing array of some struct into separate columns?

I have the following scenarios:
case class attribute(key:String,value:String)
case class entity(id:String,attr:List[attribute])
val entities = List(entity("1",List(attribute("name","sasha"),attribute("home","del"))),
entity("2",List(attribute("home","hyd"))))
val df = entities.toDF()
// df.show
+---+--------------------+
| id| attr|
+---+--------------------+
| 1|[[name,sasha], [d...|
| 2| [[home,hyd]]|
+---+--------------------+
//df.printSchema
root
|-- id: string (nullable = true)
|-- attr: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = true)
| | |-- value: string (nullable = true)
what I want to produce is
+---+--------------------+-------+
| id| name | home |
+---+--------------------+-------+
| 1| sasha |del |
| 2| null |hyd |
+---+--------------------+-------+
How do I go about this. I looked at quite a few similar questions on stack but couldn't find anything useful.
My main motive is to do groupBy on different attributes, thus want to bring it in the above mentioned format.
I looked into explode functionality. It breaks downs a list in separate rows, I don't want that. I want to create more columns from the array of attribute.
Similar things I found:
Spark - convert Map to a single-row DataFrame
Split 1 column into 3 columns in spark scala
Spark dataframe - Split struct column into 2 columns
That can easily be reduced to PySpark converting a column of type 'map' to multiple columns in a dataframe or How to get keys and values from MapType column in SparkSQL DataFrame. First convert attr to map<string, string>
import org.apache.spark.sql.functions.{explode, map_from_entries, map_keys}
val dfMap = df.withColumn("attr", map_from_entries($"attr"))
then it's just a matter of finding the unique keys
val keys = dfMap.select(explode(map_keys($"attr"))).as[String].distinct.collect
then selecting from the map
val result = dfMap.select($"id" +: keys.map(key => $"attr"(key) as key): _*)
result.show
+---+-----+----+
| id| name|home|
+---+-----+----+
| 1|sasha| del|
| 2| null| hyd|
+---+-----+----+
Less efficient but more concise variant is to explode and pivot
val result = df
.select($"id", explode(map_from_entries($"attr")))
.groupBy($"id")
.pivot($"key")
.agg(first($"value"))
result.show
+---+----+-----+
| id|home| name|
+---+----+-----+
| 1| del|sasha|
| 2| hyd| null|
+---+----+-----+
but in practice I'd advise against it.

Spark scala dataframe: Merging multiple columns into single column

I have a spark dataframe which looks something like below:
+---+------+----+
| id|animal|talk|
+---+------+----+
| 1| bat|done|
| 2| mouse|mone|
| 3| horse| gun|
| 4| horse|some|
+---+------+----+
I want to generate a new column, say merged which would look something like
+---+-----------------------------------------------------------+
| id| merged columns |
+---+-----------------------------------------------------------+
| 1| [{name: animal, value: bat}, {name: talk, value: done}] |
| 2| [{name: animal, value: mouse}, {name: talk, value: mone}] |
| 3| [{name: animal, value: horse}, {name: talk, value: gun}] |
| 4| [{name: animal, value: horse}, {name: talk, value: some}] |
+---+-----------------------------------------------------------+
Basically, combining all the columns into an Array of case class merged(name:String, value: String).
Can anyone help me with how to do this in Scala?
Here for simplicity I have used only two columns but generic answer which can work for N number of columns would greatly help.
Your expected output doesn't seem to reflect your requirement of producing a list of name-value structured objects. If I understand it correctly, consider using foldLeft to iteratively convert the wanted columns to StructType name-value columns, and group them into an ArrayType column:
import org.apache.spark.sql.functions._
val df = Seq(
(1, "bat", "done"),
(2, "mouse", "mone"),
(3, "horse", "gun"),
(4, "horse", "some")
).toDF("id", "animal", "talk")
val cols = df.columns.filter(_ != "id")
val resultDF = cols.
foldLeft(df)( (accDF, c) =>
accDF.withColumn(c, struct(lit(c).as("name"), col(c).as("value")))
).
select($"id", array(cols.map(col): _*).as("merged"))
resultDF.show(false)
// +---+-----------------------------+
// |id |merged |
// +---+-----------------------------+
// |1 |[[animal,bat], [talk,done]] |
// |2 |[[animal,mouse], [talk,mone]]|
// |3 |[[animal,horse], [talk,gun]] |
// |4 |[[animal,horse], [talk,some]]|
// +---+-----------------------------+
resultDF.printSchema
// root
// |-- id: integer (nullable = false)
// |-- merged: array (nullable = false)
// | |-- element: struct (containsNull = false)
// | | |-- name: string (nullable = false)
// | | |-- value: string (nullable = true)

How to combine 2 different dataframes together?

I have 2 DataFrames:
Users (~29.000.000 entries)
|-- userId: string (nullable = true)
Impressions (~1000 entries)
|-- modules: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- content: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- id: string (nullable = true)
I want to walk through all the Users and attach to each User 1 Impression from these ~1000 entries. So actually at each ~1000th User the Impression would be the same, then the loop on the Impressions would start from the beginning and assign the same ~1000 impressions for the next ~1000 users.
At the end I want to have a DataFrame with the combined data. Also the Users dataframe could be reused by adding the columns of the Impressions or a newly created one would work also as a result.
You have any ideas, which would be a good solution here?
What I would do is use the old trick of adding a monotically increasing ID to both dataframes, then create a new column on your LARGER dataframe (Users) which contains the modulo of each row's ID and the size of smaller dataframe.
This new column then provides a rolling matching key against the items in the Impressions dataframe.
This is a minimal example (tested) to give you the idea. Obviously this will work if you have 1000 impressions to join against:
var users = Seq("user1", "user2", "user3", "user4", "user5", "user6", "user7", "user8", "user9").toDF("users")
var impressions = Seq("a", "b", "c").toDF("impressions").withColumn("id", monotonically_increasing_id())
var cnt = impressions.count
users=users.withColumn("id", monotonically_increasing_id())
.withColumn("mod", $"id" mod cnt)
.join(impressions, $"mod"===impressions("id"))
.drop("mod")
users.show
+-----+---+-----------+---+
|users| id|impressions| id|
+-----+---+-----------+---+
|user1| 0| a| 0|
|user2| 1| b| 1|
|user3| 2| c| 2|
|user4| 3| a| 0|
|user5| 4| b| 1|
|user6| 5| c| 2|
|user7| 6| a| 0|
|user8| 7| b| 1|
|user9| 8| c| 2|
+-----+---+-----------+---+
Sketch of idea:
Add monotonically increasing id to both dataframes Users and Impressions via
val indexedUsersDF = usersDf.withColumn("index", monotonicallyIncreasingId)
val indexedImpressionsDF = impressionsDf.withColumn("index", monotonicallyIncreasingId)
(see spark dataframe :how to add a index Column )
Determine number of rows in Impressions via count and store as int, e.g.
val numberOfImpressions = ...
Apply UDF to index-column in indexedUsersDF that computes the modulo in a seperate column (e.g. moduloIndex)
val moduloIndexedUsersDF = indexedUsersDF.select(...)
Join moduloIndexedUsersDF and indexedImperessionsDF on
moduloIndexedUsersDF("moduloIndex")===indexedImpressions("index")

Assign label to categorical data in a table in PySpark

I want to assign the label to the categorical numbers in a dataframe below using pyspark sql.
In the MARRIAGE column 1=Married and 2=Unmarried. In the EDUCATION Column 1=Grad and 2=Undergrad
Current Dataframe:
+--------+---------+-----+
|MARRIAGE|EDUCATION|Total|
+--------+---------+-----+
| 1| 2| 87|
| 1| 1| 123|
| 2| 2| 3|
| 2| 1| 8|
+--------+---------+-----+
Resulting Dataframe:
+---------+---------+-----+
|MARRIAGE |EDUCATION|Total|
+---------+---------+-----+
|Married |Grad | 87|
|Married |UnderGrad| 123|
|UnMarried|Grad | 3|
|UnMarried|UnderGrad| 8|
+---------+---------+-----+
Is it possible to assign the labels using a single udf and the withColumn()? Is there any way to assign in the single UDF by passing the whole dataframe and keep the column names as it is?
I can think of a solution to do the operation on each column by using separate udfs as below. But can't figure out if there's a way to do together.
from pyspark.sql import functions as F
def assign_marital_names(record):
if record == 1:
return "Married"
elif record == 2:
return "UnMarried"
def assign_edu_names(record):
if record == 1:
return "Grad"
elif record == 2:
return "UnderGrad"
assign_marital_udf = F.udf(assign_marital_names)
assign_edu_udf = F.udf(assign_edu_names)
df.withColumn("MARRIAGE", assign_marital_udf("MARRIAGE")).\
withColumn("EDUCATION", assign_edu_udf("EDUCATION")).show(truncate=False)
One UDF can result in only one column. But this can be structured column and UDF can apply labels on both marriage and education. See code below:
from pyspark.sql.types import *
from pyspark.sql import Row
udf_result = StructType([StructField('MARRIAGE', StringType()), StructField('EDUCATION', StringType())])
marriage_dict = {1: 'Married', 2: 'UnMarried'}
education_dict = {1: 'Grad', 2: 'UnderGrad'}
def assign_labels(marriage, education):
return Row(marriage_dict[marriage], education_dict[education])
assign_labels_udf = F.udf(assign_labels, udf_result)
df.withColumn('labels', assign_labels_udf('MARRIAGE', 'EDUCATION')).printSchema()
root
|-- MARRIAGE: long (nullable = true)
|-- EDUCATION: long (nullable = true)
|-- Total: long (nullable = true)
|-- labels: struct (nullable = true)
| |-- MARRIAGE: string (nullable = true)
| |-- EDUCATION: string (nullable = true)
But as you see, it's not replacing the original columns, it's just adding a new one. To replace them you will need to use withColumn twice and then drop labels.