I've a following data which has user and supervisor relationship.
user |supervisor |id
-----|-----------|----
a | b | 1
b | c | 2
c | d | 3
e | b | 4
I want to explode the relationship hierarchy between the user and supervisor as below.
user |supervisor |id
-----|-----------|----
a | b | 1
a | c | 1
a | d | 1
b | c | 2
b | d | 2
c | d | 3
e | b | 4
e | c | 4
e | d | 4
As you see, for the user 'a', the immediate supervisor is 'b' but again 'b' has 'c' as its supervisor. So indirectly 'c' is supervisor for 'a' as well and so on. Such as, my aim is to explode the hierarchy at any level for a given user. What is the best way to implement this in spark-scala ?
Here is an approach using dataframes. I show doing one level of hierarchy, but it can be done multiple times by repeating the step below:
val df = sc.parallelize(Array(("a", "b", 1), ("b", "c", 2), ("c", "d", 3), ("e", "b", 4))).toDF("user", "supervisor", "id")
scala> df.show
+----+----------+---+
|user|supervisor| id|
+----+----------+---+
| a| b| 1|
| b| c| 2|
| c| d| 3|
| e| b| 4|
+----+----------+---+
Let's enable cross joins:
spark.conf.set("spark.sql.crossJoin.enabled", true)
Then join the same table:
val dfjoin = df.as("df1").join(df.as("df2"), $"df1.supervisor" === $"df2.user", "left").select($"df1.user", $"df1.supervisor".as("s1"), $"df1.id", $"df2.supervisor".as("s2"))
I use an udf to combine two columns into an array:
import org.apache.spark.sql.functions.udf
val combineUdf = udf((x: String, y: String) => Seq(x, y))
val dfcombined = dfjoin.withColumn("combined", combineUdf($"s1", $"s2")).select($"user", $"combined", $"id")
Then the last step is to flatten the array to separate rows and filter rows that did not join:
val dfexplode = dfcombined.withColumn("supervisor", explode($"combined")).select($"user", $"id", $"supervisor").filter($"supervisor".isNotNull)
The first level hierarchy looks like this:
scala> dfexplode.show
+----+---+----------+
|user| id|supervisor|
+----+---+----------+
| c| 3| d|
| b| 2| c|
| b| 2| d|
| a| 1| b|
| a| 1| c|
| e| 4| b|
| e| 4| c|
+----+---+----------+
Related
there is a dataframe contains two columns, one is key and another is value. Like below:
+--------------------+--------------------+
| key| value |
+--------------------+--------------------+
|a |abcde |
I want to slice the value into mutiple values with position and generate a new dataframe following the key. Like below:
+--------------------+--------------------+
| key| value|
+--------------------+--------------------+
|a |[a, 0] |
|a |[b, 1] |
|a |[c, 2] |
|a |[d, 3] |
|a |[e, 4] |
I have tried to use join() and StructType() but I failed. Are there any possible method to do that? THANKS!
from pyspark.sql import SparkSession,Row
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
data = [("a","abcd",3000)]
df = spark.createDataFrame(data,["Name","Department","Salary"])
df.select( f.expr( "posexplode( filter (split(Department,''), x -> x != '' ) ) as (value, index)") , f.col("Name")).show()
+-----+-----+----+
|value|index|Name|
+-----+-----+----+
| 0| a| a|
| 1| b| a|
| 2| c| a|
| 3| d| a|
+-----+-----+----+
# to explain the above functions more clearly.
f.expr( "..." ) #--> write a SQL expression
"posexplode( [..] )" #for an array turn them into rows with their index
"filter( [..] , x -> x != '' )" #for an array filter out ''
"split( Department, '' )" #split the column on null (extract characters) which will add null in out array that we need to filter out.
Here's the update to fit with your exact request, just a little more manipulation to put it into your required format:
df.select( f.expr( "posexplode( filter (split(Department,''), x -> x != '' ) ) as (myvalue, index)") , f.col("Name"),f.expr( "array(myvalue,index) as value ")).drop("index","myvalue").show()
+----+------+
|Name| value|
+----+------+
| a|[0, a]|
| a|[1, b]|
| a|[2, c]|
| a|[3, d]|
+----+------+
The following snippet transform your data into your specified format:
import pyspark.sql.functions as F
df = spark.createDataFrame([("a", "abcde",)], ["key", "value"])
df_split = df.withColumn("split", F.array_remove(F.split("value", ""), ""))
df_split.show()
df_exploded = df_split.select("key", F.posexplode("split"))
df_exploded.show()
df_array = df_exploded.select("key", F.array("col", "pos").alias("value"))
df_array.show()
Output:
+---+-----+---------------+
|key|value| split|
+---+-----+---------------+
| a|abcde|[a, b, c, d, e]|
+---+-----+---------------+
+---+---+---+
|key|pos|col|
+---+---+---+
| a| 0| a|
| a| 1| b|
| a| 2| c|
| a| 3| d|
| a| 4| e|
+---+---+---+
+---+------+
|key| value|
+---+------+
| a|[a, 0]|
| a|[b, 1]|
| a|[c, 2]|
| a|[d, 3]|
| a|[e, 4]|
+---+------+
First, the string is split into an array, where the split pattern is the empty string. Therefore, the last element needs to be removed.
Then each element of a the split column array is transformed to a row with its position in the array as column pos
Lastly, the columns are combined to an array.
This is a question identical to
Pyspark: Split multiple array columns into rows
but I want to know how to do it in scala
for a dataframe like this,
+---+---------+---------+---+
| a| b| c| d|
+---+---------+---------+---+
| 1|[1, 2, 3]|[, 8, 9] |foo|
+---+---------+---------+---+
I want to have it in following format
+---+---+-------+------+
| a| b| c | d |
+---+---+-------+------+
| 1| 1| None | foo |
| 1| 2| 8 | foo |
| 1| 3| 9 | foo |
+---+---+-------+------+
In scala, I know there's an explode function, but I don't think it's applicable here.
I tried
import org.apache.spark.sql.functions.arrays_zip
but I get an error, saying arrays_zip is not a member of org.apache.spark.sql.functions although it's clearly a function in https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/functions.html
the below answer might be helpful to you,
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
import org.apache.spark.sql.functions._
val arrayData = Seq(
Row(1,List(1,2,3),List(0,8,9),"foo"))
val arraySchema = new StructType().add("a",IntegerType).add("b", ArrayType(IntegerType)).add("c", ArrayType(IntegerType)).add("d",StringType)
val df = spark.createDataFrame(spark.sparkContext.parallelize(arrayData),arraySchema)
df.select($"a",$"d",explode($"b",$"c")).show(false)
val zip = udf((x: Seq[Int], y: Seq[Int]) => x.zip(y))
df.withColumn("vars", explode(zip($"b", $"c"))).select($"a", $"d",$"vars._1".alias("b"), $"vars._2".alias("c")).show()
/*
+---+---+---+---+
| a| d| b| c|
+---+---+---+---+
| 1|foo| 1| 0|
| 1|foo| 2| 8|
| 1|foo| 3| 9|
+---+---+---+---+
*/
I have a dataframe with only column category & column A as below. I want to populate the column B such that, it compares the current value of A and previous value of B and stores the max for each category. Tried with Windows function, lags, max of a categories etc. but the biggest challenge I'm facing is how to remember earlier max while comparing between two values.
+---+--------+--+--+
id | category | A | B |
+---+--------+--+--+
1 Fruit 1 1
2 Fruit 5 5
3 Fruit 3 5
4 Fruit 4 5
1 Dessert 4 4
2 Dessert 2 4
1 Veggies 11 11
2 Veggies 7 11
3 Veggies 12 12
4 Veggies 3 12
---+------+---+----+-
Using running maximum of A should do the trick:
df
.withColumn("B", max($"A").over(Window.partitionBy($"category").orderBy($"id")))
I had a hard time expressing this with Spark SQL, but managed with Functional Programming using the Dataset API
scala> case class Food(category: String, a: Int, b: Option[Int] = None)
defined class Food
scala> val ds = spark.createDataset(
| List(
| Food("Fruit", 1),
| Food("Fruit", 5),
| Food("Fruit", 3),
| Food("Fruit", 4),
| Food("Dessert", 4),
| Food("Dessert", 2),
| Food("Veggies", 11),
| Food("Veggies", 7),
| Food("Veggies", 12),
| Food("Veggies", 3)
| )
| )
ds: org.apache.spark.sql.Dataset[Food] = [category: string, a: int ... 1 more field]
scala> ds.show
+--------+---+----+
|category| a| b|
+--------+---+----+
| Fruit| 1|null|
| Fruit| 5|null|
| Fruit| 3|null|
| Fruit| 4|null|
| Dessert| 4|null|
| Dessert| 2|null|
| Veggies| 11|null|
| Veggies| 7|null|
| Veggies| 12|null|
| Veggies| 3|null|
+--------+---+----+
scala> :paste
// Entering paste mode (ctrl-D to finish)
ds.groupByKey(_.category)
.flatMapGroups { (key, iter) =>
if (iter.hasNext) {
val head = iter.next
iter.scanLeft(head.copy(b = Some(head.a))) { (x, y) =>
val a = x.b.map(b => if(x.a > b) x.a else b).getOrElse(x.a)
y.copy(b = if(y.a > a) Some(y.a) else Some(a))
}
} else iter
}
.show
// Exiting paste mode, now interpreting.
+--------+---+---+
|category| a| b|
+--------+---+---+
| Veggies| 11| 11|
| Veggies| 7| 11|
| Veggies| 12| 12|
| Veggies| 3| 12|
| Dessert| 4| 4|
| Dessert| 2| 4|
| Fruit| 1| 1|
| Fruit| 5| 5|
| Fruit| 3| 5|
| Fruit| 4| 5|
+--------+---+---+
I am trying to dynamically add columns to a DataFrame from a Seq of String.
Here's an example : the source dataframe is like:
+-----+---+----+---+---+
|id | A | B | C | D |
+-----+---+----+---+---+
|1 |toto|tata|titi| |
|2 |bla |blo | | |
|3 |b | c | a | d |
+-----+---+----+---+---+
I also have a Seq of String which contains name of columns I want to add. If a column already exists in the source DataFrame, it must do some kind of difference like below :
The Seq looks like :
val columns = Seq("A", "B", "F", "G", "H")
The expectation is:
+-----+---+----+---+---+---+---+---+
|id | A | B | C | D | F | G | H |
+-----+---+----+---+---+---+---+---+
|1 |toto|tata|titi|tutu|null|null|null
|2 |bla |blo | | |null|null|null|
|3 |b | c | a | d |null|null|null|
+-----+---+----+---+---+---+---+---+
What I've done so far is something like this :
val difference = columns diff sourceDF.columns
val finalDF = difference.foldLeft(sourceDF)((df, field) => if (!sourceDF.columns.contains(field)) df.withColumn(field, lit(null))) else df)
.select(columns.head, columns.tail:_*)
But I can't figure how to do this using Spark efficiently in a more simpler and easier way to read ...
Thanks in advance
Here is another way using Seq.diff, single select and map to generate your final column list:
import org.apache.spark.sql.functions.{lit, col}
val newCols = Seq("A", "B", "F", "G", "H")
val updatedCols = newCols.diff(df.columns).map{ c => lit(null).as(c)}
val selectExpr = df.columns.map(col) ++ updatedCols
df.select(selectExpr:_*).show
// +---+----+----+----+----+----+----+----+
// | id| A| B| C| D| F| G| H|
// +---+----+----+----+----+----+----+----+
// | 1|toto|tata|titi|null|null|null|null|
// | 2| bla| blo|null|null|null|null|null|
// | 3| b| c| a| d|null|null|null|
// +---+----+----+----+----+----+----+----+
First we find the diff between newCols and df.columns this gives us: F, G, H. Next we transform each element of the list to lit(null).as(c) via map function. Finally, we concatenate the existing and the new list together to produce selectExpr which is used for the select.
Below will be optimised way with your logic.
scala> df.show
+---+----+----+----+----+
| id| A| B| C| D|
+---+----+----+----+----+
| 1|toto|tata|titi|null|
| 2| bla| blo|null|null|
| 3| b| c| a| d|
+---+----+----+----+----+
scala> val Columns = Seq("A", "B", "F", "G", "H")
scala> val newCol = Columns filterNot df.columns.toSeq.contains
scala> val df1 = newCol.foldLeft(df)((df,name) => df.withColumn(name, lit(null)))
scala> df1.show()
+---+----+----+----+----+----+----+----+
| id| A| B| C| D| F| G| H|
+---+----+----+----+----+----+----+----+
| 1|toto|tata|titi|null|null|null|null|
| 2| bla| blo|null|null|null|null|null|
| 3| b| c| a| d|null|null|null|
+---+----+----+----+----+----+----+----+
If you do not want to use foldLeft then you can use RunTimeMirror which will be faster. Check Below Code.
scala> import scala.reflect.runtime.universe.runtimeMirror
scala> import scala.tools.reflect.ToolBox
scala> import org.apache.spark.sql.DataFrame
scala> df.show
+---+----+----+----+----+
| id| A| B| C| D|
+---+----+----+----+----+
| 1|toto|tata|titi|null|
| 2| bla| blo|null|null|
| 3| b| c| a| d|
+---+----+----+----+----+
scala> def compile[A](code: String): DataFrame => A = {
| val tb = runtimeMirror(getClass.getClassLoader).mkToolBox()
| val tree = tb.parse(
| s"""
| |import org.elasticsearch.spark.sql._
| |import org.apache.spark.sql.DataFrame
| |def wrapper(context:DataFrame): Any = {
| | $code
| |}
| |wrapper _
| """.stripMargin)
|
| val fun = tb.compile(tree)
| val wrapper = fun()
| wrapper.asInstanceOf[DataFrame => A]
| }
scala> def AddColumns(df:DataFrame,withColumnsString:String):DataFrame = {
| val code =
| s"""
| |import org.apache.spark.sql.functions._
| |import org.elasticsearch.spark.sql._
| |import org.apache.spark.sql.DataFrame
| |var data = context.asInstanceOf[DataFrame]
| |data = data
| """ + withColumnsString +
| """
| |
| |data
| """.stripMargin
|
| val fun = compile[DataFrame](code)
| val res = fun(df)
| res
| }
scala> val Columns = Seq("A", "B", "F", "G", "H")
scala> val newCol = Columns filterNot df.columns.toSeq.contains
scala> var cols = ""
scala> newCol.foreach{ name =>
| cols = ".withColumn(\""+ name + "\" , lit(null))" + cols
| }
scala> val df1 = AddColumns(df,cols)
scala> df1.show
+---+----+----+----+----+----+----+----+
| id| A| B| C| D| H| G| F|
+---+----+----+----+----+----+----+----+
| 1|toto|tata|titi|null|null|null|null|
| 2| bla| blo|null|null|null|null|null|
| 3| b| c| a| d|null|null|null|
+---+----+----+----+----+----+----+----+
I am having problem figuring this. Here is the problem statement
lets say I have a dataframe, I want to select value for column c where column b value is foo and create a new column D and repeat the vale "3" for all rows
+---+----+---+
| A| B| C|
+---+----+---+
| 4|blah| 2|
| 2| | 3|
| 56| foo| 3|
|100|null| 5|
+---+----+---+
want it to become:
+---+----+---+-----+
| A| B| C| D |
+---+----+---+-----+
| 4|blah| 2| 3 |
| 2| | 3| 3 |
| 56| foo| 3| 3 |
|100|null| 5| 3 |
+---+----+---+-----+
You will have to extract the column C value i.e. 3 with foo in column B
import org.apache.spark.sql.functions._
val value = df.filter(col("B") === "foo").select("C").first()(0)
Then use that value using withColumn to create a new column D using lit function
df.withColumn("D", lit(value)).show(false)
You should get your desired output.