Column comparison in spark scala - scala

I have 2 dataframes like this.
scala> df1.show
+---+---------+
| ID| Count|
+---+---------+
| 1|20.565656|
| 2|30.676776|
+---+---------+
scala> df2.show
+---+-----------+
| ID| Count|
+---+-----------+
| 1|10.00998787|
| 2| 40.7767|
+---+-----------+
How can i take take the max of the column-count after join?
Expected output.
+---+---------+
| id| Count|
+---+---------+
| 1|20.565656|
| 2|40.7767 |
+---+---------+

You can do this:
df1.union(df2).groupBy("ID").max("Count").show()
+---+----------+
| ID|max(Count)|
+---+----------+
| 1| 20.565656|
| 2| 40.7767|
+---+----------+

After joining both dataframes, create an UDF with 2 count columns as input and in the UDF return the greatest value between those columns.
Always its a good practice to use UDF when we need to derive a single column based on multiple columns.

scala> df.show()
+---+---------+
| ID| Count|
+---+---------+
| 1|20.565656|
| 2|30.676776|
+---+---------+
scala> df1.show()
+---+-----------+
| ID| Count|
+---+-----------+
| 1|10.00998787|
| 2| 40.7767|
+---+-----------+
scala> df.alias("x").join(df1.alias("y"), List("ID"))
.select(col("ID"), col("x.count").alias("Xcount"),col("y.count").alias("Ycount"))
.withColumn("Count", when(col("Xcount") >= col("Ycount"), col("Xcount")).otherwise(col("Ycount")))
.drop("Xcount", "YCount")
.show()
+---+---------+
| ID| Count|
+---+---------+
| 1|20.565656|
| 2| 40.7767|
+---+---------+

Related

Pyspark filter where value is in another dataframe

I have two data frames. I need to filter one to only show values that are contained in the other.
table_a:
+---+----+
|AID| foo|
+---+----+
| 1 | bar|
| 2 | bar|
| 3 | bar|
| 4 | bar|
+---+----+
table_b:
+---+
|BID|
+---+
| 1 |
| 2 |
+---+
In the end I want to filter out what was in table_a to only the IDs that are in the table_b, like this:
+--+----+
|ID| foo|
+--+----+
| 1| bar|
| 2| bar|
+--+----+
Here is what I'm trying to do
result_table = table_a.filter(table_b.BID.contains(table_a.AID))
But this doesn't seem to be working. It looks like I'm getting ALL values.
NOTE: I can't add any other imports other than pyspark.sql.functions import col
You can join the two tables and specify how = 'left_semi'
A left semi-join returns values from the left side of the relation that has a match with the right.
result_table = table_a.join(table_b, (table_a.AID == table_b.BID), \
how = "left_semi").drop("BID")
result_table.show()
+---+---+
|AID|foo|
+---+---+
| 1|bar|
| 2|bar|
+---+---+
In case you have duplicates or Multiple values in the second dataframe and you want to take only distinct values, below approach can be useful to tackle such use cases -
Create the Dataframe
df = spark.createDataFrame([(1,"bar"),(2,"bar"),(3,"bar"),(4,"bar")],[ "col1","col2"])
df_lookup = spark.createDataFrame([(1,1),(1,2)],[ "id","val"])
df.show(truncate=True)
df_lookup.show()
+----+----+
|col1|col2|
+----+----+
| 1| bar|
| 2| bar|
| 3| bar|
| 4| bar|
+----+----+
+---+---+
| id|val|
+---+---+
| 1| 1|
| 1| 2|
+---+---+
get all the unique values of val column in dataframe two and take in a set/list variable
df_lookup_var = df_lookup.groupBy("id").agg(F.collect_set("val").alias("val")).collect()[0][1][0]
print(df_lookup_var)
df = df.withColumn("case_col", F.when((F.col("col1").isin([1,2])), F.lit("1")).otherwise(F.lit("0")))
df = df.filter(F.col("case_col") == F.lit("1"))
df.show()
+----+----+--------+
|col1|col2|case_col|
+----+----+--------+
| 1| bar| 1|
| 2| bar| 1|
+----+----+--------+
This should work too:
table_a.where( col(AID).isin(table_b.BID.tolist() ) )

reverse effect of explode function

In scala with spark-2.4, I would like to filter the value inside the arrays in a column.
From
+---+------------+
| id| letter|
+---+------------+
| 1|[x, xxx, xx]|
| 2|[yy, y, yyy]|
+---+------------+
To
+---+-------+
| id| letter|
+---+-------+
| 1|[x, xx]|
| 2|[yy, y]|
+---+-------+
I thought of using explode + filter
val res = Seq(("1", Array("x", "xxx", "xx")), ("2", Array("yy", "y", "yyy"))).toDF("id", "letter")
res.withColumn("tmp", explode(col("letter"))).filter(length(col("tmp")) < 3).drop(col("letter")).show()
And I'm getting
+---+---+
| id|tmp|
+---+---+
| 1| x|
| 1| xx|
| 2| yy|
| 2| y|
+---+---+
How do I zip/groupBy back by id ?
Or maybe there is a better, more optimised solution ?
You can filter the array without explode() in Spark 2.4:
res.withColumn("letter", expr("filter(letter, x -> length(x) < 3)")).show()
Output:
+---+-------+
| id| letter|
+---+-------+
| 1|[x, xx]|
| 2|[yy, y]|
+---+-------+
In Spark 2.4+, higher order functions are the way to go (filter), alternatively use collect_list :
res.withColumn("tmp",explode(col("letter")))
.filter(length(col("tmp")) < 3)
.drop(col("letter"))
// aggregate back
.groupBy($"id")
.agg(collect_list($"tmp").as("letter"))
.show()
gives:
+---+-------+
| id| letter|
+---+-------+
| 1|[x, xx]|
| 2|[yy, y]|
+---+-------+
As this introduces a shuffle, it's better to use UDF for that:
def filter_arr(maxLength:Int)= udf((arr:Seq[String]) => arr.filter(str => str.size<=maxLength))
res
.select($"id",filter_arr(maxLength = 2)($"letter").as("letter"))
.show()
gives:
+---+-------+
| id| letter|
+---+-------+
| 1|[x, xx]|
| 2|[yy, y]|
+---+-------+

Replicating rows in Spark dataframe according values in a column

I would like to replicate rows according to their value for a given column. For example, I got this DataFrame:
+-----+
|count|
+-----+
| 3|
| 1|
| 4|
+-----+
I would like to get:
+-----+
|count|
+-----+
| 3|
| 3|
| 3|
| 1|
| 4|
| 4|
| 4|
| 4|
+-----+
I tried to use withColumn method, according to this answer.
val replicateDf = originalDf
.withColumn("replicating", explode(array((1 until $"count").map(lit): _*)))
.select("count")
But $"count" is a ColumnName and cannot be used to represent its values in the above expression.
(I also tried with explode(Array.fill($"count"){1}) but same problem here.)
What do I need to change? Is there a cleaner way?
array_repeat is available from 2.4 onwards. If you need the solution in lower versions, you can use udf() or rdd. For Rdd, check this out
import scala.collection.mutable._
val df = Seq(3,1,4).toDF("count")
val rdd1 = df.rdd.flatMap( x=> { val y = x.getAs[Int]("count"); for ( p <- 0 until y ) yield Row(y) } )
spark.createDataFrame(rdd1,df.schema).show(false)
Results:
+-----+
|count|
+-----+
|3 |
|3 |
|3 |
|1 |
|4 |
|4 |
|4 |
|4 |
+-----+
With df() alone
scala> df.flatMap( r=> { (0 until r.getInt(0)).map( i => r.getInt(0)) } ).show
+-----+
|value|
+-----+
| 3|
| 3|
| 3|
| 1|
| 4|
| 4|
| 4|
| 4|
+-----+
For udf(), below would work
val df = Seq(3,1,4).toDF("count")
def array_repeat(x:Int):Array[Int]={
val y = for ( p <- 0 until x )yield x
y.toArray
}
val udf_array_repeat = udf (array_repeat(_:Int):Array[Int] )
df.withColumn("count2", explode(udf_array_repeat('count))).select("count2").show(false)
EDIT :
Check #user10465355's answer below for more information about array_repeat.
You can use array_repeat function:
import org.apache.spark.sql.functions.{array_repeat, explode}
val df = Seq(1, 2, 3).toDF
df.select(explode(array_repeat($"value", $"value"))).show()
+---+
|col|
+---+
| 1|
| 2|
| 2|
| 3|
| 3|
| 3|
+---+

Extracting array index in Spark Dataframe

I have a Dataframe with a Column of Array Type
For example :
val df = List(("a", Array(1d,2d,3d)), ("b", Array(4d,5d,6d))).toDF("ID", "DATA")
df: org.apache.spark.sql.DataFrame = [ID: string, DATA: array<double>]
scala> df.show
+---+---------------+
| ID| DATA|
+---+---------------+
| a|[1.0, 2.0, 3.0]|
| b|[4.0, 5.0, 6.0]|
+---+---------------+
I wish to explode the array and have index like
+---+------------------+
| ID| DATA_INDEX| DATA|
+---+------------------+
| a|1 | 1.0 |
| a|2 | 2.0 |
| a|3 | 3.0 |
| b|1 | 4.0 |
| b|2 | 5.0 |
| b|3 | 6.0 |
+---+------------+-----+
I wish be able to do that with scala, and Sparlyr or SparkR
I'm using spark 1.6
There is a posexplode function available in spark functions
import org.apache.spark.sql.functions._
df.select("ID", posexplode($"DATA))
PS: This is only available after 2.1.0 versions
With Spark 1.6, you can register you dataframe as a temporary table and then run Hive QL over it to get the desired result.
df.registerTempTable("tab")
sqlContext.sql("""
select
ID, exploded.DATA_INDEX + 1 as DATA_INDEX, exploded.DATA
from
tab
lateral view posexplode(tab.DATA) exploded as DATA_INDEX, DATA
""").show
+---+----------+----+
| ID|DATA_INDEX|DATA|
+---+----------+----+
| a| 1| 1.0|
| a| 2| 2.0|
| a| 3| 3.0|
| b| 1| 4.0|
| b| 2| 5.0|
| b| 3| 6.0|
+---+----------+----+

Spark dataframe filter

val df = sc.parallelize(Seq((1,"Emailab"), (2,"Phoneab"), (3, "Faxab"),(4,"Mail"),(5,"Other"),(6,"MSL12"),(7,"MSL"),(8,"HCP"),(9,"HCP12"))).toDF("c1","c2")
+---+-------+
| c1| c2|
+---+-------+
| 1|Emailab|
| 2|Phoneab|
| 3| Faxab|
| 4| Mail|
| 5| Other|
| 6| MSL12|
| 7| MSL|
| 8| HCP|
| 9| HCP12|
+---+-------+
I want to filter out records which have first 3 characters of column 'c2' either 'MSL' or 'HCP'.
So the output should be like below.
+---+-------+
| c1| c2|
+---+-------+
| 1|Emailab|
| 2|Phoneab|
| 3| Faxab|
| 4| Mail|
| 5| Other|
+---+-------+
Can any one please help on this?
I knew that df.filter($"c2".rlike("MSL")) -- This is for selecting the records but how to exclude the records. ?
Version: Spark 1.6.2
Scala : 2.10
This works too. Concise and very similar to SQL.
df.filter("c2 not like 'MSL%' and c2 not like 'HCP%'").show
+---+-------+
| c1| c2|
+---+-------+
| 1|Emailab|
| 2|Phoneab|
| 3| Faxab|
| 4| Mail|
| 5| Other|
+---+-------+
df.filter(not(
substring(col("c2"), 0, 3).isin("MSL", "HCP"))
)
I used below to filter rows from dataframe and this worked form me.Spark 2.2
val spark = new org.apache.spark.sql.SQLContext(sc)
val data = spark.read.format("csv").
option("header", "true").
option("delimiter", "|").
option("inferSchema", "true").
load("D:\\test.csv")
import spark.implicits._
val filter=data.filter($"dept" === "IT" )
OR
val filter=data.filter($"dept" =!= "IT" )
val df1 = df.filter(not(df("c2").rlike("MSL"))&&not(df("c2").rlike("HCP")))
This worked.