How to collect and process column-wise data in Spark - scala

I have a dataframe contains 7 days, 24 hours data, so it has 144 columns.
id d1h1 d1h2 d1h3 ..... d7h24
aaa 21 24 8 ..... 14
bbb 16 12 2 ..... 4
ccc 21 2 7 ..... 6
what I want to do, is to find the max 3 values for each day:
id d1 d2 d3 .... d7
aaa [22,2,2] [17,2,2] [21,8,3] [32,11,2]
bbb [32,22,12] [47,22,2] [31,14,3] [32,11,2]
ccc [12,7,4] [28,14,7] [11,2,1] [19,14,7]

import org.apache.spark.sql.functions._
var df = ...
val first3 = udf((list : Seq[Double]) => list.slice(0,3))
for (i <- 1 until 7) {
val columns = (1 until 24).map(x=> "d"+i+"h"+x)
df = df
.withColumn("d"+i, first3(sort_array(array(columns.head, columns.tail :_*), false)))
.drop(columns :_*)
}
This should give you what you want. In fact for each day I aggregate the 24 hours into an array column, that I sort in desc order and from which I select the first 3 elements.

Define pattern:
val p = "^(d[1-7])h[0-9]{1,2}$".r
Group columns:
import org.apache.spark.sql.functions._
val cols = df.columns.tail
.groupBy { case p(d) => d }
.map { case (c, cs) => {
val sorted = sort_array(array(cs map col: _*), false)
array(sorted(0), sorted(1), sorted(2)).as(c)
}}
And select:
df.select($"id" +: cols.toSeq: _*)

Related

spark dataframe aggregation of column based on condition in scala

I have csv data a following in following format.
I need to find top 2 vendor whose turnover is greater than 100 in year 2017.
Turnover= Sum(Invoices whose status is Paid-in-Full ) - Sum(Invoices
whose status is Exception or Rejected)
I have loaded the data from csv in datebricks scala notebook as follow:
val invoices_data = spark.read.format(file_type)
.option("header", "true")
.option("dateFormat", "M/d/yy")
.option("inferSchema", "true")
.load("invoice.csv")
Then I tried to make group by vendor name
val avg_invoice_by_vendor = invoices_data.groupBy("VendorName")
But Now I don't know how to proceed further.
Here is sample csv data.
Id InvoiceDate Status Invoice VendorName
2 2/23/17 Exception 23 V1
3 11/23/17 Paid-in-Full 56 V1
1 12/20/17 Paid-in-Full 12 V1
5 8/4/19 Paid-in-Full 123 V2
6 2/6/17 Paid-in-Full 237 V2
9 3/9/17 Rejected 234 V2
7 4/23/17 Paid-in-Full 78 V3
8 5/23/17 Exception 345 V4
You can use udf for sign invoice depends on status and after grouping aggregate df using sum function:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.DateType
def signInvoice: (String, Int) => Int = (status: String, invoice: Int) => {
status match {
case "Exception" | "Rejected" => -invoice
case "Paid-in-Full" => invoice
case _ => throw new IllegalStateException("wrong status")
}
}
val signInvoiceUdf = spark.udf.register("signInvoice", signInvoice)
val top2_vendorsDF = invoices_data
.withColumn("InvoiceDate", col("InvoiceDate").cast(DateType))
.filter(year(col("InvoiceDate")) === lit(2017))
.withColumn("Invoice", col("Invoice").as[Int])
.groupBy("VendorName")
.agg(sum(signInvoiceUdf('Status, 'Invoice)).as("sum_invoice"))
.filter(col("sum_invoice") > 100)
.orderBy(col("sum_invoice").desc)
.take(2)
I have used pivot method to solve above issue.
invoices_data
.filter(invoices_data("InvoiceStatusDesc") === "Paid-in-Full" ||
invoices_data("InvoiceStatusDesc") === "Exception" ||
invoices_data("InvoiceStatusDesc") === "Rejected")
.filter(year(to_date(invoices_data("InvoiceDate"), "M/d/yy")) === 2017)
.groupBy("InvoiceVendorName").pivot("InvoiceStatusDesc").sum("InvoiceTotal")

How to explode an array into multiple columns in Spark

I have a spark dataframe looks like:
id DataArray
a array(3,2,1)
b array(4,2,1)
c array(8,6,1)
d array(8,2,4)
I want to transform this dataframe into:
id col1 col2 col3
a 3 2 1
b 4 2 1
c 8 6 1
d 8 2 4
What function should I use?
Use apply:
import org.apache.spark.sql.functions.col
df.select(
col("id") +: (0 until 3).map(i => col("DataArray")(i).alias(s"col$i")): _*
)
You can use foldLeft to add each columnn fron DataArray
make a list of column names that you want to add
val columns = List("col1", "col2", "col3")
columns.zipWithIndex.foldLeft(df) {
(memodDF, column) => {
memodDF.withColumn(column._1, col("dataArray")(column._2))
}
}
.drop("DataArray")
Hope this helps!

Scala: How to match a data from one column to other columns in data frame

I have below data and would like to match the data from ID column of df1 to df2.
df1:
ID key
1
2
3
4
5
Df2:
first second third key
--------------------------
1 9 9 777
9 8 8 878
8 1 10 765
10 12 19 909
11 2 20 708
Code:
val finalDF = Df1.join(DF2.withColumnRenamed("key", "key2"), $"ID" === $"first" || $"ID" === $"second" || $"ID" === $"third","left").select($"ID", $"key2".as("key")).show(false)
val notMatchingDF = finalDF.filter($"key" === "")
val matchingDF = finalDF.except(notMatchingDF)
val columnsToCheck = DF2.columns.toSet - "key" toList
import org.apache.spark.sql.functions._
val tempSelectedDetailsDF = DF2.select(array(columnsToCheck.map(col): _*).as("array"), col("key").as("key2"))
val arrayContains = udf((array: collection.mutable.WrappedArray[String], value: String) => array.contains(value))
val finalDF = df1.join(tempSelectedDetailsDF, arrayContains($"array", $"ID"), "left")
.select($"ID", $"key2".as("key"))
.na.fill("")
I am getting the output as below,
ID key
1 777
1 765
2 708
3
4
5
However i am expecting as below,here the logic is from df1 we have id column value 1 and in df2 the value 1 is matching more than once hence i am getting above output. but i should not match second occurrence when it matches in the first occurrence.
Expected output:
ID key
1 777
2 708
3
4
5
i should not match second occurrence when it matches in the first occurrence.
I would suggest you to create a increasing id for df2 for identifying the order of matches when joined with df1 so that it would easy later on to filter in the first matches only. For that you can benefit from monotonically_increasing_id()
import org.apache.spark.sql.functions._
val finalDF = Df1.join(DF2.withColumnRenamed("key", "key2").withColumn("order", monotonically_increasing_id()), $"ID" === $"first" || $"ID" === $"second" || $"ID" === $"third","left").select($"ID", $"key2".as("key").cast(StringType), $"order")
Then you separate the dataframe into matching and non-matching dataframes
val notMatchingDF = finalDF.filter($"key".isNull || $"key" === "")
val matchingDF = finalDF.except(notMatchingDF)
After that on the matchingDF, generate row numbers for each row on each window grouped by ID and sorted by the increasing id gereated above. Then filter in the first matching rows. Then merge in the non matching dataframe and drop the newly created column and fill all nulls with empty character
import org.apache.spark.sql.expressions._
def windowSpec = Window.partitionBy("ID").orderBy("order")
matchingDF.withColumn("order", row_number().over(windowSpec))
.filter($"order" === 1)
.union(notMatchingDF)
.drop("order")
.na.fill("")
You should have your requirement fulfilled
+---+---+
|ID |key|
+---+---+
|1 |777|
|2 |708|
|3 | |
|4 | |
|5 | |
+---+---+

Pyspark dataframe create new column from other columns and from it

I have pyspark dataframe DF
Now I would like create a new column with below condition.
city customer sales orders checkpoint
a eee 20 20 1
b sfd 28 30 0
C sss 30 30 1
d zzz 35 40 0
DF = Df.withColumn("NewCol",func.when(DF.month == 1,DF.sales + DF.orders).otherwise(greatest(DF.sales,DF.orders))+ func.when(DF.checkpoint == 1,lit(0)).otherwise(func.lag("NewCol).over(Window.partitionBy(DF.city,DF.customer).orderBy(DF.city,DF.customer))))
I got an error like NewCol is not defined which is expected.
Please suggest me on this?
Created a column
df= df.withColumn("NewCol",lit(None))
for i in range(2):
if i<=2:
DF = Df.withColumn("NewCol",func.when(DF.month == 1,DF.sales + DF.orders).otherwise(greatest(DF.sales,DF.orders))+ func.when(DF.checkpoint == 1,lit(0)).otherwise(func.lag("NewCol).over(Window.partitionBy(DF.city,DF.customer).orderBy(DF.city,DF.customer))))</i)

Combine two rdds

I am new to spark ,Could someone help me find a way to combine two rdds to create a final rdd as per the below logic in scala preferably without using sqlcontext(dataframes) -
RDD1=column1,column2,column3 having 362825 records
RDD2=column2_distinct(same as from RDD1 but containing distinct values),column4 having 2621 records
Final RDD=column1,column2,column3,column4
Example-
RDD1 =
userid | progid | Rating
a 001 5
b 001 3
b 002 4
c 003 2
RDD2=
progid(distinct) | id
001 1
002 2
003 3
Final RDD=
userid | progid | id | rating
a 001 1 5
b 001 1 3
b 002 2 4
c 003 3 2
code
val rawRdd1 = pairrdd1.map(x => x._1.split(",")(0) + "," + x._1.split(",")(1) + "," + x._2) //362825 records
val rawRdd2 = pairrdd2.map(x => x._1 + "," + x._2) //2621 records
val schemaString1 = "userid programid rating"
val schemaString2 = "programid id"
val fields1 = schemaString1.split(" ").map(fieldName => StructField(fieldName, StringType, nullable = true))
val fields2 = schemaString2.split(" ").map(fieldName => StructField(fieldName, StringType, nullable = true))
val schema1 = StructType(fields1)
val schema2 = StructType(fields2)
val rowRDD1 = rawRdd1.map(_.split(",")).map(attributes => Row(attributes(0), attributes(1), attributes(2)))
val rowRDD2 = rawRdd2.map(_.split(",")).map(attributes => Row(attributes(0), attributes(1)))
val DF1 = sparkSession.createDataFrame(rowRDD1, schema1)
val DF2 = sparkSession.createDataFrame(rowRDD2, schema2)
DF1.createOrReplaceTempView("df1")
DF2.createOrReplaceTempView("df2")
val resultDf = DF1.join(DF2, Seq("programid"))
val DF3 = sparkSession.sql("""SELECT df1.userid, df1.programid, df2.id, df1.rating FROM df1 JOIN df2 on df1.programid == df2.programid""")
println(DF1.count()) //362825 records
println(DF2.count()) //2621 records
println(DF3.count()) //only 297 records
expecting same number of records as DF1 with a new column attached from DF2 (id) having corresponding value of programid from DF2`
It is a bit ugly but should work (Spark 2.0):
val rdd1 = sparkSession.sparkContext.parallelize(List("a,001,5", "b,001,3", "b,002,4","c,003,2"))
val rdd2 = sparkSession.sparkContext.parallelize(List("001,1", "002,2", "003,3"))
val groupedRDD1 = rdd1.map(x => (x.split(",")(1),x))
val groupedRDD2 = rdd2.map(x => (x.split(",")(0),x))
val joinRDD = groupedRDD1.join(groupedRDD2)
// convert back to String
val cleanJoinRDD = joinRDD.map(x => x._1 + "," + x._2._1.replace(x._1 + ",","") + "," + x._2._2.replace(x._1 + ",",""))
cleanJoinRDD.collect().foreach(println)
I think better option is to use spark SQL
First of all, why do you split, concatenate and split the row again? You could just do it in one step:
val rowRdd1 = pairrdd1.map{x =>
val (userid, progid) = x._1.split(",")
val rating = x._2
Row(userid, progid, rating)
}
My guess that your problem might be that there are some additional characters in your keys so the don't match in the joins. A simple approach would be to do a left join and inspect the rows where they don't match.
It could be something like extra space in the rows which you could fix like this for both rdds:
val rowRdd1 = pairrdd1.map{x =>
val (userid, progid) = x._1.split(",").map(_.trim)
val rating = x._2
Row(userid, progid, rating)
}