How would I repeat each row in a Scala dataframe N times - scala

Here is the before of the dataframe:
and here is the after:
notice how the rows that are repeated are all next to each other, as opposed to just starting the dataframe over from scratch at the end.
Thanks

Try with array_repeat with struct function then explode the array.
Example:
df.show()
/*
+----+----+
|col1|col2|
+----+----+
| 1| 4|
| 2| 5|
| 3| 6|
+----+----+
*/
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
df.withColumn("arr",explode(array_repeat(struct(df.columns.head,df.columns.tail:_*),7))).
select("arr.*").
toDF("col1","col2").
show(100,false)
/*
+----+----+
|col1|col2|
+----+----+
|1 |4 |
|1 |4 |
|1 |4 |
|1 |4 |
|1 |4 |
|1 |4 |
|1 |4 |
|2 |5 |
|2 |5 |
|2 |5 |
|2 |5 |
|2 |5 |
|2 |5 |
|2 |5 |
|3 |6 |
|3 |6 |
|3 |6 |
|3 |6 |
|3 |6 |
|3 |6 |
|3 |6 |
+----+----+
*/

Here's a function which duplicates a DataFrame:
def repeatRows(df: DataFrame, numRepeats: Int): DataFrame = {
(1 until numRepeats).foldLeft(df)((growingDF, _) => growingDF.union(df))
}
The problem of having the resulting DataFrame sorted is separate from the duplication process, and hence wasn't included in the function, but can be easily achieved afterwards.
So let's take your problem:
// Problem setup
val someDF = Seq((1,4),(2,4),(3,6)).toDF("col1","col2")
// Duplicate followed by sort
val duplicatedSortedDF = repeatRows(someDF, 3).sort("col1")
// Show result
duplicatedSortedDF.show()
+----+----+
|col1|col2|
+----+----+
| 1| 4|
| 1| 4|
| 1| 4|
| 2| 4|
| 2| 4|
| 2| 4|
| 3| 6|
| 3| 6|
| 3| 6|
+----+----+
And there you have it.

Related

Iterate Over a Dataframe as each time column is passing to do transformation

I have a dataframe with 100 columns and col names like col1, col2, col3.... I want to apply certain transformation on the values of columns based on condition matches. I can store the column names in a array of string. And pass the value each element of the array in withColumn and based on When condition i can transform the values of the column vertically.
But the question is, as Dataframe is immutable, so each updated version is need to store in a new variable and also new dataframe need to pass in withColumn to transform for next iteration.
Is there any way to create array of dataframe so that new dataframe can be stored as a element of array and it can iterate based on the value of iterator.
Or is there any other way to handle the same.
var arr_df : Array[DataFrame] = new Array[DataFrame](60)
--> This throws error "not found type DataFrame"
val df(0) = df1.union(df2)
for(i <- 1 to 99){
val df(i) = df(i-1).withColumn(col(i), when(col(i)> 0, col(i) +
1).otherwise(col(i)))
Here col(i) is an array of strings that stores the name of the columns of the original datframe .
As a example :
scala> val original_df = Seq((1,2,3,4),(2,3,4,5),(3,4,5,6),(4,5,6,7),(5,6,7,8),(6,7,8,9)).toDF("col1","col2","col3","col4")
original_df: org.apache.spark.sql.DataFrame = [col1: int, col2: int ... 2 more fields]
scala> original_df.show()
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| 1| 2| 3| 4|
| 2| 3| 4| 5|
| 3| 4| 5| 6|
| 4| 5| 6| 7|
| 5| 6| 7| 8|
| 6| 7| 8| 9|
+----+----+----+----+
I want to iterate 3 columns : col1, col2, col3 if the value of that column is greater than 3, then it will be updated by +1
Check below code.
scala> df.show(false)
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
|1 |2 |3 |4 |
|2 |3 |4 |5 |
|3 |4 |5 |6 |
|4 |5 |6 |7 |
|5 |6 |7 |8 |
|6 |7 |8 |9 |
+----+----+----+----+
scala> val requiredColumns = df.columns.zipWithIndex.filter(_._2 < 3).map(_._1).toSet
requiredColumns: scala.collection.immutable.Set[String] = Set(col1, col2, col3)
scala> val allColumns = df.columns
allColumns: Array[String] = Array(col1, col2, col3, col4)
scala> val columnExpr = allColumns.filterNot(requiredColumns(_)).map(col(_)) ++ requiredColumns.map(c => when(col(c) > 3, col(c) + 1).otherwise(col(c)).as(c))
scala> df.select(columnExpr:_*).show(false)
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
|1 |2 |3 |4 |
|2 |3 |5 |5 |
|3 |5 |6 |6 |
|5 |6 |7 |7 |
|6 |7 |8 |8 |
|7 |8 |9 |9 |
+----+----+----+----+
If I understand you right, you are trying to do a dataframe wise operation. you dont need to iterate for this . I can show you how it can be done in pyspark. probably it can be taken over in scala.
from pyspark.sql import functions as F
tst= sqlContext.createDataFrame([(1,7,0),(1,8,4),(1,0,10),(5,1,90),(7,6,0),(0,3,11)],schema=['col1','col2','col3'])
expr = [F.when(F.col(coln)>3,F.col(coln)+1).otherwise(F.col(coln)).alias(coln) for coln in tst.columns if 'col3' not in coln]
tst1= tst.select(*expr)
results:
tst1.show()
+----+----+
|col1|col2|
+----+----+
| 1| 8|
| 1| 9|
| 1| 0|
| 6| 1|
| 8| 7|
| 0| 3|
+----+----+
This should give you the desired result
You can iterate over all columns and apply the condition in single line as below,
original_df.select(original_df.columns.map(c => (when(col(c) > lit(3), col(c)+1).otherwise(col(c))).alias(c)):_*).show()
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| 1| 2| 3| 5|
| 2| 3| 5| 6|
| 3| 5| 6| 7|
| 5| 6| 7| 8|
| 6| 7| 8| 9|
| 7| 8| 9| 10|
+----+----+----+----+
You can use foldLeft whenever you want to make changes on multiple columns as below
val original_df = Seq(
(1,2,3,4),
(2,3,4,5),
(3,4,5,6),
(4,5,6,7),
(5,6,7,8),
(6,7,8,9)
).toDF("col1","col2","col3","col4")
//Filter the columns that yuou want to update
val columns = original_df.columns
columns.foldLeft(original_df){(acc, colName) =>
acc.withColumn(colName, when(col(colName) > 3, col(colName) + 1).otherwise(col(colName)))
}
.show(false)
Output:
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
|1 |2 |3 |5 |
|2 |3 |5 |6 |
|3 |5 |6 |7 |
|5 |6 |7 |8 |
|6 |7 |8 |9 |
|7 |8 |9 |10 |
+----+----+----+----+

Process multiple dataframes in parallel Scala

I am a newbie in Scala-Spark. I have a dataframe like the one below that I need to split into different chunks of data based into a group ID and process them independently in parallel.
+----+-------+-----+-------+
|user|feature|value|groupID
+----+-------+-----+-------+
| 1| 100| 1| A|
| 2| 20B| 0| B|
| 3| 30A| 1| B|
| 4| 40A| 1| B|
| 5| 50A| 1| A|
| 6| 10A| 0| B|
| 7| 200| 1| A|
| 8| 30B| 1| B|
| 9| 400| 0| A|
| 10| 50C| 0| A|
+----+-------+-----+-------+
1 Step I need to split it to have two different df like these ones: I can user a filter for this. But I am not sure if (due to the large number of different dataframes they will produce) I should save them into ADLS as parquets or keep them in memory.
+----+-------+-----+-------+
|user|feature|value|groupID
+----+-------+-----+-------+
| 1| 100| 1| A|
| 5| 50A| 1| A|
| 7| 200| 1| A|
| 9| 400| 0| A|
| 10| 50C| 0| A|
+----+-------+-----+-------+
+----+-------+-----+-------+
|user|feature|value|groupID
+----+-------+-----+-------+
| 2| 20B| 0| B|
| 3| 30A| 1| B|
| 4| 40A| 1| B|
| 6| 10A| 0| B|
| 8| 30B| 1| B|
+----+-------+-----+-------+
2 Step Process independently each dataframe in a parallel fashion and get independent processed dataframes.
To give some context:
The number of groupIds will be high therefore they cannot be hardcoded.
The processing of each dataframe would ideally happen in parallel.
I ask for a brief idea on how to proceed: I have seen .par.foreach (but is not clear to me how to apply this on a dynamic number of dataframes and how to store them independently nor if the best efficient way)
Check below code.
scala> df.show(false)
+----+-------+-----+-------+
|user|feature|value|groupID|
+----+-------+-----+-------+
|1 |100 |1 |A |
|2 |20B |0 |B |
|3 |30A |1 |B |
|4 |40A |1 |B |
|5 |50A |1 |A |
|6 |10A |0 |B |
|7 |200 |1 |A |
|8 |30B |1 |B |
|9 |400 |0 |A |
|10 |50C |0 |A |
+----+-------+-----+-------+
Get distinct groupid values from dataframe.
scala> val groupIds = df.select($"groupID").distinct.as[String].collect // Get distinct group ids.
groupIds: Array[String] = Array(B, A)
Use .par for parallel process. You need add your logic inside map.
scala> groupIds.par.map(groupid => df.filter($"groupId" === lit(groupid))).foreach(_.show(false)) // here you might need add your logic to save or any other inside map function not foreach.., for example I have added logic to show dataframe content in foreach.
+----+-------+-----+-------+
|user|feature|value|groupID|
+----+-------+-----+-------+
|2 |20B |0 |B |
|3 |30A |1 |B |
|4 |40A |1 |B |
|6 |10A |0 |B |
|8 |30B |1 |B |
+----+-------+-----+-------+
+----+-------+-----+-------+
|user|feature|value|groupID|
+----+-------+-----+-------+
|1 |100 |1 |A |
|5 |50A |1 |A |
|7 |200 |1 |A |
|9 |400 |0 |A |
|10 |50C |0 |A |
+----+-------+-----+-------+

how to rename the Columns Produced by count() function in Scala

I have the below df:
+------+-------+--------+
|student| vars|observed|
+------+-------+--------+
| 1| ABC | 19|
| 1| ABC | 1|
| 2| CDB | 1|
| 1| ABC | 8|
| 3| XYZ | 3|
| 1| ABC | 389|
| 2| CDB | 946|
| 1| ABC | 342|
|+------+-------+--------+
I wanted to add a new frequency column groupBy two columns "student", "vars" in SCALA.
val frequency = df.groupBy($"student", $"vars").count()
This code generates a "count" column with the frequencies BUT losing observed column from the df.
I would like to create a new df as follows without losing "observed" column
+------+-------+--------+------------+
|student| vars|observed|total_count |
+------+-------+--------+------------+
| 1| ABC | 9|22
| 1| ABC | 1|22
| 2| CDB | 1|7
| 1| ABC | 2|22
| 3| XYZ | 3|3
| 1| ABC | 8|22
| 2| CDB | 6|7
| 1| ABC | 2|22
|+------+-------+-------+--------------+
You cannot do this directly but there are couple of ways,
You can join original df with count df. check here
You collect the observed column while doing aggregation and explode it again
With explode:
val frequency = df.groupBy("student", "vars").agg(collect_list("observed").as("observed_list"),count("*").as("total_count")).select($"student", $"vars",explode($"observed_list").alias("observed"), $"total_count")
scala> frequency.show(false)
+-------+----+--------+-----------+
|student|vars|observed|total_count|
+-------+----+--------+-----------+
|3 |XYZ |3 |1 |
|2 |CDB |1 |2 |
|2 |CDB |946 |2 |
|1 |ABC |389 |5 |
|1 |ABC |342 |5 |
|1 |ABC |19 |5 |
|1 |ABC |1 |5 |
|1 |ABC |8 |5 |
+-------+----+--------+-----------+
We can use Window functions as well
val windowSpec = Window.partitionBy("student","vars")
val frequency = df.withColumn("total_count", count(col("student")) over windowSpec)
.show
+-------+----+--------+-----------+
|student|vars|observed|total_count|
+-------+----+--------+-----------+
|3 |XYZ |3 |1 |
|2 |CDB |1 |2 |
|2 |CDB |946 |2 |
|1 |ABC |389 |5 |
|1 |ABC |342 |5 |
|1 |ABC |19 |5 |
|1 |ABC |1 |5 |
|1 |ABC |8 |5 |
+-------+----+--------+-----------+

Incremental addition with condition in dataframe [duplicate]

This question already has answers here:
How to calculate cumulative sum using sqlContext
(4 answers)
Closed 4 years ago.
I have a DataFrame like this :
finalSondDF.show()
+---------------+------------+----------------+
|webService_Name|responseTime|numberOfSameTime|
+---------------+------------+----------------+
| webservice1| 80| 1|
| webservice1| 87| 2|
| webservice1| 283| 1|
| webservice2| 77| 2|
| webservice2| 80| 1|
| webservice2| 81| 1|
| webservice3| 63| 3|
| webservice3| 145| 1|
| webservice4| 167| 1|
| webservice4| 367| 2|
| webservice4| 500| 1|
+---------------+------------+----------------+
and I want to get a result like this :
+---------------+------------+----------------+------+
|webService_Name|responseTime|numberOfSameTime|Result|
+---------------+------------+----------------+------+
| webservice1| 80| 1| 1|
| webservice1| 87| 2| 3| ==> 2+1
| webservice1| 283| 1| 4| ==> 1+2+1
| webservice2| 77| 2| 2|
| webservice2| 80| 1| 3| ==> 2+1
| webservice2| 81| 1| 4| ==> 2+1+1
| webservice3| 63| 3| 3|
| webservice3| 145| 1| 4| ==> 3+1
| webservice4| 167| 1| 1|
| webservice4| 367| 2| 3| ==> 1+2
| webservice4| 500| 1| 4| ==> 1+2+1
+---------------+------------+----------------+------+
here the result is the sum of numberOfSameTime inferior of the current responseTime
I can't find a logic to do that. Can any one help me !!
If your data is in increasing order with responseTime column for each group of webService_Name column then you can benefit from cumulative sum using Window function as below
import org.apache.spark.sql.expressions._
def windowSpec = Window.partitionBy("webService_Name").orderBy("responseTime")
import org.apache.spark.sql.functions._
df.withColumn("Result", sum("numberOfSameTime").over(windowSpec)).show(false)
and you should have
+---------------+------------+----------------+------+
|webService_Name|responseTime|numberOfSameTime|Result|
+---------------+------------+----------------+------+
|webservice1 |80 |1 |1 |
|webservice1 |87 |2 |3 |
|webservice1 |283 |1 |4 |
|webservice2 |80 |1 |3 |
|webservice2 |81 |1 |4 |
|webservice2 |77 |2 |2 |
|webservice3 |145 |1 |4 |
|webservice3 |63 |3 |3 |
|webservice4 |167 |1 |1 |
|webservice4 |367 |2 |3 |
|webservice4 |500 |1 |4 |
+---------------+------------+----------------+------+
Note that the responseTime as to be number type and in increasing order for each webService_Name for the above case to work
You can use Window function available in spark and calculate the cumulative sum as below.
//dummy data
val d1 = spark.sparkContext.parallelize(Seq(
("webservice1", 80, 1),
("webservice1", 87, 2),
("webservice1", 283, 1),
("webservice2", 77, 2),
("webservice2", 80, 1),
("webservice2", 81, 1),
("webservice3", 63, 3),
("webservice3", 145, 1),
("webservice4", 167, 1),
("webservice4", 367, 2),
("webservice4", 500, 1)
)).toDF("webService_Name","responseTime","numberOfSameTime")
//window functionn
val window = Window.partitionBy("webService_Name").orderBy($"webService_Name")
.rowsBetween(Long.MinValue, 0)
// create new column for Result
d1.withColumn("Result", sum("numberOfSameTime").over(window)).show(false)
Output:
+---------------+------------+----------------+------+
|webService_Name|responseTime|numberOfSameTime|Result|
+---------------+------------+----------------+------+
|webservice4 |167 |1 |1 |
|webservice4 |367 |2 |3 |
|webservice4 |500 |1 |4 |
|webservice2 |77 |2 |2 |
|webservice2 |80 |1 |3 |
|webservice2 |81 |1 |4 |
|webservice3 |63 |3 |3 |
|webservice3 |145 |1 |4 |
|webservice1 |80 |1 |1 |
|webservice1 |87 |2 |3 |
|webservice1 |283 |1 |4 |
+---------------+------------+----------------+------+
Hope this helps!

apply an aggregate result to all ungrouped rows of a dataframe in spark

assume there is a dataframe as follows:
machine_id | value
1| 5
1| 3
2| 6
2| 9
2| 14
I want to produce a final dataframe like this
machine_id | value | diff
1| 5| 1
1| 3| -1
2| 6| -4
2| 10| 0
2| 14| 4
the values in "diff" column is computed as groupBy($"machine_id").avg($"value") - value.
note that the avg for machine_id==1 is (5+3)/2 = 4 and for machine_id ==2 is (6+10+14)/3 = 10
What is the best way to produce such a final dataframe in Apache Spark?
You can use Window function to get the desired output
Given the dataframe as
+----------+-----+
|machine_id|value|
+----------+-----+
|1 |5 |
|1 |3 |
|2 |6 |
|2 |10 |
|2 |14 |
+----------+-----+
You can use following code
df.withColumn("diff", avg("value").over(Window.partitionBy("machine_id")))
.withColumn("diff", 'value - 'diff)
to get the final result as
+----------+-----+----+
|machine_id|value|diff|
+----------+-----+----+
|1 |5 |1.0 |
|1 |3 |-1.0|
|2 |6 |-4.0|
|2 |10 |0.0 |
|2 |14 |4.0 |
+----------+-----+----+