how to implement uniqueConcatenate, uniqueCount in spark scala [closed] - scala

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 1 year ago.
Improve this question
I am trying to transform the data, the older code is in Tibco and using uniqueConcatenate, uniqueCount functions.
I am not sure how we can achieve same output in spark scala.
uniqueConcatenate Example:
uniqueCount Example:
I tried to use collect_set, but as i need to do Over partition by another columns, which seems like not be working for me.
Please help me here !

For uniqueConcatenate you can use collect_set() function which aggregates a column into a set.
For example:
import org.apache.spark.sql.functions.{collect_set, concat_ws}
import spark.implicits._
case class Record(col1: Option[Int] = None, col2: Option[Int] = None, col3: Option[Int] = None)
val df: DataFrame = Seq(Record(Some(1), Some(1), Some(1)), Record(Some(1), None, Some(3)), Record(Some(1), Some(3), Some(3))).toDF("col1", "col2", "col3")
df.show()
/*
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| 1| 1|
| 1|null| 3|
| 1| 3| 3|
+----+----+----+
*/
df.agg(
concat_ws(". ", collect_set("col1")).as("col1"),
concat_ws(". ", collect_set("col2")).as("col2"),
concat_ws(". ", collect_set("col3")).as("col3")
).show()
/*
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1|1. 3|1. 3|
+----+----+----+
*/
For uniqueCount, you can use countDistinct in a similar way:
import org.apache.spark.sql.functions.countDistinct
df.agg(
countDistinct("col1").as("col1"),
countDistinct("col2").as("col2"),
countDistinct("col3").as("col3")
).show()
/*
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| 2| 2|
+----+----+----+
*/

Related

Sum columns of a Spark dataframe and create another dataframe

I have a dataframe like below -
I am trying to create another dataframe from this which has 2 columns - the column name and the sum of values in each column like this -
So far, I've tried this (in Spark 2.2.0) but throws a stack trace -
val get_count: (String => Long) = (c: String) => {
df.groupBy("id")
.agg(sum(c) as "s")
.select("s")
.collect()(0)
.getLong(0)
}
val sqlfunc = udf(get_count)
summary = summary.withColumn("sum_of_column", sqlfunc(col("c")))
Are there any other alternatives of accomplishing this task?
I think that the most efficient way is to do an aggregation and then build a new dataframe. That way you avoid a costly explode.
First, let's create the dataframe. BTW, it's always nice to provide the code to do it when you ask a question. This way we can reproduce your problem in seconds.
val df = Seq((1, 1, 0, 0, 1), (1, 1, 5, 0, 0),
(0, 1, 0, 6, 0), (0, 1, 0, 4, 3))
.toDF("output_label", "ID", "C1", "C2", "C3")
Then we build the list of columns that we are interested in, the aggregations, and compute the result.
val cols = (1 to 3).map(i => s"C$i")
val aggs = cols.map(name => sum(col(name)).as(name))
val agg_df = df.agg(aggs.head, aggs.tail :_*) // See the note below
agg_df.show
+---+---+---+
| C1| C2| C3|
+---+---+---+
| 5| 10| 4|
+---+---+---+
We almost have what we need, we just need to collect the data and build a new dataframe:
val agg_row = agg_df.first
cols.map(name => name -> agg_row.getAs[Long](name))
.toDF("column", "sum")
.show
+------+---+
|column|sum|
+------+---+
| C1| 5|
| C2| 10|
| C3| 4|
+------+---+
EDIT:
NB: df.agg(aggs.head, aggs.tail :_*) may seem strange. The idea is simply to compute all the aggregations computed in aggs. One would expect something more simple like df.agg(aggs : _*). Yet the signature of the agg method is as follows:
def agg(expr: org.apache.spark.sql.Column,exprs: org.apache.spark.sql.Column*)
maybe to ensure that at least one column is used, and this is why you need to split aggs in aggs.head and aggs.tail.
What i do is to define a method to create a struct from the desired values:
def kv (columnsToTranspose: Array[String]) = explode(array(columnsToTranspose.map {
c => struct(lit(c).alias("k"), col(c).alias("v"))
}: _*))
This functions receives a list of columns to transpose (your 3 last columns in your case) and transform them in a struct with the column name as key and the column value as value
And then use that method to create an struct and process it as you want
df.withColumn("kv", kv(df.columns.tail.tail))
.select( $"kv.k".as("column"), $"kv.v".alias("values"))
.groupBy("column")
.agg(sum("values").as("sum"))
First apply the previous defined function to have the desired columns as the said struct, and then deconstruct the struct to have a column key and a column value in each row.
Then you can aggregate by the column name and sum the values
INPUT
+------------+---+---+---+---+
|output_label| id| c1| c2| c3|
+------------+---+---+---+---+
| 1| 1| 0| 0| 1|
| 1| 1| 5| 0| 0|
| 0| 1| 0| 6| 0|
| 0| 1| 0| 4| 3|
+------------+---+---+---+---+
OUTPUT
+------+---+
|column|sum|
+------+---+
| c1| 5|
| c3| 4|
| c2| 10|
+------+---+

IF function in spark gives error [duplicate]

This question already has answers here:
Apache Spark, add an "CASE WHEN ... ELSE ..." calculated column to an existing DataFrame
(4 answers)
Closed 4 years ago.
I am using below code but it gives error.Kindly guide.
val a = Seq(
("ram,shyam,hari",12,10),
("shyam,ram,hari",3,5)
).toDF("name","id","dt")
.withColumn("newcol",if($"id">$"dt",0,1))
.show()
Error is as follows,
:14: error: ')' expected but ',' found.
.withColumn("newcol",if($"id">$"dt",0,1)).show()
You need when.otherwise:
val df = Seq(("ram,shyam,hari",12,10),("shyam,ram,hari",3,5)).toDF("name","id","dt")
df.withColumn("newcol", when($"id" > $"dt", 0).otherwise(1)).show
//+--------------+---+---+------+
//| name| id| dt|newcol|
//+--------------+---+---+------+
//|ram,shyam,hari| 12| 10| 0|
//|shyam,ram,hari| 3| 5| 1|
//+--------------+---+---+------+
Or you can cast the comparison result to int:
df.withColumn("newcol", ($"id" <= $"dt").cast("int")).show
//+--------------+---+---+------+
//| name| id| dt|newcol|
//+--------------+---+---+------+
//|ram,shyam,hari| 12| 10| 0|
//|shyam,ram,hari| 3| 5| 1|
//+--------------+---+---+------+
Use when / otherwis:
import org.apache.spark.sql.functions.when
df.withColumn("newcol", when($"id" > $"dt", 0).otherwise(1))

minBy equivalent in Spark Dataframe [duplicate]

This question already has answers here:
How to select the first row of each group?
(9 answers)
Closed 4 years ago.
I'm looking for equivalent function of minBy aggregate in Spark Dataframe or may need to manually aggregate. Any thoughts? Thanks.
https://prestodb.io/docs/current/functions/aggregate.html#min_by
There is no such direct function to get the 'min_by' values from the Dataframe.
It is a two stage operation in Spark. First groupby the column then apply min function to get min value for each numeric column for each group.
scala> val inputDF = Seq(("a", 1),("b", 2), ("b", 3), ("a", 4), ("a", 5)).toDF("id", "count")
inputDF: org.apache.spark.sql.DataFrame = [id: string, count: int]
scala> inputDF.show()
+---+-----+
| id|count|
+---+-----+
| a| 1|
| b| 2|
| b| 3|
| a| 4|
| a| 5|
+---+-----+
scala> inputDF.groupBy($"id").min("count").show()
+---+----------+
| id|min(count)|
+---+----------+
| b| 2|
| a| 1|
+---+----------+

How to use DataFrame Window expressions and withColumn and not to change partition?

For some reason I have to convert RDD to DataFrame, then do something with DataFrame.
My interface is RDD,so I have to convert DataFrame to RDD, and when I use df.withcolumn, the partition change to 1, so I have to repartition and sortBy RDD.
Is there any cleaner solution ?
This is my code :
val rdd = sc.parallelize(List(1,3,2,4,5,6,7,8),4)
val partition = rdd.getNumPartitions
println(partition + "rdd")
val df=rdd.toDF()
val rdd2=df.rdd
val result = rdd.toDF("col1")
.withColumn("csum", sum($"col1").over(Window.orderBy($"col1")))
.withColumn("rownum", row_number().over(Window.orderBy($"col1")))
.withColumn("avg", $"csum"/$"rownum").rdd
println(result.getNumPartitions + "rdd2")
Let's make this as simple as possible, we will generate the same data into 4 partitions
scala> val df = spark.range(1,9,1,4).toDF
df: org.apache.spark.sql.DataFrame = [id: bigint]
scala> df.show
+---+
| id|
+---+
| 1|
| 2|
| 3|
| 4|
| 5|
| 6|
| 7|
| 8|
+---+
scala> df.rdd.getNumPartitions
res13: Int = 4
We don't need 3 window functions to prove this, so let's do it with one :
scala> import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.expressions.Window
scala> val df2 = df.withColumn("csum", sum($"id").over(Window.orderBy($"id")))
df2: org.apache.spark.sql.DataFrame = [id: bigint, csum: bigint]
So what's happening here is that we didn't just add a column but we computed a window of cumulative sum over the data and since you haven't provided an partition column, the window function will move all the data to a single partition and you even get a warning from spark :
scala> df2.rdd.getNumPartitions
17/06/06 10:05:53 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
res14: Int = 1
scala> df2.show
17/06/06 10:05:56 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
+---+----+
| id|csum|
+---+----+
| 1| 1|
| 2| 3|
| 3| 6|
| 4| 10|
| 5| 15|
| 6| 21|
| 7| 28|
| 8| 36|
+---+----+
So let's add now a column to partition on. We will create a new DataFrame just for the sake of demonstration :
scala> val df3 = df.withColumn("x", when($"id"<5,lit("a")).otherwise("b"))
df3: org.apache.spark.sql.DataFrame = [id: bigint, x: string]
It has indeed the same number of partitions that we defined explicitly on df :
scala> df3.rdd.getNumPartitions
res18: Int = 4
Let's perform our window operation using the column x to partition :
scala> val df4 = df3.withColumn("csum", sum($"id").over(Window.orderBy($"id").partitionBy($"x")))
df4: org.apache.spark.sql.DataFrame = [id: bigint, x: string ... 1 more field]
scala> df4.show
+---+---+----+
| id| x|csum|
+---+---+----+
| 5| b| 5|
| 6| b| 11|
| 7| b| 18|
| 8| b| 26|
| 1| a| 1|
| 2| a| 3|
| 3| a| 6|
| 4| a| 10|
+---+---+----+
The window function will repartition our data using the default number of partitions set in spark configuration.
scala> df4.rdd.getNumPartitions
res20: Int = 200
I was just reading about controlling the number of partitions when using groupBy aggregation, from https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-performance-tuning-groupBy-aggregation.html, it seems the same trick works with Window, in my code I'm defining a window like
windowSpec = Window \
.partitionBy('colA', 'colB') \
.orderBy('timeCol') \
.rowsBetween(1, 1)
and then doing
next_event = F.lead('timeCol', 1).over(windowSpec)
and creating a dataframe via
df2 = df.withColumn('next_event', next_event)
and indeed, it has 200 partitions. But, if I do
df2 = df.repartition(10, 'colA', 'colB').withColumn('next_event', next_event)
it has 10!

How to use Spark ML ALS algorithm? [duplicate]

Is it possible to factorize a Spark dataframe column? With factorizing I mean creating a mapping of each unique value in the column to the same ID.
Example, the original dataframe:
+----------+----------------+--------------------+
| col1| col2| col3|
+----------+----------------+--------------------+
|1473490929|4060600988513370| A|
|1473492972|4060600988513370| A|
|1473509764|4060600988513370| B|
|1473513432|4060600988513370| C|
|1473513432|4060600988513370| A|
+----------+----------------+--------------------+
to the factorized version:
+----------+----------------+--------------------+
| col1| col2| col3|
+----------+----------------+--------------------+
|1473490929|4060600988513370| 0|
|1473492972|4060600988513370| 0|
|1473509764|4060600988513370| 1|
|1473513432|4060600988513370| 2|
|1473513432|4060600988513370| 0|
+----------+----------------+--------------------+
In scala itself it would be fairly simple, but since Spark distributes it's dataframes over nodes I'm not sure how to keep a mapping from A->0, B->1, C->2.
Also, assume the dataframe is pretty big (gigabytes), which means loading one entire column into the memory of a single machine might not be possible.
Can it be done?
You can use StringIndexer to encode letters into indices:
import org.apache.spark.ml.feature.StringIndexer
val indexer = new StringIndexer()
.setInputCol("col3")
.setOutputCol("col3Index")
val indexed = indexer.fit(df).transform(df)
indexed.show()
+----------+----------------+----+---------+
| col1| col2|col3|col3Index|
+----------+----------------+----+---------+
|1473490929|4060600988513370| A| 0.0|
|1473492972|4060600988513370| A| 0.0|
|1473509764|4060600988513370| B| 1.0|
|1473513432|4060600988513370| C| 2.0|
|1473513432|4060600988513370| A| 0.0|
+----------+----------------+----+---------+
Data:
val df = spark.createDataFrame(Seq(
(1473490929, "4060600988513370", "A"),
(1473492972, "4060600988513370", "A"),
(1473509764, "4060600988513370", "B"),
(1473513432, "4060600988513370", "C"),
(1473513432, "4060600988513370", "A"))).toDF("col1", "col2", "col3")
You can use an user defined function.
First you create the mapping you need:
val updateFunction = udf {(x: String) =>
x match {
case "A" => 0
case "B" => 1
case "C" => 2
case _ => 3
}
}
And now you only have to apply it to your DataFrame:
df.withColumn("col3", updateFunction(df.col("col3")))