how to create customised user defined aggregate distinct function - scala

I have a dataframe which contains 4 columns.
Dataframe sample
id1 id2 id3 id4
---------------
a1 a2 a3 a4
b1 b2 b3 b4
b1 b2 b3 b4
c1 c2 c3 c4
b2
c1
a3
a4
c1
d4
There are 2 types of data in a row either all the columns have data or only one column.
I want to perform distinct function on all the columns such as while comparing the values between rows, it will only compare the value which is present in a row and don't consider the null values.
Output dataframe should be
id1 id2 id3 id4
a1 a2 a3 a4
b1 b2 b3 b4
c1 c2 c3 c4
d4
I have looked multiple examples of UDAF in spark. But not able to modified according.

you can use filter for all the columns as below
df.filter($"id1" =!= "" && $"id2" =!= "" && $"id3" =!= "" && $"id4" =!= "")
and you should get your final dataframe.
The above code is for static four columned dataframe. If you have more than four columns above method would become hectic as you would have to write too many logic checkings.
the solution to that would be to use a udf function as below
import org.apache.spark.sql.functions._
def checkIfNull = udf((co : mutable.WrappedArray[String]) => !(co.contains(null) || co.contains("")))
df.filter(checkIfNull(array(df.columns.map(col): _*))).show(false)
I hope the answer is helpful

It is possible to take advantage of that dropDuplicates is order dependent to solve this, see the answer here. However, it is not very efficient, there should be a more efficient solution.
First remove all duplicates with distinct(), then iteratively order by each column and drop it's duplicates. The columns are ordered in descending order as nulls then will be put last.
Example with four static columns:
val df2 = df.distinct()
.orderBy($"id1".desc).dropDuplicates("id1")
.orderBy($"id2".desc).dropDuplicates("id2")
.orderBy($"id3".desc).dropDuplicates("id3")
.orderBy($"id4".desc).dropDuplicates("id4")

Related

Roll up Hierarchical data in Spark

I am new to Spark and Scala and trying to figure out how to roll up hierarchical data. The input data map looks like this:
key
value
e1
e2
e2
e3
e4
e2
e5
e4
e6
e4
I am trying to find a way to roll up the input map into a data frame like below:
id
level1
level2
level3
e1
e2
e3
null
e2
e3
null
null
e3
null
null
null
e4
e2
null
null
e5
e4
e2
e3
e6
e4
e2
e3
I need to use Scala and Spark.
Any help will be appreciated.
Thanks
Since you are interested in 3 levels you could perform a left-join twice to achieve this. Initially I have prepared df1 by using a union to achieve all possible keys and an aggregate to achieve the first level before renaming the columns.
//rename columns and ensure that all value and keys exist as ids
val df1 = df.union(df.selectExpr("value as key","NULL as value"))
.groupBy("key")
.agg(max("value").as("value"))
.withColumnRenamed("key","id")
.withColumnRenamed("value","level1")
//first repetition for level 2
val level2 = df1.alias("l1").join(
df1.alias("l2"),
col("l1.level1")==col("l2.id"),
"left"
).selectExpr("l1.*","l2.level1 as level2")
//second repetition for level 3
val level3 = level2.alias("l1").join(
df1.alias("l2"),
col("l1.level2")==col("l2.id"),
"left"
).selectExpr("l1.*","l2.level1 as level3")
level3.show()
The example above was written in such a way which I hope will make it easier to identify a pattern between each repetition. i.e. you could loop to your desired level or depth using scala eg:
var df1 = df.union(df.selectExpr("value as key","NULL as value"))
.groupBy("key")
.agg(max("value").as("value"))
.withColumnRenamed("key","id")
.withColumnRenamed("value","level1")
.cache()
for( depth <- 1 to 3){
df1 = df1.alias("l1").join(
df1.alias("l2"),
col("l1.level"+depth)==col("l2.id"),
"left"
).selectExpr("l1.*","l2.level"+depth+" as level"+(depth+1))
}
df1.show()

How to build chain of segments in scope of pyspark dataframe

I have a huge pyspark dataframe with segments and their subsegments, like this:
SegmentId SubSegmentStart SubSegmentEnd
1 a1 a2
1 a2 a3
2 b1 b2
3 c1 c2
3 c3 c4
3 c2 c3
I need to group records by SegmentId and add new column index to build chain of subsegments using start and end points. I need to do it for each Segment.
So I need to get the following dataframe:
SegmentId SubSegmentStart SubSegmentEnd Index
1 a1 a2 0
1 a2 a3 1
2 b1 b2 0
3 c1 c2 0
3 c3 c4 2
3 c2 c3 1
How can I do it by PySpark?

How to add new columns to a dataframe in a loop using scala on Azure Databricks

I have imported a csv file into a dataframe in Azure Databricks using scala.
--------------
A B C D E
--------------
a1 b1 c1 d1 e1
a2 b2 c2 d2 e2
--------------
Now I want to perform hash on some selective columns and add the result as a new column to that dataframe.
--------------------------------
A B B2 C D D2 E
--------------------------------
a1 b1 hash(b1) c1 d1 hash(d1) e1
a2 b2 hash(b2) c2 d2 hash(d2) e2
--------------------------------
This is the code I have:
val data_df = spark.read.format("csv").option("header", "true").option("sep", ",").load(input_file)
...
...
for (col <- columns) {
if (columnMapping.keys.contains((col))){
val newColName = col + "_token"
// Now here I want to add a new column to data_df and the content would be hash of the current value
}
}
// And here I would like to upload selective columns (B, B2, D, D2) to a SQL database
Any help will be highly appreciated.
Thank you!
Try this -
val colsToApplyHash = Array("B","D")
val hashFunction:String => String = <ACTUAL HASH LOGIC>
val hash = udf(hashFunction)
val finalDf = colsToApplyHash.foldLeft(data_df){
case(acc,colName) => acc.withColumn(colName+"2",hash(col(colName)))
}

using Spark: binning column1 and find mean of column2 based on column1's bins

I am learning apache spark and scala language. So some help please. I get 3 columns (c1, c2 and c3) from querying cassandra and get it in a dataframe in the scala code.. I have to bin(bin size = 3) (statistics, like in histogram ) c1 and find mean of c2 and c3 in the c1 bins. Are there any pre built functions that I can use to do this instead of traditional for loops and if conditions to achieve this?
Try this
val modifiedRDD = rdd.map{case(c1, c2, c3) => ((c1), (c2, c3, 1))}
val reducedRDD = modifiedRDD.reduceByKey{case(x, y) => (x._1+y._1, x._2+y._2, x._3+y._3)}
val finalRDD = reducedRDD.map{case((c1), (totalC2, totalC3, count)) => (c1, totalC2/count, totalC3/count)}

Multiple level grouping in Crystal Reports

I have a report with the 4 columns,
ColumnA|ColumnB|ColumnC|ColumnD
Row1 A1 B1 C1 D1
Row2 A1 B1 C1 D2
Row3 A1 B1 C1 D1
Row4 A1 B1 C1 D2
Row5 A1 B1 C1 D1
I did like grouping based on the 4 columns, but i got output with space for every row.
But here in this report i would like to get the ouput as,
ColumnA|ColumnB|ColumnC|ColumnD
Row1 A1 B1 C1 D1
Row2 A1 B1 C1 D2
<-------------an empty space ----------->
Row3 A1 B1 C1 D1
Row4 A1 B1 C1 D2
<-------------an empty space ----------->
Row5 A1 B1 C1 D1
How can i achieve the above output?
A standard group by would sort the record like this:
ColumnA|ColumnB|ColumnC|ColumnD
Row1 A1 B1 C1 D1
Row3 A1 B1 C1 D1
Row5 A1 B1 C1 D1
Row2 A1 B1 C1 D2
Row4 A1 B1 C1 D2
Since you don't have a standard grouping, another approach may work. You basically want a blank line after the D2 value. This will only work if you always have D2 values at the end of a group.
Create a new blank detail section under the main section
Detail one
A1 B1 C1 D1
Detail two
<blank>
Then put a conditional suppress expression on detail two
ColumnD <> "D2"
Then whenever D2 is present the blank detail section will be displayed.
You can use a Formula instead of a field Value for grouping.
select Column4 <br>
case D1 : "Group1"<br>
case D2 : "Group2"<br>
case D3 : "Group3"<br>
case D4 : "Group3"<br>
case D5 : "Group3"<br>
case D6 : "Group4"<br>
default "Group5"<br>
Is that your problem ?
The blank lines can be generated as Group Footer.