Spark Count Large Number of Columns - scala

Ran into this a little while ago, and I think there should be a better/more efficient way of doing this:
I have a DF with about 70k columns and roughly 10k rows. I want to essentially get a count of each column based on the value of the row.
df.columns.map( c => df.where(column(c)===1).count )
This works for a small number of columns, but in this case, the large number of columns causes the process to take hours and appears to iterate through each column and query the data.
What optimizations can I do to get to the results faster?

You can replace value of each column to 1 or 0 depending of whether the column previous value matches condition and then sum each column in one aggregation. After you can collect the unique row of the resulting dataframe and make it an array.
So the code would be as follow:
import org.apache.spark.sql.functions.{col, lit, sum, when}
val aggregation_columns = df.columns.map(c => sum(col(c)))
df
.columns
.foldLeft(df)((acc, elem) => acc.withColumn(elem, when(col(elem) === 1, lit(1)).otherwise(lit(0))))
.agg(aggregation_columns.head, aggregation_columns.tail: _*)
.collect()
.flatMap(row => df.columns.indices.map(i => row.getLong(i))

count_if counts all rows for that a condition matches. This SQL expression can be evaluated for all columns in a single pass:
df = ...
df.selectExpr( df.columns.map((c => s"count_if($c=1) as $c")):_* ).show()
explain prints (for three colums a, b and c:
== Physical Plan ==
*(2) HashAggregate(keys=[], functions=[count(if (((a#10 = 1) = false)) null else (a#10 = 1)), count(if (((b#11 = 1) = false)) null else (b#11 = 1)), count(if (((c#12 = 1) = false)) null else (c#12 = 1))])
+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [id=#13]
+- *(1) HashAggregate(keys=[], functions=[partial_count(if (((a#10 = 1) = false)) null else (a#10 = 1)), partial_count(if (((b#11 = 1) = false)) null else (b#11 = 1)), partial_count(if (((c#12 = 1) = false)) null else (c#12 = 1))])
+- *(1) LocalTableScan [a#10, b#11, c#12]

Related

Spark specify multiple logical condition in where clause of spark dataframe

While defining the multiple logical/relational condition in spark scala dataframe getting the error as mentioned below. But same thing is working fine in scala
Python code:
df2=df1.where(((col('a')==col('b')) & (abs(col('c')) <= 1))
| ((col('a')==col('fin')) & ((col('b') <= 3) & (col('c') > 1)) & (col('d') <= 500))
| ((col('a')==col('b')) & ((col('c') <= 15) & (col('c') > 3)) & (col('d') <= 200))
| ((col('a')==col('b')) & ((col('c') <= 30) & (col('c') > 15)) & (col('c') <= 100)))
Tried for scala equivalent:
val df_aqua_xentry_dtb_match=df_aqua_xentry.where((col("a") eq col("b")) & (abs(col("c") ) <= 1))
notebook:2: error: type mismatch;
found : org.apache.spark.sql.Column
required: Boolean
val df_aqua_xentry_dtb_match=df_aqua_xentry.where((col("a") eq col("b")) & (abs(col("c") ) <= 1))
How to define multiple logical condition in spark dataframe using scala
eq returns a Boolean, <= returns a Column. They are incompatible.
You probably want this :
df.where((col("a") === col("b")) && (abs(col("c") ) <= 1))
=== is used for equality between columns and returns a Column, and there we can use && to do multiple conditions in the same where.
With Spark you should use
=== instead of == or eq (see explanation)
&& instead of & (&& is logical AND, & is binary AND)
val df_aqua_xentry_dtb_match = df_aqua_xentry.where((col("a") === col("b")) && (abs(col("c") ) <= 1))
Please see the below solution.
df.where("StudentId == 1").explain(true)
== Parsed Logical Plan ==
'Filter ('StudentId = 1)
+- Project [_1#3 AS StudentId#7, _2#4 AS SubjectName#8, _3#5 AS Marks#9]
+- LocalRelation [_1#3, _2#4, _3#5]
== Analyzed Logical Plan ==
StudentId: int, SubjectName: string, Marks: int
Filter (StudentId#7 = 1)
+- Project [_1#3 AS StudentId#7, _2#4 AS SubjectName#8, _3#5 AS Marks#9]
+- LocalRelation [_1#3, _2#4, _3#5]
== Optimized Logical Plan ==
LocalRelation [StudentId#7, SubjectName#8, Marks#9]
Here we used where clause, internally optimizer converted to filter opetration eventhough where clause in code level.
So we can apply filter function on rows of data frame like below
df.filter(row => row.getString(1) == "A" && row.getInt(0) == 1).show()
Here 0 and 1 are columns of data frames. In my case schema is (StudentId(Int), SubjectName(string), Marks(Int))
There are few issues with your Scala version of code.
"eq" is basically to compare two strings in Scala (desugars to == in Java) so
when you try to compare two Columns using "eq", it returns a boolean
instead of Column type. Here you can use "===" operator for Column comparison.
String comparison
scala> "praveen" eq "praveen"
res54: Boolean = true
scala> "praveen" eq "nag"
res55: Boolean = false
scala> lit(1) eq lit(2)
res56: Boolean = false
scala> lit(1) eq lit(1)
res57: Boolean = false
Column comparison
scala> lit(1) === lit(2)
res58: org.apache.spark.sql.Column = (1 = 2)
scala> lit(1) === lit(1)
19/08/02 14:00:40 WARN Column: Constructing trivially true equals predicate, '1 = 1'. Perhaps you need to use aliases.
res59: org.apache.spark.sql.Column = (1 = 1)
You are using a "betwise AND" operator instead of "and"/"&&" operator for Column type. This is reason you were getting the above error (as it was expecting a boolean instead Column).
scala> df.show
+---+---+
| id|id1|
+---+---+
| 1| 2|
+---+---+
scala> df.where((col("id") === col("id1")) && (abs(col("id")) > 2)).show
+---+---+
| id|id1|
+---+---+
+---+---+
scala> df.where((col("id") === col("id1")) and (abs(col("id")) > 2)).show
+---+---+
| id|id1|
+---+---+
+---+---+
Hope this helps !

Spark/Scala repeated creation of DataFrames using the same function on different data subsets

My current code repeatedly creates new DataFrames (df_1, df_2, df_3) using the same function, but applied on different subsets of the original DataFrame df (e.g. where("category == 1')).
I would like to create a function that can automate the creation of these DataFrames.
In the following example, My DataFrame df has three columns: "category", "id", and "amount". Assume I have 10 categories. I want to summarise the value of the column 'category' as well as count the number of occurrences of 'category' based on different categories:
val df_1 = df.where("category == 1")
.groupBy("id")
.agg(sum(when(col("amount") > 0,(col("amount")))).alias("total_incoming_cat_1"),
count(when(col("amount") < 0, (col("amount")))).alias("total_outgoing_cat_1"))
val df_2 = df.where("category == 2")
.groupBy("id")
.agg(sum(when(col("amount") > 0,(col("amount")))).alias("total_incoming_cat_2"),
count(when(col("amount") < 0, (col("amount")))).alias("total_outgoing_cat_2"))
val df_3 = df.where("category == 3")
.groupBy("id")
.agg(sum(when(col("amount") > 0, (col("amount")))).alias("total_incoming_cat_3"),
count(when(col("amount") < 0, (col("amount")))).alias("total_outgoing_cat_3"))
I would like something like this:
def new_dfs(L:List, df:DataFrame): DataFrame={
for l in L{
val df_+l df.filter($amount == l)
.groupBy("id")
.agg(sum(when(col("amount") > 0, (col("amount")))).alias("total_incoming_cat_"+l),
count(when(col("amount") < 0, (col("amount")))).alias("total_outgoing_cat_"+l))
}
}
it is not better to group by category and id
df
.groupBy("category","id")
.agg(sum(when(col("amount") > 0,(col("amount")))).alias("total_incoming_cat"),
count(when(col("amount") < 0, (col("amount")))).alias("total_outgoing_cat"))

Iterate rows and columns in Spark dataframe

I have the following Spark dataframe that is created dynamically:
val sf1 = StructField("name", StringType, nullable = true)
val sf2 = StructField("sector", StringType, nullable = true)
val sf3 = StructField("age", IntegerType, nullable = true)
val fields = List(sf1,sf2,sf3)
val schema = StructType(fields)
val row1 = Row("Andy","aaa",20)
val row2 = Row("Berta","bbb",30)
val row3 = Row("Joe","ccc",40)
val data = Seq(row1,row2,row3)
val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
df.createOrReplaceTempView("people")
val sqlDF = spark.sql("SELECT * FROM people")
Now, I need to iterate each row and column in sqlDF to print each column, this is my attempt:
sqlDF.foreach { row =>
row.foreach { col => println(col) }
}
row is type Row, but is not iterable that's why this code throws a compilation error in row.foreach. How to iterate each column in Row?
Consider you have a Dataframe like below
+-----+------+---+
| name|sector|age|
+-----+------+---+
| Andy| aaa| 20|
|Berta| bbb| 30|
| Joe| ccc| 40|
+-----+------+---+
To loop your Dataframe and extract the elements from the Dataframe, you can either chose one of the below approaches.
Approach 1 - Loop using foreach
Looping a dataframe directly using foreach loop is not possible. To do this, first you have to define schema of dataframe using case class and then you have to specify this schema to the dataframe.
import spark.implicits._
import org.apache.spark.sql._
case class cls_Employee(name:String, sector:String, age:Int)
val df = Seq(cls_Employee("Andy","aaa", 20), cls_Employee("Berta","bbb", 30), cls_Employee("Joe","ccc", 40)).toDF()
df.as[cls_Employee].take(df.count.toInt).foreach(t => println(s"name=${t.name},sector=${t.sector},age=${t.age}"))
Please see the result below :
Approach 2 - Loop using rdd
Use rdd.collect on top of your Dataframe. The row variable will contain each row of Dataframe of rdd row type. To get each element from a row, use row.mkString(",") which will contain value of each row in comma separated values. Using split function (inbuilt function) you can access each column value of rdd row with index.
for (row <- df.rdd.collect)
{
var name = row.mkString(",").split(",")(0)
var sector = row.mkString(",").split(",")(1)
var age = row.mkString(",").split(",")(2)
}
Note that there are two drawback of this approach.
1. If there is a , in the column value, data will be wrongly split to adjacent column.
2. rdd.collect is an action that returns all the data to the driver's memory where driver's memory might not be that much huge to hold the data, ending up with getting the application failed.
I would recommend to use Approach 1.
Approach 3 - Using where and select
You can directly use where and select which will internally loop and finds the data. Since it should not throws Index out of bound exception, an if condition is used
if(df.where($"name" === "Andy").select(col("name")).collect().length >= 1)
name = df.where($"name" === "Andy").select(col("name")).collect()(0).get(0).toString
Approach 4 - Using temp tables
You can register dataframe as temptable which will be stored in spark's memory. Then you can use a select query as like other database to query the data and then collect and save in a variable
df.registerTempTable("student")
name = sqlContext.sql("select name from student where name='Andy'").collect()(0).toString().replace("[","").replace("]","")
You can convert Row to Seq with toSeq. Once turned to Seq you can iterate over it as usual with foreach, map or whatever you need
sqlDF.foreach { row =>
row.toSeq.foreach{col => println(col) }
}
Output:
Berta
bbb
30
Joe
Andy
aaa
20
ccc
40
You should use mkString on your Row:
sqlDF.foreach { row =>
println(row.mkString(","))
}
But note that this will be printed inside the executors JVM's, so norally you won't see the output (unless you work with master = local)
sqlDF.foreach is not working for me but Approach 1 from #Sarath Avanavu answer works but it was also playing with the order of the records sometime.
I found one more way which is working
df.collect().foreach { row =>
println(row.mkString(","))
}
You should iterate over the partitions which allows the data to be processed by Spark in parallel and you can do foreach on each row inside the partition.
You can further group the data in partition into batches if need be
sqlDF.foreachPartition { partitionedRows: Iterator[Model1] =>
if (partitionedRows.take(1).nonEmpty) {
partitionedRows.grouped(numberOfRowsPerBatch).foreach { batch =>
batch.foreach { row =>
.....
This worked fine for me
sqlDF.collect().foreach(row => row.toSeq.foreach(col => println(col)))
simple collect result and then apply foreach
df.collect().foreach(println)
My solution using FOR because it was I need:
Solution 1:
case class campos_tablas(name:String, sector:String, age:Int)
for (row <- df.as[campos_tablas].take(df.count.toInt))
{
print(row.name.toString)
}
Solution 2:
for (row <- df.take(df.count.toInt))
{
print(row(0).toString)
}
Let's assume resultDF is the Dataframe.
val resultDF = // DataFrame //
var itr = 0
val resultRow = resultDF.count
val resultSet = resultDF.collectAsList
var load_id = 0
var load_dt = ""
var load_hr = 0
while ( itr < resultRow ){
col1 = resultSet.get(itr).getInt(0)
col2 = resultSet.get(itr).getString(1) // if column is having String value
col3 = resultSet.get(itr).getLong(2) // if column is having Long value
// Write other logic for your code //
itr = itr + 1
}

Check every column in a spark dataframe has a certain value

Can we check to see if every column in a spark dataframe contains a certain string(example "Y") using Spark-SQL or scala?
I have tried the following but don't think it is working properly.
df.select(df.col("*")).filter("'*' =='Y'")
Thanks,
Sai
You can do something like this to keep the rows where all columns contain 'Y':
//Get all columns
val columns: Array[String] = df.columns
//For each column, keep the rows with 'Y'
val seqDfs: Seq[DataFrame] = columns.map(name => df.filter(s"$name == 'Y'"))
//Union all the dataframes together into one final dataframe
val output: DataFrame = seqDfs.reduceRight(_ union _)
You can use data frame method columns to get all column's names
val columnNames: Array[String] = df.columns
and then add all filters in a loop
var filteredDf = df.select(join5.col("*"))
for(name <- columnNames) {
filteredDf = filteredDf.filter(s"$name =='Y'")
}
or you can create a SQL query using same approach
If you want to filter every row, in which any of the columns is equal to 1 (or anything else), you can dynamically create a query like this:
cols = [col(c) == lit(1) for c in patients.columns]
query = cols[0]
for c in cols[1:]:
query |= c
df.filter(query).show()
It's a bit verbose, but it is very clear what is happening. A more elegant version would be:
res = df.filter(reduce(lambda x, y: x | y, (col(c) == lit(1) for c in cols)))
res.show()

Spark Dataframe select based on column index

How do I select all the columns of a dataframe that has certain indexes in Scala?
For example if a dataframe has 100 columns and i want to extract only columns (10,12,13,14,15), how to do the same?
Below selects all columns from dataframe df which has the column name mentioned in the Array colNames:
df = df.select(colNames.head,colNames.tail: _*)
If there is similar, colNos array which has
colNos = Array(10,20,25,45)
How do I transform the above df.select to fetch only those columns at the specific indexes.
You can map over columns:
import org.apache.spark.sql.functions.col
df.select(colNos map df.columns map col: _*)
or:
df.select(colNos map (df.columns andThen col): _*)
or:
df.select(colNos map (col _ compose df.columns): _*)
All the methods shown above are equivalent and don't impose performance penalty. Following mapping:
colNos map df.columns
is just a local Array access (constant time access for each index) and choosing between String or Column based variant of select doesn't affect the execution plan:
val df = Seq((1, 2, 3 ,4, 5, 6)).toDF
val colNos = Seq(0, 3, 5)
df.select(colNos map df.columns map col: _*).explain
== Physical Plan ==
LocalTableScan [_1#46, _4#49, _6#51]
df.select("_1", "_4", "_6").explain
== Physical Plan ==
LocalTableScan [_1#46, _4#49, _6#51]
#user6910411's answer above works like a charm and the number of tasks/logical plan is similar to my approach below. BUT my approach is a bit faster.
So,
I would suggest you to go with the column names rather than column numbers. Column names are much safer and much ligher than using numbers. You can use the following solution :
val colNames = Seq("col1", "col2" ...... "col99", "col100")
val selectColNames = Seq("col1", "col3", .... selected column names ... )
val selectCols = selectColNames.map(name => df.col(name))
df = df.select(selectCols:_*)
If you are hesitant to write all the 100 column names then there is a shortcut method too
val colNames = df.schema.fieldNames
Example: Grab first 14 columns of Spark Dataframe by Index using Scala.
import org.apache.spark.sql.functions.col
// Gives array of names by index (first 14 cols for example)
val sliceCols = df.columns.slice(0, 14)
// Maps names & selects columns in dataframe
val subset_df = df.select(sliceCols.map(name=>col(name)):_*)
You cannot simply do this (as I tried and failed):
// Gives array of names by index (first 14 cols for example)
val sliceCols = df.columns.slice(0, 14)
// Maps names & selects columns in dataframe
val subset_df = df.select(sliceCols)
The reason is that you have to convert your datatype of Array[String] to Array[org.apache.spark.sql.Column] in order for the slicing to work.
OR Wrap it in a function using Currying (high five to my colleague for this):
// Subsets Dataframe to using beg_val & end_val index.
def subset_frame(beg_val:Int=0, end_val:Int)(df: DataFrame): DataFrame = {
val sliceCols = df.columns.slice(beg_val, end_val)
return df.select(sliceCols.map(name => col(name)):_*)
}
// Get first 25 columns as subsetted dataframe
val subset_df:DataFrame = df_.transform(subset_frame(0, 25))