Getting the number of rows in a Spark dataframe without counting - scala

I am applying many transformations on a Spark DataFrame (filter, groupBy, join). I want to have the number of rows in the DataFrame after each transformation.
I am currently counting the number of rows using the function count() after each transformation, but this triggers an action each time which is not really optimized.
I was wondering if there is any way of knowing the number of rows without having to trigger another action than the original job.

You could use an accumulator for each stage and increment the accumulator in a map after each stage. Then at the end after you do your action you would have a count for all the stages.
val filterCounter = spark.sparkContext.longAccumulator("filter-counter")
val groupByCounter = spark.sparkContext.longAccumulator("group-counter")
val joinCounter = spark.sparkContext.longAccumulator("join-counter")
myDataFrame
.filter(col("x") === lit(3))
.map(x => {
filterCounter.add(1)
x
}) .groupBy(col("x"))
.agg(max("y"))
.map(x => {
groupByCounter.add(1)
x
})
.join(myOtherDataframe, col("x") === col("y"))
.map(x => {
joinCounter.add(1)
x
})
.count()
print(s"count for filter = ${filterCounter.value}")
print(s"count for group by = ${groupByCounter.value}")
print(s"count for join = ${joinCounter.value}")

Each operator in itself has couple of metrics. These metrics are visible in the spark UI,'s SQL tab.
If SQL is not used, we could introspect the query execution object of the data frame after execution, to access the metrics (internally accumulators).
Example: df.queryExecution.executedPlan.metrics will give the metrics of the top most node in DAG.

Coming back to this question after a bit more experience on Apache Spark to complement randal's answer.
You can also use an UDF to increment a counter.
val filterCounter = spark.sparkContext.longAccumulator("filter-counter")
val groupByCounter = spark.sparkContext.longAccumulator("group-counter")
val joinCounter = spark.sparkContext.longAccumulator("join-counter")
def countUdf(acc: LongAccumulator): UserDefinedFunction = udf { (x: Int) =>
acc.add(1)
x
}
myDataFrame
.filter(col("x") === lit(3))
.withColumn("x", countUdf(filterCounter)(col("x")))
.groupBy(col("x"))
.agg(max("y"))
.withColumn("x", countUdf(groupByCounter)(col("x")))
.join(myOtherDataframe, col("x") === col("y"))
.withColumn("x", countUdf(joinCounter)(col("x")))
.count()
print(s"count for filter = ${filterCounter.value}")
print(s"count for group by = ${groupByCounter.value}")
print(s"count for join = ${joinCounter.value}")
This should be more efficient because spark will only have to deserialize the column used in the UDF, but has to be carefully used as catalyst can more easily reorder the operations (like pushing a filter before the call to the udf)

Related

How do you change schema in a Spark `DataFrame.map()` operation without joins?

In Spark v3.0.1 I have a DataFrame of arbitrary schema.
I want to turn that DataFrame of arbitrary schema into a new DataFrame with the same schema and a new column that is the result of a calculation over the data discretely present in each row.
I can safely assume that certain columns of certain types are available for the logical calculation despite the DataFrame being of arbitrary schema.
I have solved this previously by creating a new Dataset[outcome] of two columns:
the KEY from the input DataFrame
the OUTCOME of the calculation
... and then joining that DF back on the initial input to add the new column:
val inputDf = Seq(
("1", "input1", "input2"),
("2", "anotherInput1", "anotherInput2"),
).asDF("key", "logicalInput1", "logicalInput2")
case class outcome(key: String, outcome: String)
val outcomes = inputDf.map(row => {
val input1 = row.getAs[String]("logicalInput1")
val input2 = row.getAs[String]("logicalInput2")
val key = row.getAs[String]("key")
val result = if (input1 != "") input1 + input2 else input2
outcome(key, result)
})
val finalDf = inputDf.join(outcomes, Seq("key"))
Is there a more efficient way to map a DataFrame to a new DataFrame with an extra column given arbitrary columns on the input DF upon which we can assume some columns exist to make the calculation?
I'd like to take the inputDF and map over each row, generating a copy of the row and adding a new column to it with the outcome result without having to join afterwards...
NOTE that in the example above, a simple solution exists using Spark API... My calculation is not as simple as concatenating strings together, so the .map or a udf is required for the solution. I'd like to avoid UDF if possible, though that could work too.
Before answering exact question about using .map I think it is worth a brief discussion about using UDFs for this purpose. UDFs were mentioned in the "note" of the question but not in detail.
When we use .map (or .filter, .flatMap, and any other higher order function) on any Dataset [1] we are forcing Spark to fully deserialize the entire row into an object, transforming the object with a function, and then serializing the entire object again. This is very expensive.
A UDF is effectively a wrapper around a Scala function that routes values from certain columns to the arguments of the UDF. Therefore, Spark is aware of which columns are required by the UDF and which are not and thus we save a lot of serialization (and possibly IO) costs by ignoring columns that are not used by the UDF.
In addition, the query optimizer can't really help with .map but a UDF can be part of a larger plan that the optimizer will (in theory) minimize the cost of execution.
I believe that a UDF will usually be better in the kind of scenario put forth int the question. Another smell that indicate UDFs are a good solution is how little code is required compared to other solutions.
val outcome = udf { (input1: String, input2: String) =>
if (input1 != "") input1 + input2 else input2
}
inputDf.withColumn("outcome", outcome(col("logicalInput1"), col("logicalInput2")))
Now to answer the question about using .map! To avoid the join, we need to have the result of the .map be a Row that has all the contents of the input row with the output added. Row is effectively a sequence of values with type Any. Spark manipulates these values in a type-safe way by using the schema information from the dataset. If we create a new Row with a new schema, and provide .map with an Encoder for the new schema, Spark will know how to create a new DataFrame for us.
val newSchema = inputDf.schema.add("outcome", StringType)
val newEncoder = RowEncoder(newSchema)
inputDf
.map { row =>
val rowWithSchema = row.asInstanceOf[GenericRowWithSchema] // This cast might not always be possible!
val input1 = row.getAs[String]("logicalInput1")
val input2 = row.getAs[String]("logicalInput2")
val key = row.getAs[String]("key")
val result = if (input1 != "") input1 + input2 else input2
new GenericRowWithSchema(rowWithSchema.toSeq.toArray :+ result, row.schema).asInstanceOf[Row] // Encoder is invariant so we have to cast again.
}(newEncoder)
.show()
Not as elegant as the UDFs, but it works in this case. However, I'm not sure that this solution is universal.
[1] DataFrame is just an alias for Dataset[Row]
You should use withColumn with an UDF. I don't see why map should be preferred, and I think it's very difficult to append a column in DataFrame API
Or you switch to Dataset API

Dynamically generate code having filter, withColumnRenamed and coalesce condition Scala Spark

I have a piece of code which I want to generate dynamically. I want to take below columns in the form of a list or Sequence and perform filter operation with coalesce inside, drop and withColumnRenamed statements.
Here the list of columns that I want to accept dynamically (here as a string).
val cols = "a|tmp_a,b|tmp_b"
The code looks something like this:
val df1 = df2.filter(!(coalesce(col("a"), lit(0)) === coalesce(col("tmp_a"), lit(0))) || !(upper(col("b")) === upper(col("tmp_b"))))
.drop("a")
.drop("b")
.withColumnRenamed("tmp_a", "a")
.withColumnRenamed("tmp_b", "b")
If more columns are added to cols, how can the code be adapted dynamically? New column pairs should use the same filter condition as the "b|tmp_b" above.
Given an input with the pairs of column names, you can create the two types of filter conditions (below the first column pair uses the first filter pattern and the rest uses the second). After the dataframe is filtered, the drop and withColumnRenamed can be applied using a foldLeft.
val cols = "a|tmp_a,b|tmp_b,c|tmp_c".split(",").map(_.split("\\|"))
val filterCondHead = !(coalesce(col(cols.head(0)), lit(0)) === coalesce(col(cols.head(1)), lit(0)))
val filterCondTail = cols.tail.map(c => !(upper(col(c(0))) === upper(col(c(1))))).reduce(_ || _)
val df2 = df.filter(filterCondHead || filterCondTail)
val df3 = cols.foldLeft(df2){ case(df, c) =>
df.drop(c(0)).withColumnRenamed(c(1), c(0))
}

spark dataframe filter and select

I have a spark scala dataframe and need to filter the elements based on condition and select the count.
val filter = df.groupBy("user").count().alias("cnt")
val **count** = filter.filter(col("user") === ("subscriber").select("cnt")
The error i am facing is value select is not a member of org.apache.spark.sql.Column
Also for some reasons count is Dataset[Row]
Any thoughts to get the count in a single line?
DataSet[Row] is DataFrame
RDD[Row] is DataFrame so no need to worry.. its dataframe
see this for better understanding... Difference between DataFrame, Dataset, and RDD in Spark
Regarding select is not a member of org.apache.spark.sql.Column its purely compile error.
val filter = df.groupBy("user").count().alias("cnt")
val count = filter.filter (col("user") === ("subscriber"))
.select("cnt")
will work since you are missing ) braces which is closing brace for filter.
You are missing ")" before .select, Please check below code.
Column class don't have .select method, you have to invoke select on Dataframe.
val filter = df.groupBy("user").count().alias("cnt")
val **count** = filter.filter(col("user") === "subscriber").select("cnt")

How can I parallelize different SparkSQL execution efficiently?

Environment
Scala
Apache Spark: Spark 2.2.1
EMR on AWS: emr-5.12.1
Content
I have one large DataFrame, like below:
val df = spark.read.option("basePath", "s3://some_bucket/").json("s3://some_bucket/group_id=*/")
There are JSON files ~1TB at s3://some_bucket and it includes 5000 partitions of group_id.
I want to execute conversion using SparkSQL, and it differs by each group_id.
The Spark code is like below:
// Create view
val df = spark.read.option("basePath", "s3://data_lake/").json("s3://data_lake/group_id=*/")
df.createOrReplaceTempView("lakeView")
// one of queries like this:
// SELECT
// col1 as userId,
// col2 as userName,
// .....
// FROM
// lakeView
// WHERE
// group_id = xxx;
val queries: Seq[String] = getGroupIdMapping
// ** Want to know better ways **
queries.par.foreach(query => {
val convertedDF: DataFrame = spark.sql(query)
convertedDF.write.save("s3://another_bucket/")
})
The par can parallelize by Runtime.getRuntime.availableProcessors num, and it will be equal to the number of driver's cores.
But It seems weird and not efficient enough because it has nothing to do with Spark's parallization.
I really want to do with something like groupBy in scala.collection.Seq.
This is not right spark code:
df.groupBy(groupId).foreach((groupId, parDF) => {
parDF.createOrReplaceTempView("lakeView")
val convertedDF: DataFrame = spark.sql(queryByGroupId)
convertedDF.write.save("s3://another_bucket")
})
1) First of all if your data is already stored in files per group id there is no reason to mix it up and then group by id using Spark.
It's much more simple and efficient to load for each group id only relevant files
2) Spark itself parallelizes the computation. So in most cases there is no need for external parallelization.
But if you feel that Spark doesn't utilize all resources you can:
a) if each individual computation takes less than few seconds then task schedulling overhead is comparable to task execution time so it's possible to get a boost by running few tasks in parallel.
b) computation takes significant amount of time but resources are still underutilized. Then most probably you should increase the number of partitions for your dataset.
3) If you finally decided to run several tasks in parallel it can be achieved this way:
val parallelism = 10
val executor = Executors.newFixedThreadPool(parallelism)
val ec: ExecutionContext = ExecutionContext.fromExecutor(executor)
val tasks: Seq[String] = ???
val results: Seq[Future[Int]] = tasks.map(query => {
Future{
//spark stuff here
0
}(ec)
})
val allDone: Future[Seq[Int]] = Future.sequence(results)
//wait for results
Await.result(allDone, scala.concurrent.duration.Duration.Inf)
executor.shutdown //otherwise jvm will probably not exit

Spark Union inside a loop gives void

I try to make a RDD from iterative union from another RDD inside a loop but the result works exclusively if i perform an action on the result RDD inside the loop.
var rdd : RDD[Int] = sc.emptyRDD
for ( i <- 1 to 5 ) {
val rdd1 = sc.parallelize(Array(1))
rdd = rdd ++ rdd1
}
// rdd.foreach(println) => void
for ( i <- 1 to 5 ) {
val rdd1 = sc.parallelize(Array(1))
rdd = rdd ++ rdd1
rdd.foreach(x=>x)
}
// rdd.foreach(println) => ( 1,1,1,1,1)
If I create rdd1 outside the loop everything works fine but not inside.
Does it exist a specific lightweight action to solve this problem ?
One thing to keep in mind is that when you apply the foreach action to your RDD, the action while apply on each individual worker. Therefore in the first case, if you check the stdout's of each executor, you will find the printed values from rdd. If you want these values to be printed to your console, you can aggregate the elements of an RDD (or a subset of them) at the driver, and then apply your function (e.g. rdd.collect.foreach(println), rdd.take(3).foreach(println), etc).