Spark columnar performance - scala

I'm a relative beginner to things Spark. I have a wide dataframe (1000 columns) that I want to add columns to based on whether a corresponding column has missing values
so
+----+
| A |
+----+
| 1 |
+----+
|null|
+----+
| 3 |
+----+
becomes
+----+-------+
| A | A_MIS |
+----+-------+
| 1 | 0 |
+----+-------+
|null| 1 |
+----+-------+
| 3 | 1 |
+----+-------+
This is part of a custom ml transformer but the algorithm should be clear.
override def transform(dataset: org.apache.spark.sql.Dataset[_]): org.apache.spark.sql.DataFrame = {
var ds = dataset
dataset.columns.foreach(c => {
if (dataset.filter(col(c).isNull).count() > 0) {
ds = ds.withColumn(c + "_MIS", when(col(c).isNull, 1).otherwise(0))
}
})
ds.toDF()
}
Loop over the columns, if > 0 nulls create a new column.
The dataset passed in is cached (using the .cache method) and the relevant config settings are the defaults.
This is running on a single laptop for now, and runs in the order of 40 minutes for the 1000 columns even with a minimal amount of rows.
I thought the problem was due to hitting a database, so I tried with a parquet file instead with the same result. Looking at the jobs UI it appears to be doing filescans in order to do the count.
Is there a way I can improve this algorithm to get better performance, or tune the cacheing in some way? Increasing spark.sql.inMemoryColumnarStorage.batchSize just got me an OOM error.

Remove the condition:
if (dataset.filter(col(c).isNull).count() > 0)
and leave only the internal expression. As it is written Spark requires #columns data scans.
If you want prune columns compute statistics once, as outlined in Count number of non-NaN entries in each column of Spark dataframe with Pyspark, and use single drop call.

Here's the code that fixes the problem.
override def transform(dataset: Dataset[_]): DataFrame = {
var ds = dataset
val rowCount = dataset.count()
val exprs = dataset.columns.map(count(_))
val colCounts = dataset.agg(exprs.head, exprs.tail: _*).toDF(dataset.columns: _*).first()
dataset.columns.foreach(c => {
if (colCounts.getAs[Long](c) > 0 && colCounts.getAs[Long](c) < rowCount ) {
ds = ds.withColumn(c + "_MIS", when(col(c).isNull, 1).otherwise(0))
}
})
ds.toDF()
}

Related

Spark: how to group rows into a fixed size array?

I have a dataset that looks like this:
+---+
|col|
+---+
| a|
| b|
| c|
| d|
| e|
| f|
| g|
+---+
I want to reformat this dataset so that I aggregate the rows into a arrays of fixed length, like so:
+------+
| col|
+------+
|[a, b]|
|[c, d]|
|[e, f]|
| [g]|
+------+
I tried this:
spark.sql("select collect_list(col) from (select col, row_number() over (order by col) row_number from dataset) group by floor(row_number/2)")
But the problem with this is that my actual dataset is too large to process in a single partition for row_number()
As you wish to distribute this, there are a couple of steps necessary.
In case, you wish to run the code, I am starting from this:
var df = List(
"a", "b", "c", "d", "e", "f", "g"
).toDF("col")
val desiredArrayLength = 2
First, split tyour dataframe into a small one which you can process on single node, and larger one which has number of rows which is multiple of size of desired array (in your example, this is 2)
val nRowsPrune = 1 //number of rows to prune such that remaining dataframe has number of
// rows is multiples of the desired length of array
val dfPrune = df.sort(desc("col")).limit(nRowsPrune)
df = df.join(dfPrune,Seq("col"),"left_anti") //separate small from large dataframe
By construction, you can apply the original code on the small dataframe,
val groupedPruneDf = dfPrune//.withColumn("g",floor((lit(-1)+row_number().over(w))/lit(desiredArrayLength ))) //added -1 as row-number starts from 1
//.groupBy("g")
.agg( collect_list("col").alias("col"))
.select("col")
Now, we need to figure a way to deal with the remaining large dataframe. However, now we made sure, that df has a number of rows which is a multiple of the array size.
This is where we use a great trick, which is repartitioning using repartitionByRange. Basically, the partitioning guarantees to preserve the sorting and as you are partitioning each partition will have same size.
You can now, collect each array within each partition,
val nRows = df.count()
val maxNRowsPartition = desiredArrayLength //make sure its a multiple of desired array length
val nPartitions = math.max(1,math.floor(nRows/maxNRowsPartition) ).toInt
df = df.repartitionByRange(nPartitions, $"col".desc)
.withColumn("partitionId",spark_partition_id())
val w = Window.partitionBy($"partitionId").orderBy("col")
val groupedDf = df
.withColumn("g", floor( (lit(-1)+row_number().over(w))/lit(desiredArrayLength ))) //added -1 as row-number starts from 1
.groupBy("partitionId","g")
.agg( collect_list("col").alias("col"))
.select("col")
Finally combining the two results yields what you are looking for,
val result = groupedDf.union(groupedPruneDf)
result.show(truncate=false)

Flatmap on Spark Dataframe in Scala

I have a Dataframe. I need to create one or more rows from each row in dataframe. I am hoping FlapMap could help me in solving the problem. One or More rows would be created by applying logic on 2 columns of the row.
Example Input dataframe
+--------------------+
| Name|Float1|Float2|
+--------------------+
| Java| 2.3| 0.2|
|Python| 3.2| 0.5|
| Scala| 4.3| 0.8|
+--------------------+
Logic:
If *|Float1 + Float2| = |Float1)|* Then one row is created.
eg : 2.3 +0.2 = |2.5| = 2
|2.3| = 2
if *|Float1 +Float2| > |Float1|* Then two rows are created
eg: 4.3+0.8 = |5.1| = 5
|4.3| = 4
Can we solve this problem using flatmap or any other transformation in spark?
Create a UDF that takes in two columns and returns back a list.
Once you have a list, then use the explode function on the column which will give you what you desire

Is there a better way to go about this process of trimming my spark DataFrame appropriately?

In the following example, I want to be able to only take the x Ids with the highest counts. x is number of these I want which is determined by a variable called howMany.
For the following example, given this Dataframe:
+------+--+-----+
|query |Id|count|
+------+--+-----+
|query1|11|2 |
|query1|12|1 |
|query2|13|2 |
|query2|14|1 |
|query3|13|2 |
|query4|12|1 |
|query4|11|1 |
|query5|12|1 |
|query5|11|2 |
|query5|14|1 |
|query5|13|3 |
|query6|15|2 |
|query6|16|1 |
|query7|17|1 |
|query8|18|2 |
|query8|13|3 |
|query8|12|1 |
+------+--+-----+
I would like to get the following dataframe if the variable number is 2.
+------+-------+-----+
|query |Ids |count|
+------+-------+-----+
|query1|[11,12]|2 |
|query2|[13,14]|2 |
|query3|[13] |2 |
|query4|[12,11]|1 |
|query5|[11,13]|2 |
|query6|[15,16]|2 |
|query7|[17] |1 |
|query8|[18,13]|2 |
+------+-------+-----+
I then want to remove the count column, but that is trivial.
I have a way to do this, but I think it defeats the purpose of scala all together and completely wastes a lot of runtime. Being new, I am unsure about the best ways to go about this
My current method is to first get a distinct list of the query column and create an iterator. Second I loop through the list using the iterator and trim the dataframe to only the current query in the list using df.select($"eachColumnName"...).where("query".equalTo(iter.next())). I then .limit(howMany) and then groupBy($"query").agg(collect_list($"Id").as("Ids")). Lastly, I have an empty dataframe and add each of these one by one to the empty dataframe and return this newly created dataframe.
df.select($"query").distinct().rdd.map(r => r(0).asInstanceOf[String]).collect().toList
val iter = queries.toIterator
while (iter.hasNext) {
middleDF = df.select($"query", $"Id", $"count").where($"query".equalTo(iter.next()))
queryDF = middleDF.sort(col("count").desc).limit(howMany).select(col("query"), col("Ids")).groupBy(col("query")).agg(collect_list("Id").as("Ids"))
emptyDF.union(queryDF) // Assuming emptyDF is made
}
emptyDF
I would do this using Window-Functions to get the rank, then groupBy to aggrgate:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val howMany = 2
val newDF = df
.withColumn("rank",row_number().over(Window.partitionBy($"query").orderBy($"count".desc)))
.where($"rank"<=howMany)
.groupBy($"query")
.agg(
collect_list($"Id").as("Ids"),
max($"count").as("count")
)

How to merge two or more columns into one?

I have a streaming Dataframe that I want to calculate min and avg over some columns.
Instead of getting separate resulting columns of min and avg after applying the operations, I want to merge the min and average output into a single column.
The dataframe look like this:
+-----+-----+
| 1 | 2 |
+-----+-----+-
|24 | 55 |
+-----+-----+
|20 | 51 |
+-----+-----+
I thought I'd use a Scala tuple for it, but that does not seem to work:
val res = List("1","2").map(name => (min(col(name)), avg(col(name))).as(s"result($name)"))
All code used:
val res = List("1","2").map(name => (min(col(name)),avg(col(name))).as(s"result($name)"))
val groupedByTimeWindowDF1 = processedDf.groupBy($"xyz", window($"timestamp", "60 seconds"))
.agg(res.head, res.tail: _*)
I'm expecting the output after applying the min and avg mathematical opearations to be:
+-----------+-----------+
| result(1)| result(2)|
+-----------+-----------+
|20 ,22 | 51,53 |
+-----------+-----------+
How I should write the expression?
Use struct standard function:
struct(colName: String, colNames: String*): Column
struct(cols: Column*): Column
Creates a new struct column that composes multiple input columns.
That gives you the values as well as the names (of the columns).
val res = List("1","2").map(name =>
struct(min(col(name)), avg(col(name))) as s"result($name)")
^^^^^^ HERE
The power of struct can be seen when you want to reference one field in the struct and you can use the name (not index).
q.select("structCol.name")
What you want to do is to merge the values of multiple columns together in a single column. For this you can use the array function. In this case it would be:
val res = List("1","2").map(name => array(min(col(name)),avg(col(name))).as(s"result($name)"))
Which will give you :
+------------+------------+
| result(1)| result(2)|
+------------+------------+
|[20.0, 22.0]|[51.0, 53.0]|
+------------+------------+

pyspark dataframe complex calculation with previous row

I am working with Pyspark and trying to figure out how to do complex calculation with previous columns. I think there are generally two ways to do calculation with previous columns : Windows, and mapwithPartition. I think my problem is too complex to solve by windows, and I want the result as a sepreate row, not column. So I am trying to use mapwithpartition. I am having a trouble with syntax of this.
For instance, here is a rough draft of the code.
def change_dd(rows):
prev_rows = []
prev_rows.append(rows)
for row in rows:
new_row=[]
for entry in row:
# Testing to figure out syntax, things would get more complex
new_row.append(entry + prev_rows[0])
yield new_row
updated_rdd = select.rdd.mapPartitions(change_dd)
However, I can't access to the single data of prev_rows. Seems like prev_rows[0] is itertools.chain. How do I iterate over this prev_rows[0]?
edit
neighbor = sc.broadcast(df_sliced.where(df_sliced.id == neighbor_idx).collect()[0][:-1]).value
current = df_sliced.where(df_sliced.id == i)
def oversample_dt(dataframe):
for row in dataframe:
new_row = []
for entry, neigh in zip(row, neighbor):
if isinstance(entry, str):
if scale < 0.5:
new_row.append(entry)
else:
new_row.append(neigh)
else:
if isinstance(entry, int):
new_row.append(int(entry + (neigh - entry) * scale))
else:
new_row.append(entry + (neigh - entry) * scale)
yield new_row
sttt = time.time()
sample = current.rdd.mapPartitions(oversample_dt).toDF(schema)
In the end, I ended up doing like this for now, but I really don't want to use collect in the first row. If someone knows how to fix this / point out any problem in using pyspark, please tell me.
edit2
--Suppose Alice, and its neighbor Alice_2
scale = 0.4
+---+-------+--------+
|age| name | height |
+---+-------+--------+
| 10| Alice | 170 |
| 11|Alice_2| 175 |
+---+-------+--------+
Then, I want a row
+---+-------+----------------------------------+
|age | name | height |
+---+-------+---------------------------------+
| 10+1*0.4 | Alice_2 | 170 + 5*0.4 |
+---+-------+---------------------------------+
Why not using dataframes?
Add a column to the dataframe with the previous values using window functions like this:
from pyspark.sql import SparkSession, functions
from pyspark.sql.window import Window
spark_session = SparkSession.builder.getOrCreate()
df = spark_session.createDataFrame([{'name': 'Alice', 'age': 1}, {'name': 'Alice_2', 'age': 2}])
df.show()
+---+-------+
|age| name|
+---+-------+
| 1| Alice|
| 2|Alice_2|
+---+-------+
window = Window.partitionBy().orderBy('age')
df = df.withColumn("age-1", functions.lag(df.age).over(window))
df.show()
You can use this function for every column
+---+-------+-----+
|age| name|age-1|
+---+-------+-----+
| 1| Alice| null|
| 2|Alice_2| 1|
+---+-------+-----+
An then just make your calculus
And if you want to use rdd, then just use df.rdd