dataframe.selectexpr performace for selecting large number of columns - scala

I am using spark dataframe in scala. My data frame is holding about 400 columns, with 1000-1M rows. I am running a
datagrame.selectExpr operation(1 to 400th column) on certain criteria and once fetching them, I am aggregating the values of all these columns.
my selectexpr statement:
val df = df2.selectExpr(getColumn(beginDate, endDate, x._2): _*)
getColumn method will fetch columns day wise between start and enddate from my dataframe (this may be 365 columns as we have day wise data).
my summing by expression is :
df.map(row => (row(0), row(1), row(2), (3 until row.length).map(row.getLong(_)).sum)).collect()
I find that selecting these many number of columns is degrading the performance of my job. Is there anyway to make this fetching of 400 columns much faster?

Dataframe is better optimization than the RDD . And you are using it which is good . But can you Please check the Spark UI and what stage it is taking much time. If its taking the time due to calculation or due to data load. Reshaping Data . And try to scale up slowly for faster output.Check if the partioning can also help your code to run it fatser . Apache Spark Code Faster .Making the code faster depnds on various factor and try to optimize by using those.

Related

How to calculate median on timestamp windowed data in Spark?

I am very new to spark. I am stuck at this for a while now -
I am using Databricks for this. I create a dataframe df from database. Then I am collecting data using 30 minutes time window. But first challenge is I am unable to figure out if this groupBy operation is correct. Using agg seems to be not working.
val window30 = df.groupBy(window($"X_DATE_TS", "30 minute"), $"X_ID").agg(sort_array($"X_VALUE"))
display(window30)
Secondly, if I want to calculate median of the column X_VALUE, how do I perform this on this multiple rows selected?

Efficient filtering of a large dataset in Spark based on other dataset

I have an issue with filtering a large dataset, which I believe is because of an inefficient join. What I'm trying to do is the following:
Dataset info contains a lot of user data, identified by a user ID and a timestamp.
Dataset filter_dates contains, for every user ID, the most recent timestamp processed.
The processing job then needs to determine, from a large input source, what data in info is new and to be processed.
What I tried doing was the following:
info
.join(
broadcast(filter_dates),
info("userid") === filter_dates("userid"),
"left"
)
.filter('info_date >= 'latest_date || 'latest_date.isNull)
But this is incredibly slow (taking many hours), and in the Spark UI I see it processing an absurd amount of data as Input.
Then, just for fun, I tried it with this:
val startDate = // Calculate the minimum startDate from filter_dates
info
.filter('info_date >= startDate)
And this was blazingly fast (Taking some minutes), but of course this is not optimal because it ends up re-processing some data. This leads me to believe there's something fundamentally wrong with how I joined the datasets.
Does anyone know how I can improve the joins?

How to populate a Spark DataFrame column based on another column's value?

I have a use-case where I need to select certain columns from a dataframe containing atleast 30 columns and millions of rows.
I'm loading this data from a cassandra table using scala and apache-spark.
I selected the required columns using: df.select("col1","col2","col3","col4")
Now I have to perform a basic groupBy operation to group the data according to src_ip,src_port,dst_ip,dst_port and I also want to have the latest value from a received_time column of the original dataframe.
I want a dataframe with distinct src_ip values with their count and latest received_time in a new column as last_seen.
I know how to use .withColumn and also, I think that .map() can be used here.
Since I'm relatively new in this field, I really don't know how to proceed further. I could really use your help to get done with this task.
Assuming you have a dataframe df with src_ip,src_port,dst_ip,dst_port and received_time, you can try:
val mydf = df.groupBy(col("src_ip"),col("src_port"),col("dst_ip"),col("dst_port")).agg(count("received_time").as("row_count"),max(col("received_time")).as("max_received_time"))
The above line calculates the count of timestamp received against the group by columns as well as the max timestamp for that group by columns.

Create sample value for failure records spark

I have a scenario where my dataframe has 3 columns a,b and c. I need to validate if the length of all the columns is equal to 100. Based on validation I am creating status column like a_status,b_status,c_status with values 5 (Success) and 10 (Failure). In Failure scenarios I need to update count and create new columns a_sample,b_sample,c_sample with some 5 failure sample values separated by ",". For creating samples column I tried like this
df= df.select(df.columns.toList.map(col(_)) :::
df.columns.toList.map( x => (lit(getSample(df.select(x, x + "_status").filter(x + "_status=10" ).select(x).take(5))).alias(x + "_sample")) ).toList: _* )
getSample method will just get array of rows and concatenate as a string. This works fine for limited columns and data size. However if the number of columns > 200 and data is > 1 million rows it creates huge performance impact. Is there any alternate approach for same.
While the details of your problem statement are unclear, you can break up the task into two parts:
Transform data into a format where you identify several different types of rows you need to sample.
Collect sample by row type.
The industry jargon for "row type" is stratum/strata and the way to do (2), without collecting data to the driver, which you don't want to do when the data is large, is via stratified sampling, which Spark implements via df.stat.sampleBy(). As a statistical function, it doesn't work with exact row numbers but fractions. If you absolutely must get a sample with an exact number of rows there are two strategies:
Oversample by fraction and then filter unneeded rows, e.g., using the row_number() window function followed by a filter 'row_num < n.
Build a custom user-defined aggregate function (UDAF), firstN(col, n). This will be much faster but a lot more work. See https://docs.databricks.com/spark/latest/spark-sql/udaf-scala.html
An additional challenge for your use case is that you want this done per column. This is not a good fit with Spark's transformations such as grouping or sampleBy, which operate on rows. The simple approach is to make several passes through the data, one column at a time. If you absolutely must do this in a single pass through the data, you'll need to build a much more custom UDAF or Aggregator, e.g., the equivalent of takeFirstNFromAWhereBHasValueC(n, colA, colB, c).

Unable to Convert to DataFrame from RDD after applying partitioning

I am using Spark 2.1.0
When i am trying to use Window function on a Dataframe
val winspec = Window.partitionBy("partition_column")
DF.withColumn("column", avg(DF("col_name")).over(winspec))
My Plan changes and add the below lines to the Physical Plan and due to this An Extra Stage , EXtra Shuffling is happening and the Data is Huge which Slows down my Query like anything & runs for Hours.
+- Window [avg(cast(someColumn#262 as double)) windowspecdefinition(partition_column#460, ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS someColumn#263], [partition_column#460]
+- *Sort [partition_column#460 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(partition_column#460, 200)
Also i see the Stage as MapInternalPartition which i think is partitioned internally Now i don't know what is this. But because i think because of this even my 100 tasks took 30+ mins and in that 99 was completed within 1-2 mins and the last 1 task took remaining 30 mins leaving my cluster IDLE with no parallel processing which makes me think that is the data partitioned properly when Window function is used ???
I Tried to apply HashPartitioning by converting it to RDD... BECAUSE we cannot apply Custom / HashPartitioner on a Dataframe
So if i do this :
val myVal = DF.rdd.partitioner(new HashPartitioner(10000))
I am getting a return type of ANY with which i am not getting any Action list to perform.
I checked and saw that the column with which the Partitioning is happening in Window functions contains all NULL values
TL;DR:
A shuffle when using window functions is not extra. It is required for correctness and cannot be removed.
Applying partitioner shuffles.
Datasets cannot reuse RDD partitioning. Partitioning with Dataset should be done with repartition method:
df.repartition($"col_name")
But it won't help you because of 2)
And this:
val myVal = DF.rdd.partitioner(new HashPartitioner(10000))
wouldn't return Any. It wouldn't compile, as there is no partitioner method for RDD, which takes arguments.
Correct method is partitionBy but it is not applicable to RDD[Row] and it wouldn't help you because of 3).
If there is enough memory you can try
df.join(
broadcast(df.groupBy("partition_column").agg(avg(DF("col_name"))),
Seq("partition_column")
)
Edit:
If you're trying to compute running average (avg with Window.partitionBy("partition_column") computes global average by group, not running average), then you're out of luck.
If partitioning column has only NULLS, then task is not distributed and fully sequential.
To compute global running average you can try to apply logic similar to this How to compute cumulative sum using Spark.