I have written some code in PySpark to load some data from MongoDB to a Spark dataframe, apply some filters, process the data (using a RDD) and then write back the result to MongoDB.
# 1) Load the data
df_initial = spark.read.format("com.mongodb.spark.sql").options().schema(schema).load() #df_initial is a Spark dataframe
df_filtered = df_initial.filter(...)
# 2) Process the data
rdd_to_process = df_filtered.rdd
processed_rdd = rdd_to_process.mapPartitions(lambda iterator: process_data(iterator))
# 3) Create a dataframe from the RDD
df_final = spark.createDataFrame(processed_rdd, schema)
df_to_write = df_final.select(...)
# 4) Write the dataframe to MongoDB
df_to_write.write.format("com.mongodb.spark.sql").mode("append").save()
I would like to measure the time each part takes (loading the data, processing the RDD, creating the dataframe, and writing back data).
I tried to put timers between each part but from what I understood all the Spark operations are lazy so everything is executed in the last line.
Is there a way to measure the time spent by each part so that I can identify bottlenecks ?
Thanks
Spark can inline some operations, especially if you use Dataframe API. That's why you cannot get execution statistics of "code parts", but only for different stages.
There is not an easy way to get these information from the context directly, but REST API presents a lot information you may use. For example to get time spent in each stage you can use the following instructions:
import datetime
import requests
parse_datetime = lambda date: datetime.datetime.strptime(date, "%Y-%m-%dT%H:%M:%S.%fGMT")
dates_interval = lambda dt1, dt2: parse_datetime(dt2) - parse_datetime(dt1)
app_id = spark.sparkContext.applicationId
data = requests.get(spark.sparkContext.uiWebUrl + "/api/v1/applications/" + app_id + "/stages").json()
for stage in data:
stage_time = dates_interval(stage['submissionTime'], stage['completionTime']).total_seconds()
print("Stage {} took {}s (tasks: {})".format(stage['stageId'], stage_time, stage['numCompleteTasks']))
Example output looks like this:
Stage 4 took 0.067s (tasks: 1)
Stage 3 took 0.53s (tasks: 1)
Stage 2 took 1.592s (tasks: 595)
Stage 1 took 0.363s (tasks: 1)
Stage 0 took 2.367s (tasks: 595)
But then it's your job to identify what are the stages responsible for operations you want to measure.
Related
I'll be getting data from Hbase within a TimeRange. So, I divided the time range into chunks and scanning the columns from Hbase within the chunked TimeRange like
Suppose, I have a TimeRange from Jun to Aug, I divide them into Weekly, which gives 8 weeks TimeRange List.
From that, I will scan the columns of Hbase via repartition & mappartition like
sparkSession.sparkContext.parallelize(chunkedTimeRange.toList).repartition(noOfCores).mapPartitions{
// Scan Cols of Hbase Logic
// This gives DF as output
}
I'll get DF from the above and Do some filter to that DF using mappartition and foreachPartition like
df.mapPartitions{
rows => {
rows.toList.par.foreach(
cols => {
json.filter(condition).foreach(//code)
anotherJson.filter(condition).foreach(//code)
}
)
}
// returns DF
}
This DF has been used by other methods, Since mapparttions are lazy. I called an action after the above like
df.persist(StorageLevel.MEMORY_AND_DISK)
df.foreachPartition((x: Iterator[org.apache.spark.sql.Row]) => x: Unit)
This forEachPartition unnecessarily executing twice. One stage taking it around 2.5 min (128 tasks) and Other one 40s (200 tasks) which is not necessary.
200 is the mentioned value in spark config
spark.sql.shuffle.partitions=200.
How to avoid this unnecessary foreachPartition? Is there any way still I can make it better in terms of performance?
I found a similar question. Unfortunately, I didn't get much Information from that.
Screenshot of foreachPartitions happening twice for same DF
If any clarification needed, please mention in comment
You need to "reuse" the persisted Dataframe:
val df2 = df.persist(StorageLevel.MEMORY_AND_DISK)
df2.foreachPartition((x: Iterator[org.apache.spark.sql.Row]) => x: Unit)
Otherwise when running the foreachPartition, it runs on a DF which has not been persisted and it's doing every step of the DF computation again.
I'm trying to count the number of valid and invalid data, that is present in a file. Below is the code to do the same,
val badDataCountAcc = spark.sparkContext.longAccumulator("BadDataAcc")
val goodDataCountAcc = spark.sparkContext.longAccumulator("GoodDataAcc")
val dataframe = spark
.read
.format("csv")
.option("header", true)
.option("inferSchema", true)
.load(path)
.filter(data => {
val matcher = regex.matcher(data.toString())
if (matcher.find()) {
goodDataCountAcc.add(1)
println("GoodDataCountAcc: " + goodDataCountAcc.value)
true
} else {
badDataCountAcc.add(1)
println("BadDataCountAcc: " + badDataCountAcc.value)
false
}
}
)
.withColumn("FileName", input_file_name())
dataframe.show()
val filename = dataframe
.select("FileName")
.distinct()
val name = filename.collectAsList().get(0).toString()
println("" + filename)
println("Bad data Count Acc: " + badDataCountAcc.value)
println("Good data Count Acc: " + goodDataCountAcc.value)
I ran this code for the sample data that has 2 valid and 3 invalid data. Inside the filter, where I'm printing the counts, values are correct. But outside the filter when I'm printing the values for count, it is coming as 4 for good data and 6 for bad data.
Questions:
When I remove the withColumn statement at the end - along with the code which calculates distinct filename - values are printed correctly. I'm not sure why?
I do have a requirement to get the input filename as well. What would be best way to do that here?
First of all, Accumulator belongs to the RDD API, while you are using Dataframes. Dataframes are compiled down to RDDs in the end, but they are at a higher level of abstraction. It is better to use aggregations instead of Accumulators in this context.
From the Spark Accumulators documentation:
For accumulator updates performed inside actions only, Spark guarantees that each task’s update to the accumulator will only be applied once, i.e. restarted tasks will not update the value. In transformations, users should be aware of that each task’s update may be applied more than once if tasks or job stages are re-executed.
Accumulators do not change the lazy evaluation model of Spark. If they are being updated within an operation on an RDD, their value is only updated once that RDD is computed as part of an action. Consequently, accumulator updates are not guaranteed to be executed when made within a lazy transformation like map(). The below code fragment demonstrates this property:
Your DataFrame filter will be compiled to an RDD filter, which is not an action, but a transformation (and thus lazy), so this only-once guarantee does not hold in your case.
How many times your code is executed depends is implementation-dependent, and may change with Spark versions, so you should not rely on it.
Regarding your two questions:
(BEFORE EDIT) This cannot be answered based on your code snippet because it doesn't contain any actions. Is it even the exact code snippet you use? I suspect that if you actually execute the code you posted without any additions except for the missing imports, it should print 0 two times because nothing is executed. Either way, you should always assume that an accumulator inside an RDD transformation is potentially executed multiple times (or even not at all if it is in a DataFrame operation which can possibly be optimized out).
Your approach of using withColumn is perfectly fine.
I'd suggest using DataFrame expressions and aggregations (or equivalent Spark SQL if you prefer that). The regex matching can be done using rlike, using the columns instead of relying of toString(), e.g. .withColumn("IsGoodData", $"myColumn1".rlike(regex1) && $"myColumn2".rlike(regex2)).
Then you can count the good and bad records using an aggregation like dataframe.groupBy($"IsGoodData").count()
EDIT: With the additional lines the answer to your first question is also clear: The first time was from the dataframe.show() and the second time from the filename.collectAsList(), which you probably also removed as it depends on the added column. Please make sure you understand the distinction between Spark transformations and actions and the lazy evaluation model of Spark. Otherwise you won't be very happy with it :-)
I have a data set of weather data and I am trying to query it to get average lows and average highs for each year. I have no problem submitting the job and getting the desired result but it is taking hours to run. I thought it would run much faster, Am I doing something wrong or is it just not as fast as I'm thinking it should be?
The data is a csv file with over 100,000,000 entries.
THe columns are date, weather station, measurement(TMAX or TMIN), and value
I am running the job on my university's hadoop cluster, I don't have much more information than that about the cluster.
Thanks in advance!
import sys
from random import random
from operator import add
from pyspark.sql import SQLContext, Row
from pyspark import SparkContext
if __name__ == "__main__":
sc = SparkContext(appName="PythonPi")
sqlContext = SQLContext(sc)
file = sys.argv[1]
lines = sc.textFile(file)
parts = lines.map(lambda l: l.split(","))
obs = parts.map(lambda p: Row(station=p[0], date=int(p[1]) , measurement=p[2] , value=p[3] ) )
weather = sqlContext.createDataFrame(obs)
weather.registerTempTable("weather")
#AVERAGE TMAX/TMIN PER YEAR
query2 = sqlContext.sql("""select SUBSTRING(date,1,4) as Year, avg(value)as Average, measurement
from weather
where value<130 AND value>-40
group by measurement, SUBSTRING(date,1,4)
order by SUBSTRING(date,1,4) """)
query2.show()
query2.rdd.map(lambda x: ",".join(map(str, x))).coalesce(1).saveAsTextFile("hdfs:/user/adduccij/tmax_tmin_year.csv")
sc.stop()
Make sure that spark job in fact started in cluster (and not local) mode. e.g. If you're using yarn, then job is launched in 'yarn-client' mode.
If that's true, then make sure you've provided enough #executors/cores/ executor and driver memory. You can get the actual cluster/job information from either the resource manager (e.g. yarn) page or from spark context (sqlContext.getAllConfs).
100Mil records is not that small. Let's say each record is 30 bytes, still the overall size is 3gb and that can take a while if you only have a handful of executors.
Let's say that the above suggestions do not help, then try to find out which part of the query is taking long. Few speed up tips are:
Cache the weather dataframe
Break the query into 2 parts: 1st part does group by, and output is cached
2nd part does order by
instead of coalesce, write the rdd with default shards and then do a mergeFrom to get your csv output from shell.
I am inserting data to hive table with iterations in spark.
For example : Lets say 10 000 items, firstly these items are separated to 5 list, each list has 2000 items. After that I am doing iteration on that 5 lists.
In each iteration, 2000 items maps to much more rows so at the end of iteration 15M records are inserted to hive table. Each iteration is completed in 40 mins.
Issue is after each iteration. spark is waiting for starting the other 2000 K items. The waiting time is about 90 mins ! In that time gap, there is no active tasks in spark web UI below.
By the way, iterations are directly start with spark process. no any scala or java code is exist at the begging or at the end of iterations.
Any idea?
Thanks
val itemSeq = uniqueIDsDF.select("unique_id").map(r => r.getLong(0)).collect.toSeq // Get 10K items
val itemList = itemSeq.sliding(2000,2000).toList // Create 5 Lists
itemList.foreach(currItem => {
//starting code. (iteration start)
val currListDF = currItem.toDF("unique_id")
val currMetadataDF = hive_raw_metadata.join(broadcast(currListDF),Seq("unique_id"),"inner")
currMetadataDF.registerTempTable("metaTable")
// further logic here ....
}
I got the reason, even if the insert task seems completed in spark ui, in background insert process still continue. After writing to hdfs is completed, new iteration is starting. That is the reason for gap in web ui
AFAIK, I understand that you are trying to divide DataFrame and pass the data in batches and do some processing as your pseudo code, which was not so clear.
As you mentioned above in your answer, when ever action happens it
will take some time for insertion in to sink.
But basically, what I feel your logic of sliding can be improved like this...
Based on that above assumption, I have 2 options for you. you can choose most suitable one...
Option #1:(foreachPartitionAsync : AsyncRDDActions)
I would suggest you to use DataFrame iterator grouping capabilities
df.repartition(numofpartitionsyouwant) // numPartitions
df.rdd.foreachPartitionAsync // since its partition wise processing to sink it would be faster than the approach you are adopting...
{
partitionIterator =>
partitionIterator.grouped(2000).foreach {
group => group.foreach {
// do your insertions here or what ever you wanted to ....
}
}
}
Note : RDD will be executed in the background. All of these executions will be submitted to the Spark scheduler and run concurrently. Depending on your Spark cluster size that some of the jobs may wait until Executors become available for processing.
Option #2 :
Second approach is dataframe as randomSplit I think you can use in this case to divide equal sized dataframes. which will return you equal sized array of dataframes if sum of their weights > 1
Note : weights(first argument of dataframe) for splits, will be normalized if they don't sum to 1.
DataFrame[] randomSplit(double[] weights) Randomly splits this
DataFrame with the provided weights.
refer randomSplit code here
it will be like ..
val equalsizeddfArray = yourdf.randomSplit(Array(0.2,0.2,0.2,0.2,0.2,0.2, 0.3) // intentionally gave sum of weights > 1 (in your case 10000 records of dataframe to array of 5 dataframes of each 2000 records in it)
and then...
for (i <- 0 until equalsizeddfArray.length) {
// your logic ....
}
Note :
Above logic is sequential...
If you want to execute them in parallel (if they are independent) you can use
import scala.concurrent._
import scala.concurrent.duration._
import scala.concurrent.ExecutionContext.Implicits.global
// Now wait for the tasks to finish before exiting the app Await.result(Future.sequence(Seq(yourtaskfuncOndf1(),yourtaskfuncOndf2()...,yourtaskfuncOndf10())), Duration(10, MINUTES))
Out of above 2 options, I would prefer approach #2 since randomSplit function will take care(by normalizing weights) about dividing equal sized to process them
I want to aggregate data based on intervals on timestamp columns.
I saw that it takes 53 seconds for computation, but 5 minutes to write result in the CSV file. It seems like df.csv() takes too much to write.
How can I optimize the code please ?
Here is my code snippet :
val df = spark.read.option("header",true).option("inferSchema", "true").csv("C:\\dataSet.csv\\inputDataSet.csv")
//convert all column to numeric value in order to apply aggregation function
df.columns.map { c =>df.withColumn(c, col(c).cast("int")) }
//add a new column inluding the new timestamp column
val result2=df.withColumn("new_time",((unix_timestamp(col("_c0"))/300).cast("long") * 300).cast("timestamp")).drop("_c0")
val finalresult=result2.groupBy("new_time").agg(result2.drop("new_time").columns.map(mean(_)).head,result2.drop("new_time").columns.map(mean(_)).tail: _*).sort("new_time")
finalresult.coalesce(1).write.option("header", "true").csv("C:/result_with_time.csv")//<= it took to much to write
Here are some thoughts on optimization based on your code.
inferSchema: it will be faster to have a predefined schema rather than using inferSchema.
Instead of writing into your local, you can try writing it in hdfs and then scp the file into local.
df.coalesce(1).write will take more time than just df.write. But you will get multiple files which can be combined using different techniques. or else you can just let it be in one directory with with multiple parts of the file.