I have a data set of weather data and I am trying to query it to get average lows and average highs for each year. I have no problem submitting the job and getting the desired result but it is taking hours to run. I thought it would run much faster, Am I doing something wrong or is it just not as fast as I'm thinking it should be?
The data is a csv file with over 100,000,000 entries.
THe columns are date, weather station, measurement(TMAX or TMIN), and value
I am running the job on my university's hadoop cluster, I don't have much more information than that about the cluster.
Thanks in advance!
import sys
from random import random
from operator import add
from pyspark.sql import SQLContext, Row
from pyspark import SparkContext
if __name__ == "__main__":
sc = SparkContext(appName="PythonPi")
sqlContext = SQLContext(sc)
file = sys.argv[1]
lines = sc.textFile(file)
parts = lines.map(lambda l: l.split(","))
obs = parts.map(lambda p: Row(station=p[0], date=int(p[1]) , measurement=p[2] , value=p[3] ) )
weather = sqlContext.createDataFrame(obs)
weather.registerTempTable("weather")
#AVERAGE TMAX/TMIN PER YEAR
query2 = sqlContext.sql("""select SUBSTRING(date,1,4) as Year, avg(value)as Average, measurement
from weather
where value<130 AND value>-40
group by measurement, SUBSTRING(date,1,4)
order by SUBSTRING(date,1,4) """)
query2.show()
query2.rdd.map(lambda x: ",".join(map(str, x))).coalesce(1).saveAsTextFile("hdfs:/user/adduccij/tmax_tmin_year.csv")
sc.stop()
Make sure that spark job in fact started in cluster (and not local) mode. e.g. If you're using yarn, then job is launched in 'yarn-client' mode.
If that's true, then make sure you've provided enough #executors/cores/ executor and driver memory. You can get the actual cluster/job information from either the resource manager (e.g. yarn) page or from spark context (sqlContext.getAllConfs).
100Mil records is not that small. Let's say each record is 30 bytes, still the overall size is 3gb and that can take a while if you only have a handful of executors.
Let's say that the above suggestions do not help, then try to find out which part of the query is taking long. Few speed up tips are:
Cache the weather dataframe
Break the query into 2 parts: 1st part does group by, and output is cached
2nd part does order by
instead of coalesce, write the rdd with default shards and then do a mergeFrom to get your csv output from shell.
Related
Say I have these 2 parquet files
import pandas as pd
pd.DataFrame([[0]], columns=["a"]).to_parquet("/tmp/1.parquet")
pd.DataFrame([[0],[2]], columns=["a"]).to_parquet("/tmp/2.parquet")
I would like to have a new parquet file that is a row wise union of the two.
The resulting DataFrame should look like this
a
0 0
1 0
2 2
I also would like to repartition that new file with a pre-determined number of partitions.
You can certainly solve this problem in either Pandas, Spark or other computing frameworks, but each of them will require different implementations. Using Fugue here, you can have one implementation for different computing backends, more importantly, the logic is unit testable without using any heavy backend.
from fugue import FugueWorkflow
def merge_and_save(file1, file2, file3, partition_num):
dag = FugueWorkflow()
df1 = dag.load(file1)
df2 = dag.load(file2)
df3 = df1.union(df2, distinct=False)
df3.partition(num=partition_num).save(file3)
return dag
To unit test this logic, just use small local files and use the default execution engine. Assume you have a function assert_eq:
merge_and_save(f1, f2, f3, 4).run()
assert_eq(pd.read_parquet(f3), expected_df)
And in real production, if the input files are large, you can switch to spark
merge_and_save(f4, f5, f6, 100).run(spark_session)
It's worth to point out that partition_num is not respected by the default local execution engine, so we can't assert on the number of output files. But it takes effect when the backend is Spark or Dask.
I'll be getting data from Hbase within a TimeRange. So, I divided the time range into chunks and scanning the columns from Hbase within the chunked TimeRange like
Suppose, I have a TimeRange from Jun to Aug, I divide them into Weekly, which gives 8 weeks TimeRange List.
From that, I will scan the columns of Hbase via repartition & mappartition like
sparkSession.sparkContext.parallelize(chunkedTimeRange.toList).repartition(noOfCores).mapPartitions{
// Scan Cols of Hbase Logic
// This gives DF as output
}
I'll get DF from the above and Do some filter to that DF using mappartition and foreachPartition like
df.mapPartitions{
rows => {
rows.toList.par.foreach(
cols => {
json.filter(condition).foreach(//code)
anotherJson.filter(condition).foreach(//code)
}
)
}
// returns DF
}
This DF has been used by other methods, Since mapparttions are lazy. I called an action after the above like
df.persist(StorageLevel.MEMORY_AND_DISK)
df.foreachPartition((x: Iterator[org.apache.spark.sql.Row]) => x: Unit)
This forEachPartition unnecessarily executing twice. One stage taking it around 2.5 min (128 tasks) and Other one 40s (200 tasks) which is not necessary.
200 is the mentioned value in spark config
spark.sql.shuffle.partitions=200.
How to avoid this unnecessary foreachPartition? Is there any way still I can make it better in terms of performance?
I found a similar question. Unfortunately, I didn't get much Information from that.
Screenshot of foreachPartitions happening twice for same DF
If any clarification needed, please mention in comment
You need to "reuse" the persisted Dataframe:
val df2 = df.persist(StorageLevel.MEMORY_AND_DISK)
df2.foreachPartition((x: Iterator[org.apache.spark.sql.Row]) => x: Unit)
Otherwise when running the foreachPartition, it runs on a DF which has not been persisted and it's doing every step of the DF computation again.
I'm trying to integrate DynamoDB in EMR spark using the solution provided in AWS blog.
https://aws.amazon.com/blogs/big-data/analyze-your-data-on-amazon-dynamodb-with-apache-spark
I'm able to retrieve the results as expected . But always the Task calculator shows a warning "The calculated max number of concurrent map tasks is less than 1 , use 1 instead " and it take more than 2 minutes to fetch the data
$ spark-shell --jars /usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar
import org.apache.hadoop.io.Text;
import org.apache.hadoop.dynamodb.DynamoDBItemWritable
/* Importing DynamoDBInputFormat and DynamoDBOutputFormat */
import org.apache.hadoop.dynamodb.read.DynamoDBInputFormat
import org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat
import org.apache.hadoop.mapred.JobConf
import org.apache.hadoop.io.LongWritable
var jobConf = new JobConf(sc.hadoopConfiguration)
jobConf.set("dynamodb.servicename", "dynamodb")
jobConf.set("dynamodb.input.tableName", "customer") // Pointing to DynamoDB table
jobConf.set("dynamodb.endpoint", "dynamodb.us-east-1.amazonaws.com")
jobConf.set("dynamodb.regionid", "us-east-1")
jobConf.set("dynamodb.throughput.read", "1")
jobConf.set("dynamodb.throughput.read.percent", "1")
jobConf.set("dynamodb.version", "2011-12-05")
jobConf.set("mapred.output.format.class", "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat")
jobConf.set("mapred.input.format.class", "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat")
var customers= sc.hadoopRDD(jobConf, classOf[DynamoDBInputFormat], classOf[Text], classOf[DynamoDBItemWritable])
cutsomers.count()
The cluster has 2 nodes of size m3.xlarge spot instance.
I'm not sure how to increase the hadoop map tasks.
Any help would be appreciated.
Created a hive table that maps to the DynamoDB table, and tried the same query using Hive shell. Query performance is normal.
select * from customer where custid='123456' -- Time taken is only 4 seconds
When I am running my spark job (version 2.1.1) on EMR, each run counts a different amount of rows on a dataframe. I first read data from s3 to 4 different dataframes, these counts are always consistent an then after joining the dataframes, the result of the join have different counts. afterwards I also filter the result and that also has a different count on each run. The variations are small, 1-5 rows difference but it's still something I would like to understand.
This is the code for the join:
val impJoinKey = Seq("iid", "globalVisitorKey", "date")
val impressionsJoined: DataFrame = impressionDsNoDuplicates
.join(realUrlDSwithDatenoDuplicates, impJoinKey, "outer")
.join(impressionParamterDSwithDateNoDuplicates, impJoinKey, "left")
.join(chartSiteInstance, impJoinKey, "left")
.withColumn("timestamp", coalesce($"timestampImp", $"timestampReal", $"timestampParam"))
.withColumn("url", coalesce($"realUrl", $"url"))
and this is for the filter:
val impressionsJoined: Dataset[ImpressionJoined] = impressionsJoinedFullDay.where($"timestamp".geq(new Timestamp(start.getMillis))).cache()
I have also tried using filter method instead of where, but with same results
Any thought?
Thanks
Nir
is it possible that one of the data sources changes over over time?
since impressionsJoined is not cached, spark will reevaluate it from scratch on every action, and that includes reading the data again from the source.
try caching impressionsJoined after the join.
I have written some code in PySpark to load some data from MongoDB to a Spark dataframe, apply some filters, process the data (using a RDD) and then write back the result to MongoDB.
# 1) Load the data
df_initial = spark.read.format("com.mongodb.spark.sql").options().schema(schema).load() #df_initial is a Spark dataframe
df_filtered = df_initial.filter(...)
# 2) Process the data
rdd_to_process = df_filtered.rdd
processed_rdd = rdd_to_process.mapPartitions(lambda iterator: process_data(iterator))
# 3) Create a dataframe from the RDD
df_final = spark.createDataFrame(processed_rdd, schema)
df_to_write = df_final.select(...)
# 4) Write the dataframe to MongoDB
df_to_write.write.format("com.mongodb.spark.sql").mode("append").save()
I would like to measure the time each part takes (loading the data, processing the RDD, creating the dataframe, and writing back data).
I tried to put timers between each part but from what I understood all the Spark operations are lazy so everything is executed in the last line.
Is there a way to measure the time spent by each part so that I can identify bottlenecks ?
Thanks
Spark can inline some operations, especially if you use Dataframe API. That's why you cannot get execution statistics of "code parts", but only for different stages.
There is not an easy way to get these information from the context directly, but REST API presents a lot information you may use. For example to get time spent in each stage you can use the following instructions:
import datetime
import requests
parse_datetime = lambda date: datetime.datetime.strptime(date, "%Y-%m-%dT%H:%M:%S.%fGMT")
dates_interval = lambda dt1, dt2: parse_datetime(dt2) - parse_datetime(dt1)
app_id = spark.sparkContext.applicationId
data = requests.get(spark.sparkContext.uiWebUrl + "/api/v1/applications/" + app_id + "/stages").json()
for stage in data:
stage_time = dates_interval(stage['submissionTime'], stage['completionTime']).total_seconds()
print("Stage {} took {}s (tasks: {})".format(stage['stageId'], stage_time, stage['numCompleteTasks']))
Example output looks like this:
Stage 4 took 0.067s (tasks: 1)
Stage 3 took 0.53s (tasks: 1)
Stage 2 took 1.592s (tasks: 595)
Stage 1 took 0.363s (tasks: 1)
Stage 0 took 2.367s (tasks: 595)
But then it's your job to identify what are the stages responsible for operations you want to measure.