What happen if you count twice? [duplicate] - scala

I run an action two times, and the second time takes very little time to run, so I suspect that spark automatically cache some results. But I did find any source.
Im using Spark1.4.
doc = sc.textFile('...')
doc_wc = doc.flatMap(lambda x: re.split('\W', x))\
.filter(lambda x: x != '') \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda x,y: x+y)
%%time
doc_wc.take(5) # first time
# CPU times: user 10.7 ms, sys: 425 µs, total: 11.1 ms
# Wall time: 4.39 s
%%time
doc_wc.take(5) # second time
# CPU times: user 6.13 ms, sys: 276 µs, total: 6.41 ms
# Wall time: 151 ms

From the documentation:
Spark also automatically persists some intermediate data in shuffle operations (e.g. reduceByKey), even without users calling persist. This is done to avoid recomputing the entire input if a node fails during the shuffle. We still recommend users call persist on the resulting RDD if they plan to reuse it.
The underlying filesystem will also be caching access to the disk.

Related

Pyspark count() is taking long time before and after using subtract command

This is my code:
spark_df1 = spark.read.option('header','True').csv("/mnt/gmclmprod/dsshare/cp106_rf_dev_final_apr20.csv.gz")
spark_df1.count( ) # This command took around 1.40 min for exectuion
spark_df1 = spark.read.option('header','True').csv("/mnt/gmclmprod/dsshare/cp106_rf_dev_final_apr20.csv.gz")
test_data = spark_df1.sample(fraction=0.001)
spark_df2 = spark_df1.subtract(test_data)
spark_df2.count() #This command is taking more than 20 min for execution. Can any one help why
#its taking long time for same count command?
Why is count() taking long time before and after using subtract command?
The jist is that, subtract is an expensive operation involving joins and distinct incurring shuffled hence would take long time compared to count on spark_df1.count(). How much longer is dependent on the Spark executor configurations and partitioning scheme. Do update the question according to comment to an ind-depth analysis.

Netlogo Profiler: How can the exclusive time be greater than the inclusive time?

I am trying to optimize my NetLogo model using the Profiler extension. I get the following output [excerpt]:
Profiler
BEGIN PROFILING DUMP
Sorted by Exclusive Time
Name Calls Incl T(ms) Excl T(ms) Excl/calls
COMPLETE-COOKING 38741 0.711 4480.369 0.116
GET-RECIPE 10701 2618.651 2618.651 0.245
GET-EQUIPMENT 38741 1204.293 1204.293 0.031
SELECT-RECIPE-AT-TICK 990 9533.460 470.269 0.475
GIVE-RECIPE-REVIEW 10701 4.294 449.523 0.042
COMPLETE-COOKING and GIVE-RECIPE-REVIEW have a greater exclusive than inclusive time.
How can this be? And if it is an error, how do I fix it?

Spark 3.0 is much slower to read json files than Spark 2.4

I have large amount of json files that Spark can read in 36 seconds but Spark 3.0 takes almost 33 minutes to read the same. On closer analysis, looks like Spark 3.0 is choosing different DAG than Spark 2.0. Does anyone have any idea what is going on? Is there any configuration problem with Spark 3.0.
Spark 2.4
scala> spark.time(spark.read.json("/data/20200528"))
Time taken: 19691 ms
res61: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 more fields]
scala> spark.time(res61.count())
Time taken: 7113 ms
res64: Long = 2605349
Spark 3.0
scala> spark.time(spark.read.json("/data/20200528"))
20/06/29 08:06:53 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
Time taken: 849652 ms
res0: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 more fields]
scala> spark.time(res0.count())
Time taken: 8201 ms
res2: Long = 2605349
Here are the details:
As it turns out default behavior of Spark 3.0 has changed - it tries to infer timestamp unless schema is specified and that results into huge amount of text scan. I tried to load the data with inferTimestamp=false time did come close to that of Spark 2.4 but Spark 2.4 still beats Spark 3 by ~3+ sec (may be in acceptable range but question is why?). I have no idea why this behavior was changed but its should have been notified in BOLD letters.
Spark 2.4
spark.time(spark.read.option("inferTimestamp","false").json("/data/20200528/").count)
Time taken: 29706 ms
res0: Long = 2605349
spark.time(spark.read.option("inferTimestamp","false").option("prefersDecimal","false").json("/data/20200528/").count)
Time taken: 31431 ms
res0: Long = 2605349
Spark 3.0
spark.time(spark.read.option("inferTimestamp","false").json("/data/20200528/").count)
Time taken: 32826 ms
res0: Long = 2605349
spark.time(spark.read.option("inferTimestamp","false").option("prefersDecimal","false").json("/data/20200528/").count)
Time taken: 34011 ms
res0: Long = 2605349
Note:
Make sure you never turn on prefersDecimal to true even when
inferTimestamp is false, it again takes huge amount of time.
Spark 3.0 + JDK 11 is slower than Spark 3.0 + JDK 8 by almost 6 sec.

Spark processing time increases for each sequential task

I'm new to spark and scala, and have just written some Scala code that groups data in my RDD by four keys and sums a fifth. My intention is to replicate this SQL statement:
SELECT iso, id1, id2, convert(thresh), sum(area::double)
FROM my_table
GROUP BY iso, id1, id2, convert(thresh)
My scala code is as follows, where finalRDD is an RDD and convert is a simple function to bin integers from 1-100:
finalRDD
.map({case Array(lon, lat, thresh, area, iso, id1, id2) => ((iso, id1, id2, convert(thresh)), (area.toDouble)) })
.reduceByKey(_+_)
.map({ case (key, value) => Array(key._1, key._2, key._3, key._4, value)
.mkString(",")})
.saveAsTextFile(conf.get("output.path"))
When I run this process on a large dataset (3.5 TB, parallelized into 113,000 tasks), the first 30,000 tasks finish each within a second or two. An hour later, however, each task takes 15 - 20 minutes to complete.
I'm running this in Spark 2.0.0 on AWS EMR with 13 d2.8xlarge machines (244 GB memory and 36 cores each). I run the .jar using:
spark-submit --master yarn --executor-memory 20g --jars my_jar.jar
The Spark UI reports that all tasks are finishing relatively quickly (3 - 5 seconds), but it does show increasing amounts of scheduler delay time, the largest of these is 23 minutes.
Is this the likely cause of the increasing time to completion for each task? And is there a way to better structure my code (or config) to avoid this? Is reduceByKey to blame?
--- UPDATE ---
I've updated my code to utilize Spark Dataframes, letting Spark determine the best execution plan to group and sum my data. The updated code is as follows:
case class TableRow(iso: String, id1: String, id2: String, thresh: Long, area: Double)
finalRDD
.map({case Array(lon, lat, thresh, area, iso, id1, id2) => (TableRow(iso, id1, id2, matchTest(thresh), area.toDouble)) })
.toDF()
.groupBy("iso", "id1", "id2", "thresh").agg(sum("area").alias("area_out"))
.write
.format("csv")
.save(conf.get("output.path"))
When running this updated code, I did not experience the scheduler delay error, which was great. I did encounter errors connecting to executors and having them time out. Having already spent a fair amount of time and money on this d2.8xlarge cluster, I ended up splitting my job into 14 separate jobs, which all completed successfully.

Need advice on efficiently inserting millions of time series data into a Cassandra DB

I want to use a Cassandra database to store time series data from a test site. I am using Pattern 2 from the "Getting started with Time Series Data Modeling" tutorial but am not storing the date to limit the row size as a date, but as an int counting the number of days elapsed since 1970-01-01, and the timestamp of the value is the number of nanoseconds since the epoch (some of our measuring devices are that precise and the precision is needed). My table for the values looks like this:
CREATE TABLE values (channel_id INT, day INT, time BIGINT, value DOUBLE, PRIMARY KEY ((channel_id, day), time))
I created a simple benchmark, taking into account using asynchronity and prepared statements for batch loading instead of batches:
def valueBenchmark(numVals: Int): Unit = {
val vs = session.prepare(
"insert into values (channel_id, day, time, " +
"value) values (?, ?, ?, ?)")
val currentFutures = mutable.MutableList[ResultSetFuture]()
for(i <- 0 until numVals) {
currentFutures += session.executeAsync(vs.bind(-1: JInt,
i / 100000: JInt, i.toLong: JLong, 0.0: JDouble))
if(currentFutures.length >= 10000) {
currentFutures.foreach(_.getUninterruptibly)
currentFutures.clear()
}
}
if(currentFutures.nonEmpty) {
currentFutures.foreach(_.getUninterruptibly)
}
}
JInt, JLong and JDouble are simply java.lang.Integer, java.lang.Long and java.lang.Double, respectively.
When I run this benchmark for 10 million values, this needs about two minutes for a locally installed single-node Cassandra. My computer is equipped with 16 GiB of RAM and a quad-core i7 CPU. I find this quite slow. Is this normal performance for inserts with Cassandra?
I already read these:
Anti-Patterns in Cassandra
Another question on write performance
Are there any other things I could check?
Simple maths:
10 millions inserts/2 minutes ≈ 83 333,33333 inserts/sec which is great for a single machine, did you expect something faster?
By the way, what are the specs of your hard-drives ? SSD or spinning disks ?
You should know that massive insert scenarios are more CPU bound than I/O bound. Try to execute the same test on a machine with 8 physical cores (so 16 vcores with Hyper Threading) and compare the results.