Scala/Spark Select Column very slow - scala

I am new to scala/spark (about a week now)
The following code is being run on my 8 core laptop, 64bit, Win10
The dataframe has 1700 rows.
ONE select takes over ten seconds.
Watching the console shows the main hang is at this point:
17/09/02 12:23:46 INFO FileSourceStrategy: Pruning directories with:
The Code
{
val major:String =name.substring(0,name.indexOf("_SCORE"))+"_idx1"
println(major)
val majors = dfMergedDroppedDeleted
.select(col(major))
.collect().toSeq
println(s"got majors ${majors.size}")
}
This should take milliseconds (based on experience with hibernate,r,mysql etc)
I am assuming there is something wrong with my configuration of spark?
Any suggestions would be most welcome.
The full console output up to the hang is as follows:
1637_1636_1716_idx1
1637_1636_1716_idx2
17/09/02 12:23:08 INFO ContextCleaner: Cleaned accumulator 765
17/09/02 12:23:08 INFO ContextCleaner: Cleaned accumulator 763
17/09/02 12:23:08 INFO BlockManagerInfo: Removed broadcast_51_piece0 on 192.168.0.13:62246 in memory (size: 113.7 KB, free: 901.6 MB)
17/09/02 12:23:08 INFO ContextCleaner: Cleaned accumulator 761
17/09/02 12:23:08 INFO ContextCleaner: Cleaned accumulator 764
17/09/02 12:23:08 INFO ContextCleaner: Cleaned accumulator 762
17/09/02 12:23:08 INFO ContextCleaner: Cleaned accumulator 766
17/09/02 12:23:08 INFO BlockManagerInfo: Removed broadcast_50_piece0 on 192.168.0.13:62246 in memory (size: 20.7 KB, free: 901.6 MB)
17/09/02 12:23:08 INFO FileSourceStrategy: Pruning directories with:

Putting the dataframe in cache makes a big difference.
val dfMergedDroppedDeletedCached:DataFrame=dfMergedDroppedDeleted.cache()
However, the caching process itself is slow, so this only pays off if you are performing multiple operations
UPDATE
Credit Ramesh Maharjan to who wrote in a comment:
the time consuming part is not select. select is distributed in nature and would be executed in every local data in executors. The time consuming part is the collect. Collect function collects all the data in the driver node. And that takes a lot of time. Thats why collect is always recommended not to be used and if necessary to use it the minimum.
I have changed the query to be as follows:
val majorstr:String = dfMergedDroppedDeletedCached.filter(dfMergedDroppedDeletedCached(major).isNotNull)
.select(col(major))
.limit(1)
.first().getString(0)
Not exactly Oracle speeds but much faster than using collect

Related

Why spark streaming is running on previous topics records?

I ran the zookeeper and kafka broker but I didn't run kafka producer.
I ran spark streaming code and print the nonfiltered stream here.
My question is, why I'm receiving these stream of data, namely
{"vehicleId":"0","lon":"0","lat":"0","ts":"0"}
{"vehicleId":"0","lon":"0","lat":"0","ts":"0"}
Although I'm not running the producer?
What do these messages mean bellow?
19/06/24 20:20:00 INFO JobScheduler: Finished job streaming job 1561378800000 ms.0 from job set of time 1561378800000 ms
19/06/24 20:20:00 INFO JobScheduler: Total delay: 0.028 s for time 1561378800000 ms (execution: 0.021 s)
19/06/24 20:20:00 INFO MapPartitionsRDD: Removing RDD 161 from persistence list
19/06/24 20:20:00 INFO ContextCleaner: Cleaned accumulator 1716
19/06/24 20:20:00 INFO ContextCleaner: Cleaned accumulator 1893
19/06/24 20:20:00 INFO ContextCleaner: Cleaned accumulator 1944
{"vehicleId":"0","lon":"0","lat":"0","ts":"0"}
{"vehicleId":"0","lon":"0","lat":"0","ts":"0"}
{"vehicleId":"0","lon":"0","lat":"0","ts":"0"}
{"vehicleId":"0","lon":"0","lat":"0","ts":"0"}
{"vehicleId":"0","lon":"0","lat":"0","ts":"0"}
{"vehicleId":"0","lon":"0","lat":"0","ts":"0"}
{"vehicleId":"0","lon":"0","lat":"0","ts":"0"}
{"vehicleId":"0","lon":"0","lat":"0","ts":"0"}
{"vehicleId":"0","lon":"0","lat":"0","ts":"0"}
{"vehicleId":"0","lon":"0","lat":"0","ts":"0"}
...
19/06/24 20:20:00 INFO KafkaRDD: Removing RDD 160 from persistence list
19/06/24 20:20:00 INFO ContextCleaner: Cleaned accumulator 1628
19/06/24 20:20:00 INFO ContextCleaner: Cleaned accumulator 1781
19/06/24 20:20:00 INFO ContextCleaner: Cleaned accumulator 1570
19/06/24 20:20:00 INFO BlockManager: Removing RDD 161
19/06/24 20:20:00 INFO ContextCleaner: Cleaned accumulator 1808
19/06/24 20:20:00 INFO ContextCleaner: Cleaned accumulator 2020
19/06/24 20:20:00 INFO ContextCleaner: Cleaned accumulator 1624
19/06/24 20:20:00 INFO ContextCleaner: Cleaned accumulator 1918
19/06/24 20:20:00 INFO ContextCleaner: Cleaned
You may want to check what you've set in 'auto.offset.reset'.
From the Spark Streaming Guide:
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "localhost:9092,anotherhost:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "use_a_separate_group_id_for_each_stream",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)
They set the offset reset to "latest". Yours seems to be set to earliest.

Why are the executors getting killed by the driver?

The first stage of my spark job is quite simple.
It reads from a big number of files (around 30,000 files and 100GB in total) -> RDD[String]
does a map (to parse each line) -> RDD[Map[String,Any]]
filters -> RDD[Map[String,Any]]
coalesces (.coalesce(100, true))
When running it, I observe a quite peculiar behavior. The number of executors grows until the given limit I specified in spark.dynamicAllocation.maxExecutors (typically 100 or 200 in my application). Then it starts decreasing quickly (at approx. 14000/33428 tasks) and only a few executors remain. They are killed by the drive. When this task is done. The number of executors increases back to its maximum value.
Below is a screenshot of the number of executors at its lowest.
An here is a screenshot of the task summary.
I guess that these executors are killed because they are idle. But, in this case, I do not understand why would they become idle. There remains a lot of task to do in the stage...
Do you have any idea of why it happens?
EDIT
More details about the driver logs when an executor is killed:
16/09/30 12:23:33 INFO cluster.YarnClusterSchedulerBackend: Disabling executor 91.
16/09/30 12:23:33 INFO scheduler.DAGScheduler: Executor lost: 91 (epoch 0)
16/09/30 12:23:33 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 91 from BlockManagerMaster.
16/09/30 12:23:33 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(91, server.com, 40923)
16/09/30 12:23:33 INFO storage.BlockManagerMaster: Removed 91 successfully in removeExecutor
16/09/30 12:23:33 INFO cluster.YarnClusterScheduler: Executor 91 on server.com killed by driver.
16/09/30 12:23:33 INFO spark.ExecutorAllocationManager: Existing executor 91 has been removed (new total is 94)
Logs on the executor
16/09/30 12:26:28 INFO rdd.HadoopRDD: Input split: hdfs://...
16/09/30 12:26:32 INFO executor.Executor: Finished task 38219.0 in stage 0.0 (TID 26519). 2312 bytes result sent to driver
16/09/30 12:27:33 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: SIGTERM
16/09/30 12:27:33 INFO storage.DiskBlockManager: Shutdown hook called
16/09/30 12:27:33 INFO util.ShutdownHookManager: Shutdown hook called
I'm seeing this problem on executors that are killed as a result of an idle timeout. I have an exceedingly demanding computational load, but it's mostly computed in a UDF, invisible to Spark. I believe that there's some spark parameter that can be adjusted.
Try looking through the spark.executor parameters in https://spark.apache.org/docs/latest/configuration.html#spark-properties and see if anything jumps out.

Spark: java.lang.OutOfMemoryError: GC overhead limit exceeded

I have the following code to converts the I read the data from my input files and create a pairedrdd, which is then converted to a Map for future lookups. I then map this broadcast variable. This is the map that is few GB. Is there a way to do collectAsMap() in a more efficient manner or to replace it with some other call?
val result_paired_rdd = prods_user_flattened.collectAsMap()
sc.broadcast(result_paired_rdd)
I get the following error. I also tried the following param: --executor-memory 7G with spark-submit command.
15/08/31 08:29:51 INFO BlockManagerInfo: Removed taskresult_48 on host3:48924 in memory (size: 11.4 MB, free: 3.6 GB)
15/08/31 08:29:51 INFO BlockManagerInfo: Added taskresult_50 in memory on host3:48924 (size: 11.6 MB, free: 3.6 GB)
15/08/31 08:29:52 INFO BlockManagerInfo: Added taskresult_51 in memory on host2:60182 (size: 11.6 MB, free: 3.6 GB)
15/08/31 08:30:02 ERROR Utils: Uncaught exception in thread task-result-getter-0
java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.Arrays.copyOfRange(Arrays.java:2694)
at java.lang.String.<init>(String.java:203)
at com.esotericsoftware.kryo.io.Input.readString(Input.java:448)
at com.esotericsoftware.kryo.serializers.DefaultSerializers$StringSerializer.read(DefaultSerializers.java:157)
at com.esotericsoftware.kryo.serializers.DefaultSerializers$StringSerializer.read(DefaultSerializers.java:146)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
at com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:42)
at com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:33)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:338)
at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:293)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
at org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:173)
at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:79)
at org.apache.spark.scheduler.TaskSetManager.handleSuccessfulTask(TaskSetManager.scala:621)
at org.apache.spark.scheduler.TaskSchedulerImpl.handleSuccessfulTask(TaskSchedulerImpl.scala:379)
at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:82)
at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:51)
at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:51)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1617)
at org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:50)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
From the logs it looks like the driver is running out of memory.
For certain actions like collect, rdd data from all workers is transferred to the driver JVM.
Check your driver JVM settings
Avoid collecting so much data onto driver JVM

Spark SQL + Cassandra: bad performance

I'm just starting using Spark SQL + Cassandra, and probably am missing something important, but one simple query takes ~45 seconds. I'm using cassanda-spark-connector library, and run the local web server which also hosts the Spark. So my setup is roughly like this:
In sbt:
"org.apache.spark" %% "spark-core" % "1.4.1" excludeAll(ExclusionRule(organization = "org.slf4j")),
"org.apache.spark" %% "spark-sql" % "1.4.1" excludeAll(ExclusionRule(organization = "org.slf4j")),
"com.datastax.spark" %% "spark-cassandra-connector" % "1.4.0-M3" excludeAll(ExclusionRule(organization = "org.slf4j"))
In code I have a singleton that hosts SparkContext and CassandraSQLContetx. It's then called from the servlet. Here's how the singleton code looks like:
object SparkModel {
val conf =
new SparkConf()
.setAppName("core")
.setMaster("local")
.set("spark.cassandra.connection.host", "127.0.0.1")
val sc = new SparkContext(conf)
val sqlC = new CassandraSQLContext(sc)
sqlC.setKeyspace("core")
val df: DataFrame = sqlC.cassandraSql(
"SELECT email, target_entity_id, target_entity_type " +
"FROM tracking_events " +
"LEFT JOIN customers " +
"WHERE entity_type = 'User' AND entity_id = customer_id")
}
And here how I use it:
get("/spark") {
SparkModel.df.collect().map(r => TrackingEvent(r.getString(0), r.getString(1), r.getString(2))).toList
}
Cassandra, Spark and the web app run on the same host in virtual machine on my Macbook Pro with decent specs. Cassandra queries by themselves take 10-20 milliseconds.
When I call this endpoint for the first time, it takes 70-80 seconds to return the result. Subsequent queries take ~45 seconds. The log of the subsequent operation looks like this:
12:48:50 INFO org.apache.spark.SparkContext - Starting job: collect at V1Servlet.scala:1146
12:48:50 INFO o.a.spark.scheduler.DAGScheduler - Got job 1 (collect at V1Servlet.scala:1146) with 1 output partitions (allowLocal=false)
12:48:50 INFO o.a.spark.scheduler.DAGScheduler - Final stage: ResultStage 1(collect at V1Servlet.scala:1146)
12:48:50 INFO o.a.spark.scheduler.DAGScheduler - Parents of final stage: List()
12:48:50 INFO o.a.spark.scheduler.DAGScheduler - Missing parents: List()
12:48:50 INFO o.a.spark.scheduler.DAGScheduler - Submitting ResultStage 1 (MapPartitionsRDD[29] at collect at V1Servlet.scala:1146), which has no missing parents
12:48:50 INFO org.apache.spark.storage.MemoryStore - ensureFreeSpace(18696) called with curMem=26661, maxMem=825564856
12:48:50 INFO org.apache.spark.storage.MemoryStore - Block broadcast_1 stored as values in memory (estimated size 18.3 KB, free 787.3 MB)
12:48:50 INFO org.apache.spark.storage.MemoryStore - ensureFreeSpace(8345) called with curMem=45357, maxMem=825564856
12:48:50 INFO org.apache.spark.storage.MemoryStore - Block broadcast_1_piece0 stored as bytes in memory (estimated size 8.1 KB, free 787.3 MB)
12:48:50 INFO o.a.spark.storage.BlockManagerInfo - Added broadcast_1_piece0 in memory on localhost:56289 (size: 8.1 KB, free: 787.3 MB)
12:48:50 INFO org.apache.spark.SparkContext - Created broadcast 1 from broadcast at DAGScheduler.scala:874
12:48:50 INFO o.a.spark.scheduler.DAGScheduler - Submitting 1 missing tasks from ResultStage 1 (MapPartitionsRDD[29] at collect at V1Servlet.scala:1146)
12:48:50 INFO o.a.s.scheduler.TaskSchedulerImpl - Adding task set 1.0 with 1 tasks
12:48:50 INFO o.a.spark.scheduler.TaskSetManager - Starting task 0.0 in stage 1.0 (TID 1, localhost, NODE_LOCAL, 59413 bytes)
12:48:50 INFO org.apache.spark.executor.Executor - Running task 0.0 in stage 1.0 (TID 1)
12:48:50 INFO com.datastax.driver.core.Cluster - New Cassandra host localhost/127.0.0.1:9042 added
12:48:50 INFO c.d.s.c.cql.CassandraConnector - Connected to Cassandra cluster: Super Cluster
12:49:11 INFO o.a.spark.storage.BlockManagerInfo - Removed broadcast_0_piece0 on localhost:56289 in memory (size: 8.0 KB, free: 787.3 MB)
12:49:35 INFO org.apache.spark.executor.Executor - Finished task 0.0 in stage 1.0 (TID 1). 6124 bytes result sent to driver
12:49:35 INFO o.a.spark.scheduler.TaskSetManager - Finished task 0.0 in stage 1.0 (TID 1) in 45199 ms on localhost (1/1)
12:49:35 INFO o.a.s.scheduler.TaskSchedulerImpl - Removed TaskSet 1.0, whose tasks have all completed, from pool
12:49:35 INFO o.a.spark.scheduler.DAGScheduler - ResultStage 1 (collect at V1Servlet.scala:1146) finished in 45.199 s
As you can see from the log, the longest pauses are between these 3 lines (21 + 24 seconds):
12:48:50 INFO c.d.s.c.cql.CassandraConnector - Connected to Cassandra cluster: Super Cluster
12:49:11 INFO o.a.spark.storage.BlockManagerInfo - Removed broadcast_0_piece0 on localhost:56289 in memory (size: 8.0 KB, free: 787.3 MB)
12:49:35 INFO org.apache.spark.executor.Executor - Finished task 0.0 in stage 1.0 (TID 1). 6124 bytes result sent to driver
Apparently, I'm doing something wrong. What's that? How can I improve this?
EDIT: Important addition: the size of the tables is tiny (~200 entries for tracking_events, ~20 for customers), so reading them in their whole into memory shouldn't take any significant time. And it's a local Cassandra installation, no cluster, no networking is involved.
"SELECT email, target_entity_id, target_entity_type " +
"FROM tracking_events " +
"LEFT JOIN customers " +
"WHERE entity_type = 'User' AND entity_id = customer_id")
This query will read all of the data from both the tracking_events and customers table. I would compare the performance to just doing a SELECT COUNT(*) on both tables. If it is significantly different then there may be an issue but my guess is this is just the amount of time it takes to read both tables entirely into memory.
There are a few knobs for tuning how reads are done and since the defaults are oriented towards a much a bigger dataset you may want to change these.
spark.cassandra.input.split.size_in_mb approx amount of data to be fetched into a Spark partition 64 MB
spark.cassandra.input.fetch.size_in_rows number of CQL rows fetched per driver request 1000
I would make sure you are generating as many tasks as you have cores (at the minimum) so you can take advantage of all of your resources. To do this shrink the input.split.size
The fetch size controls how many rows are paged at a time by an executor core so increasing this can increase speed in some use cases.

Apache spark message understanding

Request help to understand this message..
INFO spark.MapOutputTrackerMaster: Size of output statuses for shuffle 2 is **2202921** bytes
what does 2202921 mean here?
My job does a shuffle operation and while reading shuffle files from previous stage, it gives the message first and then after sometime it fails with below error:
14/11/12 11:09:46 WARN scheduler.TaskSetManager: Lost task 224.0 in stage 4.0 (TID 13938, ip-xx-xxx-xxx-xx.ec2.internal): FetchFailed(BlockManagerId(11, ip-xx-xxx-xxx-xx.ec2.internal, 48073, 0), shuffleId=2, mapId=7468, reduceId=224)
14/11/12 11:09:46 INFO scheduler.DAGScheduler: Marking Stage 4 (coalesce at <console>:49) as failed due to a fetch failure from Stage 3 (map at <console>:42)
14/11/12 11:09:46 INFO scheduler.DAGScheduler: Stage 4 (coalesce at <console>:49) failed in 213.446 s
14/11/12 11:09:46 INFO scheduler.DAGScheduler: Resubmitting Stage 3 (map at <console>:42) and Stage 4 (coalesce at <console>:49) due to fetch failure
14/11/12 11:09:46 INFO scheduler.DAGScheduler: Executor lost: 11 (epoch 2)
14/11/12 11:09:46 INFO storage.BlockManagerMasterActor: Trying to remove executor 11 from BlockManagerMaster.
14/11/12 11:09:46 INFO storage.BlockManagerMaster: Removed 11 successfully in removeExecutor
14/11/12 11:09:46 INFO scheduler.Stage: Stage 3 is now unavailable on executor 11 (11893/12836, false)
14/11/12 11:09:46 INFO scheduler.DAGScheduler: Resubmitting failed stages
14/11/12 11:09:46 INFO scheduler.DAGScheduler: Submitting Stage 3 (MappedRDD[13] at map at <console>:42), which has no missing parents
14/11/12 11:09:46 INFO storage.MemoryStore: ensureFreeSpace(25472) called with curMem=474762, maxMem=11113699737
14/11/12 11:09:46 INFO storage.MemoryStore: Block broadcast_6 stored as values in memory (estimated size 24.9 KB, free 10.3 GB)
14/11/12 11:09:46 INFO storage.MemoryStore: ensureFreeSpace(5160) called with curMem=500234, maxMem=11113699737
14/11/12 11:09:46 INFO storage.MemoryStore: Block broadcast_6_piece0 stored as bytes in memory (estimated size 5.0 KB, free 10.3 GB)
14/11/12 11:09:46 INFO storage.BlockManagerInfo: Added broadcast_6_piece0 in memory on ip-xx.ec2.internal:35571 (size: 5.0 KB, free: 10.4 GB)
14/11/12 11:09:46 INFO storage.BlockManagerMaster: Updated info of block broadcast_6_piece0
14/11/12 11:09:46 INFO scheduler.DAGScheduler: Submitting 943 missing tasks from Stage 3 (MappedRDD[13] at map at <console>:42)
14/11/12 11:09:46 INFO cluster.YarnClientClusterScheduler: Adding task set 3.1 with 943 tasks
My code looks like this,
(rdd1 ++ rdd2).map { t => ((t.id), t) }.groupByKey(1280).map {
case ((id), sequence) =>
val newrecord = sequence.maxBy {
case Fact(id, key, type, day, group, c_key, s_key, plan_id,size,
is_mom, customer_shipment_id, customer_shipment_item_id, asin, company_key, product_line_key, dw_last_updated, measures) => dw_last_updated.toLong
}
((PARTITION_KEY + "=" + newrecord.day.toString + "/part"), (newrecord))
}.coalesce(2048,true).saveAsTextFile("s3://myfolder/PT/test20nodes/")```
I derived 1280 as I have 20 nodes each having 32 cores. I derived it like 2*32*20.
For a Shuffle stage, it will create some ShuffleMapTasks which output the intermediate results to the disk. The location information will be stored in MapStatuses and sent to the MapOutputTrackerMaster(the driver).
Then when the next stage starts to run, it needs these location statuses. So executors will ask MapOutputTrackerMaster to fetch them. MapOutputTrackerMaster will serialize these status to bytes and send them to executors. Here is the size of these status in bytes.
These status will be sent via Akka. And Akka has a limitation to the max message size. You can set it via spark.akka.frameSize:
Maximum message size to allow in "control plane" communication (for serialized tasks and task results), in MB. Increase this if your tasks need to send back large results to the driver (e.g. using collect() on a large dataset).
If the size is greater than spark.akka.frameSize, Akka will refuse to deliver the message and your job will fail. Therefore it can help you adjust spark.akka.frameSize to a best one.