Spark Streaming: StreamingContext doesn't read data files - scala

I'm new in Spark Streaming and I'm trying to getting started with it using Spark-shell.
Assuming I have a directory called "dataTest" placed in the root directory of spark-1.2.0-bin-hadoop2.4.
The simple code that I want to test in the shell is (after typing $.\bin\spark-shell):
import org.apache.spark.streaming._
val ssc = new StreamingContext(sc, Seconds(2))
val data = ssc.textFileStream("dataTest")
println("Nb lines is equal to= "+data.count())
data.foreachRDD { (rdd, time) => println(rdd.count()) }
ssc.start()
ssc.awaitTermination()
And then, I copy some files in the directory "dataTest" (and also I tried to rename some existing files in this directory).
But unfortunately I did not get what I want (i.e, I didn't get any outpout, so it seems like ssc.textFileStream doesn't work correctly), just some things like:
15/01/15 19:32:46 INFO JobScheduler: Added jobs for time 1421346766000 ms
15/01/15 19:32:46 INFO JobScheduler: Starting job streaming job 1421346766000 ms
.0 from job set of time 1421346766000 ms
15/01/15 19:32:46 INFO SparkContext: Starting job: foreachRDD at <console>:20
15/01/15 19:32:46 INFO DAGScheduler: Job 69 finished: foreachRDD at <console>:20
, took 0,000021 s
0
15/01/15 19:32:46 INFO JobScheduler: Finished job streaming job 1421346766000 ms
.0 from job set of time 1421346766000 ms
15/01/15 19:32:46 INFO MappedRDD: Removing RDD 137 from persistence list
15/01/15 19:32:46 INFO JobScheduler: Total delay: 0,005 s for time 1421346766000
ms (execution: 0,002 s)
15/01/15 19:32:46 INFO BlockManager: Removing RDD 137
15/01/15 19:32:46 INFO UnionRDD: Removing RDD 78 from persistence list
15/01/15 19:32:46 INFO BlockManager: Removing RDD 78
15/01/15 19:32:46 INFO FileInputDStream: Cleared 1 old files that were older tha
n 1421346706000 ms: 1421346704000 ms
15/01/15 19:32:46 INFO ReceivedBlockTracker: Deleting batches ArrayBuffer()

Did you try moving text files from another directory into the directory that is being monitored? For file stream to work, you are atomically put the files into the monitored directory, so that as soon as the files becomes visible in the listings, Spark can read all the data in the file (which may not be the case if you are copying files into the directory).
This is well documented in the Basic sources subsection in the programming guide

Copy file/document Using command line or save as the file/document to the directory work for me.
When you normally copy(by IDE) this can't effect the modified date as streaming context monitor modified date.

I have also struggled with the same issue and for me the solution was that while the streaming is on running, I edit and save the file that I want to use as input stream. I then directly move the input file to the streaming directory, also while the streaming is still on.

Following code worked for me
class StreamingData extends Serializable {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Application").setMaster("local[2]")
//val sc = new SparkContext(conf)
val ssc = new StreamingContext(conf, Seconds(1))
val input = ssc.textFileStream("file:///C:/Users/M1026352/Desktop/Spark/StreamInput")
val lines = input.flatMap(_.split(" "))
val words = lines.map(word => (word, 1))
val counts = words.reduceByKey(_ + _)
counts.print()
ssc.start()
ssc.awaitTermination()
}
}
Just Need to keep the text file in Unix format
If you open the file in notepad++ go to
settings->Preferences->New Document->Unix/OSX
Then changes the file name to get it picked by Scala.
https://stackoverflow.com/a/41495776/5196927
Refer above link.

I think this should generally work, however the problem may be that it is recommended to do Spark Streaming as a standalone application rather than in spark-shell.
I ran this as a standalone application (on other streaming data) and it worked.
data.count() gives you how many elements are in each RDD of the DStream, which is the same as what you are counting in your foreachRDD().

I am doing almost the same thing (as a standalone app on a Windows 8 laptop) and for me this works fine however I have the "dataTest" folder located as a sub folder of "bin". Maybe try that?

I just use your code in the shell ,it works fine. When I put some files to the directory(HDFS),I got the output log like this:
15/07/23 10:46:36 INFO dstream.FileInputDStream: Finding new files took 9 ms
15/07/23 10:46:36 INFO dstream.FileInputDStream: New files at time 1437619596000 ms:
hdfs://master:9000/user/jared/input/hadoop-env.sh
15/07/23 10:46:36 INFO storage.MemoryStore: ensureFreeSpace(235504) called with curMem=0, maxMem=280248975
......
15/07/23 10:46:36 INFO input.FileInputFormat: Total input paths to process : 1
15/07/23 10:46:37 INFO rdd.NewHadoopRDD: Input split: hdfs://master:9000/user/jared/input/hadoop-env.sh:0+4387
15/07/23 10:46:42 INFO dstream.FileInputDStream: Finding new files took 107 ms
15/07/23 10:46:42 INFO dstream.FileInputDStream: New files at time 1437619598000 ms:
15/07/23 10:46:42 INFO scheduler.JobScheduler: Added jobs for time 1437619598000 ms
15/07/23 10:46:42 INFO dstream.FileInputDStream: Finding new files took 23 ms
15/07/23 10:46:42 INFO dstream.FileInputDStream: New files at time 1437619600000 ms:
15/07/23 10:46:42 INFO scheduler.JobScheduler: Added jobs for time 1437619600000 ms
15/07/23 10:46:43 INFO dstream.FileInputDStream: Finding new files took 42 ms
15/07/23 10:46:43 INFO dstream.FileInputDStream: New files at time 1437619602000 ms:
15/07/23 10:46:43 INFO scheduler.JobScheduler: Added jobs for time 1437619602000 ms
15/07/23 10:46:43 INFO executor.Executor: Finished task 0.0 in stage 0.0 (TID 0). 1830 bytes result sent to driver
15/07/23 10:46:43 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 6098 ms on localhost (1/1)
15/07/23 10:46:43 INFO scheduler.DAGScheduler: ResultStage 0 (foreachRDD at <console>:29) finished in 6.178 s
15/07/23 10:46:43 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
15/07/23 10:46:43 INFO scheduler.DAGScheduler: Job 66 finished: foreachRDD at <console>:29, took 6.647137 s
101

Related

How to fetch Kafka Stream and print it in Spark Shell?

First I built an SBT in a folder in the following way
val sparkVersion = "1.6.3"
scalaVersion := "2.10.5"
resolvers += "Spark Packages Repo" at "https://dl.bintray.com/spark-packages/maven"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-streaming" % sparkVersion,
"org.apache.spark" %% "spark-streaming-kafka" % sparkVersion
)
libraryDependencies +="datastax" % "spark-cassandra-connector" % "1.6.3-s_2.10"
libraryDependencies +="org.apache.spark" %% "spark-sql" % "1.1.0"
Later in the same folder where my "build.sbt" exists I started the spark shell in following way
>/usr/hdp/2.6.0.3-8/spark/bin/spark-shell --packages datastax:spark-cassandra-connector:1.6.3-s_2.10 --conf spark.cassandra.connection.host=127.0.0.1
These are the warnings shown while spark shell is started:
WARN AbstractLifeCycle: FAILED SelectChannelConnector#0.0.0.0:4040: java.net.Bind
java.net.BindException: Address already in use
WARN AbstractLifeCycle: FAILED org.spark-project.jetty.server.Server#75bf9e67:
java.net.BindException: Address already in use
In Spark shell am importing the following packages
import org.apache.spark.SparkConf; import org.apache.spark.streaming.StreamingContext; import org.apache.spark.streaming.Seconds; import org.apache.spark.streaming.kafka.KafkaUtils; import com.datastax.spark.connector._ ; import org.apache.spark.sql.cassandra._ ;
Then in spark shell creating a configuration in the below way:
val conf = new SparkConf().setMaster("local[*]").setAppName("KafkaReceiver").set("spark.driver.allowMultipleContexts", "true").setMaster("local");
After creating configuration assigning it, created a new spark streaming context in the below way:
val ssc = new StreamingContext(conf, Seconds(10))
During the creation of spark streaming context the few warnings which are shown above raised again along with other warning, as shown below
WARN AbstractLifeCycle: FAILED SelectChannelConnector#0.0.0.0:4040: java.net.Bind
java.net.BindException: Address already in use
.
.
.
WARN AbstractLifeCycle: FAILED org.spark-project.jetty.server.Server#75bf9e67:
java.net.BindException: Address already in use
.
.
.
WARN SparkContext: Multiple running SparkContexts detected in the same JVM!
org.apache.spark.SparkException: Only one SparkContext may be running in this JVM (see SPARK-2243). To ignore this error, set spark.driver.allowMulti
pleContexts = true. The currently running SparkContext was created at:
org.apache.spark.SparkContext.<init>(SparkContext.scala:82)
org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:1017)
.
.
.
WARN StreamingContext: spark.master should be set as local[n], n > 1 in local mode if you have receivers to get data, otherwise Spa
rk jobs will not get resources to process the received data.
ssc: org.apache.spark.streaming.StreamingContext = org.apache.spark.streaming.StreamingContext#616f1c2e
Then using created spark streaming context created a kafkaStream in the below way
val kafkaStream = KafkaUtils.createStream(ssc, "localhost:2181","spark-streaming-consumer-group", map("spark-topic" -> 5))
Then printing the stream and starting the ssc in below way
kafkaStream.print()
ssc.start
After use of the above command in shell the output is as shown in the below images
Later starts the following mess ! of stream without printing values but information just as shown in below image
Output that's getting printed repeatedly is here as shown below !
17/08/18 10:01:30 INFO JobScheduler: Starting job streaming job 1503050490000 ms.0 from job set of time 1503050490000 ms
17/08/18 10:01:30 INFO JobScheduler: Finished job streaming job 1503050490000 ms.0 from job set of time 1503050490000 ms
17/08/18 10:01:30 INFO JobScheduler: Total delay: 0.003 s for time 1503050490000 ms (execution: 0.000 s)
17/08/18 10:01:30 INFO BlockRDD: Removing RDD 3 from persistence list
17/08/18 10:01:30 INFO KafkaInputDStream: Removing blocks of RDD BlockRDD[3] at createStream at <console>:39 of time 1503050490000 ms
17/08/18 10:01:30 INFO ReceivedBlockTracker: Deleting batches ArrayBuffer(1503050470000 ms)
17/08/18 10:01:30 INFO InputInfoTracker: remove old batch metadata: 1503050470000 ms
17/08/18 10:01:30 INFO BlockManager: Removing RDD 3
17/08/18 10:01:40 INFO JobScheduler: Added jobs for time 1503050500000 ms
-------------------------------------------
Time: 1503050500000 ms
-------------------------------------------
WARN AbstractLifeCycle: FAILED SelectChannelConnector#0.0.0.0:4040: java.net.Bind
java.net.BindException: Address already in use
It means that port which needs is already in use. As a rule, port 4040 is used by Spark-thrifteserver. So try to stop thriftserver using stop-thriftserver.sh from spark/sbin folder. Or check who else use this port and free it.
I was able to fix it by doing the following things:
Take care of case sensitivity, because Scala is case sensitive language.
In below part of code used map() instead of Map()
val kafkaStream = KafkaUtils.createStream(ssc, "localhost:2181","spark-streaming-consumer-group", map("spark-topic" -> 5)) wrong practice! That is the reason why spark couldn't fetch your stream !
val kafkaStream = KafkaUtils.createStream(ssc, "localhost:2181","spark-streaming-consumer-group", Map("spark-topic" -> 5)) right practice!
Check if producer is streaming to Kafka topics mentioned in the map function! Spark couldn't fetch your stream from topic mentioned, either when Kafka-producer streaming data to that topic is stopped or when the data in streaming is finished, spark starts removing RDD from array buffer and displays the message as below!
17/08/18 10:01:30 INFO JobScheduler: Starting job streaming job 1503050490000 ms.0 from job set of time 1503050490000 ms
17/08/18 10:01:30 INFO JobScheduler: Finished job streaming job 1503050490000 ms.0 from job set of time 1503050490000 ms
17/08/18 10:01:30 INFO JobScheduler: Total delay: 0.003 s for time 1503050490000 ms (execution: 0.000 s)
17/08/18 10:01:30 INFO BlockRDD: Removing RDD 3 from persistence list
17/08/18 10:01:30 INFO KafkaInputDStream: Removing blocks of RDD BlockRDD[3] at createStream at <console>:39 of time 1503050490000 ms
17/08/18 10:01:30 INFO ReceivedBlockTracker: Deleting batches ArrayBuffer(1503050470000 ms)
17/08/18 10:01:30 INFO InputInfoTracker: remove old batch metadata: 1503050470000 ms
17/08/18 10:01:30 INFO BlockManager: Removing RDD 3
17/08/18 10:01:40 INFO JobScheduler: Added jobs for time 1503050500000 ms
-------------------------------------------
Time: 1503050500000 ms
-------------------------------------------
Follow the comments and response of #YehorKrivokon and #VinodChandak to avoid the warnings faced!
Kind of late, but this could help somebody else. Spark shell already instantiates SparkContext, which is available as sc. So to create StreamingContext, just pass the existing sc as argument. Hope this helps!

Why are the executors getting killed by the driver?

The first stage of my spark job is quite simple.
It reads from a big number of files (around 30,000 files and 100GB in total) -> RDD[String]
does a map (to parse each line) -> RDD[Map[String,Any]]
filters -> RDD[Map[String,Any]]
coalesces (.coalesce(100, true))
When running it, I observe a quite peculiar behavior. The number of executors grows until the given limit I specified in spark.dynamicAllocation.maxExecutors (typically 100 or 200 in my application). Then it starts decreasing quickly (at approx. 14000/33428 tasks) and only a few executors remain. They are killed by the drive. When this task is done. The number of executors increases back to its maximum value.
Below is a screenshot of the number of executors at its lowest.
An here is a screenshot of the task summary.
I guess that these executors are killed because they are idle. But, in this case, I do not understand why would they become idle. There remains a lot of task to do in the stage...
Do you have any idea of why it happens?
EDIT
More details about the driver logs when an executor is killed:
16/09/30 12:23:33 INFO cluster.YarnClusterSchedulerBackend: Disabling executor 91.
16/09/30 12:23:33 INFO scheduler.DAGScheduler: Executor lost: 91 (epoch 0)
16/09/30 12:23:33 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 91 from BlockManagerMaster.
16/09/30 12:23:33 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(91, server.com, 40923)
16/09/30 12:23:33 INFO storage.BlockManagerMaster: Removed 91 successfully in removeExecutor
16/09/30 12:23:33 INFO cluster.YarnClusterScheduler: Executor 91 on server.com killed by driver.
16/09/30 12:23:33 INFO spark.ExecutorAllocationManager: Existing executor 91 has been removed (new total is 94)
Logs on the executor
16/09/30 12:26:28 INFO rdd.HadoopRDD: Input split: hdfs://...
16/09/30 12:26:32 INFO executor.Executor: Finished task 38219.0 in stage 0.0 (TID 26519). 2312 bytes result sent to driver
16/09/30 12:27:33 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: SIGTERM
16/09/30 12:27:33 INFO storage.DiskBlockManager: Shutdown hook called
16/09/30 12:27:33 INFO util.ShutdownHookManager: Shutdown hook called
I'm seeing this problem on executors that are killed as a result of an idle timeout. I have an exceedingly demanding computational load, but it's mostly computed in a UDF, invisible to Spark. I believe that there's some spark parameter that can be adjusted.
Try looking through the spark.executor parameters in https://spark.apache.org/docs/latest/configuration.html#spark-properties and see if anything jumps out.

Spark hangs during RDD reading

I have Apache Spark master node. When I try to iterate throught RDDs Spark hangs.
Here is an example of my code:
val conf = new SparkConf()
.setAppName("Demo")
.setMaster("spark://localhost:7077")
.set("spark.executor.memory", "1g")
val sc = new SparkContext(conf)
val records = sc.textFile("file:///Users/barbara/projects/spark/src/main/resources/videos.csv")
println("Start")
records.collect().foreach(println)
println("Finish")
Spark log says:
Start
16/04/05 17:32:23 INFO FileInputFormat: Total input paths to process : 1
16/04/05 17:32:23 INFO SparkContext: Starting job: collect at Application.scala:23
16/04/05 17:32:23 INFO DAGScheduler: Got job 0 (collect at Application.scala:23) with 2 output partitions
16/04/05 17:32:23 INFO DAGScheduler: Final stage: ResultStage 0 (collect at Application.scala:23)
16/04/05 17:32:23 INFO DAGScheduler: Parents of final stage: List()
16/04/05 17:32:23 INFO DAGScheduler: Missing parents: List()
16/04/05 17:32:23 INFO DAGScheduler: Submitting ResultStage 0 (file:///Users/barbara/projects/spark/src/main/resources/videos.csv MapPartitionsRDD[1] at textFile at Application.scala:19), which has no missing parents
16/04/05 17:32:23 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 3.0 KB, free 120.5 KB)
16/04/05 17:32:23 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 1811.0 B, free 122.3 KB)
16/04/05 17:32:23 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 172.18.199.187:55983 (size: 1811.0 B, free: 2.4 GB)
16/04/05 17:32:23 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006
16/04/05 17:32:23 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 0 (file:///Users/barbara/projects/spark/src/main/resources/videos.csv MapPartitionsRDD[1] at textFile at Application.scala:19)
16/04/05 17:32:23 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
I see only a "Start" message. Seems Spark do nothing to read RDDs. Any ideas how to fix it?
UPD
The data I want to read:
123v4n312bv4nb12,Action,Comedy
2n4vhj2gvrh24gvr,Action,Drama
sjfu326gjrw6g374,Drama,Horror
If Spark hands on such a small dataset I would first look for:
Am I trying to connect to a cluster that doesn't respond/exists? If I am trying to connect to a running cluster, I would first try to run the same code locally setMaster("local[*]"). If this works, I would know that there is something going on with the "master" I try to connect to.
Am I asking for more resources that what the cluster has to offer? For example, if the cluster manages 2G and I ask for a 3GB executor, my application will never get schedule and it will be in the job queue forever.
Specific to the comments above. If you started your cluster by sbin/start-master.sh you will NOT get a running cluster. At the very minimum you need a master and a worker (for standalone). You should use the start-all.sh script. I recommend a bit more homework and follow a tutorial.
Use this instead:
val bufferedSource = io.Source.fromFile("/path/filename.csv")
for (line <- bufferedSource.getLines) {
println(line)
}

Spark SQL + Cassandra: bad performance

I'm just starting using Spark SQL + Cassandra, and probably am missing something important, but one simple query takes ~45 seconds. I'm using cassanda-spark-connector library, and run the local web server which also hosts the Spark. So my setup is roughly like this:
In sbt:
"org.apache.spark" %% "spark-core" % "1.4.1" excludeAll(ExclusionRule(organization = "org.slf4j")),
"org.apache.spark" %% "spark-sql" % "1.4.1" excludeAll(ExclusionRule(organization = "org.slf4j")),
"com.datastax.spark" %% "spark-cassandra-connector" % "1.4.0-M3" excludeAll(ExclusionRule(organization = "org.slf4j"))
In code I have a singleton that hosts SparkContext and CassandraSQLContetx. It's then called from the servlet. Here's how the singleton code looks like:
object SparkModel {
val conf =
new SparkConf()
.setAppName("core")
.setMaster("local")
.set("spark.cassandra.connection.host", "127.0.0.1")
val sc = new SparkContext(conf)
val sqlC = new CassandraSQLContext(sc)
sqlC.setKeyspace("core")
val df: DataFrame = sqlC.cassandraSql(
"SELECT email, target_entity_id, target_entity_type " +
"FROM tracking_events " +
"LEFT JOIN customers " +
"WHERE entity_type = 'User' AND entity_id = customer_id")
}
And here how I use it:
get("/spark") {
SparkModel.df.collect().map(r => TrackingEvent(r.getString(0), r.getString(1), r.getString(2))).toList
}
Cassandra, Spark and the web app run on the same host in virtual machine on my Macbook Pro with decent specs. Cassandra queries by themselves take 10-20 milliseconds.
When I call this endpoint for the first time, it takes 70-80 seconds to return the result. Subsequent queries take ~45 seconds. The log of the subsequent operation looks like this:
12:48:50 INFO org.apache.spark.SparkContext - Starting job: collect at V1Servlet.scala:1146
12:48:50 INFO o.a.spark.scheduler.DAGScheduler - Got job 1 (collect at V1Servlet.scala:1146) with 1 output partitions (allowLocal=false)
12:48:50 INFO o.a.spark.scheduler.DAGScheduler - Final stage: ResultStage 1(collect at V1Servlet.scala:1146)
12:48:50 INFO o.a.spark.scheduler.DAGScheduler - Parents of final stage: List()
12:48:50 INFO o.a.spark.scheduler.DAGScheduler - Missing parents: List()
12:48:50 INFO o.a.spark.scheduler.DAGScheduler - Submitting ResultStage 1 (MapPartitionsRDD[29] at collect at V1Servlet.scala:1146), which has no missing parents
12:48:50 INFO org.apache.spark.storage.MemoryStore - ensureFreeSpace(18696) called with curMem=26661, maxMem=825564856
12:48:50 INFO org.apache.spark.storage.MemoryStore - Block broadcast_1 stored as values in memory (estimated size 18.3 KB, free 787.3 MB)
12:48:50 INFO org.apache.spark.storage.MemoryStore - ensureFreeSpace(8345) called with curMem=45357, maxMem=825564856
12:48:50 INFO org.apache.spark.storage.MemoryStore - Block broadcast_1_piece0 stored as bytes in memory (estimated size 8.1 KB, free 787.3 MB)
12:48:50 INFO o.a.spark.storage.BlockManagerInfo - Added broadcast_1_piece0 in memory on localhost:56289 (size: 8.1 KB, free: 787.3 MB)
12:48:50 INFO org.apache.spark.SparkContext - Created broadcast 1 from broadcast at DAGScheduler.scala:874
12:48:50 INFO o.a.spark.scheduler.DAGScheduler - Submitting 1 missing tasks from ResultStage 1 (MapPartitionsRDD[29] at collect at V1Servlet.scala:1146)
12:48:50 INFO o.a.s.scheduler.TaskSchedulerImpl - Adding task set 1.0 with 1 tasks
12:48:50 INFO o.a.spark.scheduler.TaskSetManager - Starting task 0.0 in stage 1.0 (TID 1, localhost, NODE_LOCAL, 59413 bytes)
12:48:50 INFO org.apache.spark.executor.Executor - Running task 0.0 in stage 1.0 (TID 1)
12:48:50 INFO com.datastax.driver.core.Cluster - New Cassandra host localhost/127.0.0.1:9042 added
12:48:50 INFO c.d.s.c.cql.CassandraConnector - Connected to Cassandra cluster: Super Cluster
12:49:11 INFO o.a.spark.storage.BlockManagerInfo - Removed broadcast_0_piece0 on localhost:56289 in memory (size: 8.0 KB, free: 787.3 MB)
12:49:35 INFO org.apache.spark.executor.Executor - Finished task 0.0 in stage 1.0 (TID 1). 6124 bytes result sent to driver
12:49:35 INFO o.a.spark.scheduler.TaskSetManager - Finished task 0.0 in stage 1.0 (TID 1) in 45199 ms on localhost (1/1)
12:49:35 INFO o.a.s.scheduler.TaskSchedulerImpl - Removed TaskSet 1.0, whose tasks have all completed, from pool
12:49:35 INFO o.a.spark.scheduler.DAGScheduler - ResultStage 1 (collect at V1Servlet.scala:1146) finished in 45.199 s
As you can see from the log, the longest pauses are between these 3 lines (21 + 24 seconds):
12:48:50 INFO c.d.s.c.cql.CassandraConnector - Connected to Cassandra cluster: Super Cluster
12:49:11 INFO o.a.spark.storage.BlockManagerInfo - Removed broadcast_0_piece0 on localhost:56289 in memory (size: 8.0 KB, free: 787.3 MB)
12:49:35 INFO org.apache.spark.executor.Executor - Finished task 0.0 in stage 1.0 (TID 1). 6124 bytes result sent to driver
Apparently, I'm doing something wrong. What's that? How can I improve this?
EDIT: Important addition: the size of the tables is tiny (~200 entries for tracking_events, ~20 for customers), so reading them in their whole into memory shouldn't take any significant time. And it's a local Cassandra installation, no cluster, no networking is involved.
"SELECT email, target_entity_id, target_entity_type " +
"FROM tracking_events " +
"LEFT JOIN customers " +
"WHERE entity_type = 'User' AND entity_id = customer_id")
This query will read all of the data from both the tracking_events and customers table. I would compare the performance to just doing a SELECT COUNT(*) on both tables. If it is significantly different then there may be an issue but my guess is this is just the amount of time it takes to read both tables entirely into memory.
There are a few knobs for tuning how reads are done and since the defaults are oriented towards a much a bigger dataset you may want to change these.
spark.cassandra.input.split.size_in_mb approx amount of data to be fetched into a Spark partition 64 MB
spark.cassandra.input.fetch.size_in_rows number of CQL rows fetched per driver request 1000
I would make sure you are generating as many tasks as you have cores (at the minimum) so you can take advantage of all of your resources. To do this shrink the input.split.size
The fetch size controls how many rows are paged at a time by an executor core so increasing this can increase speed in some use cases.

why after some stages, all the taskes are assigned to one machine(excuter) in spark?

I encountered this problem: when i run a machine learning task on spark, after some stages, all the taskes are assigned to one machine(excutor), and the stage execution get slower and slower.
[the spark conf setting]
val conf = new SparkConf().setMaster(sparkMaster).setAppName("ModelTraining").setSparkHome(sparkHome).setJars(List(jarFile))
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
conf.set("spark.kryo.registrator", "LRRegistrator")
conf.set("spark.storage.memoryFraction", "0.7")
conf.set("spark.executor.memory", "8g")
conf.set("spark.cores.max", "150")
conf.set("spark.speculation", "true")
conf.set("spark.storage.blockManagerHeartBeatMs", "300000")
val sc = new SparkContext(conf)
val lines = sc.textFile("hdfs://xxx:52310"+inputPath , 3)
val trainset = lines.map(parseWeightedPoint).repartition(50).persist(StorageLevel.MEMORY_ONLY)
[the warn log from the spark]
14/09/19 10:26:23 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(45, TS-BH109, 48384, 0)
14/09/19 10:27:18 WARN TaskSetManager: Lost TID 726 (task 14.0:9)
14/09/19 10:29:03 WARN SparkDeploySchedulerBackend: Ignored task status update (737 state FAILED) from unknown executor Actor[akka.tcp://sparkExecutor#TS-BH96:33178/user/Executor#-913985102] with ID 39
14/09/19 10:29:03 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(30, TS-BH136, 28518, 0)
14/09/19 11:01:22 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(47, TS-BH136, 31644, 0) with no recent heart beats: 47765ms exceeds 45000ms
Any suggestions?
Can you post the executor log -- anything suspect there? In particular, you have Lost TID 726 (task 14.0:9). Further up in the driver log you should see which executor TID 726 was assigned to -- I'd check the error log on that machine and see if anything of interest shows up there.
My guess (and it's only a guess) is that your executor is crashing. At which point the system will try to launch a new executor but that is generally slow. In the mean time the current task might get resent to an existing executor hosing your system further.