Issue while passing data from one task function to another in locust with sequential task one by one - locust

We are trying to achieve a user-scenatio based load testing which basically creates an order in one url and then we have to pass the order id in the next url to get the status of that order.
We are using locust sequential task for it. As we want it to run sequentially like first request -> second request -> third request. We are already getting the response data as expected but we are not able to send that variable data to the third task in order to display the status of the order.
import json
from locust import HttpLocust, TaskSet, task, TaskSequence, seq_task
class MyTaskSequence(TaskSequence):
response_data = ""
#seq_task(1)
def index(self):
print("--- First Task")
response = self.client.get("/order/testing-06a5c/")
print(response.status_code)
#seq_task(2)
def get_details(self):
print("--- Second Task")
response = self.client.post(return_data, headers={"authority": "staging.domain.com", "referer":"https://staging.domain.com/"})
print(response.status_code)
response_data = json.loads(response.text)
print(response_data["details"]["claim_uri"])
self.response_data
def on_start(self):
self.get_details()
#seq_task(3)
def post_details(self):
print(self.get_details())
print("-- Third Task", self.response_data)
#return_data = self.response_data["details"]["claim_uri"]
#response = self.client.post(return_data, headers={"authority": "staging.domain.com", "referer":"https://staging.domain.com/"})
#print(response.text)
class MyLocust(HttpLocust):
task_set = MyTaskSequence
min_wait = 5000
max_wait = 15000
host = 'https://staging.domain.com'
Output:
[2018-11-19 19:24:19,784] system.local/INFO/stdout:
[2018-11-19 19:24:19,784] system.local/INFO/stdout: --- First Task
[2018-11-19 19:24:19,785] system.local/INFO/stdout:
[2018-11-19 19:24:20,235] system.local/INFO/stdout: 200
[2018-11-19 19:24:20,235] system.local/INFO/stdout:
[2018-11-19 19:24:29,372] system.local/INFO/stdout: --- Second Task
[2018-11-19 19:24:29,373] system.local/INFO/stdout:
[2018-11-19 19:24:29,653] system.local/INFO/stdout: 200
[2018-11-19 19:24:29,654] system.local/INFO/stdout:
[2018-11-19 19:24:29,654] system.local/INFO/stdout: /payment/initiate/claim/bc6d5024-f608-41af-8e78-191798c31a69/
[2018-11-19 19:24:29,654] system.local/INFO/stdout:
[2018-11-19 19:24:37,089] system.local/INFO/stdout: --- Second Task
[2018-11-19 19:24:37,089] system.local/INFO/stdout:
[2018-11-19 19:24:37,367] system.local/INFO/stdout: 200
[2018-11-19 19:24:37,367] system.local/INFO/stdout:
[2018-11-19 19:24:37,367] system.local/INFO/stdout: /payment/initiate/claim/72217a35-01fc-488e-885e-aea81a57a463/
[2018-11-19 19:24:37,368] system.local/INFO/stdout:
[2018-11-19 19:24:37,368] system.local/INFO/stdout: None
[2018-11-19 19:24:37,368] system.local/INFO/stdout:
[2018-11-19 19:24:37,368] system.local/INFO/stdout: ('-- Third Task', '')
[2018-11-19 19:24:37,368] system.local/INFO/stdout:
^C[2018-11-19 19:24:40,598] system.local/ERROR/stderr: KeyboardInterrupt
We wanted to pass response_data["details"]["claim_uri"] variable to #seq_task(3). So that we can form a dynamic target which will be domain.com/response_data["details"]["claim_uri"]. This variable will be of string type and have to pass separately everytime the third task invokes.
We tried this method but getting None in the output.
Is there anything which is missing out?

The issue was with the variable response data. The scope of the variable was local while calling it without self.
The below code works.
#seq_task(2)
def get_details(self):
<-- function data -->
self.response_data = json.loads(response.text)

Related

Job submitted via Spark job server fails with error

I am using Spark Job Server to submit spark jobs in cluster .The application I am trying to test is a
spark program based on Sansa query and Sansa stack . Sansa is used for scalable processing of huge amounts of RDF data and Sansa query is one of the sansa libraries which is used for querying RDF data.
When I am running the spark application as a spark program with spark-submit command it works correctly as expected.But when ran the program through spark job server , the applications fails most of the time with below exception .
0/05/29 18:57:00 INFO BlockManagerInfo: Added rdd_44_0 in memory on
us1salxhpw0653.corpnet2.com:37017 (size: 16.0 B, free: 366.2 MB)
20/05/29 18:57:00 ERROR ApplicationMaster: RECEIVED SIGNAL TERM
20/05/29 18:57:00 INFO SparkContext: Invoking stop() from shutdown
hook 20/05/29 18:57:00 INFO JobManagerActor: Got Spark Application end
event, stopping job manger. 20/05/29 18:57:00 INFO JobManagerActor:
Got Spark Application end event externally, stopping job manager
20/05/29 18:57:00 INFO SparkUI: Stopped Spark web UI at
http://10.138.32.96:46627 20/05/29 18:57:00 INFO TaskSetManager:
Starting task 3.0 in stage 3.0 (TID 63, us1salxhpw0653.corpnet2.com,
executor 1, partition 3, NODE_LOCAL, 4942 bytes) 20/05/29 18:57:00
INFO TaskSetManager: Finished task 0.0 in stage 3.0 (TID 60) in 513 ms
on us1salxhpw0653.corpnet2.com (executor 1) (1/560) 20/05/29 18:57:00
INFO TaskSetManager: Starting task 4.0 in stage 3.0 (TID 64,
us1salxhpw0669.corpnet2.com, executor 2, partition 4, NODE_LOCAL, 4942
bytes) 20/05/29 18:57:00 INFO TaskSetManager: Finished task 1.0 in
stage 3.0 (TID 61) in 512 ms on us1salxhpw0669.corpnet2.com (executor
2) (2/560) 20/05/29 18:57:00 INFO TaskSetManager: Starting task 5.0 in
stage 3.0 (TID 65, us1salxhpw0670.corpnet2.com, executor 3, partition
5, NODE_LOCAL, 4942 bytes) 20/05/29 18:57:00 INFO TaskSetManager:
Finished task 2.0 in stage 3.0 (TID 62) in 536 ms on
us1salxhpw0670.corpnet2.com (executor 3) (3/560) 20/05/29 18:57:00
INFO BlockManagerInfo: Added rdd_44_4 in memory on
us1salxhpw0669.corpnet2.com:34922 (size: 16.0 B, free: 366.2 MB)
20/05/29 18:57:00 INFO BlockManagerInfo: Added rdd_44_3 in memory on
us1salxhpw0653.corpnet2.com:37017 (size: 16.0 B, free: 366.2 MB)
20/05/29 18:57:00 INFO DAGScheduler: Job 2 failed: save at
SansaQueryExample.scala:32, took 0.732943 s 20/05/29 18:57:00 INFO
DAGScheduler: ShuffleMapStage 3 (save at SansaQueryExample.scala:32)
failed in 0.556 s due to Stage cancelled because SparkContext was shut
down 20/05/29 18:57:00 ERROR FileFormatWriter: Aborting job null.
> > org.apache.spark.SparkException: Job 2 cancelled because SparkContext
was shut down at
org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:820)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:818)
at scala.collection.mutable.HashSet.foreach(HashSet.scala:78) at
org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:818)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onStop(DAGScheduler.scala:1732)
at org.apache.spark.util.EventLoop.stop(EventLoop.scala:83) at
org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1651)
at
org.apache.spark.SparkContext$$anonfun$stop$8.apply$mcV$sp(SparkContext.scala:1923)
at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1317)
at org.apache.spark.SparkContext.stop(SparkContext.scala:1922) at
org.apache.spark.SparkContext$$anonfun$2.apply$mcV$sp(SparkContext.scala:584)
at
org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216)
at
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188)
at
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
at
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
at
org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1954)
at
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188)
at
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
at
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
at scala.util.Try$.apply(Try.scala:192)
Code which used for direct execution
object SansaQueryExampleWithoutSJS {
def main(args: Array[String]) {
val spark=SparkSession.builder().appName("sansa stack example").getOrCreate()
val input = "hdfs://user/dileep/rdf.nt";
val sparqlQuery: String = "SELECT * WHERE {?s ?p ?o} LIMIT 10"
val lang = Lang.NTRIPLES
val graphRdd = spark.rdf(lang)(input)
println(graphRdd.collect().foreach(println))
val result = graphRdd.sparql(sparqlQuery)
result.write.format("csv").mode("overwrite").save("hdfs://user/dileep/test-out")
}
Code Integrated with Spark Job Server
object SansaQueryExample extends SparkSessionJob {
override type JobData = Seq[String]
override type JobOutput = collection.Map[String, Long]
override def validate(sparkSession: SparkSession, runtime: JobEnvironment, config: Config):
JobData Or Every[ValidationProblem] = {
Try(config.getString("input.string").split(" ").toSeq)
.map(words => Good(words))
.getOrElse(Bad(One(SingleProblem("No input.string param"))))
}
override def runJob(sparkSession: SparkSession, runtime: JobEnvironment, data: JobData): JobOutput = {
val input = "hdfs://user/dileep/rdf.nt";
val sparqlQuery: String = "SELECT * WHERE {?s ?p ?o} LIMIT 10"
val lang = Lang.NTRIPLES
val graphRdd = sparkSession.rdf(lang)(input)
println(graphRdd.collect().foreach(println))
val result = graphRdd.sparql(sparqlQuery)
result.write.format("csv").mode("overwrite").save("hdfs://user/dileep/test-out")
sparkSession.sparkContext.parallelize(data).countByValue
}
}
Steps for executing an application via spark job server is explained here ,mainly
upload the jar into SJS through rest api
create a spark context with memory and core as required ,through another api
execute the job via another api mentioning the jar and context already created
So when I observed different executions of the program , I could see like the spark job server is behaving inconsistently and the program works on few occasions without any errors .Also observed like sparkcontext is being shutdown due to some unknown reasons .I am using SJS 0.8.0 and sansa 0.7.1 and spark 2.4

Spark Job keeps failing by the running the same map 3 times

My job has a step where I am converting the data frame as RDD[(key, value)] but the step runs three 3 times and getting stuck in the third time and fails
Spark UI shows :
Active Jobs(1)
Job Id (Job Group) Description Submitted Duration Stages: Succeeded/Total Tasks (for all stages): Succeeded/Total
3 (zeppelin-20161017-005442_839671900) Zeppelin map at <console>:69 2016/10/25 05:50:02 1.6 min 0/1 210/45623
Completed Jobs (2)
2 (zeppelin-20161017-005442_839671900) Zeppelin map at <console>:69 2016/10/25 05:16:28 23 min 1/1 46742/46075 (21 failed)
1 (zeppelin-20161017-005442_839671900) Zeppelin map at <console>:69 2016/10/25 04:47:58 17 min 1/1 47369/46795 (20 failed)
This is the code :
val eventsRDD = eventsDF.map {
r =>
val customerId = r.getAs[String]("customerId")
val itemId = r.getAs[String]("itemId")
val countryId = r.getAs[Long]("countryId").toInt
val timeStamp = r.getAs[String]("eventTimestamp")
val totalRent = r.getAs[Int]("totalRent")
val totalPurchase = r.getAs[Int]("totalPurchase")
val totalProfit = r.getAs[Int]("totalProfit")
val store = r.getAs[String]("store")
val itemName = r.getAs[String]("itemName")
val itemName = if (itemName.size > 0 && itemName.nonEmpty && itemName != null ) itemName else "NA"
(itemId, (customerId, countryId, timeStamp, totalRent, totalProfit, totalPurchase, store,itemName ))
}
Can someone tell what is wrong here ? If I want persist/cache which one I should do ?
Error :
16/10/25 23:28:55 INFO YarnClientSchedulerBackend: Asked to remove non-existent executor 181
16/10/25 23:28:55 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_1477415847345_0005_02_031011 on host: ip-172-31-14-104.ec2.internal. Exit status: 52. Diagnostics: Exception from container-launch.
Container id: container_1477415847345_0005_02_031011
Exit code: 52
Stack trace: ExitCodeException exitCode=52:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
at org.apache.hadoop.util.Shell.run(Shell.java:456)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
You map operation is resulting in some error and that propogates to the driver which results in task failure.
By default spark.task.maxFailures has value as 4 which is for :
Number of failures of any particular task before giving up on the job.
The total number of failures spread across different tasks will not
cause the job to fail; a particular task has to fail this number of
attempts. Should be greater than or equal to 1. Number of allowed
retries = this value - 1.
So what happens when your task fails spark tries to recompute the map operation untill it has failed 4 times in all.
If I want persist/cache which one I should do ?
cache is just specific persist operation where RDD is persisted with default storage level (MEMORY_ONLY).

FileNotFound Error in Spark

Iam running a spark simple program on a cluster:
val logFile = "/home/hduser/README.md" // Should be some file on your system
val conf = new SparkConf().setAppName("Simple Application")
val sc = new SparkContext(conf)
val logData = sc.textFile(logFile).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println()
println()
println()
println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
println()
println()
println()
println()
println()
and i get the following error
15/10/27 19:44:01 INFO TaskSetManager: Lost task 0.3 in stage 0.0 (TID 6) on
executor 192.168.0.19: java.io.FileNotFoundException (File
file:/home/hduser/README.md does not exist.) [duplicate 6]
15/10/27 19:44:01 ERROR TaskSetManager: Task 0 in stage 0.0 failed 4 times;
aborting job
15/10/27 19:44:01 INFO TaskSetManager: Lost task 1.3 in stage 0.0 (TID 7)
on executor 192.168.0.19: java.io.FileNotFoundException (File
file:/home/hduser/README.md does not exist.) [duplicate 7]
15/10/27 19:44:01 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks
have all completed, from pool
15/10/27 19:44:01 INFO TaskSchedulerImpl: Cancelling stage 0
15/10/27 19:44:01 INFO DAGScheduler: ResultStage 0 (count at
SimpleApp.scala:55) failed in 7.636 s
15/10/27 19:44:01 INFO DAGScheduler: Job 0 failed: count at
SimpleApp.scala:55, took 7.810387 s
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 6, 192.168.0.19): java.io.FileNotFoundException: File file:/home/hduser/README.md does not exist.
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:397)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251)
at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:125)
at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:283)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:427)
at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:78)
at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:51)
at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:239)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:242)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
The file is in the correct place. IF I REPLACE README.MD WITH REAMDME.TXT IT WILL WORK FINE. Can somenone help with this?
Thanks
If you are running a multinode cluster, make sure all nodes have the file in the same given path, with respect to their own filesystem. Or, you know, just use HDFS.
In multinode case "/home/hduser/README.md" path is distributed to worker nodes as well. README.md probably exists only on master node. Now when workers try to access this file, they won't look into master's fs instead each will try to find it on their own fs. If you have the same file on the same path in every node. The code is very likely to work. To do this, copy the file to every node's fs using the same path.
As you've already noticed, the solution above is cumbersome. Hadoop FS, HDFS, solves this issue and many more. You should look into it.
It is simply because a file with an extension .md contains plain text with formatting information. When you save this file with .txt extension, the formatting informations are removed or not considered. sc.textFile() works with plain texts.

Link celery tasks on TimeLimitExceeded or Failure

'worker_collect' is the celery package.
All the tasks are routing into the same queue.
I'm executing 2 tasks.
The task 'collect_data' is calling a task 'do_request'.
I put a sleep(5) in order to force the timeout.
When the task 'do_request' timeout of fail I'm trying to call the task 'fetch_result'
The function 'fetch_result' is empty but afterwards it should get the result of the task 'do_request'.
My question :
Where do I have to catch the exception raised 'TimeLimitExceeded' in order to call the function 'fetch_result' ?
Here is the code :
from __future__ import absolute_import
from time import sleep
from worker_collect.celery import app
from celery.utils.log import get_task_logger
from celery.exceptions import TimeLimitExceeded
#app.task(time_limit=3 , max_retries=None)
def fetch_result():
logger = get_task_logger(__name__)
logger.info('-------------------')
logger.info('----- FETCH -------')
logger.info('-------------------')
logger.info('-------------------')
#app.task(time_limit=3 , max_retries=None)
def do_request():
logger = get_task_logger(__name__)
logger.info('---------------------')
logger.info('----- DO REQUEST ----')
logger.info('---------------------')
logger.info('---------------------')
sleep(5)
return 'do_request DONE'
#app.task(time_limit=8 , max_retries=None)
def collect_data():
logger = get_task_logger(__name__)
logger.info('---------------------')
logger.info('-----COLLECT DATA----')
logger.info('---------------------')
do_request.apply_async(link=fetch_result.s(), link_error=fetch_result.s())
logger.info('---------------------')
logger.info('--END COLLECT DATA---')
logger.info('---------------------')
Here are the celery logs
[2015-02-13 15:18:08,275: INFO/Beat] beat: Starting...
[2015-02-13 15:18:08,301: INFO/MainProcess] Connected to amqp://guest:**#127.0.0.1:5672//
[2015-02-13 15:18:08,315: INFO/MainProcess] mingle: searching for neighbors
[2015-02-13 15:18:09,328: INFO/MainProcess] mingle: all alone
[2015-02-13 15:18:09,366: WARNING/MainProcess] celery#collector ready.
[2015-02-13 15:18:09,416: INFO/MainProcess] Received task: worker_collect.tasks.collect_data[1df333f3-cc34-4d10-ad1b-181432bf2283]
[2015-02-13 15:18:09,419: INFO/Worker-5] worker_collect.tasks.collect_data[1df333f3-cc34-4d10-ad1b-181432bf2283]: ---------------------
[2015-02-13 15:18:09,420: INFO/Worker-5] worker_collect.tasks.collect_data[1df333f3-cc34-4d10-ad1b-181432bf2283]: -----COLLECT DATA----
[2015-02-13 15:18:09,421: INFO/Worker-5] worker_collect.tasks.collect_data[1df333f3-cc34-4d10-ad1b-181432bf2283]: ---------------------
[2015-02-13 15:18:09,469: INFO/Worker-5] worker_collect.tasks.collect_data[1df333f3-cc34-4d10-ad1b-181432bf2283]: ---------------------
[2015-02-13 15:18:09,470: INFO/Worker-5] worker_collect.tasks.collect_data[1df333f3-cc34-4d10-ad1b-181432bf2283]: --END COLLECT DATA---
[2015-02-13 15:18:09,470: INFO/Worker-5] worker_collect.tasks.collect_data[1df333f3-cc34-4d10-ad1b-181432bf2283]: ---------------------
[2015-02-13 15:18:09,470: INFO/MainProcess] Received task: worker_collect.tasks.do_request[333e3002-c014-416f-a85a-ca839bd5376c]
[2015-02-13 15:18:09,472: INFO/Worker-2] worker_collect.tasks.do_request[333e3002-c014-416f-a85a-ca839bd5376c]: ---------------------
[2015-02-13 15:18:09,472: INFO/Worker-2] worker_collect.tasks.do_request[333e3002-c014-416f-a85a-ca839bd5376c]: ----- DO REQUEST ----
[2015-02-13 15:18:09,472: INFO/Worker-2] worker_collect.tasks.do_request[333e3002-c014-416f-a85a-ca839bd5376c]: ---------------------
[2015-02-13 15:18:09,473: INFO/Worker-2] worker_collect.tasks.do_request[333e3002-c014-416f-a85a-ca839bd5376c]: ---------------------
[2015-02-13 15:18:09,475: INFO/MainProcess] Task worker_collect.tasks.collect_data[1df333f3-cc34-4d10-ad1b-181432bf2283] succeeded in 0.0564397389971s: None
[2015-02-13 15:18:12,478: ERROR/MainProcess] Task worker_collect.tasks.do_request[333e3002-c014-416f-a85a-ca839bd5376c] raised unexpected: TimeLimitExceeded(3,)
Traceback (most recent call last):
File "/home/bschom/celery_env/local/lib/python2.7/site-packages/billiard/pool.py", line 639, in on_hard_timeout
raise TimeLimitExceeded(job._timeout)
TimeLimitExceeded: TimeLimitExceeded(3,)
[2015-02-13 15:18:12,479: ERROR/MainProcess] Hard time limit (3s) exceeded for worker_collect.tasks.do_request[333e3002-c014-416f-a85a-ca839bd5376c]
[2015-02-13 15:18:12,587: ERROR/MainProcess] Process 'Worker-2' pid:14159 exited with 'signal 9 (SIGKILL)'
Thanks in advance.

Scala Spark SQLContext Program throwing array out of bound exception

I am new to Apache Spark. I am trying to create a schema and load data from hdfs. Below is my code:
// importing sqlcontext
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.createSchemaRDD
//defining the schema
case class Author1(Author_Key: Long, Author_ID: Long, Author: String, First_Name: String, Last_Name: String, Middle_Name: String, Full_Name: String, Institution_Full_Name: String, Country: String, DIAS_ID: Int, R_ID: String)
val D_Authors1 =
sc.textFile("hdfs:///user/D_Authors.txt")
.map(_.split("\\|"))
.map(auth => Author1(auth(0).trim.toLong, auth(1).trim.toLong, auth(2), auth(3), auth(4), auth(5), auth(6), auth(7), auth(8), auth(9).trim.toInt, auth(10)))
//register the table
D_Authors1.registerAsTable("D_Authors1")
val auth = sqlContext.sql("SELECT * FROM D_Authors1")
sqlContext.sql("SELECT * FROM D_Authors").collect().foreach(println)
When I am executing this code it throwing array out of bound exception. Below is the error:
14/08/18 06:57:14 INFO Analyzer: Max iterations (2) reached for batch MultiInstanceRelations
14/08/18 06:57:14 INFO Analyzer: Max iterations (2) reached for batch CaseInsensitiveAttributeReferences
14/08/18 06:57:14 INFO SQLContext$$anon$1: Max iterations (2) reached for batch Add exchange
14/08/18 06:57:14 INFO SQLContext$$anon$1: Max iterations (2) reached for batch Prepare Expressions
14/08/18 06:57:14 INFO FileInputFormat: Total input paths to process : 1
14/08/18 06:57:14 INFO SparkContext: Starting job: collect at <console>:24
14/08/18 06:57:14 INFO DAGScheduler: Got job 5 (collect at <console>:24) with 2 output partitions (allowLocal=false)
14/08/18 06:57:14 INFO DAGScheduler: Final stage: Stage 5(collect at <console>:24)
14/08/18 06:57:14 INFO DAGScheduler: Parents of final stage: List()
14/08/18 06:57:14 INFO DAGScheduler: Missing parents: List()
14/08/18 06:57:14 INFO DAGScheduler: Submitting Stage 5 (SchemaRDD[26] at RDD at SchemaRDD.scala:98
== Query Plan ==
ExistingRdd [Author_Key#22L,Author_ID#23L,Author#24,First_Name#25,Last_Name#26,Middle_Name#27,Full_Name#28,Institution_Full_Name#29,Country#30,DIAS_ID#31,R_ID#32], MapPartitionsRDD[23] at mapPartitions at basicOperators.scala:174), which has no missing parents
14/08/18 06:57:14 INFO DAGScheduler: Submitting 2 missing tasks from Stage 5 (SchemaRDD[26] at RDD at SchemaRDD.scala:98
== Query Plan ==
ExistingRdd [Author_Key#22L,Author_ID#23L,Author#24,First_Name#25,Last_Name#26,Middle_Name#27,Full_Name#28,Institution_Full_Name#29,Country#30,DIAS_ID#31,R_ID#32], MapPartitionsRDD[23] at mapPartitions at basicOperators.scala:174)
14/08/18 06:57:14 INFO YarnClientClusterScheduler: Adding task set 5.0 with 2 tasks
14/08/18 06:57:14 INFO TaskSetManager: Starting task 5.0:0 as TID 38 on executor 1: orf-bat.int..com (NODE_LOCAL)
14/08/18 06:57:14 INFO TaskSetManager: Serialized task 5.0:0 as 4401 bytes in 1 ms
14/08/18 06:57:15 INFO TaskSetManager: Starting task 5.0:1 as TID 39 on executor 1: orf-bat.int..com (NODE_LOCAL)
14/08/18 06:57:15 INFO TaskSetManager: Serialized task 5.0:1 as 4401 bytes in 0 ms
14/08/18 06:57:15 WARN TaskSetManager: Lost TID 38 (task 5.0:0)
14/08/18 06:57:15 WARN TaskSetManager: Loss was due to java.lang.ArrayIndexOutOfBoundsException
java.lang.ArrayIndexOutOfBoundsException: 10
at $line39.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:27)
at $line39.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:27)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$1.next(Iterator.scala:853)
at scala.collection.Iterator$$anon$1.head(Iterator.scala:840)
at org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:179)
at org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:174)
at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559)
at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.sql.SchemaRDD.compute(SchemaRDD.scala:110)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
at org.apache.spark.scheduler.Task.run(Task.scala:51)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
14/08/18 06:57:15 WARN TaskSetManager: Lost TID 39 (task 5.0:1)
14/08/18 06:57:15 WARN TaskSetManager: Loss was due to java.lang.ArrayIndexOutOfBoundsException
java.lang.ArrayIndexOutOfBoundsException: 9
at $line39.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:27)
at $line39.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:27)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$1.next(Iterator.scala:853)
at scala.collection.Iterator$$anon$1.head(Iterator.scala:840)
at org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:179)
at org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:174)
at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559)
at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.sql.SchemaRDD.compute(SchemaRDD.scala:110)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
at org.apache.spark.scheduler.Task.run(Task.scala:51)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Your problem has nothing to do with Spark.
Format your code correctly (I have corrected)
Don't mix camel & underscore naming - use underscore for SQL fields, use camel for Scala vals,
When you get an exception read it it usually tells you what you are doing wrong, in your case it's probably that some of the records in hdfs:///user/D_Authors.txt are not how you expect them
When you get an exception debug it, try actually catching the exception and printing out what the records are that fail to parse
_.split("\\|") ignores empty leading and trailing strings, use _.split("\\|", -1)
In Scala you don't need magic numbers that manually access elements of an array, it's ugly and more prone to error, use a pattern match ...
here is a simple example which includes unusual record handling!:
case class Author(author: String, authorAge: Int)
myData.map(_.split("\t", -1) match {
case Array(author, authorAge) => Author(author, authorAge.toInt)
case unexpectedArrayForm =>
throw new RuntimeException("Record did not have correct number of fields: " +
unexpectedArrayForm.mkString("\t"))
})
Now if you coded it like this, your exception would tell you straight away exactly what is wrong with your data.
One final point/concern; why are you using Spark SQL? Your data is in text form, are you trying to transform it into, say, parquet? If not, why not just use the regular Scala API to perform your analysis, moreover it's type checked and compile checked, unlike SQL.