'worker_collect' is the celery package.
All the tasks are routing into the same queue.
I'm executing 2 tasks.
The task 'collect_data' is calling a task 'do_request'.
I put a sleep(5) in order to force the timeout.
When the task 'do_request' timeout of fail I'm trying to call the task 'fetch_result'
The function 'fetch_result' is empty but afterwards it should get the result of the task 'do_request'.
My question :
Where do I have to catch the exception raised 'TimeLimitExceeded' in order to call the function 'fetch_result' ?
Here is the code :
from __future__ import absolute_import
from time import sleep
from worker_collect.celery import app
from celery.utils.log import get_task_logger
from celery.exceptions import TimeLimitExceeded
#app.task(time_limit=3 , max_retries=None)
def fetch_result():
logger = get_task_logger(__name__)
logger.info('-------------------')
logger.info('----- FETCH -------')
logger.info('-------------------')
logger.info('-------------------')
#app.task(time_limit=3 , max_retries=None)
def do_request():
logger = get_task_logger(__name__)
logger.info('---------------------')
logger.info('----- DO REQUEST ----')
logger.info('---------------------')
logger.info('---------------------')
sleep(5)
return 'do_request DONE'
#app.task(time_limit=8 , max_retries=None)
def collect_data():
logger = get_task_logger(__name__)
logger.info('---------------------')
logger.info('-----COLLECT DATA----')
logger.info('---------------------')
do_request.apply_async(link=fetch_result.s(), link_error=fetch_result.s())
logger.info('---------------------')
logger.info('--END COLLECT DATA---')
logger.info('---------------------')
Here are the celery logs
[2015-02-13 15:18:08,275: INFO/Beat] beat: Starting...
[2015-02-13 15:18:08,301: INFO/MainProcess] Connected to amqp://guest:**#127.0.0.1:5672//
[2015-02-13 15:18:08,315: INFO/MainProcess] mingle: searching for neighbors
[2015-02-13 15:18:09,328: INFO/MainProcess] mingle: all alone
[2015-02-13 15:18:09,366: WARNING/MainProcess] celery#collector ready.
[2015-02-13 15:18:09,416: INFO/MainProcess] Received task: worker_collect.tasks.collect_data[1df333f3-cc34-4d10-ad1b-181432bf2283]
[2015-02-13 15:18:09,419: INFO/Worker-5] worker_collect.tasks.collect_data[1df333f3-cc34-4d10-ad1b-181432bf2283]: ---------------------
[2015-02-13 15:18:09,420: INFO/Worker-5] worker_collect.tasks.collect_data[1df333f3-cc34-4d10-ad1b-181432bf2283]: -----COLLECT DATA----
[2015-02-13 15:18:09,421: INFO/Worker-5] worker_collect.tasks.collect_data[1df333f3-cc34-4d10-ad1b-181432bf2283]: ---------------------
[2015-02-13 15:18:09,469: INFO/Worker-5] worker_collect.tasks.collect_data[1df333f3-cc34-4d10-ad1b-181432bf2283]: ---------------------
[2015-02-13 15:18:09,470: INFO/Worker-5] worker_collect.tasks.collect_data[1df333f3-cc34-4d10-ad1b-181432bf2283]: --END COLLECT DATA---
[2015-02-13 15:18:09,470: INFO/Worker-5] worker_collect.tasks.collect_data[1df333f3-cc34-4d10-ad1b-181432bf2283]: ---------------------
[2015-02-13 15:18:09,470: INFO/MainProcess] Received task: worker_collect.tasks.do_request[333e3002-c014-416f-a85a-ca839bd5376c]
[2015-02-13 15:18:09,472: INFO/Worker-2] worker_collect.tasks.do_request[333e3002-c014-416f-a85a-ca839bd5376c]: ---------------------
[2015-02-13 15:18:09,472: INFO/Worker-2] worker_collect.tasks.do_request[333e3002-c014-416f-a85a-ca839bd5376c]: ----- DO REQUEST ----
[2015-02-13 15:18:09,472: INFO/Worker-2] worker_collect.tasks.do_request[333e3002-c014-416f-a85a-ca839bd5376c]: ---------------------
[2015-02-13 15:18:09,473: INFO/Worker-2] worker_collect.tasks.do_request[333e3002-c014-416f-a85a-ca839bd5376c]: ---------------------
[2015-02-13 15:18:09,475: INFO/MainProcess] Task worker_collect.tasks.collect_data[1df333f3-cc34-4d10-ad1b-181432bf2283] succeeded in 0.0564397389971s: None
[2015-02-13 15:18:12,478: ERROR/MainProcess] Task worker_collect.tasks.do_request[333e3002-c014-416f-a85a-ca839bd5376c] raised unexpected: TimeLimitExceeded(3,)
Traceback (most recent call last):
File "/home/bschom/celery_env/local/lib/python2.7/site-packages/billiard/pool.py", line 639, in on_hard_timeout
raise TimeLimitExceeded(job._timeout)
TimeLimitExceeded: TimeLimitExceeded(3,)
[2015-02-13 15:18:12,479: ERROR/MainProcess] Hard time limit (3s) exceeded for worker_collect.tasks.do_request[333e3002-c014-416f-a85a-ca839bd5376c]
[2015-02-13 15:18:12,587: ERROR/MainProcess] Process 'Worker-2' pid:14159 exited with 'signal 9 (SIGKILL)'
Thanks in advance.
Related
Code:
from pyspark.sql.types import *
from pyspark.sql.functions import *
retail_sales_transaction = glueContext.create_dynamic_frame.from_catalog(
database="conform_main_mobconv",
table_name="retail_sales_transaction"
).select_fields(["business week","transaction_key","dh_audit_record_type","dh_audit_active_record"])
#TODO: Implement delta logics here & exclude Deleted & Inactivve records here
df_retail_sales_transaction= (retail_sales_transaction.toDF().filter((f.col('dh_audit_record_type')!='DELETE') & (f.col('dh_audit_active_record')=='1')))
Error I'm getting is :
df_retail_sales_transaction= (retail_sales_transaction.toDF().filter((f.col('dh_audit_record_type')!='DELETE') & (f.col('dh_audit_active_record')=='1')))
py4j.protocol.Py4JJavaError: An error occurred while calling o86.toDF.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 1.0 failed 4 times, most recent failure: Lost task 1.3 in stage 1.0 (TID 14, 172.35.203.73, executor 1): java.lang.UnsupportedOperationException
I had to implement something close to that, Filter transform did the job. Try it:
from awsglue.transforms import Filter
dynamicframeFiltered = Filter.apply(
frame = retail_sales_transaction,
f = lambda row: row["dh_audit_record_type"] != 'DELETE' and row["dh_audit_active_record"] == '1'
)
dynamicframeFiltered.toDF().show(1)
How should I handle SoftTImeLimitException with group of celery tasks?
i have task
> #shared_task(bind=True, priority=2, autoretry_for=(EasySNMPError,),
> soft_time_limit=10, retry_kwargs={'max_retries': 3, 'countdown': 2},
> acks_late=True) def discover_interface(self, interface_id: int) -> dict:
> logger.info(f'start disco interface {interface_id}')
> p = probes.DiscoverSNMP()
> try:
> with utils.Timer() as t1:
> p.discover_interface(interface_id=interface_id)
> logger.info(f'stop disco interface {interface_id}')
> except SoftTimeLimitExceeded:
> p.stats['message'] = 'soft time limit exceeded'
> return p.stats
when i run it i get soft time limit exception.
log from celery
[2021-02-24 16:30:20,298: WARNING/MainProcess] Soft time limit (10s) exceeded for pollers.tasks.discover_interface[6b166746-95c4-458c-bb63-318bdb588d00]
[2021-02-24 16:30:20,374: ERROR/MainProcess] Process 'ForkPoolWorker-2' pid:19585 exited with 'signal 11 (SIGSEGV)'
[2021-02-24 16:30:20,393: ERROR/MainProcess] Task handler raised error: WorkerLostError('Worker exited prematurely: signal 11 (SIGSEGV).',)
Traceback (most recent call last):
File "/home/kolekcjoner/miniconda3/envs/kolekcjoner/lib/python3.6/site-packages/billiard/pool.py", line 1267, in mark_as_worker_lost
human_status(exitcode)),
billiard.exceptions.WorkerLostError: Worker exited prematurely: signal 11 (SIGSEGV).
execution in python console
job = group([tasks.discover_interface.s(1870361)]).apply_async()
In[45]: job.join_native()
Traceback (most recent call last):
File "/home/kolekcjoner/miniconda3/envs/kolekcjoner/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 3343, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-45-9330abace970>", line 1, in <module>
job.join_native()
File "/home/kolekcjoner/miniconda3/envs/kolekcjoner/lib/python3.6/site-packages/celery/result.py", line 818, in join_native
raise value
billiard.exceptions.WorkerLostError: Worker exited prematurely: signal 11 (SIGSEGV).
how should i change it to NOT have exception in job.join_native(), but message 'soft time limit exceeded'returned from task?
I am using Spark Job Server to submit spark jobs in cluster .The application I am trying to test is a
spark program based on Sansa query and Sansa stack . Sansa is used for scalable processing of huge amounts of RDF data and Sansa query is one of the sansa libraries which is used for querying RDF data.
When I am running the spark application as a spark program with spark-submit command it works correctly as expected.But when ran the program through spark job server , the applications fails most of the time with below exception .
0/05/29 18:57:00 INFO BlockManagerInfo: Added rdd_44_0 in memory on
us1salxhpw0653.corpnet2.com:37017 (size: 16.0 B, free: 366.2 MB)
20/05/29 18:57:00 ERROR ApplicationMaster: RECEIVED SIGNAL TERM
20/05/29 18:57:00 INFO SparkContext: Invoking stop() from shutdown
hook 20/05/29 18:57:00 INFO JobManagerActor: Got Spark Application end
event, stopping job manger. 20/05/29 18:57:00 INFO JobManagerActor:
Got Spark Application end event externally, stopping job manager
20/05/29 18:57:00 INFO SparkUI: Stopped Spark web UI at
http://10.138.32.96:46627 20/05/29 18:57:00 INFO TaskSetManager:
Starting task 3.0 in stage 3.0 (TID 63, us1salxhpw0653.corpnet2.com,
executor 1, partition 3, NODE_LOCAL, 4942 bytes) 20/05/29 18:57:00
INFO TaskSetManager: Finished task 0.0 in stage 3.0 (TID 60) in 513 ms
on us1salxhpw0653.corpnet2.com (executor 1) (1/560) 20/05/29 18:57:00
INFO TaskSetManager: Starting task 4.0 in stage 3.0 (TID 64,
us1salxhpw0669.corpnet2.com, executor 2, partition 4, NODE_LOCAL, 4942
bytes) 20/05/29 18:57:00 INFO TaskSetManager: Finished task 1.0 in
stage 3.0 (TID 61) in 512 ms on us1salxhpw0669.corpnet2.com (executor
2) (2/560) 20/05/29 18:57:00 INFO TaskSetManager: Starting task 5.0 in
stage 3.0 (TID 65, us1salxhpw0670.corpnet2.com, executor 3, partition
5, NODE_LOCAL, 4942 bytes) 20/05/29 18:57:00 INFO TaskSetManager:
Finished task 2.0 in stage 3.0 (TID 62) in 536 ms on
us1salxhpw0670.corpnet2.com (executor 3) (3/560) 20/05/29 18:57:00
INFO BlockManagerInfo: Added rdd_44_4 in memory on
us1salxhpw0669.corpnet2.com:34922 (size: 16.0 B, free: 366.2 MB)
20/05/29 18:57:00 INFO BlockManagerInfo: Added rdd_44_3 in memory on
us1salxhpw0653.corpnet2.com:37017 (size: 16.0 B, free: 366.2 MB)
20/05/29 18:57:00 INFO DAGScheduler: Job 2 failed: save at
SansaQueryExample.scala:32, took 0.732943 s 20/05/29 18:57:00 INFO
DAGScheduler: ShuffleMapStage 3 (save at SansaQueryExample.scala:32)
failed in 0.556 s due to Stage cancelled because SparkContext was shut
down 20/05/29 18:57:00 ERROR FileFormatWriter: Aborting job null.
> > org.apache.spark.SparkException: Job 2 cancelled because SparkContext
was shut down at
org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:820)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:818)
at scala.collection.mutable.HashSet.foreach(HashSet.scala:78) at
org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:818)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onStop(DAGScheduler.scala:1732)
at org.apache.spark.util.EventLoop.stop(EventLoop.scala:83) at
org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1651)
at
org.apache.spark.SparkContext$$anonfun$stop$8.apply$mcV$sp(SparkContext.scala:1923)
at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1317)
at org.apache.spark.SparkContext.stop(SparkContext.scala:1922) at
org.apache.spark.SparkContext$$anonfun$2.apply$mcV$sp(SparkContext.scala:584)
at
org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216)
at
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188)
at
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
at
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
at
org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1954)
at
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188)
at
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
at
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
at scala.util.Try$.apply(Try.scala:192)
Code which used for direct execution
object SansaQueryExampleWithoutSJS {
def main(args: Array[String]) {
val spark=SparkSession.builder().appName("sansa stack example").getOrCreate()
val input = "hdfs://user/dileep/rdf.nt";
val sparqlQuery: String = "SELECT * WHERE {?s ?p ?o} LIMIT 10"
val lang = Lang.NTRIPLES
val graphRdd = spark.rdf(lang)(input)
println(graphRdd.collect().foreach(println))
val result = graphRdd.sparql(sparqlQuery)
result.write.format("csv").mode("overwrite").save("hdfs://user/dileep/test-out")
}
Code Integrated with Spark Job Server
object SansaQueryExample extends SparkSessionJob {
override type JobData = Seq[String]
override type JobOutput = collection.Map[String, Long]
override def validate(sparkSession: SparkSession, runtime: JobEnvironment, config: Config):
JobData Or Every[ValidationProblem] = {
Try(config.getString("input.string").split(" ").toSeq)
.map(words => Good(words))
.getOrElse(Bad(One(SingleProblem("No input.string param"))))
}
override def runJob(sparkSession: SparkSession, runtime: JobEnvironment, data: JobData): JobOutput = {
val input = "hdfs://user/dileep/rdf.nt";
val sparqlQuery: String = "SELECT * WHERE {?s ?p ?o} LIMIT 10"
val lang = Lang.NTRIPLES
val graphRdd = sparkSession.rdf(lang)(input)
println(graphRdd.collect().foreach(println))
val result = graphRdd.sparql(sparqlQuery)
result.write.format("csv").mode("overwrite").save("hdfs://user/dileep/test-out")
sparkSession.sparkContext.parallelize(data).countByValue
}
}
Steps for executing an application via spark job server is explained here ,mainly
upload the jar into SJS through rest api
create a spark context with memory and core as required ,through another api
execute the job via another api mentioning the jar and context already created
So when I observed different executions of the program , I could see like the spark job server is behaving inconsistently and the program works on few occasions without any errors .Also observed like sparkcontext is being shutdown due to some unknown reasons .I am using SJS 0.8.0 and sansa 0.7.1 and spark 2.4
I am new to Apache Spark. I am trying to create a schema and load data from hdfs. Below is my code:
// importing sqlcontext
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.createSchemaRDD
//defining the schema
case class Author1(Author_Key: Long, Author_ID: Long, Author: String, First_Name: String, Last_Name: String, Middle_Name: String, Full_Name: String, Institution_Full_Name: String, Country: String, DIAS_ID: Int, R_ID: String)
val D_Authors1 =
sc.textFile("hdfs:///user/D_Authors.txt")
.map(_.split("\\|"))
.map(auth => Author1(auth(0).trim.toLong, auth(1).trim.toLong, auth(2), auth(3), auth(4), auth(5), auth(6), auth(7), auth(8), auth(9).trim.toInt, auth(10)))
//register the table
D_Authors1.registerAsTable("D_Authors1")
val auth = sqlContext.sql("SELECT * FROM D_Authors1")
sqlContext.sql("SELECT * FROM D_Authors").collect().foreach(println)
When I am executing this code it throwing array out of bound exception. Below is the error:
14/08/18 06:57:14 INFO Analyzer: Max iterations (2) reached for batch MultiInstanceRelations
14/08/18 06:57:14 INFO Analyzer: Max iterations (2) reached for batch CaseInsensitiveAttributeReferences
14/08/18 06:57:14 INFO SQLContext$$anon$1: Max iterations (2) reached for batch Add exchange
14/08/18 06:57:14 INFO SQLContext$$anon$1: Max iterations (2) reached for batch Prepare Expressions
14/08/18 06:57:14 INFO FileInputFormat: Total input paths to process : 1
14/08/18 06:57:14 INFO SparkContext: Starting job: collect at <console>:24
14/08/18 06:57:14 INFO DAGScheduler: Got job 5 (collect at <console>:24) with 2 output partitions (allowLocal=false)
14/08/18 06:57:14 INFO DAGScheduler: Final stage: Stage 5(collect at <console>:24)
14/08/18 06:57:14 INFO DAGScheduler: Parents of final stage: List()
14/08/18 06:57:14 INFO DAGScheduler: Missing parents: List()
14/08/18 06:57:14 INFO DAGScheduler: Submitting Stage 5 (SchemaRDD[26] at RDD at SchemaRDD.scala:98
== Query Plan ==
ExistingRdd [Author_Key#22L,Author_ID#23L,Author#24,First_Name#25,Last_Name#26,Middle_Name#27,Full_Name#28,Institution_Full_Name#29,Country#30,DIAS_ID#31,R_ID#32], MapPartitionsRDD[23] at mapPartitions at basicOperators.scala:174), which has no missing parents
14/08/18 06:57:14 INFO DAGScheduler: Submitting 2 missing tasks from Stage 5 (SchemaRDD[26] at RDD at SchemaRDD.scala:98
== Query Plan ==
ExistingRdd [Author_Key#22L,Author_ID#23L,Author#24,First_Name#25,Last_Name#26,Middle_Name#27,Full_Name#28,Institution_Full_Name#29,Country#30,DIAS_ID#31,R_ID#32], MapPartitionsRDD[23] at mapPartitions at basicOperators.scala:174)
14/08/18 06:57:14 INFO YarnClientClusterScheduler: Adding task set 5.0 with 2 tasks
14/08/18 06:57:14 INFO TaskSetManager: Starting task 5.0:0 as TID 38 on executor 1: orf-bat.int..com (NODE_LOCAL)
14/08/18 06:57:14 INFO TaskSetManager: Serialized task 5.0:0 as 4401 bytes in 1 ms
14/08/18 06:57:15 INFO TaskSetManager: Starting task 5.0:1 as TID 39 on executor 1: orf-bat.int..com (NODE_LOCAL)
14/08/18 06:57:15 INFO TaskSetManager: Serialized task 5.0:1 as 4401 bytes in 0 ms
14/08/18 06:57:15 WARN TaskSetManager: Lost TID 38 (task 5.0:0)
14/08/18 06:57:15 WARN TaskSetManager: Loss was due to java.lang.ArrayIndexOutOfBoundsException
java.lang.ArrayIndexOutOfBoundsException: 10
at $line39.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:27)
at $line39.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:27)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$1.next(Iterator.scala:853)
at scala.collection.Iterator$$anon$1.head(Iterator.scala:840)
at org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:179)
at org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:174)
at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559)
at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.sql.SchemaRDD.compute(SchemaRDD.scala:110)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
at org.apache.spark.scheduler.Task.run(Task.scala:51)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
14/08/18 06:57:15 WARN TaskSetManager: Lost TID 39 (task 5.0:1)
14/08/18 06:57:15 WARN TaskSetManager: Loss was due to java.lang.ArrayIndexOutOfBoundsException
java.lang.ArrayIndexOutOfBoundsException: 9
at $line39.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:27)
at $line39.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:27)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$1.next(Iterator.scala:853)
at scala.collection.Iterator$$anon$1.head(Iterator.scala:840)
at org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:179)
at org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:174)
at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559)
at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.sql.SchemaRDD.compute(SchemaRDD.scala:110)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
at org.apache.spark.scheduler.Task.run(Task.scala:51)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Your problem has nothing to do with Spark.
Format your code correctly (I have corrected)
Don't mix camel & underscore naming - use underscore for SQL fields, use camel for Scala vals,
When you get an exception read it it usually tells you what you are doing wrong, in your case it's probably that some of the records in hdfs:///user/D_Authors.txt are not how you expect them
When you get an exception debug it, try actually catching the exception and printing out what the records are that fail to parse
_.split("\\|") ignores empty leading and trailing strings, use _.split("\\|", -1)
In Scala you don't need magic numbers that manually access elements of an array, it's ugly and more prone to error, use a pattern match ...
here is a simple example which includes unusual record handling!:
case class Author(author: String, authorAge: Int)
myData.map(_.split("\t", -1) match {
case Array(author, authorAge) => Author(author, authorAge.toInt)
case unexpectedArrayForm =>
throw new RuntimeException("Record did not have correct number of fields: " +
unexpectedArrayForm.mkString("\t"))
})
Now if you coded it like this, your exception would tell you straight away exactly what is wrong with your data.
One final point/concern; why are you using Spark SQL? Your data is in text form, are you trying to transform it into, say, parquet? If not, why not just use the regular Scala API to perform your analysis, moreover it's type checked and compile checked, unlike SQL.
I am playing with demo cases with celery tutorial. However, the results were shown disabled when I start the task app. as below. Any idea?
celery --app=plmtcheck worker -l info
Then I see:
- ** ---------- .> app: plmtcheck:0x7f9fd2fdf160
- ** ---------- .> transport: amqp://guest#localhost:5672//
- ** ---------- .> results: disabled
I have seen the result is ready.
[2014-05-05 16:16:55,382: INFO/MainProcess] Connected to amqp://guest#127.0.0.1:5672//
[2014-05-05 16:16:55,389: INFO/MainProcess] mingle: searching for neighbors
[2014-05-05 16:16:56,401: INFO/MainProcess] mingle: all alone
[2014-05-05 16:16:56,422: WARNING/MainProcess] celery#D-NYC-00552088-Linux ready.
[2014-05-05 16:17:27,726: INFO/MainProcess] Received task: plmtcheck.add[7ea5a501-1085-48b7-8f7e-dac8ac2c5377]
[2014-05-05 16:17:27,759: INFO/MainProcess] Task plmtcheck.add[7ea5a501-1085-48b7-8f7e-dac8ac2c5377] succeeded in 0.032166894000056345s: 37
My code is simply
from celery import Celery
app = Celery('plmtcheck', backend='amqp', broker='amqp://')
#app.task
def add(x, y):
return x + y
if __name__ == '__main__':
app.worker_main()
install the django-celery-results package then add it in settings.py at
INSTALLED_APPS = [
...
'django_celery_results',
...
]
This will do what you intend to do