I am trying to convert my DynamicFrame to DataFrame in AWS Glue ETL job. I am getting the exception below. Please note that my DynamicFrame has DynamicRecords with different columns per record. But my understanding was DynamicFrame handles those and generates DataFrame.
'An error occurred while calling o80.showString.\n: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 10.0 failed 4 times, most recent failure: Lost task 0.3 in stage 10.0 (TID 619, ip-172-31-7-160.ec2.internal, executor 2): java.lang.NullPointerException\n\tat com.amazonaws.services.glue.schema.types.StructType.fieldIndex(StructType.java:53)\n\tat com.amazonaws.services.glue.RecordToRow$$anonfun$convertField$1.apply(RecordToRow.scala:93)\n\tat com.amazonaws.services.glue.RecordToRow$$anonfun$convertField$1.apply(RecordToRow.scala:92)\n\tat scala.collection.Iterator$class.foreach(Iterator.scala:893)\n\tat scala.collection.AbstractIterator.foreach(Iterator.scala:1336)\n\tat com.amazonaws.services.glue.RecordToRow$.convertField(RecordToRow.scala:92)\n\tat com.amazonaws.services.glue.RecordToRow$.apply(RecordToRow.scala:43)\n\tat com.amazonaws.services.glue.DynamicRecord.toRow(DynamicRecord.scala:260)\n\tat com.amazonaws.services.glue.DynamicFrame$$anonfun$toDF$1.apply(DynamicFrame.scala:280)\n\tat com.amazonaws.services.glue.DynamicFrame$$anonfun$toDF$1.apply(DynamicFrame.scala:280)\n\tat scala.collection.Iterator$$anon$11.next(Iterator.scala:409)\n\tat scala.collection.Iterator$$anon$11.next(Iterator.scala:409)\n\tat scala.collection.Iterator$$anon$11.next(Iterator.scala:409)\n\tat org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:235)\n\tat org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)\n\tat org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)\n\tat org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)\n\tat org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)\n\tat org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)\n\tat org.apache.spark.rdd.RDD.iterator(RDD.scala:287)\n\tat org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)\n\tat org.apache.spark.scheduler.Task.run(Task.scala:108)\n\tat org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat java.lang.Thread.run(Thread.java:748)\n\nDriver stacktrace:\n\tat org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1517)\n\tat org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1505)\n\tat org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1504)\n\tat scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)\n\tat scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)\n\tat org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1504)\n\tat org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)\n\tat org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)\n\tat scala.Option.foreach(Option.scala:257)\n\tat org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)\n\tat org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1732)\n\tat org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1687)\n\tat org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1676)\n\tat org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)\n\tat org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)\n\tat org.apache.spark.SparkContext.runJob(SparkContext.scala:2029)\n\tat org.apache.spark.SparkContext.runJob(SparkContext.scala:2050)\n\tat org.apache.spark.SparkContext.runJob(SparkContext.scala:2069)\n\tat org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:336)\n\tat org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)\n\tat org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:2861)\n\tat org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2150)\n\tat org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2150)\n\tat org.apache.spark.sql.Dataset$$anonfun$55.apply(Dataset.scala:2842)\n\tat org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)\n\tat org.apache.spark.sql.Dataset.withAction(Dataset.scala:2841)\n\tat org.apache.spark.sql.Dataset.head(Dataset.scala:2150)\n\tat org.apache.spark.sql.Dataset.take(Dataset.scala:2363)\n\tat org.apache.spark.sql.Dataset.showString(Dataset.scala:241)\n\tat sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n\tat sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\n\tat sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\tat java.lang.reflect.Method.invoke(Method.java:498)\n\tat py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n\tat py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n\tat py4j.Gateway.invoke(Gateway.java:280)\n\tat py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n\tat py4j.commands.CallCommand.execute(CallCommand.java:79)\n\tat py4j.GatewayConnection.run(GatewayConnection.java:214)\n\tat java.lang.Thread.run(Thread.java:748)\nCaused by: java.lang.NullPointerException\n\tat com.amazonaws.services.glue.schema.types.StructType.fieldIndex(StructType.java:53)\n\tat com.amazonaws.services.glue.RecordToRow$$anonfun$convertField$1.apply(RecordToRow.scala:93)\n\tat com.amazonaws.services.glue.RecordToRow$$anonfun$convertField$1.apply(RecordToRow.scala:92)\n\tat scala.collection.Iterator$class.foreach(Iterator.scala:893)\n\tat scala.collection.AbstractIterator.foreach(Iterator.scala:1336)\n\tat com.amazonaws.services.glue.RecordToRow$.convertField(RecordToRow.scala:92)\n\tat com.amazonaws.services.glue.RecordToRow$.apply(RecordToRow.scala:43)\n\tat com.amazonaws.services.glue.DynamicRecord.toRow(DynamicRecord.scala:260)\n\tat com.amazonaws.services.glue.DynamicFrame$$anonfun$toDF$1.apply(DynamicFrame.scala:280)\n\tat com.amazonaws.services.glue.DynamicFrame$$anonfun$toDF$1.apply(DynamicFrame.scala:280)\n\tat scala.collection.Iterator$$anon$11.next(Iterator.scala:409)\n\tat scala.collection.Iterator$$anon$11.next(Iterator.scala:409)\n\tat scala.collection.Iterator$$anon$11.next(Iterator.scala:409)\n\tat org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:235)\n\tat org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)\n\tat org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)\n\tat org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)\n\tat org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)\n\tat org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)\n\tat org.apache.spark.rdd.RDD.iterator(RDD.scala:287)\n\tat org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)\n\tat org.apache.spark.scheduler.Task.run(Task.scala:108)\n\tat org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\t... 1 more\n')
Related
How to resolve py4jjavaerror
I got this error:
"name": "Py4JJavaError",
"message": "An error occurred while calling o1993.collectToPython.\n: org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in stage 88.0 failed 1 times, most recent failure: Lost task 4.0 in stage 88.0 (TID 339) (LAPTOP-LT9BMOHO executor driver): org.apache.spark.SparkException: Python worker failed to connect back.\r\n\tat org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:189)\r\n\tat org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:109)\r\n\tat org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:124)\r\n\tat org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:164)\r\n\tat org.apache.spark.sql.execution.python.PandasGroupUtils$.executePython(PandasGroupUtils.scala:44)\r\n\tat org.apache.spark.sql.execution.python.FlatMapGroupsInPandasExec.$anonfun$doExecute$1(FlatMapGroupsInPandasExec.scala:94)\r\n\tat org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)\r\n\tat org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)\r\n\tat org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)\r\n\tat org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)\r\n\tat org.apache.spark.rdd.RDD.iterator(RDD.scala:329)\r\n\tat org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)\r\n\tat org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)\r\n\tat org.apache.spark.rdd.RDD.iterator(RDD.scala:329)\r\n\tat org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)\r\n\tat org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)\r\n\tat org.apache.spark.rdd.RDD.iterator(RDD.scala:329)\r\n\tat org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)\r\n\tat org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)\r\n\tat org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)\r\n\tat org.apache.spark.scheduler.Task.run(Task.scala:136)\r\n\tat org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)\r\n\tat org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)\r\n\tat org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)\r\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\r\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\r\n\tat java.lang.Thread.run(Thread.java:750)\r\nCaused by: java.net.SocketTimeoutException: Accept timed out\r\n\tat java.net.DualStackPlainSocketImpl.waitForNewConnection(Native Method)\r\n\tat java.net.DualStackPlainSocketImpl.socketAccept(DualStackPlainSocketImpl.java:131)\r\n\tat java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:535)\r\n\tat java.net.PlainSocketImpl.accept(PlainSocketImpl.java:189)\r\n\tat java.net.ServerSocket.implAccept(ServerSocket.java:545)\r\n\tat java.net.ServerSocket.accept(ServerSocket.java:513)\r\n\tat org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:176)\r\n\t... 26 more\r\n\nDriver stacktrace:\r\n\tat org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2672)\r\n\tat org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2608)\r\n\tat org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2607)\r\n\tat scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)\r\n\tat scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)\r\n\tat scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)\r\n\tat org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2607)\r\n\tat org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1182)\r\n\tat org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1182)\r\n\tat scala.Option.foreach(Option.scala:407)\r\n\tat org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1182)\r\n\tat org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2860)\r\n\tat org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2802)\r\n\tat org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2791)\r\n\tat org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)\r\nCaused by: org.apache.spark.SparkException: Python worker failed to connect back.\r\n\tat org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:189)\r\n\tat org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:109)\r\n\tat org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:124)\r\n\tat org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:164)\r\n\tat org.apache.spark.sql.execution.python.PandasGroupUtils$.executePython(PandasGroupUtils.scala:44)\r\n\tat org.apache.spark.sql.execution.python.FlatMapGroupsInPandasExec.$anonfun$doExecute$1(FlatMapGroupsInPandasExec.scala:94)\r\n\tat org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)\r\n\tat org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)\r\n\tat org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)\r\n\tat org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)\r\n\tat org.apache.spark.rdd.RDD.iterator(RDD.scala:329)\r\n\tat org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)\r\n\tat org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)\r\n\tat org.apache.spark.rdd.RDD.iterator(RDD.scala:329)\r\n\tat org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)\r\n\tat org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)\r\n\tat org.apache.spark.rdd.RDD.iterator(RDD.scala:329)\r\n\tat org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)\r\n\tat org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)\r\n\tat org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)\r\n\tat org.apache.spark.scheduler.Task.run(Task.scala:136)\r\n\tat org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)\r\n\tat org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)\r\n\tat org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)\r\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\r\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\r\n\tat java.lang.Thread.run(Thread.java:750)\r\nCaused by: java.net.SocketTimeoutException: Accept timed out\r\n\tat java.net.DualStackPlainSocketImpl.waitForNewConnection(Native Method)\r\n\tat java.net.DualStackPlainSocketImpl.socketAccept(DualStackPlainSocketImpl.java:131)\r\n\tat java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:535)\r\n\tat java.net.PlainSocketImpl.accept(PlainSocketImpl.java:189)\r\n\tat java.net.ServerSocket.implAccept(ServerSocket.java:545)\r\n\tat java.net.ServerSocket.accept(ServerSocket.java:513)\r\n\tat org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:176)\r\n\t... 26 more\r\n",
"stack": "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[1;31mPy4JJavaError\u001b[0m Traceback (most recent call last)\nCell \u001b[1;32mIn[16], line 22\u001b[0m\n\u001b[0;32m 20\u001b[0m model_df_validation[i]\u001b[39m=\u001b[39mmodel_data\u001b[39m.\u001b[39mfilter((col(\u001b[39m\"\u001b[39m\u001b[39mWEEK_START_DATE\u001b[39m\u001b[39m\"\u001b[39m) \u001b[39m>\u001b[39m\u001b[39m=\u001b[39m data_start_dates[i]) \u001b[39m&\u001b[39m (col(\u001b[39m\"\u001b[39m\u001b[39mWEEK_START_DATE\u001b[39m\u001b[39m\"\u001b[39m) \u001b[39m<\u001b[39m testing_start_dates[i]))\n\u001b[0;32m 21\u001b[0m model_df_validation[i]\u001b[39m=\u001b[39mmodel_df_validation[i]\u001b[39m.\u001b[39mwithColumn(\u001b[39m\"\u001b[39m\u001b[39mTRAIN_VALID\u001b[39m\u001b[39m\"\u001b[39m,when(model_df_validation[i]\u001b[39m.\u001b[39mWEEK_START_DATE\u001b[39m<\u001b[39mvalidation_start_dates[i],\u001b[39m\"\u001b[39m\u001b[39mTRAIN\u001b[39m\u001b[39m\"\u001b[39m)\u001b[39m.\u001b[39motherwise(\u001b[39m\"\u001b[39m\u001b[39mVALID\u001b[39m\u001b[39m\"\u001b[39m))\u001b[39m.\u001b[39mcache()\n\u001b[1;32m---> 22\u001b[0m all_errors[i], best_error_mean[i], best_loss_init[i], best_n_estimators_init[i], best_max_depth_init[i], best_learning_rate_init[i], best_subsample_init[i], best_max_features_init[i] \u001b[39m=\u001b[39m run_validation(model_df_validation[i], model_input_cols, model_output_col, loss_values, n_estimators_values, max_depth_values, learning_rate_values, subsample_values, max_features_values)\n\nCell \u001b[1;32mIn[13], line 82\u001b[0m, in \u001b[0;36mrun_validation\u001b[1;34m(model_df, train_model_inputs, model_output_col, loss_values, n_estimators_values, max_depth_values, learning_rate_values, subsample_values, max_features_values)\u001b[0m\n\u001b[0;32m 76\u001b[0m coeff \u001b[39m=\u001b[39m coeff\u001b[39m.\u001b[39mwithColumn(\n\u001b[0;32m 77\u001b[0m \u001b[39m\"\u001b[39m\u001b[39mE\u001b[39m\u001b[39m\"\u001b[39m, col(\u001b[39m\"\u001b[39m\u001b[39mVALID_Y_PREDICTED_UNITS\u001b[39m\u001b[39m\"\u001b[39m) \u001b[39m-\u001b[39m col(\u001b[39m\"\u001b[39m\u001b[39mVALID_Y_TRUE_UNITS\u001b[39m\u001b[39m\"\u001b[39m))\u001b[39m.\u001b[39mwithColumn(\n\u001b[0;32m 78\u001b[0m \u001b[39m\"\u001b[39m\u001b[39mAE\u001b[39m\u001b[39m\"\u001b[39m, \u001b[39mabs\u001b[39m(col(\u001b[39m\"\u001b[39m\u001b[39mE\u001b[39m\u001b[39m\"\u001b[39m)))\u001b[39m.\u001b[39mwithColumn(\n\u001b[0;32m 79\u001b[0m \u001b[39m\"\u001b[39m\u001b[39mSE\u001b[39m\u001b[39m\"\u001b[39m, col(\u001b[39m\"\u001b[39m\u001b[39mE\u001b[39m\u001b[39m\"\u001b[39m) \u001b[39m*\u001b[39m col(\u001b[39m\"\u001b[39m\u001b[39mE\u001b[39m\u001b[39m\"\u001b[39m))\u001b[39m.\u001b[39mwithColumn(\n\u001b[0;32m 80\u001b[0m \u001b[39m\"\u001b[39m\u001b[39mAPE\u001b[39m\u001b[39m\"\u001b[39m, \u001b[39m100\u001b[39m\u001b[39m*\u001b[39m\u001b[39mabs\u001b[39m(col(\u001b[39m\"\u001b[39m\u001b[39mVALID_Y_PREDICTED_UNITS\u001b[39m\u001b[39m\"\u001b[39m) \u001b[39m-\u001b[39m col(\u001b[39m\"\u001b[39m\u001b[39mVALID_Y_TRUE_UNITS\u001b[39m\u001b[39m\"\u001b[39m))\u001b[39m/\u001b[39mcol(\u001b[39m\"\u001b[39m\u001b[39mVALID_Y_TRUE_UNITS\u001b[39m\u001b[39m\"\u001b[39m))\n\u001b[0;32m 81\u001b[0m coeff \u001b[39m=\u001b[39m coeff\u001b[39m.\u001b[39mwithColumn(\u001b[39m\"\u001b[39m\u001b[39mused_error\u001b[39m\u001b[39m\"\u001b[39m, when(col(\u001b[39m\"\u001b[39m\u001b[39mVALID_Y_TRUE_UNITS\u001b[39m\u001b[39m\"\u001b[39m)\u001b[39m<\u001b[39m\u001b[39m=\u001b[39m\u001b[39m100\u001b[39m, \u001b[39m2\u001b[39m\u001b[39m*\u001b[39mcol(\u001b[39m\"\u001b[39m\u001b[39mAE\u001b[39m\u001b[39m\"\u001b[39m))\u001b[39m.\u001b[39motherwise(col(\u001b[39m\"\u001b[39m\u001b[39mAPE\u001b[39m\u001b[39m\"\u001b[39m))) \u001b[39m# to do: pass relative importance as parameter\u001b[39;00m\n\u001b[1;32m---> 82\u001b[0m error_mean \u001b[39m=\u001b[39m coeff\u001b[39m.\u001b[39;49mfilter(coeff\u001b[39m.\u001b[39;49mused_error \u001b[39m<\u001b[39;49m \u001b[39m300\u001b[39;49m)\u001b[39m.\u001b[39;49mselect(mean(\u001b[39m'\u001b[39;49m\u001b[39mused_error\u001b[39;49m\u001b[39m'\u001b[39;49m))\u001b[39m.\u001b[39;49mcollect() \u001b[39m#to do: pass threshold as a parameter\u001b[39;00m\n\u001b[0;32m 84\u001b[0m all_errors \u001b[39m=\u001b[39m []\n\u001b[0;32m 85\u001b[0m all_errors\u001b[39m.\u001b[39mappend({\u001b[39m\"\u001b[39m\u001b[39mloss\u001b[39m\u001b[39m\"\u001b[39m:loss, \u001b[39m\"\u001b[39m\u001b[39mn_estimators\u001b[39m\u001b[39m\"\u001b[39m:n_estimators, \u001b[39m\"\u001b[39m\u001b[39mmax_depth\u001b[39m\u001b[39m\"\u001b[39m:max_depth, \u001b[39m\"\u001b[39m\u001b[39mlearning_rate\u001b[39m\u001b[39m\"\u001b[39m:learning_rate, \u001b[39m\"\u001b[39m\u001b[39msubsample\u001b[39m\u001b[39m\"\u001b[39m:subsample, \u001b[39m\"\u001b[39m\u001b[39mmax_features\u001b[39m\u001b[39m\"\u001b[39m:max_features, \u001b[39m\"\u001b[39m\u001b[39merror_mean\u001b[39m\u001b[39m\"\u001b[39m: error_mean})\n\nFile \u001b[1;32mC:\\Spark\\spark-3.3.1-bin-hadoop3\\spark-3.3.1-bin-hadoop3\\spark-3.3.1-bin-hadoop3\\python\\pyspark\\sql\\dataframe.py:817\u001b[0m, in \u001b[0;36mDataFrame.collect\u001b[1;34m(self)\u001b[0m\n\u001b[0;32m 807\u001b[0m \u001b[39m\"\"\"Returns all the records as a list of :class:`Row`.\u001b[39;00m\n\u001b[0;32m 808\u001b[0m \n\u001b[0;32m 809\u001b[0m \u001b[39m.. versionadded:: 1.3.0\u001b[39;00m\n\u001b[1;32m (...)\u001b[0m\n\u001b[0;32m 814\u001b[0m \u001b[39m[Row(age=2, name='Alice'), Row(age=5, name='Bob')]\u001b[39;00m\n\u001b[0;32m 815\u001b[0m \u001b[39m\"\"\"\u001b[39;00m\n\u001b[0;32m 816\u001b[0m \u001b[39mwith\u001b[39;00m SCCallSiteSync(\u001b[39mself\u001b[39m\u001b[39m.\u001b[39m_sc):\n\u001b[1;32m--> 817\u001b[0m sock_info \u001b[39m=\u001b[39m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_jdf\u001b[39m.\u001b[39;49mcollectToPython()\n\u001b[0;32m 818\u001b[0m \u001b[39mreturn\u001b[39;00m \u001b[39mlist\u001b[39m(_load_from_socket(sock_info, BatchedSerializer(CPickleSerializer())))\n\nFile \u001b[1;32mC:\\Spark\\spark-3.3.1-bin-hadoop3\\spark-3.3.1-bin-hadoop3\\spark-3.3.1-bin-hadoop3\\python\\lib\\py4j-0.10.9.5-src.zip\\py4j\\java_gateway.py:1321\u001b[0m, in \u001b[0;36mJavaMember.__call__\u001b[1;34m(self, *args)\u001b[0m\n\u001b[0;32m 1315\u001b[0m command \u001b[39m=\u001b[39m proto\u001b[39m.\u001b[39mCALL_COMMAND_NAME \u001b[39m+\u001b[39m\\\n\u001b[0;32m 1316\u001b[0m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mcommand_header \u001b[39m+\u001b[39m\\\n\u001b[0;32m 1317\u001b[0m args_command \u001b[39m+\u001b[39m\\\n\u001b[0;32m 1318\u001b[0m proto\u001b[39m.\u001b[39mEND_COMMAND_PART\n\u001b[0;32m 1320\u001b[0m answer \u001b[39m=\u001b[39m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mgateway_client\u001b[39m.\u001b[39msend_command(command)\n\u001b[1;32m-> 1321\u001b[0m return_value \u001b[39m=\u001b[39m get_return_value(\n\u001b[0;32m 1322\u001b[0m answer, \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mgateway_client, \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mtarget_id, \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mname)\n\u001b[0;32m 1324\u001b[0m \u001b[39mfor\u001b[39;00m temp_arg \u001b[39min\u001b[39;00m temp_args:\n\u001b[0;32m 1325\u001b[0m temp_arg\u001b[39m.\u001b[39m_detach()\n\nFile \u001b[1;32mC:\\Spark\\spark-3.3.1-bin-hadoop3\\spark-3.3.1-bin-hadoop3\\spark-3.3.1-bin-hadoop3\\python\\pyspark\\sql\\utils.py:190\u001b[0m, in \u001b[0;36mcapture_sql_exception.<locals>.deco\u001b[1;34m(*a, **kw)\u001b[0m\n\u001b[0;32m 188\u001b[0m \u001b[39mdef\u001b[39;00m \u001b[39mdeco\u001b[39m(\u001b[39m*\u001b[39ma: Any, \u001b[39m*\u001b[39m\u001b[39m*\u001b[39mkw: Any) \u001b[39m-\u001b[39m\u001b[39m>\u001b[39m Any:\n\u001b[0;32m 189\u001b[0m \u001b[39mtry\u001b[39;00m:\n\u001b[1;32m--> 190\u001b[0m \u001b[39mreturn\u001b[39;00m f(\u001b[39m*\u001b[39ma, \u001b[39m*\u001b[39m\u001b[39m*\u001b[39mkw)\n\u001b[0;32m 191\u001b[0m \u001b[39mexcept\u001b[39;00m Py4JJavaError \u001b[39mas\u001b[39;00m e:\n\u001b[0;32m 192\u001b[0m converted \u001b[39m=\u001b[39m convert_exception(e\u001b[39m.\u001b[39mjava_exception)\n\nFile \u001b[1;32mC:\\Spark\\spark-3.3.1-bin-hadoop3\\spark-3.3.1-bin-hadoop3\\spark-3.3.1-bin-hadoop3\\python\\lib\\py4j-0.10.9.5-src.zip\\py4j\\protocol.py:326\u001b[0m, in \u001b[0;36mget_return_value\u001b[1;34m(answer, gateway_client, target_id, name)\u001b[0m\n\u001b[0;32m 324\u001b[0m value \u001b[39m=\u001b[39m OUTPUT_CONVERTER[\u001b[39mtype\u001b[39m](answer[\u001b[39m2\u001b[39m:], gateway_client)\n\u001b[0;32m 325\u001b[0m \u001b[39mif\u001b[39;00m answer[\u001b[39m1\u001b[39m] \u001b[39m==\u001b[39m REFERENCE_TYPE:\n\u001b[1;32m--> 326\u001b[0m \u001b[39mraise\u001b[39;00m Py4JJavaError(\n\u001b[0;32m 327\u001b[0m \u001b[39m\"\u001b[39m\u001b[39mAn error occurred while calling \u001b[39m\u001b[39m{0}\u001b[39;00m\u001b[39m{1}\u001b[39;00m\u001b[39m{2}\u001b[39;00m\u001b[39m.\u001b[39m\u001b[39m\\n\u001b[39;00m\u001b[39m\"\u001b[39m\u001b[39m.\u001b[39m\n\u001b[0;32m 328\u001b[0m \u001b[39mformat\u001b[39m(target_id, \u001b[39m\"\u001b[39m\u001b[39m.\u001b[39m\u001b[39m\"\u001b[39m, name), value)\n\u001b[0;32m 329\u001b[0m \u001b[39melse\u001b[39;00m:\n\u001b[0;32m 330\u001b[0m \u001b[39mraise\u001b[39;00m Py4JError(\n\u001b[0;32m 331\u001b[0m \u001b[39m\"\u001b[39m\u001b[39mAn error occurred while calling \u001b[39m\u001b[39m{0}\u001b[39;00m\u001b[39m{1}\u001b[39;00m\u001b[39m{2}\u001b[39;00m\u001b[39m. Trace:\u001b[39m\u001b[39m\\n\u001b[39;00m\u001b[39m{3}\u001b[39;00m\u001b[39m\\n\u001b[39;00m\u001b[39m\"\u001b[39m\u001b[39m.\u001b[39m\n\u001b[0;32m 332\u001b[0m \u001b[39mformat\u001b[39m(target_id, \u001b[39m\"\u001b[39m\u001b[39m.\u001b[39m\u001b[39m\"\u001b[39m, name, value))\n\n\u001b[1;31mPy4JJavaError\u001b[0m: An error occurred while calling o1993.collectToPython.\n: org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in stage 88.0 failed 1 times, most recent failure: Lost task 4.0 in stage 88.0 (TID 339) (LAPTOP-LT9BMOHO executor driver): org.apache.spark.SparkException: Python worker failed to connect back.\r\n\tat org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:189)\r\n\tat org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:109)\r\n\tat org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:124)\r\n\tat org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:164)\r\n\tat org.apache.spark.sql.execution.python.PandasGroupUtils$.executePython(PandasGroupUtils.scala:44)\r\n\tat org.apache.spark.sql.execution.python.FlatMapGroupsInPandasExec.$anonfun$doExecute$1(FlatMapGroupsInPandasExec.scala:94)\r\n\tat org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)\r\n\tat org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)\r\n\tat org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)\r\n\tat org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)\r\n\tat org.apache.spark.rdd.RDD.iterator(RDD.scala:329)\r\n\tat org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)\r\n\tat org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)\r\n\tat org.apache.spark.rdd.RDD.iterator(RDD.scala:329)\r\n\tat org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)\r\n\tat org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)\r\n\tat org.apache.spark.rdd.RDD.iterator(RDD.scala:329)\r\n\tat org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)\r\n\tat org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)\r\n\tat org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)\r\n\tat org.apache.spark.scheduler.Task.run(Task.scala:136)\r\n\tat org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)\r\n\tat org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)\r\n\tat org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)\r\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\r\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\r\n\tat java.lang.Thread.run(Thread.java:750)\r\nCaused by: java.net.SocketTimeoutException: Accept timed out\r\n\tat java.net.DualStackPlainSocketImpl.waitForNewConnection(Native Method)\r\n\tat java.net.DualStackPlainSocketImpl.socketAccept(DualStackPlainSocketImpl.java:131)\r\n\tat java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:535)\r\n\tat java.net.PlainSocketImpl.accept(PlainSocketImpl.java:189)\r\n\tat java.net.ServerSocket.implAccept(ServerSocket.java:545)\r\n\tat java.net.ServerSocket.accept(ServerSocket.java:513)\r\n\tat org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:176)\r\n\t... 26 more\r\n\nDriver stacktrace:\r\n\tat org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2672)\r\n\tat org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2608)\r\n\tat org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2607)\r\n\tat scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)\r\n\tat scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)\r\n\tat scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)\r\n\tat org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2607)\r\n\tat org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1182)\r\n\tat org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1182)\r\n\tat scala.Option.foreach(Option.scala:407)\r\n\tat org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1182)\r\n\tat org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2860)\r\n\tat org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2802)\r\n\tat org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2791)\r\n\tat org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)\r\nCaused by: org.apache.spark.SparkException: Python worker failed to connect back.\r\n\tat org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:189)\r\n\tat org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:109)\r\n\tat org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:124)\r\n\tat org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:164)\r\n\tat org.apache.spark.sql.execution.python.PandasGroupUtils$.executePython(PandasGroupUtils.scala:44)\r\n\tat org.apache.spark.sql.execution.python.FlatMapGroupsInPandasExec.$anonfun$doExecute$1(FlatMapGroupsInPandasExec.scala:94)\r\n\tat org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)\r\n\tat org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)\r\n\tat org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)\r\n\tat org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)\r\n\tat org.apache.spark.rdd.RDD.iterator(RDD.scala:329)\r\n\tat org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)\r\n\tat org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)\r\n\tat org.apache.spark.rdd.RDD.iterator(RDD.scala:329)\r\n\tat org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)\r\n\tat org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)\r\n\tat org.apache.spark.rdd.RDD.iterator(RDD.scala:329)\r\n\tat org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)\r\n\tat org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)\r\n\tat org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)\r\n\tat org.apache.spark.scheduler.Task.run(Task.scala:136)\r\n\tat org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)\r\n\tat org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)\r\n\tat org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)\r\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\r\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\r\n\tat java.lang.Thread.run(Thread.java:750)\r\nCaused by: java.net.SocketTimeoutException: Accept timed out\r\n\tat java.net.DualStackPlainSocketImpl.waitForNewConnection(Native Method)\r\n\tat java.net.DualStackPlainSocketImpl.socketAccept(DualStackPlainSocketImpl.java:131)\r\n\tat java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:535)\r\n\tat java.net.PlainSocketImpl.accept(PlainSocketImpl.java:189)\r\n\tat java.net.ServerSocket.implAccept(ServerSocket.java:545)\r\n\tat java.net.ServerSocket.accept(ServerSocket.java:513)\r\n\tat org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:176)\r\n\t... 26 more\r\n"
}
I am getting below error while wrting spark structured streaming dataframe -
please tell me where I am doing wrong while running this code-
here df is reading from s3://abc/testing location and I am writing this dataframe to different s3 location using spark streaming-
val q = df .writeStream
.trigger(Trigger.Once)
.option("checkpointLocation", "s3://abc/checkpoint")
.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
batchDF
.write
.mode(SaveMode.Append)
.parquet("s3://abc/demo")
}.start()
q.processAllAvailable()
q.stop()
while running above code I get below error -
org.apache.spark.sql.streaming.StreamingQueryException: Job aborted.
=== Streaming Query ===
Identifier: [id = 82cae180-6190-499a-99ae, runId = 23aa9dca-c6ef-49ff-b860]
Current Committed Offsets: {}
Current Available Offsets: {FileStreamSource[s3://abc/testing]: {"logOffset":0}}
Current State: ACTIVE
Thread State: RUNNABLE
Logical Plan:
FileStreamSource[s3://abc/testing]
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:379)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:269)
Caused by: org.apache.spark.SparkException: Job aborted.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:230)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:178)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:116)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:114)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:139)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:200)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$3(SparkPlan.scala:252)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:165)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:248)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:192)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:158)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:157)
at org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:999)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$5(SQLExecution.scala:116)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:249)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$1(SQLExecution.scala:101)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:845)
at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:77)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:199)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:999)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:437)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:421)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:294)
at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:884)
at line7d42fe70c8664871b443fdc5f6bbc35869.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.$anonfun$withCreateExtract$5(command-3858326:61)
at line7d42fe70c8664871b443fdc5f6bbc35869.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.$anonfun$withCreateExtract$5$adapted(command-3858326:56)
at org.apache.spark.sql.execution.streaming.sources.ForeachBatchSink.addBatch(ForeachBatchSink.scala:39)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runBatch$16(MicroBatchExecution.scala:593)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$5(SQLExecution.scala:116)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:249)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$1(SQLExecution.scala:101)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:845)
at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:77)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:199)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runBatch$15(MicroBatchExecution.scala:591)
at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:276)
at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:274)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:74)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runBatch(MicroBatchExecution.scala:591)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$2(MicroBatchExecution.scala:231)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:276)
at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:274)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:74)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$1(MicroBatchExecution.scala:199)
at org.apache.spark.sql.execution.streaming.OneTimeExecutor.execute(TriggerExecutor.scala:39)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:193)
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:358)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:269)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 31 tasks (4.0 GiB) is bigger than spark.driver.maxResultSize 4.0 GiB.
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2519)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2466)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2460)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2460)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1152)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1152)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1152)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2721)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2668)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2656)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:938)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2339)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2434)
at org.apache.spark.sql.execution.collect.Collector.runSparkJobs(Collector.scala:273)
at org.apache.spark.sql.execution.collect.Collector.collect(Collector.scala:308)
at org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:82)
at org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:88)
at org.apache.spark.sql.execution.ResultCacheManager.getOrComputeResult(ResultCacheManager.scala:508)
at org.apache.spark.sql.execution.ResultCacheManager.getOrComputeResult(ResultCacheManager.scala:480)
at org.apache.spark.sql.execution.SparkPlan.executeCollectResult(SparkPlan.scala:401)
at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.$anonfun$relationFuture$1(BroadcastExchangeExec.scala:127)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:845)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withThreadLocalCaptured$4(SQLExecution.scala:308)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withThreadLocalCaptured$3(SQLExecution.scala:308)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withThreadLocalCaptured$2(SQLExecution.scala:307)
at org.apache.spark.sql.execution.SQLExecution$.withOptimisticTransaction(SQLExecution.scala:325)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withThreadLocalCaptured$1(SQLExecution.scala:306)
at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
at org.apache.spark.util.threads.SparkThreadLocalCapturingRunnable.$anonfun$run$1(SparkThreadLocalForwardingThreadPoolExecutor.scala:104)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.apache.spark.util.threads.SparkThreadLocalCapturingHelper.runWithCaptured(SparkThreadLocalForwardingThreadPoolExecutor.scala:68)
at org.apache.spark.util.threads.SparkThreadLocalCapturingHelper.runWithCaptured$(SparkThreadLocalForwardingThreadPoolExecutor.scala:54)
at org.apache.spark.util.threads.SparkThreadLocalCapturingRunnable.runWithCaptured(SparkThreadLocalForwardingThreadPoolExecutor.scala:101)
at org.apache.spark.util.threads.SparkThreadLocalCapturingRunnable.run(SparkThreadLocalForwardingThreadPoolExecutor.scala:104)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Total size of serialized results of 31 tasks (4.0 GiB) is bigger than spark.driver.maxResultSize 4.0 GiB. means when a executor is trying to send its result to driver, it exceeds spark.driver.maxResultSize. You can resolve it by increasing it till you get it to work, but it's not a recommendation if an executor is trying to send too much data.
Other thing that could cause this is that data is skewed, you should check how data is distributed on the worker nodes, possible scenario is that all data ends up on single node which causes huge input/output of data from single worker. In this case you can try to repartition your data to split the load between your workers which will be much better solution that increasing the limit.
I have orc files in my hdfs. One of the fields is Map(String, String). Somehow there are some rows with value Map(null,null) in this field. null in the map keys is a critical error for java. So, when I'm trying to access this field, I got NullPointer exeption.
I want to read these files and change this fields to emty map.
I tried to do it this way:
val df = spark.read.format("orc").load("/tmp/bad_orc")
def func(s: org.apache.spark.sql.Row): String = {
try
{
if ( s(14) == null ) // the 14'th column is the column with Map(String,String) type
{
return "Ok"
}
else
{
return "Zero"
}
}
catch
{
case x: Exception => return "Erro"
}
}
df.rdd.map(func).take(20)
I got this exception, when I run this script.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 417.0 failed 4 times, most recent failure: Lost task 0.3 in stage 417.0 (TID 97094, srvg1076.local.odkl.ru, executor 86): java.lang.NullPointerException
at java.util.TreeMap.compare(TreeMap.java:1294)
at java.util.TreeMap.put(TreeMap.java:538)
at org.apache.orc.mapred.OrcMapredRecordReader.nextMap(OrcMapredRecordReader.java:507)
at org.apache.orc.mapred.OrcMapredRecordReader.nextValue(OrcMapredRecordReader.java:554)
at org.apache.orc.mapreduce.OrcMapreduceRecordReader.nextKeyValue(OrcMapreduceRecordReader.java:104)
at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$1.hasNext(WholeStageCodegenExec.scala:619)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
And when I'm trying to acces any other column in this orc - everything is Ok.
How catch this exception and how fix these files? Help me, please
I am writing app that will be extract logs, so I implemented function (Lisitng 1) which is taking string as parametr and extracts valuable informations (regexs: Listing 2) from it. I wanted that this method could be send to other workers so I impelemnt serializable class.
I have problem with apply this method on DStreams. Here is my streams minning solution:
def streamMinner(): Unit = {
val ssc = new StreamingContext(sc, Seconds(2))
val logsStream = ssc.textFileStream("logs/")
// Not works
val extractLogs = logsStream.map( log => new Matcher().matchLog(log))
extractLogs.print(1)
// Works
// val words = logsStream.transform( rdd => rdd.map( log => matchLog(log)))
// words.print()
ssc.start()
ssc.awaitTermination()
}
Problem is in line where every element of logsStream is maped with new object of Matcher class (new Matcher().matchLog(log)
Apache Spark gave my below errors:
ERROR YarnScheduler: Lost executor 2 on host1: Container marked as failed: container_e743_1499728610705_0043_01_000003 on host: host1. Exit status: 50. Diagnostics: Exception from container-launch.
Container id: container_e743_1499728610705_0043_01_000003
Exit code: 50
Stack trace: ExitCodeException exitCode=50:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:600)
at org.apache.hadoop.util.Shell.run(Shell.java:511)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:783)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:303)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 50
ERROR YarnScheduler: Lost executor 5 on host2: Container marked as failed: container_e743_1499728610705_0043_01_000006 on host: host2. Exit status: 50. Diagnostics: Exception from container-launch.
Container id: container_e743_1499728610705_0043_01_000006
Exit code: 50
Stack trace: ExitCodeException exitCode=50:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:600)
at org.apache.hadoop.util.Shell.run(Shell.java:511)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:783)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:303)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 50
...
ERROR YarnScheduler: Lost executor 6 ...
ERROR TaskSetManager: Task 0 in stage 0.0 failed 4 times; aborting job
17/07/11 09:41:09 ERROR JobScheduler: Error running job streaming job 1499758850000 ms.0
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, host1): ExecutorLostFailure (executor 6 exited caused by one of the running tasks) Reason: Container marked as failed: container_e743_1499728610705_0043_01_000007 on host: host1. Exit status: 50. Diagnostics: Exception from container-launch.
Container id: container_e743_1499728610705_0043_01_000007
Exit code: 50
Stack trace: ExitCodeException exitCode=50:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:600)
at org.apache.hadoop.util.Shell.run(Shell.java:511)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:783)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:303)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 50
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1433)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1421)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1420)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1420)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:801)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:801)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:801)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1642)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1601)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1590)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:622)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1856)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1869)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1882)
at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1335)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:323)
at org.apache.spark.rdd.RDD.take(RDD.scala:1309)
at org.apache.spark.streaming.dstream.DStream$$anonfun$print$2$$anonfun$foreachFunc$5$1.apply(DStream.scala:768)
at org.apache.spark.streaming.dstream.DStream$$anonfun$print$2$$anonfun$foreachFunc$5$1.apply(DStream.scala:767)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:50)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:50)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:50)
at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:426)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:49)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:49)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:49)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.streaming.scheduler.Job.run(Job.scala:39)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:227)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:227)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:227)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:226)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, host1): ExecutorLostFailure (executor 6 exited caused by one of the running tasks) Reason: Container marked as failed: container_e743_1499728610705_0043_01_000007 on host: host1. Exit status: 50. Diagnostics: Exception from container-launch.
Container id: container_e743_1499728610705_0043_01_000007
Exit code: 50
Stack trace: ExitCodeException exitCode=50:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:600)
at org.apache.hadoop.util.Shell.run(Shell.java:511)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:783)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:303)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 50
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1433)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1421)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1420)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1420)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:801)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:801)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:801)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1642)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1601)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1590)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:622)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1856)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1869)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1882)
at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1335)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:323)
at org.apache.spark.rdd.RDD.take(RDD.scala:1309)
at org.apache.spark.streaming.dstream.DStream$$anonfun$print$2$$anonfun$foreachFunc$5$1.apply(DStream.scala:768)
at org.apache.spark.streaming.dstream.DStream$$anonfun$print$2$$anonfun$foreachFunc$5$1.apply(DStream.scala:767)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:50)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:50)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:50)
at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:426)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:49)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:49)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:49)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.streaming.scheduler.Job.run(Job.scala:39)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:227)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:227)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:227)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:226)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
scala> 17/07/11 09:41:11 ERROR TransportResponseHandler: Still have 1 requests outstanding when connection from host3/11.11.11.11:11111 is closed
17/07/11 09:41:11 ERROR YarnScheduler: Lost executor 4 on host3: Slave lost
17/07/11 09:41:12 ERROR TransportClient: Failed to send RPC 7741519719369750843 to host3/11.11.11.11:11111: java.nio.channels.ClosedChannelException
java.nio.channels.ClosedChannelException
17/07/11 09:41:12 ERROR YarnScheduler: Lost executor 1 on host2: Slave lost
17/07/11 09:41:12 ERROR TransportClient: Failed to send RPC 7734757459881277232 to host3//11.11.11.11:11111: java.nio.channels.ClosedChannelException
java.nio.channels.ClosedChannelException
17/07/11 09:41:12 ERROR YarnScheduler: Lost executor 3 on host4: Slave lost
17/07/11 09:41:12 ERROR TransportClient: Failed to send RPC 5816053641531447955 to host3//11.11.11.11:11111: java.nio.channels.ClosedChannelException
java.nio.channels.ClosedChannelException
17/07/11 09:41:12 ERROR YarnScheduler: Lost executor 7 on host2: Slave lost
17/07/11 09:41:13 ERROR TransportClient: Failed to send RPC 8774007142277591342 to host3/11.11.11.11:11111: java.nio.channels.ClosedChannelException
java.nio.channels.ClosedChannelException
17/07/11 09:41:13 ERROR YarnScheduler: Lost executor 8 on host1: Slave lost
17/07/11 09:41:19 ERROR YarnScheduler: Lost executor 1 on host3: Container marked as failed: container_e743_1499728610705_0043_02_000002 on host: host3. Exit status: 50. Diagnostics: Exception from container-launch.
Container id: container_e743_1499728610705_0043_02_000002
Exit code: 50
Stack trace: ExitCodeException exitCode=50:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:600)
at org.apache.hadoop.util.Shell.run(Shell.java:511)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:783)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:303)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 50
When I comment lines:
val extractLogs = logsStream.map( log => new Matcher().matchLog(log))
extractLogs.print(1)
and I uncomment lines:
// val words = logsStream.transform( rdd => rdd.map( log => matchLog(log)))
// words.print()
Everything works fine. My question is why? I'm afraid that solution that works may not be parallelized on cluster because method matchLog is not serializable. Someone has a similar problem or know how to deal with it?
Lisitng 1:
case class logValues2(time_stamp: String, action: String, protocol: String, connection_id: String, src_ip: String, dst_ip: String, src_port: String, dst_port: String, duration: String, bytes: String, user: String) extends Serializable
class Matcher extends Serializable {
def matchLog(x: String): logValues2 = {
var dst_ip = " "
var dst_port = " "
var time_stamp = time_stamp_reg.findAllIn(x).mkString(",")
var action = action_reg.findAllIn(x).mkString(",")
var protocol = protocol_reg.findAllIn(x).mkString(",")
var connection_id = connection_id_reg.findAllIn(x).mkString(",")
var ips = ips_reg.findAllIn(x).mkString(" ").split(""" """)
var src_ip = ips(0)
if (ips.length > 1) {
dst_ip = ips(1)
} else {
dst_ip = " "
}
var ports = ports_reg.findAllIn(x).mkString(" ").split(""" """)
var src_port = ports(0)
if (ports.length > 1) {
dst_port = ports(1)
} else {
dst_port = " "
}
var duration = duration_reg.findAllIn(x).mkString(",")
var bytes = bytes_reg.findAllIn(x).mkString(",")
var user = user_reg.findAllIn(x).mkString(",")
var logObject = logValues2(time_stamp, action, protocol, connection_id, src_ip, dst_ip, src_port, dst_port, duration, bytes, user)
return logObject
}
Above method is implemented also separately (no within Matcher class).
UPDATE:
My Regular expressions: Listing 2:
val time_stamp_reg = """^.*?(?=\s\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\s%)""".r
val action_reg = """((?<=:\s)\w{4,10}(?=\s\w{2})|(?<=\w\s)(\w{7,9})(?=\s[f]))""".r
val protocol_reg = """(?<=[\w:]\s)(\w+)(?=\s[cr])""".r
val connection_id_reg = """(?<=\w\s)(\d+)(?=\sfor)""".r
val ips_reg = """(?<=[\d\w][:\s])(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})(?=\/\d+|\z| \w)""".r
val ports_reg = """(?<=\d\/)(\d{1,6})(?=\z|[\s(])""".r
val duration_reg = """(?<=duration\s)(\d{1,2}:\d{1,2}:\d{1,2})(?=\s|\z)""".r
val bytes_reg = """(?<=bytes\s)(\d+)(?=\s|\z)""".r
val user_reg = """(?<=\\\\)(\d+)(?=\W)""".r
In Spark 1.6.0 I have a data frame with a column that holds a job description, like:
Description
bartender
bartender
employee
taxi-driver
...
I retrieve a list of unique values from that column with:
val jobs = people.select("Description").distinct().rdd.map(r => r(0).asInstanceOf[String]).repartition(4)
I then try, for each job description, to retrieve people with that job and do something, but I get a NullPointerException:
jobs.foreach {
ajob =>
var peoplewithjob = people.filter($"Description" === ajob)
// ... do stuff
}
I don't understand why this happens, because every job has been extracted from the people data frame, so there should be at least one with that job... any hint more that welcome! Here's the stack trace:
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 4.0 failed 1 times, most recent failure: Lost task 3.0 in stage 4.0 (TID 206, localhost): java.lang.NullPointerException
at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:131)
at org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$withPlan(DataFrame.scala:2165)
at org.apache.spark.sql.DataFrame.filter(DataFrame.scala:799)
at jago.Run$$anonfun$main$1.apply(Run.scala:89)
at jago.Run$$anonfun$main$1.apply(Run.scala:82)
at scala.collection.Iterator$class.foreach(Iterator.scala:742)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$32.apply(RDD.scala:912)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$32.apply(RDD.scala:912)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
It happens because Spark doesn't support nested actions or transformations. If you want to operate on distinct values extracted from the DataFrame you have to fetch the results to the driver and iterate locally:
// or toLocalIterator
jobs.collect.foreach {
ajob =>
var peoplewithjob = people.filter($"Description" === ajob)
}
Depending on what kind of transformations you apply as "do stuff" it can be a better idea to simply grouBy and aggregate:
people.groupBy($"Description").agg(...)