How to fix an error when an empty string is being written to elastic search from an Apache Spark job? - scala

There is an exception being thrown when I execute my Scala app with functionality of myRDD.saveToEs (I also tried saveToEs from a dataframe). My ES version is 2.3.5.
I am using Spark 1.5.0 so maybe there is a way to configure this in the SparkContext which I am not aware of.
The stack trace is as under -
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 (TID 2, localhost): org.apache.spark.util.TaskCompletionListenerException: Found unrecoverable error [127.0.0.1:9200] returned Bad Request(400) - failed to parse [foo_eff_dt];Invalid format: ""; Bailing out..
at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:90)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
The field named foo_eff_dt does have values and in certain cases doesnt (i.e., empty). I am not sure if this is causing the exception.
My scala code snippet looks like this :
fooRDD.saveToEs("foo/bar")
Please help/guide me in resolving this.
TIA.

I think you are trying to insert Date into Elastic and in Elastic Date can be empty.
{
"format": "strict_date_optional_time||epoch_millis",
"type": "date"
}
If you don't have strict need for date field then you can easily resolve this by changing this into string.

Related

java.lang.NoSuchMethodError: org.apache.spark.sql.internal.SQLConf.useDeprecatedKafkaOffsetFetching()Z

I want to show spark dataframe and I used:
df.writeStream.outputMode("append").start().awaitTermination()
But when I got the error when run this line:
21/07/16 01:20:53 ERROR MicroBatchExecution: Query [id = f243e6e6-c02e-4e70-b5c3-6a821fd33232, runId = 312544cf-fea8-45b4-94a1-c052306538cf] terminated with error
java.lang.NoSuchMethodError: org.apache.spark.sql.internal.SQLConf.useDeprecatedKafkaOffsetFetching()Z
Check the version of spark and version of dependencies you have added. Make sure both are having the same versions. This will resolve the issue.

DATABRICKS SparkException: Exception thrown in awaitResult - CAN'T DISPLAY DATAFRAME

I need some help please.
I run this command: display(df), but when I try to download the dataframe I obtain the following error:
SparkException: Exception thrown in awaitResult: Caused by: java.io.IOException: Failed to read job commit marker: FileStatus{path=dbfs:/databricks-results/1390434353332427/_committed_8779047008713225709; isDirectory=false; length=114; replication=1; blocksize=67108864; modification_time=1583486899000; access_time=0; owner=; group=; permission=rwx-wx-wx; isSymlink=false}
Thanks in advance!

Not able to insert data into redshift table if any column has any NULL values from s3

I am having source data in s3 in below format.
WM_ID,SOURCE_SYSTEM,DB_ID,JOB_NUM,NOTE_TYPE,NOTE_TEXT,NOTE_DATE_TIME
WOR25,CORE,NI,NI1LBE14,GEN,"",2020-02-01 17:23:32
WOR25,FSI,NI,NI1LBR39,CPN,"",2020-02-04 13:47:35
WOR25,FSI,NI,NI1LBE14,ACC,"",2020-02-03 13:22:56
WOR25,CORE,NI,NI1LBR39,FIT,NA,2020-02-05 13:13:08
Here NOTE_TEXT has some values with NULL. While trying to insert to redshift table using jdbc loader using streamsets transformer(spark-submit), it is not working.
RUN_ERROR: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 4.0 failed 1 times, most recent failure: Lost task 0.0 in stage 4.0 (TID 4, localhost, executor driver): java.sql.SQLFeatureNotSupportedException: [Amazon][JDBC](10220) Driver does not support this optional feature. at com.amazon.exceptions.ExceptionConverter.toSQLException(Unknown Source) at com.amazon.jdbc.common.SPreparedStatement.checkTypeSupported(Unknown Source) at com.amazon.jdbc.common.SPreparedStatement.setNull(Unknown Source) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:658) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:834) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:834) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101) at
If I convert all the NULL values to string it is working as expected. Can anyone guide me with the correct approach?
The problem is that the Redshift JDBC driver itself doesn’t support writing null values. The workaround is to convert to string.
You can use Field Replacer to replace NULLs with a placeholder.
We at StreamSets are looking at resolving this in a future release.

ERROR: com.streamsets.pipeline.api.StageException: JDBC_52 - Error starting LogMiner

I am getting the following error while running oracle cdc since today morning it was running fine but get continues errors from this morning.
What is the exact reason for this error?
The pipeline, cdc_test stopped at 2019-06-15 13:37:46 due to the following error:
UNKNOWN com.streamsets.pipeline.api.StageException: JDBC_52 - Error
starting LogMiner at
com.streamsets.pipeline.stage.origin.jdbc.cdc.oracle.OracleCDCSource.startGeneratorThread(OracleCDCSource.java:454)
at
com.streamsets.pipeline.stage.origin.jdbc.cdc.oracle.OracleCDCSource.produce(OracleCDCSource.java:325)
at
com.streamsets.pipeline.api.base.configurablestage.DSource.produce(DSource.java:38)
at
com.streamsets.datacollector.runner.StageRuntime.lambda$execute$2(StageRuntime.java:283)
at
com.streamsets.pipeline.api.impl.CreateByRef.call(CreateByRef.java:40)
at
com.streamsets.datacollector.runner.StageRuntime.execute(StageRuntime.java:235)
at
com.streamsets.datacollector.runner.StageRuntime.execute(StageRuntime.java:298)
at
com.streamsets.datacollector.runner.StagePipe.process(StagePipe.java:219)
at
com.streamsets.datacollector.execution.runner.common.ProductionPipelineRunner.processPipe(ProductionPipelineRunner.java:810)
at
com.streamsets.datacollector.execution.runner.common.ProductionPipelineRunner.runPollSource(ProductionPipelineRunner.java:554)
at
com.streamsets.datacollector.execution.runner.common.ProductionPipelineRunner.run(ProductionPipelineRunner.java:383)
at com.streamsets.datacollector.runner.Pipeline.run(Pipeline.java:527)
at
com.streamsets.datacollector.execution.runner.common.ProductionPipeline.run(ProductionPipeline.java:109)
at
com.streamsets.datacollector.execution.runner.common.ProductionPipelineRunnable.run(ProductionPipelineRunnable.java:75)
at
com.streamsets.datacollector.execution.runner.standalone.StandaloneRunner.start(StandaloneRunner.java:703)
at
com.streamsets.datacollector.execution.AbstractRunner.lambda$scheduleForRetries$0(AbstractRunner.java:349)
at
com.streamsets.pipeline.lib.executor.SafeScheduledExecutorService$SafeCallable.lambda$call$0(SafeScheduledExecutorService.java:226)
at
com.streamsets.datacollector.security.GroupsInScope.execute(GroupsInScope.java:33)
at
com.streamsets.pipeline.lib.executor.SafeScheduledExecutorService$SafeCallable.call(SafeScheduledExecutorService.java:222)
at java.util.concurrent.FutureTask.run(FutureTask.java:266) at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at
com.streamsets.datacollector.metrics.MetricSafeScheduledExecutorService$MetricsTask.run(MetricSafeScheduledExecutorService.java:100)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748) Caused by:
java.sql.SQLException: ORA-01291: missing logfile ORA-06512: at
"SYS.DBMS_LOGMNR", line 58 ORA-06512: at line 1
Typically, this means that while the pipeline was stopped, Oracle deleted one or more logfiles, so the pipeline cannot pick up where it left off.
This blog entry gives a lot of detail on the issue and steps to resolve it: https://streamsets.com/blog/replicating-oracle-mysql-json#ora-01291-missing-logfile

Spark fails on big shuffle jobs with java.io.IOException: Filesystem closed

I often find spark fails with large jobs with a rather unhelpful meaningless exception. The worker logs look normal, no errors, but they get state "KILLED". This is extremely common for large shuffles, so operations like .distinct.
The question is, how do I diagnose what's going wrong, and ideally, how do I fix it?
Given that a lot of these operations are monoidal I've been working around the problem by splitting the data into, say 10, chunks, running the app on each chunk, then running the app on all of the resulting outputs. In other words - meta-map-reduce.
14/06/04 12:56:09 ERROR client.AppClient$ClientActor: Master removed our application: FAILED; stopping client
14/06/04 12:56:09 WARN cluster.SparkDeploySchedulerBackend: Disconnected from Spark cluster! Waiting for reconnection...
14/06/04 12:56:09 WARN scheduler.TaskSetManager: Loss was due to java.io.IOException
java.io.IOException: Filesystem closed
at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:703)
at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:779)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:840)
at java.io.DataInputStream.read(DataInputStream.java:149)
at org.apache.hadoop.io.compress.DecompressorStream.getCompressedData(DecompressorStream.java:159)
at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:143)
at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85)
at java.io.InputStream.read(InputStream.java:101)
at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:180)
at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:209)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:47)
at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:164)
at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:149)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:27)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:176)
at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:45)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toList(TraversableOnce.scala:257)
at scala.collection.AbstractIterator.toList(Iterator.scala:1157)
at $line5.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:13)
at $line5.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:13)
at org.apache.spark.rdd.RDD$$anonfun$1.apply(RDD.scala:450)
at org.apache.spark.rdd.RDD$$anonfun$1.apply(RDD.scala:450)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:34)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:34)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
at org.apache.spark.scheduler.Task.run(Task.scala:53)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:42)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:41)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:41)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
As of September 1st 2014, this is an "open improvement" in Spark. Please see https://issues.apache.org/jira/browse/SPARK-3052. As syrza pointed out in the given link, the shutdown hooks are likely done in incorrect order when an executor failed which results in this message. I understand you will have to little more investigation to figure out the main cause of problem (i.e. why your executor failed). If it is a large shuffle, it might be an out-of-memory error which cause executor failure which then caused the Hadoop Filesystem to be closed in their shutdown hook. So, the RecordReaders in running tasks of that executor throw "java.io.IOException: Filesystem closed" exception. I guess it will be fixed in subsequent release and then you will get more helpful error message :)
Something calls DFSClient.close() or DFSClient.abort(), closing the client. The next file operation then results in the above exception.
I would try to figure out what calls close()/abort(). You could use a breakpoint in your debugger, or modify the Hadoop source code to throw an exception in these methods, so you would get a stack trace.
The exception about “file system closed” can be solved if the spark job is running on a cluster. You can set properties like spark.executor.cores , spark.driver.cores and spark.akka.threads to the maximum values w.r.t your resource availability. I had the same problem when my dataset was pretty large with JSON data about 20 million records. I fixed it with the above properties and it ran like a charm. In my case, I set those properties to 25,25 and 20 respectively. Hope it helps!!
Reference Link:
http://spark.apache.org/docs/latest/configuration.html