Not able to insert data into redshift table if any column has any NULL values from s3 - amazon-redshift

I am having source data in s3 in below format.
WM_ID,SOURCE_SYSTEM,DB_ID,JOB_NUM,NOTE_TYPE,NOTE_TEXT,NOTE_DATE_TIME
WOR25,CORE,NI,NI1LBE14,GEN,"",2020-02-01 17:23:32
WOR25,FSI,NI,NI1LBR39,CPN,"",2020-02-04 13:47:35
WOR25,FSI,NI,NI1LBE14,ACC,"",2020-02-03 13:22:56
WOR25,CORE,NI,NI1LBR39,FIT,NA,2020-02-05 13:13:08
Here NOTE_TEXT has some values with NULL. While trying to insert to redshift table using jdbc loader using streamsets transformer(spark-submit), it is not working.
RUN_ERROR: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 4.0 failed 1 times, most recent failure: Lost task 0.0 in stage 4.0 (TID 4, localhost, executor driver): java.sql.SQLFeatureNotSupportedException: [Amazon][JDBC](10220) Driver does not support this optional feature. at com.amazon.exceptions.ExceptionConverter.toSQLException(Unknown Source) at com.amazon.jdbc.common.SPreparedStatement.checkTypeSupported(Unknown Source) at com.amazon.jdbc.common.SPreparedStatement.setNull(Unknown Source) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:658) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:834) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:834) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101) at
If I convert all the NULL values to string it is working as expected. Can anyone guide me with the correct approach?

The problem is that the Redshift JDBC driver itself doesn’t support writing null values. The workaround is to convert to string.
You can use Field Replacer to replace NULLs with a placeholder.
We at StreamSets are looking at resolving this in a future release.

Related

azure analysis service processing using azure data factory custom activity without azure batch

I have .Net dll for processing Azure Analysis Service. It's working fine in .Net.
now I want to run same dll using Azure Data Factory custom activity without Azure Batch.
What approach can I use?
using Hdinsight i got error
FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 2
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:322)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:535)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:170)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:164)
17/08/05 11:31:07 INFO mapreduce.Job: Task Id : attempt_1501755343944_0172_m_000000_3, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 2
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:322)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:535)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:170)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:164)
17/08/05 11:31:12 INFO mapreduce.Job: Task Id : attempt_1501755343944_0172_m_000000_4, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 2
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:322)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:535)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:170)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:164)
17/08/05 11:31:17 INFO mapreduce.Job: Task Id : attempt_1501755343944_0172_m_000000_5, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 2
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:322)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:535)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild
Currently custom .Net activities can only be executed on Azure Batch and Azure HDInsight. Since you don't want to use first option, HDInsight is the only way.
Basically, you need create HDInsight service and describe it as linked service for your custom .Net activity pipeline.
See Use Custom .Net activity with HDInsight.

How to fix an error when an empty string is being written to elastic search from an Apache Spark job?

There is an exception being thrown when I execute my Scala app with functionality of myRDD.saveToEs (I also tried saveToEs from a dataframe). My ES version is 2.3.5.
I am using Spark 1.5.0 so maybe there is a way to configure this in the SparkContext which I am not aware of.
The stack trace is as under -
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 (TID 2, localhost): org.apache.spark.util.TaskCompletionListenerException: Found unrecoverable error [127.0.0.1:9200] returned Bad Request(400) - failed to parse [foo_eff_dt];Invalid format: ""; Bailing out..
at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:90)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
The field named foo_eff_dt does have values and in certain cases doesnt (i.e., empty). I am not sure if this is causing the exception.
My scala code snippet looks like this :
fooRDD.saveToEs("foo/bar")
Please help/guide me in resolving this.
TIA.
I think you are trying to insert Date into Elastic and in Elastic Date can be empty.
{
"format": "strict_date_optional_time||epoch_millis",
"type": "date"
}
If you don't have strict need for date field then you can easily resolve this by changing this into string.

Quartz jobPersistenceException with hsqldb

I'm trying to run a scheduler using quartz. I use HSQLDB (version 2.2.7 ). But I get following exception;
Caused by: org.quartz.JobPersistenceException: Couldn't store job: data exception: string data, right truncation [See nested exception: java.sql.SQLDataException: data exception: string data, right truncation]
at org.quartz.impl.jdbcjobstore.JobStoreSupport.storeJob(JobStoreSupport.java:1132)[475:org.apache.servicemix.bundles.quartz:1.8.6.1]
at org.quartz.impl.jdbcjobstore.JobStoreSupport$3.execute(JobStoreSupport.java:1071)[475:org.apache.servicemix.bundles.quartz:1.8.6.1]
at org.quartz.impl.jdbcjobstore.JobStoreSupport$40.execute(JobStoreSupport.java:3716)[475:org.apache.servicemix.bundles.quartz:1.8.6.1]
at org.quartz.impl.jdbcjobstore.JobStoreSupport.executeInNonManagedTXLock(JobStoreSupport.java:3788)[475:org.apache.servicemix.bundles.quartz:1.8.6.1]
at org.quartz.impl.jdbcjobstore.JobStoreTX.executeInLock(JobStoreTX.java:90)[475:org.apache.servicemix.bundles.quartz:1.8.6.1]
at org.quartz.impl.jdbcjobstore.JobStoreSupport.executeInLock(JobStoreSupport.java:3712)[475:org.apache.servicemix.bundles.quartz:1.8.6.1]
at org.quartz.impl.jdbcjobstore.JobStoreSupport.storeJobAndTrigger(JobStoreSupport.java:1059)[475:org.apache.servicemix.bundles.quartz:1.8.6.1]
at org.quartz.core.QuartzScheduler.scheduleJob(QuartzScheduler.java:822)[475:org.apache.servicemix.bundles.quartz:1.8.6.1]
at org.quartz.impl.StdScheduler.scheduleJob(StdScheduler.java:243)[475:org.apache.servicemix.bundles.quartz:1.8.6.1]
at com.aepona.ase.services.terminalstatus.utils.SimpleTriggerExample.afterPropertiesSet(SimpleTriggerExample.java:32)[320:com.aepona.ase.services.terminalstatus:3.0.7.VFB-SNAPSHOT]
at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.invokeInitMethods(AbstractAutowireCapableBeanFactory.java:1514)[439:org.springframework.beans:3.1.4.RELEASE]
at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.initializeBean(AbstractAutowireCapableBeanFactory.java:1452)[439:org.springframework.beans:3.1.4.RELEASE]
... 14 more
Caused by: java.sql.SQLDataException: data exception: string data, right truncation
at org.hsqldb.jdbc.Util.sqlException(Unknown Source)[315:org.hsqldb.hsqldb:2.2.7]
at org.hsqldb.jdbc.Util.sqlException(Unknown Source)[315:org.hsqldb.hsqldb:2.2.7]
at org.hsqldb.jdbc.JDBCPreparedStatement.fetchResult(Unknown Source)[315:org.hsqldb.hsqldb:2.2.7]
at org.hsqldb.jdbc.JDBCPreparedStatement.executeUpdate(Unknown Source)[315:org.hsqldb.hsqldb:2.2.7]
at org.apache.commons.dbcp.DelegatingPreparedStatement.executeUpdate(DelegatingPreparedStatement.java:102)[301:org.apache.servicemix.bundles.commons-dbcp:1.2.2.7]
at org.quartz.impl.jdbcjobstore.StdJDBCDelegate.insertJobDetail(StdJDBCDelegate.java:530)[475:org.apache.servicemix.bundles.quartz:1.8.6.1]
at org.quartz.impl.jdbcjobstore.JobStoreSupport.storeJob(JobStoreSupport.java:1126)[475:org.apache.servicemix.bundles.quartz:1.8.6.1]
anyone familiar with this issue?
I am using quartz-scheduler 1.8.6. The DB script comes with it for HSQLDB uses BINARY data type in qrtz_job_details, qrtz_triggers, qrtz_calendars and qrtz_blob_triggers tables. I just change BINARY to BLOB and problem solved.

MongoDB hadoop connector fails to query on mongo hive table

I am using MongoDB hadoop connector to query mongoDB using hive table in hadoop.
I am able to execute
select * from mongoDBTestHiveTable;
But when I try to execute following query
select id from mongoDBTestHiveTable;
it throws following exception.
Following class exist in hive lib folder.
Exception stacktrace:
Diagnostic Messages for this Task:
Error: java.io.IOException: Cannot create an instance of InputSplit class = com.mongodb.hadoop.hive.input.HiveMongoInputFormat$MongoHiveInputSplit:Class com.mongodb.hadoop.hive.input.HiveMongoInputFormat$MongoHiveInputSplit not found
at org.apache.hadoop.hive.ql.io.HiveInputFormat$HiveInputSplit.readFields(HiveInputFormat.java:147)
at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:71)
at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:42)
at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:370)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:402)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)
Caused by: java.lang.ClassNotFoundException: Class com.mongodb.hadoop.hive.input.HiveMongoInputFormat$MongoHiveInputSplit not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1626)
at org.apache.hadoop.hive.ql.io.HiveInputFormat$HiveInputSplit.readFields(HiveInputFormat.java:144)
... 10 more
Container killed by the ApplicationMaster.
Please advice.
You also need to add mongo-hadoop-* and well as mongo driver jars to MR1/MR2 classpath on all workers

Spark fails on big shuffle jobs with java.io.IOException: Filesystem closed

I often find spark fails with large jobs with a rather unhelpful meaningless exception. The worker logs look normal, no errors, but they get state "KILLED". This is extremely common for large shuffles, so operations like .distinct.
The question is, how do I diagnose what's going wrong, and ideally, how do I fix it?
Given that a lot of these operations are monoidal I've been working around the problem by splitting the data into, say 10, chunks, running the app on each chunk, then running the app on all of the resulting outputs. In other words - meta-map-reduce.
14/06/04 12:56:09 ERROR client.AppClient$ClientActor: Master removed our application: FAILED; stopping client
14/06/04 12:56:09 WARN cluster.SparkDeploySchedulerBackend: Disconnected from Spark cluster! Waiting for reconnection...
14/06/04 12:56:09 WARN scheduler.TaskSetManager: Loss was due to java.io.IOException
java.io.IOException: Filesystem closed
at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:703)
at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:779)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:840)
at java.io.DataInputStream.read(DataInputStream.java:149)
at org.apache.hadoop.io.compress.DecompressorStream.getCompressedData(DecompressorStream.java:159)
at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:143)
at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85)
at java.io.InputStream.read(InputStream.java:101)
at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:180)
at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:209)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:47)
at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:164)
at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:149)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:27)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:176)
at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:45)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toList(TraversableOnce.scala:257)
at scala.collection.AbstractIterator.toList(Iterator.scala:1157)
at $line5.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:13)
at $line5.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:13)
at org.apache.spark.rdd.RDD$$anonfun$1.apply(RDD.scala:450)
at org.apache.spark.rdd.RDD$$anonfun$1.apply(RDD.scala:450)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:34)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:34)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
at org.apache.spark.scheduler.Task.run(Task.scala:53)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:42)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:41)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:41)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
As of September 1st 2014, this is an "open improvement" in Spark. Please see https://issues.apache.org/jira/browse/SPARK-3052. As syrza pointed out in the given link, the shutdown hooks are likely done in incorrect order when an executor failed which results in this message. I understand you will have to little more investigation to figure out the main cause of problem (i.e. why your executor failed). If it is a large shuffle, it might be an out-of-memory error which cause executor failure which then caused the Hadoop Filesystem to be closed in their shutdown hook. So, the RecordReaders in running tasks of that executor throw "java.io.IOException: Filesystem closed" exception. I guess it will be fixed in subsequent release and then you will get more helpful error message :)
Something calls DFSClient.close() or DFSClient.abort(), closing the client. The next file operation then results in the above exception.
I would try to figure out what calls close()/abort(). You could use a breakpoint in your debugger, or modify the Hadoop source code to throw an exception in these methods, so you would get a stack trace.
The exception about “file system closed” can be solved if the spark job is running on a cluster. You can set properties like spark.executor.cores , spark.driver.cores and spark.akka.threads to the maximum values w.r.t your resource availability. I had the same problem when my dataset was pretty large with JSON data about 20 million records. I fixed it with the above properties and it ran like a charm. In my case, I set those properties to 25,25 and 20 respectively. Hope it helps!!
Reference Link:
http://spark.apache.org/docs/latest/configuration.html