Sqoop import to avrodatafile or Parquet files failing in dataproc clusters - google-cloud-dataproc

When we run sqoop import in the GCP dataproc clusters to either avrodatafile or parquetfile it fails with the below errors. However, import to textfile works. Feels like we might need some additional JARs are required.
The required Sqoop jars are loaded from GCS.
COMMAND used:
gcloud dataproc jobs submit hadoop \
--cluster={cluster_name} \
--region=us-central1 \
--class=org.apache.sqoop.Sqoop --jars={sqoop_jars_gcs}/sqoop-1.4.7.jar,{sqoop_jars_gcs}/avro-1.8.2.jar,{sqoop_jars_gcs}/terajdbc4.jar,{sqoop_jars_gcs}/log4j-1.2.17.jar,{sqoop_jars_gcs}/sqoop-connector-teradata-1.2c5.jar,{sqoop_jars_gcs}/tdgssconfig.jar,{sqoop_jars_gcs}/avro-1.8.2.jar \
-- import \
-Dmapreduce.job.user.classpath.first=true \
-Dorg.apache.sqoop.splitter.allow_text_splitter=true \
--connect={db_connection}DATABASE={source_db} \
--username={userid} \
--password-file {passfile} \
--driver com.teradata.jdbc.TeraDriver \
-e "sql query AND \$CONDITIONS" \
--target-dir=<dir> \
--delete-target-dir \
--as-<avrodatafile/parquetfile> \
--split-by <column>
Error when running --as-avrodatafile:
We have the avro-1.8.2.jar in classpath but still no luck.
INFO - Error: java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
INFO - at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:135)
INFO - at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:753)
INFO - at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
INFO - at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:177)
INFO - at java.security.AccessController.doPrivileged(Native Method)
INFO - at javax.security.auth.Subject.doAs(Subject.java:422)
INFO - at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
INFO - at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:171)
INFO - Caused by: java.lang.reflect.InvocationTargetException
INFO - at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
INFO - at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
INFO - at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
INFO - at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
INFO - at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
INFO - ... 7 more
INFO - Caused by: java.lang.NoClassDefFoundError: org/apache/avro/mapred/AvroWrapper
INFO - at org.apache.sqoop.mapreduce.AvroImportMapper.<init>(AvroImportMapper.java:43)
INFO - ... 12 more
INFO - Caused by: java.lang.ClassNotFoundException: org.apache.avro.mapred.AvroWrapper
INFO - at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
INFO - at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
INFO - at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
INFO - at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
INFO - ... 13 more
Error when running --as-parquetfile:
INFO - at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
INFO - at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
INFO - at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
INFO - at java.lang.reflect.Method.invoke(Method.java:498)
INFO - at com.google.cloud.hadoop.services.agent.job.shim.HadoopRunClassShim.main(HadoopRunClassShim.java:19)
INFO - Caused by: java.lang.NoClassDefFoundError: org/kitesdk/data/mapreduce/DatasetKeyOutputFormat
INFO - at org.apache.sqoop.mapreduce.DataDrivenImportJob.getOutputFormatClass(DataDrivenImportJob.java:213)
INFO - at org.apache.sqoop.mapreduce.ImportJobBase.configureOutputFormat(ImportJobBase.java:98)
INFO - at org.apache.sqoop.mapreduce.ImportJobBase.runImport(ImportJobBase.java:263)
INFO - at org.apache.sqoop.manager.SqlManager.importQuery(SqlManager.java:748)
INFO - at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:522)
INFO - at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:628)
INFO - at org.apache.sqoop.Sqoop.run(Sqoop.java:147)
INFO - at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
INFO - at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:183)
INFO - at org.apache.sqoop.Sqoop.runTool(Sqoop.java:234)
INFO - at org.apache.sqoop.Sqoop.runTool(Sqoop.java:243)
INFO - at org.apache.sqoop.Sqoop.main(Sqoop.java:252)
INFO - ... 5 more
INFO - Caused by: java.lang.ClassNotFoundException: org.kitesdk.data.mapreduce.DatasetKeyOutputFormat
INFO - at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
INFO - at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
INFO - at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
INFO - at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
INFO - ... 17 more

Ran into a similar issue I was able to use the following resources to change the dependency version for kite-sdk to a newer version. I had to download the latest jars for kite-sdk.
https://community.cloudera.com/t5/Support-Questions/Issue-when-using-parquet-org-kitesdk-data/td-p/128233
https://discuss.cloudxlab.com/t/sqoop-import-to-hive-as-parquet-file-is-failing/1089/6

This worked for me only when using a very specific set of jar versions, mostly from Cloudera as follows -
# Jars used:
# https://repo1.maven.org/maven2/org/apache/parquet/parquet-format/2.9.0/parquet-format-2.9.0.jar
# https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/sqoop/sqoop/1.4.7.7.2.10.0-148/sqoop-1.4.7.7.2.10.0-148.jar
# https://repository.cloudera.com/artifactory/cloudera-repos/org/kitesdk/kite-data-core/1.0.0-cdh6.3.4/kite-data-core-1.0.0-cdh6.3.4.jar
# https://repository.cloudera.com/artifactory/cloudera-repos/org/kitesdk/kite-data-mapreduce/1.0.0-cdh6.3.4/kite-data-mapreduce-1.0.0-cdh6.3.4.jar
# https://repository.cloudera.com/artifactory/cloudera-repos/org/kitesdk/kite-hadoop-compatibility/1.0.0-cdh6.3.4/kite-hadoop-compatibility-1.0.0-cdh6.3.4.jar
# https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/avro/avro/1.8.2.7.2.10.0-148/avro-1.8.2.7.2.10.0-148.jar
# https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/avro/avro-mapred/1.8.2.7.2.10.0-148/avro-mapred-1.8.2.7.2.10.0-148.jar
# https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/parquet/parquet-avro/1.10.99.7.2.10.0-148/parquet-avro-1.10.99.7.2.10.0-148.jar
# https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/parquet/parquet-common/1.10.99.7.2.10.0-148/parquet-common-1.10.99.7.2.10.0-148.jar
# https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/parquet/parquet-column/1.10.99.7.2.10.0-148/parquet-column-1.10.99.7.2.10.0-148.jar
# https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/parquet/parquet-hadoop/1.10.99.7.2.10.0-148/parquet-hadoop-1.10.99.7.2.10.0-148.jar
# https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/parquet/parquet-jackson/1.10.99.7.2.10.0-148/parquet-jackson-1.10.99.7.2.10.0-148.jar
# https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/parquet/parquet-encoding/1.10.99.7.2.10.0-148/parquet-encoding-1.10.99.7.2.10.0-148.jar
export CLUSTER_NAME=
export CLUSTER_REGION=us-central1 # update accordingly
export GCS_BUCKET="" # name only
export DRIVER_CLASS=com.mysql.jdbc.Driver
export CONNECT_STRING="jdbc:..."
export USERNAME=
export PASSWORD="" # testing only - use password-file
export TABLE=
export JDBC_JAR=gs://${GCS_BUCKET}/sqoop/jars/mysql-connector-java-5.0.8-bin.jar
export SQOOP_JAR=gs://${GCS_BUCKET}/sqoop/jars/cloudera/sqoop-1.4.7.7.2.10.0-148.jar
export AVRO_JAR1=gs://${GCS_BUCKET}/sqoop/jars/cloudera/avro-1.8.2.7.2.10.0-148.jar
export AVRO_JAR2=gs://${GCS_BUCKET}/sqoop/jars/cloudera/avro-mapred-1.8.2.7.2.10.0-148.jar
export PARQUET_JAR1=gs://${GCS_BUCKET}/sqoop/jars/cloudera/kite-data-core-1.0.0-cdh6.3.4.jar
export PARQUET_JAR2=gs://${GCS_BUCKET}/sqoop/jars/cloudera/kite-data-mapreduce-1.0.0-cdh6.3.4.jar
export PARQUET_JAR3=gs://${GCS_BUCKET}/sqoop/jars/cloudera/kite-hadoop-compatibility-1.0.0-cdh6.3.4.jar
export PARQUET_JAR4=gs://${GCS_BUCKET}/sqoop/jars/cloudera/parquet-common-1.10.99.7.2.10.0-148.jar
export PARQUET_JAR5=gs://${GCS_BUCKET}/sqoop/jars/cloudera/parquet-avro-1.10.99.7.2.10.0-148.jar
export PARQUET_JAR6=gs://${GCS_BUCKET}/sqoop/jars/cloudera/parquet-hadoop-1.10.99.7.2.10.0-148.jar
export PARQUET_JAR7=gs://${GCS_BUCKET}/sqoop/jars/cloudera/parquet-column-1.10.99.7.2.10.0-148.jar
export PARQUET_JAR8=gs://${GCS_BUCKET}/sqoop/jars/cloudera/parquet-encoding-1.10.99.7.2.10.0-148.jar
export PARQUET_JAR9=gs://${GCS_BUCKET}/sqoop/jars/cloudera/parquet-jackson-1.10.99.7.2.10.0-148.jar
export PARQUET_JAR10=gs://${GCS_BUCKET}/sqoop/jars/parquet-format-2.9.0.jar
gcloud dataproc jobs submit hadoop \
--cluster=${CLUSTER_NAME} \
--class=org.apache.sqoop.Sqoop \
--region=${CLUSTER_REGION} \
--jars=${JDBC_JAR},${SQOOP_JAR},${AVRO_JAR1},${AVRO_JAR2},${PARQUET_JAR1},${PARQUET_JAR2},${PARQUET_JAR3},${PARQUET_JAR4},${PARQUET_JAR5},${PARQUET_JAR6},${PARQUET_JAR7},${PARQUET_JAR8},${PARQUET_JAR9},${PARQUET_JAR10} \
-- import \
-Dmapreduce.job.user.classpath.first=true \
-Dparquetjob.configurator.implementation=hadoop \
--driver ${DRIVER_CLASS} \
--connect=${CONNECT_STRING} \
--username=${USERNAME} \
--password=${PASSWORD} \
--target-dir="gs://${GCS_BUCKET}/sqoop/out/parquet_output4/" \
--table=${TABLE} \
--delete-target-dir \
--as-parquetfile \
-m 1 \
--verbose
# --parquet-configurator-implementation kite \
#--compression-codec snappy \
#--query=""

Related

pyspark container- spark-submitting a pyspark script throws file not found error

Solution-
Add following env variables to the container
export PYSPARK_PYTHON=/usr/bin/python3.9
export PYSPARK_DRIVER_PYTHON=/usr/bin/python3.9
Trying to create a spark container and spark-submit a pyspark script.
I am able to create the container but running the pyspark script throws the following error:
Exception in thread "main" java.io.IOException: Cannot run program
"python": error=2, No such file or directory
Questions :
Any idea why this error is occurring ?
Do i need to install python separately or does it comes bundled with spark download ?
Do i need to install Pyspark separately or does it comes bundled with spark download ?
What is preferable regarding python installation? download and put it under /opt/python or use apt-get ?
pyspark script:
from pyspark import SparkContext
sc = SparkContext("local", "count app")
words = sc.parallelize (
["scala",
"java",
"hadoop",
"spark",
"akka",
"spark vs hadoop",
"pyspark",
"pyspark and spark"]
)
counts = words.count()
print "Number of elements in RDD -> %i" % (counts)
output of spark-submit:
newuser#c1f28230da16:~$ spark-submit count.py
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform
(file:/opt/spark/jars/spark-unsafe_2.12-3.0.1.jar) to constructor
java.nio.DirectByteBuffer(long,int) WARNING: Please consider reporting
this to the maintainers of org.apache.spark.unsafe.Platform WARNING:
Use --illegal-access=warn to enable warnings of further illegal
reflective access operations WARNING: All illegal access operations
will be denied in a future release 21/02/01 19:58:35 WARN
NativeCodeLoader: Unable to load native-hadoop library for your
platform... using builtin-java classes where applicable Exception in
thread "main" java.io.IOException: Cannot run program "python":
error=2, No such file or directory at
java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1128) at
java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1071) at
org.apache.spark.deploy.PythonRunner$.main(PythonRunner.scala:97) at
org.apache.spark.deploy.PythonRunner.main(PythonRunner.scala) at
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native
Method) at
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:564) at
org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at
org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:928)
at
org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at
org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused
by: java.io.IOException: error=2, No such file or directory at
java.base/java.lang.ProcessImpl.forkAndExec(Native Method) at
java.base/java.lang.ProcessImpl.<init>(ProcessImpl.java:319) at
java.base/java.lang.ProcessImpl.start(ProcessImpl.java:250) at
java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1107)
... 15 more log4j:WARN No appenders could be found for logger
(org.apache.spark.util.ShutdownHookManager). log4j:WARN Please
initialize the log4j system properly. log4j:WARN See
http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
output of printenv:
newuser#c1f28230da16:~$ printenv
HOME=/home/newuser
LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:
PYTHONPATH=:/opt/spark/python:/opt/spark/python/lib/py4j-0.10.4-src.zip
TERM=xterm SHLVL=1 SPARK_HOME=/opt/spark
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/java/bin:/opt/spark/bin
_=/usr/bin/printenv
myspark dockerfile:
JDK_PACKAGE=openjdk-14.0.2_linux-x64_bin.tar.gz ARG
SPARK_HOME=/opt/spark ARG SPARK_PACKAGE=spark-3.0.1-bin-hadoop3.2.tgz
#MAINTAINER demo#gmail.com
#LABEL maintainer="demo#foo.com"
############################################
### Install openjava
############################################
# Base image stage 1 FROM ubuntu as jdk
ARG JAVA_HOME ARG JDK_PACKAGE
WORKDIR /opt/
## download open java
# ADD https://download.java.net/java/GA/jdk14.0.2/205943a0976c4ed48cb16f1043c5c647/12/GPL/$JDK_PACKAGE
/
# ADD $JDK_PACKAGE / COPY $JDK_PACKAGE .
RUN mkdir -p $JAVA_HOME/ && \
tar -zxf $JDK_PACKAGE --strip-components 1 -C $JAVA_HOME && \
rm -f $JDK_PACKAGE
############################################
### Install spark search
############################################
# Base image stage 2 From ubuntu as spark
#ARG JAVA_HOME ARG SPARK_HOME ARG SPARK_PACKAGE
WORKDIR /opt/
## download spark COPY $SPARK_PACKAGE .
RUN mkdir -p $SPARK_HOME/ && \
tar -zxf $SPARK_PACKAGE --strip-components 1 -C $SPARK_HOME && \
rm -f $SPARK_PACKAGE
# Mount elasticsearch.yml config
### ADD config/elasticsearch.yml /opt/elasticsearch/config/elasticsearch.yml
############################################
### final
############################################
From ubuntu as finalbuild
ARG JAVA_HOME ARG SPARK_HOME ARG SPARK_PACKAGE
WORKDIR /opt/
# get artifacts from previous stages COPY --from=jdk $JAVA_HOME $JAVA_HOME COPY --from=spark $SPARK_HOME $SPARK_HOME
# Setup JAVA_HOME, this is useful for docker commandline ENV JAVA_HOME $JAVA_HOME ENV SPARK_HOME $SPARK_HOME
# setup paths ENV PATH $PATH:$JAVA_HOME/bin ENV PATH $PATH:$SPARK_HOME/bin ENV PYTHONPATH
$PYTHONPATH:$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.4-src.zip
# Expose ports
# EXPOSE 9200
# EXPOSE 9300
# Define mountable directories.
#VOLUME ["/data"]
## give permission to entire setup directory RUN useradd newuser --create-home --shell /bin/bash && \
echo 'newuser:newpassword' | chpasswd && \
chown -R newuser $SPARK_HOME $JAVA_HOME && \
chown -R newuser:newuser /home/newuser && \
chmod 755 /home/newuser
#chown -R newuser:newuser /home/newuser
#chown -R newuser /home/newuser && \
# Install Python RUN apt-get update && \
apt-get install -yq curl && \
apt-get install -yq vim && \
apt-get install -yq python3.9
## Install PySpark and Numpy
#RUN \
# pip install --upgrade pip && \
# pip install numpy && \
# pip install pyspark
#
USER newuser
WORKDIR /home/newuser
# RUN chown -R newuser /home/newuser
Added following env variables to the container and it works
export PYSPARK_PYTHON=/usr/bin/python3.9
export PYSPARK_DRIVER_PYTHON=/usr/bin/python3.9

Snowflake-kafka connect -> Error: Caused by: java.lang.ClassNotFoundException: org.bouncycastle.jcajce.provider.BouncyCastleFipsProvider

I am getting this error message when i send a post request to start my kafka snowflake connector:
[ec2-user#ip-10-0-64-123 tmp]$ curl -X POST -H "Content-Type: application/json" --data #snowflake.json http://internal-test-dev-alb-39351xyz.eu-central-1.elb.amazonaws.com:80/connectors
<html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
<title>Error 500 Request failed.</title>
</head>
<body><h2>HTTP ERROR 500</h2>
<p>Problem accessing /connectors. Reason:
<pre> Request failed.</pre></p><hr>Powered by Jetty:// 9.4.18.v20190429<hr/>
</body>
</html>
When i look in the logs:
Caused by: java.lang.ClassNotFoundException: org.bouncycastle.jcajce.provider.BouncyCastleFipsProvider
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at org.apache.kafka.connect.runtime.isolation.PluginClassLoader.loadClass(PluginClassLoader.java:104)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
... 13 more
[2020-09-28 17:21:32,308] WARN /connectors (org.eclipse.jetty.server.HttpChannel:597)
javax.servlet.ServletException: org.glassfish.jersey.server.ContainerException: java.lang.NoClassDefFoundError: org/bouncycastle/jcajce/provider/BouncyCastleFipsProvider
at org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:408)
at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:346)
at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:365)
at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:318)
at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:205)
at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:873)
at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:542)
at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)
at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1700)
at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)
at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1345)
at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:480)
at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1667)
at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)
at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1247)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:220)
at org.eclipse.jetty.server.handler.StatisticsHandler.handle(StatisticsHandler.java:174)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
at org.eclipse.jetty.server.Server.handle(Server.java:505)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:370)
at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:267)
at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:305)
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103)
at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126)
at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366)
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:698)
at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:804)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.glassfish.jersey.server.ContainerException: java.lang.NoClassDefFoundError: org/bouncycastle/jcajce/provider/BouncyCastleFipsProvider
at org.glassfish.jersey.servlet.internal.ResponseWriter.rethrow(ResponseWriter.java:254)
at org.glassfish.jersey.servlet.internal.ResponseWriter.failure(ResponseWriter.java:236)
at org.glassfish.jersey.server.ServerRuntime$Responder.process(ServerRuntime.java:436)
at org.glassfish.jersey.server.ServerRuntime$1.run(ServerRuntime.java:261)
at org.glassfish.jersey.internal.Errors$1.call(Errors.java:248)
at org.glassfish.jersey.internal.Errors$1.call(Errors.java:244)
at org.glassfish.jersey.internal.Errors.process(Errors.java:292)
at org.glassfish.jersey.internal.Errors.process(Errors.java:274)
at org.glassfish.jersey.internal.Errors.process(Errors.java:244)
at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:265)
at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:232)
at org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:679)
at org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:392)
... 33 more
Caused by: java.lang.NoClassDefFoundError: org/bouncycastle/jcajce/provider/BouncyCastleFipsProvider
...
Here is my docker file to build the connector:
FROM openjdk:8-jre
# Add Confluent Repository and install Confluent Platform
RUN wget -qO - http://packages.confluent.io/deb/5.3/archive.key | apt-key add -
RUN echo "deb [arch=amd64] http://packages.confluent.io/deb/5.3 stable main" > /etc/apt/sources.list.d/confluent.list
RUN apt-get update && apt-get install -y --no-install-recommends confluent-kafka-connect-* confluent-schema-registry gettext confluent-kafka-2.12
RUN mkdir -p /usr/share/java/kafka-connect-pubsub/
RUN mkdir -p /etc/kafka-connect-pubsub/
# Script to configure properties in various config files.
COPY config-templates/configure.sh /configure.sh
# RUN wget -q -P /usr/share/java/kafka-connect-jdbc/ https://s3.amazonaws.com/datahub-public-repo/ojdbc10.jar
# RUN wget -q -P /usr/share/java/kafka-connect-jdbc/ https://s3.amazonaws.com/datahub-public-repo/mysql-connector-java-5.1.41-bin.jar
# RUN wget -q -P /usr/share/java/kafka-connect-jdbc/ https://s3.amazonaws.com/datahub-public-repo/terajdbc4.jar
# RUN wget -q -P /usr/share/java/kafka-connect-jdbc/ https://s3.amazonaws.com/datahub-public-repo/tdgssconfig.jar
RUN wget -q -P /usr/share/java/ https://repo1.maven.org/maven2/com/snowflake/snowflake-kafka-connector/1.4.4/snowflake-kafka-connector-1.4.4.jar
RUN wget -q -P /usr/share/java/ https://repo1.maven.org/maven2/org/bouncycastle/bc-fips/1.0.2/bc-fips-1.0.2.jar
RUN wget -q -P /usr/share/java/ https://repo1.maven.org/maven2/org/bouncycastle/bcpkix-fips/1.0.4/bcpkix-fips-1.0.4.jar
COPY config-templates/connect-standalone.properties.template /etc/kafka/connect-standalone.properties.template
COPY config-templates/snowflake.properties.template /etc/kafka/snowflake.properties.template
COPY config-templates/connect-distributed.properties.template /etc/kafka/connect-distributed.properties.template
# COPY config-templates/jdbc-source.properties.template /etc/kafka-connect-jdbc/jdbc-source.properties.template
# COPY config-templates/jdbc-sink.properties.template /etc/kafka-connect-jdbc/jdbc-sink.properties.template
# COPY config-templates/pubsub-sink-connector.properties.template /etc/kafka-connect-pubsub/pubsub-sink-connector.properties.template
COPY config-templates/kafka-run-class /usr/bin/kafka-run-class
# Modify these lines to reflect your client Keystore and Truststore.
# COPY replicantSuperUser.kafka.client.keystore.jks /replicantSuperUser.kafka.client.keystore.jks
# COPY kafka.client.truststore.jks /tmp/kafka.client.truststore.jks
RUN chmod 755 configure.sh /usr/bin/kafka-run-class
ENTRYPOINT /configure.sh && $KC_CMD && bash
Here is my snowflake.json
{
"name":"Snowflaketest",
"config":{
"connector.class":"com.snowflake.kafka.connector.SnowflakeSinkConnector",
"tasks.max":"8",
"topics":"dat.slt.isc.incoming.json",
"buffer.count.records":"10000",
"buffer.flush.time":"60",
"buffer.size.bytes":"5000000",
"snowflake.url.name":"https://t1234.eu-central-1.snowflakecomputing.com:443",
"snowflake.user.name":"asdf_CONNECT",
"snowflake.private.key":"MIIFLTBXBgkqhkiG9w0BBQ0wSjApBgkqhkiG9w0BBQwwHAQIWi2iAjGL9JsCAggAMAw********",
"snowflake.private.key.passphrase":"hchajvdzSvcmqamIWe1jvrF***",
"snowflake.database.name":"db_SANDBOX",
"snowflake.schema.name":"LOADDB",
"key.converter":"org.apache.kafka.connect.storage.StringConverter",
"value.converter":"com.snowflake.kafka.connector.records.SnowflakeJsonConverter"
}
}
Any pointer?
My guess is that bouncycastle jars are not placed in a correct location openjdk? Any idea where they should be placed or if there is any other way to fix the problem?
Any help shall be highly appreciated.

Permission issue encountered during sqoop import using Hcatalog

I am trying to use sqoop import with HCatalog integration to ingest data from Teradata to Hive. Below is my sqoop import command:
sqoop import -libjars /path/tdgssconfig.jar \
-Dmapreduce.job.queuename=${queue} \
-Dmapreduce.map.java.opts=-Xmx16g \
-Dmapreduce.map.memory.mb=20480 \
--driver com.teradata.jdbc.TeraDriver \
--connect jdbc:teradata:<db-url>,charset=ASCII,LOGMECH=LDAP \
--username ${srcDbUsr} \
--password-file ${srcDbPassFile} \
--verbose \
--query "${query} AND \$CONDITIONS" \
--split-by ${splitBy} \
--fetch-size ${fetchSize} \
--null-string '\\N' \
--null-non-string '\\N' \
--fields-terminated-by , \
--hcatalog-database ${tgtDbName} \
--hcatalog-table ${tgtTblName} \
--hcatalog-partition-keys ${partitionKey} \
--hcatalog-partition-values "${partitionValue}"
And I encountered below error - Error adding partition to metastore. Permission denied.:
18/07/03 12:14:02 INFO mapreduce.Job: Job job_1530241180113_6487 failed with state FAILED due to: Job commit failed: org.apache.hive.hcatalog.common.HCatException : 2006 : Error adding partition to metastore. Cause : org.apache.hadoop.security.AccessControlException: Permission denied. user=<usr-name> is not the owner of inode=<partition-key=partition-value>
at org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.checkOwner(DefaultAuthorizationProvider.java:195)
at org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.checkPermission(DefaultAuthorizationProvider.java:181)
at org.apache.sentry.hdfs.SentryAuthorizationProvider.checkPermission(SentryAuthorizationProvider.java:178)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:152)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:3560)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:3543)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkOwner(FSDirectory.java:3508)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOwner(FSNamesystem.java:6559)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.setPermissionInt(FSNamesystem.java:1807)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.setPermission(FSNamesystem.java:1787)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.setPermission(NameNodeRpcServer.java:654)
at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.setPermission(AuthorizationProviderProxyClientProtocol.java:174)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.setPermission(ClientNamenodeProtocolServerSideTranslatorPB.java:454)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2141)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2137)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1714)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2135)
at org.apache.hive.hcatalog.mapreduce.FileOutputCommitterContainer.registerPartitions(FileOutputCommitterContainer.java:969)
at org.apache.hive.hcatalog.mapreduce.FileOutputCommitterContainer.commitJob(FileOutputCommitterContainer.java:249)
at org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler$EventProcessor.handleJobCommit(CommitterEventHandler.java:274)
at org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler$EventProcessor.run(CommitterEventHandler.java:237)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
How can I resolve this permission issue?
Figured out the issue. sqoop hcatalog cannot add files to Hive internal table, because it resides in Hive directories and the owner is hive, not a particular user. Resolution is to create an external table so that the underlying directories have the user (not hive) as the owner.

How to execute an application uploaded to worker nodes with --files option?

I am uploading a file to my worker nodes using spark-submit, and I would like to access this file. This file is a binary, which I would like to execute. I already know how to execute the file through scala, but I keep getting a "File not found" exception and I can't find a way to access it.
I use the following command to submit my job.
spark-submit --class Main --master yarn --deploy-mode cluster --files las2las myjar.jar
when the job is being executed I noticed that it was uploaded to the staging directory for the current running application, when I tried to run the following, it didn't work.
val command = "hdfs://url/user/username/.sparkStaging/" + sparkContext.applicationId + "/las2las" !!
This is the exception that gets thrown:
17/10/22 18:15:57 ERROR yarn.ApplicationMaster: User class threw exception: java.io.IOException: Cannot run program "hdfs://url/user/username/.sparkStaging/application_1486393309284_26788/las2las": error=2, No such file or directory
So, my question is, how can I access the las2las file?
Use SparkFiles:
val path = org.apache.spark.SparkFiles.get("las2las")
How can I access the las2las file?
When you go to the YARN UI at http://localhost:8088/cluster and click on the application ID for the Spark application, you'll get redirected to the page with the container logs. Click Logs. In stderr you should find lines that looks similar to the following:
===============================================================================
YARN executor launch context:
env:
CLASSPATH -> {{PWD}}<CPS>{{PWD}}/__spark_conf__<CPS>{{PWD}}/__spark_libs__/*<CPS>$HADOOP_CONF_DIR<CPS>$HADOOP_COMMON_HOME/share/hadoop/common/*<CPS>$HADOOP_COMMON_HOME/share/hadoop/common/lib/*<CPS>$HADOOP_HDFS_HOME/share/hadoop/hdfs/*<CPS>$HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*<CPS>$HADOOP_YARN_HOME/share/hadoop/yarn/*<CPS>$HADOOP_YARN_HOME/share/hadoop/yarn/lib/*<CPS>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*<CPS>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*<CPS>{{PWD}}/__spark_conf__/__hadoop_conf__
SPARK_YARN_STAGING_DIR -> file:/Users/jacek/.sparkStaging/application_1508700955259_0002
SPARK_USER -> jacek
SPARK_YARN_MODE -> true
command:
{{JAVA_HOME}}/bin/java \
-server \
-Xmx1024m \
-Djava.io.tmpdir={{PWD}}/tmp \
'-Dspark.worker.ui.port=44444' \
'-Dspark.driver.port=55365' \
-Dspark.yarn.app.container.log.dir=<LOG_DIR> \
-XX:OnOutOfMemoryError='kill %p' \
org.apache.spark.executor.CoarseGrainedExecutorBackend \
--driver-url \
spark://CoarseGrainedScheduler#192.168.1.6:55365 \
--executor-id \
<executorId> \
--hostname \
<hostname> \
--cores \
1 \
--app-id \
application_1508700955259_0002 \
--user-class-path \
file:$PWD/__app__.jar \
1><LOG_DIR>/stdout \
2><LOG_DIR>/stderr
resources:
__spark_libs__ -> resource { scheme: "file" port: -1 file: "/Users/jacek/.sparkStaging/application_1508700955259_0002/__spark_libs__618005180363157241.zip" } size: 218111116 timestamp: 1508701349000 type: ARCHIVE visibility: PRIVATE
__spark_conf__ -> resource { scheme: "file" port: -1 file: "/Users/jacek/.sparkStaging/application_1508700955259_0002/__spark_conf__.zip" } size: 105328 timestamp: 1508701349000 type: ARCHIVE visibility: PRIVATE
hello.sh -> resource { scheme: "file" port: -1 file: "/Users/jacek/.sparkStaging/application_1508700955259_0002/hello.sh" } size: 33 timestamp: 1508701349000 type: FILE visibility: PRIVATE
===============================================================================
I executed my Spark application as follows:
YARN_CONF_DIR=/tmp \
./bin/spark-shell --master yarn --deploy-mode client --files hello.sh
so the lines of interest is:
hello.sh -> resource { scheme: "file" port: -1 file: "/Users/jacek/.sparkStaging/application_1508700955259_0002/hello.sh" } size: 33 timestamp: 1508701349000 type: FILE visibility: PRIVATE
You should find a similar line with the path to the shell script (mine is /Users/jacek/.sparkStaging/application_1508700955259_0002/hello.sh).
This file is a binary, which I would like to execute.
With the line, you can try to execute it.
import scala.sys.process._
scala> s"/Users/jacek/.sparkStaging/${sc.applicationId}/hello.sh" !!
warning: there was one feature warning; re-run with -feature for details
java.io.IOException: Cannot run program "/Users/jacek/.sparkStaging/application_1508700955259_0003/hello.sh": error=13, Permission denied
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
at scala.sys.process.ProcessBuilderImpl$Simple.run(ProcessBuilderImpl.scala:69)
at scala.sys.process.ProcessBuilderImpl$AbstractBuilder.$bang(ProcessBuilderImpl.scala:113)
at scala.sys.process.ProcessBuilderImpl$AbstractBuilder.slurp(ProcessBuilderImpl.scala:129)
at scala.sys.process.ProcessBuilderImpl$AbstractBuilder.$bang$bang(ProcessBuilderImpl.scala:102)
... 50 elided
Caused by: java.io.IOException: error=13, Permission denied
at java.lang.UNIXProcess.forkAndExec(Native Method)
at java.lang.UNIXProcess.<init>(UNIXProcess.java:247)
at java.lang.ProcessImpl.start(ProcessImpl.java:134)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
... 54 more
It won't work by default since the file is not marked as executable.
$ ls -l /Users/jacek/.sparkStaging/application_1508700955259_0003/hello.sh
-rw-r--r-- 1 jacek staff 33 22 paź 21:51 /Users/jacek/.sparkStaging/application_1508700955259_0003/hello.sh
(I don't know if you can inform Spark or YARN to make a file executable).
Let's make the file executable.
scala> s"chmod +x /Users/jacek/.sparkStaging/${sc.applicationId}/hello.sh".!!
res2: String = ""
It is indeed an executable shell script.
$ ls -l /Users/jacek/.sparkStaging/application_1508700955259_0003/hello.sh
-rwxr-xr-x 1 jacek staff 33 22 paź 21:51 /Users/jacek/.sparkStaging/application_1508700955259_0003/hello.sh
Let's execute it then.
scala> s"/Users/jacek/.sparkStaging/${sc.applicationId}/hello.sh".!!
+ echo 'Hello world'
res3: String =
"Hello world
"
It worked fine given the following hello.sh:
#!/bin/sh -x
echo "Hello world"

Not able to export CLOB data from hive to db2

I am successfully able to import data with clob data type from db2 to hive. After some processing on table in hive, I want to load back the table to db2.
Command used to import:
$SQOOP_HOME/bin/sqoop import --connect jdbc:db2://192.168.145.64:50000/one --table clobtest --username db2inst1 --password dbuser --hive-import --map-column-hive CLOBB=STRING --inline-lob-limit 155578 --target-dir /tmp/1 --m 1
Command used to export:
$SQOOP_HOME/bin/sqoop export --connect jdbc:db2://192.168.145.64:50000/one --username db2inst1 --password dbuser --export-dir /user/hive/warehouse/clobtest --table clobtest --input-fields-terminated-by '\0001' --input-null-string '\\N' --input-null-non-string '\\N' --m 1
At the time of export, I am getting below error:
Error: java.io.IOException: Can't export data, please check failed map task logs
at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:112)
at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:39)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.sqoop.mapreduce.AutoProgressMapper.run(AutoProgressMapper.java:64)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
Caused by: java.io.IOException: Could not buffer record
at org.apache.sqoop.mapreduce.AsyncSqlRecordWriter.write(AsyncSqlRecordWriter.java:218)
at org.apache.sqoop.mapreduce.AsyncSqlRecordWriter.write(AsyncSqlRecordWriter.java:46)
at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:635)
at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:84)
... 10 more
Caused by: java.lang.CloneNotSupportedException: com.cloudera.sqoop.lib.ClobRef
at java.lang.Object.clone(Native Method)
at org.apache.sqoop.lib.LobRef.clone(LobRef.java:109)
at clobtest.clone(clobtest.java:222)
at org.apache.sqoop.mapreduce.AsyncSqlRecordWriter.write(AsyncSqlRecordWriter.java:213)
... 15 more
Any idea about this error?
Thanks.