read and transform parquet files in cloud data fusion - google-cloud-data-fusion

Trying to ingest and transform a parquet file in cloud data fusion. I can see that I can ingest the parquet file using the GCS plugin. But when I want to transform it using the wrangler plugin I don't see any capability to do that. Does the wrangler plugin have that ability at all or should I consider another approach? BTW, I just deployed my pipeline to see if I'm able to ingest the parquet file from GCS, but I see this error in the logs:
java.lang.NoClassDefFoundError: org/xerial/snappy/Snappy
at org.apache.parquet.hadoop.codec.SnappyDecompressor.decompress(SnappyDecompressor.java:62) ~[parquet-hadoop-1.8.3.jar:1.8.3]
at org.apache.parquet.hadoop.codec.NonBlockedDecompressorStream.read(NonBlockedDecompressorStream.java:51) ~[parquet-hadoop-1.8.3.jar:1.8.3]
at java.io.DataInputStream.readFully(DataInputStream.java:195) ~[na:1.8.0_275]
at java.io.DataInputStream.readFully(DataInputStream.java:169) ~[na:1.8.0_275]
at org.apache.parquet.bytes.BytesInput$StreamBytesInput.toByteArray(BytesInput.java:204) ~[parquet-encoding-1.8.3.jar:1.8.3]
at org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary.<init>(PlainValuesDictionary.java:89) ~[parquet-column-1.8.3.jar:1.8.3]
at org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary.<init>(PlainValuesDictionary.java:72) ~[parquet-column-1.8.3.jar:1.8.3]
at org.apache.parquet.column.Encoding$1.initDictionary(Encoding.java:90) ~[parquet-column-1.8.3.jar:1.8.3]
at org.apache.parquet.column.Encoding$4.initDictionary(Encoding.java:149) ~[parquet-column-1.8.3.jar:1.8.3]
at org.apache.parquet.column.impl.ColumnReaderImpl.<init>(ColumnReaderImpl.java:343) ~[parquet-column-1.8.3.jar:1.8.3]
at org.apache.parquet.column.impl.ColumnReadStoreImpl.newMemColumnReader(ColumnReadStoreImpl.java:82) ~[parquet-column-1.8.3.jar:1.8.3]
at org.apache.parquet.column.impl.ColumnReadStoreImpl.getColumnReader(ColumnReadStoreImpl.java:77) ~[parquet-column-1.8.3.jar:1.8.3]
at org.apache.parquet.io.RecordReaderImplementation.<init>(RecordReaderImplementation.java:270) ~[parquet-column-1.8.3.jar:1.8.3]
at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:135) ~[parquet-column-1.8.3.jar:1.8.3]
at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:101) ~[parquet-column-1.8.3.jar:1.8.3]
at org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:154) ~[parquet-column-1.8.3.jar:1.8.3]
at org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:101) ~[parquet-column-1.8.3.jar:1.8.3]
at org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:140) ~[parquet-hadoop-1.8.3.jar:1.8.3]
at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:214) ~[parquet-hadoop-1.8.3.jar:1.8.3]
at org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:227) ~[parquet-hadoop-1.8.3.jar:1.8.3]
at io.cdap.plugin.format.parquet.input.PathTrackingParquetInputFormat$ParquetRecordReader.nextKeyValue(PathTrackingParquetInputFormat.java:76) ~[1614054281928-0/:na]
at io.cdap.plugin.format.input.PathTrackingInputFormat$TrackingRecordReader.nextKeyValue(PathTrackingInputFormat.java:136) ~[na:na]
at org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReaderWrapper.nextKeyValue(CombineFileRecordReaderWrapper.java:90) ~[hadoop-mapreduce-client-core-2.9.2.jar:na]
at org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader.nextKeyValue(CombineFileRecordReader.java:65) ~[hadoop-mapreduce-client-core-2.9.2.jar:na]
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:214) ~[spark-core_2.11-2.3.4.jar:2.3.4]
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) ~[spark-core_2.11-2.3.4.jar:2.3.4]
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) ~[scala-library-2.11.8.jar:na]
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) ~[scala-library-2.11.8.jar:na]
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) ~[scala-library-2.11.8.jar:na]
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) ~[scala-library-2.11.8.jar:na]
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) ~[scala-library-2.11.8.jar:na]
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:130) ~[spark-core_2.11-2.3.4.jar:2.3.4]
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:129) ~[spark-core_2.11-2.3.4.jar:2.3.4]
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1415) ~[spark-core_2.11-2.3.4.jar:2.3.4]
at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:141) [spark-core_2.11-2.3.4.jar:2.3.4]
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83) [spark-core_2.11-2.3.4.jar:2.3.4]
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78) [spark-core_2.11-2.3.4.jar:2.3.4]
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) [spark-core_2.11-2.3.4.jar:2.3.4]
at org.apache.spark.scheduler.Task.run(Task.scala:109) [spark-core_2.11-2.3.4.jar:2.3.4]
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) [spark-core_2.11-2.3.4.jar:2.3.4]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_275]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_275]
at java.lang.Thread.run(Thread.java:748) [na:1.8.0_275]
Caused by: java.lang.ClassNotFoundException: org.xerial.snappy.Snappy
at java.net.URLClassLoader.findClass(URLClassLoader.java:382) ~[na:1.8.0_275]
at io.cdap.cdap.common.lang.InterceptableClassLoader.findClass(InterceptableClassLoader.java:44) ~[na:na]
at java.lang.ClassLoader.loadClass(ClassLoader.java:418) ~[na:1.8.0_275]
at java.lang.ClassLoader.loadClass(ClassLoader.java:351) ~[na:1.8.0_275]
... 43 common frames omitted
Do I need to install a specific module on my cluster (Is that possible at all)?

What image version of Dataproc were you using?
It seems that some versions of Dataproc 2.0 didn't include the Snappy libraries under the Hadoop.
As a workaround you could copy the snappy jars to the following directories:
sudo cp /usr/lib/hive/lib/snappy-java-*.jar /usr/lib/hadoop-mapreduce/
sudo cp /usr/lib/hive/lib/snappy-java-*.jar /usr/lib/hadoop/lib
sudo cp /usr/lib/hive/lib/snappy-java-*.jar /usr/lib/hadoop-yarn/lib
This can be done in a Dataproc init-action [1] script, through a Data Fusion Compute Profile as follows:
In Data Fusion, go to: System Admin -> Configuration -> System Compute Profiles -> (create new) -> Advanced Settings -> Initialization Actions.
Otherwise, on a deployed pipeline, in studio, go to Configure -> Compute Config -> (select profile) -> Customize -> Advanced Settings -> Initialization Actions.
Note: these options are not available in Developer instances, only Basic/Enterprise.
[1] https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/init-actions

Related

OrientDB compatibility with Java17

We are using OrientDB 3.0.10 version and we are planning to migrate our application from Java11 to Java17. But application is stopped working due to OrientDB exception. On the latest version of OrientDB also same issue exists.
22-Mar-2022 12:35:15.140 ERROR [Catalina-utility-1] o.a.catalina.core.StandardContext.listenerStart - Exception sending context initialized event to listener instance of class [com.progress.fathom.servlet.FathomContextInitializer]
java.lang.ExceptionInInitializerError: null
at javassist.util.proxy.DefineClassHelper.<clinit>(DefineClassHelper.java:216)
at javassist.util.proxy.FactoryHelper.toClass(FactoryHelper.java:128)
at javassist.util.proxy.ProxyFactory.createClass3(ProxyFactory.java:552)
at javassist.util.proxy.ProxyFactory.createClass2(ProxyFactory.java:537)
at javassist.util.proxy.ProxyFactory.createClass1(ProxyFactory.java:473)
at javassist.util.proxy.ProxyFactory.createClass(ProxyFactory.java:444)
at com.orientechnologies.orient.object.enhancement.OObjectEntityEnhancer.getProxiedInstance(OObjectEntityEnhancer.java:97)
at com.orientechnologies.orient.object.db.OObjectDatabaseTx.newInstance(OObjectDatabaseTx.java:211)
at com.orientechnologies.orient.object.db.OObjectDatabaseTx.newInstance(OObjectDatabaseTx.java:115)
at com.orientechnologies.orient.object.enhancement.OObjectEntitySerializer.serializeObject(OObjectEntitySerializer.java:146)
at com.orientechnologies.orient.object.db.OObjectDatabaseTx.save(OObjectDatabaseTx.java:499)
at com.orientechnologies.orient.object.db.OObjectDatabaseTx.save(OObjectDatabaseTx.java:444)
at com.progress.isq.ipqos.AbstractDatabase.updateDatabaseSchemaVersion(AbstractDatabase.java:455)
at com.progress.isq.ipqos.AbstractDatabase.checkAndUpdateSchema(AbstractDatabase.java:481)
at com.progress.isq.ipqos.AbstractDatabase.applySchema(AbstractDatabase.java:399)
at com.progress.isq.ipqos.AbstractDatabase.openDatabase(AbstractDatabase.java:374)
at com.progress.isq.ipqos.AbstractDatabase.open(AbstractDatabase.java:296)
at com.progress.isq.ipqos.Probe.loadProject(Probe.java:1817)
at com.progress.isq.ipqos.Probe.loadAndInitializeProjectResources(Probe.java:1195)
at com.progress.isq.ipqos.Probe.createAndInitializeBaselineResources(Probe.java:678)
at com.progress.isq.ipqos.Probe.start(Probe.java:391)
at com.progress.fathom.servlet.FathomContextInitializer.contextInitialized(FathomContextInitializer.java:298)
at org.apache.catalina.core.StandardContext.listenerStart(StandardContext.java:4768)
at org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5230)
at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:183)
at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:726)
at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:698)
at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:696)
at org.apache.catalina.startup.HostConfig.deployWAR(HostConfig.java:1024)
at org.apache.catalina.startup.HostConfig$DeployWar.run(HostConfig.java:1911)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at org.apache.tomcat.util.threads.InlineExecutorService.execute(InlineExecutorService.java:75)
at java.base/java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:123)
at org.apache.catalina.startup.HostConfig.deployWARs(HostConfig.java:825)
at org.apache.catalina.startup.HostConfig.deployApps(HostConfig.java:475)
at org.apache.catalina.startup.HostConfig.start(HostConfig.java:1618)
at org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:319)
at org.apache.catalina.util.LifecycleBase.fireLifecycleEvent(LifecycleBase.java:123)
at org.apache.catalina.util.LifecycleBase.setStateInternal(LifecycleBase.java:423)
at org.apache.catalina.util.LifecycleBase.setState(LifecycleBase.java:366)
at org.apache.catalina.core.ContainerBase.startInternal(ContainerBase.java:946)
at org.apache.catalina.core.StandardHost.startInternal(StandardHost.java:835)
at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:183)
at org.apache.catalina.core.ContainerBase$StartChild.call(ContainerBase.java:1396)
at org.apache.catalina.core.ContainerBase$StartChild.call(ContainerBase.java:1386)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: java.lang.reflect.InaccessibleObjectException: Unable to make protected final java.lang.Class java.lang.ClassLoader.defineClass(java.lang.String,byte[],int,int,java.security.ProtectionDomain) throws java.lang.ClassFormatError accessible: module java.base does not "opens java.lang" to unnamed module #46762e3e
at java.base/java.lang.reflect.AccessibleObject.checkCanSetAccessible(AccessibleObject.java:354)
at java.base/java.lang.reflect.AccessibleObject.checkCanSetAccessible(AccessibleObject.java:297)
at java.base/java.lang.reflect.Method.checkCanSetAccessible(Method.java:199)
at java.base/java.lang.reflect.Method.setAccessible(Method.java:193)
at javassist.util.proxy.SecurityActions$3.run(SecurityActions.java:94)
at javassist.util.proxy.SecurityActions$3.run(SecurityActions.java:90)
at java.base/java.security.AccessController.doPrivileged(AccessController.java:569)
at javassist.util.proxy.SecurityActions.getMethodHandle(SecurityActions.java:89)
at javassist.util.proxy.DefineClassHelper$SecuredPrivileged$2.getDefineClassMethodHandle(DefineClassHelper.java:143)
at javassist.util.proxy.DefineClassHelper$SecuredPrivileged$2.<init>(DefineClassHelper.java:136)
at javassist.util.proxy.DefineClassHelper$SecuredPrivileged.<clinit>(DefineClassHelper.java:134)
... 52 common frames omitted
To fix this we have added --add-opens=java.base/java.lang=ALL-UNNAMED to our application as mitigation but we know that this is not the valid fix for the issue. OrientDB needs to come up with a fix i think.
Another thing is OrientDB already added other java packages to Add-Opens in orient jar manifest file
Manifest-Version: 1.0
Implementation-Title: OrientDB Object
Implementation-Version: 3.2.5
Built-By: tglman
Specification-Vendor: Orient Technologies
Specification-Title: OrientDB Object
Implementation-Vendor-Id: com.orientechnologies
Implementation-Vendor: Orient Technologies
Implementation-Build-Date: 2022-02-14 18:19:43+0000
Add-Opens: jdk.unsupported/sun.misc=ALL-UNNAMED java.base/sun.security
.x509=ALL-UNNAMED
X-Compile-Target-JDK: 8
Implementation-Build: c4298657c01683192ba0b7bfffdf82226c164506
X-Compile-Source-JDK: 8
Created-By: Apache Maven 3.6.3
Build-Jdk: 1.8.0_312
Specification-Version: 3.2
Implementation-URL: http://www.orientechnologies.com
Is OrientDB officially supporting Java17 and is there any plan to fix this issue?

AWS GLUE: Cassandra connection using SSL is not working

I wanted to connect to Cassandra using Spark, when trying to connect Cassandra using the default port it is working, but when I try accessing it via SSL the job fails, below is the code:
val spark: SparkSession = SparkSession.builder()
.config("spark.cassandra.connection.host","server.abc")
.config("spark.cassandra.connection.port","9142")
.config("spark.cassandra.connection.ssl.enabled",true)
.config("spark.cassandra.connection.ssl.trustStore.path","s3:/dev-code/certs/trust.jks")
.config("spark.cassandra.connection.ssl.trustStore.password","mypass")
.config("spark.cassandra.auth.username","myuser")
.config("spark.cassandra.auth.password","userpass")
.appName("CassandraIntegration").getOrCreate()
FYI: it has access to the S3 bucket, I am able to read the CSV file from the same location. Also, both the ports are enabled 9042 and 9142. Closed 9042 and kept only 9142 port still the error persists.
Below is the error:
ERROR [main] glue.ProcessLauncher (Logging.scala:logError(94)): Exception in User Class
java.io.IOException: Failed to open native connection to Cassandra at {server.abc:9142} :: Error instantiating class com.datastax.oss.driver.internal.core.ssl.DefaultSslEngineFactory (specified by advanced.ssl-engine-factory.class): Cannot initialize SSL Context
at com.datastax.spark.connector.cql.CassandraConnector$.createSession(CassandraConnector.scala:173)
at com.datastax.spark.connector.cql.CassandraConnector$.$anonfun$sessionCache$1(CassandraConnector.scala:161)
at com.datastax.spark.connector.cql.RefCountedCache.createNewValueAndKeys(RefCountedCache.scala:32)
at com.datastax.spark.connector.cql.RefCountedCache.syncAcquire(RefCountedCache.scala:69)
at com.datastax.spark.connector.cql.RefCountedCache.acquire(RefCountedCache.scala:57)
at com.datastax.spark.connector.cql.CassandraConnector.openSession(CassandraConnector.scala:81)
at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:103)
at com.datastax.spark.connector.datasource.CassandraCatalog$.com$datastax$spark$connector$datasource$CassandraCatalog$$getMetadata(CassandraCatalog.scala:455)
at com.datastax.spark.connector.datasource.CassandraCatalog$.getTableMetaData(CassandraCatalog.scala:421)
at org.apache.spark.sql.cassandra.DefaultSource.getTable(DefaultSource.scala:68)
at org.apache.spark.sql.cassandra.DefaultSource.inferSchema(DefaultSource.scala:72)
at org.apache.spark.sql.execution.datasources.v2.DataSourceV2Utils$.getTableFromProvider(DataSourceV2Utils.scala:81)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$1(DataFrameReader.scala:296)
at scala.Option.map(Option.scala:230)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:266)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:226)
at MyCsvToCassandrsJob$.main(csv-to-cassanra-job:63)
at MyCsvToCassandrsJob.main(csv-to-cassanra-job-job)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.amazonaws.services.glue.SparkProcessLauncherPlugin.invoke(ProcessLauncher.scala:47)
at com.amazonaws.services.glue.SparkProcessLauncherPlugin.invoke$(ProcessLauncher.scala:47)
at com.amazonaws.services.glue.ProcessLauncher$$anon$1.invoke(ProcessLauncher.scala:75)
at com.amazonaws.services.glue.ProcessLauncher.launch(ProcessLauncher.scala:123)
at com.amazonaws.services.glue.ProcessLauncher$.main(ProcessLauncher.scala:29)
at com.amazonaws.services.glue.ProcessLauncher.main(ProcessLauncher.scala)
Caused by: java.lang.IllegalArgumentException: Error instantiating class com.datastax.oss.driver.internal.core.ssl.DefaultSslEngineFactory (specified by advanced.ssl-engine-factory.class): Cannot initialize SSL Context
at com.datastax.oss.driver.internal.core.util.Reflection.buildFromConfig(Reflection.java:253)
at com.datastax.oss.driver.internal.core.util.Reflection.buildFromConfig(Reflection.java:108)
at com.datastax.oss.driver.internal.core.context.DefaultDriverContext.buildSslEngineFactory(DefaultDriverContext.java:414)
at com.datastax.oss.driver.internal.core.context.DefaultDriverContext.lambda$new$4(DefaultDriverContext.java:279)
at com.datastax.oss.driver.internal.core.util.concurrent.LazyReference.get(LazyReference.java:55)
at com.datastax.oss.driver.internal.core.context.DefaultDriverContext.getSslEngineFactory(DefaultDriverContext.java:733)
at com.datastax.oss.driver.internal.core.context.DefaultDriverContext.buildSslHandlerFactory(DefaultDriverContext.java:470)
at com.datastax.oss.driver.internal.core.util.concurrent.LazyReference.get(LazyReference.java:55)
at com.datastax.oss.driver.internal.core.context.DefaultDriverContext.getSslHandlerFactory(DefaultDriverContext.java:799)
at com.datastax.oss.driver.internal.core.session.DefaultSession$SingleThreaded.init(DefaultSession.java:348)
at com.datastax.oss.driver.internal.core.session.DefaultSession$SingleThreaded.access$1100(DefaultSession.java:300)
at com.datastax.oss.driver.internal.core.session.DefaultSession.lambda$init$0(DefaultSession.java:146)
at com.datastax.oss.driver.shaded.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
at com.datastax.oss.driver.shaded.netty.util.concurrent.PromiseTask.run(PromiseTask.java:106)
at com.datastax.oss.driver.shaded.netty.channel.DefaultEventLoop.run(DefaultEventLoop.java:54)
at com.datastax.oss.driver.shaded.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
at com.datastax.oss.driver.shaded.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at com.datastax.oss.driver.shaded.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalStateException: Cannot initialize SSL Context
at com.datastax.oss.driver.internal.core.ssl.DefaultSslEngineFactory.<init>(DefaultSslEngineFactory.java:74)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at com.datastax.oss.driver.internal.core.util.Reflection.buildFromConfig(Reflection.java:246)
... 18 more
Caused by: java.nio.file.NoSuchFileException: s3:/dev-code/certs/trust.jks
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214)
at java.nio.file.Files.newByteChannel(Files.java:361)
at java.nio.file.Files.newByteChannel(Files.java:407)
at java.nio.file.spi.FileSystemProvider.newInputStream(FileSystemProvider.java:384)
at java.nio.file.Files.newInputStream(Files.java:152)
at com.datastax.oss.driver.internal.core.ssl.DefaultSslEngineFactory.buildContext(DefaultSslEngineFactory.java:119)
at com.datastax.oss.driver.internal.core.ssl.DefaultSslEngineFactory.<init>(DefaultSslEngineFactory.java:72)
... 23 more
Big help if there is any workaround for this problem.
At the bottom of your error message, I see this:
NoSuchFileException: s3:/dev-code/certs/trust.jks
Alex is right, in that you need to provide a path to that file that the Spark connector can actually get to. From the looks of it, S3 won't work here.
Added the .jks s3 file into "Referenced files path" of Glue Job and then just try to access just provide the file name. As the file will be automatically be placed under /tmp folder. But it will still not solve the issue.
From the this website, I understood that we need to provide all the default values as well:
Below is my final code:
val spark: SparkSession = SparkSession.builder()
.config("spark.cassandra.connection.host","server.abc")
.config("spark.cassandra.connection.port","9142")
.config("spark.cassandra.connection.ssl.enabled",true)
.config("spark.cassandra.connection.ssl.enabledAlgorithms", "TLS_RSA_WITH_AES_128_CBC_SHA,TLS_RSA_WITH_AES_256_CBC_SHA")
.config("spark.cassandra.connection.ssl.trustStore.path","trust.jks")
.config("spark.cassandra.connection.ssl.trustStore.password","mypass")
.config("spark.cassandra.connection.ssl.trustStore.type","JKS")
.config("spark.cassandra.connection.ssl.protocol","TLS")
.config("spark.cassandra.auth.username","myuser")
.config("spark.cassandra.auth.password","userpass")
.appName("CassandraIntegration").getOrCreate()

Failure while Loading Parquet file to Azure Synapse

I load pre-generated Parquet files (present in Azure Blob Storage) to Azure synapse tables everyday. The job ran successfully until yesterday, but all of sudden it's failing.
Note: A Test Copy Job Executes Successfully when Parquet File is Loaded to a CSV in the same Blob Storage.
Error Message while loading Parquet File to Synapse:
Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=An error occurred when invoking java, message:
java.io.IOException:Error reading summaries total entry:6
org.apache.parquet.hadoop.ParquetFileReader.readAllFootersInParallelUsingSummaryFiles(ParquetFileReader.java:190)
org.apache.parquet.hadoop.ParquetReader.<init>(ParquetReader.java:112)
org.apache.parquet.hadoop.ParquetReader.<init>(ParquetReader.java:45)
org.apache.parquet.hadoop.ParquetReader$Builder.build(ParquetReader.java:20
2)
com.microsoft.datatransfer.bridge.parquet.ParquetBatchReaderBridge.open(ParquetBatchReaderBridge.java:62)
com.microsoft.datatransfer.bridge.parquet.ParquetFileBridge.createReader(ParquetFileBridge.java:22)
java.util.concurrent.ExecutionException:java.lang.ExceptionInInitializerError total entry:9
java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191)
org.apache.parquet.hadoop.ParquetFileReader.runAllInParallel(ParquetFileReader.java:227)
org.apache.parquet.hadoop.ParquetFileReader.readAllFootersInParallelUsingSummaryFiles(ParquetFileReader.java:185)
org.apache.parquet.hadoop.ParquetReader.<init>(ParquetReader.java:112)
org.apache.parquet.hadoop.ParquetReader.<init>(ParquetReader.java:45)
org.apache.parquet.hadoop.ParquetReader$Builder.build(ParquetReader.java:202)
com.microsoft.datatransfer.bridge.parquet.ParquetBatchReaderBridge.open(ParquetBatchReaderBridge.java:62)
com.microsoft.datatransfer.bridge.parquet.ParquetFileBridge.createReader(ParquetFileBridge.java:22) java.lang.ExceptionInInitializerError:null total entry:24
org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:79)
org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:104)
org.apache.hadoop.security.Groups.<init>(Groups.java:86)
org.apache.hadoop.security.Groups.<init>(Groups.java:66)
org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:280)
org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:271)
org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:248)
org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:763)
org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:748)
org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:621)
org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:2753)
org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:2745)
org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2611)
org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
org.apache.hadoop.fs.FileSystem.get(FileSystem.java:169) org.apache.hadoop.fs.FileSystem.get(FileSystem.java:354)
org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
org.apache.parquet.hadoop.ParquetFileReader.readSummaryMetadata(ParquetFileReader.java:360)
org.apache.parquet.hadoop.ParquetFileReader$1.call(ParquetFileReader.java:158)
org.apache.parquet.hadoop.ParquetFileReader$1.call(ParquetFileReader.java:155) java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1167)
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:641) java.base/java.lang.Thread.run(Thread.java:844) java.lang.StringIndexOutOfBoundsException:begin 0, end 3, length 1 total entry:27 java.base/java.lang.String.checkBoundsBeginEnd(String.java:3116)
java.base/java.lang.String.substring(String.java:1885)
org.apache.hadoop.util.Shell.<clinit>(Shell.java:49) org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:79) org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:104)
org.apache.hadoop.security.Groups.<init>(Groups.java:86) org.apache.hadoop.security.Groups.<init>(Groups.java:66) org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:280) org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:271) org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:248) org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:763) org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:748)
org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:621) org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:2753) org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:2745) org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2611) org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370) org.apache.hadoop.fs.FileSystem.get(FileSystem.java:169) org.apache.hadoop.fs.FileSystem.get(FileSystem.java:354) org.apache.hadoop.fs.Path.getFileSystem(Path.java:296) org.apache.parquet.hadoop.ParquetFileReader.readSummaryMetadata(ParquetFileReader.java:360) org.apache.parquet.hadoop.ParquetFileReader$1.call(ParquetFileReader.java:158) org.apache.parquet.hadoop.ParquetFileReader$1.call(ParquetFileReader.java:155)
java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1167)
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:641)
java.base/java.lang.Thread.run(Thread.java:844) .,Source=Microsoft.DataTransfer.Richfile.ParquetTransferPlugin,''Type=Microsoft.DataTransfer.Richfile.JniExt.JavaBridgeException,Message=,Source=Microsoft.DataTransfer.Richfile.HiveOrcBridge,

Spark 3 stream job fails with Cannot run program "chmod"

Spark 3.0 on Kubernetes reading data from Kafka and pushing data out using via 3rd party Segment IO REST API.
I am facing below error while running an Spark stream job
Caused by: java.io.IOException: Cannot run program "chmod": error=11, Resource temporarily unavailable
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:938)
at org.apache.hadoop.util.Shell.run(Shell.java:901)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1213)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:1307)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:1289)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:865)
at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.<init>(RawLocalFileSystem.java:252)
at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.<init>(RawLocalFileSystem.java:232)
at org.apache.hadoop.fs.RawLocalFileSystem.createOutputStreamWithMode(RawLocalFileSystem.java:331)
at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:320)
at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:351)
at org.apache.hadoop.fs.FileSystem.primitiveCreate(FileSystem.java:1228)
at org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:100)
at org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.<init>(ChecksumFs.java:353)
at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:400)
at org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:605)
at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:696)
at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:692)
at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
at org.apache.hadoop.fs.FileContext.create(FileContext.java:698)
at org.apache.spark.sql.execution.streaming.FileContextBasedCheckpointFileManager.createTempFile(CheckpointFileManager.scala:310)
at org.apache.spark.sql.execution.streaming.CheckpointFileManager$RenameBasedFSDataOutputStream.<init>(CheckpointFileManager.scala:133)
at org.apache.spark.sql.execution.streaming.CheckpointFileManager$RenameBasedFSDataOutputStream.<init>(CheckpointFileManager.scala:136)
at org.apache.spark.sql.execution.streaming.FileContextBasedCheckpointFileManager.createAtomic(CheckpointFileManager.scala:316)
at org.apache.spark.sql.execution.streaming.HDFSMetadataLog.writeBatchToFile(HDFSMetadataLog.scala:131)
at org.apache.spark.sql.execution.streaming.HDFSMetadataLog.$anonfun$add$3(HDFSMetadataLog.scala:120)
at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.execution.streaming.HDFSMetadataLog.add(HDFSMetadataLog.scala:118)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runBatch$17(MicroBatchExecution.scala:588)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.withProgressLocked(MicroBatchExecution.scala:598)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runBatch(MicroBatchExecution.scala:585)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$2(MicroBatchExecution.scala:223)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:352)
at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:350)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:69)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$1(MicroBatchExecution.scala:191)
at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:57)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:185)
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:334)
... 1 more
Caused by: java.io.IOException: error=11, Resource temporarily unavailable
at java.lang.UNIXProcess.forkAndExec(Native Method)
at java.lang.UNIXProcess.<init>(UNIXProcess.java:247)
at java.lang.ProcessImpl.start(ProcessImpl.java:134)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
Check your PATH Environment Variable.
(Maybe you override it to add some spark/kafka jars to path?)

Problems with Kafka Source initialization in Siddhi

Can't create stream from Kafka topic using Siddhi. Even if I create string with Design View.
I copied all required jars to lib and bundle folders. Even started Kafka with Zookeeper locally (dunno why I need it locally but nwm).
On tooling.sh start I have following error:
[2020-02-26 22:15:43,041] WARNING {org.wso2.carbon.launcher.extensions.OSGiLibBundleDeployerUtils lambda$getBundlesInfo$1} - Error when loading the OSGi bundle information from /home/Hed/StreamProcessor/siddhi-tooling-5.1.2/lib/kafka-clients-2.3.0.jar
java.io.IOException: Required bundle manifest headers do not exist
at org.wso2.carbon.launcher.extensions.OSGiLibBundleDeployerUtils.getBundleInfo(OSGiLibBundleDeployerUtils.java:183)
at org.wso2.carbon.launcher.extensions.OSGiLibBundleDeployerUtils.lambda$getBundlesInfo$1(OSGiLibBundleDeployerUtils.java:135)
at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
at java.util.stream.StreamSpliterators$WrappingSpliterator.forEachRemaining(StreamSpliterators.java:313)
at java.util.stream.StreamSpliterators$DistinctSpliterator.forEachRemaining(StreamSpliterators.java:1291)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
at java.util.stream.ReduceOps$ReduceTask.doLeaf(ReduceOps.java:747)
at java.util.stream.ReduceOps$ReduceTask.doLeaf(ReduceOps.java:721)
at java.util.stream.AbstractTask.compute(AbstractTask.java:327)
at java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:731)
at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)
For this script:
#App:name("HelloKafka")
#App:description('Consume events from a Kafka Topic and publish to a different Kafka Topic')
#source(type='kafka',
topic.list='kafka_topic',
partition.no.list='0',
threading.option='single.thread',
group.id="group",
bootstrap.servers='localhost:9092',
#map(type='json'))
define stream SweetProductionStream (name string, amount double);
I have see error on Run command:
io.siddhi.core.exception.SiddhiAppCreationException: Error on 'HelloKafka' # Line: 10. Position: 26, near '#source(type='kafka',
topic.list='kafka_topic',
partition.no.list='0',
threading.option='single.thread',
group.id="group",
bootstrap.servers='localhost:9092',
#map(type='json'))'. org/apache/kafka/clients/producer/Producer
at io.siddhi.core.util.ExceptionUtil.populateQueryContext(ExceptionUtil.java:43)
at io.siddhi.core.util.parser.helper.DefinitionParserHelper.addEventSource(DefinitionParserHelper.java:388)
at io.siddhi.core.util.SiddhiAppRuntimeBuilder.defineStream(SiddhiAppRuntimeBuilder.java:117)
at io.siddhi.core.util.parser.SiddhiAppParser.defineStreamDefinitions(SiddhiAppParser.java:374)
at io.siddhi.core.util.parser.SiddhiAppParser.parse(SiddhiAppParser.java:230)
at io.siddhi.core.SiddhiManager.createSiddhiAppRuntime(SiddhiManager.java:85)
at io.siddhi.core.SiddhiManager.createSiddhiAppRuntime(SiddhiManager.java:95)
at io.siddhi.distribution.editor.core.internal.DebugRuntime.createRuntime(DebugRuntime.java:201)
at io.siddhi.distribution.editor.core.internal.DebugRuntime.(DebugRuntime.java:56)
at io.siddhi.distribution.editor.core.internal.DebugProcessorService.start(DebugProcessorService.java:38)
at io.siddhi.distribution.editor.core.internal.EditorMicroservice.start(EditorMicroservice.java:761)
at io.siddhi.distribution.editor.core.internal.EditorMicroservice.startWithVariables(EditorMicroservice.java:781)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.wso2.msf4j.internal.router.HttpMethodInfo.invokeResource(HttpMethodInfo.java:187)
at org.wso2.msf4j.internal.router.HttpMethodInfo.invoke(HttpMethodInfo.java:143)
at org.wso2.msf4j.internal.MSF4JHttpConnectorListener.dispatchMethod(MSF4JHttpConnectorListener.java:218)
at org.wso2.msf4j.internal.MSF4JHttpConnectorListener.lambda$onMessage$58(MSF4JHttpConnectorListener.java:129)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NoClassDefFoundError: org/apache/kafka/clients/producer/Producer
at java.lang.Class.getDeclaredConstructors0(Native Method)
at java.lang.Class.privateGetDeclaredConstructors(Class.java:2671)
at java.lang.Class.getConstructor0(Class.java:3075)
at java.lang.Class.newInstance(Class.java:412)
at io.siddhi.core.util.SiddhiClassLoader.loadClass(SiddhiClassLoader.java:32)
at io.siddhi.core.util.SiddhiClassLoader.loadExtensionImplementation(SiddhiClassLoader.java:48)
at io.siddhi.core.util.parser.helper.DefinitionParserHelper.addEventSource(DefinitionParserHelper.java:346)
... 21 more
Caused by: java.lang.ClassNotFoundException: org.apache.kafka.clients.producer.Producer cannot be found by siddhi-io-kafka_5.0.7
at org.eclipse.osgi.internal.loader.BundleLoader.findClassInternal(BundleLoader.java:448)
at org.eclipse.osgi.internal.loader.BundleLoader.findClass(BundleLoader.java:361)
at org.eclipse.osgi.internal.loader.BundleLoader.findClass(BundleLoader.java:353)
at org.eclipse.osgi.internal.loader.ModuleClassLoader.loadClass(ModuleClassLoader.java:161)
at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
... 28 more
Can somebody tell me what am I doing wrong? :(
Please make sure you had the OSGi-converted jars to the "C:\Program Files\WSO2\Enterprise Integrator\7.0.2\streaming-integrator\lib".
The OSGi-converted jar list:
kafka_2.12_2.3.0_1.0.0
kafka_clients_2.3.0_1.0.0
metrics_core_2.2.0_1.0.0
scala_library_2.12.8_1.0.0
zkclient_0.11_1.0.0
zookeeper_3.4.14_1.0.0
The, copy the original jars to to the "C:\Program Files\WSO2\Enterprise Integrator\7.0.2\streaming-integrator\samples\sample-clients\lib"
The list of original jars:
kafka_2.12-2.3.0
kafka-clients-2.3.0
metrics-core-2.2.0
scala-library-2.12.8
zkclient-0.11
zookeeper-3.4.14
In order to generate the OSGi-converted jars, copy all original jars to a folder called "source" and create an empty folder called "destination". Then run the following command in the terminal:
MINGW32 /c/Program Files/WSO2/Enterprise Integrator/7.0.2/streaming-integrator/bin
$ ./jartobundle.sh C:/DevTools/source C:/DevTools/destination
Finally, distribute the OSGis and original in accordance with the directories above.
PS1: in my case i am using kafka_2.12-2.4.1, but the basename of the jars does not change.
PS2: adapt the directories to your installation path
For more details check WSO2 documentation: Kafka transport