Using the filter transform in Debezium 1.5 - debezium

I want to filter some records in a source connector that I created.
The image contains the debezium-scripting-1.5.0.Beta1.jar in /kafka/connect/debezium-connector-mysql (Enabled using the property "ENABLE_DEBEZIUM_SCRIPTING=true" on the connect-base image).
My connector has the following properties :
"transforms": "filter",
"transforms.filter.type": "io.debezium.transforms.Filter"
When registering my source connector, it fails to configure the connector with the following stacktrace :
Caused by: java.lang.ClassNotFoundException: io.debezium.config.EnumeratedValue
at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:471)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:589)
at org.apache.kafka.connect.runtime.isolation.PluginClassLoader.loadClass(PluginClassLoader.java:104)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
The message is clear : I am not finding the EnumeratedValue class (which is part of the debezium core package).
But when I change the "transforms" property to "unwrap", there is no more error.

This method worked on me.
I have <some-path>/debezium/debezium-connector-mongo/ folder, and
plugin.path=<some-path>/debezium
When I start connector, it works fine.
However, when I add debezium-script plugin, which in <some-path>/debezium/debezium-script/ folder (in same plugin.path), the error appear.
So I moved any plugins in <some-path>/debezium/debezium-script/ into <some-path>/debezium/debezium-connector-mongo/ (merge two folder into one), this worked.

Related

databricks-connect, py4j.protocol.Py4JJavaError: An error occurred while calling o342.cache

Connection to databricks works fine, working with DataFrames goes smoothly (operations like join, filter, etc).
The problem appears when I call cache on a dataframe.
py4j.protocol.Py4JJavaError: An error occurred while calling o342.cache.
: java.io.InvalidClassException: failed to read class descriptor
...
Caused by: java.lang.ClassNotFoundException: org.apache.spark.rdd.RDD$client53442a94a3$$anonfun$mapPartitions$1$$anonfun$apply$23
at java.lang.ClassLoader.findClass(ClassLoader.java:523)
at org.apache.spark.util.ParentClassLoader.findClass(ParentClassLoader.java:35)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at org.apache.spark.util.ParentClassLoader.loadClass(ParentClassLoader.java:40)
at org.apache.spark.util.ChildFirstURLClassLoader.loadClass(ChildFirstURLClassLoader.java:48)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:257)
at org.apache.spark.sql.util.ProtoSerializer.org$apache$spark$sql$util$ProtoSerializer$$readResolveClassDescriptor(ProtoSerializer.scala:4316)
at org.apache.spark.sql.util.ProtoSerializer$$anon$4.readClassDescriptor(ProtoSerializer.scala:4304)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1857)
... 71 more
I work with java8 as required, clearing pycache doesn't help.
The same code submitted as a job to databricks works fine.
It looks like a local problem on a bridge python-jvm level but java version (8) and python (3.7) is as required. Switching to java13 produces quite the same message.
Versions databricks-connect==6.2.0, openjdk version "1.8.0_242", Python 3.7.6
EDIT:
Behavior depends on how DF is created, if the source of DF is external then it works fine, if DF is created locally then such error appears.
# works fine
df = spark.read.csv("dbfs:/some.csv")
df.cache()
# ERROR in 'cache' line
df = spark.createDataFrame([("a",), ("b",)])
df.cache()
This is a known issue and I think a recent patch fixed it. This was seen for Azure, I am not sure whether you are using which Azure or AWS but it's solved. Please check the issue - https://github.com/MicrosoftDocs/azure-docs/issues/52431

WSO2 Integration Studio v6.5.0's built-in Kafka template throws NoClassDefFoundError

I've installed WSO2 Integration Studio version 6.5.0 in my Windows workstation and created a project using the Kafka Consumer and Producer built-in template.
Then I configured the project with my own Kafka server settings (topic name "myTopic").
I then right-clicked the composite application and chose Export Project Artifacts and Run.
The Console window displayed at the very top the following messages:
[2019-06-25 09:23:45,499] [micro-integrator] INFO - LibraryArtifactDeployer Synapse Library named '{org.wso2.carbon.connector}kafkaTransport' has been deployed from file : C:\IntegrationStudio\runtime\microesb\tmp\carbonapps\-1234\1561465425230TestCompositeApplication_1.0.0.car\kafkaTransport-connector_2.0.6\kafkaTransport-connector-2.0.6.zip
[2019-06-25 09:23:45,517] [micro-integrator] INFO - SynapseImportFactory Successfully created Synapse Import: kafkaTransport
[2019-06-25 09:23:45,533] [micro-integrator] ERROR - ClassMediatorFactory
Error in instantiating class :
org.wso2.carbon.connector.KafkaProduceConnector
java.lang.NoClassDefFoundError: org/apache/kafka/common/header/Headers
at java.lang.Class.getDeclaredConstructors0(Native Method)
at java.lang.Class.privateGetDeclaredConstructors(Class.java:2671)
at java.lang.Class.getConstructor0(Class.java:3075)
[snipped rest for clarity]
I've tried uninstalling Integrator Studio and running it with elevated right to no avail.
I expected the project to be deployed normally.
EDIT: after copying:
kafka_2.11-2.2.1.jar
metrics-core-2.2.0.jar
zkclient-0.11.jar
kafka-clients-2.2.1.jar
scala-library-2.11.12.jar
zookeeper-3.4.13.jar
to the EI_HOME/lib directory, the exception changed to:
org.apache.axis2.deployment.DeploymentException: kafka/consumer/ConsumerTimeoutException
at org.apache.synapse.deployers.AbstractSynapseArtifactDeployer.deploy(AbstractSynapseArtifactDeployer.java:219)
at org.wso2.carbon.application.deployer.synapse.SynapseAppDeployer.deployArtifactType(SynapseAppDeployer.java:1099)
at org.wso2.carbon.application.deployer.synapse.SynapseAppDeployer.deployArtifacts(SynapseAppDeployer.java:114)
at org.wso2.carbon.application.deployer.internal.ApplicationManager.deployCarbonApp(ApplicationManager.java:272)
at org.wso2.carbon.application.deployer.CappAxis2Deployer.deploy(CappAxis2Deployer.java:72)
at org.apache.axis2.deployment.repository.util.DeploymentFileData.deploy(DeploymentFileData.java:136)
at org.apache.axis2.deployment.DeploymentEngine.doDeploy(DeploymentEngine.java:807)
at org.apache.axis2.deployment.repository.util.WSInfoList.update(WSInfoList.java:144)
[snipped for clarity]
Caused by: org.apache.axis2.deployment.DeploymentException: kafka/consumer/ConsumerTimeoutException
at org.apache.synapse.deployers.AbstractSynapseArtifactDeployer.deploy(AbstractSynapseArtifactDeployer.java:207)
... 87 more
Caused by: java.lang.NoClassDefFoundError: kafka/consumer/ConsumerTimeoutException
at org.wso2.carbon.inbound.endpoint.protocol.kafka.KAFKAPollingConsumer.startsMessageListener(KAFKAPollingConsumer.java:90)
at org.wso2.carbon.inbound.endpoint.protocol.kafka.KAFKAProcessor.init(KAFKAProcessor.java:96)
at org.apache.synapse.inbound.InboundEndpoint.init(InboundEndpoint.java:79)
at org.apache.synapse.deployers.InboundEndpointDeployer.deploySynapseArtifact(InboundEndpointDeployer.java:57)
at org.apache.synapse.deployers.AbstractSynapseArtifactDeployer.deploy(AbstractSynapseArtifactDeployer.java:197)
... 87 more
Caused by: java.lang.ClassNotFoundException: kafka.consumer.ConsumerTimeoutException cannot be found by synapse-core_2.1.7.wso2v111
at org.eclipse.osgi.internal.loader.BundleLoader.findClassInternal(BundleLoader.java:475)
at org.eclipse.osgi.internal.loader.BundleLoader.findClass(BundleLoader.java:421)
at org.eclipse.osgi.internal.loader.BundleLoader.findClass(BundleLoader.java:412)
at org.eclipse.osgi.internal.baseadaptor.DefaultClassLoader.loadClass(DefaultClassLoader.java:107)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 92 more
Have you copied the required jars from kafka_home/libs folder to EI_home/lib, if yes then share your code to get the issue detail
According to this documentation https://docs.wso2.com/display/EI650/Kafka+Inbound+Protocol, the recommended versions for Kafka is kafka_2.9.2-0.8.1.1. You can download it in the below link. http://kafka.apache.org/downloads.html. Please use those jars and copy them to the EI_HOME/lib. There is an github issue for this as well. https://github.com/wso2/product-ei/issues/2239
I may be a bit too late but we have been using Custom inbound endpoint for Kafka. We also faced exactly same issue and that was the only way to fix it.
You could use https://github.com/wso2-extensions/esb-inbound-kafka/blob/master/docs/config.md to configure it.

How to save data-frame in MySQL using PySpark

I am new to Apache Spark. I have a use case where I have to save data frame data in MySQL. I got the below code to do the same:
data_frame.write.format('jdbc').options(
url='URI',
driver='com.mysql.jdbc.Driver',
dbtable=table_name,
user=user_name,
password='your_password').mode('append').save()
But when I ran the code, I got the below error:
File "/usr/local/Cellar/apache-spark/2.3.1/libexec/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o207.save.
: java.lang.ClassNotFoundException: com.mysql.jdbc.Driver
I might be missing out on very minute detail. How can I fix this?
The error description is clearly indicating that it's not able to locate the JDBC driver class. You will have to include the JAR file for com.mysql.jdbc.Driver using
pyspark --jars <jar-file-location>
See this question - How to add third-party Java JAR files for use in PySpark.

Spark executor is throwing error "java.lang.ClassNotFoundException: oracle.jdbc.OracleDriver"

I am trying to import a table from my oracle database using spark and here I am using Scala to import the table.
My jdbc driver is ojdbc7.jar and it's added in both the parameter spark.driver.extraClassPath and spark.executor.extraClassPath in configuration file
spark.driver.extraClassPath :/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/s
hare/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/home/hadoop/ojdbc7.jar
spark.driver.extraLibraryPath /usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native
spark.executor.extraClassPath :/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/s
hare/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/home/hadoop/ojdbc7.jar
spark.executor.extraLibraryPath /usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native
I can successfully import the table. I can print the schema of the table. But while performing any operations like Count,show() it throws below error
`
Caused by: java.lang.ClassNotFoundException: oracle.jdbc.OracleDriver
at java.lang.ClassLoader.findClass(ClassLoader.java:530) at
org.apache.spark.util.ParentClassLoader.findClass(ParentClassLoader.scala:26)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at
org.apache.spark.util.ParentClassLoader.loadClass(ParentClassLoader.scala:34)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at
org.apache.spark.util.ParentClassLoader.loadClass(ParentClassLoader.scala:30)
at
org.apache.spark.repl.ExecutorClassLoader.findClass(ExecutorClassLoader.scala:77)
... 21 more
`
This error was because Spark was not able to locate the ojdbc7.jar from every core node. So placing this jar in a shared location like /usr/lib/spark/jars will resolve this issue.
You can also do few other things including adding jar file full location as a dependency under spark section in the interpreter as an artifact
If you just want %jdbc to work, update the jdbc section under interpreter, add the jar file full location as a artifact under the dependencies and also update the default.driver, default.url, default.user, default.password accordingly

Hive Table Creation Using MongoDB Hadoop Driver

I am trying to connect from a Hive Database to a collection in MongoDB using a driver (jars) provided on the wiki site. Here are the steps I did: -
I created a collection in MongoDB called "Diamond" under a database called "Moe" and it has got 20 documents:
I wanted to connect from Hive via the Hadoop MongoDB Driver and view these documents via Hive.
I have both MongoDB and Hive installed on the same server and configured. However I don't see any variable called the HIVE_CLASPATH I wonder where that is.
So I installed 3 divers on the server: -
mongo-hadoop-core-1.5.2.jar;
mongo-hadoop-hive-1.5.2.jar;
mongo-java-driver-3.0.0.jar;
Now, I connect to Hive, and then add these 2 jar's to my classpath by the following commands: - (they get added successfully)
add jar /hadoopgdc/hadoop-2.6.0/share/hadoop/common/lib/mongo-hadoop-hive-1.5.2.jar;
add jar /hadoopgdc/hadoop-2.6.0/share/hadoop/common/lib/mongo-hadoop-core-1.5.2.jar;
add jar /hadoopgdc/hadoop-2.6.0/share/hadoop/common/lib/mongo-java-driver-3.0.0.jar;
Now I create a table in HIVE: -
CREATE TABLE Diamond
(
carat DOUBLE,
cut STRING,
color STRING,
clarity STRING,
depth DOUBLE,
table DOUBLE,
price DOUBLE,
xcord DOUBLE,
ycord DOUBLE,
zcord DOUBLE
)
STORED BY 'com.mongodb.hadoop.hive.MongoStorageHandler'
WITH SERDEPROPERTIES('mongo.columns.mapping'='{"carat":"carat","cut":"cut",
"color":"color", "clarity":"clarity", "depth":"depth", "table":"table",
"price":"price", "xcord":"x", "ycord":"y", "zcord":"z"}')
TBLPROPERTIES('mongo.uri'='mongodb://localhost:27017/Moe.Diamond');
However when I execute the above command in Hive I get the error below: -
java.lang.NoClassDefFoundError: com/mongodb/util/JSON
at com.mongodb.hadoop.hive.BSONSerDe.initialize(BSONSerDe.java:110)
at org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:210)
at org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:268)
at org.apache.hadoop.hive.ql.metadata.Table.getDeserializer(Table.java:261)
at org.apache.hadoop.hive.ql.metadata.Table.getCols(Table.java:587)
at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:573)
at org.apache.hadoop.hive.ql.exec.DDLTask.createTable(DDLTask.java:3784)
at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:256)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:155)
at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1355)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1139)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:945)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:259)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:216)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:413)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:756)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:614)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Caused by: java.lang.ClassNotFoundException: com.mongodb.util.JSON
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
... 23 more
FAILED: Execution Error, return code -101 from
org.apache.hadoop.hive.ql.exec.DDLTask
I have tried the following: -
- placing the jars in every possible directory with no effect
- The class that is supposed to be missing, is pretty much present in the jar file.
- oh yes and the MongoStorageHandler class is very much in the jar.
I am done breaking my head with this !! If anyone can shed some light on what I could do to alleviate my anxiety, it would be great.
Thanks again.
Mario
I identified what the issue was. To connect from HIVE to MongoDB, the MongoDb Driver uses invokes a java class in a hive jar library
**
java.lang.ClassNotFoundException:
org.apache.hadoop.hive.ql.hooks.PreExecutePrinter
**
Now this class is supposed to be found in the jar file - hive-exec-0.11.0.1.3.2.0-111.jar. However it is available only in more recent versions of HIVE and not older ones.
It is not available in 0.11.0.1.3.2.0-111 but is visibly detectable in 0.13.0.2.1.7.0-784.
The solution here was to connect to a version of HIVE that is supported by the driver. MongoDB does state that its driver supports a certain version of Hadoop, but doesn't drill down to the individual Application (HIVE / SQOOP).