Integrating Confluent Schema Registry with Apache Atlas - apache-kafka

Problem Definition
I am trying to integrate the data which exists in Confluent Schema Registry with Apache Atlas. For this purpose I have seen lots of links, they also talk about its possibility but they didn't give any technical information of how this integration was done.
Question
Would anyone help me to import the data (also metadata) from Schema Registry to Apache Atlas real-time? Is there any hook, even-listener or something like this to implement it?
Example
Here is what I have from Schema Registry:
{
"subject":"order-value",
"version":1,
"id":101,
"schema":"{\"type\":\"record\",\"name\":\"cart_closed\",\"namespace\":\"com.akbar.avro\",\"fields\":[{\"name\":\"_g\",\"type\":[\"long\",\"null\"],\"default\":null},{\"name\":\"_s\",\"type\":[\"long\",\"null\"],\"default\":null},{\"name\":\"_u\",\"type\":[\"long\",\"null\"],\"default\":null},{\"name\":\"application_version\",\"type\":[\"int\",\"null\"],\"default\":null},{\"name\":\"client_time\",\"type\":[\"long\",\"null\"],\"default\":null},{\"name\":\"event_fingerprint\",\"type\":[\"string\",\"null\"],\"default\":null},{\"name\":\"os\",\"type\":[\"string\",\"null\"],\"default\":null},{\"name\":\"php_session_id\",\"type\":[\"string\",\"null\"],\"default\":null},{\"name\":\"platform\",\"type\":[\"string\",\"null\"],\"default\":null},{\"name\":\"server_time\",\"type\":[\"long\",\"null\"],\"default\":null},{\"name\":\"site\",\"type\":[\"string\",\"null\"],\"default\":null},{\"name\":\"user_agent\",\"type\":[\"string\",\"null\"],\"default\":null},{\"name\":\"payment_method_id\",\"type\":[\"int\",\"null\"],\"default\":null},{\"name\":\"page_view\",\"type\":[\"boolean\",\"null\"],\"default\":null},{\"name\":\"items\",\"type\":{\"type\":\"array\",\"items\":{\"type\":\"record\",\"name\":\"item\",\"fields\":[{\"name\":\"brand_id\",\"type\":[\"long\",\"null\"],\"default\":null},{\"name\":\"category_id\",\"type\":[\"long\",\"null\"],\"default\":null},{\"name\":\"discount\",\"type\":[\"long\",\"null\"],\"default\":null},{\"name\":\"order_item_id\",\"type\":[\"long\",\"null\"],\"default\":null},{\"name\":\"price\",\"type\":[\"long\",\"null\"],\"default\":null},{\"name\":\"product_id\",\"type\":[\"long\",\"null\"],\"default\":null},{\"name\":\"quantity\",\"type\":[\"int\",\"null\"],\"default\":null},{\"name\":\"seller_id\",\"type\":[\"long\",\"null\"],\"default\":null},{\"name\":\"variant_id\",\"type\":[\"long\",\"null\"],\"default\":null}]}}},{\"name\":\"cart_id\",\"type\":[\"long\",\"null\"],\"default\":null}]}"
}
How to import it in Apache Atlas?
What I have done
I checked the schema registry documentation in which it has the following architecture:
So I decided to set the Kafka url but I didn't find any where to set the Kafka configuration. I tried to change the atlas.kafka.bootstrap.servers
variable in atlas-application.properties. I have also tried to call import-kafka.sh from hook-bin directory but it wasn't successful.
Error log
2021-04-25 15:48:34,162 ERROR - [main:] ~ Thread Thread[main,5,main] died (NIOServerCnxnFactory$1:92)
org.apache.atlas.exception.AtlasBaseException: EmbeddedServer.Start: failed!
at org.apache.atlas.web.service.EmbeddedServer.start(EmbeddedServer.java:115)
at org.apache.atlas.Atlas.main(Atlas.java:133)
Caused by: java.lang.NullPointerException
at org.apache.atlas.util.BeanUtil.getBean(BeanUtil.java:36)
at org.apache.atlas.web.service.EmbeddedServer.auditServerStatus(EmbeddedServer.java:128)
at org.apache.atlas.web.service.EmbeddedServer.start(EmbeddedServer.java:111)
... 1 more

Related

"SchemaRegistryException: Failed to get Kafka cluster ID" for LOCAL setup

I'm downloaded the .tz (I am on MAC) for confluent version 7.0.0 from the official confluent site and was following the setup for LOCAL (1 node) and Kafka/ZooKeeper are starting fine, but the Schema Registry keeps failing (Note, I am behind a corporate VPN)
The exception message in the SchemaRegistry logs is:
[2021-11-04 00:34:22,492] INFO Logging initialized #1403ms to org.eclipse.jetty.util.log.Slf4jLog (org.eclipse.jetty.util.log)
[2021-11-04 00:34:22,543] INFO Initial capacity 128, increased by 64, maximum capacity 2147483647. (io.confluent.rest.ApplicationServer)
[2021-11-04 00:34:22,614] INFO Adding listener: http://0.0.0.0:8081 (io.confluent.rest.ApplicationServer)
[2021-11-04 00:35:23,007] ERROR Error starting the schema registry (io.confluent.kafka.schemaregistry.rest.SchemaRegistryRestApplication)
io.confluent.kafka.schemaregistry.exceptions.SchemaRegistryException: Failed to get Kafka cluster ID
at io.confluent.kafka.schemaregistry.storage.KafkaSchemaRegistry.kafkaClusterId(KafkaSchemaRegistry.java:1488)
at io.confluent.kafka.schemaregistry.storage.KafkaSchemaRegistry.<init>(KafkaSchemaRegistry.java:166)
at io.confluent.kafka.schemaregistry.rest.SchemaRegistryRestApplication.initSchemaRegistry(SchemaRegistryRestApplication.java:71)
at io.confluent.kafka.schemaregistry.rest.SchemaRegistryRestApplication.configureBaseApplication(SchemaRegistryRestApplication.java:90)
at io.confluent.rest.Application.configureHandler(Application.java:271)
at io.confluent.rest.ApplicationServer.doStart(ApplicationServer.java:245)
at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73)
at io.confluent.kafka.schemaregistry.rest.SchemaRegistryMain.main(SchemaRegistryMain.java:44)
Caused by: java.util.concurrent.TimeoutException
at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784)
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
at org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:180)
at io.confluent.kafka.schemaregistry.storage.KafkaSchemaRegistry.kafkaClusterId(KafkaSchemaRegistry.java:1486)
... 7 more
My schema-registry.properties file has bootstrap URL set to
kafkastore.bootstrap.servers=PLAINTEXT://localhost:9092
I saw some posts saying its the SchemaRegistry unable to connect to the KafkaCluster URL because of the localhost address potentially. I am fairly new to Kafka and basically just need this local setup to run a git repo that is utilizing some Topics/Kafka so my questions...
How can I fix this (I am behind a corporate VPN but I figured this shouldn't affect this)
Do I even need the SchemaRegistry?
I ended up just going with the Docker local setup inside, and the only change I had to make to the docker compose YAML was to change the schema-registry port (I changed it to 8082 or 8084, don't remember exactly but just an unused port that is not being used by some other Confluent service listed in the docker-compose.yaml) and my local setup is working fine now

Using the Beam Python SDK and PortableRunner to connect to Kafka with SSL

I have the code below for connecting to kafka using the python beam sdk. I know that the ReadFromKafka transform is run in a java sdk harness (docker container) but I have not been able to figure out how to make ssl.truststore.location and ssl.keystore.location accesible inside the sdk harness' docker environment. The job_endpoint argument is pointing to java -jar beam-runners-flink-1.10-job-server-2.27.0.jar --flink-master localhost:8081
pipeline_args.extend([
'--job_name=paul_test',
'--runner=PortableRunner',
'--sdk_location=container',
'--job_endpoint=localhost:8099',
'--streaming',
"--environment_type=DOCKER",
f"--sdk_harness_container_image_overrides=.*java.*,{my_beam_sdk_docker_image}:{my_beam_docker_tag}",
])
with beam.Pipeline(options=PipelineOptions(pipeline_args)) as pipeline:
kafka = pipeline | ReadFromKafka(
consumer_config={
"bootstrap.servers": "bootstrap-server:17032",
"security.protocol": "SSL",
"ssl.truststore.location": "/opt/keys/client.truststore.jks", # how do I make this available to the Java SDK harness
"ssl.truststore.password": "password",
"ssl.keystore.type": "PKCS12",
"ssl.keystore.location": "/opt/keys/client.keystore.p12", # how do I make this available to the Java SDK harness
"ssl.keystore.password": "password",
"group.id": "group",
"basic.auth.credentials.source": "USER_INFO",
"schema.registry.basic.auth.user.info": "user:password"
},
topics=["topic"],
max_num_records=2,
# expansion_service="localhost:56938"
)
kafka | beam.Map(lambda x: print(x))
I tried specifying the image override option as --sdk_harness_container_image_overrides='.*java.*,beam_java_sdk:latest' - where beam_java_sdk:latest is a docker image I based on apache/beam_java11_sdk:2.27.0 and that pulls the credetials in its entrypoint.sh. But Beam does not appear to use it, I see
INFO org.apache.beam.runners.fnexecution.environment.DockerEnvironmentFactory - Still waiting for startup of environment apache/beam_java11_sdk:2.27.0 for worker id 1-1
in the logs. Which is soon inevitebly followed by
Caused by: org.apache.kafka.common.KafkaException: org.apache.kafka.common.KafkaException: org.apache.kafka.common.KafkaException: Failed to load SSL keystore /opt/keys/client.keystore.p12 of type PKCS12
In conclusion, my question is this, In Apache Beam, is it possible to make files available inside java sdk harness docker container from the python beam sdk? If so, how might it be done?
Many thanks.
Currently, there is no straightforward way to achieve this. There is ongoing discussion and tracking issues to provide support for this kind of expansion service customization (see here, here, BEAM-12538 and BEAM-12539). That is the short answer.
Long answer is yes, you can do that. You would have to copy &paste ExpansionService.java into your codebase and build your custom expansion service, where you specify default environment (DOCKER) and default environment config (your image) here. You then have to manually run this expansion service and specify its address using expansion_service parameter of ReadFromKafka.

Apache Camel Kafka support for Confluent schema Registry

I am trying to create camel route with kafka component trying to consume events with io.confluent.kafka.serializers.KafkaAvroDeserializer and schemaRegistry url along with other component parameters. I am not sure if this is full supported by Camel-Kafka currently. Can someone please comment on this ?
from("kafka:{{kafka.notification.topic}}?brokers={{kafka.notification.brokers}}"
+ "&maxPollRecords={{kafka.notification.maxPollRecords}}"
+ "&seekTo={{kafka.notification.seekTo}}"
+ "&specificAvroReader=" + "true"
+ "&valueDeserializer=" + "io.confluent.kafka.serializers.KafkaAvroDeserializer"
+"&schemaRegistryURL=localhost:9021"
+ "&allowManualCommit={{kafka.notification.autocommit}})
specificAvroReader & schemaRegistryURL are the properties which seems to be not supported.
I believe the only way currently to have camel-kafka to work with Confluent Schema Registry is to write a custom AvroSerilizer/ AvroDeserializer (io.confluent.kafka.serializers.AbstractKafkaAvroSerializer/ io.confluent.kafka.serializers.AbstractKafkaAvroDeserializer). E.g.:
BlablaDeserializer extends AbstractKafkaAvroDeserializer implements Deserializer<Object>
and
BlablaSerializer extends AbstractKafkaAvroSerializer implements Serializer<Object>
and then set them on the camel component. E.g. for the value it will be:
KafkaConfiguration kafkaConfiguration.setValueDeserializer(...)
Got this working after adding this to Gradle
compile 'org.apache.camel:camel-kafka:3.0.0-M2'
which can be found in this staging repository https://repository.apache.org/content/repositories/orgapachecamel-1124/org/apache/camel/
I think 3.0.0-M2 will be officially supported by Camel early next week.
Edit : 3.0.0-M2 available now https://repository.apache.org/content/repositories/releases/org/apache/camel/apache-camel/3.0.0-M2/
Has support for camel Kafka & Confluent Schema registry
As already answered, this has been solved in the meantime. You do not have to write your own serializer/deserializer.
I made full example with camel-quarkus and schema-registry from confluent:
https://github.com/tstuber/camel-quarkus-kafka-schema-registry

kafka-connect distributed mode

I am trying to start kafka-connect using:
connect-distributed /etc/schema-registry/connect-avro-distributed.properties
, but then I get:
[2017-02-15 12:45:35,962] INFO Instantiated task mysql-adventureworks-source-0 with version 3.1.2 of type io.confluent.connect.jdbc.source.JdbcSourceTask (org.apache.kafka.connect.runtime.Worker:264)
[2017-02-15 12:45:35,963] ERROR Failed to start task mysql-adventureworks-source-0 (org.apache.kafka.connect.runtime.Worker:280)
org.apache.kafka.common.config.ConfigException: Missing Schema registry url!
at io.confluent.connect.avro.AvroConverter.configure(AvroConverter.java:64)
at org.apache.kafka.connect.runtime.Worker.startTask(Worker.java:268)
at org.apache.kafka.connect.runtime.distributed.DistributedHerder.startTask(DistributedHerder.java:757)
at org.apache.kafka.connect.runtime.distributed.DistributedHerder.startWork(DistributedHerder.java:750)
at org.apache.kafka.connect.runtime.distributed.DistributedHerder.handleRebalanceCompleted(DistributedHerder.java:708)
at org.apache.kafka.connect.runtime.distributed.DistributedHerder.tick(DistributedHerder.java:204)
at org.apache.kafka.connect.runtime.distributed.DistributedHerder.run(DistributedHerder.java:174)
at java.lang.Thread.run(Thread.java:745)
The schema url is presented. I've also tried to start it using:
connect-distributed /etc/kafka/connect-distributed.properties
, which uses the json format, but I get the same error.
Any ideas ?
Please start Schema registry service pointing to the right property file. It should work.

Seam 2.2GA + JBoss AS 5.1GA + Postgres 8.4

Sorry for the big wall of text, but its mostly logs
Thx for any help in any of my problems
I've been trying to get help from Seam forums, but in vain.
I'm trying this Setup mentioned in the title, but unsuccessfully.
I have it all installed correctly and the problems start with the seam-gen.
This is my build.properties
#Generated by seam setup
#Sat Aug 29 19:12:18 BRT 2009
hibernate.connection.password=abc123
workspace.home=/home/rgoytacaz/workspace
hibernate.connection.dataSource_class=org.postgresql.ds.PGConnectionPoolDataSource
model.package=com.atom.Commerce.model
hibernate.default_catalog=PostgreSQL
driver.jar=/home/rgoytacaz/postgresql-8.4-701.jdbc4.jar
action.package=com.atom.Commerce.action
test.package=com.atom.Commerce.test
database.type=postgres
richfaces.skin=glassX
glassfish.domain=domain1
hibernate.default_schema=Core
database.drop=n
project.name=Commerce
hibernate.connection.username=postgres
glassfish.home=C\:/Program Files/glassfish-v2.1
hibernate.connection.driver_class=org.postgresql.Driver
hibernate.cache.provider_class=org.hibernate.cache.HashtableCacheProvider
jboss.domain=default
project.type=ear
icefaces.home=
database.exists=y
jboss.home=/srv/jboss-5.1.0.GA
driver.license.jar=
hibernate.dialect=org.hibernate.dialect.PostgreSQLDialect
hibernate.connection.url=jdbc\:postgresql\:Atom
icefaces=n
./seam create-project works okay, but when I try generate-entities, I get the following...
generate-model:
[echo] Reverse engineering database using JDBC driver /home/rgoytacaz/postgresql-8.4-701.jdbc4.jar
[echo] project=/home/rgoytacaz/workspace/Commerce
[echo] model=com.atom.Commerce.model
[hibernate] Executing Hibernate Tool with a JDBC Configuration (for reverse engineering)
[hibernate] 1. task: hbm2java (Generates a set of .java files)
[hibernate] log4j:WARN No appenders could be found for logger (org.hibernate.cfg.Environment).
[hibernate] log4j:WARN Please initialize the log4j system properly.
[javaformatter] Java formatting of 4 files completed. Skipped 0 file(s).
this is problem no.1. How do I fix this? What is this? I had to do this in eclipse. It worked.
Then I import the seam-gen created project into eclipse, and deploy to JBoss 5.1. While my servers start I've noticed the following..
03:18:56,405 ERROR [SchemaUpdate] Unsuccessful: alter table PostgreSQL.atom.productsculturedetail add constraint FKBD5D849BC0A26E19 foreign key (culture_Id) references PostgreSQL.atom.cultures
03:18:56,406 ERROR [SchemaUpdate] ERROR: cross-database references are not implemented: "postgresql.atom.productsculturedetail"
03:18:56,407 ERROR [SchemaUpdate] Unsuccessful: alter table PostgreSQL.atom.productsculturedetail add constraint FKBD5D849BFFFC9417 foreign key (product_Id) references PostgreSQL.atom.products
03:18:56,408 ERROR [SchemaUpdate] ERROR: cross-database references are not implemented: "postgresql.atom.productsculturedetail"*
03:18:56,408 INFO [SchemaUpdate] schema update complete
Problem no.2. What is this cross-database references?
What about this..
03:18:55,089 INFO [SettingsFactory] JDBC driver: PostgreSQL Native Driver, version: PostgreSQL 8.4 JDBC3 (build 701)
Problem no.3 I've said in the build.properties to use JDBC4 driver, I don't know why seam insists to use JDBC3 driver. Where do I change this?
When I go into http://localhost:5443/Commerce and try to browse the auto-generated CRUD UI.
I get this error.. Error reading 'resultList' on type com.atom.Commerce.action.ProductsList_$$_javassist_seam_2
And this is what is showing in my server logs...
03:34:00,828 INFO [STDOUT] Hibernate:
select
products0_.product_Id as product1_0_,
products0_.active as active0_
from
PostgreSQL.atom.products products0_ limit ?
03:34:00,848 WARN [JDBCExceptionReporter] SQL Error: 0, SQLState: 0A000
03:34:00,849 ERROR [JDBCExceptionReporter] ERROR: cross-database references are not implemented: "postgresql.atom.products"
Position: 81
03:34:00,871 SEVERE [viewhandler] Error Rendering View[/ProductsList.xhtml]
javax.el.ELException: /ProductsList.xhtml: Error reading 'resultList' on type com.atom.Commerce.action.ProductsList_$$_javassist_seam_2
Caused by: javax.persistence.PersistenceException: org.hibernate.exception.GenericJDBCException: could not execute query
Problem no.4 What is going on here? Cross-database references?
Thx for any help in any of my problems.
You did receive a few answers on the Seam forums (here and here), but you didn't follow up. Anyway, all these are actually caused by one problem:
As Stuart Douglas told you, you shouldn't use a catalog when connecting to PostgreSQL. To fix this, replace the property "hibernate.default_catalog=PostgreSQL" in your properties file by the property: "hibernate.default_catalog.null=", so that your file looks like this:
...
model.package=com.atom.Commerce.model
hibernate.default_catalog.null= # <-- This is the replaced property
driver.jar=/home/rgoytacaz/postgresql-8.4-701.jdbc4.jar
...
You should be able to use seam generate-entities fine afterwards (assuming the rest of your configuration is correct). I'd recommend doing the generation into a clean folder.
Cross-database references is when a query tries to access two or more different databases. PostgreSQL does not support this, and thus complains when there is more than 1 period in the table name, so in PostgreSQL.atom.productsculturedetail, the bold part should be removed. Hibernate adds this prefix when you tell it to use a default catalog, which we already fixed in step 1 above (by telling it not to use a catalog), so this problem should be fixed after you regenerate your entities.
(Note that this is effectively the same as what Stuart Douglas told you, that you should remove the catalog="PostgreSQL" attribute in the annotations on your entity classes.)
When you specified the postgresql-8.4-701.jdbc4.jar file in the properties file, this didn't mean that the driver supports JDBC4. Although the name of the file would suggest so, the driver's website clearly states that "The driver provides a reasonably complete implementation of the JDBC 3 specification". This shouldn't be a problem for you, as you're not using the driver directly (or at least you're not supposed to). The driver is sufficient for Hibernate to fulfill its requirements and provide the required functionality.
This issue is caused by the same problem above. Hibernate is unable to read data from the database because of the incorrect query. Fixing the catalog problem should fix this issue.