Integrating Confluent Schema Registry with Apache Atlas - apache-kafka
Problem Definition
I am trying to integrate the data which exists in Confluent Schema Registry with Apache Atlas. For this purpose I have seen lots of links, they also talk about its possibility but they didn't give any technical information of how this integration was done.
Question
Would anyone help me to import the data (also metadata) from Schema Registry to Apache Atlas real-time? Is there any hook, even-listener or something like this to implement it?
Example
Here is what I have from Schema Registry:
{
"subject":"order-value",
"version":1,
"id":101,
"schema":"{\"type\":\"record\",\"name\":\"cart_closed\",\"namespace\":\"com.akbar.avro\",\"fields\":[{\"name\":\"_g\",\"type\":[\"long\",\"null\"],\"default\":null},{\"name\":\"_s\",\"type\":[\"long\",\"null\"],\"default\":null},{\"name\":\"_u\",\"type\":[\"long\",\"null\"],\"default\":null},{\"name\":\"application_version\",\"type\":[\"int\",\"null\"],\"default\":null},{\"name\":\"client_time\",\"type\":[\"long\",\"null\"],\"default\":null},{\"name\":\"event_fingerprint\",\"type\":[\"string\",\"null\"],\"default\":null},{\"name\":\"os\",\"type\":[\"string\",\"null\"],\"default\":null},{\"name\":\"php_session_id\",\"type\":[\"string\",\"null\"],\"default\":null},{\"name\":\"platform\",\"type\":[\"string\",\"null\"],\"default\":null},{\"name\":\"server_time\",\"type\":[\"long\",\"null\"],\"default\":null},{\"name\":\"site\",\"type\":[\"string\",\"null\"],\"default\":null},{\"name\":\"user_agent\",\"type\":[\"string\",\"null\"],\"default\":null},{\"name\":\"payment_method_id\",\"type\":[\"int\",\"null\"],\"default\":null},{\"name\":\"page_view\",\"type\":[\"boolean\",\"null\"],\"default\":null},{\"name\":\"items\",\"type\":{\"type\":\"array\",\"items\":{\"type\":\"record\",\"name\":\"item\",\"fields\":[{\"name\":\"brand_id\",\"type\":[\"long\",\"null\"],\"default\":null},{\"name\":\"category_id\",\"type\":[\"long\",\"null\"],\"default\":null},{\"name\":\"discount\",\"type\":[\"long\",\"null\"],\"default\":null},{\"name\":\"order_item_id\",\"type\":[\"long\",\"null\"],\"default\":null},{\"name\":\"price\",\"type\":[\"long\",\"null\"],\"default\":null},{\"name\":\"product_id\",\"type\":[\"long\",\"null\"],\"default\":null},{\"name\":\"quantity\",\"type\":[\"int\",\"null\"],\"default\":null},{\"name\":\"seller_id\",\"type\":[\"long\",\"null\"],\"default\":null},{\"name\":\"variant_id\",\"type\":[\"long\",\"null\"],\"default\":null}]}}},{\"name\":\"cart_id\",\"type\":[\"long\",\"null\"],\"default\":null}]}"
}
How to import it in Apache Atlas?
What I have done
I checked the schema registry documentation in which it has the following architecture:
So I decided to set the Kafka url but I didn't find any where to set the Kafka configuration. I tried to change the atlas.kafka.bootstrap.servers
variable in atlas-application.properties. I have also tried to call import-kafka.sh from hook-bin directory but it wasn't successful.
Error log
2021-04-25 15:48:34,162 ERROR - [main:] ~ Thread Thread[main,5,main] died (NIOServerCnxnFactory$1:92)
org.apache.atlas.exception.AtlasBaseException: EmbeddedServer.Start: failed!
at org.apache.atlas.web.service.EmbeddedServer.start(EmbeddedServer.java:115)
at org.apache.atlas.Atlas.main(Atlas.java:133)
Caused by: java.lang.NullPointerException
at org.apache.atlas.util.BeanUtil.getBean(BeanUtil.java:36)
at org.apache.atlas.web.service.EmbeddedServer.auditServerStatus(EmbeddedServer.java:128)
at org.apache.atlas.web.service.EmbeddedServer.start(EmbeddedServer.java:111)
... 1 more
Related
"SchemaRegistryException: Failed to get Kafka cluster ID" for LOCAL setup
I'm downloaded the .tz (I am on MAC) for confluent version 7.0.0 from the official confluent site and was following the setup for LOCAL (1 node) and Kafka/ZooKeeper are starting fine, but the Schema Registry keeps failing (Note, I am behind a corporate VPN) The exception message in the SchemaRegistry logs is: [2021-11-04 00:34:22,492] INFO Logging initialized #1403ms to org.eclipse.jetty.util.log.Slf4jLog (org.eclipse.jetty.util.log) [2021-11-04 00:34:22,543] INFO Initial capacity 128, increased by 64, maximum capacity 2147483647. (io.confluent.rest.ApplicationServer) [2021-11-04 00:34:22,614] INFO Adding listener: http://0.0.0.0:8081 (io.confluent.rest.ApplicationServer) [2021-11-04 00:35:23,007] ERROR Error starting the schema registry (io.confluent.kafka.schemaregistry.rest.SchemaRegistryRestApplication) io.confluent.kafka.schemaregistry.exceptions.SchemaRegistryException: Failed to get Kafka cluster ID at io.confluent.kafka.schemaregistry.storage.KafkaSchemaRegistry.kafkaClusterId(KafkaSchemaRegistry.java:1488) at io.confluent.kafka.schemaregistry.storage.KafkaSchemaRegistry.<init>(KafkaSchemaRegistry.java:166) at io.confluent.kafka.schemaregistry.rest.SchemaRegistryRestApplication.initSchemaRegistry(SchemaRegistryRestApplication.java:71) at io.confluent.kafka.schemaregistry.rest.SchemaRegistryRestApplication.configureBaseApplication(SchemaRegistryRestApplication.java:90) at io.confluent.rest.Application.configureHandler(Application.java:271) at io.confluent.rest.ApplicationServer.doStart(ApplicationServer.java:245) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73) at io.confluent.kafka.schemaregistry.rest.SchemaRegistryMain.main(SchemaRegistryMain.java:44) Caused by: java.util.concurrent.TimeoutException at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784) at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928) at org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:180) at io.confluent.kafka.schemaregistry.storage.KafkaSchemaRegistry.kafkaClusterId(KafkaSchemaRegistry.java:1486) ... 7 more My schema-registry.properties file has bootstrap URL set to kafkastore.bootstrap.servers=PLAINTEXT://localhost:9092 I saw some posts saying its the SchemaRegistry unable to connect to the KafkaCluster URL because of the localhost address potentially. I am fairly new to Kafka and basically just need this local setup to run a git repo that is utilizing some Topics/Kafka so my questions... How can I fix this (I am behind a corporate VPN but I figured this shouldn't affect this) Do I even need the SchemaRegistry?
I ended up just going with the Docker local setup inside, and the only change I had to make to the docker compose YAML was to change the schema-registry port (I changed it to 8082 or 8084, don't remember exactly but just an unused port that is not being used by some other Confluent service listed in the docker-compose.yaml) and my local setup is working fine now
Using the Beam Python SDK and PortableRunner to connect to Kafka with SSL
I have the code below for connecting to kafka using the python beam sdk. I know that the ReadFromKafka transform is run in a java sdk harness (docker container) but I have not been able to figure out how to make ssl.truststore.location and ssl.keystore.location accesible inside the sdk harness' docker environment. The job_endpoint argument is pointing to java -jar beam-runners-flink-1.10-job-server-2.27.0.jar --flink-master localhost:8081 pipeline_args.extend([ '--job_name=paul_test', '--runner=PortableRunner', '--sdk_location=container', '--job_endpoint=localhost:8099', '--streaming', "--environment_type=DOCKER", f"--sdk_harness_container_image_overrides=.*java.*,{my_beam_sdk_docker_image}:{my_beam_docker_tag}", ]) with beam.Pipeline(options=PipelineOptions(pipeline_args)) as pipeline: kafka = pipeline | ReadFromKafka( consumer_config={ "bootstrap.servers": "bootstrap-server:17032", "security.protocol": "SSL", "ssl.truststore.location": "/opt/keys/client.truststore.jks", # how do I make this available to the Java SDK harness "ssl.truststore.password": "password", "ssl.keystore.type": "PKCS12", "ssl.keystore.location": "/opt/keys/client.keystore.p12", # how do I make this available to the Java SDK harness "ssl.keystore.password": "password", "group.id": "group", "basic.auth.credentials.source": "USER_INFO", "schema.registry.basic.auth.user.info": "user:password" }, topics=["topic"], max_num_records=2, # expansion_service="localhost:56938" ) kafka | beam.Map(lambda x: print(x)) I tried specifying the image override option as --sdk_harness_container_image_overrides='.*java.*,beam_java_sdk:latest' - where beam_java_sdk:latest is a docker image I based on apache/beam_java11_sdk:2.27.0 and that pulls the credetials in its entrypoint.sh. But Beam does not appear to use it, I see INFO org.apache.beam.runners.fnexecution.environment.DockerEnvironmentFactory - Still waiting for startup of environment apache/beam_java11_sdk:2.27.0 for worker id 1-1 in the logs. Which is soon inevitebly followed by Caused by: org.apache.kafka.common.KafkaException: org.apache.kafka.common.KafkaException: org.apache.kafka.common.KafkaException: Failed to load SSL keystore /opt/keys/client.keystore.p12 of type PKCS12 In conclusion, my question is this, In Apache Beam, is it possible to make files available inside java sdk harness docker container from the python beam sdk? If so, how might it be done? Many thanks.
Currently, there is no straightforward way to achieve this. There is ongoing discussion and tracking issues to provide support for this kind of expansion service customization (see here, here, BEAM-12538 and BEAM-12539). That is the short answer. Long answer is yes, you can do that. You would have to copy &paste ExpansionService.java into your codebase and build your custom expansion service, where you specify default environment (DOCKER) and default environment config (your image) here. You then have to manually run this expansion service and specify its address using expansion_service parameter of ReadFromKafka.
Apache Camel Kafka support for Confluent schema Registry
I am trying to create camel route with kafka component trying to consume events with io.confluent.kafka.serializers.KafkaAvroDeserializer and schemaRegistry url along with other component parameters. I am not sure if this is full supported by Camel-Kafka currently. Can someone please comment on this ? from("kafka:{{kafka.notification.topic}}?brokers={{kafka.notification.brokers}}" + "&maxPollRecords={{kafka.notification.maxPollRecords}}" + "&seekTo={{kafka.notification.seekTo}}" + "&specificAvroReader=" + "true" + "&valueDeserializer=" + "io.confluent.kafka.serializers.KafkaAvroDeserializer" +"&schemaRegistryURL=localhost:9021" + "&allowManualCommit={{kafka.notification.autocommit}}) specificAvroReader & schemaRegistryURL are the properties which seems to be not supported.
I believe the only way currently to have camel-kafka to work with Confluent Schema Registry is to write a custom AvroSerilizer/ AvroDeserializer (io.confluent.kafka.serializers.AbstractKafkaAvroSerializer/ io.confluent.kafka.serializers.AbstractKafkaAvroDeserializer). E.g.: BlablaDeserializer extends AbstractKafkaAvroDeserializer implements Deserializer<Object> and BlablaSerializer extends AbstractKafkaAvroSerializer implements Serializer<Object> and then set them on the camel component. E.g. for the value it will be: KafkaConfiguration kafkaConfiguration.setValueDeserializer(...)
Got this working after adding this to Gradle compile 'org.apache.camel:camel-kafka:3.0.0-M2' which can be found in this staging repository https://repository.apache.org/content/repositories/orgapachecamel-1124/org/apache/camel/ I think 3.0.0-M2 will be officially supported by Camel early next week. Edit : 3.0.0-M2 available now https://repository.apache.org/content/repositories/releases/org/apache/camel/apache-camel/3.0.0-M2/ Has support for camel Kafka & Confluent Schema registry
As already answered, this has been solved in the meantime. You do not have to write your own serializer/deserializer. I made full example with camel-quarkus and schema-registry from confluent: https://github.com/tstuber/camel-quarkus-kafka-schema-registry
kafka-connect distributed mode
I am trying to start kafka-connect using: connect-distributed /etc/schema-registry/connect-avro-distributed.properties , but then I get: [2017-02-15 12:45:35,962] INFO Instantiated task mysql-adventureworks-source-0 with version 3.1.2 of type io.confluent.connect.jdbc.source.JdbcSourceTask (org.apache.kafka.connect.runtime.Worker:264) [2017-02-15 12:45:35,963] ERROR Failed to start task mysql-adventureworks-source-0 (org.apache.kafka.connect.runtime.Worker:280) org.apache.kafka.common.config.ConfigException: Missing Schema registry url! at io.confluent.connect.avro.AvroConverter.configure(AvroConverter.java:64) at org.apache.kafka.connect.runtime.Worker.startTask(Worker.java:268) at org.apache.kafka.connect.runtime.distributed.DistributedHerder.startTask(DistributedHerder.java:757) at org.apache.kafka.connect.runtime.distributed.DistributedHerder.startWork(DistributedHerder.java:750) at org.apache.kafka.connect.runtime.distributed.DistributedHerder.handleRebalanceCompleted(DistributedHerder.java:708) at org.apache.kafka.connect.runtime.distributed.DistributedHerder.tick(DistributedHerder.java:204) at org.apache.kafka.connect.runtime.distributed.DistributedHerder.run(DistributedHerder.java:174) at java.lang.Thread.run(Thread.java:745) The schema url is presented. I've also tried to start it using: connect-distributed /etc/kafka/connect-distributed.properties , which uses the json format, but I get the same error. Any ideas ?
Please start Schema registry service pointing to the right property file. It should work.
Seam 2.2GA + JBoss AS 5.1GA + Postgres 8.4
Sorry for the big wall of text, but its mostly logs Thx for any help in any of my problems I've been trying to get help from Seam forums, but in vain. I'm trying this Setup mentioned in the title, but unsuccessfully. I have it all installed correctly and the problems start with the seam-gen. This is my build.properties #Generated by seam setup #Sat Aug 29 19:12:18 BRT 2009 hibernate.connection.password=abc123 workspace.home=/home/rgoytacaz/workspace hibernate.connection.dataSource_class=org.postgresql.ds.PGConnectionPoolDataSource model.package=com.atom.Commerce.model hibernate.default_catalog=PostgreSQL driver.jar=/home/rgoytacaz/postgresql-8.4-701.jdbc4.jar action.package=com.atom.Commerce.action test.package=com.atom.Commerce.test database.type=postgres richfaces.skin=glassX glassfish.domain=domain1 hibernate.default_schema=Core database.drop=n project.name=Commerce hibernate.connection.username=postgres glassfish.home=C\:/Program Files/glassfish-v2.1 hibernate.connection.driver_class=org.postgresql.Driver hibernate.cache.provider_class=org.hibernate.cache.HashtableCacheProvider jboss.domain=default project.type=ear icefaces.home= database.exists=y jboss.home=/srv/jboss-5.1.0.GA driver.license.jar= hibernate.dialect=org.hibernate.dialect.PostgreSQLDialect hibernate.connection.url=jdbc\:postgresql\:Atom icefaces=n ./seam create-project works okay, but when I try generate-entities, I get the following... generate-model: [echo] Reverse engineering database using JDBC driver /home/rgoytacaz/postgresql-8.4-701.jdbc4.jar [echo] project=/home/rgoytacaz/workspace/Commerce [echo] model=com.atom.Commerce.model [hibernate] Executing Hibernate Tool with a JDBC Configuration (for reverse engineering) [hibernate] 1. task: hbm2java (Generates a set of .java files) [hibernate] log4j:WARN No appenders could be found for logger (org.hibernate.cfg.Environment). [hibernate] log4j:WARN Please initialize the log4j system properly. [javaformatter] Java formatting of 4 files completed. Skipped 0 file(s). this is problem no.1. How do I fix this? What is this? I had to do this in eclipse. It worked. Then I import the seam-gen created project into eclipse, and deploy to JBoss 5.1. While my servers start I've noticed the following.. 03:18:56,405 ERROR [SchemaUpdate] Unsuccessful: alter table PostgreSQL.atom.productsculturedetail add constraint FKBD5D849BC0A26E19 foreign key (culture_Id) references PostgreSQL.atom.cultures 03:18:56,406 ERROR [SchemaUpdate] ERROR: cross-database references are not implemented: "postgresql.atom.productsculturedetail" 03:18:56,407 ERROR [SchemaUpdate] Unsuccessful: alter table PostgreSQL.atom.productsculturedetail add constraint FKBD5D849BFFFC9417 foreign key (product_Id) references PostgreSQL.atom.products 03:18:56,408 ERROR [SchemaUpdate] ERROR: cross-database references are not implemented: "postgresql.atom.productsculturedetail"* 03:18:56,408 INFO [SchemaUpdate] schema update complete Problem no.2. What is this cross-database references? What about this.. 03:18:55,089 INFO [SettingsFactory] JDBC driver: PostgreSQL Native Driver, version: PostgreSQL 8.4 JDBC3 (build 701) Problem no.3 I've said in the build.properties to use JDBC4 driver, I don't know why seam insists to use JDBC3 driver. Where do I change this? When I go into http://localhost:5443/Commerce and try to browse the auto-generated CRUD UI. I get this error.. Error reading 'resultList' on type com.atom.Commerce.action.ProductsList_$$_javassist_seam_2 And this is what is showing in my server logs... 03:34:00,828 INFO [STDOUT] Hibernate: select products0_.product_Id as product1_0_, products0_.active as active0_ from PostgreSQL.atom.products products0_ limit ? 03:34:00,848 WARN [JDBCExceptionReporter] SQL Error: 0, SQLState: 0A000 03:34:00,849 ERROR [JDBCExceptionReporter] ERROR: cross-database references are not implemented: "postgresql.atom.products" Position: 81 03:34:00,871 SEVERE [viewhandler] Error Rendering View[/ProductsList.xhtml] javax.el.ELException: /ProductsList.xhtml: Error reading 'resultList' on type com.atom.Commerce.action.ProductsList_$$_javassist_seam_2 Caused by: javax.persistence.PersistenceException: org.hibernate.exception.GenericJDBCException: could not execute query Problem no.4 What is going on here? Cross-database references? Thx for any help in any of my problems.
You did receive a few answers on the Seam forums (here and here), but you didn't follow up. Anyway, all these are actually caused by one problem: As Stuart Douglas told you, you shouldn't use a catalog when connecting to PostgreSQL. To fix this, replace the property "hibernate.default_catalog=PostgreSQL" in your properties file by the property: "hibernate.default_catalog.null=", so that your file looks like this: ... model.package=com.atom.Commerce.model hibernate.default_catalog.null= # <-- This is the replaced property driver.jar=/home/rgoytacaz/postgresql-8.4-701.jdbc4.jar ... You should be able to use seam generate-entities fine afterwards (assuming the rest of your configuration is correct). I'd recommend doing the generation into a clean folder. Cross-database references is when a query tries to access two or more different databases. PostgreSQL does not support this, and thus complains when there is more than 1 period in the table name, so in PostgreSQL.atom.productsculturedetail, the bold part should be removed. Hibernate adds this prefix when you tell it to use a default catalog, which we already fixed in step 1 above (by telling it not to use a catalog), so this problem should be fixed after you regenerate your entities. (Note that this is effectively the same as what Stuart Douglas told you, that you should remove the catalog="PostgreSQL" attribute in the annotations on your entity classes.) When you specified the postgresql-8.4-701.jdbc4.jar file in the properties file, this didn't mean that the driver supports JDBC4. Although the name of the file would suggest so, the driver's website clearly states that "The driver provides a reasonably complete implementation of the JDBC 3 specification". This shouldn't be a problem for you, as you're not using the driver directly (or at least you're not supposed to). The driver is sufficient for Hibernate to fulfill its requirements and provide the required functionality. This issue is caused by the same problem above. Hibernate is unable to read data from the database because of the incorrect query. Fixing the catalog problem should fix this issue.