Can the Hudi Metadata table be queried? - pyspark

Going through the Hudi documentation I saw the Metadata Config section and was curious about how it is used. I created a table enabling the metadata and the directory got created under /.hoodie/metadata. Has anybody experimented with this feature? Is the metadata exposed or only used internally to Hudi? What is it used for? I couldn't understand it from the docs.
I used the following Hudi options to create a table in S3 using PySpark.
hudi_options_insert = {
"hoodie.table.name": "table_p5",
"hoodie.datasource.write.table.type": "COPY_ON_WRITE",
"hoodie.datasource.write.recordkey.field": "id",
"hoodie.datasource.write.operation": "bulk_insert",
"hoodie.datasource.write.partitionpath.field": "ds",
"hoodie.datasource.write.precombine.field": "id",
"hoodie.datasource.write.hive_style_partitioning": "true",
"hoodie.datasource.hive_sync.table": "table_p5",
"hoodie.datasource.hive_sync.database": "poc_hudi",
"hoodie.datasource.hive_sync.enable": "true",
"hoodie.datasource.hive_sync.partition_fields": "ds",
"hoodie.insert.shuffle.parallelism": 6,
"hoodie.metadata.enable": "true",
"hoodie.metadata.insert.parallelism": 6
}
Thanks a mil.

Related

Debezium Connector filter "partly" working

We have a debezium connector that works without any errors. Two filtering conditions are applied and one of them works as intended but the other one seems to have no effect. These are the important parts of the config:
"connector.class": "io.debezium.connector.oracle.OracleConnector",
"transforms.filter.topic.regex": "topicname",
"database.connection.adapter": "logminer",
"transforms": "filter",
"schema.include.list": "xxxx",
"transforms.filter.type": "io.debezium.transforms.Filter",
"transforms.filter.language": "jsr223.groovy",
"tombstones.on.delete": "false",
"transforms.filter.condition": "value.op == \"c\" && value.after.QUEUELOCATIONTYPE == 5",
"table.include.list": "xxxxxx",
"skipped.operations": "u,d,r",
"snapshot.mode": "initial",
"topics": "xxxxxxx"
As you see, we want to get records which have op as "c" and "QUEUELOCATIONTYPE" as 5. In kafka topic all the records have the op field as "c". But the second condition does not work. There are records with QUEUELOCATIONTYPE as 2,3,4 etc.
A sample record is given below.
"payload": {
"before": null,
"after": {
"EVENTOBJECTID": "749dc9ea-a7aa-44c2-9af7-10574769c7db",
"QUEUECODE": "STDQSTDBKP",
"STATE": 6,
"RECORDDATE": 1638964344000,
"RECORDREQUESTOBJECTID": "32b7f617-60e8-4020-98b0-66f288433031",
"QUEUELOCATIONTYPE": 4,
"RETRYCOUNT": 0,
"RECORDCHANNELCODE": null,
"MESSAGEBROKERSERVERID": 1
},
"op": "c",
"ts_ms": 1638953572392,
"transaction": null
}
}
What may be the problem? Even though I wasn't thinking it was going to work, I've tried switching the placement of conditions. There are no error codes, connector is running.
Ok solved it. I was using a pre-created config. While reading documentations, I've seen that "skipped.operations": "u,d,r" is not an Oracle configuration. It was in the MySQL documentation. So, I deleted it and changed the connector name (cached data can cause problems so often). Looks like it's working now.

Kafka Connect - Not working for UPDATE operation

I am new to Kafka-Connect source and sink. I created application to transfer Table Data from one Schema (Schema1) to another Schema (Schema2), here I used Oracle as a Database. I successfully transferred data/row for INSERT operation from Table "Schema1.Header" to Table "Schema2.Header", but not working for UPDATE operation with below mentioned config.
SOURCE Config:
{
"connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector",
"connection.url": "jdbc:oracle:thin:#localhost:1524:XE",
"connection.user": "USER",
"connection.password": "user1234",
"dialect.name": "OracleDatabaseDialect",
"topic.prefix": "Schema1.Header",
"incrementing.column.name": "SC_NO",
"mode": "incrementing",
"query": "SELECT * FROM (SELECT HEADER_V1.* FROM Schema1.Header HEADER_V1 INNER JOIN Schema1.LINE_V1 LINE_V1 ON HEADER_V1.SC_NO = LINE_V1.SC_NO AND LINE_V1.CLNAME_CODE ='XXXXXX' AND HEADER_V1.ITEM_TYPE = 'XXX')",
"transforms": "ReplaceField",
"transforms.ReplaceField.type": "org.apache.kafka.connect.transforms.ReplaceField$Value",
"transforms.ReplaceField.blacklist": "col_3,col_10"
}
SINK Config:
{
"connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
"connection.url": "jdbc:oracle:thin:#localhost:1524:XE",
"connection.user": "USER2",
"connection.password": "user21234",
"dialect.name": "OracleDatabaseDialect",
"topics": "Schema1.Header",
"table.name.format": "Schema2.Header",
"tasks.max": "1"
}
Kindly help me to fix this issue.
Note : I need to do all CRUD operations in Schema Schema1.Tables only, Using Kafka connect am transferring those data to another Schema Schema2.Tables. Newly inserted data/row got transferred but updated data/row not transferred via Kafka-Connect. What I have to do achieve this?
According to this blog you need to set the mode to timestamp (or better timestamp+incrementing if you want to both new and updated rows) in your source config.
In addition, you then need to specify the timestamp.column.name which should point to a timestamp column that is updated every time the row is updated.

Kafka Connect: Topic shows 3x the number of events than expected

We are using Kafka Connect JDBC to sync tables between to databases (Debezium would be perfect for this but is out of the question).
The Sync in general works fine but it seems there are 3x the number of events / messages stored in the topic than expected.
What could be the reason for this?
Some additional information
The target database contains the exact number of messages (count of messages in the topics / 3).
Most of the topics are split into 3 partitions (Key is set via SMT, DefaultPartitioner is used).
JDBC Source Connector
{
"name": "oracle_source",
"config": {
"connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector",
"connection.url": "jdbc:oracle:thin:#dbdis01.allesklar.de:1521:stg_cdb",
"connection.user": "****",
"connection.password": "****",
"schema.pattern": "BBUCH",
"topic.prefix": "oracle_",
"table.whitelist": "cdc_companies, cdc_partners, cdc_categories, cdc_additional_details, cdc_claiming_history, cdc_company_categories, cdc_company_custom_fields, cdc_premium_custom_field_types, cdc_premium_custom_fields, cdc_premiums, cdc, cdc_premium_redirects, intermediate_oz_data, intermediate_oz_mapping",
"table.types": "VIEW",
"mode": "timestamp+incrementing",
"incrementing.column.name": "id",
"timestamp.column.name": "ts",
"key.converter": "org.apache.kafka.connect.converters.IntegerConverter",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"validate.non.null": false,
"numeric.mapping": "best_fit",
"db.timezone": "Europe/Berlin",
"transforms":"createKey, extractId, dropTimestamp, deleteTransform",
"transforms.createKey.type": "org.apache.kafka.connect.transforms.ValueToKey",
"transforms.createKey.fields": "id",
"transforms.extractId.type": "org.apache.kafka.connect.transforms.ExtractField$Key",
"transforms.extractId.field": "id",
"transforms.dropTimestamp.type": "org.apache.kafka.connect.transforms.ReplaceField$Value",
"transforms.dropTimestamp.blacklist": "ts",
"transforms.deleteTransform.type": "de.meinestadt.kafka.DeleteTransformation"
}
}
JDBC Sink Connector
{
"name": "postgres_sink",
"config": {
"connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
"connection.url": "jdbc:postgresql://writer.branchenbuch.psql.integration.meinestadt.de:5432/branchenbuch",
"connection.user": "****",
"connection.password": "****",
"key.converter": "org.apache.kafka.connect.converters.IntegerConverter",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.schemas.enable": true,
"insert.mode": "upsert",
"pk.mode": "record_key",
"pk.fields": "id",
"delete.enabled": true,
"auto.create": true,
"auto.evolve": true,
"topics.regex": "oracle_cdc_.*",
"transforms": "dropPrefix",
"transforms.dropPrefix.type": "org.apache.kafka.connect.transforms.RegexRouter",
"transforms.dropPrefix.regex": "oracle_cdc_(.*)",
"transforms.dropPrefix.replacement": "$1"
}
}
Strange Topic Count
This isn't an answer per-se but it's easier to format here than in the comments box.
It's not clear why you'd be getting duplicates. Some possibilities would be:
You have more than one instance of the connector running
You have on instance of the connector running but have previously run other instances which loaded the same data to the topic
Data's coming from multiple tables and being merged into one topic (not possible here based on your config, but if you were using Single Message Transform to modify target-topic name could be a possibility)
In terms of investigation I would suggest:
Isolate the problem by splitting the connector into one connector per table.
Examine each topic and locate examples of the duplicate messages. See if there is a pattern to which topics have duplicates. KSQL will be useful here:
SELECT ROWKEY, COUNT(*) FROM source GROUP BY ROWKEY HAVING COUNT(*) > 1
I'm guessing at ROWKEY (the key of the Kafka message) - you'll know your data and which columns should be unique and can be used to detect duplicates.
Once you've found a duplicate message, use kafkacat to examine the duplicate instances. Do they have the exact same Kafka message timestamp?
For more back and forth, StackOverflow isn't such an appropriate platform - I'd recommend heading to http://cnfl.io/slack and the #connect channel.

kafka jdbc source connector error in query mode [duplicate]

This question already has an answer here:
How to add explicit WHERE clause in Kafka Connect JDBC Source connector
(1 answer)
Closed 3 years ago.
I have a JDBCSourceConnector in kafka that uses a query to stream data from database.
but I have problem with the query I wrote for selecting data.
I tested query in Postgres psql and also in DBeaver. It's working fine but in kafka config, it produces an SQL syntax error
Error
ERROR Failed to run query for table TimestampIncrementingTableQuerier{name='null', query='select "Users".* from "Users" join "SchoolUserPivots" on "Users".id = "SchoolUserPivots".user_id where school_id = 1 and role_id = 2', topicPrefix='teacher', timestampColumn='"Users".updatedAt', incrementingColumn='id'}: {} (io.confluent.connect.jdbc.source.JdbcSourceTask:221)
org.postgresql.util.PSQLException: ERROR: syntax error at or near "WHERE"
Config json
{
"connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector",
"timestamp.column.name": "\"Users\".updatedAt",
"incrementing.column.name": "id",
"connection.password": "123",
"tasks.max": "1",
"query": "select \"Users\".* from \"Users\" join \"SchoolUserPivots\" on \"Users\".id = \"SchoolUserPivots\".user_id where school_id = 1 and role_id = 2",
"timestamp.delay.interval.ms": "5000",
"mode": "timestamp+incrementing",
"topic.prefix": "teacher",
"connection.user": "user",
"name": "SourceTeacher",
"connection.url": "jdbc:postgresql://ip:5432/school",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"key.converter": "org.apache.kafka.connect.json.JsonConverter"
}
You can't use "mode": "timestamp+incrementing", with a custom query that includes WHERE.
See https://www.confluent.io/blog/kafka-connect-deep-dive-jdbc-source-connector for more details, as well as https://github.com/confluentinc/kafka-connect-jdbc/issues/566. That github issue suggests one workaround, by using a subselect for your query.

error in accessing tables from hive using apache drill

I am trying to read the data from my table abc which is in hive using Drill. For that i have created hive storage plugin with the configuration mentioned below
{
"type": "hive",
"enabled": true,
"configProps": {
"hive.metastore.uris": "thrift://<ip>:<port>",
"fs.default.name": "hdfs://<ip>:<port>/",
"hive.metastore.sasl.enabled": "false",
"hive.server2.enable.doAs": "true",
"hive.metastore.execute.setugi": "true"
}
}
with this i am able to see the databases in hive, but when i try to access any table in the particular database
select * from hive.db.abc;
it throws the following error
org.apache.drill.common.exceptions.UserRemoteException: VALIDATION
ERROR: From line 1, column 15 to line 1, column 18: Object 'abc' not
found within 'hive.db' SQL Query null [Error Id:
b6c56276-6255-4b5b-a600-746dbc2f3d67 on centos2.example.com:31010]
(org.apache.calcite.runtime.CalciteContextException) From line 1,
column 15 to line 1, column 18: Object 'abc' not found within
'hive.db' sun.reflect.NativeConstructorAccessorImpl.newInstance0():-2
sun.reflect.NativeConstructorAccessorImpl.newInstance():62
sun.reflect.DelegatingConstructorAccessorImpl.newInstance():45
java.lang.reflect.Constructor.newInstance():423
org.apache.calcite.runtime.Resources$ExInstWithCause.ex():463
org.apache.calcite.sql.SqlUtil.newContextException():800
org.apache.calcite.sql.SqlUtil.newContextException():788
org.apache.calcite.sql.validate.SqlValidatorImpl.newValidationError():4703
org.apache.calcite.sql.validate.IdentifierNamespace.resolveImpl():127
org.apache.calcite.sql.validate.IdentifierNamespace.validateImpl():177
org.apache.calcite.sql.validate.AbstractNamespace.validate():84
org.apache.calcite.sql.validate.SqlValidatorImpl.validateNamespace():947
org.apache.calcite.sql.validate.SqlValidatorImpl.validateQuery():928
org.apache.calcite.sql.validate.SqlValidatorImpl.validateFrom():2972
org.apache.drill.exec.planner.sql.SqlConverter$DrillValidator.validateFrom():267
org.apache.calcite.sql.validate.SqlValidatorImpl.validateFrom():2957
org.apache.drill.exec.planner.sql.SqlConverter$DrillValidator.validateFrom():267
org.apache.calcite.sql.validate.SqlValidatorImpl.validateSelect():3216
org.apache.calcite.sql.validate.SelectNamespace.validateImpl():60
org.apache.calcite.sql.validate.AbstractNamespace.validate():84
org.apache.calcite.sql.validate.SqlValidatorImpl.validateNamespace():947
org.apache.calcite.sql.validate.SqlValidatorImpl.validateQuery():928
org.apache.calcite.sql.SqlSelect.validate():226
org.apache.calcite.sql.validate.SqlValidatorImpl.validateScopedExpression():903
org.apache.calcite.sql.validate.SqlValidatorImpl.validate():613
org.apache.drill.exec.planner.sql.SqlConverter.validate():190
org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.validateNode():630
org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.validateAndConvert():202
org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.getPlan():174
org.apache.drill.exec.planner.sql.DrillSqlWorker.getQueryPlan():146
org.apache.drill.exec.planner.sql.DrillSqlWorker.getPlan():84
org.apache.drill.exec.work.foreman.Foreman.runSQL():567
org.apache.drill.exec.work.foreman.Foreman.run():264
java.util.concurrent.ThreadPoolExecutor.runWorker():1149
java.util.concurrent.ThreadPoolExecutor$Worker.run():624
java.lang.Thread.run():748 Caused By
(org.apache.calcite.sql.validate.SqlValidatorException) Object 'abc'
not found within 'hive.db'
sun.reflect.NativeConstructorAccessorImpl.newInstance0():-2
sun.reflect.NativeConstructorAccessorImpl.newInstance():62
sun.reflect.DelegatingConstructorAccessorImpl.newInstance():45
java.lang.reflect.Constructor.newInstance():423
org.apache.calcite.runtime.Resources$ExInstWithCause.ex():463
org.apache.calcite.runtime.Resources$ExInst.ex():572
org.apache.calcite.sql.SqlUtil.newContextException():800
org.apache.calcite.sql.SqlUtil.newContextException():788
org.apache.calcite.sql.validate.SqlValidatorImpl.newValidationError():4703
org.apache.calcite.sql.validate.IdentifierNamespace.resolveImpl():127
org.apache.calcite.sql.validate.IdentifierNamespace.validateImpl():177
org.apache.calcite.sql.validate.AbstractNamespace.validate():84
org.apache.calcite.sql.validate.SqlValidatorImpl.validateNamespace():947
org.apache.calcite.sql.validate.SqlValidatorImpl.validateQuery():928
org.apache.calcite.sql.validate.SqlValidatorImpl.validateFrom():2972
org.apache.drill.exec.planner.sql.SqlConverter$DrillValidator.validateFrom():267
org.apache.calcite.sql.validate.SqlValidatorImpl.validateFrom():2957
org.apache.drill.exec.planner.sql.SqlConverter$DrillValidator.validateFrom():267
org.apache.calcite.sql.validate.SqlValidatorImpl.validateSelect():3216
org.apache.calcite.sql.validate.SelectNamespace.validateImpl():60
org.apache.calcite.sql.validate.AbstractNamespace.validate():84
org.apache.calcite.sql.validate.SqlValidatorImpl.validateNamespace():947
org.apache.calcite.sql.validate.SqlValidatorImpl.validateQuery():928
org.apache.calcite.sql.SqlSelect.validate():226
org.apache.calcite.sql.validate.SqlValidatorImpl.validateScopedExpression():903
org.apache.calcite.sql.validate.SqlValidatorImpl.validate():613
org.apache.drill.exec.planner.sql.SqlConverter.validate():190
org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.validateNode():630
org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.validateAndConvert():202
org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.getPlan():174
org.apache.drill.exec.planner.sql.DrillSqlWorker.getQueryPlan():146
org.apache.drill.exec.planner.sql.DrillSqlWorker.getPlan():84
org.apache.drill.exec.work.foreman.Foreman.runSQL():567
org.apache.drill.exec.work.foreman.Foreman.run():264
java.util.concurrent.ThreadPoolExecutor.runWorker():1149
java.util.concurrent.ThreadPoolExecutor$Worker.run():624
java.lang.Thread.run():748
You should upgrade to the newer Hive version. For Drill 1.13 it is Hive 2.3.2 version? Starting from Drill-1.13 Drill leverages 2.3.2 version of Hive client [1].
Supporting of Hive 3.0 version is upcoming [2].
Also please follow the following guide with necessary Hive plugin configurations for your environment [3]. You could omit "hive.metastore.sasl.enabled", "hive.server2.enable.doAs" and "hive.metastore.execute.setugi" properties, since you have specified the default values [4]. Regarding "hive.metastore.uris" and "fs.default.name" you should specify the same values for them as in your hive-site.xml.
[1] https://drill.apache.org/docs/hive-storage-plugin
[2] https://issues.apache.org/jira/browse/DRILL-6604
[3] https://drill.apache.org/docs/hive-storage-plugin/#hive-remote-metastore-configuration
[4] https://github.com/apache/hive/blob/rel/release-2.3.2/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java#L824