This is regarding kafka-connect-spooldir connector for CSV. I would like to know if there is a way to avoid hardcoding the schema and let the connector create schema dynamically? I have a lot of csv files to process say few hundreds GB per day sometimes a couple of tera bytes of csv. Sometimes some csv files have new columns and some are dropped.
I am able to successfully read the csv and write to elastic search, and I followed your post.https://www.confluent.io/blog/ksql-in-action-enriching-csv-events-with-data-from-rdbms-into-AWS/
So now I do not want to use value schema and key schema.
From the link https://docs.confluent.io/current/connect/kafka-connect-spooldir/connectors/csv_source_connector.html; I figured that schema.generation.enabled can be set to true.
here's my REST API call [ including my connector config]
$curl -i -X POST -H "Accept:application/json" -H "Content-Type:application/json" http://xxx:000/connectors/ -d '{
"name":"csv1",
"config":{
"tasks.max":"1",
"connector.class":"com.github.jcustenborder.kafka.connect.spooldir.SpoolDirCsvSourceConnector",
"input.file.pattern":"^.*csv$",
"halt.on.error":"false",
"topic":"order",
"schema.generation.enabled":"true",
"schema.generation.key.name":"orderschema",
"schema.generation.value.name":"orderdata",
"csv.first.row.as.header":"true",
"csv.null.field.indicator":"EMPTY_SEPARATORS",
"batch.size" : "5000",
}
}
'
When I submit this, i get the following error.
{
"name": "order",
"connector": {
"state": "FAILED",
"worker_id": "localhost:000",
"trace": "org.apache.kafka.connect.errors.DataException: More than one schema was found for the input pattern.\nSchema: {\"name\":\"com.github.jcustenborder.kafka.connect.model.Value\",\"type\":\"STRUCT\",\"isOptional\":false,\"fieldSchemas\":
Whats the solution for this?
I was able to parse all the data now. Trick was to process one file first [ any] and then just to check add randomly add another. It looks like that way, it updates the schema automagically. (like how Robin Moffatt calls it)
After that add all files to the folder, it process just fine. YAY!
Related
I am trying to ingest data from MySQL to BigQuery. I am using Debezium components running on Docker for this purpose.
Anytime I try to deploy the BigQuery sink connector to Kafka connect, I am getting this error:
{"error_code":400,"message":"Connector configuration is invalid and contains the following 2 error(s):\nFailed to construct GCS client: Failed to access JSON key file\nAn unexpected error occurred while validating credentials for BigQuery: Failed to access JSON key file\nYou can also find the above list of errors at the endpoint `/connector-plugins/{connectorType}/config/validate`"}
It shows it's an issue with the service account key (trying to locate it)
I granted the service account BigQuery Admin and Editor permissions, but the error persists.
This is my BigQuery connector configuration file:
{
"name": "kcbq-connect1",
"config": {
"connector.class": "com.wepay.kafka.connect.bigquery.BigQuerySinkConnector",
"tasks.max" : "1",
"topics" : "kcbq-quickstart1",
"sanitizeTopics" : "true",
"autoCreateTables" : "true",
"autoUpdateSchemas" : "true",
"schemaRetriever" : "com.wepay.kafka.connect.bigquery.retrieve.IdentitySchemaRetriever",
"schemaRegistryLocation":"http://localhost:8081",
"bufferSize": "100000",
"maxWriteSize":"10000",
"tableWriteWait": "1000",
"project" : "dummy-production-overview",
"defaultDataset" : "debeziumtest",
"keyfile" : "/Users/Oladayo/Desktop/Debezium-Learning/key.json"
}
Can anyone help?
Thank you.
I needed to mount the service account key in my local directory to the Kafka connect container. That was how I was able to solve the issue. Thank you :)
I am trying to sync my Postgres database with Aws Elasticsearch using PGSync
I have defined a simple schema:
[
{
"database": "tenancyportal",
"index": "properties",
"nodes": [
{
"table": "properties",
"schema": "public",
"columns": ["id", "address"]
}
]
}
]
But when I am trying to bootstrap the database using
bootstrap --config schema.json
I get the following error:
elasticsearch.exceptions.NotFoundError: NotFoundError(404,
'index_not_found_exception', 'no such index [:9200]', :9200,
index_or_alias)
In the below screenshot, you will be able to see the GET URL for elasticsearch is completely wrong, I am not able to understand what config is causing it to be formed like that.
It looks like your AWS Elasticsearch URL is not constructed properly. This was adressed in a recent update to PGSync. Can you pull the latest master branch and try this again.
OS: Ubuntu 18.x
docker image (from dockerhub.com, as of 2020-09-25): confluentinc/cp-schema-registry:latest
I am exploring the HTTP API for the Confluent Schema Registry. First off, is there a definitive assertion somewhere about what version of the JSON Schema definition the registry assumes? For now, I am assuming Draft v7.0. More broadly, I believe the API that returns supported schema should list versions. Eg, instead of:
$ curl -X GET http://localhost:8081/schemas/types
["JSON","PROTOBUF","AVRO"]
you would have:
$ curl -X GET http://localhost:8081/schemas/types
[{"flavor": "JSON", "version": "7.0"}, {"flavor": "PROTOBUF", "version": "1.2"}, {"flavor": "AVRO", "version": "3.5"}]
so at least programmers would know for sure what the Schema Registry assumes.
This issue aside, I cannot seem to POST a rather trivial JSON schema to the registry:
$ curl -X POST -H "Content-Type: application/vnd.schemaregistry.v1+json" --data '{ "schema": "{ \"type\": \"object\", \"properties\": { \"f1\": { \"type\": \"string\" } } }" }' http://localhost:8081/subjects/mytest-value/versions
{"error_code":42201,"message":"Either the input schema or one its references is invalid"}
Here I am POSTing the schema to the mytest subject. The schema, incidentally, I scraped from Confluent documentation, and then escaped it accordingly.
Can you tell why this schema is not POSTing to the registry? And more generally, can I assume full support for Draft v7.0 of the JSON Schema definition?
You need to pass the schemaType flag. "If no schemaType is supplied, schemaType is assumed to be AVRO." https://docs.confluent.io/current/schema-registry/develop/api.html#post--subjects-(string-%20subject)-versions:
'{"schemaType":"JSON","schema":"{\"type\":\"object\",\"fields\":[{\"name\":\"f1\",\"type\":\"string\"}]}"}'
I agree that output of the supported versions would be helpful.
I dont know if I did something wrong or not.
But here is my configuration.
// payload.json
{
"plugin_name": "postgresql-database-plugin",
"allowed_roles": "*",
"connection_url": "postgresql://{{username}}:{{password}}#for-testing-vault.rds.amazonaws.com:5432/test-app",
"username": "test",
"password": "testtest"
}
then run this command:
curl --header "X-Vault-Token: ..." --request POST --data #payload.json http://ip_add.us-west-1.compute.amazonaws.com:8200/v1/database/config/postgresql
roles configuration:
// readonlypayload.json
{
"db_name": "test-app",
"creation_statements": ["CREATE ROLE \"{{name}}\" WITH LOGIN PASSWORD '{{password}}' VALID UNTIL '{{expiration}}';
GRANT SELECT ON ALL TABLES IN SCHEMA public TO \"{{name}}\";"],
"default_ttl": "1h",
"max_ttl": "24h"
}
then run this command:
curl --header "X-Vault-Token: ..." --request POST --data #readonlypayload.json http://ip_add.us-west-1.compute.amazonaws.com:8200/v1/database/roles/readonly
Then created a policy:
path "database/creds/readonly" {
capabilities = [ "read" ]
}
path "/sys/leases/renew" {
capabilities = [ "update" ]
}
and run this to get the token:
curl --header "X-Vault-Token: ..." --request POST --data '{"policies": ["db_creds"]}' http://ip_add.us-west-1.compute.amazonaws.com:8200/v1/auth/token/create | jq
executed this command to get the values:
VAULT_TOKEN=... consul-template.exe -template="config.yml.tpl:config.yml" -vault-addr "http://ip_add.us-west-1.compute.amazonaws.com:8200" -log-level debug
Then I receive this errors:
URL: GET http://ip_add.us-west-1.compute.amazonaws.com:8200/v1/database/creds/readonly
Code: 500. Errors:
* 1 error occurred:
* failed to find entry for connection with name: "test-app"
Any suggestions will be appreciated, thanks!
EDIT: Tried also this command on the server
vault read database/creds/readonly
Still returning
* 1 error occurred:
* failed to find entry for connection with name: "test-app"
For those coming to this page via Googling for this error message, this might help:
Unfortunately the Vault database/role's parameter db_name is a bit misleading. The value needs to match a database/config/ entry, not an actual database name per se. The GRANT statement itself is where the database name is relevant, the db_name is just a reference to the config name, which may or may not match the database name. (In my case, the configs have other data such as environment prefixing the DB name.)
In case this issue not yet resolved
vault is not able to find the db name 'test-app' in postgres, or authentication to the db 'test-app' with given credential fails, so
connection failure happened.
login to postgres and check if the db 'test-app' exists by running \l.
for creating role in postgres you should use the default db 'postgres'. Try to change name from 'test-app' to 'postgres' and check.
Change connection_url in payload.json:
"connection_url": "postgresql://{{username}}:{{password}}#for-testing-vault.rds.amazonaws.com:5432/postgres",
I have some problems with documentation for hood, there no explanation about what supposed to be in config.json.
I've tried:
{
"development": {
"driver": "postgres",
"source": "my_development"
}
}
but I have the error:
hood db:migrate
2014/06/23 12:53:14 applying migrations...
panic: missing "=" after "my_development" in connection info string"
From the hood documentation :
The driver and source fields are the strings you would pass to the sql.Open(2) function.
So the driver value should be postgresql (for your example), and the source value should be either a list of key=value or a full connection URI (like described in the postgresql documentation).
Some examples (from here) :
postgres://pqgotest:password#localhost/pqgotest?sslmode=verify-full
user=pqgotest dbname=pqgotest sslmode=verify-full