Druid spec to convert one datatype into another - druid

I want to a Druid spec which will convert the datatype of a perticular column from one datatype to another datatype .

Use below it will work as an e.g.
"dimensionsSpec": {
"dimensions": [
"mtime",
"name1",
"name2",
{
"type": "long",
"name": "sal"
},
{
"type": "long",
"name": "amount"
},
"response",
{
"type": "long",
"name": "response_id"
},
"venue_name"
]
}

Related

Can I use single s3-sink connector to point same field name for the timeStamp field by using different type of Avro Schema's for different topics?

schema for topic t1
{
"type": "record",
"name": "Envelope",
"namespace": "t1",
"fields": [
{
"name": "before",
"type": [
"null",
{
"type": "record",
"name": "Value",
"fields": [
{
"name": "id",
"type": {
"type": "long",
"connect.default": 0
},
"default": 0
},
{
"name": "createdAt",
"type": [
"null",
{
"type": "string",
"connect.version": 1,
"connect.name": "io.debezium.time.ZonedTimestamp"
}
],
"default": null
},
],
"connect.name": "t1.Value"
}
],
"default": null
},
{
"name": "after",
"type": [
"null",
"Value"
],
"default": null
}
],
"connect.name": "t1.Envelope"
}
schema for topic t2
{
"type": "record",
"name": "Value",
"namespace": "t2",
"fields": [
{
"name": "id",
"type": {
"type": "long",
"connect.default": 0
},
"default": 0
},
{
"name": "createdAt",
"type": [
"null",
{
"type": "string",
"connect.version": 1,
"connect.name": "io.debezium.time.ZonedTimestamp"
}
],
"default": null
}
],
"connect.name": "t2.Value"
}
s3-sink Connector configuration
connector.class=io.confluent.connect.s3.S3SinkConnector
behavior.on.null.values=ignore
s3.region=us-west-2
partition.duration.ms=1000
flush.size=1
tasks.max=3
timezone=UTC
topics.regex=t1,t2
aws.secret.access.key=******
locale=US
format.class=io.confluent.connect.s3.format.json.JsonFormat
partitioner.class=io.confluent.connect.storage.partitioner.TimeBasedPartitioner
value.converter.schemas.enable=false
name=s3-sink-connector
aws.access.key.id=******
errors.tolerance=all
value.converter=org.apache.kafka.connect.json.JsonConverter
storage.class=io.confluent.connect.s3.storage.S3Storage
key.converter=org.apache.kafka.connect.storage.StringConverter
s3.bucket.name=s3-sink-connector-bucket
path.format=YYYY/MM/dd
timestamp.extractor=RecordField
timestamp.field=after.createdAt
By using this connector configuration I got error for t2 topic that is "createdAt field does not exist".
If I set timestamp.field = createdAt then error is thrown for t1 topic "createdAt field does not exist".
How can I point "createdAt" field in both schemas at the same time by using same connector for both?
Is it possible to achieve this by using a single s3-sink connector configuration ?
If this scenario is possible then how can I do this, which properties I have to use for achieve this?
If anybody has idea about this, please suggest on this.
If there is any other way to do this then please suggest that way also.
All topics will need the same timestamp field; there's no way to configure topic-to-field mappings.
Your t2 schema doesn't have an after field, so you need to run two separate connectors
The field is also required to be present in all records, otherwise the partitioner won't work.

ADF Copy activity delimited to parquet data type mapping with extra column

I'm trying to use a copy activity in ADF to copy data from a csv to a parquet. I can get column name mappings to work and I can mostly get the data types to map successfully however I am adding a dynamic column called LoadDate that is created from an expression in ADF and I can't seem to get it to map correctly. Looking at the parquet file that is output for the column "date" which is in the delimited file we get the type of INT96 which Azure Databricks correctly reads as a date. However for the LoadDate column which is generated using an expression we get the type BYTE_ARRAY.
I just can't seem to get it to output the extra column in the correct format. Any help would be appreciated.
Below is the mapping section of my JSON.
"mappings": [
{
"source": {
"name": "Date",
"type": "DateTime",
"physicalType": "String"
},
"sink": {
"name": "Date",
"type": "DateTime",
"physicalType": "INT_96"
}
},
{
"source": {
"name": "Item",
"type": "String",
"physicalType": "String"
},
"sink": {
"name": "Item",
"type": "String",
"physicalType": "UTF8"
}
},
{
"source": {
"name": "Opt",
"type": "INT32",
"physicalType": "String"
},
"sink": {
"name": "Opt",
"type": "Int32",
"physicalType": "INT_32"
}
},
{
"source": {
"name": "Branch",
"type": "INT32",
"physicalType": "String"
},
"sink": {
"name": "Branch",
"type": "Int32",
"physicalType": "INT_32"
}
},
{
"source": {
"name": "QTY",
"type": "INT32",
"physicalType": "String"
},
"sink": {
"name": "QTY",
"type": "Int32",
"physicalType": "INT_32"
}
},
{
"source": {
"name": "LoadDate",
"type": "DateTime",
"physicalType": "String"
},
"sink": {
"name": "LoadDate",
"type": "DateTime",
"physicalType": "INT_96"
}
}
]

Kafka Connect - JDBC Avro connect how define custom schema registry

I was following tutorial on kafka connect, and I am wondering if there is a possibility to define a custom schema registry for a topic which data came from a MySql table.
I can't find where define it in my json/connect config and I don't want to create a new version of that schema after creating it.
My MySql table called stations has this schema
Field | Type
---------------+-------------
code | varchar(4)
date_measuring | timestamp
attributes | varchar(256)
where the attributes contains a Json data and not a String (I have to use that type because the Json field of the attributes are variable.
My connector is
{
"value.converter.schema.registry.url": "http://localhost:8081",
"_comment": "The Kafka topic will be made up of this prefix, plus the table name ",
"key.converter.schema.registry.url": "http://localhost:8081",
"name": "jdbc_source_mysql_stations",
"connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector",
"key.converter": "io.confluent.connect.avro.AvroConverter",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"transforms": [
"ValueToKey"
],
"transforms.ValueToKey.type": "org.apache.kafka.connect.transforms.ValueToKey",
"transforms.ValueToKey.fields": [
"code",
"date_measuring"
],
"connection.url": "jdbc:mysql://localhost:3306/db_name?useJDBCCompliantTimezoneShift=true&useLegacyDatetimeCode=false&serverTimezone=UTC",
"connection.user": "confluent",
"connection.password": "**************",
"table.whitelist": [
"stations"
],
"mode": "timestamp",
"timestamp.column.name": [
"date_measuring"
],
"validate.non.null": "false",
"topic.prefix": "mysql-"
}
and creates this schema
{
"subject": "mysql-stations-value",
"version": 1,
"id": 23,
"schema": "{\"type\":\"record\",\"name\":\"stations\",\"fields\":[{\"name\":\"code\",\"type\":\"string\"},{\"name\":\"date_measuring\",\"type\":{\"type\":\"long\",\"connect.version\":1,\"connect.name\":\"org.apache.kafka.connect.data.Timestamp\",\"logicalType\":\"timestamp-millis\"}},{\"name\":\"attributes\",\"type\":\"string\"}],\"connect.name\":\"stations\"}"
}
Where "attributes" field is of course a String.
Unlike I would apply it this other schema.
{
"fields": [
{
"name": "code",
"type": "string"
},
{
"name": "date_measuring",
"type": {
"connect.name": "org.apache.kafka.connect.data.Timestamp",
"connect.version": 1,
"logicalType": "timestamp-millis",
"type": "long"
}
},
{
"name": "attributes",
"type": {
"type": "record",
"name": "AttributesRecord",
"fields": [
{
"name": "H1",
"type": "long",
"default": 0
},
{
"name": "H2",
"type": "long",
"default": 0
},
{
"name": "H3",
"type": "long",
"default": 0
},
{
"name": "H",
"type": "long",
"default": 0
},
{
"name": "Q",
"type": "long",
"default": 0
},
{
"name": "P1",
"type": "long",
"default": 0
},
{
"name": "P2",
"type": "long",
"default": 0
},
{
"name": "P3",
"type": "long",
"default": 0
},
{
"name": "P",
"type": "long",
"default": 0
},
{
"name": "T",
"type": "long",
"default": 0
},
{
"name": "Hr",
"type": "long",
"default": 0
},
{
"name": "pH",
"type": "long",
"default": 0
},
{
"name": "RX",
"type": "long",
"default": 0
},
{
"name": "Ta",
"type": "long",
"default": 0
},
{
"name": "C",
"type": "long",
"default": 0
},
{
"name": "OD",
"type": "long",
"default": 0
},
{
"name": "TU",
"type": "long",
"default": 0
},
{
"name": "MO",
"type": "long",
"default": 0
},
{
"name": "AM",
"type": "long",
"default": 0
},
{
"name": "N03",
"type": "long",
"default": 0
},
{
"name": "P04",
"type": "long",
"default": 0
},
{
"name": "SS",
"type": "long",
"default": 0
},
{
"name": "PT",
"type": "long",
"default": 0
}
]
}
}
],
"name": "stations",
"namespace": "com.mycorp.mynamespace",
"type": "record"
}
Any suggestion please?
In case it's not possible, I suppose I'll have to create a KafkaStream to create another topic, even if I would avoid it.
Thanks in advance!
I don't think you're asking anything about using a "custom" registry (which you'd do with the two lines that say which registry you're using), but rather how you can parse the data / apply a schema after the record is pulled from the database
You can write your own Transform, or you can use Kstreams, which are really the main options here. There is a SetSchemaMetadata transform, but I'm not sure that'll do what you want (parse a string into an Avro record)
Or if you must shove JSON data into a single database attribute, maybe you shouldn't use Mysql and rather a document database which has more flexible data constraints.
Otherwise, you can use BLOB rather than varchar and put binary Avro data into that column, but then you'd still need a custom deserializer to read the data

AVRO schema with optional record

Hi folks I need to create AVRO schema for the following example ;
{ "Car" : { "Make" : "Ford" , "Year": 1990 , "Engine" : "V8" , "VIN" : "123123123" , "Plate" : "XXTT9O",
"Accident" : { "Date" :"2020/02/02" , "Location" : "NJ" , "Driver" : "Joe" } ,
"Owner" : { "Name" : "Joe" , "LastName" : "Doe" } }
Accident and Owner is optional objects and created schema also needs to validate following subset message;
{ "Car" : { "Make" : "Tesla" , "Year": 2020 , "Engine" : "4ELEC" , "VIN" : "54545426" , "Plate" : "TESLA" }
I read the AVRO specs and see a lot of optional attribute and array examples but none of them worked for the record. How can I define a record as optional ? Thanks.
Following schema without any optional attribute is working.
{
"name": "MyClass", "type": "record", "namespace": "com.acme.avro", "fields": [
{
"name": "Car", "type": {
"name": "Car","type": "record","fields": [
{ "name": "Make", "type": "string" },
{ "name": "Year", "type": "int" },
{ "name": "Engine", "type": "string" },
{ "name": "VIN", "type": "string" },
{ "name": "Plate", "type": "string" },
{ "name": "Accident",
"type":
{ "name": "Accident",
"type": "record",
"fields": [
{ "name": "Date","type": "string" },
{ "name": "Location","type": "string" },
{ "name": "Driver", "type": "string" }
]
}
},
{ "name": "Owner",
"type":
{"name": "Owner",
"type": "record",
"fields": [
{"name": "Name", "type": "string" },
{"name": "LastName", "type": "string" }
]
}
}
]
}
}
]
}
when I change the Owner object as suggested avro-tool is returning error.
{ "name": "Owner",
"type": [
"null",
"record" : {
"name": "Owner",
"fields": [
{"name": "Name", "type": "string" },
{"name": "LastName", "type": "string" }
]
}
] , "default": null }
]
}
}
]
}
Test:
Projects/avro_test$ java -jar avro-tools-1.8.2.jar fromjson --schema-file CarStackOver.avsc Car.json > o2
log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Exception in thread "main" org.apache.avro.SchemaParseException: org.codehaus.jackson.JsonParseException: Unexpected character (':' (code 58)): was expecting comma to separate ARRAY entries
at [Source: org.apache.hadoop.fs.ChecksumFileSystem$FSDataBoundedInputStream#4034c28c; line: 26, column: 13]
at org.apache.avro.Schema$Parser.parse(Schema.java:1034)
at org.apache.avro.Schema$Parser.parse(Schema.java:1004)
at org.apache.avro.tool.Util.parseSchemaFromFS(Util.java:165)
at org.apache.avro.tool.DataFileWriteTool.run(DataFileWriteTool.java:83)
at org.apache.avro.tool.Main.run(Main.java:87)
at org.apache.avro.tool.Main.main(Main.java:76)
Caused by: org.codehaus.jackson.JsonParseException: Unexpected character (':' (code 58)): was expecting comma to separate ARRAY entries
at [Source: org.apache.hadoop.fs.ChecksumFileSystem$FSDataBoundedInputStream#4034c28c; line: 26, column: 13]
at org.codehaus.jackson.JsonParser._constructError(JsonParser.java:1433)
at org.codehaus.jackson.impl.JsonParserMinimalBase._reportError(JsonParserMinimalBase.java:521)
at org.codehaus.jackson.impl.JsonParserMinimalBase._reportUnexpectedChar(JsonParserMinimalBase.java:442)
at org.codehaus.jackson.impl.Utf8StreamParser.nextToken(Utf8StreamParser.java:482)
at org.codehaus.jackson.map.deser.std.BaseNodeDeserializer.deserializeArray(JsonNodeDeserializer.java:222)
at org.codehaus.jackson.map.deser.std.BaseNodeDeserializer.deserializeObject(JsonNodeDeserializer.java:200)
at org.codehaus.jackson.map.deser.std.BaseNodeDeserializer.deserializeArray(JsonNodeDeserializer.java:224)
at org.codehaus.jackson.map.deser.std.BaseNodeDeserializer.deserializeObject(JsonNodeDeserializer.java:200)
at org.codehaus.jackson.map.deser.std.BaseNodeDeserializer.deserializeObject(JsonNodeDeserializer.java:197)
at org.codehaus.jackson.map.deser.std.BaseNodeDeserializer.deserializeArray(JsonNodeDeserializer.java:224)
at org.codehaus.jackson.map.deser.std.BaseNodeDeserializer.deserializeObject(JsonNodeDeserializer.java:200)
at org.codehaus.jackson.map.deser.std.JsonNodeDeserializer.deserialize(JsonNodeDeserializer.java:58)
at org.codehaus.jackson.map.deser.std.JsonNodeDeserializer.deserialize(JsonNodeDeserializer.java:15)
at org.codehaus.jackson.map.ObjectMapper._readValue(ObjectMapper.java:2704)
at org.codehaus.jackson.map.ObjectMapper.readTree(ObjectMapper.java:1344)
at org.apache.avro.Schema$Parser.parse(Schema.java:1032)
You can make it possible for records to be optional by doing a union with null.
Like this:
{
"name": "Owner",
"type": [
"null",
{
"name": "Owner",
"type": "record",
"fields": [
{ "name": "Name", type": "string" },
{ "name": "LastName", type": "string" },
]
}
],
"default": null
},

How do I COPY a nested Avro field to Redshift as a single field?

I have the following Avro schema for a record, and I'd like to issue a COPY to Redshift:
"fields": [{
"name": "id",
"type": "long"
}, {
"name": "date",
"type": {
"type": "record",
"name": "MyDateTime",
"namespace": "com.mynamespace",
"fields": [{
"name": "year",
"type": "int"
}, {
"name": "monthOfYear",
"type": "int"
}, {
"name": "dayOfMonth",
"type": "int"
}, {
"name": "hourOfDay",
"type": "int"
}, {
"name": "minuteOfHour",
"type": "int"
}, {
"name": "secondOfMinute",
"type": "int"
}, {
"name": "millisOfSecond",
"type": ["int", "null"],
"default": 0
}, {
"name": "zone",
"type": {
"type": "string",
"avro.java.string": "String"
},
"default": "America/New_York"
}],
"noregistry": []
}
}]
I want to condense the object in MyDateTime during the COPY to a single column in Redshift. I saw that you can map nested JSON data to a top-level column: https://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-format.html#copy-json-jsonpaths , but I haven't figured out a way to concatenate the fields directly in the COPY command.
In other words, is there a way to convert the following record (originally in Avro format)
{
"id": 6,
"date": {
"year": 2010,
"monthOfYear": 10,
"dayOfMonth": 12,
"hourOfDay": 14,
"minuteOfHour": 26,
"secondOfMinute": 42,
"millisOfSecond": {
"int": 0
},
"zone": "America/New_York"
}
}
Into a row in Redshift that looks like:
id | date
---------------------------------------------
6 | 2010-10-12 14:26:42:000 America/New_York
I'd like to do this directly with COPY
You would need to declare the Avro file(s) as an Redshift Spectrum external table and then use a query over that to insert the data into the local Redshift table.
https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_EXTERNAL_TABLE.html