I am trying out Kafka and have got the io.debezium.connector.sqlserver.SqlServerConnector registered correctly as the connector returns
{
"name": "test_connector",
"connector": {
"state": "RUNNING",
"worker_id": "kafka-connect:8083"
},
"tasks": [
{
"id": 0,
"state": "RUNNING",
"worker_id": "kafka-connect:8083"
}
],
"type": "source"
}
When I enable CDC on the table from the logs it appears to work
INFO Snapshot ended with SnapshotResult [status=COMPLETED, offset=SqlServerOffsetContext [sourceInfoSchema=Schema{io.debezium.connector.sqlserver.Source:STRUCT}, sourceInfo=SourceInfo [serverName=TestServer, changeLsn=NULL, commitLsn=0000006d:00000a46:0003, eventSerialNo=null, snapshot=FALSE, sourceTime=2022-02-28T04:51:31.813Z], snapshotCompleted=true, eventSerialNo=1]] (io.debezium.pipeline.ChangeEventSourceCoordinator)
However edits to the table do not seem to be picked up yet I can confirm the edits are making it into the change control table.
INFO WorkerSourceTask{id=test_connector-0} Either no records were produced by the task since the last offset commit, or every record has been filtered out by a transformation or dropped due to transformation or conversion errors. (org.apache.kafka.connect.runtime.WorkerSourceTask)
Once I confirm this my next step is to see if I can stream the changes using KSQL
Any suggestions?
Related
I have a certain json where i need to set the document id as combination of two fields.
{
"Event_start_time": "2021-05-16T08:27:21.164Z",
"allbeat": {
"heartbeat": {
"pkt_loss_pct": 0,
"type": "ping",
"bu_id": 1,
"minimum_rtt": 32.248,
"jitter": 0.09999999999999788,
"target_state": "Up",
"average_rtt": 32.35,
"maximum_rtt": 32.436,
"tenant_id": 1,
"target": "google.com",
"port": 0
}
}
}
From the above document can we set a key with the combination of Event_start_time and allbeat.heartbeat.target using the available SMT's?
There is not an available Single Message Transform that I'm aware of that will do this. You could write your own, or you could use stream processing (e.g. Kafka Streams, ksqlDB) to do it
I am starting to get into Apache Kafka (Confluent) and have some questions regarding the use of schemas.
First, is my general understanding correct that a schema is used for validating the data? My understanding of schemas is that when the data is "produced", it checks if the Keys and Values fit the predefined concept and splits them accordingly.
My current technical setup is as follows:
Python:
from confluent_kafka import Producer
from config import conf
import json
# create producer
producer = Producer(conf)
producer.produce("datagen-topic", json.dumps({"product":"table","brand":"abc"}))
producer.flush()
in Confluent, i set up a json key schema for my topic:
{
"$schema": "http://json-schema.org/draft-04/schema#",
"properties": {
"brand": {
"type": "string"
},
"product": {
"type": "string"
}
},
"required": [
"product",
"brand"
],
"type": "object"
}
Now, when I produce the data, the message in Confluent contains only content in "Value". Key and Header are null:
{
"product": "table",
"brand": "abc"
}
Basically it doesn't make a difference if I have this schema set up or not, so I guess it's just not working as I set it up. Can you help me where my way of thinking is wrong or where my code is lacking input?
The Confluent Python library Producer class doesn't interact with the Registry in any way, so your message wouldn't be validated.
You'll want to use SerializingProducer like in the example - https://github.com/confluentinc/confluent-kafka-python/blob/master/examples/json_producer.py
If you want non-null keys and headers, you'll need to pass those on to the send method
I am having an issue with the inline data set for Common Data Model in Azure Data Factory.
Simply, everything in ADF appears to connect and read from my manifest file and entity definition - but when I click the "Data preview" button, I always get "No output data" - which I find bizarre, as the data can be read perfectly when using the CDM connector to the same files in PowerBI. What am I doing wrong to mean that the data is not read into the data preview and subsequent transformations in the mapping data flow?
My Manifest file looks as below (referring to an example entity):
{
"$schema": "CdmManifest.cdm.json",
"jsonSchemaSemanticVersion": "1.0.0",
"imports": [
{
"corpusPath": "cdm:/foundations.cdm.json"
}
],
"manifestName": "manifestExample",
"explanation": "example",
"entities": [
{
"type": "LocalEntity",
"entityName": "Entityname",
"entityPath": "folder/EntityName.cdm.json/Entityname",
"dataPartitions": [
{
"location": "folder/data/Entityname/Entityname.csv",
"exhibitsTraits": [
{
"traitReference": "is.partition.format.CSV",
"arguments": [
{
"name": "columnHeaders",
"value": "true"
},
{
"name": "delimiter",
"value": ","
}
]
}
]
}
]
},
...
I am having exactly same output message "No output data". I am using json not manifest. If i sink the source it moves no data but without error. My CDM originates from PowerBI dataflow. The PowerApps works fine but historization and privileges make it useless.
Edit:
On Microsofts info on preview feature we can find this
screen. I will make a guess that CDM the ADS sources is not the same which orignates from Power BI.
We are working on implementing a custom logging solution. Most of the information what we need is already present in log analytics from data factory analytics solution but for getting log info on data flows, there is a challenge. When querying, we get this error in output. "Too large to parse".
Since data flows are complex and critical piece in a pipeline, we are in desperate need to get data like rows copied, skipped, read etc of each activities with in data flow. can you pls help how to get those info?
You can get the same information shown in the ADF portal UI by making a POST request to the below REST endpoint. You can find more information and read about authentication on the following link https://learn.microsoft.com/en-us/rest/api/datafactory/pipelineruns/querybyfactory
You can choose to query by factory or for a specific pipeline run id depending on your needs.
https://management.azure.com/subscriptions/<subscription id>/resourcegroups/<resource group name>/providers/Microsoft.DataFactory/factories/<ADF resource Name>/pipelineruns/<pipeline run id>/queryactivityruns?api-version=2018-06-01
Below is an example of the data you can get from one stage:
{
"stage": 7,
"partitionTimes": [
950
],
"lastUpdateTime": "2020-07-28 18:24:55.604",
"bytesWritten": 0,
"bytesRead": 544785954,
"streams": {
"CleanData": {
"type": "select",
"count": 241231,
"partitionCounts": [
950
],
"cached": false
},
"ProductData": {
"type": "source",
"count": 241231,
"partitionCounts": [
950
],
"cached": false
}
},
"target": "MergeWithDeltaLakeTable",
"time": 67589,
"progressState": "Completed"
}
Reading the Concourse documentation about Implementing a Resource Type, in regards to what the check, in, and out scripts must emit, it is not clear why this output is needed or how Concourse uses it. My questions are:
1) How does Concourse use the output of the check script, the in script, and the out script?
2) And, why is it required that the in and out script emit the version? What happens if you don't?
For context, here is the relevant parts of the documentation:
1) For the check script:
...[it] must print the array of new versions, in chronological order,
to stdout, including the requested version if it's still valid.
For example:
[
{ "ref": "61cbef" },
{ "ref": "d74e01" },
{ "ref": "7154fe" }
]
2) For the in script:
The script must emit the fetched version, and may emit metadata as a list of key-value pairs. This data is intended for public consumption and will make it upstream, intended to be shown on the build's page.
For example:
{
"version": { "ref": "61cebf" },
"metadata": [
{ "name": "commit", "value": "61cebf" },
{ "name": "author", "value": "Hulk Hogan" }
]
}
3) Similar to the in script, the out script:
The script must emit the resulting version of the resource. For
example, the git resource emits the sha of the commit that it just
pushed.
For example:
{
"version": { "ref": "61cebf" },
"metadata": [
{ "name": "commit", "value": "61cebf" },
{ "name": "author", "value": "Mick Foley" }
]
}
Concourse uses the check result to verify if there is any new resource available. According to your pipeline definition, the presence of a new resource would trigger a job. The in is therefore used to read the specific resource using parameters provided by the pipeline whilst the out would take care of writing them.
As your in is going to use the information provided by the check you may want to use a similar structure, but you're not obliged to. It is useful to echo the same version information in your check/in/out in order to be able to log it and understand each resource in your pipeline is belonging to which version.