I have a java based web application which is using 2 backend database servers of Microsoft SQL (1 server is live database as it is transactional and the other one is reporting database). Lag between transactional and reporting databases is of around 30 minutes and incremental data is loaded using a SQL job which runs every 30 minutes and takes around 20-25 minutes in execution. This job is executing an SSIS package and using this package, data from reporting database is further processed and is stored in HDFS and HBase which is eventually used for analytics.
Now, I want to reduce this lag and to do this, I am thinking of implementing a messaging framework. After doing some research, I learned that Kafka could solve my purpose since Kafka can also work as an ETL tool apart from being a messaging framework.
How should I proceed? should I create topics similar to the table structures in SQL server and perform operations on that? Should I redirect my application to write any change happening in Kafka first and then in Transactional database? Please advise on usage of Kafka considering the mentioned use case.
There's a couple ways to do this that require minimal code, and then there's always the option to write your own code.
(Some coworkers just got finished looking at this, with SQL Server and Oracle, so I know a little about this here)
If you're using the enterprise version of SQL Server you could use Change Data Capture and Confluent Kakfa Connect to read all the changes to the data. This (seems to) require both a Enterprise license and may include some other additional cost (I was fuzzy on the details here. This may have been because we're using an older version of SQL Server or because we have many database servers ).
If you're not / can't use the CDC stuff, Kafka Connect's JDBC support also has a mode where it polls the database for changes. This works best if your records have some kind of timestamp column, but usually this is the case.
A poll only mode without CDC means you won't get every change - ie if you poll every 30 seconds and the record changes twice, you won't get individual messages about this change, but you'll get one message with those two changes, if that makes sense. This is Probably acceptable for your business domain, but something to be aware of.
Anyway, Kafka Connect is pretty cool - it will auto create Kafka topics for you based on your table names, including posting the Avro schemas to Schema Registry. (The topic names are knowable, so if you're in an environment with auto topic creation = false, well you can create the topics manually yourself based on the table names). Starting from no Kafka Connect knowledge it took me maybe 2 hours to figure out enough of the configuration to dump a large SQL Server database to Kafka.
I found additional documentation in a Github repository of a Confluent employee describing all this, with documentation of the settings, etc.
There's always the option of having your web app be a Kafka producer itself, and ignore the lower level database stuff. This may be a better solution, like if a request creates a number of records across the data store, but really it's one related event (an Order may spawn off some LineItem records in your relational database, but the downstream database only cares that an order was made).
On the consumer end (ie "next to" your other database) you could either use Kafka Connect on the other end to pick up changes, maybe even writing a custom plugin if required, or write your own Kafka consumer microservice to put the changes into the other database.
Related
As the title says, I have 2 seperate servers and I want both connectors to read from same source to write to their respective topic. A single connector works well. When I create another one in a different server they seem to be running but no data flow occurs for both.
My question is, is that possible to run 2 debezium connectors that read from same source? I couldn't find any information about this topic in documentation.
Edit: I've tested it with oracle database and never seen it's working well. Definitely wouldn't recommend using it especially in oracle.
So generally speaking, Debezium does not recommend that you use multiple connectors per database source and prefer that you adjust your connector configuration instead. We understand that isn't always the case when you have different business use cases at play.
That said, it's important that if you do deploy multiple connectors you properly configure each connector so that it doesn't share state such as the same database history topic, etc.
For certain database platforms, having multiple source connectors really doesn't apply any real burden to the database, such as MySQL. But other databases like Oracle, running multiple connectors can have a pretty substantial impact.
When an Oracle connector streams changes, it starts an Oracle LogMIner mining session. This session is responsible for loading, reading, parsing, and preparing the contents of the data read in a special in-memory table that the connector uses to generate change events. When you run multiple connectors, you will have concurrent Oracle LogMiner sessions happening and each session will be consuming its own share of PGA memory to support the steps taken by Oracle LogMiner. Depending on your database's volatility, this can be stressful on the database server since Oracle specifically assigns one LogMiner session to a CPU.
For an Oracle environment, I highly recommend you avoid using multiple connectors unless you are needing to stream changes from different PDBs within the same instance since there is really no technical reason why you should want to read, load, parse, and generate change data for the same redo entries multiple times, once per connector deployment.
I need to send occasionally record from oracle db to kafka topic. Is there a way to do so using procedure etc. without use of extra tools
You have several options, detailed in this blog and this talk.
tl;dr:
"Query-based" CDC, poll the database for changes, using Kafka Connect JDBC Source connector
"Log-based" CDC, use the redo log capture every change into Kafka, using GoldenGate / Qlik Attunity / etc etc etc
There's also a REST API if you want the lowest footprint, and then you just push records from a PL/SQL REST call over to Kafka. Probably fine for small infrequent volumes but it's going to be inflexible and brittle in the long run.
We are in the process of designing an ETL process, where we’ll be getting a daily account file (maybe half a million records, could grow) from client and we’ll be loading that file to our database.
Our current process splits the file into smaller files and load it to staging...sometime or if the process fails, we try to figure out how many records we have processed and then start again from that point. Is there any other better alternative to this problem?
We are thinking about using Kafka. I’m pretty new to Kafka. I would really appreciate some feedback if kafka is the way to go or we’re just over-killing a simple ETL process where we just load the data to a staging table and finally to destination table.
Apache Kafka® is a distributed streaming platform. What exactly does
that mean?
A streaming platform has three key capabilities:
Publish and subscribe to streams of records, similar to a message
queue or enterprise messaging system.
Store streams of records in a fault-tolerant durable way.
Process streams of records as they occur.
Kafka is generally used for two broad classes of applications:
Building real-time streaming data pipelines that reliably get data
between systems or applications
Building real-time streaming applications that transform or react to
the streams of data
https://kafka.apache.org/intro
If you encounter errors which make you check the last commited record to your staging database and need system to auto manage this stuff, Kafka can help you ease the process.
Though Kafka is built to work with massive data loads and spread across a cluster, you certainly can use it for smaller problems and utilize it's queuing functionalities and offset management, even with one broker (server) and low number of partitions (level of parallelism).
If you don't anticipate any scale at all, I would suggest you to consider RabbitMQ.
RabbitMQ is a message-queueing software also known as a message
broker or queue manager. Simply said; it is software where queues are
defined, to which applications connect in order to transfer a message
or messages.
https://www.cloudamqp.com/blog/2015-05-18-part1-rabbitmq-for-beginners-what-is-rabbitmq.html
“How to know if Apache Kafka is right for you” by Amit Rathi
https://link.medium.com/enGzNaNvT4
In case you chose Kafka:
When you receive a file, create a process which iterates all over it's lines and sends them to Kafka (Kafka Producer).
Create another process which continuously receive events from kafka (Kafka Consumer) and writes them in mini batches to the database (similar to your small files).
Setup Kafka:
https://dzone.com/articles/kafka-setup
Kafka Consumer/Producer simple example:
http://www.stackframelayout.com/programowanie/kafka-simple-producer-consumer-example/
Don't assume importing data is as easy as dumping it in your database and having the computer handle all the processing work. As you've discovered, an automated load can have problems.
First, database ELT processes depreciate the hard drive. Do not stage the data into one table prior to inserting it in its native table. Your process should only import the data one time to its native table to protect hardware.
Second, you don't need third-party software to middle-man the work. You need control so you're not manually inspecting what was inserted. This means your process is to first clean / transform the data prior to import. You want to prevent all problems prior to load by cleaning and structuring and even processing the data. The load should only be an SQL insert script. I have torn apart many T-SQL scripts where someone thought it convenient to integrate processing with database commands. Don't do it.
Here's how I manage imports from spreadsheet reports. Excel formulas are better than learning ETL tools like SSIS. I use cell formulas to validate whether the record is valid to go into our system. This result is its own column, and then if that column is true, a concatentation column displays an insert script.
=if(J1, concatenate("('", A1, "', ", B1, "),"), "")
If the column is false, the concat column shows nothing. This allows me to copy/paste the inserts into SSMS and conduct mass inserts via "insert into table values" scripts.
If this is actually updating existing records, as your comment appears to suggest, then you need to master the data, organizing what's changed in logs for your users.
Synchronization steps:
Log what is there before you update
Download and compare local vs remote copies for differences; you cannot compare the two without a) having them both in the same physical location or b) controlling the other system
Log what you're updating with, and timestamp when you're updating it
Save and close the logs
Only when 1-4 are done should you post an update to production
My guide to synchronizing data sources and handling Creates/Updates/Deletes:
sync local files with server files
We have a use case where we are using Kafka connect to Source and Sink data. Its like a typical ETL.
We want to understand if Kafka connect can identify the delta changes between previous streams. i.e. we want to send only the changed data to client and not the whole table or view. Also, we prefer not to execute explicit code to identify changes via a query on source and destination DB.
Is there any preferred approach towards it?
As gasparms said, use a CDC tool to pull all change events from your database. You can then use Kafka Streams or KSQL to filter, join, and aggregate as required by your ETL.
What's your source system that you want to get data from? For Oracle (and several other sources), as of GoldenGate 12.3.1 they actually bundle the Kafka Connect handler as part of the download itself. You also have other options such as DBVisit.
For open-source DBs then Debezium definitely fits the bill, and there is a nice tutorial here.
Have you looked a CDC (Change Data Capture) approach. There are several connectors that read a commit log or similar in a database and stream events. Using these events you'll get every change in a table.
Examples
Oracle Golden Gate -
http://www.oracle.com/technetwork/middleware/goldengate/oracle-goldengate-exchange-3805527.html
Postgres - https://github.com/debezium
MySQL - https://github.com/debezium
I wonder if it is possible, or if someone has tried to setup Apache Kafka as consumer of PostgreSQL logigal log stream? Does that even makes sense?
https://wiki.postgresql.org/wiki/Logical_Log_Streaming_Replication
I have a legacy source system that I need to make realtime dashboard from. For some reasons I can't hook the application events (btw, it's java app). Instead, I'm thinking of some kind of a lambda architecture: when dashboard initializes, it reads from persisted data "data warehouse" which gets there after some ETL. And then changing events are streamed via Kafka to the the dashboard.
Another use of the events stored in Kafka would be a kind of change data capture approach for data warehouse population. This is necessary because there is no commercial CDC tool that supports postgesql. And the source application is updating tables without keeping history.
A combination of xsteven's PostgreSQL WAL to protobuf project - decoderbufs (https://github.com/xstevens/decoderbufs) - and his pg_kafka producer (https://github.com/xstevens/pg_kafka) might be a start.
Take a look at Bottled Water which:
uses the logical decoding feature (introduced in PostgreSQL 9.4) to
extract a consistent snapshot and a continuous stream of change events
from a database. The data is extracted at a row level, and encoded using Avro. A
client program connects to your database, extracts this data, and
relays it to Kafka
They also have Docker images so looks like it'd be easy to try it out.
The Debezium project provides a CDC connector for streaming data changes from Postgres into Apache Kafka. Currently it supports Decoderbufs and wal2json as logical decoding plug-ins. Bottled Water referenced in Steve's answer is comparable, but it is not actively maintained any longer.
Disclaimer: I'm the project lead of Debezium