I want build flink connector to OrientDB database. I want handle data locality characteristic, whats is the best way to load all data in parallel way in a Dataset? I know that I have to handle the OrientDB way to read data, but, in flink side, wath are the principles and best practices that I have to handle this kind of task.
You can refer to flink-hbase connector under flink-connectors in flink source. Basically, you have to implement the interface named InputFormat or RichInputFormat if you need the RuntimeContext.
Related
Well, I'm a newbie on Apache Flink and reading some source codes on the Internet.
Sometimes I saw StreamExecutionEnvironment but I have also seen StreamTableEnvironment.
I've read the official doc but I still can't figure out their difference.
Furthermore, I'm trying to code a Flink Stream Job, which receives the data from Kafka. In this case, which kind of environment should I use?
A StreamExecutionEnvironment is used with the DataStream API. You need a StreamTableEnvironment if you are going to use the higher level Table or SQL APIs.
The section of the docs on creating a TableEnvironment covers this in more detail.
Which you should use depends on whether you want to work with lower level data streams, or a higher level relational API. Both can be used to implement a streaming job that reads from Kafka.
The documentation includes a pair of code walkthroughs introducing both APIs, which should help you figure out which API is better suited for your use case. See the DataStream walkthrough and the Table walkthrough.
To learn the DataStream API, spend a day working through the self-paced training. There's also training for Flink SQL. Both are free.
I want to capture data changes from few tables in a huge PostgreSQL database.
Initially I planned to use the logical decoding feature with Debezium. But this solution has significant overhead since it's necessary to decode the entire WAL. Another solution uses triggers and PgQ.
Is there any general way to integrate PgQ with Kafka or perhaps a Kafka connector for this purpose?
You either go transaction log, or you go query-based.
Which you use depends on your use of the data. Query-based polls the DB, log-based uses the log (WAL).
I'm interested in your assertion that Debezium has "significant overhead"—have you quantified this? I know there are lots of people using it and it's not usually raised as an issue.
For query-based capture use the Kafka Connect JDBC source connector.
You can see pros and cons of each approach here: http://rmoff.dev/ksny19-no-more-silos
We have a use case where we are using Kafka connect to Source and Sink data. Its like a typical ETL.
We want to understand if Kafka connect can identify the delta changes between previous streams. i.e. we want to send only the changed data to client and not the whole table or view. Also, we prefer not to execute explicit code to identify changes via a query on source and destination DB.
Is there any preferred approach towards it?
As gasparms said, use a CDC tool to pull all change events from your database. You can then use Kafka Streams or KSQL to filter, join, and aggregate as required by your ETL.
What's your source system that you want to get data from? For Oracle (and several other sources), as of GoldenGate 12.3.1 they actually bundle the Kafka Connect handler as part of the download itself. You also have other options such as DBVisit.
For open-source DBs then Debezium definitely fits the bill, and there is a nice tutorial here.
Have you looked a CDC (Change Data Capture) approach. There are several connectors that read a commit log or similar in a database and stream events. Using these events you'll get every change in a table.
Examples
Oracle Golden Gate -
http://www.oracle.com/technetwork/middleware/goldengate/oracle-goldengate-exchange-3805527.html
Postgres - https://github.com/debezium
MySQL - https://github.com/debezium
We are planning to use REST API calls to ingest data from an endpoint and store the data to HDFS. The REST calls are done in a periodic fashion (daily or maybe hourly).
I've already done Twitter ingestion using Flume, but I don't think using Flume would suit my current use-case because I am not using a continuous data firehose like this one in Twitter, but rather discrete regular time-bound invocations.
The idea I have right now, is to use custom Java that takes care of REST API calls and saves to HDFS, and then use Oozie coordinator on that Java jar.
I would like to hear suggestions / alternatives (if there's easier than what I'm thinking right now) about design and which Hadoop-based component(s) to use for this use-case. If you feel I can stick to Flume, then kindly give me also an idea how to do this.
As stated in the Apache Flume web:
Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store.
As you can see, among the features attributed to Flume is the gathering of data. "Pushing-like or emitting-like" data sources are easy to integrate thanks to HttpSource, AvroSurce, ThriftSource, etc. In your case, where the data must be let's say "actively pulled" from a http-based service, the integration is not so obvious, but can be done. For instance, by using the ExecSource, which runs a script getting the data and pushing it to the Flume agent.
If you use a proprietary code in charge of pulling the data and writting it into HDFS, such a design will be OK, but you will be missing some interesting built-in Flume characteristics (that probably you will have to implement by yourself):
Reliability. Flume has mechanisms to ensure the data is really persisted in the final storage, retrying until is is effectively written. This is achieved through the usage of an internal channel buffering data both at the input (ingesting peaks of loads) and the output (retaining data until it is effecively persisted) and the transaction concept.
Performance. The usage of transactions and the possibility to configure multiple parallel sinks (data processors) will your deployment able to deal with really large amounts of data generated per second.
Usability. By using Flume you don't need to deal with the storage details (e.g. HDFS API). Even, if some day you decide to change the final storage you only have to reconfigure the Flume agent for using the new related sink.
I wonder if it is possible, or if someone has tried to setup Apache Kafka as consumer of PostgreSQL logigal log stream? Does that even makes sense?
https://wiki.postgresql.org/wiki/Logical_Log_Streaming_Replication
I have a legacy source system that I need to make realtime dashboard from. For some reasons I can't hook the application events (btw, it's java app). Instead, I'm thinking of some kind of a lambda architecture: when dashboard initializes, it reads from persisted data "data warehouse" which gets there after some ETL. And then changing events are streamed via Kafka to the the dashboard.
Another use of the events stored in Kafka would be a kind of change data capture approach for data warehouse population. This is necessary because there is no commercial CDC tool that supports postgesql. And the source application is updating tables without keeping history.
A combination of xsteven's PostgreSQL WAL to protobuf project - decoderbufs (https://github.com/xstevens/decoderbufs) - and his pg_kafka producer (https://github.com/xstevens/pg_kafka) might be a start.
Take a look at Bottled Water which:
uses the logical decoding feature (introduced in PostgreSQL 9.4) to
extract a consistent snapshot and a continuous stream of change events
from a database. The data is extracted at a row level, and encoded using Avro. A
client program connects to your database, extracts this data, and
relays it to Kafka
They also have Docker images so looks like it'd be easy to try it out.
The Debezium project provides a CDC connector for streaming data changes from Postgres into Apache Kafka. Currently it supports Decoderbufs and wal2json as logical decoding plug-ins. Bottled Water referenced in Steve's answer is comparable, but it is not actively maintained any longer.
Disclaimer: I'm the project lead of Debezium