I am aware that one can use Storm or Spark streaming to do real time data processing with kafka but i want to ask if there is any functionality in kafka that is similar to Flume interceptor wherein data cleaning, etc can be done on the fly in an event.
Currently there is not such a feature in a released version of Kafka but the next release (0.10.0.0 according to the roadmap) will have Kafka Streams which looks to be similar to what you ask for.
What you are looking for is Kafka Interceptors that are actually inspired in Flume Interceptor Interface. Like #Lundahl is pointing out the current version doesn't support this, but the next one will.
Related
I am new to Apache Kafka and its stream services related to API's, but was wondering if there was any formal documentation on where to obtain the initial raw data required for ingestion?
In essence, I want to try my hand at building a rudimentary crypto trading bot, but was under the impression that http APIs may have more latency than APIs that integrate with Kafka Streams. For example, I know RapidAPI has a library of http APIs that can be accessed that would help pull data, but was unsure if there was something similar if I wanted the data to be ingested through Kafka Streams. I guess I am under the impression that the two data sources will not be similar and will be different in some way, but am also unsure if this is not the case.
I tried digging around on Google, but it's not very clear on what APIs or source data is taken for Kafka Streams, or if they are the same/similar just handled differently.
If you have any insight or documentation that would be greatly appreciated. Also feel free to let me know if my understanding is completely false.
any formal documentation on where to obtain the initial raw data required for ingestion?
Kafka accepts binary data. You can feed in serialized data from anywhere (although, you are restricted by (configurable) message size limits).
APIs that integrate with Kafka Streams
Kafka Streams is an intra-cluster library, it doesn't integrate with anything but Kafka.
If you want to periodically, poll/fetch an HTTP/1 API, then you would use a regular HTTP client, and a Kafka Producer.
Probably a similar answer with streaming HTTP/2 or websocket, although, still not able to use Kafka Streams, and you'd have to deal with batching records into a Kafka Producer request
You instead should look for Kafka Connect projects on the web that operate with HTTP, or opt for something like Apache NiFi as a broader project with lots of different "processors" like GetHTTP and ProduceKafka.
Once the data is in Kafka, you are welcome to use Kafka Streams/KSQL to do some processing
I haven't used Kafka before and wanted to know if messages are published through Kafka what are the possible ways to capture that info?
Is Kafka only way to receive that info via "Consumers" or can Rest APIs be also used here?
Haven't used Kafka before and while reading up I did find that Kafka needs ZooKeeper running too.
I don't need to publish info just process data received from Kafka publisher.
Any pointers will help.
Kafka is a distributed streaming platform that allows you to process streams of records in near real-time.
Producers publish records/messages to Topics in the cluster.
Consumers subscribe to Topics and process those messages as they are available.
The Kafka docs are an excellent place to get up to speed on the core concepts: https://kafka.apache.org/intro
Is Kafka only way to receive that info via "Consumers" or can Rest APIs be also used here?
Kafka has its own TCP based protocol, not a native HTTP client (assuming that's what you actually mean by REST)
Consumers are the only way to get and subsequently process data, however plenty of external tooling exists to make it so you don't have to write really any code if you don't want to in order to work on that data
We have a scenario where Kafka Producer should read a list of incoming files and produce them to Kafka Topics. I've read about FileSourceConnector (http://docs.confluent.io/3.1.0/connect/connect-filestream/filestream_connector.html) but it reads only one file and sends new lines added to that file. File rotation is not handled. A few questions:
1) Is it better to implement our own Producer code to meet our requirement or can we extend the File Connector class so that it reads new files and sends them to Kafka topics.
2) Is there any other source connector that can be used in this scenario?
In terms of performance and ease of development, which approach is better? i.e., developing our Producer code to read files and send to Kafka or extending the Connector code and making changes to it.
Any kind of feedback will be greatly appreciated!
Thank you!
I personally used the Producer API directly. I handled file rotation and could publish in realtime. There was a tricky part into making sure the files were exactly the same on the source and sink systems (exactly-once processing).
Have you take a look to Akka Streams - Reactive Kafka? https://github.com/akka/reactive-kafka
Check this example: https://github.com/ktoso/akka-streams-alpakka-talk-demos-2016/blob/master/src/main/java/javaone/step1_file_to_kafka/Step1KafkaLogStreamer.java
You could write a producer as you suggested - or better yet, write your own connector using the developer API
I need to fetch messages from Kafka topics and notify other systems via HTTP based APIs. That is, get message from topic, map to the 3rd party APIs and invoke them. I intend to write a Kafka Sink Connector for this.
For this use case, is Kafka Connect the right choice or I should go with Kafka Client.
Kafka clients when you have full control on your code and you are expert developer, you want to connect an application to Kafka and can modify the code of the application.
push data into Kafka
pull data from Kafka.
https://cwiki.apache.org/confluence/display/KAFKA/Clients
Kafka Connect when you don’t have control on third party code new in Kafka and to you have to connect Kafka to datastores that you can’t modify code.
Kafka Connect’s scope is narrow: it focuses only on copying streaming data to and from Kafka and does not handle other tasks.
http://docs.confluent.io/2.0.0/connect/
I am adding few lines form other blogs to explain differences
Companies that want to adopt Kafka write a bunch of code to publish their data streams. What we’ve learned from experience is that doing this correctly is more involved than it seems. In particular, there are a set of problems that every connector has to solve:
• Schema management: The ability of the data pipeline to carry schema information where it is available. In the absence of this capability, you end up having to recreate it downstream. Furthermore, if there are multiple consumers for the same data, then each consumer has to recreate it. We will cover the various nuances of schema management for data pipelines in a future blog post.
• Fault tolerance: Run several instances of a process and be resilient to failures
• Parallelism: Horizontally scale to handle large scale datasets
• Latency: Ingest, transport and process data in real-time, thereby moving away from once-a-day data dumps.
• Delivery semantics: Provide strong guarantees when machines fail or processes crash
• Operations and monitoring: Monitor the health and progress of every data integration process in a consistent manner
These are really hard problems in their own right, it just isn’t feasible to solve them separately in each connector. Instead you want a single infrastructure platform connectors can build on that solves these problems in a consistent way.
Until recently, adopting Kafka for data integration required significant developer expertise; developing a Kafka connector required building on the client APIs.
https://www.confluent.io/blog/announcing-kafka-connect-building-large-scale-low-latency-data-pipelines/
Kafka Connect will work well for this purpose, but this would also be a pretty straightforward consumer application as well because consumers also have the benefits of fault tolerance/scalability and in this case you're probably just doing simple message-at-a-time processing within each consumer instance. You can also easily use enable.auto.commit for this application, so you will not encounter the tricky parts of using the consumer directly. The main thing using Kafka Connect would give you compared to using the consumer in this case would be that the connector could be made generic for different input formats, but that may not be important to you for a custom connector.
you should use kafka connect sink when you are using kafka connect source for producing messages to a specific topic.
for e.g. when you are using file-source then you should use file-sink to consume what source have been produced. or when you are using jdbc-source you should use jdbc-sink to consume what you have produced.
because the schema of the producer and sink consumer should be compatible then you should use compatible source and sink in both sides.
if in some cases the schemas are not compatible you can use SMT (Simple message transform) capability that is added since version 10.2 of kafka onward and you will be able to write message transformers to transfer message between incompatible producers and consumers.
Note: if you want to transfer messages faster I suggest that you use avro and schema registry to transfer message more efficiently.
If you can code with java you can use java kafka stream, Spring-Kafka project or stream processing to achieve what you desire.
In the book that is called Kafka In Actionis explained like following:
The purpose of Kafka Connect is to help move data in or out of Kafka without having to deal with writing our own producers and clients. Connect is a framework that is already part of Kafka that really can make it simple to use pieces that have been already been built to start your streaming journey.
As for your problem, Firstly, one of the simpliest questions that one should ask is if you can modify the application code of the systems from which you need data interaction.
Secondly, If you would write custom connector which have the in-depth knowledge the ability and this connector will be used by others, it worth it. Because it may help others that may not be the experts in those systems. Otherwise, this kafka connector is used only by yourself, I think you should write Kafka connector. So you can get more flexibility and can write more easily implementing.
I have one use case for real time streaming, we will be using Kafka(0.9) for message buffer and spark streaming(1.6) for stream processing (HDP 2.4). We will receive ~80-90K/Sec event on Http. Can you please suggest a recommended architecture for data ingestion into Kafka topics which will be consumed by spark streaming.
We are considering flafka architecture.
Is Flume listening to Http and sending to Kafka (Flafka )for real time streaming a good option?
Please share other possible approaches if any.
One approach could be Kafka Connect. Look for a source that fit in your needs or develop a custom new one.