Write to AWS SQS queue using Spark - scala

Is there any way to stream or write data to Amazon SQS queue from Spark using a library?
There is nothing listed on the Spark packages.
What things can I try?

One idea is to use Alpakka's SQS connector, which is built on Akka Streams.

I wrote a small library to write a dataframe to SQS
https://github.com/fabiogouw/spark-aws-messaging

Related

Can we use SQL query in Kafka Stream API?

Can we consume the data from Kafka Stream with filters? From their docs, they claimed that we can apply our logics while consuming the data by using stream API. I can't find any samples regarding that.
Yes, filter is a literal method name in the Streams API DSL and examples are readily available in the Kafka Streams documentation pages.
If you want to use SQL for filtering, install KsqlDB or use SparkSQL, Flink, etc.

Nifi and Spark Integration

I want to create a Spark Session within a Nifi Custom processor written in Scala, so far I can create my spark session on a scala project, but when I add this spark session inside the OnTrigger method of the nifi custom processor, the spark session is never created, is there any way to achieve this? so far I have imported spark-core and spark-sql libraries
any feedback is appreciated
Not possible with Flow File. Period.
You need Kafka in between Spark Streaming or Spark Structured Streaming. Here is good read btw: https://community.cloudera.com/t5/Community-Articles/Spark-Structured-Streaming-with-NiFi-and-Kafka-using-PySpark/ta-p/245068

Kafka streams vs Kafka connect for Kafka HBase ETL pipeline

I have straightforward scenario for the ETL job: take data from Kafka topic and put it to HBase table. In the future i'm going to add the support for some logic after reading data from a topic.
I consider two scenario:
use Kafka Streams for reading data from a topic and further writing via native HBased driver each record
Use Kafka -> HBase connector
I have the next concerns about my options:
Is is a goo idea to write data each time it arrives in a Kafka Stream's window? - suggest that it'll downgrade performance
Kafka Hbase connector is supported only by third-party developer, i'm not sure about code quality of this solution and about the option to add custom aggregation logic over data from a topic.
I myself have been trying to search for ETL options for KAFKA to HBase, however, so far my research tells me that it's a not a good idea to have an external system interaction within a KAFKA streams application (check the answer here and here). KAFKA streams are super powerful and great if you have KAFKA->Transform_message->KAFKA kind of use case, and eventually you can have KAFKA connect that will take your data from KAFKA topic and write it to a sink.
Since you do not want to use the third party KAFKA connect for HBase, one option is to write something yourself using the connect API, the other option is to use the KAFKA consumer producer API and write the app using the traditional way, poll the messages, write to sink, commit the batch and move on.

Multiple Streams support in Apache Flink Job

My Question in regarding Apache Flink framework.
Is there any way to support more than one streaming source like kafka and twitter in single flink job? Is there any work around.Can we process more than one streaming sources at a time in single flink job?
I am currently working in Spark Streaming and this is the limitation there.
Is this achievable by other streaming frameworks like Apache Samza,Storm or NIFI?
Response is much awaited.
Yes, this is possible in Flink and Storm (no clue about Samza or NIFI...)
You can add as many source operators as you want and each can consume from a different source.
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
Properties properties = ... // see Flink webpage for more details
DataStream<String> stream1 = env.addSource(new FlinkKafkaConsumer08<>("topic", new SimpleStringSchema(), properties);)
DataStream<String> stream2 = env.readTextFile("/tmp/myFile.txt");
DataStream<String> allStreams = stream1.union(stream2);
For Storm using low level API, the pattern is similar. See An Apache Storm bolt receive multiple input tuples from different spout/bolt
Some solutions have already been covered, I just want to add that in a NiFi flow you can ingest many different sources, and process them either separately or together.
It is also possible to ingest a source, and have multiple teams build flows on this without needing to ingest the data multiple times.

Kafka -> Flink DataStream -> MongoDB

I want to setup Flink so it would transform and redirect the data streams from Apache Kafka to MongoDB. For testing purposes I'm building on top of flink-streaming-connectors.kafka example (https://github.com/apache/flink).
Kafka streams are being properly red by Flink, I can map them etc., but the problem occurs when I want to save each recieved and transformed message to MongoDB. The only example I've found about MongoDB integration is flink-mongodb-test from github. Unfortunately it uses static data source (database), not the Data Stream.
I believe there should be some DataStream.addSink implementation for MongoDB, but apparently there's not.
What would be the best way to achieve it? Do I need to write the custom sink function or maybe I'm missing something? Maybe it should be done in different way?
I'm not tied to any solution, so any suggestion would be appreciated.
Below there's an example what exactly i'm getting as an input and what I need to store as an output.
Apache Kafka Broker <-------------- "AAABBBCCCDDD" (String)
Apache Kafka Broker --------------> Flink: DataStream<String>
Flink: DataStream.map({
return ("AAABBBCCCDDD").convertTo("A: AAA; B: BBB; C: CCC; D: DDD")
})
.rebalance()
.addSink(MongoDBSinkFunction); // store the row in MongoDB collection
As you can see in this example I'm using Flink mostly for Kafka's message stream buffering and some basic parsing.
As an alternative to Robert Metzger answer, you can write your results again to Kafka and then use one of the maintained kafka's connectors to drop the content of a topic inside your MongoDB Database.
Kafka -> Flink -> Kafka -> Mongo/Anything
With this approach you can mantain the "at-least-once semantics" behaivour.
There is currently no Streaming MongoDB sink available in Flink.
However, there are two ways for writing data into MongoDB:
Use the DataStream.write() call of Flink. It allows you to use any OutputFormat (from the Batch API) with streaming. Using the HadoopOutputFormatWrapper of Flink, you can use the offical MongoDB Hadoop connector
Implement the Sink yourself. Implementing sinks is quite easy with the Streaming API, and I'm sure MongoDB has a good Java Client library.
Both approaches do not provide any sophisticated processing guarantees. However, when you're using Flink with Kafka (and checkpointing enabled) you'll have at-least-once semantics: In an error case, the data is streamed again to the MongoDB sink.
If you're doing idempotent updates, redoing these updates shouldn't cause any inconsistencies.
If you really need exactly-once semantics for MongoDB, you should probably file a JIRA in Flink and discuss with the community how to implement this.