Raise Alert through apache spark - scala

I am using Apache Spark to take real time data from Apache Kafka which are from any sensors in Json format.
example of data format :
{
"meterId" : "M1",
"meterReading" : "100"
}
I want to apply rule to raise alert in real time. i.e. if I did not get data of "meter M 1" from last 2 hours or meter Reading exceed some limit the alert should be created.
so how can I achieve this in Scala?

I will respond here as an answer - too long for comment.
As I said json in kafka should be: one message per one line - send this instead -> {"meterId":"M1","meterReading":"100"}
If you are using kafka there is KafkaUtils with that you can create stream:
JavaPairDStream<String, String> input = KafkaUtils.createStream(jssc, zkQuorum, group, topics);
Pair means <kafkaTopicName, JsonMessage>. So basically you can take a look only to jsonmessage if you dont need to use kafkaTopicName.
for input you can use many methods that are described in JavaPairDStream documentation - eg. you can use map to get only messages to simple JavaDStream.
And of course you can use some json parser like gson, jackson or org.json it depends on use cases, performance for different cases and so on.
So you need to do something like this:
JavaDStream<String> messagesOnly = input.map(
new Function<Tuple2<String, String>, String>() {
public String call(Tuple2<String, String> message) {
return message._2();
}
}
);
now you have only messages withou kafka topic name, now you can use your logic like you described in question.
JavaPairDStream<String, String> alerts = messagesOnly.filter(
new Function<Tuple2<String, String>, Boolean>() {
public Boolean call(Tuple2<String, String> message) {
// here use gson parser e.g
// filter messages with meterReading that doesnt exceed limit
// return true or false based on your logic
}
}
);
And here you have only alert messages - you can send it to another place.
-- AFTER EDIT
Below is the example in scala
// batch every 2 seconds
val ssc = new StreamingContext(sparkConf, Seconds(2))
ssc.checkpoint("checkpoint")
val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
def filterLogic(message: String): Boolean=
{
// here your logic for filtering
}
// map _._2 takes your json messages
val messages = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap).map(_._2)
// filtered data after filter transformation
val filtered = messages.filter(m => filterLogic(m))

Related

Get the last records of KStream

I'm very new to Kafka Stream API.
I have a KStream like this:
KStream<Long,String> joinStream = builder.stream(("output"));
The KStream with records value look like this:
The stream will be updated every 1s.
I need to build a Rest API that will be calculated based on the value profit and spotPrice.
But I've struggled to get the value of the last record.
I am assuming that you mean the max value of the stream when you say the last value as the values are continuously arriving. Then you can use the reduce transformation to always update the output stream with the max value.
final StreamsBuilder builder = new StreamsBuilder();
KStream<Long, String> stream = builder.stream("INPUT_TOPIC", Consumed.with(Serdes.Long(), Serdes.String()));
stream
.mapValues(value -> Long.valueOf(value))
.groupByKey()
.reduce(new Reducer<Long>() {
#Override
public Long apply(Long currentMax, Long v) {
return (currentMax > v) ? currentMax : v;
}
})
.toStream().to("OUTPUT_TOPIC");
return builder.build();
And in case that you want to retrive it in a rest api i suggest to take a look at Spring cloud + Kafka streams (https://cloud.spring.io/spring-cloud-stream-binder-kafka/spring-cloud-stream-binder-kafka.html) that you can exchange messages to spring web.

How to make Serdes work with multi-step kafka streams

I am new to Kafka and I'm building a starter project using the Twitter API as a data source. I have create a Producer which can query the Twitter API and sends the data to my kafka topic with string serializer for both key and value. My Kafka Stream Application reads this data and does a word count, but also grouping by the date of the tweet. This part is done through a KTable called wordCounts to make use of its upsert functionality. The structure of this KTable is:
Key: {word: exampleWord, date: exampleDate}, Value: numberOfOccurences
I then attempt to restructure the data in the KTable stream to a flat structure so I can later send it to a database. You can see this in the wordCountsStructured KStream object. This restructures the data to look like the structure below. The value is initially a JsonObject but i convert it to a string to match the Serdes which i set.
Key: null, Value: {word: exampleWord, date: exampleDate, Counts: numberOfOccurences}
However, when I try to send this to my second kafka topic, I get the error below.
A serializer (key:
org.apache.kafka.common.serialization.StringSerializer / value:
org.apache.kafka.common.serialization.StringSerializer) is not
compatible to the actual key or value type (key type:
com.google.gson.JsonObject / value type: com.google.gson.JsonObject).
Change the default Serdes in StreamConfig or provide correct Serdes
via method parameters.
I'm confused by this since the KStream I am sending to the topic is of type <String, String>. Does anyone know how I might fix this?
public class TwitterWordCounter {
private final JsonParser jsonParser = new JsonParser();
public Topology createTopology(){
StreamsBuilder builder = new StreamsBuilder();
KStream<String, String> textLines = builder.stream("test-topic2");
KTable<JsonObject, Long> wordCounts = textLines
//parse each tweet as a tweet object
.mapValues(tweetString -> new Gson().fromJson(jsonParser.parse(tweetString).getAsJsonObject(), Tweet.class))
//map each tweet object to a list of json objects, each of which containing a word from the tweet and the date of the tweet
.flatMapValues(TwitterWordCounter::tweetWordDateMapper)
//update the key so it matches the word-date combination so we can do a groupBy and count instances
.selectKey((key, wordDate) -> wordDate)
.groupByKey()
.count(Materialized.as("Counts"));
/*
In order to structure the data so that it can be ingested into SQL, the value of each item in the stream must be straightforward: property, value
so we have to:
1. take the columns which include the dimensional data and put this into the value of the stream.
2. lable the count with 'count' as the column name
*/
KStream<String, String> wordCountsStructured = wordCounts.toStream()
.map((key, value) -> new KeyValue<>(null, MapValuesToIncludeColumnData(key, value).toString()));
KStream<String, String> wordCountsPeek = wordCountsStructured.peek(
(key, value) -> System.out.println("key: " + key + "value:" + value)
);
wordCountsStructured.to("test-output2", Produced.with(Serdes.String(), Serdes.String()));
return builder.build();
}
public static void main(String[] args) {
Properties config = new Properties();
config.put(StreamsConfig.APPLICATION_ID_CONFIG, "wordcount-application1111");
config.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "myIPAddress");
config.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
config.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass());
config.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass());
TwitterWordCounter wordCountApp = new TwitterWordCounter();
KafkaStreams streams = new KafkaStreams(wordCountApp.createTopology(), config);
streams.start();
// shutdown hook to correctly close the streams application
Runtime.getRuntime().addShutdownHook(new Thread(streams::close));
}
//this method is used for taking a tweet and transforming it to a representation of the words in it plus the date
public static List<JsonObject> tweetWordDateMapper(Tweet tweet) {
try{
List<String> words = Arrays.asList(tweet.tweetText.split("\\W+"));
List<JsonObject> tweetsJson = new ArrayList<JsonObject>();
for(String word: words) {
JsonObject tweetJson = new JsonObject();
tweetJson.add("date", new JsonPrimitive(tweet.formattedDate().toString()));
tweetJson.add("word", new JsonPrimitive(word));
tweetsJson.add(tweetJson);
}
return tweetsJson;
}
catch (Exception e) {
System.out.println(e);
System.out.println(tweet.serialize().toString());
return new ArrayList<JsonObject>();
}
}
public JsonObject MapValuesToIncludeColumnData(JsonObject key, Long countOfWord) {
key.addProperty("count", countOfWord); //new JsonPrimitive(count));
return key;
}
Because you are performing a key changing operation before the groupBy(), it will create a repartition topic and for that topic, it will rely on the default key, value serdes, which you have set to String Serde.
You can modify the groupBy() call to groupBy(Grouped.with(StringSerde,JsonSerde) and this should help.

How to enrich event stream with big file in Apache Flink?

I have a Flink application for click stream collection and processing. The application consists of Kafka as event source, a map function and a sink as image shown below:
I want to enrich the incoming click stream data with user's IP location based on userIp field in raw event ingested from Kafka.
a simplified slice of the CSV file as shown below
start_ip,end_ip,country
"1.1.1.1","100.100.100.100","United States of America"
"100.100.100.101","200.200.200.200","China"
I have made some researches and found a couple of potential solutions:
1. Solution: Broadcast the enrichment data and connect with event stream with some IP matching logic.
1. Result: It worked well for a couple sample IP location data but not with whole CSV data. JVM heap has reached to 3.5 GB and due to the broadcast state, there is no way to put the broadcast state into disk (for RocksDb)
2. Solution: Load CSV data in open() method in RichFlatMapFunction into the state(ValueState) before start of the event processing and enrich event data in flatMap method.
2. Result: Due to the enrichment data is so big to store in JVM heap, it's impossible to load into ValueState. And also de/serializing through ValueState is bad practice for data in key-value nature.
3. Solution: To avoid to deal with JVM heap constraint, I have tried to put the enrichment data into RocksDB(uses disk) as state with MapState.
3. Result: Trying to load the CSV file into MapState in open() method, gave me error that tells me you cannot put into MapState in open() method because I was not in keyed context in open() method like this question: Flink keyed stream key is null
4. Solution: Because of need of the keyed context for MapState(to put RocksDB), I tried to load whole CSV file into local RocksDB instance(disk) in the process function after making the DataStream into KeyedStream:
class KeyedIpProcess extends KeyedProcessFunction[Long, Event, Event] {
var ipMapState: MapState[String, String] = _
var csvFinishedFlag: ValueState[Boolean] = _
override def processElement(event: Event,
ctx: KeyedProcessFunction[Long, Event, Event]#Context,
out: Collector[Event]): Unit = {
val ipDescriptor = new MapStateDescriptor[String, String]("ipMapState", classOf[String], classOf[String])
val csvFinishedDescriptor = new ValueStateDescriptor[Boolean]("csvFinished", classOf[Boolean])
ipMapState = getRuntimeContext.getMapState(ipDescriptor)
csvFinishedFlag = getRuntimeContext.getState(csvFinishedDescriptor)
if (!csvFinishedFlag.value()) {
val csv = new CSVParser(defaultCSVFormat)
val fileSource = Source.fromFile("/tmp/ip.csv", "UTF-8")
for (row <- fileSource.getLines()) {
val Some(List(start, end, country)) = csv.parseLine(row)
ipMapState.put(start, country)
}
fileSource.close()
csvFinishedFlag.update(true)
}
out.collect {
if (ipMapState.contains(event.userIp)) {
val details = ipMapState.get(event.userIp)
event.copy(data =
event.data.copy(
ipLocation = Some(details.country)
))
} else {
event
}
}
}
}
4. Result: It's too hacky and prevents event processing due to blocking file read operation.
Could you tell me what can I do for this situation?
Thanks
What you can do is to implement a custom partitioner, and load a slice of the enrichment data into each partition. There's an example of this approach here; I'll excerpt some key portions:
The job is organized like this:
DataStream<SensorMeasurement> measurements = env.addSource(new SensorMeasurementSource(100_000));
DataStream<EnrichedMeasurements> enrichedMeasurements = measurements
.partitionCustom(new SensorIdPartitioner(), measurement -> measurement.getSensorId())
.flatMap(new EnrichmentFunctionWithPartitionedPreloading());
The custom partitioner needs to know how many partitions there are, and deterministically assigns each event to a specific partition:
private static class SensorIdPartitioner implements Partitioner<Long> {
#Override
public int partition(final Long sensorMeasurement, final int numPartitions) {
return Math.toIntExact(sensorMeasurement % numPartitions);
}
}
And then the enrichment function takes advantage of knowing how the partitioning was done to load only the relevant slice into each instance:
public class EnrichmentFunctionWithPartitionedPreloading extends RichFlatMapFunction<SensorMeasurement, EnrichedMeasurements> {
private Map<Long, SensorReferenceData> referenceData;
#Override
public void open(final Configuration parameters) throws Exception {
super.open(parameters);
referenceData = loadReferenceData(getRuntimeContext().getIndexOfThisSubtask(), getRuntimeContext().getNumberOfParallelSubtasks());
}
#Override
public void flatMap(
final SensorMeasurement sensorMeasurement,
final Collector<EnrichedMeasurements> collector) throws Exception {
SensorReferenceData sensorReferenceData = referenceData.get(sensorMeasurement.getSensorId());
collector.collect(new EnrichedMeasurements(sensorMeasurement, sensorReferenceData));
}
private Map<Long, SensorReferenceData> loadReferenceData(
final int partition,
final int numPartitions) {
SensorReferenceDataClient client = new SensorReferenceDataClient();
return client.getSensorReferenceDataForPartition(partition, numPartitions);
}
}
Note that the enrichment is not being done on a keyed stream, so you can not use keyed state or timers in the enrichment function.

Kafka Streams: action on n-th event

I'm trying to find the best way how to perform an action on n-th event in Kafka Streams.
My case: I have an input stream with some Events. I have to filter them by eventType == login and on each n-th login (let's say, fifth) for the same accountId send this Event to the output stream.
After some investigation and different tries, I have the version of the code below (I'm using Kotlin).
data class Event(
val payload: Any = {},
val accountId: String,
val eventType: String = ""
)
// intermediate class to keep the key and value of the original event
data class LoginEvent(
val eventKey: String,
val eventValue: Event
)
fun process() {
val userLoginsStoreBuilder = Stores.keyValueStoreBuilder(
Stores.persistentKeyValueStore("logins"),
Serdes.String(),
Serdes.Integer()
)
val streamsBuilder = StreamsBuilder().addStateStore(userCheckInsStoreBuilder)
val inputStream = streamsBuilder.stream<String, String>(inputTopic)
inputStream.map { key, event ->
KeyValue(key, json.readValue<Event>(event))
}.filter { _, event -> event.eventType == "login" }
.map { key, event -> KeyValue(event.accountId, LoginEvent(key, event)) }
.transform(
UserLoginsTransformer("logins", 5),
"logins"
)
.filter { _, value -> value }
.map { key, _ -> KeyValue(key.eventKey, json.writeValueAsString(key.eventValue)) }
.to("fifth_login", Produced.with(Serdes.String(), Serdes.String()))
...
}
class UserLoginsTransformer(private val storeName: String, private val loginsThreshold: Int = 5) :
TransformerSupplier<String, CheckInEvent, KeyValue< LoginEvent, Boolean>> {
override fun get(): Transformer<String, LoginEvent, KeyValue< LoginEvent, Boolean>> {
return object : Transformer<String, LoginEvent, KeyValue< LoginEvent, Boolean>> {
private lateinit var store: KeyValueStore<String, Int>
#Suppress("UNCHECKED_CAST")
override fun init(context: ProcessorContext) {
store = context.getStateStore(storeName) as KeyValueStore<String, Int>
}
override fun transform(key: String, value: LoginEvent): KeyValue< LoginEvent, Boolean> {
val counter = (store.get(key) ?: 0) + 1
return if (counter == loginsThreshold) {
store.delete(key)
KeyValue(value, true)
} else {
store.put(key, counter)
KeyValue(value, false)
}
}
override fun close() {
}
}
}
}
My biggest concern is that transform function is not thread-safe in my case. I've checked the implementation of the KV-store that is used in my case and this is RocksDB store (non-transactional) so the value may be updated between reading and comparison and the wrong event will be sent to the output.
My other ideas:
Use materialized views as a store without a transformer but I'm stuck with implementation.
Create a custom persistent KV store that will use TransactionalRocksDB (not sure if it is worth).
Create a custom persistent KV store that will use ConcurrentHashMap inside (it may lead to the high memory consumption in case of many users that we are expecting).
One more note: I'm using Spring Cloud Stream so maybe this framework has a built-in solution for my case but I didn't find it.
I would appreciate any suggestions. Thanks in advance.
My biggest concern is that transform function is not thread-safe in my case. I've checked the implementation of the KV-store that is used in my case and this is RocksDB store (non-transactional) so the value may be updated between reading and comparison and the wrong event will be sent to the output.
There is no reason to be concerned. If you run with multiple threads, each thread will have it's own RocksDB that store one shard of the overall data (note that the overall state is sharded based in input topic partitions and a single shard is never processed by different threads). Hence, your code will work correctly. The only thing you need to ensure is, that data is partitions by accountId, such that login events of a single account go to the same shard.
If you input data is already partitioned by accountId when written into your input topic, you don't need to do anything. If not, and you can control the upstream application, it might be simplest to use a custom partitioner in the upstream's application producer to get the partitioning you need. If you can't change the upstream application, you would need to repartition the data after you have set the accountId as new key, ie, by doing a through() before you call transform().

Avro with Kafka - Deserializing with changing schema

Based on Avro schema I generated a class (Data) to work with the class appropriate to the schema
After it I encode the data and send in to other application "A" using kafka
Data data; // <- The object was initialized before . Here it is only the declaration "for example"
EncoderFactory encoderFactory = EncoderFactory.get();
ByteArrayOutputStream out = new ByteArrayOutputStream();
BinaryEncoder encoder = encoderFactory. directBinaryEncoder(out, null);
DatumWriter<Tloog> writer;
writer = new SpecificDatumWriter<Data>( Data.class);
writer.write(data, encoder);
byte[] avroByteMessage = out.toByteArray();
On the other side (in the application "A") I deserilize the the data by implementing Deserializer
class DataDeserializer implements Deserializer<Data> {
private String encoding = "UTF8";
#Override
public void configure(Map<String, ?> configs, boolean isKey) {
// nothing to do
}
#Override
public Tloog deserialize(String topic, byte[] data) {
try {
if (data == null)
{
return null;
}
else
{
DatumReader<Tloog> reader = new SpecificDatumReader<Data>( Data.class);
DecoderFactory decoderFactory = DecoderFactory.get();
BinaryDecoder decoder = decoderFactory.binaryDecoder( data, null);
Data decoded = reader.read(null, decoder);
return decoded;
}
} catch (Exception e) {
throw new SerializationException("Error when deserializing byte[] to string due to unsupported encoding " + encoding);
}
}
The problem is that this approach requires the use of SpecificDatumReader, I.e.the Data class should be integrated with the application code...This could be problematic - schema could change and therefore Data class should be re-generated and integrated once more
2 questions:
Should I use GenericDatumReader in the application? How to do that
correctly. (I can save the schema simply in the application)
Isthere a simple way to work with SpecificDatumReader if Data changes? How could it be integrated with out much trouble?
Thanks
I use GenericDatumReader -- well, actually I derive my reader class from it, but you get the point. To use it, I keep my schemas in a special Kafka topic -- Schema surprisingly enough. Consumers and producers both, on startup, read from this topic and configure their respective parsers.
If you do it like this, you can even have your consumers and producers update their schemas on the fly, without having to restart them. This was a design goal for me -- I didn't want to have to restart my applications in order to add or change schemas. Which is why SpecificDatumReader doesn't work for me, and honestly why I use Avro in the first place instead of something like Thrift.
Update
The normal way to do Avro is to store the schema in the file with the records. I don't do it that way, primarily because I can't. I use Kafka, so I can't store the schemas directly with the data -- I have to store the schemas in a separate topic.
The way I do it, first I load all of my schemas. You can read them from a text file; but like I said, I read them from a Kafka topic. After I read them from Kafka, I have an array like this:
val schemaArray: Array[String] = Array(
"""{"name":"MyObj","type":"record","fields":[...]}""",
"""{"name":"MyOtherObj","type":"record","fields":[...]}"""
)
Apologize for the Scala BTW, but it's what I got.
At any rate, then you need to create a parser, and foreach schema, parse it and create readers and writers, and save them off to Maps:
val parser = new Schema.Parser()
val schemas = Map(schemaArray.map{s => parser.parse(s)}.map(s => (s.getName, s)):_*)
val readers = schemas.map(s => (s._1, new GenericDatumReader[GenericRecord](s._2)))
val writers = schemas.map(s => (s._1, new GenericDatumWriter[GenericRecord](s._2)))
var decoder: BinaryDecoder = null
I do all of that before I parse an actual record -- that's just to configure the parser. Then, to decode an individual record I would do:
val byteArray: Array[Byte] = ... // <-- Avro encoded record
val schemaName: String = ... // <-- name of the Avro schema
val reader = readers.get(schemaName).get
decoder = DecoderFactory.get.binaryDecoder(byteArray, decoder)
val record = reader.read(null, decoder)