Flink Scala join between two Streams doesn't seem to work

I want to join two streams (json) coming from a Kafka producer.
The code works if I filter the data. But it seems not working when I join them. I want to print to the console the joined stream but nothing appears.
This is my code
import java.util.Properties
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer010
import org.apache.flink.streaming.util.serialization.SimpleStringSchema
import org.json4s._
import org.json4s.native.JsonMethods
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows
import org.apache.flink.streaming.api.windowing.time.Time
object App {
def main(args : Array[String]) {
case class Data(location: String, timestamp: Long, measurement: Int, unit: String, accuracy: Double)
case class Sensor(sensor_name: String, start_date: String, end_date: String, data_schema: Array[String], data: Data, stt: Stt)
case class Datas(location: String, timestamp: Long, measurement: Int, unit: String, accuracy: Double)
case class Sensor2(sensor_name: String, start_date: String, end_date: String, data_schema: Array[String], data: Datas, stt: Stt)
val properties = new Properties();
properties.setProperty("bootstrap.servers", "");
properties.setProperty("group.id", "test");
val env = StreamExecutionEnvironment.getExecutionEnvironment
val consumer1 = new FlinkKafkaConsumer010[String]("topics1", new SimpleStringSchema(), properties)
val stream1 = env
val consumer2 = new FlinkKafkaConsumer010[String]("topics2", new SimpleStringSchema(), properties)
val stream2 = env
val s1 = stream1.map { x => {
implicit val formats = DefaultFormats
val s2 = stream2.map { x => {
implicit val formats = DefaultFormats
val s1t = s1.assignAscendingTimestamps { x => x.data.timestamp }
val s2t = s2.assignAscendingTimestamps { x => x.data.timestamp }
val j1pre = s1t.join(s2t)
.apply((g, s) => (s.sensor_name, g.sensor_name, s.data.measurement))
I think the problem is on the assignment of the timestamp. I think that the assignAscendingTimestamp on the two sources is not the right function.
The json produced by the kafka producer has a field data.timestamp that should be assigned as the timestamp. But I don't know how to manage that.
I also thought that i should have to give a time window batch (as in spark) to the incoming tuples. But I'm not sure this is the right solution.

I think your code needs just some minor adjustments. First of all as you want to work in EventTime you should set appropriate TimeCharacteristic
Also your code that you pasted is missing a sink for the stream. If you want to print to console you should:
The rest of your code seems fine.


How can I get all the data in flatMapGroup in Spark Structured Streaming?

I have a scenario in which we need to calculate 'charges' for a stream of events which has the following details:
1. Event contains eventTime, facet, units
2. Free quantity per facet that needs to offset the earliest events based on the eventTime
3. Prices are also specified per facet
4. Events that arrive in a single minute can be considered equivalent (for reduced state maintenance) and all of them need to have free units proportionally distributed
I was hoping to make it the work in the following manner using spark structured streaming
1. Aggregate events at a minute level per facet using the window function per facet
2. Join with the price and free quantity
3. Group by facet
4. flatMapGroup by facet to then sort the aggregation by window start time, apply the results
what I am noticing is that the output of #4 is just the aggregation for which new events came in and not all the aggregation since the watermark.
Qn: How can I fix this code to get all aggregation since the watermark?
Could someone help?
package test
import java.sql.Timestamp
import java.util.UUID
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.streaming.{GroupState, GroupStateTimeout, OutputMode, Trigger}
import org.apache.spark.streaming.{Seconds, StreamingContext}
case class UsageEvent(Id: String, facetId: String, Units: Double, timeStamp:Timestamp)
case class FacetPricePoints(facetId: String, Price: Double, FreeUnits: Double)
case class UsageBlock(facetId: String, start:Timestamp, Units: Double)
case class UsageBlockWithPrice(facetId: String, start:Timestamp, Units: Double, Price: Double, FreeUnits: Double)
case class UsageBlockWithCharge(facetId: String, start:Timestamp, Units: Double, Price: Double, FreeUnits: Double, ChargedUnits: Double, Charge: Double)
object TestProcessing {
def getUsageEventStream(ts: Timestamp, units: String) : UsageEvent = {UsageEvent(UUID.randomUUID().toString, "F1", units.toInt % 20, ts)}
implicit def ordered: Ordering[Timestamp] = new Ordering[Timestamp] {def compare(x: Timestamp, y: Timestamp): Int = x compareTo y}
def ChargeUsageBlock(Key: String, Value: Iterator[UsageBlockWithPrice]) : Iterator[UsageBlockWithCharge] =
val usageBlocks = Value.toList.sortBy(ub => ub.start)
var freeUnits = 0.0
var freeUnitsSet = false
var newUe = for (ue <- usageBlocks)
yield {
freeUnits = if (!freeUnitsSet) ue.FreeUnits else freeUnits
freeUnitsSet = true
val freeUnitsInBlock = if (freeUnits > ue.Units) ue.Units else freeUnits
val chargedUnits = ue.Units - freeUnitsInBlock
freeUnits -= freeUnitsInBlock // todo: need to specify precision and rounding
UsageBlockWithCharge(ue.facetId, ue.start, ue.Units, ue.Price, freeUnitsInBlock, chargedUnits, chargedUnits * ue.Price)
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder.appName("Test").getOrCreate()
val stream = spark.readStream.format("rate").option("rowsPerSecond", 1).load()
import spark.implicits._
val ssc = new StreamingContext(spark.sparkContext, Seconds(10))
val prices = ssc.sparkContext.parallelize(List( FacetPricePoints("F1", 30.0, 100.0))).toDF()
val getUsageEventStreamUDF = udf((ts: Timestamp, units: String) => getUsageEventStream(ts, units)) // .where($"Units" < 2).
val usageEventsRaw = stream.withColumn("Usage", getUsageEventStreamUDF(stream("timestamp"), stream("value"))).select("Usage.*").as[UsageEvent].dropDuplicates("Id").withWatermark("timeStamp", "1 hour")
val aggUsage = usageEventsRaw.groupBy($"facetId", window($"timeStamp", "1 minute")).agg(sum($"Units") as "Units").selectExpr("facetId", "window.start", "Units").as[UsageBlock]
val fifoRate = (Key: String, Value: Iterator[UsageBlockWithPrice]) => { ChargeUsageBlock(Key, Value) }
val aggUsageCharge = aggUsage.joinWith(prices, prices.col("facetId") === usageEventsRaw.col("facetId")).select("_1.*", "_2.Price", "_2.FreeUnits").as[UsageBlockWithPrice].groupByKey(x => x.facetId).flatMapGroups(fifoRate).withWatermark("start", "1 hour")
val fin = aggUsageCharge.writeStream.trigger((Trigger.ProcessingTime("10 seconds"))).outputMode(OutputMode.Update).format("console").start()
// this applies freeUnits for every minute instead of just applying it once

Consuming RESTful API and converting to Dataframe in Apache Spark

I am trying to convert output of url directly from RESTful api to Dataframe conversion in following way:
package trials
import org.apache.spark.sql.SparkSession
import org.json4s.jackson.JsonMethods.parse
import scala.io.Source.fromURL
object DEF {
implicit val formats = org.json4s.DefaultFormats
case class Result(success: Boolean,
message: String,
result: Array[Markets])
case class Markets(
def main(args: Array[String]): Unit = {
val spark = SparkSession
.config("spark.sql.shuffle.partitions", "4")
import spark.implicits._
val parsedData = parse(fromURL("https://bittrex.com/api/v1.1/public/getmarkets").mkString).extract[Array[Result]]
val mySourceDataset = spark.createDataset(parsedData)
The error is as follows and it repeats for every record:
Caused by: org.json4s.package$MappingException: Expected collection but got JObject(List((success,JBool(true)), (message,JString()), (result,JArray(List(JObject(List((MarketCurrency,JString(LTC)), (BaseCurrency,JString(BTC)), (MarketCurrencyLong,JString(Litecoin)), (BaseCurrencyLong,JString(Bitcoin)), (MinTradeSize,JDouble(0.01435906)), (MarketName,JString(BTC-LTC)), (IsActive,JBool(true)), (Created,JString(2014-02-13T00:00:00)), (Notice,JNull), (IsSponsored,JNull), (LogoUrl,JString(https://bittrexblobstorage.blob.core.windows.net/public/6defbc41-582d-47a6-bb2e-d0fa88663524.png))))))))) and mapping Result[][Result, Result]
at org.json4s.reflect.package$.fail(package.scala:96)
The structure of the JSON returned from this URL is:
"success": boolean,
"message": string,
"result": [ ... ]
So Result class should be aligned with this structure:
case class Result(success: Boolean,
message: String,
result: List[Markets])
And I also refined slightly the Markets class:
case class Markets(
MarketCurrency: String,
BaseCurrency: String,
MarketCurrencyLong: String,
BaseCurrencyLong: String,
MinTradeSize: Double,
MarketName: String,
IsActive: Boolean,
Created: String,
Notice: Option[String],
IsSponsored: Option[Boolean],
LogoUrl: String
But the main issue is in the extraction of the main data part from the parsed JSON:
val parsedData = parse(fromURL("{url}").mkString).extract[Array[Result]]
The root of the returned structure is not an array, but corresponds to Result. So it should be:
val parsedData = parse(fromURL("{url}").mkString).extract[Result]
Then, I suppose that you need not to load the wrapper in the DataFrame, but rather the Markets that are inside. That is why it should be loaded like this:
val mySourceDataset = spark.createDataset(parsedData.result)
And it finally produces the DataFrame:
|MarketCurrency|BaseCurrency|MarketCurrencyLong|BaseCurrencyLong|MinTradeSize|MarketName|IsActive| Created|Notice|IsSponsored| LogoUrl|
| LTC| BTC| Litecoin| Bitcoin| 0.01435906| BTC-LTC| true|2014-02-13T00:00:00| null| null|https://bittrexbl...|
| DOGE| BTC| Dogecoin| Bitcoin|396.82539683| BTC-DOGE| true|2014-02-13T00:00:00| null| null|https://bittrexbl...|

Kafka and akka (scala): How to create Source[CommittableMessage[Array[Byte], String], Consumer.Control]?

I would like for unit test to create a source with committable message and with Consumer control.
Or to transform a source created like this :
val message: Source[Array[Byte], NotUsed] = Source.single("one message".getBytes)
to something like this
Source[CommittableMessage[Array[Byte], String], Consumer.Control]
Goal is to unit test actor behavior on message without having to install kafka on the build machine
You can use this helper to create a CommittableMessage:
package akka.kafka.internal
import akka.Done
import akka.kafka.ConsumerMessage.{CommittableMessage, CommittableOffsetBatch, GroupTopicPartition, PartitionOffset}
import akka.kafka.internal.ConsumerStage.Committer
import org.apache.kafka.clients.consumer.ConsumerRecord
import scala.collection.immutable
import scala.concurrent.Future
object AkkaKafkaHelper {
private val committer = new Committer {
def commit(offsets: immutable.Seq[PartitionOffset]): Future[Done] = Future.successful(Done)
def commit(batch: CommittableOffsetBatch): Future[Done] = Future.successful(Done)
def commitableMessage[K, V](key: K, value: V, topic: String = "topic", partition: Int = 0, offset: Int = 0, groupId: String = "group"): CommittableMessage[K, V] = {
val partitionOffset = PartitionOffset(GroupTopicPartition(groupId, topic, partition), offset)
val record = new ConsumerRecord(topic, partition, offset, key, value)
CommittableMessage(record, ConsumerStage.CommittableOffsetImpl(partitionOffset)(committer))
Use Consumer.committableSource to create a Source[CommittableMessage[K, V], Control]. The idea is that in your test you would produce one or more messages onto some topic, then use committableSource to consume from that same topic.
The following is an example that illustrates this approach: it's a slightly adjusted excerpt from the IntegrationSpec in the Akka Streams Kafka project. IntegrationSpec uses scalatest-embedded-kafka, which provides an in-memory Kafka instance for ScalaTest specs.
Source(1 to 100)
.map(n => new ProducerRecord(topic1, partition0, null: Array[Byte], n.toString))
val consumerSettings = createConsumerSettings(group1)
val (control, probe1) = Consumer.committableSource(consumerSettings, TopicSubscription(Set(topic1)))
.filterNot(_.record.value == InitialMsg)
.mapAsync(10) { elem =>
elem.committableOffset.commitScaladsl().map { _ => Done }
.expectNextN(25).toSet should be(Set(Done))
Await.result(control.isShutdown, remainingOrDefault)

How to ensure constant Avro schema generation and avoid the 'Too many schema objects created for x' exception?

I am experiencing a reproducible error while producing Avro messages with reactive kafka and avro4s. Once the identityMapCapacity of the client (CachedSchemaRegistryClient) is reached, serialization fails with
java.lang.IllegalStateException: Too many schema objects created for <myTopic>-value
This is unexpected, since all messages should have the same schema - they are serializations of the same case class.
val avroProducerSettings: ProducerSettings[String, GenericRecord] =
ProducerSettings(system, Serdes.String().serializer(),
val avroProdFlow: Flow[ProducerMessage.Message[String, GenericRecord, String],
ProducerMessage.Result[String, GenericRecord, String],
NotUsed] = Producer.flow(avroProducerSettings)
val avroQueue: SourceQueueWithComplete[Message[String, GenericRecord, String]] =
Source.queue(bufferSize, overflowStrategy)
The serializer is a KafkaAvroSerializer, instantiated with a new CachedSchemaRegistryClient(settings.schemaRegistry, 1000)
Generating the GenericRecord:
def toAvro[A](a: A)(implicit recordFormat: RecordFormat[A]): GenericRecord =
val makeEdgeMessage: (Edge, String) => Message[String, GenericRecord, String] = { (edge, topic) =>
val edgeAvro: GenericRecord = toAvro(edge)
val record = new ProducerRecord[String, GenericRecord](topic, edge.id, edgeAvro)
ProducerMessage.Message(record, edge.id)
The schema is created deep in the code (io.confluent.kafka.serializers.AbstractKafkaAvroSerDe#getSchema, invoked by io.confluent.kafka.serializers.AbstractKafkaAvroSerializer#serializeImpl) where I have no influence on it, so I have no idea how to fix the leak. Looks to me like the two confluent projects do not work well together.
The issues I have found here, here and here do not seem to address my use case.
The two workarounds for me are currently:
not use schema registry - not a long-term solution obviously
create custom SchemaRegistryClient not relying on object identity - doable but I would like to avoid creating more issues than by reimplementing
Is there a way to generate or cache a consistent schema depending on message/record type and use it with my setup?
edit 2017.11.20
The issue in my case was that each instance of GenericRecord carrying my message has been serialized by a different instance of RecordFormat, containing a different instance of the Schema. The implicit resolution here generated a new instance each time.
def toAvro[A](a: A)(implicit recordFormat: RecordFormat[A]): GenericRecord = recordFormat.to(a)
The solution was to pin the RecordFormat instance to a val and reuse it explicitly. Many thanks to https://github.com/heliocentrist for explaining the details.
original response:
After waiting for a while (also no answer for the github issue) I had to implement my own SchemaRegistryClient. Over 90% is copied from the original CachedSchemaRegistryClient, just translated into scala. Using a scala mutable.Map fixed the memory leak. I have not performed any comprehensive tests, so use at your own risk.
import java.util
import io.confluent.kafka.schemaregistry.client.rest.entities.{ Config, SchemaString }
import io.confluent.kafka.schemaregistry.client.rest.entities.requests.ConfigUpdateRequest
import io.confluent.kafka.schemaregistry.client.rest.{ RestService, entities }
import io.confluent.kafka.schemaregistry.client.{ SchemaMetadata, SchemaRegistryClient }
import org.apache.avro.Schema
import scala.collection.mutable
class CachingSchemaRegistryClient(val restService: RestService, val identityMapCapacity: Int)
extends SchemaRegistryClient {
val schemaCache: mutable.Map[String, mutable.Map[Schema, Integer]] = mutable.Map()
val idCache: mutable.Map[String, mutable.Map[Integer, Schema]] =
mutable.Map(null.asInstanceOf[String] -> mutable.Map())
val versionCache: mutable.Map[String, mutable.Map[Schema, Integer]] = mutable.Map()
def this(baseUrl: String, identityMapCapacity: Int) {
this(new RestService(baseUrl), identityMapCapacity)
def this(baseUrls: util.List[String], identityMapCapacity: Int) {
this(new RestService(baseUrls), identityMapCapacity)
def registerAndGetId(subject: String, schema: Schema): Int =
restService.registerSchema(schema.toString, subject)
def getSchemaByIdFromRegistry(id: Int): Schema = {
val restSchema: SchemaString = restService.getId(id)
(new Schema.Parser).parse(restSchema.getSchemaString)
def getVersionFromRegistry(subject: String, schema: Schema): Int = {
val response: entities.Schema = restService.lookUpSubjectVersion(schema.toString, subject)
override def getVersion(subject: String, schema: Schema): Int = synchronized {
val schemaVersionMap: mutable.Map[Schema, Integer] =
versionCache.getOrElseUpdate(subject, mutable.Map())
val version: Integer = schemaVersionMap.getOrElse(
schema, {
if (schemaVersionMap.size >= identityMapCapacity) {
throw new IllegalStateException(s"Too many schema objects created for $subject!")
val version = new Integer(getVersionFromRegistry(subject, schema))
schemaVersionMap.put(schema, version)
override def getAllSubjects: util.List[String] = restService.getAllSubjects()
override def getByID(id: Int): Schema = synchronized { getBySubjectAndID(null, id) }
override def getBySubjectAndID(subject: String, id: Int): Schema = synchronized {
val idSchemaMap: mutable.Map[Integer, Schema] = idCache.getOrElseUpdate(subject, mutable.Map())
idSchemaMap.getOrElseUpdate(id, getSchemaByIdFromRegistry(id))
override def getSchemaMetadata(subject: String, version: Int): SchemaMetadata = {
val response = restService.getVersion(subject, version)
val id = response.getId.intValue
val schema = response.getSchema
new SchemaMetadata(id, version, schema)
override def getLatestSchemaMetadata(subject: String): SchemaMetadata = synchronized {
val response = restService.getLatestVersion(subject)
val id = response.getId.intValue
val version = response.getVersion.intValue
val schema = response.getSchema
new SchemaMetadata(id, version, schema)
override def updateCompatibility(subject: String, compatibility: String): String = {
val response: ConfigUpdateRequest = restService.updateCompatibility(compatibility, subject)
override def getCompatibility(subject: String): String = {
val response: Config = restService.getConfig(subject)
override def testCompatibility(subject: String, schema: Schema): Boolean =
restService.testCompatibility(schema.toString(), subject, "latest")
override def register(subject: String, schema: Schema): Int = synchronized {
val schemaIdMap: mutable.Map[Schema, Integer] =
schemaCache.getOrElseUpdate(subject, mutable.Map())
val id = schemaIdMap.getOrElse(
schema, {
if (schemaIdMap.size >= identityMapCapacity)
throw new IllegalStateException(s"Too many schema objects created for $subject!")
val id: Integer = new Integer(registerAndGetId(subject, schema))
schemaIdMap.put(schema, id)
idCache(null).put(id, schema)

Create a RDD : too many fields => use case class for RDD

I have a dataset of intrusion which is labeled that I want to use to test different supervised machine learning techniques.
So here is a part of my code :
object parser_dataset {
val conf = new SparkConf()
.set("spark.executor.memory", "8g")
classOf[Array[scala.Tuple3[Int, Int, Int]]],
val context = new SparkContext(conf)
def load(file: String): RDD[(Int, String, String,String,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Double,Double,Double,Double,Double,Double,Double, Int, Int,Double, Double, Double, Double, Double, Double, Double, Double, String)] = {
val data = context.textFile(file)
val res = data.map(x => {
val s = x.split(",")
(s(0).toInt, s(1), s(2), s(3), s(4).toInt, s(5).toInt, s(6).toInt, s(7).toInt, s(8).toInt, s(9).toInt, s(10).toInt, s(11).toInt, s(12).toInt, s(13).toInt, s(14).toInt, s(15).toInt, s(16).toInt, s(17).toInt, s(18).toInt, s(19).toInt, s(20).toInt, s(21).toInt, s(22).toInt, s(23).toInt, s(24).toDouble, s(25).toDouble, s(26).toDouble, s(27).toDouble, s(28).toDouble, s(29).toDouble, s(30).toDouble, s(31).toInt, s(32).toInt, s(33).toDouble, s(34).toDouble, s(35).toDouble, s(36).toDouble, s(37).toDouble, s(38).toDouble, s(39).toDouble, s(40).toDouble, s(41))
return res
def main(args: Array[String]) {
val data = this.load("/home/hvfd8529/Datasets/KDDCup99/kddcup.data_10_percent_corrected")
This is not my code, it was given to me and I just modified some parts (especially the RDD and splitting parts) and I'm a newbie at Scala and Spark :)
So I added case class above my load function, like this :
case class BasicFeatures(duration:Int, protocol_type:String, service:String, flag:String, src_bytes:Int, dst_bytes:Int, land:Int, wrong_fragment:Int, urgent:Int)
case class ContentFeatures(hot:Int, num_failed_logins:Int, logged_in:Int, num_compromised:Int, root_shell:Int, su_attempted:Int, num_root:Int, num_file_creations:Int, num_shells:Int, num_access_files:Int, num_outbound_cmds:Int, is_host_login:Int, is_guest_login:Int)
case class TrafficFeatures(count:Int, srv_count:Int, serror_rate:Double, srv_error_rate:Double, rerror_rate:Double, srv_rerror_rate:Double, same_srv_rate:Double, diff_srv_rate:Double, srv_diff_host_rate:Double, dst_host_count:Int, dst_host_srv_count:Int, dst_host_same_srv_rate:Double, dst_host_diff_srv_rate:Double, dst_host_same_src_port_rate:Double, dst_host_srv_diff_host_rate:Double, dst_host_serror_rate:Double, dst_host_srv_serror_rate:Double, dst_host_rerror_rate:Double, dst_host_srv_rerror_rate:Double, attack_type:String )
But now I am confused, how can I use these to solve my problem, because I still need a RDD in order to have one feature = one field
Here is my one line of my file I want to parse :
Max tuple size supported by Scala is 22.Scala function have limit of 22 Parameter. Hence you can not create tuple of size more that 22.