TopologyException on left join - scala

I'm trying to do a simple stream.leftJoin(table) but get the following exception at runtime:
TopologyException: Invalid topology: StateStore null is not added yet
This is what my code roughly looks like, I commented out the implementation details to keep it short:
val streamsConfiguration: Properties = {
val p = new Properties()
// api config
p.put(StreamsConfig.APPLICATION_ID_CONFIG /**/)
p.put(StreamsConfig.CLIENT_ID_CONFIG /**/)
// kafka broker
p.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092")
// local state store
p.put(StreamsConfig.STATE_DIR_CONFIG, "./streams-state")
p.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest")
// serdes
p.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, classOf[StringSerde])
p.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, classOf[StringSerde])
p
}
val builder = new StreamsBuilderS()
val rawInfoTable: KTableS[String, String] = builder.table("station_info")
val infoTable: KTableS[String, StationInfo] = rawInfoTable.mapValues{jsonString =>
/** */
}.filter(/** */)
.mapValues((/** */)
val rawStatusStream: KStreamS[String, String] = builder.stream("station_status")
val statusStream: KStreamS[String, StationStatus] = rawStatusStream.flatMapValues{jsonString =>
/** */
}
val outputStream: KStreamS[String, String] = statusStream
.leftJoin(infoTable, calculateStats)
.filter((_, availability) => {
/** */
})
.map((stationId: String, availability) => {
/** */
})
outputStream.to("low_availability")
val streams = new KafkaStreams(builder.build(), streamsConfiguration)
streams.cleanUp()
streams.start()
I even tried to manually add a StateStore via:
val store = Stores.inMemoryKeyValueStore("my-store")
val storeBuilder = Stores.keyValueStoreBuilder(store, new StringSerde(), new StringSerde())
val builder = new StreamsBuilderS()
builder.addStateStore(storeBuilder)
But it doesn't seem to change anything. I'm using the kafka streams wrapper from lightbend: "com.lightbend" %% "kafka-streams-scala" % "0.2.1"
All the examples I checked don't seem to care about adding a state store, so I'm somewhat confused. Can somebody point me in the right direction? Does this have to do something the STATE_DIR_CONFIG? Or with the Kafka cluster I'm running locally?

Related

Adding a name to source processor of Kafka streams app results in serialization exception

I'm trying to name my source processor using the Consumed.as() method (full code below):
val usersOrdersStreams: KStream[UserId, Order] = builder
.stream[UserId, Order](ordersByUserTopic)(Consumed.as("topic-name"))
However when I'm running the application I'm getting the following exception:
scalaorg.apache.kafka.common.config.ConfigException: Please specify a value serde or set one through StreamsConfig#DEFAULT_VALUE_SERDE_CLASS_CONFIG
When I looked at the definition of .as() I saw this:
public static <K, V> Consumed<K, V> as(final String processorName) {
return new Consumed<>(null, null, null, null, processorName);
}
So I guessed the issue was that the key/value serdes were set to null.
I tried to solve it by adding a call to withValueSerde():
val orderSerde = ...
val usersOrdersStreams: KStream[UserId, Order] = builder
.stream[UserId, Order](ordersByUserTopic)(Consumed.as("topic-name").withValueSerde(orderSerde))
But got the same error. What am I doing wrong?
Note: if I remove the Consumed.as() part the code works and the exception is not being thrown
Following is the full code (some imports were removed for readability reasons):
import org.apache.kafka.common.serialization.Serde
import org.apache.kafka.streams.kstream.{GlobalKTable, JoinWindows, TimeWindows, Windowed}
import org.apache.kafka.streams.scala.ImplicitConversions._
import org.apache.kafka.streams.scala.serialization.Serdes
import org.apache.kafka.streams.scala.serialization.Serdes._
import scala.concurrent.duration._
object KafkaStreamsApp {
implicit def serde[A >: Null : Decoder : Encoder]: Serde[A] = {
val serializer = (a: A) => a.asJson.noSpaces.getBytes
val deserializer = (aAsBytes: Array[Byte]) => {
val aAsString = new String(aAsBytes)
val aOrError = decode[A](aAsString)
aOrError match {
case Right(a) => Option(a)
case Left(error) =>
Option.empty
}
}
Serdes.fromFn[A](serializer, deserializer)
}
implicit val orderSerde: Serde[Order] = serde[Order]
// Topics
final val ordersByUserTopic = "orders-by-user"
final val filterOrders = "filter-low-orders"
final val applyMapValues = "mapValues-apply-discount"
final val payedOrdersTopic = "filtered-orders"
type UserId = String
case class Order(user: UserId, amount: Double)
val builder = new StreamsBuilder
val usersOrdersStreams: KStream[UserId, Order] =
builder.stream[UserId, Order](ordersByUserTopic)(Consumed.as("vvv").withValueSerde(orderSerde))
def paidOrdersTopology(): Unit = {
usersOrdersStreams
.filter((_, v) => v.amount > 1000.0, named = Named.as(filterOrders))
.mapValues(v => v.copy(amount = v.amount * 0.85), named = Named.as(applyMapValues))
.to(payedOrdersTopic)
}
def main(args: Array[String]): Unit = {
val props = new Properties
props.put(StreamsConfig.APPLICATION_ID_CONFIG, "orders-application")
props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092")
props.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.stringSerde.getClass)
paidOrdersTopology()
val topology: Topology = builder.build()
println(topology.describe())
val application: KafkaStreams = new KafkaStreams(topology, props)
application.start()
}
}
So... after some digging I managed to find the issue: the key serde was missing. The following code sets only the values serde, which creates a Consumed object with a null key serde:
val orderSerde = ...
val usersOrdersStreams: KStream[UserId, Order] = builder
.stream[UserId, Order](ordersByUserTopic)(Consumed.as("topic-name").withValueSerde(orderSerde))
When I added the key serde as well:
val orderSerde = ...
val consumed = Consumed.as("topic-name")
.withKeySerde(Serdes.stringSerde) // Missing key serde
.withValueSerde(orderSerde)
val usersOrdersStreams: KStream[UserId, Order] =
builder.stream[UserId, Order](ordersByUserTopic)(consumed)
The code started working.
The only thing I'm not sure about is why the error thrown stated that value serde was missing, when it's the key serde that's missing.

How to iterate over key values of a Kafka Streams Table

I'm new to Kafka streams and I tried to iterate over items in a kafka Streams table via the keyValueStore:
The idea is to create a Ktable (I've also tried with a globalKTable) with a KeyValueStore.
Then a separated thread is in charge to read the content of the KeyValueStore in order to iterate over last value of each key.
val streamProperties: Properties = {
val p = new Properties()
p.put(StreamsConfig.APPLICATION_ID_CONFIG, "test-application")
p.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, config.getStringList("kafka.bootstrap.servers").toList.mkString(","))
p.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String.getClass.getName)
p.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.ByteArray.getClass.getName)
p.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest")
p
}
val builder: StreamsBuilder = new StreamsBuilder()
import org.apache.kafka.streams.kstream.Materialized
import org.apache.kafka.streams.state.KeyValueStore
val globalTable = builder.table("test",
Materialized
.as[String, Array[Byte], KeyValueStore[org.apache.kafka.common.utils.Bytes, Array[Byte]]]("TestStore")
.withCachingDisabled()
.withKeySerde(Serdes.String())
.withValueSerde(Serdes.ByteArray())
)
val streams: KafkaStreams = new KafkaStreams(builder.build(), streamProperties)
streams.start()
val ex = new ScheduledThreadPoolExecutor(1)
val task = new Runnable {
def run() = {
println("\n\n\n tick \n\n\n")
try {
val keyValueStore = streams.store(globalTable.queryableStoreName(), QueryableStoreTypes.keyValueStore())
keyValueStore.all().toIterator.map { k =>
print(k.key)
}
} catch {
case _ => println("error")
}
}
}
val f = ex.scheduleAtFixedRate(task, 1, 10, TimeUnit.SECONDS)
}
}
In the thread the keyValueStore stays empty even when I produce items on topic "test".
Is there something I missed or didn't understand?
One thing missing is state directory location config:
p.put(StreamsConfig.STATE_DIR_CONFIG, "/tmp")
Without it Kafka Streams would not throw exception, but stateful things like global KTables would silently not work.

Spark rdd write to Hbase

I am able to read the messages from Kafka using the below code:
val ssc = new StreamingContext(sc, Seconds(50))
val topicmap = Map("test" -> 1)
val lines = KafkaUtils.createStream(ssc,"127.0.0.1:2181", "test-consumer-group",topicmap)
But, I am trying to read each message from Kafka and putting into HBase. This is my code to write into HBase but no success.
lines.foreachRDD(rdd => {
rdd.foreach(record => {
val i = +1
val hConf = new HBaseConfiguration()
val hTable = new HTable(hConf, "test")
val thePut = new Put(Bytes.toBytes(i))
thePut.add(Bytes.toBytes("cf"), Bytes.toBytes("a"), Bytes.toBytes(record))
})
})
Well, you are not actually executing the Put, you are mereley creating a Put request and adding data to it. What you are missing is an
hTable.put(thePut);
Adding other answer!!
You can use foreachPartition to establish connection at executor level to be more efficient instead of each row which is costly operation.
lines.foreachRDD(rdd => {
rdd.foreachPartition(iter => {
val hConf = new HBaseConfiguration()
val hTable = new HTable(hConf, "test")
iter.foreach(record => {
val i = +1
val thePut = new Put(Bytes.toBytes(i))
thePut.add(Bytes.toBytes("cf"), Bytes.toBytes("a"), Bytes.toBytes(record))
//missing part in your code
hTable.put(thePut);
})
})
})

Spark Streaming using Scala to insert to Hbase Issue

I am trying to read records from Kafka message and put into Hbase. Though the scala script is running with out any issue, the inserts are not happening. Please help me.
Input:
rowkey1,1
rowkey2,2
Here is the code which I am using:
object Blaher {
def blah(row: Array[String]) {
val hConf = new HBaseConfiguration()
val hTable = new HTable(hConf, "test")
val thePut = new Put(Bytes.toBytes(row(0)))
thePut.add(Bytes.toBytes("cf"), Bytes.toBytes("a"), Bytes.toBytes(row(1)))
hTable.put(thePut)
}
}
object TheMain extends Serializable{
def run() {
val ssc = new StreamingContext(sc, Seconds(1))
val topicmap = Map("test" -> 1)
val lines = KafkaUtils.createStream(ssc,"127.0.0.1:2181", "test-consumer-group",topicmap).map(_._2)
val words = lines.map(line => line.split(",")).map(line => (line(0),line(1)))
val store = words.foreachRDD(rdd => rdd.foreach(Blaher.blah))
ssc.start()
}
}
TheMain.run()
From the API doc for HTable's flushCommits() method: "Executes all the buffered Put operations". You should call this at the end of your blah() method -- it looks like they're currently being buffered but never executed or executed at some random time.

Spark job not parallelising locally (using Parquet + Avro from local filesystem)

edit 2
Indirectly solved the problem by repartitioning the RDD into 8 partitions. Hit a roadblock with avro objects not being "java serialisable" found a snippet here to delegate avro serialisation to kryo. The original problem still remains.
edit 1: Removed local variable reference in map function
I'm writing a driver to run a compute heavy job on spark using parquet and avro for io/schema. I can't seem to get spark to use all my cores. What am I doing wrong ? Is it because I have set the keys to null ?
I am just getting my head around how hadoop organises files. AFAIK since my file has a gigabyte of raw data I should expect to see things parallelising with the default block and page sizes.
The function to ETL my input for processing looks as follows :
def genForum {
class MyWriter extends AvroParquetWriter[Topic](new Path("posts.parq"), Topic.getClassSchema) {
override def write(t: Topic) {
synchronized {
super.write(t)
}
}
}
def makeTopic(x: ForumTopic): Topic = {
// Ommited to save space
}
val writer = new MyWriter
val q =
DBCrawler.db.withSession {
Query(ForumTopics).filter(x => x.crawlState === TopicCrawlState.Done).list()
}
val sz = q.size
val c = new AtomicInteger(0)
q.par.foreach {
x =>
writer.write(makeTopic(x))
val count = c.incrementAndGet()
print(f"\r${count.toFloat * 100 / sz}%4.2f%%")
}
writer.close()
}
And my transformation looks as follows :
def sparkNLPTransformation() {
val sc = new SparkContext("local[8]", "forumAddNlp")
// io configuration
val job = new Job()
ParquetInputFormat.setReadSupportClass(job, classOf[AvroReadSupport[Topic]])
ParquetOutputFormat.setWriteSupportClass(job,classOf[AvroWriteSupport])
AvroParquetOutputFormat.setSchema(job, Topic.getClassSchema)
// configure annotator
val props = new Properties()
props.put("annotators", "tokenize,ssplit,pos,lemma,parse")
val an = DAnnotator(props)
// annotator function
def annotatePosts(ann : DAnnotator, top : Topic) : Topic = {
val new_p = top.getPosts.map{ x=>
val at = new Annotation(x.getPostText.toString)
ann.annotator.annotate(at)
val t = at.get(classOf[SentencesAnnotation]).map(_.get(classOf[TreeAnnotation])).toList
val r = SpecificData.get().deepCopy[Post](x.getSchema,x)
if(t.nonEmpty) r.setTrees(t)
r
}
val new_t = SpecificData.get().deepCopy[Topic](top.getSchema,top)
new_t.setPosts(new_p)
new_t
}
// transformation
val ds = sc.newAPIHadoopFile("forum_dataset.parq", classOf[ParquetInputFormat[Topic]], classOf[Void], classOf[Topic], job.getConfiguration)
val new_ds = ds.map(x=> ( null, annotatePosts(x._2) ) )
new_ds.saveAsNewAPIHadoopFile("annotated_posts.parq",
classOf[Void],
classOf[Topic],
classOf[ParquetOutputFormat[Topic]],
job.getConfiguration
)
}
Can you confirm that the data is indeed in multiple blocks in HDFS? The total block count on the forum_dataset.parq file