How to extend the transformer in Kafka Scala? - scala

I am working on a Kafka streaming implementation of a word counter in Scala in which I extended the transformer:
class WordCounter extends Transformer[String, String, (String, Long)]
It is then called in the stream as follows:
val counter: KStream[String, Long] = filtered_record.transform(new WordCounter, "count")
However, I am getting the error below when running my program via sbt:
[error] required: org.apache.kafka.streams.kstream.TransformerSupplier[String,String,org.apache.kafka.streams.KeyValue[String,Long]]
I can't seem to figure out how to fix it, and could not find any appropriate Kafka example of a similar implementation.
Anyone got any idea of what I am doing wrong?

The signature of transform() is:
def transform[K1, V1](transformerSupplier: TransformerSupplier[K, V, KeyValue[K1, V1]],
stateStoreNames: String*): KStream[K1, V1]
Thus, transform() takes a TransformerSupplier as first argument not a Transformer.
See also the javadocs

Related

Zio-Kafka: Producing messages to topic

I'm trying to produce some messages to a Kafka topic using the library zio-kafka, version 0.15.0.
Clearly, my comprehension of the ZIO ecosystem is suboptimal cause I cannot produce a simple message. My program is the following:
object KafkaProducerExample extends zio.App {
val producerSettings: ProducerSettings = ProducerSettings(List("localhost:9092"))
val producer: ZLayer[Blocking, Throwable, Producer[Nothing, String, String]] =
ZLayer.fromManaged(Producer.make(producerSettings, Serde.string, Serde.string))
val effect: RIO[Nothing with Producer[Nothing, String, String], RecordMetadata] =
Producer.produce("topic", "key", "value")
override def run(args: List[String]): URIO[zio.ZEnv, ExitCode] = {
effect.provideSomeLayer(producer).exitCode
}
}
The compiler gives me the following error:
[error] KafkaProducerExample.scala:19:28: Cannot prove that zio.blocking.Blocking with zio.Has[zio.kafka.producer.Producer.Service[Nothing,String,String]] <:< Nothing with zio.kafka.producer.Producer[Nothing,String,String].
[error] effect.provideSomeLayer(producer).exitCode
[error] ^
[error] one error found
Can anyone help me in understanding what's going on?
Ok, it was ZIO that requires some hints about types during the creation of the producer layer:
val producer: ZLayer[Blocking, Throwable, Producer[Any, String, String]] =
ZLayer.fromManaged(Producer.make[Any, String, String](producerSettings, Serde.string, Serde.string))
When calling the make smart constructor, we have to give him the types we want to use. The first represents the environment needed to build key and value serializer, while the last two are the types of the messages' keys and values.
In this case, we need no environment at all to build the two serializers, so we pass Any.
Finally, also the Producer.produce function requires some type hints:
val effect: RIO[Producer[Any, String, String], RecordMetadata] =
Producer.produce[Any, String, String]("topic", "key", "value")
After doing the above changes, the types perfectly align, and the compiler is happy again.

Scala: How to take any generic sequence as input to this method

Scala noob here. Still trying to learn the syntax.
I am trying to reduce the code I have to write to convert my test data into DataFrames. Here is what I have right now:
def makeDf[T](seq: Seq[(Int, Int)], colNames: String*): Dataset[Row] = {
val context = session.sqlContext
import context.implicits._
seq.toDF(colNames: _*)
}
The problem is that the above method only takes a sequence of the shape Seq[(Int, Int)] as input. How do I make it take any sequence as input? I can change the inputs shape to Seq[AnyRef], but then the code fails to recognize the toDF call as valid symbol.
I am not able to figure out how to make this work. Any ideas? Thanks!
Short answer:
import scala.reflect.runtime.universe.TypeTag
def makeDf[T <: Product: TypeTag](seq: Seq[T], colNames: String*): DataFrame = ...
Explanation:
When you are calling seq.toDF you are actually using an implicit defined in SQLImplicits:
implicit def localSeqToDatasetHolder[T : Encoder](s: Seq[T]): DatasetHolder[T] = {
DatasetHolder(_sqlContext.createDataset(s))
}
which in turn requires the generation of an encoder. The problem is that encoders are defined only on certain types. Specifically Product (i.e. tuple, case class etc.) You also need to add the TypeTag implicit so that Scala can get over the type erasure (in the runtime all Sequences have the type sequence regardless of the generics type. TypeTag provides information on this).
As a side node, you do not need to extract sqlcontext from the session, you can simply use:
import sparkSession.implicits._
As #AssafMendelson already explained the real reason of why you cannot create a Dataset of Any is because Spark needs an Encoder to transform objects from they JVM representation to its internal representation - and Spark cannot guarantee the generation of such Encoder for Any type.
Assaf answers is correct, and will work.
However, IMHO, it is too much restrictive as it will only work for Products (tuples, and case classes) - and even if that includes most use cases, there still a few ones excluded.
Since, what you really need is an Encoder, you may leave that responsibility to the client. Which in most situation will only need to call import spark.implicits._ to get them in scope.
Thus, this is what I believe will be the most general solution.
import org.apache.spark.sql.{DataFrame, Dataset, Encoder, SparkSession}
// Implicit SparkSession to make the call to further methods more transparent.
implicit val spark = SparkSession.builder.master("local[*]").getOrCreate()
import spark.implicits._
def makeDf[T: Encoder](seq: Seq[T], colNames: String*)
(implicit spark: SparkSession): DataFrame =
spark.createDataset(seq).toDF(colNames: _*)
def makeDS[T: Encoder](seq: Seq[T])
(implicit spark: SparkSession): Dataset[T] =
spark.createDataset(seq)
Note: This is basically re-inventing the already defined functions from Spark.

SBT confused about Scala types

SBT is throwing the following error:
value split is not a member of (String, String)
[error] .filter(arg => arg.split(delimiter).length >= 2)
For the following code block:
implicit def argsToMap(args: Array[String]): Map[String, String] = {
val delimiter = "="
args
.filter(arg => arg.split(delimiter).length >= 2)
.map(arg => arg.split(delimiter)(0) -> arg.split(delimiter)(1))
.toMap
}
Can anyone explain what might be going on here?
Some details:
java version "1.8.0_191"
sbt version 1.2.7
scala version 2.11.8
I've tried both on the command line and also with intellij. I've also tried Java 11 and Scala 2.11.12 to no avail.
I'm not able to replicate this on another machine (different OS, SBT, IntelliJ, etc. though) and I can also write a minimal failing case:
value split is not a member of (String, String)
[error] Array("a", "b").map(x => x.split("y"))
The issue is that the filter method is added to arrays via an implicit.
When you call args.filter(...), args is converted to ArrayOps via the Predef.refArrayOps implicit method.
You are defining a implicit conversion from Array[String] to Map[(String, String)].
This implicit has higher priority than Predef.refArrayOps and is therefore used instead.
So args is converted into a Map[(String, String)]. The filter method of that Map would expect a function of type (String, String) => Boolean as parameter.
I believe what happened is that the implicit method is getting invoked a bit too eagerly. That is, the Tuple2 that's seemingly coming out of nowhere is the result of the implicit function converting each String into a key/value pair. The implicit function was recursively calling itself. I found this out after eventually getting a stack overflow with some other code that was manipulating a collection of Strings.

spark kafkastream to cassandra in scala

I'm trying to insert Kafka-stream Json data into my Cassandra using Scala, but unfortunately getting stuck. My Code is :-
val kafkaStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics)
val records = kafkaStream.map(_._2)
val collection = records.flatMap(_.split(",")).map(s => event(s(0).toString, s(1).toString))
case class event(vehicleid: String, vehicletype: String)
collection.foreachRDD(x => println(x))
collection.saveToCassandra("traffickeyspace", "test", SomeColumns("vehicleid", "vehicletype"))
The Error I'm getting is :-
not enough arguments for method saveToCassandra: (implicit connector: com.datastax.spark.connector.cql.CassandraConnector, implicit rwf: com.datastax.spark.connector.writer.RowWriterFactory[event])Unit. Unspecified value parameter rwf. kafkatesting.scala /SparkRedis/src/com/spark/test line 48 Scala Problem
and other error is :-
could not find implicit value for parameter rwf: com.datastax.spark.connector.writer.RowWriterFactory[event] kafkatesting.scala /SparkRedis/src/com/spark/test line 48 Scala Problem
My JSON record from producer is :-
{"vehicleId":"3a92516d-58a7-478e-9cff-baafd98764a3","vehicleType":"Small Truck","routeId":"Route-37","longitude":"-95.30818","latitude":"33.265877","timestamp":"2018-03-28 06:21:47","speed":58.0,"fuelLevel":25.0}
You actually cannot declare your case class where you have. Case classes have to be defined at the top level scope to get the TypeTag they need. Look here for more details: Scala - No TypeTag Available Exception when using case class to try to get TypeTag?
So move your case class to the top level scope of the file you are in. This way it gets it TypeTag, allowing it to get its ColumnMapper which allows it to pick up its implicit RowWriterFactor

Persistence with NamedObjects in Spark Job Server

Im using the latest SJS version (master) and the application extends SparkHiveJob. In the runJob implementation, I have the following
val eDF1 = hive.applySchema(rowRDD1, schema)
I would like to persist eDF1 and tried the following
val rdd_topersist = namedObjects.getOrElseCreate("cleanedDF1", {
NamedDataFrame(eDF1, true, StorageLevel.MEMORY_ONLY)
})
where the following compile errors occur
could not find implicit value for parameter persister: spark.jobserver.NamedObjectPersister[spark.jobserver.NamedDataFrame]
not enough arguments for method getOrElseCreate: (implicit timeout:scala.concurrent.duration.FiniteDuration, implicit persister:spark.jobserver.NamedObjectPersister[spark.jobserver.NamedDataFrame])spark.jobserver.NamedDataFrame. Unspecified value parameter persister.
Obviously this is wrong, but I can't figure what is wrong. I'm fairly new to Scala.
Can someone help me understand this syntax from NamedObjectSupport?
def getOrElseCreate[O <: NamedObject](name: String, objGen: => O)
(implicit timeout: FiniteDuration = defaultTimeout,
persister: NamedObjectPersister[O]): O
I think you should define implicit persister. Looking at the test code, I see something like this
https://github.com/spark-jobserver/spark-jobserver/blob/ea34a8f3e3c90af27aa87a165934d5eb4ea94dee/job-server-extras/test/spark.jobserver/NamedObjectsSpec.scala#L20