spark kafkastream to cassandra in scala - scala

I'm trying to insert Kafka-stream Json data into my Cassandra using Scala, but unfortunately getting stuck. My Code is :-
val kafkaStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics)
val records = kafkaStream.map(_._2)
val collection = records.flatMap(_.split(",")).map(s => event(s(0).toString, s(1).toString))
case class event(vehicleid: String, vehicletype: String)
collection.foreachRDD(x => println(x))
collection.saveToCassandra("traffickeyspace", "test", SomeColumns("vehicleid", "vehicletype"))
The Error I'm getting is :-
not enough arguments for method saveToCassandra: (implicit connector: com.datastax.spark.connector.cql.CassandraConnector, implicit rwf: com.datastax.spark.connector.writer.RowWriterFactory[event])Unit. Unspecified value parameter rwf. kafkatesting.scala /SparkRedis/src/com/spark/test line 48 Scala Problem
and other error is :-
could not find implicit value for parameter rwf: com.datastax.spark.connector.writer.RowWriterFactory[event] kafkatesting.scala /SparkRedis/src/com/spark/test line 48 Scala Problem
My JSON record from producer is :-
{"vehicleId":"3a92516d-58a7-478e-9cff-baafd98764a3","vehicleType":"Small Truck","routeId":"Route-37","longitude":"-95.30818","latitude":"33.265877","timestamp":"2018-03-28 06:21:47","speed":58.0,"fuelLevel":25.0}

You actually cannot declare your case class where you have. Case classes have to be defined at the top level scope to get the TypeTag they need. Look here for more details: Scala - No TypeTag Available Exception when using case class to try to get TypeTag?
So move your case class to the top level scope of the file you are in. This way it gets it TypeTag, allowing it to get its ColumnMapper which allows it to pick up its implicit RowWriterFactor

Related

How to extend the transformer in Kafka Scala?

I am working on a Kafka streaming implementation of a word counter in Scala in which I extended the transformer:
class WordCounter extends Transformer[String, String, (String, Long)]
It is then called in the stream as follows:
val counter: KStream[String, Long] = filtered_record.transform(new WordCounter, "count")
However, I am getting the error below when running my program via sbt:
[error] required: org.apache.kafka.streams.kstream.TransformerSupplier[String,String,org.apache.kafka.streams.KeyValue[String,Long]]
I can't seem to figure out how to fix it, and could not find any appropriate Kafka example of a similar implementation.
Anyone got any idea of what I am doing wrong?
The signature of transform() is:
def transform[K1, V1](transformerSupplier: TransformerSupplier[K, V, KeyValue[K1, V1]],
stateStoreNames: String*): KStream[K1, V1]
Thus, transform() takes a TransformerSupplier as first argument not a Transformer.
See also the javadocs

Scala - template method pattern in a trait

I'm implementing a template method pattern in Scala. The idea is that the method returns a Dataset[Metric].
But when I'm converting enrichedMetrics to a DataSet enrichedMetrics.as[Metric] I have to use implicits in order to map the records to the specified type. This means passing a SparkSession to the MetricsProcessor which seems not the best solution to me.
The solution I see now is to pass spark: SparkSession as a parameter to the template method. And then import spark.implicits._ within the template method.
Is there a more proper way to implement the template method pattern in this case?
trait MetricsProcessor {
// Template method
def parseMetrics(startDate: Date, endDate: Date, metricId: Long): Dataset[Metric] = {
val metricsFromSource: DataFrame = queryMetrics(startDate, endDate)
val enrichedMetrics = enrichMetrics(metricsFromSource, metricId)
enrichedMetrics.as[Metric] <--- //requires spark.implicits
}
// abstract method
def queryMetrics(startDate: Date, endDate: Date): DataFrame
def enrichMetrics(metricsDf: DataFrame, metricId: Long): DataFrame = {
/*Default implementation*/
}
}
You're missing the Encoder for your type Metric here which spark cannot find implicitly, for common types like String, Int etc, spark has implicit encoders.
Also, you cannot do a simple .as on a data frame if the columns in the source type and destination type aren't the same. I'll make some assumptions here.
For a case class Metric
case class Metric( ??? )
the line in parseMetrics will change to,
Option 1 - Explicitly passing the Encoder
enrichedMetrics.map(row => Metric( ??? ))(Encoders.product[Metric])
Option 2 - Implicitly passing the Encoder
implicit val enc : Encoder[Metric] = Encoders.product[Metric]
enrichedMetrics.map(row => Metric( ??? ))
Note, as pointed in one of the comments, if your parseMetric method is always returning the Dataset[Metric], you can add the implicit encoder to the body of the trait.
Hope this helped.

How is a typed Scala object losing its type?

In the following piece of code, entities is a Map[String, Seq[String]] object that I receive from some other piece of code. The goal is to map the entities object into a two column Spark DataFrame; but, before I get there, I found some very unusual results.
val data: Map[String, Seq[String]] = Map("idtag" -> Seq("things", "associated", "with", "id"))
println(data)
println(data.toSeq)
data.toSeq.foreach{println}
data.toSeq.map{case(id: String, names: Seq[String]) => names}.foreach{println}
val eSeq: Seq[(String, Seq[String])] = entities.toSeq
println(eSeq.head)
println(eSeq.head.getClass)
println(eSeq.head._1.getClass)
println(eSeq.head._2.getClass)
eSeq.map{case(id: String, names: Seq[String]) => names}.foreach{println}
The output of the above on the console is:
Map(idtag -> List(things, associated, with, id))
ArrayBuffer((idtag,List(things, associated, with, id)))
(idtag,List(things, associated, with, id))
List(things, associated, with, id)
(0CY4NZ-E,["MEC", "Marriott-MEC", "Media IQ - Kimberly Clark c/o Mindshare", "Mindshare", "WPP", "WPP Plc", "Wavemaker Global", "Wavemaker Global Ltd"])
class scala.Tuple2
class java.lang.String
class java.lang.String
Exception in thread "main" java.lang.ClassCastException: java.lang.String cannot be cast to scala.collection.Seq
at package.EntityList$$anonfun$toStorage$4.apply(EntityList.scala:31)
The data object that I hardcoded acts as expected. The .toSeq function on the entities map produces a Seq (implemented as an ArrayBuffer) of tuples; and these tuples can be processed through mapping.
But using the entities object, you can see that when I take the first element using .head and it is a Tuple2[String, String]. How can that possibly happen? How does the second element of the tuple turn into a String and cause the exception?
Further confusing me, if the last line is changed to reflect the Tuple2[String, String]:
eSeq.map{case(id: String, names: String) => names}.foreach{println}
then we get a compile error:
/path/to/repo/src/main/scala/package/EntityList.scala:31: error: pattern type is incompatible with expected type;
found : String
required: Seq[String]
eSeq.map{case(id: String, names: String) => names}.foreach{println}
I can't replicate this odd behavior with a Map[String, Seq[String]] that I create myself, as you can see in this code. Can anyone explain this behavior and why it happens?
The problem appears to be that entities.toSeq is lying about the type of the data that it is returning, so I would look at "some other piece of code" and check it is doing the right thing.
Specifically, it claims to return Seq[(String, Seq[String])] and the compiler believes it. But getClass shows that the second object in the tuple is actually java.lang.String not Seq[String].
If this were correct then the match statement would be using unapply to extract the value and then getting an error when it tried to convert names to the stated type.
I note that the string appears to be a list of strings enclosed in [ ], so it seems possible that whatever is creating entities is failing to parse this into a Seq but claiming that it has succeeded.

SBT confused about Scala types

SBT is throwing the following error:
value split is not a member of (String, String)
[error] .filter(arg => arg.split(delimiter).length >= 2)
For the following code block:
implicit def argsToMap(args: Array[String]): Map[String, String] = {
val delimiter = "="
args
.filter(arg => arg.split(delimiter).length >= 2)
.map(arg => arg.split(delimiter)(0) -> arg.split(delimiter)(1))
.toMap
}
Can anyone explain what might be going on here?
Some details:
java version "1.8.0_191"
sbt version 1.2.7
scala version 2.11.8
I've tried both on the command line and also with intellij. I've also tried Java 11 and Scala 2.11.12 to no avail.
I'm not able to replicate this on another machine (different OS, SBT, IntelliJ, etc. though) and I can also write a minimal failing case:
value split is not a member of (String, String)
[error] Array("a", "b").map(x => x.split("y"))
The issue is that the filter method is added to arrays via an implicit.
When you call args.filter(...), args is converted to ArrayOps via the Predef.refArrayOps implicit method.
You are defining a implicit conversion from Array[String] to Map[(String, String)].
This implicit has higher priority than Predef.refArrayOps and is therefore used instead.
So args is converted into a Map[(String, String)]. The filter method of that Map would expect a function of type (String, String) => Boolean as parameter.
I believe what happened is that the implicit method is getting invoked a bit too eagerly. That is, the Tuple2 that's seemingly coming out of nowhere is the result of the implicit function converting each String into a key/value pair. The implicit function was recursively calling itself. I found this out after eventually getting a stack overflow with some other code that was manipulating a collection of Strings.

Issue converting scala case class to spark dataset with embedded options that wrap monads

I ran into a problem when trying to convert a Seq of case classes into a Spark Dataset today. I thought I'd share the solution here as it was tough to pin down.
I have a case class I am trying to convert to a Dataset
case class Foo(name: String, names: Option[List[String]])
val myData: Seq[Foo] = Seq(Foo("A", Some(List("T","U"))),
Foo("B", Some(List("V","W"))))
val myFooDataset = sparkSession.createDataset(myData)
This errors out and complains that there is no encoder. How can I get this to work?
The answer in this case is to convert your embedded Lists to Seq. In fact, Just having a column that is a List (without being wrapped in an Option) will work but as soon as you wrap it in an Option it needs to be a Seq instead.
case class Foo(name: String, names: Option[Seq[String]])
val myData: Seq[Foo] = Seq(Foo("A", Some(Seq("T","U"))),
Foo("B", Some(Seq("V","W"))))
val myFooDataset = sparkSession.createDataset(myData)