Connecting list of case classes to kafka producer? - scala

I have the below case class:
case class Alpakka(id:Int,name:String,animal_type:String)
I am trying to connect a list of these case classes to a producer in kafka by using the following code:
def connectEntriesToProducer(seq: Seq[Alpakka]) = {
val producerSettings = ProducerSettings(system, new StringSerializer, new StringSerializer)
.withBootstrapServers("localhost:9092")
seq.map(alpakka => new ProducerRecord[String, String]("alpakkas", alpakka.asJson.noSpaces))
.runWith(Producer.plainSink(producerSettings))
}
I am using circe to convert the case class to json. However I keep getting a compiler error saying this:
Error:(87, 34) type mismatch;
found : akka.stream.scaladsl.Sink[org.apache.kafka.clients.producer.ProducerRecord[String,String],scala.concurrent.Future[akka.Done]]
required: org.apache.kafka.clients.producer.ProducerRecord[String,String] => ?
.runWith(Producer.plainSink(producerSettings))
I'm not sure whats going on!

You are trying to build a Graph from a Seq instead of a Source.
Your method connectEntriesToProducer should look like
def connectEntriesToProducer(seq: Source[Alpakka]) = {
Note, Source instead of Seq.
In alternative, you can build a source from a Seq, but you'll have to use immutable.Seq since Source.apply would only take an immutable iterable.
def connectEntriesToProducer(seq: scala.collection.immutable.Seq[Alpakka]) = {
val producerSettings = ProducerSettings(system, new StringSerializer, new StringSerializer)
.withBootstrapServers("localhost:9092")
Source(seq).
map(alpakka => new ProducerRecord[String, String]("alpakkas", alpakka.asJson.noSpaces))
.runWith(Producer.plainSink(producerSettings))
}

Related

Scala spark kafka code - functional approach

I've the following code in scala. I am using spark sql to pull data from hadoop, perform some group by on the result, serialize it and then write that message to Kafka.
I've written the code - but i want to write it in functional way. Should i create a new class with function 'getCategories' to get the categories from Hadoop? I am not sure how to approach this.
Here is the code
class ExtractProcessor {
def process(): Unit = {
implicit val formats = DefaultFormats
val spark = SparkSession.builder().appName("test app").getOrCreate()
try {
val df = spark.sql("SELECT DISTINCT SUBCAT_CODE, SUBCAT_NAME, CAT_CODE, CAT_NAME " +
"FROM CATEGORY_HIERARCHY " +
"ORDER BY CAT_CODE, SUBCAT_CODE ")
val result = df.collect().groupBy(row => (row(2), row(3)))
val categories = result.map(cat =>
category(cat._1._1.toString(), cat._1._2.toString(),
cat._2.map(subcat =>
subcategory(subcat(0).toString(), subcat(1).toString())).toList))
val jsonMessage = write(categories)
val kafkaKey = java.security.MessageDigest.getInstance("SHA-1").digest(jsonMessage.getBytes("UTF-8")).map("%02x".format(_)).mkString.toString()
val key = write(kafkaKey)
Logger.log.info(s"Json Message: ${jsonMessage}")
Logger.log.info(s"Kafka Key: ${key}")
KafkaUtil.apply.send(key, jsonMessage, "testTopic")
}
And here is the Kafka Code
class KafkaUtil {
def send(key: String, message: String, topicName: String): Unit = {
val properties = new Properties()
properties.put("bootstrap.servers", "localhost:9092")
properties.put("client.id", "test publisher")
properties.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")
properties.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer")
val producer = new KafkaProducer[String, String](properties)
try {
val record = new ProducerRecord[String, String](topicName, key, message)
producer.send(record)
}
finally {
producer.close()
Logger.log.info("Kafka producer closed...")
}
}
}
object KafkaUtil {
def apply: KafkaUtil = {
new KafkaUtil
}
}
Also, for writing unit tests what should i be testing in the functional approach. In OOP we unit test the business logic, but in my scala code there is hardly any business logic.
Any help is appreciated.
Thanks in advance,
Suyog
You code consists of
1) Loading the data into spark df
2) Crunching the data
3) Creating a json message
4) Sending json message to kafka
Unit tests are good for testing pure functions.
You can extract step 2) into a method with signature like
def getCategories(df: DataFrame): Seq[Category] and cover it by a test.
In the test data frame will be generated from just a plain hard-coded in-memory sequence.
Step 3) can be also covered by a unit test if you feel it error-prone
Steps 1) and 4) are to be covered by an end-to-end test
By the way
val result = df.collect().groupBy(row => (row(2), row(3))) is inefficient. Better to replace it by val result = df.groupBy(row => (row(2), row(3))).collect
Also there is no need to initialize a KafkaProducer individually for each single message.

Scalacache with redis support

I am trying to integrate redis to scalacache. Keys are usually string but values can be objects, Set[String], etc. Cache is initialized by this
val cache: RedisCache = RedisCache(config.host, config.port)
private implicit val scalaCache: ScalaCache[Array[Byte]] = ScalaCache(cacheService.cache)
But while calling put, i am getting this error "Could not find any Codecs for type Set[String] and Repr". Looks like i need to provide codec for my cache input as suggested here so i added,
class A extends Codec[Set[String], Array[Byte]] with GZippingBinaryCodec[Set[String]]
Even after, my class A, is throwing the same error. What am i missing.
As you mentioned in the link, you can either serialize values in a binary format:
import scalacache.serialization.binary._
or as JSON using circe:
import scalacache.serialization.circe._
import io.circe.generic.auto._
Looks like its solved in next release by binary and circe serialization. I am on version 10 and solved by the following,
implicit object SetBindaryCodec extends Codec[Any, Array[Byte]] {
override def serialize(value: Any): Array[Byte] = {
val stream: ByteArrayOutputStream = new ByteArrayOutputStream()
val oos = new ObjectOutputStream(stream)
oos.writeObject(value)
oos.close()
stream.toByteArray
}
override def deserialize(data: Array[Byte]): Any = {
val ois = new ObjectInputStream(new ByteArrayInputStream(data))
val value = ois.readObject
ois.close()
value
}
}
Perks of being up to date. Will upgrade the version, posted it just in case somebody needs it.

Type mismatch with existential types inside Option

I'm trying to pass a bunch of classes to the SparkConf.registerKryoClasses method, which has the following signature:
registerKryoClasses: Array[Class[_]]) => SparkConf
Since I may or may not have classes that need to be registered, I'm wrapping it in an Option and tried this (in a simplified version):
class SomeClass(val app: String, val classes: Option[Array[Class[_]]]) {
val conf = classes match {
case Some(cs) ⇒ new SparkConf()
.setAppName(app)
.registerKryoClasses(cs)
case None ⇒ new SparkConf()
.setAppName(app)
}
// more stuff
}
IntelliJ gives me that there is a type mismatch on cs and then lists the expected and actual types. They are the same.
What am I missing here?
The following works fine for me,
def someClass(app: String, classes: Option[Array[Class[Any]]]): SparkConf = {
val conf = classes match {
case Some(cs) ⇒ new SparkConf()
.setAppName(app)
.registerKryoClasses(cs)
case None ⇒ new SparkConf()
.setAppName(app)
// more stuff
}
conf
}
val conf = someClass("lol", Some(Array(classOf[String], classOf[Int])))
//org.apache.spark.SparkConf = org.apache.spark.SparkConf#685d92cf

Mapping column types Slick 3.1.1

I am new to Slick and having a really hard time getting mapping of java.sql.date/time/timestamp mapped into jodatime.
trait ColumnTypeMappings {
val profile: JdbcProfile
import profile.api._
val localTimeFormatter = DateTimeFormat.forPattern("HH:mm:ss")
val javaTimeFormatter = new SimpleDateFormat("HH:mm:ss")
implicit val myDateColumnType = MappedColumnType.base[LocalDate, Date](
ld => new java.sql.Date(ld.toDateTimeAtStartOfDay(DateTimeZone.UTC).getMillis),
d => new LocalDateTime(d.getTime).toLocalDate
)
implicit val myTimeColumnType = MappedColumnType.base[LocalTime, Time](
lt => new java.sql.Time(javaTimeFormatter.parse(lt.toString(localTimeFormatter)).getTime),
t => new LocalTime(t.getTime)
)
implicit val myTimestampColumnType = MappedColumnType.base[DateTime, Timestamp](
dt => new java.sql.Timestamp(dt.getMillis),
ts => new DateTime(ts.getTime, DateTimeZone.UTC)
)
}
In the auto generated Tables.scala I include the mapping like this:
trait Tables extends ColumnTypeMappings {
val profile: slick.driver.JdbcDriver
import profile.api._
import scala.language.implicitConversions
// + rest of the auto generated code by slick codegen
}
And to wrap it all up I use this like this:
object TestTables extends Tables {
val profile = slick.driver.MySQLDriver
}
import Tables._
import profile.api._
val db = Database.forURL("url", "user", "password", driver = "com.mysql.jdbc.Driver")
val q = Company.filter(_.companyid === 1).map(._name)
val action = q.result
val future = db.run(action)
val result = Await.result(future, Duration.Inf)
I get an NullPointerException on: implicit val myDateColumnType.... when running this. I've verified that this last block of code works if I remove the mapping.
Try changing implicit val to implicit def in your definitions of the MappedColumnTypes. The reason why is related to the answer given by Maksym Chernenko to this question. Generally, the JdbcProfile driver (that defines api.MappedColumnType) has not been injected yet, and:
that causes NPE. You can either make your "mapper" val lazy, or change it
from val to def (as shown below)
implicit def myDateColumnType = MappedColumnType.base[LocalDate, Date](
ld => new java.sql.Date(ld.toDateTimeAtStartOfDay(DateTimeZone.UTC).getMillis),
d => new LocalDateTime(d.getTime).toLocalDate
)
implicit def myTimeColumnType = MappedColumnType.base[LocalTime, Time](
lt => new java.sql.Time(javaTimeFormatter.parse(lt.toString(localTimeFormatter)).getTime),
t => new LocalTime(t.getTime)
)
implicit def myTimestampColumnType = MappedColumnType.base[DateTime, Timestamp](
dt => new java.sql.Timestamp(dt.getMillis),
ts => new DateTime(ts.getTime, DateTimeZone.UTC)
)
So i think the issue may be that you are extending ColumnTypeMappings in your Tables.scala. The documentation doesn't make it clear but I think the auto generated code relating to the database should not be touched, as this is used by slick to map the rows in the DB, and then extend TestTables by ColumnTypeMappings to do the implicit conversion when you get the result back from the database.
I haven't particularly delved into slick 3.x yet so I may be wrong, but I think that makes sense.
Edit: No, i was wrong :(. Apologies

Enriching SparkContext without incurring in serialization issues

I am trying to use Spark to process data that comes from HBase tables. This blog post gives an example of how to use NewHadoopAPI to read data from any Hadoop InputFormat.
What I have done
Since I will need to do this many times, I was trying to use implicits to enrich SparkContext, so that I can get an RDD from a given set of columns in HBase. I have written the following helper:
trait HBaseReadSupport {
implicit def toHBaseSC(sc: SparkContext) = new HBaseSC(sc)
implicit def bytes2string(bytes: Array[Byte]) = new String(bytes)
}
final class HBaseSC(sc: SparkContext) extends Serializable {
def extract[A](data: Map[String, List[String]], result: Result, interpret: Array[Byte] => A) =
data map { case (cf, columns) =>
val content = columns map { column =>
val cell = result.getColumnLatestCell(cf.getBytes, column.getBytes)
column -> interpret(CellUtil.cloneValue(cell))
} toMap
cf -> content
}
def makeConf(table: String) = {
val conf = HBaseConfiguration.create()
conf.setBoolean("hbase.cluster.distributed", true)
conf.setInt("hbase.client.scanner.caching", 10000)
conf.set(TableInputFormat.INPUT_TABLE, table)
conf
}
def hbase[A](table: String, data: Map[String, List[String]])
(interpret: Array[Byte] => A) =
sc.newAPIHadoopRDD(makeConf(table), classOf[TableInputFormat],
classOf[ImmutableBytesWritable], classOf[Result]) map { case (key, row) =>
Bytes.toString(key.get) -> extract(data, row, interpret)
}
}
It can be used like
val rdd = sc.hbase[String](table, Map(
"cf" -> List("col1", "col2")
))
In this case we get an RDD of (String, Map[String, Map[String, String]]), where the first component is the rowkey and the second is a map whose key are column families and the values are maps whose keys are columns and whose content are the cell values.
Where it fails
Unfortunately, it seems that my job gets a reference to sc, which is itself not serializable by design. What I get when I run the job is
Exception in thread "main" org.apache.spark.SparkException: Job aborted: Task not serializable: java.io.NotSerializableException: org.apache.spark.SparkContext
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)
I can remove the helper classes and use the same logic inline in my job and everything runs fine. But I want to get something which I can reuse instead of writing the same boilerplate over and over.
By the way, the issue is not specific to implicit, even using a function of sc exhibits the same problem.
For comparison, the following helper to read TSV files (I know it's broken as it does not support quoting and so on, never mind) seems to work fine:
trait TsvReadSupport {
implicit def toTsvRDD(sc: SparkContext) = new TsvRDD(sc)
}
final class TsvRDD(val sc: SparkContext) extends Serializable {
def tsv(path: String, fields: Seq[String], separator: Char = '\t') = sc.textFile(path) map { line =>
val contents = line.split(separator).toList
(fields, contents).zipped.toMap
}
}
How can I encapsulate the logic to read rows from HBase without unintentionally capturing the SparkContext?
Just add #transient annotation to sc variable:
final class HBaseSC(#transient val sc: SparkContext) extends Serializable {
...
}
and make sure sc is not used within extract function, since it won't be available on workers.
If it's necessary to access Spark context from within distributed computation, rdd.context function might be used:
val rdd = sc.newAPIHadoopRDD(...)
rdd map {
case (k, v) =>
val ctx = rdd.context
....
}