JSON string gets truncated in apache flink - scala

I am using kafka as my source in apache flink. My kafka producer is sending a big json string, but it is getting truncated to 4095 characters and my json parsing fails. I have no control over size of json string, can you please explain and help me tackle this issue?
Code:
import java.sql.Timestamp
import java.util.Properties
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer010
import org.apache.flink.streaming.util.serialization.SimpleStringSchema
import models._
import org.apache.flink.types.Row
import net.liftweb.json._
case class ParsedPage(data: String, domain:String, url:String, text: String)
object HelloWorld{
def main(args:Array[String]){
val env = StreamExecutionEnvironment.getExecutionEnvironment
val properties = new Properties()
properties.setProperty("bootstrap.servers", "localhost:9092")
properties.setProperty("group.id", "test")
val kafkaConsumer011 = new FlinkKafkaConsumer010[String]("first_topic", new SimpleStringSchema(), properties);
val stream: DataStream[String] = env.addSource[String](kafkaConsumer011);
val result: DataStream[Row] = stream.filter(_.nonEmpty).flatMap{ p =>
try {
implicit val formats = DefaultFormats
Seq(parse(p).extract[ParsedPage])
} catch {
case e: Exception =>
println(s"Parsing failed for message in txn: $p")
Seq()
}}.map { p =>
val bookJson = ""
val cityId = ""
Row.of(voyagerId, cityId)
}
stream.print()
env.execute()
}
}
Flink Version : 1.4.2
Scala Version: 2.11.8

Related

Unable to solve the error: java.io.NotSerializableException: org.apache.avro.Schema$RecordSchema

I am trying to read data from a table through SparkSession, and publish it to a Kafka topic. Using below piece of code for the same:
import org.apache.avro.Schema
import org.apache.avro.generic.{GenericData, GenericDatumWriter, GenericRecord}
import org.apache.avro.specific.SpecificDatumWriter
import org.apache.avro.io._
import org.apache.kafka.clients.CommonClientConfigs
import org.apache.kafka.clients.producer._
import org.apache.kafka.common.serialization.StringSerializer
import org.apache.kafka.common.serialization.ByteArraySerializer
import java.io.{ByteArrayOutputStream, StringWriter}
object Producer extends Serializable {
def main(args: Array[String]): Unit = {
val props = new Properties()
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, classOf[StringSerializer].getName)
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, classOf[ByteArraySerializer].getName)
val lines= Source.fromFile("file")
val schema = new Schema.Parser().parse(lines)
val spark = new SparkSession.Builder().enableHiveSupport() getOrCreate()
import spark.implicits._
val df = spark.sql("select * from table")
df.rdd.map{
value => {
val prod = new KafkaProducer[String, Array[Byte]](props)
val records = new GenericData.Record(schema)
records.put("col1",value.getString(1))
records.put("col2",value.getString(2))
records.put("col3",value.getString(3))
records.put("col4",value.getString(4))
val writer = new SpecificDatumWriter[GenericRecord](schema)
val out = new ByteArrayOutputStream()
val encoder: BinaryEncoder = EncoderFactory.get().binaryEncoder(out, null)
writer.write(records, encoder)
encoder.flush()
out.close()
val serializedBytes: Array[Byte] = out.toByteArray()
val record = new ProducerRecord("topic",col1.toString , serializedBytes)
val data = prod.send(record)
prod.flush()
prod.close() }
}
spark.close()
}
}
And, below error is thrown when I execute it:
Caused by: java.io.NotSerializableException:
org.apache.avro.Schema$RecordSchema Serialization stack:
- object not serializable (class: org.apache.avro.Schema$RecordSchema, value:
{"type":"record","name":"data","namespace":"com.data.record","fields":[{"name":"col1","type":"string"},{"name":"col2","type":"string"},{"name":"col3","type":"string"},{"name":"col4","type":"string"}]})
field (class: scala.runtime.ObjectRef, name: elem, type: class java.lang.Object)
object (class scala.runtime.ObjectRef,
{"type":"record","name":"data","namespace":"com.data.record","fields":[{"name":"col1","type":"string"},{"name":"col2","type":"string"},{"name":"col3","type":"string"},{"name":"col4","type":"string"}]})
- field (class: com.kafka.driver.KafkaProducer.Producer$$anonfun$main$1, name: schema$1, type: class scala.runtime.ObjectRef)
However, it runs fine when I try to pass the dataset to driver using df.rdd.collect.foreach . Instead, I need to publish the messages at cluster level, thus using rdd.map . Not sure what am I missing here exactly which is causing this error. Any help towards resolving this would be highly appreciated, thanks!
Figured out that objects, Schema and Kafka Producer, need to be exposed to executors. For that modified the above code as:
import org.apache.avro.Schema
import org.apache.avro.generic.{GenericData, GenericDatumWriter, GenericRecord}
import org.apache.avro.specific.SpecificDatumWriter
import org.apache.avro.io._
import org.apache.kafka.clients.CommonClientConfigs
import org.apache.kafka.clients.producer._
import org.apache.kafka.common.serialization.StringSerializer
import org.apache.kafka.common.serialization.ByteArraySerializer
import java.io.{ByteArrayOutputStream, StringWriter}
object Producer extends Serializable {
def main(args: Array[String]): Unit = {
val props = new Properties()
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, classOf[StringSerializer].getName)
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, classOf[ByteArraySerializer].getName)
val spark = new SparkSession.Builder().enableHiveSupport() getOrCreate()
import spark.implicits._
val df = spark.sql("select * from table")
df.foreachPartition{
rows => {
val prod = new KafkaProducer[String, Array[Byte]](props)
val lines= Source.fromFile("file")
val schema = new Schema.Parser().parse(lines)
rows.foreach{
value =>
val records = new GenericData.Record(schema)
records.put("col1",value.getString(1))
records.put("col2",value.getString(2))
records.put("col3",value.getString(3))
records.put("col4",value.getString(4))
val writer = new SpecificDatumWriter[GenericRecord](schema)
val out = new ByteArrayOutputStream()
val encoder: BinaryEncoder = EncoderFactory.get().binaryEncoder(out, null)
writer.write(records, encoder)
encoder.flush()
out.close()
val serializedBytes: Array[Byte] = out.toByteArray()
val record = new ProducerRecord("topic",col1.toString , serializedBytes)
val data = prod.send(record)
}
prod.flush()
prod.close()
}
}
spark.close()
}
}

Testing a utility function by writing a unit test in apache spark scala

I have a utility function written in scala to read parquet files from s3 bucket. Could someone help me in writing unit test cases for this
Below is the function which needs to be tested.
def readParquetFile(spark: SparkSession,
locationPath: String): DataFrame = {
spark.read
.parquet(locationPath)
}
So far i have created a SparkSession for which the master is local
import org.apache.spark.sql.SparkSession
trait SparkSessionTestWrapper {
lazy val spark: SparkSession = {
SparkSession.builder().master("local").appName("Test App").getOrCreate()
}
}
I am stuck with testing the function. Here is the code where I am stuck. The question is should i create a real parquet file and load to see if the dataframe is getting created or is there a mocking framework to test this.
import com.github.mrpowers.spark.fast.tests.DataFrameComparer
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
import org.scalatest.FunSpec
class ReadAndWriteSpec extends FunSpec with DataFrameComparer with SparkSessionTestWrapper {
import spark.implicits._
it("reads a parquet file and creates a dataframe") {
}
}
Edit:
Basing on the inputs from the comments i came up with the below but i am still not able to understand how this can be leveraged.
I am using https://github.com/findify/s3mock
class ReadAndWriteSpec extends FunSpec with DataFrameComparer with SparkSessionTestWrapper {
import spark.implicits._
it("reads a parquet file and creates a dataframe") {
val api = S3Mock(port = 8001, dir = "/tmp/s3")
api.start
val endpoint = new EndpointConfiguration("http://localhost:8001", "us-west-2")
val client = AmazonS3ClientBuilder
.standard
.withPathStyleAccessEnabled(true)
.withEndpointConfiguration(endpoint)
.withCredentials(new AWSStaticCredentialsProvider(new AnonymousAWSCredentials()))
.build
/** Use it as usual. */
client.createBucket("foo")
client.putObject("foo", "bar", "baz")
val url = client.getUrl("foo","bar")
println(url.getFile())
val df = ReadAndWrite.readParquetFile(spark,url.getPath())
df.printSchema()
}
}
I figured out and kept it simple. I could complete some basic test cases.
Here is my solution. I hope this will help someone.
import org.apache.spark.sql
import org.apache.spark.sql.{SaveMode, SparkSession}
import org.scalatest.{BeforeAndAfterEach, FunSuite}
import loaders.ReadAndWrite
class ReadAndWriteTestSpec extends FunSuite with BeforeAndAfterEach{
private val master = "local"
private val appName = "ReadAndWrite-Test"
var spark : SparkSession = _
override def beforeEach(): Unit = {
spark = new sql.SparkSession.Builder().appName(appName).master(master).getOrCreate()
}
test("creating data frame from parquet file") {
val sparkSession = spark
import sparkSession.implicits._
val peopleDF = spark.read.json("src/test/resources/people.json")
peopleDF.write.mode(SaveMode.Overwrite).parquet("src/test/resources/people.parquet")
val df = ReadAndWrite.readParquetFile(sparkSession,"src/test/resources/people.parquet")
df.printSchema()
}
test("creating data frame from text file") {
val sparkSession = spark
import sparkSession.implicits._
val peopleDF = ReadAndWrite.readTextfileToDataSet(sparkSession,"src/test/resources/people.txt").map(_.split(","))
.map(attributes => Person(attributes(0), attributes(1).trim.toInt))
.toDF()
peopleDF.printSchema()
}
test("counts should match with number of records in a text file") {
val sparkSession = spark
import sparkSession.implicits._
val peopleDF = ReadAndWrite.readTextfileToDataSet(sparkSession,"src/test/resources/people.txt").map(_.split(","))
.map(attributes => Person(attributes(0), attributes(1).trim.toInt))
.toDF()
peopleDF.printSchema()
assert(peopleDF.count() == 3)
}
test("data should match with sample records in a text file") {
val sparkSession = spark
import sparkSession.implicits._
val peopleDF = ReadAndWrite.readTextfileToDataSet(sparkSession,"src/test/resources/people.txt").map(_.split(","))
.map(attributes => Person(attributes(0), attributes(1).trim.toInt))
.toDF()
peopleDF.printSchema()
assert(peopleDF.take(1)(0)(0).equals("Michael"))
}
test("Write a data frame as csv file") {
val sparkSession = spark
import sparkSession.implicits._
val peopleDF = ReadAndWrite.readTextfileToDataSet(sparkSession,"src/test/resources/people.txt").map(_.split(","))
.map(attributes => Person(attributes(0), attributes(1).trim.toInt))
.toDF()
//header argument should be boolean to the user to avoid confusions
ReadAndWrite.writeDataframeAsCSV(peopleDF,"src/test/resources/out.csv",java.time.Instant.now().toString,",","true")
}
override def afterEach(): Unit = {
spark.stop()
}
}
case class Person(name: String, age: Int)

Scala Package Issue With ZKStringSerializer

I am trying to use the class ZKStringSerializer, which I get with
import kafka.utils.ZKStringSerializer
According to the entirety of the internet, and even my own code before I restarted by computer, this should allow my code to work. However, I now get an incredibly confusing compile error,
object ZKStringSerializer in package utils cannot be accessed in package kafka.utils
This is confusing because this file is not supposed to be in any package, and I don't specify a package anywhere. This is my code:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.execution.streaming.FileStreamSource.Timestamp
import org.apache.spark.sql.types._
import org.I0Itec.zkclient.ZkClient
import org.I0Itec.zkclient.ZkConnection
import java.util.Properties
import org.apache.kafka.clients.admin
import kafka.admin.{AdminUtils, RackAwareMode}
import kafka.utils.ZKStringSerializer
import kafka.utils.ZkUtils
import org.apache.kafka.clients.producer.{KafkaProducer, ProducerRecord}
object SpeedTester {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder.master("local[4]").appName("SpeedTester").config("spark.driver.memory", "8g").getOrCreate()
val rootLogger = Logger.getRootLogger()
rootLogger.setLevel(Level.ERROR)
import spark.implicits._
val zookeeperConnect = "localhost:2181"
val sessionTimeoutMs = 10000
val connectionTimeoutMs = 10000
val zkClient = new ZkClient(zookeeperConnect, sessionTimeoutMs, connectionTimeoutMs, ZKStringSerializer)
val topicName = "testTopic"
val numPartitions = 8
val replicationFactor = 1
val topicConfig = new Properties
val isSecureKafkaCluster = false
val zkUtils = new ZkUtils(zkClient, new ZkConnection(zookeeperConnect), isSecureKafkaCluster)
AdminUtils.createTopic(zkUtils, topicName, numPartitions, replicationFactor, topicConfig)
// Create producer for topic testTopic and actually push values to the topic
val props = new Properties()
props.put("bootstrap.servers", "localhost:9592")
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer")
val producer = new KafkaProducer[String, String](props)
val TOPIC = "testTopic"
for (i <- 1 to 50) {
val record = new ProducerRecord(TOPIC, "key", s"hello $i")
producer.send(record)
}
val record = new ProducerRecord(TOPIC, "key", "the end" + new java.util.Date)
producer.send(record)
producer.flush()
producer.close()
}
}
I know this is too late, but for others who will be looking for the same issue-
In the latest version of kafka, kafka.utils got deprecated. So please use kafka admin client apis

Integration test Flink and Kafka with scalatest-embedded-kafka

I would like to run integration test with Flink and Kafka. The process is to read from Kafka, some manipulation with Flink and put the datastream in kafka.
I would like to test the process from the begining to the end. For now I use scalatest-embedded-kafka.
I put an example here I tried to be as simple as possible :
import java.util.Properties
import net.manub.embeddedkafka.{EmbeddedKafka, EmbeddedKafkaConfig}
import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.api.common.typeinfo.TypeInformation
import org.apache.flink.streaming.api.functions.sink.SinkFunction
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.connectors.kafka.{FlinkKafkaConsumer011, FlinkKafkaProducer011}
import org.scalatest.{Matchers, WordSpec}
import scala.collection.mutable.ListBuffer
object SimpleFlinkKafkaTest {
class CollectSink extends SinkFunction[String] {
override def invoke(string: String): Unit = {
synchronized {
CollectSink.values += string
}
}
}
object CollectSink {
val values: ListBuffer[String] = ListBuffer.empty[String]
}
val kafkaPort = 9092
val zooKeeperPort = 2181
val props = new Properties()
props.put("bootstrap.servers", "localhost:" + kafkaPort.toString)
props.put("schema.registry.url", "localhost:" + zooKeeperPort.toString)
val inputString = "mystring"
val expectedString = "MYSTRING"
}
class SimpleFlinkKafkaTest extends WordSpec with Matchers with EmbeddedKafka {
"runs with embedded kafka" should {
"work" in {
implicit val config = EmbeddedKafkaConfig(
kafkaPort = SimpleFlinkKafkaTest.kafkaPort,
zooKeeperPort = SimpleFlinkKafkaTest.zooKeeperPort
)
withRunningKafka {
publishStringMessageToKafka("input-topic", SimpleFlinkKafkaTest.inputString)
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
val kafkaConsumer = new FlinkKafkaConsumer011(
"input-topic",
new SimpleStringSchema,
SimpleFlinkKafkaTest.props
)
implicit val typeInfo = TypeInformation.of(classOf[String])
val inputStream = env.addSource(kafkaConsumer)
val outputStream = inputStream.map(_.toUpperCase)
val kafkaProducer = new FlinkKafkaProducer011(
"output-topic",
new SimpleStringSchema(),
SimpleFlinkKafkaTest.props
)
outputStream.addSink(kafkaProducer)
env.execute()
consumeFirstStringMessageFrom("output-topic") shouldEqual SimpleFlinkKafkaTest.expectedString
}
}
}
}
I had en error so I add the line implicit val typeInfo = TypeInformation.of(classOf[String]) but I don't really understand why I have to do that.
For now this code doesn't work, it runs without interuption but do not stop and do not give any result.
If someone hase any idea ? Even better idea to test this kind of pipeline.
Thanks !
EDIT : add env.execute() and change error.
Here's a simple solution I came up with.
The idea is to:
Start Kafka Embedded server
Create your test topics (here input and output)
Launch Flink job in a Future to avoid blocking the main thread
Publish a message to the input topic
Check the result on the output topic
And the working prototype:
import java.util.Properties
import org.apache.flink.streaming.api.scala._
import net.manub.embeddedkafka.{EmbeddedKafka, EmbeddedKafkaConfig}
import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.core.fs.FileSystem.WriteMode
import org.apache.flink.streaming.connectors.kafka.{FlinkKafkaConsumer011, FlinkKafkaProducer011}
import org.scalatest.{Matchers, WordSpec}
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.Future
class SimpleFlinkKafkaTest extends WordSpec with Matchers with EmbeddedKafka {
"runs with embedded kafka on arbitrary available ports" should {
val env = StreamExecutionEnvironment.getExecutionEnvironment
"work" in {
val userDefinedConfig = EmbeddedKafkaConfig(kafkaPort = 9092, zooKeeperPort = 2182)
val properties = new Properties()
properties.setProperty("bootstrap.servers", "localhost:9092")
properties.setProperty("zookeeper.connect", "localhost:2182")
properties.setProperty("group.id", "test")
properties.setProperty("auto.offset.reset", "earliest")
val kafkaConsumer = new FlinkKafkaConsumer011[String]("input", new SimpleStringSchema(), properties)
val kafkaSink = new FlinkKafkaProducer011[String]("output", new SimpleStringSchema(), properties)
val stream = env
.addSource(kafkaConsumer)
.map(_.toUpperCase)
.addSink(kafkaSink)
withRunningKafkaOnFoundPort(userDefinedConfig) { implicit actualConfig =>
createCustomTopic("input")
createCustomTopic("output")
Future{env.execute()}
publishStringMessageToKafka("input", "Titi")
consumeFirstStringMessageFrom("output") shouldEqual "TITI"
}
}
}
}

Unable to serialize SparkContext in foreachRDD

I am trying to save the streaming data to cassandra from Kafka. I am able to read and parse the data but when I call below lines to save the data i am getting a Task not Serializable Exception. My class is extending serializable but not sure why i am seeing this error, didn't get much help ever after googling for 3 hours, can some body give any pointers ?
val collection = sc.parallelize(Seq((obj.id, obj.data)))
collection.saveToCassandra("testKS", "testTable ", SomeColumns("id", "data"))`
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SaveMode
import org.apache.spark.streaming.Seconds
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.kafka.KafkaUtils
import com.datastax.spark.connector._
import kafka.serializer.StringDecoder
import org.apache.spark.rdd.RDD
import com.datastax.spark.connector.SomeColumns
import java.util.Formatter.DateTime
object StreamProcessor extends Serializable {
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf().setMaster("local[2]").setAppName("StreamProcessor")
.set("spark.cassandra.connection.host", "127.0.0.1")
val sc = new SparkContext(sparkConf)
val ssc = new StreamingContext(sc, Seconds(2))
val sqlContext = new SQLContext(sc)
val kafkaParams = Map("metadata.broker.list" -> "localhost:9092")
val topics = args.toSet
val stream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
ssc, kafkaParams, topics)
stream.foreachRDD { rdd =>
if (!rdd.isEmpty()) {
try {
rdd.foreachPartition { iter =>
iter.foreach {
case (key, msg) =>
val obj = msgParseMaster(msg)
val collection = sc.parallelize(Seq((obj.id, obj.data)))
collection.saveToCassandra("testKS", "testTable ", SomeColumns("id", "data"))
}
}
}
}
}
ssc.start()
ssc.awaitTermination()
}
import org.json4s._
import org.json4s.native.JsonMethods._
case class wordCount(id: Long, data: String) extends serializable
implicit val formats = DefaultFormats
def msgParseMaster(msg: String): wordCount = {
val m = parse(msg).extract[wordCount]
return m
}
}
I am getting
org.apache.spark.SparkException: Task not serializable
below is the full log
16/08/06 10:24:52 ERROR JobScheduler: Error running job streaming job 1470504292000 ms.0
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2055)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:919)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:918)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:918)
at
SparkContext isn't serializable, you can't use it inside foreachRDD, and from the use of your graph you don't need it. Instead, you can simply map over each RDD, parse out the relevant data and save that new RDD to cassandra:
stream
.map {
case (_, msg) =>
val result = msgParseMaster(msg)
(result.id, result.data)
}
.foreachRDD(rdd => if (!rdd.isEmpty)
rdd.saveToCassandra("testKS",
"testTable",
SomeColumns("id", "data")))
You can not call sc.parallelize within a function passed to foreachPartition - that function would have to be serialized and sent to each executor, and SparkContext is (intentionally) not serializable (it should only reside within the Driver application, not the executor).