Integration test Flink and Kafka with scalatest-embedded-kafka - scala

I would like to run integration test with Flink and Kafka. The process is to read from Kafka, some manipulation with Flink and put the datastream in kafka.
I would like to test the process from the begining to the end. For now I use scalatest-embedded-kafka.
I put an example here I tried to be as simple as possible :
import java.util.Properties
import net.manub.embeddedkafka.{EmbeddedKafka, EmbeddedKafkaConfig}
import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.api.common.typeinfo.TypeInformation
import org.apache.flink.streaming.api.functions.sink.SinkFunction
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.connectors.kafka.{FlinkKafkaConsumer011, FlinkKafkaProducer011}
import org.scalatest.{Matchers, WordSpec}
import scala.collection.mutable.ListBuffer
object SimpleFlinkKafkaTest {
class CollectSink extends SinkFunction[String] {
override def invoke(string: String): Unit = {
synchronized {
CollectSink.values += string
}
}
}
object CollectSink {
val values: ListBuffer[String] = ListBuffer.empty[String]
}
val kafkaPort = 9092
val zooKeeperPort = 2181
val props = new Properties()
props.put("bootstrap.servers", "localhost:" + kafkaPort.toString)
props.put("schema.registry.url", "localhost:" + zooKeeperPort.toString)
val inputString = "mystring"
val expectedString = "MYSTRING"
}
class SimpleFlinkKafkaTest extends WordSpec with Matchers with EmbeddedKafka {
"runs with embedded kafka" should {
"work" in {
implicit val config = EmbeddedKafkaConfig(
kafkaPort = SimpleFlinkKafkaTest.kafkaPort,
zooKeeperPort = SimpleFlinkKafkaTest.zooKeeperPort
)
withRunningKafka {
publishStringMessageToKafka("input-topic", SimpleFlinkKafkaTest.inputString)
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
val kafkaConsumer = new FlinkKafkaConsumer011(
"input-topic",
new SimpleStringSchema,
SimpleFlinkKafkaTest.props
)
implicit val typeInfo = TypeInformation.of(classOf[String])
val inputStream = env.addSource(kafkaConsumer)
val outputStream = inputStream.map(_.toUpperCase)
val kafkaProducer = new FlinkKafkaProducer011(
"output-topic",
new SimpleStringSchema(),
SimpleFlinkKafkaTest.props
)
outputStream.addSink(kafkaProducer)
env.execute()
consumeFirstStringMessageFrom("output-topic") shouldEqual SimpleFlinkKafkaTest.expectedString
}
}
}
}
I had en error so I add the line implicit val typeInfo = TypeInformation.of(classOf[String]) but I don't really understand why I have to do that.
For now this code doesn't work, it runs without interuption but do not stop and do not give any result.
If someone hase any idea ? Even better idea to test this kind of pipeline.
Thanks !
EDIT : add env.execute() and change error.

Here's a simple solution I came up with.
The idea is to:
Start Kafka Embedded server
Create your test topics (here input and output)
Launch Flink job in a Future to avoid blocking the main thread
Publish a message to the input topic
Check the result on the output topic
And the working prototype:
import java.util.Properties
import org.apache.flink.streaming.api.scala._
import net.manub.embeddedkafka.{EmbeddedKafka, EmbeddedKafkaConfig}
import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.core.fs.FileSystem.WriteMode
import org.apache.flink.streaming.connectors.kafka.{FlinkKafkaConsumer011, FlinkKafkaProducer011}
import org.scalatest.{Matchers, WordSpec}
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.Future
class SimpleFlinkKafkaTest extends WordSpec with Matchers with EmbeddedKafka {
"runs with embedded kafka on arbitrary available ports" should {
val env = StreamExecutionEnvironment.getExecutionEnvironment
"work" in {
val userDefinedConfig = EmbeddedKafkaConfig(kafkaPort = 9092, zooKeeperPort = 2182)
val properties = new Properties()
properties.setProperty("bootstrap.servers", "localhost:9092")
properties.setProperty("zookeeper.connect", "localhost:2182")
properties.setProperty("group.id", "test")
properties.setProperty("auto.offset.reset", "earliest")
val kafkaConsumer = new FlinkKafkaConsumer011[String]("input", new SimpleStringSchema(), properties)
val kafkaSink = new FlinkKafkaProducer011[String]("output", new SimpleStringSchema(), properties)
val stream = env
.addSource(kafkaConsumer)
.map(_.toUpperCase)
.addSink(kafkaSink)
withRunningKafkaOnFoundPort(userDefinedConfig) { implicit actualConfig =>
createCustomTopic("input")
createCustomTopic("output")
Future{env.execute()}
publishStringMessageToKafka("input", "Titi")
consumeFirstStringMessageFrom("output") shouldEqual "TITI"
}
}
}
}

Related

Iceberg's FlinkSink doesn't update metadata file in streaming writes

I have been trying to use Iceberg's FlinkSink to consume the data and write to sink.
I was successful in fetching the data from kinesis and I see that the data is being written into the appropriate partition. However, I don't see the metadata.json being updated. Without which I am not able to query the table.
Any help or pointers are appreciated.
The following is the code.
package test
import java.util.{Calendar, Properties}
import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.connectors.kinesis.FlinkKinesisConsumer
import org.apache.flink.streaming.connectors.kinesis.config.{AWSConfigConstants, ConsumerConfigConstants}
import org.apache.flink.table.data.{GenericRowData, RowData, StringData}
import org.apache.hadoop.conf.Configuration
import org.apache.iceberg.catalog.TableIdentifier
import org.apache.iceberg.flink.{CatalogLoader, TableLoader}
import org.apache.iceberg.flink.TableLoader.HadoopTableLoader
import org.apache.iceberg.flink.sink.FlinkSink
import org.apache.iceberg.hadoop.HadoopCatalog
import org.apache.iceberg.types.Types
import org.apache.iceberg.{PartitionSpec, Schema}
import scala.collection.JavaConverters._
object SampleApp {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
val warehouse = "file://<local folder path>"
val catalog = new HadoopCatalog(new Configuration(), warehouse)
val ti = TableIdentifier.of("test_table")
if (!catalog.tableExists(ti)) {
println("table doesnt exist. creating it.")
val schema = new Schema(
Types.NestedField.optional(1, "message", Types.StringType.get()),
Types.NestedField.optional(2, "year", Types.StringType.get()),
Types.NestedField.optional(3, "month", Types.StringType.get()),
Types.NestedField.optional(4, "date", Types.StringType.get()),
Types.NestedField.optional(5, "hour", Types.StringType.get())
)
val props = Map(
"write.metadata.delete-after-commit.enabled" -> "true",
"write.metadata.previous-versions-max" -> "5",
"write.target-file-size-bytes" -> "1048576"
)
val partitionSpec = PartitionSpec.builderFor(schema)
.identity("year")
.identity("month")
.identity("date")
.identity("hour")
.build();
catalog.createTable(ti, schema, partitionSpec, props.asJava)
} else {
println("table exists. not creating it.")
}
val inputProperties = new Properties()
inputProperties.setProperty(AWSConfigConstants.AWS_REGION, "us-east-1")
inputProperties.setProperty(ConsumerConfigConstants.STREAM_INITIAL_POSITION, "LATEST")
val stream: DataStream[RowData] = env
.addSource(new FlinkKinesisConsumer[String]("test_kinesis_stream", new SimpleStringSchema(), inputProperties))
.map(x => {
val now = Calendar.getInstance()
GenericRowData.of(
StringData.fromString(x),
StringData.fromString(now.get(Calendar.YEAR).toString),
StringData.fromString("%02d".format(now.get(Calendar.MONTH))),
StringData.fromString("%02d".format(now.get(Calendar.DAY_OF_MONTH))),
StringData.fromString("%02d".format(now.get(Calendar.HOUR_OF_DAY)))
)
})
FlinkSink
.forRowData(stream.javaStream)
.tableLoader(TableLoader.fromHadoopTable(s"$warehouse/${ti.name()}", new Configuration()))
.build()
env.execute("test app")
}
}
Thanks in Advance.
you should set checkpointing:
env.enableCheckpointing(1000)

Unable to solve the error: java.io.NotSerializableException: org.apache.avro.Schema$RecordSchema

I am trying to read data from a table through SparkSession, and publish it to a Kafka topic. Using below piece of code for the same:
import org.apache.avro.Schema
import org.apache.avro.generic.{GenericData, GenericDatumWriter, GenericRecord}
import org.apache.avro.specific.SpecificDatumWriter
import org.apache.avro.io._
import org.apache.kafka.clients.CommonClientConfigs
import org.apache.kafka.clients.producer._
import org.apache.kafka.common.serialization.StringSerializer
import org.apache.kafka.common.serialization.ByteArraySerializer
import java.io.{ByteArrayOutputStream, StringWriter}
object Producer extends Serializable {
def main(args: Array[String]): Unit = {
val props = new Properties()
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, classOf[StringSerializer].getName)
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, classOf[ByteArraySerializer].getName)
val lines= Source.fromFile("file")
val schema = new Schema.Parser().parse(lines)
val spark = new SparkSession.Builder().enableHiveSupport() getOrCreate()
import spark.implicits._
val df = spark.sql("select * from table")
df.rdd.map{
value => {
val prod = new KafkaProducer[String, Array[Byte]](props)
val records = new GenericData.Record(schema)
records.put("col1",value.getString(1))
records.put("col2",value.getString(2))
records.put("col3",value.getString(3))
records.put("col4",value.getString(4))
val writer = new SpecificDatumWriter[GenericRecord](schema)
val out = new ByteArrayOutputStream()
val encoder: BinaryEncoder = EncoderFactory.get().binaryEncoder(out, null)
writer.write(records, encoder)
encoder.flush()
out.close()
val serializedBytes: Array[Byte] = out.toByteArray()
val record = new ProducerRecord("topic",col1.toString , serializedBytes)
val data = prod.send(record)
prod.flush()
prod.close() }
}
spark.close()
}
}
And, below error is thrown when I execute it:
Caused by: java.io.NotSerializableException:
org.apache.avro.Schema$RecordSchema Serialization stack:
- object not serializable (class: org.apache.avro.Schema$RecordSchema, value:
{"type":"record","name":"data","namespace":"com.data.record","fields":[{"name":"col1","type":"string"},{"name":"col2","type":"string"},{"name":"col3","type":"string"},{"name":"col4","type":"string"}]})
field (class: scala.runtime.ObjectRef, name: elem, type: class java.lang.Object)
object (class scala.runtime.ObjectRef,
{"type":"record","name":"data","namespace":"com.data.record","fields":[{"name":"col1","type":"string"},{"name":"col2","type":"string"},{"name":"col3","type":"string"},{"name":"col4","type":"string"}]})
- field (class: com.kafka.driver.KafkaProducer.Producer$$anonfun$main$1, name: schema$1, type: class scala.runtime.ObjectRef)
However, it runs fine when I try to pass the dataset to driver using df.rdd.collect.foreach . Instead, I need to publish the messages at cluster level, thus using rdd.map . Not sure what am I missing here exactly which is causing this error. Any help towards resolving this would be highly appreciated, thanks!
Figured out that objects, Schema and Kafka Producer, need to be exposed to executors. For that modified the above code as:
import org.apache.avro.Schema
import org.apache.avro.generic.{GenericData, GenericDatumWriter, GenericRecord}
import org.apache.avro.specific.SpecificDatumWriter
import org.apache.avro.io._
import org.apache.kafka.clients.CommonClientConfigs
import org.apache.kafka.clients.producer._
import org.apache.kafka.common.serialization.StringSerializer
import org.apache.kafka.common.serialization.ByteArraySerializer
import java.io.{ByteArrayOutputStream, StringWriter}
object Producer extends Serializable {
def main(args: Array[String]): Unit = {
val props = new Properties()
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, classOf[StringSerializer].getName)
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, classOf[ByteArraySerializer].getName)
val spark = new SparkSession.Builder().enableHiveSupport() getOrCreate()
import spark.implicits._
val df = spark.sql("select * from table")
df.foreachPartition{
rows => {
val prod = new KafkaProducer[String, Array[Byte]](props)
val lines= Source.fromFile("file")
val schema = new Schema.Parser().parse(lines)
rows.foreach{
value =>
val records = new GenericData.Record(schema)
records.put("col1",value.getString(1))
records.put("col2",value.getString(2))
records.put("col3",value.getString(3))
records.put("col4",value.getString(4))
val writer = new SpecificDatumWriter[GenericRecord](schema)
val out = new ByteArrayOutputStream()
val encoder: BinaryEncoder = EncoderFactory.get().binaryEncoder(out, null)
writer.write(records, encoder)
encoder.flush()
out.close()
val serializedBytes: Array[Byte] = out.toByteArray()
val record = new ProducerRecord("topic",col1.toString , serializedBytes)
val data = prod.send(record)
}
prod.flush()
prod.close()
}
}
spark.close()
}
}

Scala Package Issue With ZKStringSerializer

I am trying to use the class ZKStringSerializer, which I get with
import kafka.utils.ZKStringSerializer
According to the entirety of the internet, and even my own code before I restarted by computer, this should allow my code to work. However, I now get an incredibly confusing compile error,
object ZKStringSerializer in package utils cannot be accessed in package kafka.utils
This is confusing because this file is not supposed to be in any package, and I don't specify a package anywhere. This is my code:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.execution.streaming.FileStreamSource.Timestamp
import org.apache.spark.sql.types._
import org.I0Itec.zkclient.ZkClient
import org.I0Itec.zkclient.ZkConnection
import java.util.Properties
import org.apache.kafka.clients.admin
import kafka.admin.{AdminUtils, RackAwareMode}
import kafka.utils.ZKStringSerializer
import kafka.utils.ZkUtils
import org.apache.kafka.clients.producer.{KafkaProducer, ProducerRecord}
object SpeedTester {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder.master("local[4]").appName("SpeedTester").config("spark.driver.memory", "8g").getOrCreate()
val rootLogger = Logger.getRootLogger()
rootLogger.setLevel(Level.ERROR)
import spark.implicits._
val zookeeperConnect = "localhost:2181"
val sessionTimeoutMs = 10000
val connectionTimeoutMs = 10000
val zkClient = new ZkClient(zookeeperConnect, sessionTimeoutMs, connectionTimeoutMs, ZKStringSerializer)
val topicName = "testTopic"
val numPartitions = 8
val replicationFactor = 1
val topicConfig = new Properties
val isSecureKafkaCluster = false
val zkUtils = new ZkUtils(zkClient, new ZkConnection(zookeeperConnect), isSecureKafkaCluster)
AdminUtils.createTopic(zkUtils, topicName, numPartitions, replicationFactor, topicConfig)
// Create producer for topic testTopic and actually push values to the topic
val props = new Properties()
props.put("bootstrap.servers", "localhost:9592")
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer")
val producer = new KafkaProducer[String, String](props)
val TOPIC = "testTopic"
for (i <- 1 to 50) {
val record = new ProducerRecord(TOPIC, "key", s"hello $i")
producer.send(record)
}
val record = new ProducerRecord(TOPIC, "key", "the end" + new java.util.Date)
producer.send(record)
producer.flush()
producer.close()
}
}
I know this is too late, but for others who will be looking for the same issue-
In the latest version of kafka, kafka.utils got deprecated. So please use kafka admin client apis

JSON string gets truncated in apache flink

I am using kafka as my source in apache flink. My kafka producer is sending a big json string, but it is getting truncated to 4095 characters and my json parsing fails. I have no control over size of json string, can you please explain and help me tackle this issue?
Code:
import java.sql.Timestamp
import java.util.Properties
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer010
import org.apache.flink.streaming.util.serialization.SimpleStringSchema
import models._
import org.apache.flink.types.Row
import net.liftweb.json._
case class ParsedPage(data: String, domain:String, url:String, text: String)
object HelloWorld{
def main(args:Array[String]){
val env = StreamExecutionEnvironment.getExecutionEnvironment
val properties = new Properties()
properties.setProperty("bootstrap.servers", "localhost:9092")
properties.setProperty("group.id", "test")
val kafkaConsumer011 = new FlinkKafkaConsumer010[String]("first_topic", new SimpleStringSchema(), properties);
val stream: DataStream[String] = env.addSource[String](kafkaConsumer011);
val result: DataStream[Row] = stream.filter(_.nonEmpty).flatMap{ p =>
try {
implicit val formats = DefaultFormats
Seq(parse(p).extract[ParsedPage])
} catch {
case e: Exception =>
println(s"Parsing failed for message in txn: $p")
Seq()
}}.map { p =>
val bookJson = ""
val cityId = ""
Row.of(voyagerId, cityId)
}
stream.print()
env.execute()
}
}
Flink Version : 1.4.2
Scala Version: 2.11.8

Read the data from HDFS using Scala

I am new to Scala. How can I read a file from HDFS using Scala (not using Spark)?
When I googled it I only found writing option to HDFS.
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import java.io.PrintWriter;
/**
* #author ${user.name}
*/
object App {
//def foo(x : Array[String]) = x.foldLeft("")((a,b) => a + b)
def main(args : Array[String]) {
println( "Trying to write to HDFS..." )
val conf = new Configuration()
//conf.set("fs.defaultFS", "hdfs://quickstart.cloudera:8020")
conf.set("fs.defaultFS", "hdfs://192.168.30.147:8020")
val fs= FileSystem.get(conf)
val output = fs.create(new Path("/tmp/mySample.txt"))
val writer = new PrintWriter(output)
try {
writer.write("this is a test")
writer.write("\n")
}
finally {
writer.close()
println("Closed!")
}
println("Done!")
}
}
Please help me.How can read the file or load file from HDFS using scala.
One of the ways (kinda in functional style) could be like this:
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}
import java.net.URI
import scala.collection.immutable.Stream
val hdfs = FileSystem.get(new URI("hdfs://yourUrl:port/"), new Configuration())
val path = new Path("/path/to/file/")
val stream = hdfs.open(path)
def readLines = Stream.cons(stream.readLine, Stream.continually( stream.readLine))
//This example checks line for null and prints every existing line consequentally
readLines.takeWhile(_ != null).foreach(line => println(line))
Also you could take a look this article or here and here, these questions look related to yours and contain working (but more Java-like) code examples if you're interested.