Iceberg's FlinkSink doesn't update metadata file in streaming writes - scala

I have been trying to use Iceberg's FlinkSink to consume the data and write to sink.
I was successful in fetching the data from kinesis and I see that the data is being written into the appropriate partition. However, I don't see the metadata.json being updated. Without which I am not able to query the table.
Any help or pointers are appreciated.
The following is the code.
package test
import java.util.{Calendar, Properties}
import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.connectors.kinesis.FlinkKinesisConsumer
import org.apache.flink.streaming.connectors.kinesis.config.{AWSConfigConstants, ConsumerConfigConstants}
import{GenericRowData, RowData, StringData}
import org.apache.hadoop.conf.Configuration
import org.apache.iceberg.catalog.TableIdentifier
import org.apache.iceberg.flink.{CatalogLoader, TableLoader}
import org.apache.iceberg.flink.TableLoader.HadoopTableLoader
import org.apache.iceberg.flink.sink.FlinkSink
import org.apache.iceberg.hadoop.HadoopCatalog
import org.apache.iceberg.types.Types
import org.apache.iceberg.{PartitionSpec, Schema}
import scala.collection.JavaConverters._
object SampleApp {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
val warehouse = "file://<local folder path>"
val catalog = new HadoopCatalog(new Configuration(), warehouse)
val ti = TableIdentifier.of("test_table")
if (!catalog.tableExists(ti)) {
println("table doesnt exist. creating it.")
val schema = new Schema(
Types.NestedField.optional(1, "message", Types.StringType.get()),
Types.NestedField.optional(2, "year", Types.StringType.get()),
Types.NestedField.optional(3, "month", Types.StringType.get()),
Types.NestedField.optional(4, "date", Types.StringType.get()),
Types.NestedField.optional(5, "hour", Types.StringType.get())
val props = Map(
"write.metadata.delete-after-commit.enabled" -> "true",
"write.metadata.previous-versions-max" -> "5",
"" -> "1048576"
val partitionSpec = PartitionSpec.builderFor(schema)
catalog.createTable(ti, schema, partitionSpec, props.asJava)
} else {
println("table exists. not creating it.")
val inputProperties = new Properties()
inputProperties.setProperty(AWSConfigConstants.AWS_REGION, "us-east-1")
inputProperties.setProperty(ConsumerConfigConstants.STREAM_INITIAL_POSITION, "LATEST")
val stream: DataStream[RowData] = env
.addSource(new FlinkKinesisConsumer[String]("test_kinesis_stream", new SimpleStringSchema(), inputProperties))
.map(x => {
val now = Calendar.getInstance()
.tableLoader(TableLoader.fromHadoopTable(s"$warehouse/${}", new Configuration()))
env.execute("test app")
Thanks in Advance.

you should set checkpointing:


Convert scala map to dataframe

I am trying to stream twitter data using Apache Spark and I want to save it as csv file into HDFS. I understand that I have to convert it to a dataframe but I am not able to do so.
Here is my full code:
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.twitter.TwitterUtils
import org.apache.log4j.{Level, Logger}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{DataFrame, Row, SparkSession}
//import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
//import org.apache.spark.sql.functions._
import sentimentAnalysis.sentimentScore
case class twitterCaseClass (userID: String = "", user: String = "", createdAt: String = "", text: String = "", sentimentType: String = "")
object twitterStream {
//private val gson = new Gson()
def main(args: Array[String]) {
//Twitter API
System.setProperty("twitter4j.oauth.consumerKey", "#######")
System.setProperty("twitter4j.oauth.consumerSecret", "#######")
System.setProperty("twitter4j.oauth.accessToken", "#######")
System.setProperty("twitter4j.oauth.accessTokenSecret", "#######")
val spark = SparkSession.builder().appName("twitterStream").master("local[*]").getOrCreate()
val sc: SparkContext = spark.sparkContext
val streamContext = new StreamingContext(sc, Seconds(5))
import spark.implicits._
val filters = Array("Singapore")
val filtered = TwitterUtils.createStream(streamContext, None, filters)
val englishTweets = filtered.filter(_.getLang() == "en")
val tweets ={ col => {
"userID" -> col.getId,
"user" -> col.getUser.getScreenName,
"createdAt" -> col.getCreatedAt.toInstant.toString,
"text" -> col.getText.toLowerCase.split(" ").filter(_.matches("^[a-zA-Z0-9 ]+$")).fold("")((a, b) => a + " " + b).trim,
"sentimentType" -> sentimentScore(col.getText).toString
//val tweets =
I am not sure where did I possibly went wrong. There is another way to go about which is using case class. Is there a good example I can follow?
The result of the Map function which is save into HDFS is like this:
((userID,1345940003533312000),(user,rei_yang),(createdAt,2021-01-04T03:47:57Z),(text,just posted a photo singapore),(sentimentType,NEUTRAL))
Is there a way to code it to a dataframe?

Testing a utility function by writing a unit test in apache spark scala

I have a utility function written in scala to read parquet files from s3 bucket. Could someone help me in writing unit test cases for this
Below is the function which needs to be tested.
def readParquetFile(spark: SparkSession,
locationPath: String): DataFrame = {
So far i have created a SparkSession for which the master is local
import org.apache.spark.sql.SparkSession
trait SparkSessionTestWrapper {
lazy val spark: SparkSession = {
SparkSession.builder().master("local").appName("Test App").getOrCreate()
I am stuck with testing the function. Here is the code where I am stuck. The question is should i create a real parquet file and load to see if the dataframe is getting created or is there a mocking framework to test this.
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
import org.scalatest.FunSpec
class ReadAndWriteSpec extends FunSpec with DataFrameComparer with SparkSessionTestWrapper {
import spark.implicits._
it("reads a parquet file and creates a dataframe") {
Basing on the inputs from the comments i came up with the below but i am still not able to understand how this can be leveraged.
I am using
class ReadAndWriteSpec extends FunSpec with DataFrameComparer with SparkSessionTestWrapper {
import spark.implicits._
it("reads a parquet file and creates a dataframe") {
val api = S3Mock(port = 8001, dir = "/tmp/s3")
val endpoint = new EndpointConfiguration("http://localhost:8001", "us-west-2")
val client = AmazonS3ClientBuilder
.withCredentials(new AWSStaticCredentialsProvider(new AnonymousAWSCredentials()))
/** Use it as usual. */
client.putObject("foo", "bar", "baz")
val url = client.getUrl("foo","bar")
val df = ReadAndWrite.readParquetFile(spark,url.getPath())
I figured out and kept it simple. I could complete some basic test cases.
Here is my solution. I hope this will help someone.
import org.apache.spark.sql
import org.apache.spark.sql.{SaveMode, SparkSession}
import org.scalatest.{BeforeAndAfterEach, FunSuite}
import loaders.ReadAndWrite
class ReadAndWriteTestSpec extends FunSuite with BeforeAndAfterEach{
private val master = "local"
private val appName = "ReadAndWrite-Test"
var spark : SparkSession = _
override def beforeEach(): Unit = {
spark = new sql.SparkSession.Builder().appName(appName).master(master).getOrCreate()
test("creating data frame from parquet file") {
val sparkSession = spark
import sparkSession.implicits._
val peopleDF ="src/test/resources/people.json")
val df = ReadAndWrite.readParquetFile(sparkSession,"src/test/resources/people.parquet")
test("creating data frame from text file") {
val sparkSession = spark
import sparkSession.implicits._
val peopleDF = ReadAndWrite.readTextfileToDataSet(sparkSession,"src/test/resources/people.txt").map(_.split(","))
.map(attributes => Person(attributes(0), attributes(1).trim.toInt))
test("counts should match with number of records in a text file") {
val sparkSession = spark
import sparkSession.implicits._
val peopleDF = ReadAndWrite.readTextfileToDataSet(sparkSession,"src/test/resources/people.txt").map(_.split(","))
.map(attributes => Person(attributes(0), attributes(1).trim.toInt))
assert(peopleDF.count() == 3)
test("data should match with sample records in a text file") {
val sparkSession = spark
import sparkSession.implicits._
val peopleDF = ReadAndWrite.readTextfileToDataSet(sparkSession,"src/test/resources/people.txt").map(_.split(","))
.map(attributes => Person(attributes(0), attributes(1).trim.toInt))
test("Write a data frame as csv file") {
val sparkSession = spark
import sparkSession.implicits._
val peopleDF = ReadAndWrite.readTextfileToDataSet(sparkSession,"src/test/resources/people.txt").map(_.split(","))
.map(attributes => Person(attributes(0), attributes(1).trim.toInt))
//header argument should be boolean to the user to avoid confusions
override def afterEach(): Unit = {
case class Person(name: String, age: Int)

Scala Package Issue With ZKStringSerializer

I am trying to use the class ZKStringSerializer, which I get with
import kafka.utils.ZKStringSerializer
According to the entirety of the internet, and even my own code before I restarted by computer, this should allow my code to work. However, I now get an incredibly confusing compile error,
object ZKStringSerializer in package utils cannot be accessed in package kafka.utils
This is confusing because this file is not supposed to be in any package, and I don't specify a package anywhere. This is my code:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.execution.streaming.FileStreamSource.Timestamp
import org.apache.spark.sql.types._
import org.I0Itec.zkclient.ZkClient
import org.I0Itec.zkclient.ZkConnection
import java.util.Properties
import org.apache.kafka.clients.admin
import kafka.admin.{AdminUtils, RackAwareMode}
import kafka.utils.ZKStringSerializer
import kafka.utils.ZkUtils
import org.apache.kafka.clients.producer.{KafkaProducer, ProducerRecord}
object SpeedTester {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder.master("local[4]").appName("SpeedTester").config("spark.driver.memory", "8g").getOrCreate()
val rootLogger = Logger.getRootLogger()
import spark.implicits._
val zookeeperConnect = "localhost:2181"
val sessionTimeoutMs = 10000
val connectionTimeoutMs = 10000
val zkClient = new ZkClient(zookeeperConnect, sessionTimeoutMs, connectionTimeoutMs, ZKStringSerializer)
val topicName = "testTopic"
val numPartitions = 8
val replicationFactor = 1
val topicConfig = new Properties
val isSecureKafkaCluster = false
val zkUtils = new ZkUtils(zkClient, new ZkConnection(zookeeperConnect), isSecureKafkaCluster)
AdminUtils.createTopic(zkUtils, topicName, numPartitions, replicationFactor, topicConfig)
// Create producer for topic testTopic and actually push values to the topic
val props = new Properties()
props.put("bootstrap.servers", "localhost:9592")
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer")
val producer = new KafkaProducer[String, String](props)
val TOPIC = "testTopic"
for (i <- 1 to 50) {
val record = new ProducerRecord(TOPIC, "key", s"hello $i")
val record = new ProducerRecord(TOPIC, "key", "the end" + new java.util.Date)
I know this is too late, but for others who will be looking for the same issue-
In the latest version of kafka, kafka.utils got deprecated. So please use kafka admin client apis

Integration test Flink and Kafka with scalatest-embedded-kafka

I would like to run integration test with Flink and Kafka. The process is to read from Kafka, some manipulation with Flink and put the datastream in kafka.
I would like to test the process from the begining to the end. For now I use scalatest-embedded-kafka.
I put an example here I tried to be as simple as possible :
import java.util.Properties
import net.manub.embeddedkafka.{EmbeddedKafka, EmbeddedKafkaConfig}
import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.api.common.typeinfo.TypeInformation
import org.apache.flink.streaming.api.functions.sink.SinkFunction
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.connectors.kafka.{FlinkKafkaConsumer011, FlinkKafkaProducer011}
import org.scalatest.{Matchers, WordSpec}
import scala.collection.mutable.ListBuffer
object SimpleFlinkKafkaTest {
class CollectSink extends SinkFunction[String] {
override def invoke(string: String): Unit = {
synchronized {
CollectSink.values += string
object CollectSink {
val values: ListBuffer[String] = ListBuffer.empty[String]
val kafkaPort = 9092
val zooKeeperPort = 2181
val props = new Properties()
props.put("bootstrap.servers", "localhost:" + kafkaPort.toString)
props.put("schema.registry.url", "localhost:" + zooKeeperPort.toString)
val inputString = "mystring"
val expectedString = "MYSTRING"
class SimpleFlinkKafkaTest extends WordSpec with Matchers with EmbeddedKafka {
"runs with embedded kafka" should {
"work" in {
implicit val config = EmbeddedKafkaConfig(
kafkaPort = SimpleFlinkKafkaTest.kafkaPort,
zooKeeperPort = SimpleFlinkKafkaTest.zooKeeperPort
withRunningKafka {
publishStringMessageToKafka("input-topic", SimpleFlinkKafkaTest.inputString)
val env = StreamExecutionEnvironment.getExecutionEnvironment
val kafkaConsumer = new FlinkKafkaConsumer011(
new SimpleStringSchema,
implicit val typeInfo = TypeInformation.of(classOf[String])
val inputStream = env.addSource(kafkaConsumer)
val outputStream =
val kafkaProducer = new FlinkKafkaProducer011(
new SimpleStringSchema(),
consumeFirstStringMessageFrom("output-topic") shouldEqual SimpleFlinkKafkaTest.expectedString
I had en error so I add the line implicit val typeInfo = TypeInformation.of(classOf[String]) but I don't really understand why I have to do that.
For now this code doesn't work, it runs without interuption but do not stop and do not give any result.
If someone hase any idea ? Even better idea to test this kind of pipeline.
Thanks !
EDIT : add env.execute() and change error.
Here's a simple solution I came up with.
The idea is to:
Start Kafka Embedded server
Create your test topics (here input and output)
Launch Flink job in a Future to avoid blocking the main thread
Publish a message to the input topic
Check the result on the output topic
And the working prototype:
import java.util.Properties
import org.apache.flink.streaming.api.scala._
import net.manub.embeddedkafka.{EmbeddedKafka, EmbeddedKafkaConfig}
import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.core.fs.FileSystem.WriteMode
import org.apache.flink.streaming.connectors.kafka.{FlinkKafkaConsumer011, FlinkKafkaProducer011}
import org.scalatest.{Matchers, WordSpec}
import scala.concurrent.Future
class SimpleFlinkKafkaTest extends WordSpec with Matchers with EmbeddedKafka {
"runs with embedded kafka on arbitrary available ports" should {
val env = StreamExecutionEnvironment.getExecutionEnvironment
"work" in {
val userDefinedConfig = EmbeddedKafkaConfig(kafkaPort = 9092, zooKeeperPort = 2182)
val properties = new Properties()
properties.setProperty("bootstrap.servers", "localhost:9092")
properties.setProperty("zookeeper.connect", "localhost:2182")
properties.setProperty("", "test")
properties.setProperty("auto.offset.reset", "earliest")
val kafkaConsumer = new FlinkKafkaConsumer011[String]("input", new SimpleStringSchema(), properties)
val kafkaSink = new FlinkKafkaProducer011[String]("output", new SimpleStringSchema(), properties)
val stream = env
withRunningKafkaOnFoundPort(userDefinedConfig) { implicit actualConfig =>
publishStringMessageToKafka("input", "Titi")
consumeFirstStringMessageFrom("output") shouldEqual "TITI"

Scala/Spark serialization error - streaming data to HBASE

I am a newbie to Scala/Spark. In the following code, I am extracting Twitter public stream content to the HBase.
On commenting the last four lines (put commands in HBase), I am able to print content of tweet on the terminal, however unable to dump it to the HBase table.
I need help in on the following regards:
1. How can I overcome the serialilzation error?
2. Are there efficient methods (may be useing Kryo serialilzation) to overcome this error?
Caused by:
org.apache.hadoop.conf.Configuration Serialization stack:
- object not serializable (class: org.apache.hadoop.conf.Configuration, value: Configuration:
core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml,
yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml)
import twitter4j.auth._
import twitter4j.conf._
import twitter4j._
import twitter4j.json._
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.client.ConnectionFactory
import org.apache.hadoop.hbase.client.HBaseAdmin
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.HColumnDescriptor
import org.apache.hadoop.hbase.KeyValue
import org.apache.hadoop.hbase.mapred.TableOutputFormat
import org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2
import org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.hbase.TableName
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor}
import org.apache.hadoop.mapreduce.Job
import org.apache.spark._
import org.apache.spark.rdd.NewHadoopRDD
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.hadoop.hbase.util.Bytes
import org.apache.spark.streaming.twitter.TwitterUtils
val conf = new SparkConf().setAppName("model1").setMaster("local[*]")
// val sc = new SparkContext(conf)
val TABLE_NAME = "publicrd"
val CF_USER = "user"
val CF_TWEET = "tweet"
val CF_ENTITIES = "entities"
val CF_PLACES = "places"
val hadoopConf = new Configuration
val conf = HBaseConfiguration.create(hadoopConf)
val admin = new HBaseAdmin(conf)
val tableDesc = new HTableDescriptor(Bytes.toBytes(TABLE_NAME))
// Define column family descriptor
val ColumnFamilyDesc1 = new HColumnDescriptor(Bytes.toBytes(CF_USER))
val ColumnFamilyDesc2 = new HColumnDescriptor(Bytes.toBytes(CF_TWEET))
val ColumnFamilyDesc3 = new HColumnDescriptor(Bytes.toBytes(CF_ENTITIES))
val ColumnFamilyDesc4 = new HColumnDescriptor(Bytes.toBytes(CF_PLACES))
// Add column family in table descriptor
// Check if the table exists
if (admin.tableExists(TABLE_NAME)){
print(">>>>>" + TABLE_NAME + " already exists <<<<<")
// Create HBASE table
val table = new HTable(conf, TABLE_NAME)
val timewindow = 2 // seconds
val ssc = new StreamingContext(sc, Seconds(timewindow))
val cb = new ConfigurationBuilder
val ckey = "ckey"
val csecret = "csecret"
val atoken = "atoken"
val atokensecret = "atokensecret"
val auth = new OAuthAuthorization(
val tweets = TwitterUtils.createStream(ssc,Some(auth))
val status = tweets.filter(_.getLang()=="en")
status.foreachRDD(foreachFunc = rdd => {
rdd.foreachPartition {
records => while (records.hasNext) {
var record =
var tweetID = record.getUser().getId().toString//.isInstanceOf[Int]
print("\ntweetID : "+tweetID)
var tweetBody = record.getText()//.toString
print("\ntweetBody : "+tweetBody)
var favoritesCount = record.getFavoriteCount()//.toInt
print("\nfavoritesCount : "+favoritesCount)
var keyrow = "t_"+tweetID //"t_"+
print("\nkeyrow : "+keyrow+"\n")
var theput= new Put(Bytes.toBytes(keyrow))
The code is run on the terminal via:
spark-shell --driver-class-path /opt/hadoop/hbase-1.2.1/lib/hbase-server-1.1.4.jar:/opt/hadoop/hbase-1.2.1/lib/hbase-protocol-1.0.0-cdh5.5.0.jar:/opt/hadoop/hbase-1.2.1/lib/hbase-hadoop2-compat-1.0.0-cdh5.5.0.jar:/opt/hadoop/hbase-1.2.1/lib/hbase-client-1.0.0-cdh5.5.0.jar:/opt/hadoop/hbase-1.2.1/lib/hbase-common-1.0.0-cdh5.5.0.jar:/opt/hadoop/hbase-1.2.1/lib/htrace-core-3.2.0-incubating.jar:/home/cloudera/Desktop/hbase/twitter4jJARS/guava-19.0.jar:/home/cloudera/Desktop/hbase/twitter4jJARS/spark-streaming-twitter_2.10-1.6.1.jar:/home/cloudera/Desktop/hbase/twitter4jJARS/twitter4j-async-4.0.4.jar:/home/cloudera/Desktop/hbase/twitter4jJARS/twitter4j-core-4.0.4.jar:/home/cloudera/Desktop/hbase/twitter4jJARS/twitter4j-examples-4.0.4.jar:/home/cloudera/Desktop/hbase/twitter4jJARS/twitter4j-media-support-4.0.4.jar:/home/cloudera/Desktop/hbase/twitter4jJARS/twitter4j-stream-4.0.4.jar
It says the object org.apache.hadoop.conf.Configuration is not serialisable which mean it does not implement the Serializable interface while it's required. To get rid of that add #transient keyword.
#transient val hadoopConf = new Configuration