Apache Flink: add timestamp to filename - scala

I want to create a BucketingSink in Flink which writes all the files in the same folder, but sets the file name with the current timestamp instead of incrementing counter. For example:
Here is my code:
val sink = new BucketingSink[Tuple2[String, MyType]]("/tmp/flink/")
sink.setBucketer(new MyTypeBucketer(new SimpleDateFormat("yyyy-MM-dd--HH")))
sink.setInactiveBucketThreshold(120000) // this is 2 minutes
sink.setBatchSize(1024 * 1024 * 64) // this is 64 MB,
val writer: AvroKeyValueSinkWriter[String, MyType] = new AvroKeyValueSinkWriter[String, MyType](parseAvroSinkProperties())
def parseAvroSinkProperties(): util.Map[String, String] = {
var properties = new util.HashMap[String, String]()
val stringSchema = Schema.create(Type.STRING)
val myTypeSchema = myType.getClassSchema
val keySchema = stringSchema.toString
val valueSchema = myTypeSchema.toString
val compress = true
properties.put(AvroKeyValueSinkWriter.CONF_OUTPUT_KEY_SCHEMA, keySchema)
properties.put(AvroKeyValueSinkWriter.CONF_OUTPUT_VALUE_SCHEMA, valueSchema)
properties.put(AvroKeyValueSinkWriter.CONF_COMPRESS, compress.toString)
properties.put(AvroKeyValueSinkWriter.CONF_COMPRESS_CODEC, DataFileConstants.SNAPPY_CODEC)
class MyTypeBucketer(dateFormatter: SimpleDateFormat) extends DateTimeBucketer[Tuple2[String, MyType]] {
override def getBucketPath(clock: Clock, basePath: Path, element: Tuple2[String, MyType]) = {
new Path(s"$basePath/${element.f1.getMyStringProp}")
Does anybody have any idea?


Spark: FlatMap and CountVectorizer pipeline

I working on the pipeline and try to split the column value before passing it to CountVectorizer.
For this purpose I made a custom Transformer.
class FlatMapTransformer(override val uid: String)
extends Transformer {
* Param for input column name.
* #group param
final val inputCol = new Param[String](this, "inputCol", "The input column")
final def getInputCol: String = $(inputCol)
* Param for output column name.
* #group param
final val outputCol = new Param[String](this, "outputCol", "The output column")
final def getOutputCol: String = $(outputCol)
def setInputCol(value: String): this.type = set(inputCol, value)
def setOutputCol(value: String): this.type = set(outputCol, value)
def this() = this(Identifiable.randomUID("FlatMapTransformer"))
private val flatMap: String => Seq[String] = { input: String =>
override def copy(extra: ParamMap): SplitString = defaultCopy(extra)
override def transform(dataset: Dataset[_]): DataFrame = {
val flatMapUdf = udf(flatMap)
dataset.withColumn($(outputCol), explode(flatMapUdf(col($(inputCol)))))
override def transformSchema(schema: StructType): StructType = {
val dataType = schema($(inputCol)).dataType
s"Input column must be of type StringType but got ${dataType}")
val inputFields = schema.fields
!inputFields.exists(_.name == $(outputCol)),
s"Output column ${$(outputCol)} already exists.")
DataTypes.createStructField($(outputCol), DataTypes.StringType, false)))
The code seems legit, but when I try to chain it with other operation the problem occurs. Here is my pipeline:
val train = reader.readTrainingData()
val cat_features = getFeaturesByType(taskConfig, "categorical")
val num_features = getFeaturesByType(taskConfig, "numeric")
val cat_ohe_features = getFeaturesByType(taskConfig, "categorical", Some("ohe"))
val cat_features_string_index = cat_features.
filter { feature: String => !cat_ohe_features.contains(feature) }
val catIndexer = cat_features_string_index.map {
feature =>
new StringIndexer()
.setOutputCol(feature + "_index")
val flatMapper = cat_ohe_features.map {
feature =>
new FlatMapTransformer()
.setOutputCol(feature + "_transformed")
val countVectorizer = cat_ohe_features.map {
feature =>
new CountVectorizer()
.setInputCol(feature + "_transformed")
.setOutputCol(feature + "_vectorized")
// val countVectorizer = cat_ohe_features.map {
// feature =>
// val flatMapper = new FlatMapTransformer()
// .setInputCol(feature)
// .setOutputCol(feature + "_transformed")
// new CountVectorizer()
// .setInputCol(flatMapper.getOutputCol)
// .setOutputCol(feature + "_vectorized")
// .setVocabSize(10)
// }
val cat_features_index = cat_features_string_index.map {
(feature: String) => feature + "_index"
val count_vectorized_index = cat_ohe_features.map {
(feature: String) => feature + "_vectorized"
val catFeatureAssembler = new VectorAssembler()
val oheFeatureAssembler = new VectorAssembler()
val numFeatureAssembler = new VectorAssembler()
val featureAssembler = new VectorAssembler()
.setInputCols(Array("cat_features", "num_features", "cat_ohe_features_vectorized"))
val pipelineStages = catIndexer ++ flatMapper ++ countVectorizer ++
val pipeline = new Pipeline().setStages(pipelineStages)
pipeline.fit(dataset = train)
Running this code, I receive an error:
java.lang.IllegalArgumentException: Field "my_ohe_field_trasformed" does not exist.
[info] java.lang.IllegalArgumentException: Field "from_expdelv_areas_transformed" does not exist.
[info] at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:266)
[info] at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:266)
[info] at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
[info] at scala.collection.AbstractMap.getOrElse(Map.scala:59)
[info] at org.apache.spark.sql.types.StructType.apply(StructType.scala:265)
[info] at org.apache.spark.ml.util.SchemaUtils$.checkColumnTypes(SchemaUtils.scala:56)
[info] at org.apache.spark.ml.feature.CountVectorizerParams$class.validateAndTransformSchema(CountVectorizer.scala:75)
[info] at org.apache.spark.ml.feature.CountVectorizer.validateAndTransformSchema(CountVectorizer.scala:123)
[info] at org.apache.spark.ml.feature.CountVectorizer.transformSchema(CountVectorizer.scala:188)
When I uncomment the stringSplitter and countVectorizer the error is raised in my Transformer
java.lang.IllegalArgumentException: Field "my_ohe_field" does not exist. at
val dataType = schema($(inputCol)).dataType
Result of calling pipeline.getStages:
I might follow the wrong way. Any comments are appreciated.
Your FlatMapTransformer #transform is incorrect, your kind of dropping/ignoring all other columns when you select only on outputCol
please modify your method to -
override def transform(dataset: Dataset[_]): DataFrame = {
val flatMapUdf = udf(flatMap)
dataset.withColumn($(outputCol), explode(flatMapUdf(col($(inputCol)))))
Also, Modify your transformSchema to check input column first before checking its datatype-
override def transformSchema(schema: StructType): StructType = {
require(schema.names.contains($(inputCol)), "inputCOl is not there in the input dataframe")
//... rest as it is
Update-1 based on comments
PLease modify the copy method (Though it's not the cause for exception you facing)-
override def copy(extra: ParamMap): FlatMapTransformer = defaultCopy(extra)
please note that the CountVectorizer takes the column having columns of type ArrayType(StringType, true/false) and since the FlatMapTransformer output columns becomes the input of CountVectorizer, you need to make sure output column of FlatMapTransformer must be of ArrayType(StringType, true/false). I think, this is not the case, your code today is as following-
override def transform(dataset: Dataset[_]): DataFrame = {
val flatMapUdf = udf(flatMap)
dataset.withColumn($(outputCol), explode(flatMapUdf(col($(inputCol)))))
The explode functions converts the array<string> to string, so the output of the transformer becomes StringType. you may wanted to change this code to-
override def transform(dataset: Dataset[_]): DataFrame = {
val flatMapUdf = udf(flatMap)
dataset.withColumn($(outputCol), flatMapUdf(col($(inputCol))))
modify transformSchema method to output ArrayType(StringType)
override def transformSchema(schema: StructType): StructType = {
val dataType = schema($(inputCol)).dataType
s"Input column must be of type StringType but got ${dataType}")
val inputFields = schema.fields
!inputFields.exists(_.name == $(outputCol)),
s"Output column ${$(outputCol)} already exists.")
schema.add($(outputCol), ArrayType(StringType))
change vector assembler to this-
val featureAssembler = new VectorAssembler()
.setInputCols(Array("cat_features", "num_features", "cat_ohe_features"))
I tried to execute your pipeline on dummy dataframe, it worked well. Please refer this gist for full code.

Unable to Analyse data

val patterns = ctx.getBroadcastState(patternStateDescriptor)
The imports I made
import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.api.common.state.{MapStateDescriptor, ValueState, ValueStateDescriptor}
import org.apache.flink.api.scala.typeutils.Types
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.datastream.BroadcastStream
import org.apache.flink.streaming.api.functions.co.KeyedBroadcastProcessFunction
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer010
import org.apache.flink.streaming.api.scala._
import org.apache.flink.util.Collector
Here's the code
val env = StreamExecutionEnvironment.getExecutionEnvironment
val properties = new Properties()
val patternStream = new FlinkKafkaConsumer010("patterns", new SimpleStringSchema, properties)
val patterns = env.addSource(patternStream)
var patternData = patterns.map {
str =>
val splitted_str = str.split(",")
PatternStream(splitted_str(0).trim, splitted_str(1).trim, splitted_str(2).trim)
val logsStream = new FlinkKafkaConsumer010("logs", new SimpleStringSchema, properties)
// logsStream.setStartFromEarliest()
val logs = env.addSource(logsStream)
var data = logs.map {
str =>
val splitted_str = str.split(",")
LogsTest(splitted_str.head.trim, splitted_str(1).trim, splitted_str(2).trim)
val keyedData: KeyedStream[LogsTest, String] = data.keyBy(_.metric)
val bcStateDescriptor = new MapStateDescriptor[Unit, PatternStream]("patterns", Types.UNIT, Types.of[PatternStream]) // first type defined is for the key and second data type defined is for the value
val broadcastPatterns: BroadcastStream[PatternStream] = patternData.broadcast(bcStateDescriptor)
val alerts = keyedData
.process(new PatternEvaluator())
// println(alerts.getClass)
// val sinkProducer = new FlinkKafkaProducer010("output", new SimpleStringSchema(), properties)
env.execute("Flink Broadcast State Job")
class PatternEvaluator()
extends KeyedBroadcastProcessFunction[String, LogsTest, PatternStream, (String, String, String)] {
private lazy val patternStateDescriptor = new MapStateDescriptor("patterns", classOf[String], classOf[String])
private var lastMetricState: ValueState[String] = _
override def open(parameters: Configuration): Unit = {
val lastMetricDescriptor = new ValueStateDescriptor("last-metric", classOf[String])
lastMetricState = getRuntimeContext.getState(lastMetricDescriptor)
override def processElement(reading: LogsTest,
readOnlyCtx: KeyedBroadcastProcessFunction[String, LogsTest, PatternStream, (String, String, String)]#ReadOnlyContext,
out: Collector[(String, String, String)]): Unit = {
val metrics = readOnlyCtx.getBroadcastState(patternStateDescriptor)
if (metrics.contains(reading.metric)) {
val metricPattern: String = metrics.get(reading.metric)
val metricPatternValue: String = metrics.get(reading.value)
val lastMetric = lastMetricState.value()
val logsMetric = (reading.metric)
val logsValue = (reading.value)
if (logsMetric == metricPattern) {
if (metricPatternValue == logsValue) {
out.collect((reading.timestamp, reading.value, reading.metric))
override def processBroadcastElement(
update: PatternStream,
ctx: KeyedBroadcastProcessFunction[String, LogsTest, PatternStream, (String, String, String)]#Context,
out: Collector[(String, String, String)]
): Unit = {
val patterns = ctx.getBroadcastState(patternStateDescriptor)
if (update.metric == "IP") {
patterns.put(update.metric /*,update.operator*/ , update.value)
// else if (update.metric == "username"){
// patterns.put(update.metric, update.value)
// }
// else {
// println("No required data found")
// }
// }
Sample Data :- Logs Stream
"21/09/98","IP", ""
Pattern Stream
I'm unable to analyse data by getting desired result, i.e = 21/09/98,IP,
There's no error as of now, it's just not analysing the data
The code is reading streams (Checked)
One common source of trouble in cases like this is that the API offers no control over the order in which the patterns and the data are ingested. It could be that processElement is being called before processBroadcastElement.

Empty Iterator : Asynchronous cassandra write

I am trying to implement asynchronous cassandra writes on objects (not RDD) using TableWriter. Code snippet below:
class CassandraOperations[T] extends Serializable with Logging {
* Saves the data from object or Iterator of object to a Cassandra table asynchronously. Uses the specified column names.
* You can check whether this action is completed or not by callback on Future.
def saveToCassandraAsync(
cc: CassandraConnector,
keyspaceName: String,
tableName: String,
columns: ColumnSelector = AllColumns,
data: Iterator[T],
writeConf: WriteConf = WriteConf(ttl = TTLOption.constant(80000)))(implicit rwf: RowWriterFactory[T]):
Future[Unit] = {
implicit val ec = ExecutionContext.global
val writer = TableWriter(cc, keyspaceName, tableName, columns, writeConf)
val futureAction = Future(writer.write(TaskContext.get(), data: Iterator[T]))
And then wait using:
Await.result(resultFuture, TIMEOUT seconds)
the data is available when the execution reaches the write method on line :
val futureAction = Future(writer.write(TaskContext.get(), data: Iterator[T]))
But data is empty when the execution reaches the definition def write(taskContext: TaskContext, **data**: Iterator[T]) of function :
def write(taskContext: TaskContext, data: Iterator[T]) {
val updater = OutputMetricsUpdater(taskContext, writeConf)
connector.withSessionDo { session =>
val protocolVersion = session.getCluster.getConfiguration.getProtocolOptions.getProtocolVersion
val rowIterator = new CountingIterator(data)
val stmt = prepareStatement(session).setConsistencyLevel(writeConf.consistencyLevel)
val queryExecutor = new QueryExecutor(
Some(updater.batchFinished(success = true, _, _, _)),
Some(updater.batchFinished(success = false, _, _, _)))
val routingKeyGenerator = new RoutingKeyGenerator(tableDef, columnNames)
val batchType = if (isCounterUpdate) Type.COUNTER else Type.UNLOGGED
val boundStmtBuilder = new BoundStatementBuilder(
protocolVersion = protocolVersion,
ignoreNulls = writeConf.ignoreNulls)
val batchStmtBuilder = new BatchStatementBuilder(
val batchKeyGenerator = batchRoutingKey(session, routingKeyGenerator) _
val batchBuilder = new GroupingBatchBuilder(
val rateLimiter = new RateLimiter((writeConf.throughputMiBPS * 1024 * 1024).toLong, 1024 * 1024)
logDebug(s"Writing data partition to $keyspaceName.$tableName in batches of ${writeConf.batchSize}.")
for (stmtToWrite <- batchBuilder) {
assert(stmtToWrite.bytesCount > 0)
if (!queryExecutor.successful)
throw new IOException(s"Failed to write statements to $keyspaceName.$tableName.")
val duration = updater.finish() / 1000000000d
logInfo(f"Wrote ${rowIterator.count} rows to $keyspaceName.$tableName in $duration%.3f s.")
if (boundStmtBuilder.logUnsetToNullWarning) {
so I see empty iterator.
Please guide on what can be the issue.

Union Dataframes based on condition in spark(scala)

I have a folder which consists of 4 subfolders which contains parquet files
Folder->A.parquet,B.parquet,C.parquet,D.parquet(subfolders). My requirement is I want to union data frames based on file Names I provide to the method.
I am doing it with code
val df = listDirectoriesGetWantedFile(folderPath,sqlContext,A,B)
def listDirectoriesGetWantedFile(folderPath: String, sqlContext: SQLContext, str1: String, str2: String): DataFrame = {
var df: DataFrame = null
val sb = new StringBuilder
var done = false
val path = new Path(folderPath)
if (fileSystem.isDirectory(path)) {
var files = fileSystem.listStatus(path)
for (file <- files) {
if (file.getPath.getName.contains(str) && !done) {
done = true
} else if (file.getPath.getName.contains(str2)) {
But I need to split the sb and then union the dataframes. Which I am unable to find the solution. How can I approach it and solve
If I understand your question, you could simply do something like this :
def listDirectoriesGetWantedFile(path: String,
sqlContext: SQLContext,
folder1: String,
folder2: String): DataFrame = {
val df1 = sqlContext.read.parquet(s"$path/$folder1")
val df2 = sqlContext.read.parquet(s"$path/$folder2")
By using Hadoop FileSystem, you can check path existence on your folders. So you may try something like that :
def listDirectoriesGetWantedFile(path: String, sqlContext: SQLContext, folders: Seq[String]): DataFrame = {
val conf = new Configuration()
val fs = FileSystem.get(conf)
val existingFolders = folders
.map(folder => new Path(s"$path/$folder"))
if (existingFolders.isEmpty) {
} else {
sqlContext.read.parquet(existingFolders: _*)

Creating serializable objects from Scala source code at runtime

To embed Scala as a "scripting language", I need to be able to compile text fragments to simple objects, such as Function0[Unit] that can be serialised to and deserialised from disk and which can be loaded into the current runtime and executed.
How would I go about this?
Say for example, my text fragment is (purely hypothetical):
This might be wrapped into the following complete text:
package myapp.userscripts
import myapp.DSL._
object UserFunction1234 extends Function0[Unit] {
def apply(): Unit = {
What comes next? Should I use IMain to compile this code? I don't want to use the normal interpreter mode, because the compilation should be "context-free" and not accumulate requests.
What I need to get hold off from the compilation is I guess the binary class file? In that case, serialisation is straight forward (byte array). How would I then load that class into the runtime and invoke the apply method?
What happens if the code compiles to multiple auxiliary classes? The example above contains a closure _.open(). How do I make sure I "package" all those auxiliary things into one object to serialize and class-load?
Note: Given that Scala 2.11 is imminent and the compiler API probably changed, I am happy to receive hints as how to approach this problem on Scala 2.11
Here is one idea: use a regular Scala compiler instance. Unfortunately it seems to require the use of hard disk files both for input and output. So we use temporary files for that. The output will be zipped up in a JAR which will be stored as a byte array (that would go into the hypothetical serialization process). We need a special class loader to retrieve the class again from the extracted JAR.
The following assumes Scala 2.10.3 with the scala-compiler library on the class path:
import scala.tools.nsc
import java.io._
import scala.annotation.tailrec
Wrapping user provided code in a function class with a synthetic name that will be incremented for each new fragment:
val packageName = "myapp"
var userCount = 0
def mkFunName(): String = {
val c = userCount
userCount += 1
def wrapSource(source: String): (String, String) = {
val fun = mkFunName()
val code = s"""package $packageName
|class $fun extends Function0[Unit] {
| def apply(): Unit = {
| $source
| }
(fun, code)
A function to compile a source fragment and return the byte array of the resulting jar:
/** Compiles a source code consisting of a body which is wrapped in a `Function0`
* apply method, and returns the function's class name (without package) and the
* raw jar file produced in the compilation.
def compile(source: String): (String, Array[Byte]) = {
val set = new nsc.Settings
val d = File.createTempFile("temp", ".out")
d.delete(); d.mkdir()
set.d.value = d.getPath
set.usejavacp.value = true
val compiler = new nsc.Global(set)
val f = File.createTempFile("temp", ".scala")
val out = new BufferedOutputStream(new FileOutputStream(f))
val (fun, code) = wrapSource(source)
out.flush(); out.close()
val run = new compiler.Run()
val bytes = packJar(d)
(fun, bytes)
def deleteDir(base: File): Unit = {
base.listFiles().foreach { f =>
if (f.isFile) f.delete()
else deleteDir(f)
Note: Doesn't handle compiler errors yet!
The packJar method uses the compiler output directory and produces an in-memory jar file from it:
// cf. http://stackoverflow.com/questions/1281229
def packJar(base: File): Array[Byte] = {
import java.util.jar._
val mf = new Manifest
mf.getMainAttributes.put(Attributes.Name.MANIFEST_VERSION, "1.0")
val bs = new java.io.ByteArrayOutputStream
val out = new JarOutputStream(bs, mf)
def add(prefix: String, f: File): Unit = {
val name0 = prefix + f.getName
val name = if (f.isDirectory) name0 + "/" else name0
val entry = new JarEntry(name)
if (f.isFile) {
val in = new BufferedInputStream(new FileInputStream(f))
try {
val buf = new Array[Byte](1024)
#tailrec def loop(): Unit = {
val count = in.read(buf)
if (count >= 0) {
out.write(buf, 0, count)
} finally {
if (f.isDirectory) f.listFiles.foreach(add(name, _))
base.listFiles().foreach(add("", _))
A utility function that takes the byte array found in deserialization and creates a map from class names to class byte code:
def unpackJar(bytes: Array[Byte]): Map[String, Array[Byte]] = {
import java.util.jar._
import scala.annotation.tailrec
val in = new JarInputStream(new ByteArrayInputStream(bytes))
val b = Map.newBuilder[String, Array[Byte]]
#tailrec def loop(): Unit = {
val entry = in.getNextJarEntry
if (entry != null) {
if (!entry.isDirectory) {
val name = entry.getName
// cf. http://stackoverflow.com/questions/8909743
val bs = new ByteArrayOutputStream
var i = 0
while (i >= 0) {
i = in.read()
if (i >= 0) bs.write(i)
val bytes = bs.toByteArray
b += mkClassName(name) -> bytes
def mkClassName(path: String): String = {
path.substring(0, path.length - 6).replace("/", ".")
A suitable class loader:
class MemoryClassLoader(map: Map[String, Array[Byte]]) extends ClassLoader {
override protected def findClass(name: String): Class[_] =
map.get(name).map { bytes =>
println(s"defineClass($name, ...)")
defineClass(name, bytes, 0, bytes.length)
} .getOrElse(super.findClass(name)) // throws exception
And a test case which contains additional classes (closures):
val exampleSource =
"""val xs = List("hello", "world")
|println(xs.map(_.capitalize).mkString(" "))
def test(fun: String, cl: ClassLoader): Unit = {
val clName = s"$packageName.$fun"
println(s"Resolving class '$clName'...")
val clazz = Class.forName(clName, true, cl)
val x = clazz.newInstance().asInstanceOf[() => Unit]
println("Invoking 'apply':")
locally {
val (fun, bytes) = compile(exampleSource)
val map = unpackJar(bytes)
println("Classes found:")
map.keys.foreach(k => println(s" '$k'"))
val cl = new MemoryClassLoader(map)
test(fun, cl) // should call `defineClass`
test(fun, cl) // should find cached class