Beam pipeline: Kafka to HDFS by time buckets - scala

I am trying to bake a very simple pipeline that reads a stream of events from Kafka (KafkaIO.read) and writes the very same events to HDFS, bucketing each event together by hour (the hour is read from a timestamp field of the event, not processing time).
No assumption can be made about the timestamp of the events (they could be spanning through multiple days even if 99% of the time they are in real-time) and there is absolutely no information about the order of the events. My first attempt is to create a pipeline running in processing time.
My pipeline looks like this:
val kafkaReader = KafkaIO.read[String, String]()
.withBootstrapServers(options.getKafkaBootstrapServers)
.withTopic(options.getKafkaInputTopic)
.withKeyDeserializer(classOf[StringDeserializer])
.withValueDeserializer(classOf[StringDeserializer])
.updateConsumerProperties(
ImmutableMap.of("receive.buffer.bytes", Integer.valueOf(16 * 1024 * 1024))
)
.commitOffsetsInFinalize()
.withoutMetadata()
val keyed = p.apply(kafkaReader)
.apply(Values.create[String]())
.apply(new WindowedByWatermark(options.getBatchSize))
.apply(ParDo.of[String, CustomEvent](new CustomEvent))
val outfolder = FileSystems.matchNewResource(options.getHdfsOutputPath, true)
keyed.apply(
"write to HDFS",
FileIO.writeDynamic[Integer, CustomEvent]()
.by(new SerializableFunction[CustomEvent, Integer] {
override def apply(input: CustomEvent): Integer = {
new Instant(event.eventTime * 1000L).toDateTime.withMinuteOfHour(0).withSecondOfMinute(0)
(eventZeroHoured.getMillis / 1000).toInt
}
})
.via(Contextful.fn(new SerializableFunction[CustomEvent, String] {
override def apply(input: CustomEvent): String = {
convertEventToStr(input)
}
}), TextIO.sink())
.withNaming(new SerializableFunction[Integer, FileNaming] {
override def apply(bucket: Integer): FileNaming = {
new BucketedFileNaming(outfolder, bucket, withTiming = true)
}
})
.withDestinationCoder(StringUtf8Coder.of())
.to(options.getHdfsOutputPath)
.withTempDirectory("hdfs://tlap/tmp/gulptmp")
.withNumShards(1)
.withCompression(Compression.GZIP)
)
And this is my WindowedByWatermark:
class WindowedByWatermark(bucketSize: Int = 5000000) extends PTransform[PCollection[String], PCollection[String]] {
val window: Window[String] = Window
.into[String](FixedWindows.of(Duration.standardMinutes(10)))
.triggering(
AfterWatermark.pastEndOfWindow()
.withEarlyFirings(AfterPane.elementCountAtLeast(bucketSize))
)
.withAllowedLateness(Duration.standardMinutes(30))
.discardingFiredPanes()
override def expand(input: PCollection[String]): PCollection[String] = {
input.apply("window", window)
}
}
The pipeline runs flawlessly but it is suffering from incredibly high back pressure due to the write phase (the groupby caused by the writeDynamic). Most of the events are coming in real-time, hence they belong to the same hour. I tried also bucketing the data using hour and minutes, without much help.
After days of pain, I have decided to replicate the same with Flink using a bucketingSink and the performance is excellent.
val stream = env
.addSource(new FlinkKafkaConsumer011[String](options.kafkaInputTopic, new SimpleStringSchema(), properties))
.addSink(bucketingSink(options.hdfsOutputPath, options.batchSize))
According to my analysis (even using JMX), the threads in Beam are waiting during the write phase to HDFS (and this causes the pipeline to pause the retrieval of data from Kafka).
I have therefore the following questions:
Is it possible to push down the bucketing as the bucketingSink is doing also in Beam?
Is there a smarter way to achieve the same in Beam?

Related

Using flatMapGroupsWithState with forEachBatch

I have a streaming spark app, wherein the running stream I'm removing duplicate rows using Stateful Aggregation with flatMapGroupsWithState.
But when I used forEachBatch on the stream, and used the same functions I created to remove duplicates on stream, it is treating each Batch as an independent entity, and returning duplicates only among that single Micro Batch.
Code:
case class User(name: String, userId: Integer)
case class StateClass(totalUsers: Int)
def removeDuplicates(inputData: Dataset[User]): Dataset[User] = {
inputData
.groupByKey(_.userId)
.flatMapGroupsWithState(OutputMode.Append, GroupStateTimeout.ProcessingTimeTimeout)(removeDuplicatesInternal)
}
def removeDuplicatesInternal(id: Integer, newData: Iterator[User], state: GroupState[StateClass]): Iterator[User] = {
if (state.hasTimedOut) {
state.remove() // Removing state since no same UserId in 4 hours
return Iterator()
}
if (newData.isEmpty)
return Iterator()
if (!state.exists) {
val firstUserData = newData.next()
val newState = StateClass(1) // Total count = 1 initially
state.update(newState)
state.setTimeoutDuration("4 hours")
Iterator(firstUserData) // Returning UserData first time
}
else {
val newState = StateClass(state.get.totalUsers + 1)
state.update(newState)
state.setTimeoutDuration("4 hours")
Iterator() // Returning empty since state already exists (Already sent this UserData before)
}
}
Input Stream I used is userStream.
Above functions works fine when I directly pass stream to it.
val results = removeDuplicates(userStream)
But when I do something like:
userStream
.writeStream
.foreachBatch { (batch, batchId) => writeBatch(batch) }
def writeBatch(batch: Dataset[User]): Unit = {
val distinctBatch = removeDuplicates(batch)
}
I get distinct User data only within that Micro Batch. But I want it to be distinct overall across 4 hour timeout.
For Eg:
If 1st batch has UserIds (1, 3, 5, 1), and second batch has UserIds (2, 3, 1).
Expected Behaviour:
Output: 1st Batch = (1, 3, 5) and 2nd Batch = (2)
My Output: 1st Batch = (1, 3, 5) and 2nd Batch = (2, 3, 1)
How can I enable it to use the same State throughout? Right now, it is treating each micro-batch different, and creating a separate state for each batch.
PS: Problem is not limited to getting duplicates on Stream, I need to use forEachBatch for some computations on Micro batches, and remove duplicates before writing.
For Running test script, refer this: https://ideone.com/nZ5pq2
The behavior is actually expected one.
flatMapGroupsWithState leverages state store only when the query is streaming one. (For batch query, it doesn't even create a state store because it's not necessary.) Once you call forEachBatch, the provided batch parameter is no longer continuous one across batches - consider it as a dataset from batch query, which the batch means "a" micro-batch.
So you still need to pass your streaming dataset to removeDuplicate, or have your own way to deduplicate records across batches in forEachBatch.

Variable number of partitions and batch for loadDataFrame

I'm currently using the neo4j-spark-connector for loading in data from the graph into a DataFrame.
My Spark job contains different queries on different parts of the graph. It can also run against different versions of the graph with less or more data, which we don't know upfront. The problem is that I don't know how to set a correct number of partitions and batch for each query before knowing how much results it will return.
val result = neo.cypher(("MATCH (p:Person)-[:KNOWS]->(other:Person) RETURN person.Name, other.Name"))
.partitions(**partitions**) // PROBLEM
.batch(**batch**) // PROBLEM
I came up with a temporary solution where I count the number of most occuring relationship type and divide this by the number of partitions.
val batch: Long = {
// partitions default: 200
println("Calculating batch size...")
val batchSizeCount = if (relationShip == null) {
neo.cypher("MATCH (n)-[r:MOST_OCC_REL]->(m) RETURN COUNT(r)").loadRdd[Long].collect.head
}
val newPartitions = if (partitions > 5) {
partitions - 5
} else {
partitions
}
val batchSize = if (batchSizeCount < newPartitions) {
1L
} else {
batchSizeCount / newPartitions
}
println("Batch size for Neo4j: " + batchSize)
batchSize
}
Although the overhead it works for most of the (simple) queries. But for more complex queries it does not seem to be correct al the time.
I need to make sure that the reading of source data always uses the appropriate partitions/batch configuration for each run/query.
Is there maybe a better way to define the correct partitions and batch size, without knowing the amount of data it will return ?

How to process multiple parquet files individually in a for loop?

I have multiple parquet files (around 1000). I need to load each one of them, process it and save the result to a Hive table. I have a for loop but it only seems to work with 2 or 5 files, but not with 1000, as it seems Sparks tries to load them all at the same time, and I need it do it individually in the same Spark session.
I tried using a for loop, then a for each, and I ussed unpersist() but It fails anyway.
val ids = get_files_IDs()
ids.foreach(id => {
println("Starting file " + id)
var df = load_file(id)
var values_df = calculate_values(df)
values_df.write.mode(SaveMode.Overwrite).saveAsTable("table.values_" + id)
df.unpersist()
})
def get_files_IDs(): List[String] = {
var ids = sqlContext.sql("SELECT CAST(id AS varchar(10)) FROM table.ids WHERE id IS NOT NULL")
var ids_list = ids.select("id").map(r => r.getString(0)).collect().toList
return ids_list
}
def calculate_values(df:org.apache.spark.sql.DataFrame): org.apache.spark.sql.DataFrame ={
val values_id = df.groupBy($"id", $"date", $"hr_time").agg(avg($"value_a") as "avg_val_a", avg($"value_b") as "avg_value_b")
return values_id
}
def load_file(id:String): org.apache.spark.sql.DataFrame = {
val df = sqlContext.read.parquet("/user/hive/wh/table.db/parquet/values_for_" + id + ".parquet")
return df
}
What I would expect is for Spark to load file ID 1, process the data, save it to the Hive table and then dismiss that date and cotinue with the second ID and so on until it finishes the 1000 files. Instead of it trying to load everything at the same time.
Any help would be very appreciated! I've been stuck on it for days. I'm using Spark 1.6 with Scala Thank you!!
EDIT: Added the definitions. Hope it can help to get a better view. Thank you!
Ok so after a lot of inspection I realised that the process was working fine. It processed each file individualy and saved the results. The issue was that in some very specific cases the process was taking way way way to long.
So I can tell that with a for loop or for each you can process multiple files and save the results without problem. Unpersisting and clearing cache do helps on performance.

SparkSQL performance issue with collect method

We are currently facing a performance issue in sparksql written in scala language. Application flow is mentioned below.
Spark application reads a text file from input hdfs directory
Creates a data frame on top of the file using programmatically specifying schema. This dataframe will be an exact replication of the input file kept in memory. Will have around 18 columns in the dataframe
var eqpDF = sqlContext.createDataFrame(eqpRowRdd, eqpSchema)
Creates a filtered dataframe from the first data frame constructed in step 2. This dataframe will contain unique account numbers with the help of distinct keyword.
var distAccNrsDF = eqpDF.select("accountnumber").distinct().collect()
Using the two dataframes constructed in step 2 & 3, we will get all the records which belong to one account number and do some Json parsing logic on top of the filtered data.
var filtrEqpDF =
eqpDF.where("accountnumber='" + data.getString(0) + "'").collect()
Finally the json parsed data will be put into Hbase table
Here we are facing performance issues while calling the collect method on top of the data frames. Because collect will fetch all the data into a single node and then do the processing, thus losing the parallel processing benefit.
Also in real scenario there will be 10 billion records of data which we can expect. Hence collecting all those records in to driver node will might crash the program itself due to memory or disk space limitations.
I don't think the take method can be used in our case which will fetch limited number of records at a time. We have to get all the unique account numbers from the whole data and hence I am not sure whether take method, which takes
limited records at a time, will suit our requirements
Appreciate any help to avoid calling collect methods and have some other best practises to follow. Code snippets/suggestions/git links will be very helpful if anyone have had faced similar issues
Code snippet
val eqpSchemaString = "acoountnumber ....."
val eqpSchema = StructType(eqpSchemaString.split(" ").map(fieldName =>
StructField(fieldName, StringType, true)));
val eqpRdd = sc.textFile(inputPath)
val eqpRowRdd = eqpRdd.map(_.split(",")).map(eqpRow => Row(eqpRow(0).trim, eqpRow(1).trim, ....)
var eqpDF = sqlContext.createDataFrame(eqpRowRdd, eqpSchema);
var distAccNrsDF = eqpDF.select("accountnumber").distinct().collect()
distAccNrsDF.foreach { data =>
var filtrEqpDF = eqpDF.where("accountnumber='" + data.getString(0) + "'").collect()
var result = new JSONObject()
result.put("jsonSchemaVersion", "1.0")
val firstRowAcc = filtrEqpDF(0)
//Json parsing logic
{
.....
.....
}
}
The approach usually take in this kind of situation is:
Instead of collect, invoke foreachPartition: foreachPartition applies a function to each partition (represented by an Iterator[Row]) of the underlying DataFrame separately (the partition being the atomic unit of parallelism of Spark)
the function will open a connection to HBase (thus making it one per partition) and send all the contained values through this connection
This means the every executor opens a connection (which is not serializable but lives within the boundaries of the function, thus not needing to be sent across the network) and independently sends its contents to HBase, without any need to collect all data on the driver (or any one node, for that matter).
It looks like you are reading a CSV file, so probably something like the following will do the trick:
spark.read.csv(inputPath). // Using DataFrameReader but your way works too
foreachPartition { rows =>
val conn = ??? // Create HBase connection
for (row <- rows) { // Loop over the iterator
val data = parseJson(row) // Your parsing logic
??? // Use 'conn' to save 'data'
}
}
You can ignore collect in your code if you have large set of data.
Collect Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.
Also this can cause the driver to run out of memory, though, because collect() fetches the entire RDD/DF to a single machine.
I have just edited your code, which should work for you.
var distAccNrsDF = eqpDF.select("accountnumber").distinct()
distAccNrsDF.foreach { data =>
var filtrEqpDF = eqpDF.where("accountnumber='" + data.getString(0) + "'")
var result = new JSONObject()
result.put("jsonSchemaVersion", "1.0")
val firstRowAcc = filtrEqpDF(0)
//Json parsing logic
{
.....
.....
}
}

Flink: join file with kafka stream

I have a problem I don't really can figure out.
So I have a kafka stream that contains some data like this:
{"adId":"9001", "eventAction":"start", "eventType":"track", "eventValue":"", "timestamp":"1498118549550"}
And I want to replace 'adId' with another value 'bookingId'.
This value is located in a csv file, but I can't really figure out how to get it working.
Here is my mapping csv file:
9001;8
9002;10
So my output would ideally be something like
{"bookingId":"8", "eventAction":"start", "eventType":"track", "eventValue":"", "timestamp":"1498118549550"}
This file can get refreshed every hour at least once, so it should pick up changes to it.
I currently have this code which doesn't work for me:
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.enableCheckpointing(30000); // create a checkpoint every 30 seconds
env.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime);
DataStream<String> adToBookingMapping = env.readTextFile(parameters.get("adToBookingMapping"));
DataStream<Tuple2<Integer,Integer>> input = adToBookingMapping.flatMap(new Tokenizer());
//Kafka Consumer
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", parameters.get("bootstrap.servers"));
properties.setProperty("group.id", parameters.get("group.id"));
FlinkKafkaConsumer010<ObjectNode> consumer = new FlinkKafkaConsumer010<>(parameters.get("inbound_topic"), new JSONDeserializationSchema(), properties);
consumer.setStartFromGroupOffsets();
consumer.setCommitOffsetsOnCheckpoints(true);
DataStream<ObjectNode> logs = env.addSource(consumer);
DataStream<Tuple4<Integer,String,Integer,Float>> parsed = logs.flatMap(new Parser());
// output -> bookingId, action, impressions, sum
DataStream<Tuple4<Integer, String,Integer,Float>> joined = runWindowJoin(parsed, input, 3);
public static DataStream<Tuple4<Integer, String, Integer, Float>> runWindowJoin(DataStream<Tuple4<Integer, String, Integer, Float>> parsed,
DataStream<Tuple2<Integer, Integer>> input,long windowSize) {
return parsed.join(input)
.where(new ParsedKey())
.equalTo(new InputKey())
.window(TumblingProcessingTimeWindows.of(Time.of(windowSize, TimeUnit.SECONDS)))
//.window(TumblingEventTimeWindows.of(Time.milliseconds(30000)))
.apply(new JoinFunction<Tuple4<Integer, String, Integer, Float>, Tuple2<Integer, Integer>, Tuple4<Integer, String, Integer, Float>>() {
private static final long serialVersionUID = 4874139139788915879L;
#Override
public Tuple4<Integer, String, Integer, Float> join(
Tuple4<Integer, String, Integer, Float> first,
Tuple2<Integer, Integer> second) {
return new Tuple4<Integer, String, Integer, Float>(second.f1, first.f1, first.f2, first.f3);
}
});
}
The code only runs once and then stops, so it doesn't convert new entries in kafka using the csv file. Any ideas on how I could process the stream from Kafka with the latest values from my csv file?
Kind regards,
darkownage
Your goal appears to be to join steaming data with a slow-changing catalog (i.e. a side-input). I don't think the join operation is useful here because it doesn't store the catalog entries across windows. Also, the text file is a bounded input whose lines are read once.
Consider using connect to create a connected stream, and store the catalog data as managed state to perform lookups into. The operator's parallelism would need to be 1.
You may find a better solution by researching 'side inputs', looking at the solutions that people use today. See FLIP-17 and Dean Wampler's talk at Flink Forward.