Spark 2.3.1 Structured Streaming Input Rate - scala

I wonder if there is a way to specify the size of the mini-batch in Spark Structured streaming. That is rather than only stating the mini-batch interval (Triggers), I would like to state how many Row can be in a mini-batch (DataFrame) per interval.
Is there a way to do that ?
Aside from the general capability to do that, I particularily need to apply that in testing scenario, where i have an MemoryStream. I would like Spark to consume a certain amount of data from the MemoryStream, instead of taking all of it at once, to actually see how the the overall application behave. My understanding is that the MemoryStream data structure needs to be filled before launching the job on it. Hence, how can i see the mini-batch processing behavior, is spark is able to ingest the entire content of the MemoryStream within the interval that I give ?
EDIT1
In the Kafka Integration I have found the following:
maxOffsetsPerTrigger: Rate limit on maximum number of offsets processed per trigger interval. The specified total number of offsets will be proportionally split across topicPartitions of different volume.
But that is just for KAFKA integration. I have also seen
maxFilesPerTrigger: maximum number of new files to be considered in every trigger
So it seems things are defined per source types. Hence, is there a way to control how data is consumed from MEMORYSTREAM[ROW] ?

Look for below guys they can solve your problem:
1.spark.streaming.backpressure.initialRate
2.spark.streaming.backpressure.enabled

Related

Kafka Streams Yearly time Window

There is a requirement in one of the applications that we are working on is, aggregation to happen on a windowed manner and the windowing size may vary monthly/quarterly/half yearly/yearly.
Kafka streams calendar based timed window supports this and I would like to get more inputs on the performance front to know if it would best suit the need.
The memory consumed by the cache to hold the records till the window size.
Number of records that gets streamed on a daily basis within the window is really high.
Please suggest can Kafka stream processing be used in this case and how about the resources for the memory management.?

Prediction/Estimation of missing intervals inside Apache Kafka process

Goal is to process raw readings (15min and 1h interval) from external remote meters (assets) in real time.
Process is defined using simple Apache Kafka producer/consumer and multiple Spring Boot microservices to deduplicate messages, transform (map) readings to our system (instead external codes insert internal IDS and similar stuff) and insert in TimescaleDB (extension of PostgreSql).
Everything seems fine, but there is requirement to perform real time prediction/estimation of missing intervals.
Simple example for one meter and 15 minute readings:
On day 1 we got all readings. We process them and have them ingested in our DB.
On day 2 we are missing all readings - so process is not even
started for this meter.
On day 3 we again got all readings - but only for day 3. Now we need
to predict that whole day 2 is missing and create empty readings and
then estimate them by some algorithm (that is not that important
now).
My question here, is there any way or idea how to do this without querying existing database in one of the microservices and checking if something is missing?
Is it possible to check previous messages in Kafka topics and based on that do the prediction/estimation (kafka streams? - I don't get them at all) and is that even smart to do, or there is any other way/idea to do it?
Personal opinion disclaimer
It is not reasonably possible to check previous messages in Kafka Streams. If you are hellbent on doing it, you could probably try to seek messages and re-consume them but Kafka will fight you every step on the way. The mental model is, that you are transforming or aggregating data that comes in in real time. If you need to query something about previous data, you ought to have collected that information when that data was coming through.
What could work (rather well even) is to separate the prediction of missing data from the transformation.
Create two consumers for the stream.
Have one topology (or whatever it is that does your transformations already) transform the data and load it back into Kafka and from there to timescaledb.
Have one topology (or another microservice) that does what is needed to predict missing data. Your usecase of backfilling a missing day could be handled by something like a count based on daily windows
Make that trigger your backfilling either as part of that topology or as a subsequent microservice and load that data to timescaledb as well.
Are you already using Kafka Streams for the transformations? This would be a classical usecase.
The recognition of missing data not so much
As far as I understand it does not require high throughput. More the opposite. You want to know if there is no data.
As far as I understand it latency is not a (main) concern.
Kafka Streams could be useful if you need to take automated action within seconds after data stops coming in. But even then, you could just write throughput metrics and trigger alerts in this case.
Pther than that, it is a very stateful problem and stream processing is at its best if you can treat every message separately reduce them in a "standard" manner like sums or counts.
I got the impression, that a delay of a few hours / a day is not that tragic and currently the backfilling might be done manually. In this case the cot of Kafka Streams would outweigh the benefits.

How to improve low throughput groupbykey in dataflow pipeline

I have an apache beam batch pipeline (written in java) to transform raw analytics data from bigquery into an aggregated form. Session records (that might now be extended by the next days worth of page events) and a new set of page events are read from bigquery. The pipeline is then performing a groupByKey operation to group by user id (across both datasets) before the aggregation operation to create session records. The groupByKey operation is performing very slowly (a throughput of ~50 per sec) on the larger dataset (~8400000 records) whereas the throughput for the other input (~1000000 records) was much much higher (~10000 per sec). Does anyone have any advice on how I can troubleshoot and ultimately improve the speed of the operation?
From research online I am aware sometimes it can be more efficient to use a Combine operation rather than groupByKey ( among others this article) but I did not think that would be appropriate for the data I'm grouping (BQ TableRow records).
Further info that might be useful:
The groupByKey is taking the 8400000 into approx 3500000 grouped records with a range of ~2000 to 1 records being combined per key
I fully acknowledge I am lacking a full understanding of the intricacies of apache beam and dataflow and am keen to understand a lot more as I will be building out a number of different pipelines.
Below is a screenshot of the dataflow graph
Individual stages in Beam get fused together when running on Dataflow, meaning the throughput of on stage gets tied to others, so it's fully possible it's not the GroupByKey but rather adjacent DoFns that are causing the slowness. If you click on a step you can see, the step-info tab, a field that gives the wall time for executing a particular step. I would see if there is a particular step in your pipeline around that GroupByKey that has a high walltime.

How can I control number of rows and/or output file size in Spark streaming when writing to HDFS - hive?

Using spark streaming to read and process messages from Kafka and write to HDFS - Hive.
Since I wish to avoid creating many small files which spams the filesystem, I would like to know if there's a way to ensure a minimal file size, and/or ability to force a minimal number of output rows in a file, with the exception of a timeout.
Thanks.
As far as I know, there is no way to control the number of lines in your output files. But you can control the number of output files.
Controlling that and considering your dataset size may help you with your needs, since you can calculate the size of each file in your output. You can do that with the coalesce and repartition commands:
df.coalesce(2).write(...)
df.repartition(2).write(...)
Both of them are used to create the number of partitions given as parameter. So if you set 2, you should have 2 files in your output.
The difference are that with repartition you can both increase and decrease your partitions, while with coalesce you can only decrease.
Also,keep in mind that repartition performs a full shuffle to equally distribute the data among the partitions, which may be resource and time expensive. On the other hand, coalesce does not perform a full shuffle, it combines existing partitions instead.
You can find an awesome explanation in this other answer here

Understanding Spark partitioning

I'm trying to understand how Spark partitions data. Suppose I have an execution DAG like that in the picture (orange boxes are the stages). The two groupBy and the join operations are supposed to be very heavy if the RDD's are not partitioned.
Is it wise then to use .partitonBy(new HashPartitioner(properValue)) to P1, P2, P3 and P4 to avoid shuffle? What's the cost of partitioning an existing RDD? When isn't proper to partition an existing RDD? Doesn't Spark partition my data automatically if I don't specify a partitioner?
Thank you
tl;dr The answers to your questions respectively: Better to partition at the outset if you can; Probably less than not partitioning; Your RDD is partitioned one way or another anyway; Yes.
This is a pretty broad question. It takes up a good portion of our course! But let's try to address as much about partitioning as possible without writing a novel.
As you know, the primary reason to use a tool like Spark is because you have too much data to analyze on one machine without having the fan sound like a jet engine. The data get distributed among all the cores on all the machines in your cluster, so yes, there is a default partitioning--according to the data. Remember that the data are distributed already at rest (in HDFS, HBase, etc.), so Spark just partitions according to the same strategy by default to keep the data on the machines where they already are--with the default number of partitions equal to the number of cores on the cluster. You can override this default number by configuring spark.default.parallelism, and you want this number to be 2-3 per core per machine.
However, typically you want data that belong together (for example, data with the same key, where HashPartitioner would apply) to be in the same partition, regardless of where they are to start, for the sake of your analytics and to minimize shuffle later. Spark also offers a RangePartitioner, or you can roll your own for your needs fairly easily. But you are right that there is an upfront shuffle cost to go from default partitioning to custom partitioning; it's almost always worth it.
It is generally wise to partition at the outset (rather than delay the inevitable with partitionBy) and then repartition if needed later. Later on you may choose to coalesce even, which causes an intermediate shuffle, to reduce the number of partitions and potentially leave some machines and cores idle because the gain in network IO (after that upfront cost) is greater than the loss of CPU power.
(The only situation I can think of where you don't partition at the outset--because you can't--is when your data source is a compressed file.)
Note also that you can preserve partitions during a map transformation with mapPartitions and mapPartitionsWithIndex.
Finally, keep in mind that as you experiment with your analytics while you work your way up to scale, there are diagnostic capabilities you can use:
toDebugString to see the lineage of RDDs
getNumPartitions to, shockingly, get the number of partitions
glom to see clearly how your data are partitioned
And if you pardon the shameless plug, these are the kinds of things we discuss in Analytics with Apache Spark. We hope to have an online version soon.
By applying partitionBy preemptively you don't avoid the shuffle. You just push it in another place. This can be a good idea if partitioned RDD is reused multiple times, but you gain nothing for a one-off join.
Doesn't Spark partition my data automatically if I don't specify a partitioner?
It will partition (a.k.a. shuffle) your data a part of the join) and subsequent groupBy (unless you keep the same key and use transformation which preserves partitioning).