PyFlink performance compared to Scala - pyspark

How PyFlink performance is compared to Flink + Scala?
Big Picture.
The goal is to build Lambda architecture with Cold and Hot Tier.
Cold (Batch) Tier will be implemented with Apache Spark (PySpark).
But with Hot (Streaming) Tier there are different options: Spark Streaming or Flink.
Thus Apache Flink is pure streaming rather then Spark's micro-batches, I tend to choose Apache Flink.
But my only point of concern is performance of PyFlink. Will it have less latency that PySpark streaming? Is it slower then Scala written Flink code? In what cases it's slower?
Thank you in advance!

I had implemented something very similar , and from my experience these are a few things
Performance of the job is completely dependent on the type of code you are writing , if you are using some custom UDFs written in python to run while you extract then the performance is going to be slower than doing the same thing using Scala based code - this happens majorly because of the conversion of python objects to JVM and vice versa . But this will happen while you are using Pyspark .
Flink is true streaming process, the micro batches in spark are not so if your use case does need a true streaming service go ahead with Flink.
If you stick your service to the native functions given in PyFlink you will not observe any noticeable difference in performance .

Related

Spark Structured Streaming performance for Scala vs Python

Hi~ I'm going to develop a mini-batched program with Kafka + Spark Structured Streaming. But I am very confused, whether to use python or scala, which is faster. It would be better if there is any benchmark performance result about Spark Structured Streaming between Scala and Python.
Not really an issue.
Only thing are that 1) Scala is faster but the scale of data per microbatch may mean less of an impact and 2) Scala has dataset support for types, pyspark does not.
Most use Scala, pyspark more for data science.
That said real-time machine learning may well be better with pyspark. See for example: https://towardsdatascience.com/building-a-real-time-prediction-pipeline-using-spark-structured-streaming-and-microservices-626dc20899eb

What is the difference between StreamExecutionEnvironment and StreamTableEnvironment in Flink

Well, I'm a newbie on Apache Flink and reading some source codes on the Internet.
Sometimes I saw StreamExecutionEnvironment but I have also seen StreamTableEnvironment.
I've read the official doc but I still can't figure out their difference.
Furthermore, I'm trying to code a Flink Stream Job, which receives the data from Kafka. In this case, which kind of environment should I use?
A StreamExecutionEnvironment is used with the DataStream API. You need a StreamTableEnvironment if you are going to use the higher level Table or SQL APIs.
The section of the docs on creating a TableEnvironment covers this in more detail.
Which you should use depends on whether you want to work with lower level data streams, or a higher level relational API. Both can be used to implement a streaming job that reads from Kafka.
The documentation includes a pair of code walkthroughs introducing both APIs, which should help you figure out which API is better suited for your use case. See the DataStream walkthrough and the Table walkthrough.
To learn the DataStream API, spend a day working through the self-paced training. There's also training for Flink SQL. Both are free.

Metrics for latency or performance of checkpointing

We are using Spark 2.3 and we are facing a weird issue. The streaming job works perfectly fine in one environment but in an another environment it does not.
We feel that in low performing environment the checkpointing of the state is not as performant as the other.
Is there any metrics we can look into that Spark Streaming emits which can prove our theory.

Parallelised collections in Spark

What's the concept of "Paralleled collections" in Spark is, and how this concept can improve the overall performance of a job? Besides, how should partitions be configured for that?
Parallel collections are provided in the Scala language as a simple way to parallelize data processing in Scala. The basic idea is that when you perform operations like map, filter, etc... to a collection it is possible to parallelize it using a thread pool. This type of parallelization is called data parallelization because it is based on the data itself. This is happening locally in the JVM and Scala will use as many threads as cores are available to the JVM.
On the other hand Spark is based on RDD, that are an abstraction that represents a distributed dataset. Unlike the Scala parallel collections this datasets are distributed in several nodes. Spark is also based on data parallelism, but this time is distributed data parallelism. This allows you to parallelize much more than in a single JVM, but it also introduces other issues related with data shuffling.
In summary, Spark implements a distributed data parallelism system, so everytime you execute a map, filter, etc... you are doing something similar to what a Scala parallel collection would do but in a distributed fashion. Also the unit of parallelism in Spark are partitions, while in Scala collections is each row.
You could always use Scala parallel collections inside a Spark task to parallelize within a Spark task, but you won't necessarily see performance improvement, specially if your data was already evenly distributed in your RDD and each task needs about the same computational resources to be executed.

How to use Concurrency API on data frames?

I have a requirement to parallelize the Scala Data Frames to load various tables. I have a fact table that is having around 1.7 TB of data. This is taking around 5 minutes to load. I want to concurrently load my dimension tables so that I can reduce my overall scala . I am not well versed with Concurrent API in Scala?.
You need to read up on Spark - the whole point of it is to parallelize processing of data beyond the scope of single machine. Essentially Spark will parallelize the load by as many tasks you'd have running in parallel - It all depends on how you set your cluster - from the question I am guessing you only use on and that you ran it in local model in which case you should at least run it with local[number of processors you have]
If I didn't make it clear you shouldn't also use any other Scala concurrency APIs