Spark Structured Streaming performance for Scala vs Python - scala

Hi~ I'm going to develop a mini-batched program with Kafka + Spark Structured Streaming. But I am very confused, whether to use python or scala, which is faster. It would be better if there is any benchmark performance result about Spark Structured Streaming between Scala and Python.

Not really an issue.
Only thing are that 1) Scala is faster but the scale of data per microbatch may mean less of an impact and 2) Scala has dataset support for types, pyspark does not.
Most use Scala, pyspark more for data science.
That said real-time machine learning may well be better with pyspark. See for example: https://towardsdatascience.com/building-a-real-time-prediction-pipeline-using-spark-structured-streaming-and-microservices-626dc20899eb

Related

PyFlink performance compared to Scala

How PyFlink performance is compared to Flink + Scala?
Big Picture.
The goal is to build Lambda architecture with Cold and Hot Tier.
Cold (Batch) Tier will be implemented with Apache Spark (PySpark).
But with Hot (Streaming) Tier there are different options: Spark Streaming or Flink.
Thus Apache Flink is pure streaming rather then Spark's micro-batches, I tend to choose Apache Flink.
But my only point of concern is performance of PyFlink. Will it have less latency that PySpark streaming? Is it slower then Scala written Flink code? In what cases it's slower?
Thank you in advance!
I had implemented something very similar , and from my experience these are a few things
Performance of the job is completely dependent on the type of code you are writing , if you are using some custom UDFs written in python to run while you extract then the performance is going to be slower than doing the same thing using Scala based code - this happens majorly because of the conversion of python objects to JVM and vice versa . But this will happen while you are using Pyspark .
Flink is true streaming process, the micro batches in spark are not so if your use case does need a true streaming service go ahead with Flink.
If you stick your service to the native functions given in PyFlink you will not observe any noticeable difference in performance .

What is the difference between StreamExecutionEnvironment and StreamTableEnvironment in Flink

Well, I'm a newbie on Apache Flink and reading some source codes on the Internet.
Sometimes I saw StreamExecutionEnvironment but I have also seen StreamTableEnvironment.
I've read the official doc but I still can't figure out their difference.
Furthermore, I'm trying to code a Flink Stream Job, which receives the data from Kafka. In this case, which kind of environment should I use?
A StreamExecutionEnvironment is used with the DataStream API. You need a StreamTableEnvironment if you are going to use the higher level Table or SQL APIs.
The section of the docs on creating a TableEnvironment covers this in more detail.
Which you should use depends on whether you want to work with lower level data streams, or a higher level relational API. Both can be used to implement a streaming job that reads from Kafka.
The documentation includes a pair of code walkthroughs introducing both APIs, which should help you figure out which API is better suited for your use case. See the DataStream walkthrough and the Table walkthrough.
To learn the DataStream API, spend a day working through the self-paced training. There's also training for Flink SQL. Both are free.

PySpark - Local system performance

I am new to Pyspark. I would like to learn one while solving a Kaggle Challenge using a large dataset.
Does Pyspark offer performance advantage over Pandas when using on a local system? Or does it not matter?
When running locally, pyspark runs with as many worker threads as logical cores available on your machine - if you run spark.sparkContext.master, it should return local[*] (more information on local configurations can be found here). Since Pandas is single threaded (unless you're using something like Dask), for large datasets, Pyspark should be more performant. However, due to the overhead associated with using multiple threads, serializing data and sending to the JVM, etc., Pandas may be faster for smaller datasets.

Where can I find all the properties associated with spark structured streaming?

I wonder if there is a listing somewhere of all the properties associated with spark structured streaming ?
For instance in the doc we can find:
spark.sql.streaming.schemaInference
spark.sql.streaming.metricsEnabled
When I do spark.sql("SET -v").show(numRows = 200, truncate = false)
as recommended in the documentation for configuration over spark sql the only thing that i see are:
spark.sql.streaming.numRecentProgressUpdates
spark.sql.streaming.metricsEnabled
spark.sql.streaming.checkpointLocation
However I don't see ***spark.sql.streaming.schemaInference***
Hence my question what is the consistence way to see all the properties that can be used to set spark structured streaming behavior. Are Spark streaming properties part of all that apply to Spark structured streaming behavior ? I am interested in controlling the rate per mini-batch (i.e. mini dataFrame aka number of of ROWS per processing)
I tried to find all of configurations in the official Spark website, but I've failed.
So here is the original code about configuration of Spark 2.4.0.
You can find all of structured streaming configurations as searching spark.sql.streaming.

Parallelised collections in Spark

What's the concept of "Paralleled collections" in Spark is, and how this concept can improve the overall performance of a job? Besides, how should partitions be configured for that?
Parallel collections are provided in the Scala language as a simple way to parallelize data processing in Scala. The basic idea is that when you perform operations like map, filter, etc... to a collection it is possible to parallelize it using a thread pool. This type of parallelization is called data parallelization because it is based on the data itself. This is happening locally in the JVM and Scala will use as many threads as cores are available to the JVM.
On the other hand Spark is based on RDD, that are an abstraction that represents a distributed dataset. Unlike the Scala parallel collections this datasets are distributed in several nodes. Spark is also based on data parallelism, but this time is distributed data parallelism. This allows you to parallelize much more than in a single JVM, but it also introduces other issues related with data shuffling.
In summary, Spark implements a distributed data parallelism system, so everytime you execute a map, filter, etc... you are doing something similar to what a Scala parallel collection would do but in a distributed fashion. Also the unit of parallelism in Spark are partitions, while in Scala collections is each row.
You could always use Scala parallel collections inside a Spark task to parallelize within a Spark task, but you won't necessarily see performance improvement, specially if your data was already evenly distributed in your RDD and each task needs about the same computational resources to be executed.