How to achieve Task Parallelism in Spark and Scala? - scala

We have a Spark batch job, where we are reading data from HBase table and apply multiple transformations and then populate the data in Cassandra[multiple tables].
We have multiple independent tasks, which is using the same DataFrame [ Hbase table data]. Basically we have several dashboards based on the same Hbase table data.
Currently everything is running sequentially, how to run this parallel?
Is it a good practice to use Scala Future to run the tasks parallel?

Related

How to create multiple Spark tasks to query Cassandra partitions

I have an application that is using Spark (with Spark Job Server) that uses a Cassandra store. My current setup is that of a client mode running with master=local[*]. So there is a single Spark executor which is also the driver process that is using all 8 cores of the machine. I have a Cassandra instance running on the same machine.
The Cassandra tables have a primary key of the form ((datasource_id, date), clustering_col_1...clustering_col_n) where date is a single day of the form "2019-02-07" and is part of a composite partition key.
In my Spark application, I am running a query like so:
df.filter(col("date").isin(days: _*))
In the Spark physical plan, I notice that these filters along with the filter for the "datasource_id" partition key are pushed up to the Cassandra CQL query.
For our biggest datasources, I know that the partitions are around 30MB in size. So I have the following setting in the Spark Job Server configuration:
spark.cassandra.input.split.size_in_mb = 1
However I notice that there is no parallelization in the Cassandra loading step. Though there are multiple Cassandra partitions that are >1MB, there are no additional spark partitions created. There is only a single task that does all the querying on a single core, thus taking ~20 secs to load data for a 1 month date range that corresponds to ~1 million rows.
I have tried the alternative approach below:
df union days.foldLeft(df)((df: DataFrame, day: String) => {
df.filter(col("date").equalTo(day))
})
This does indeed create a spark partition (or task) for every "day" partition in cassandra. However, for smaller datasources where the cassandra partitions are much smaller in size, this method proves to be quite expensive in terms of excessive tasks created and the overhead due to their coordination. For these datasources, it would be totally fine to lump many cassandra partitions into one spark partition. Hence why I thought using the spark.cassandra.input.split.size_in_mb configuration would prove useful in dealing with both small and large datasources.
Is my understanding wrong? Is there something else that I'm missing in order for this configuration to take effect?
P.S. I have also read the answers about using joinWithCassandraTable. However, our code relies on using DataFrame. Also, converting from a CassandraRDD to a DataFrame is not very viable for us since our schema is dynamic and cannot be specified using case classes.

Parallelised collections in Spark

What's the concept of "Paralleled collections" in Spark is, and how this concept can improve the overall performance of a job? Besides, how should partitions be configured for that?
Parallel collections are provided in the Scala language as a simple way to parallelize data processing in Scala. The basic idea is that when you perform operations like map, filter, etc... to a collection it is possible to parallelize it using a thread pool. This type of parallelization is called data parallelization because it is based on the data itself. This is happening locally in the JVM and Scala will use as many threads as cores are available to the JVM.
On the other hand Spark is based on RDD, that are an abstraction that represents a distributed dataset. Unlike the Scala parallel collections this datasets are distributed in several nodes. Spark is also based on data parallelism, but this time is distributed data parallelism. This allows you to parallelize much more than in a single JVM, but it also introduces other issues related with data shuffling.
In summary, Spark implements a distributed data parallelism system, so everytime you execute a map, filter, etc... you are doing something similar to what a Scala parallel collection would do but in a distributed fashion. Also the unit of parallelism in Spark are partitions, while in Scala collections is each row.
You could always use Scala parallel collections inside a Spark task to parallelize within a Spark task, but you won't necessarily see performance improvement, specially if your data was already evenly distributed in your RDD and each task needs about the same computational resources to be executed.

Streaming data store in hive using spark

I am creating a application in which getting streaming data which goes into kafka and then on spark. consume the data, apply some login and then save processed data into the hive. velocity of data is very fast. I am getting 50K records in 1min. There is window of 1 min in spark streaming in which it process the data and save the data in the hive.
my question is for production prospective architecture is fine? If yes how can I save the streaming data into hive. What I am doing is, creating dataframe of 1 min window data and will save it in hive by using
results.write.mode(org.apache.spark.sql.SaveMode.Append).insertInto("stocks")
I have not created the pipeline. Is it fine or I have to modified the architecture?
Thanks
I would give it a try!
BUT kafka->spark->hive is not the optimal pipline for your usecase.
hive is normally based on hdfs which is not designed for small number of inserts/updates/selects.
So your plan can end up in the following problems:
many small files which ends in bad performance
your window gets to small because it takes to long
Suggestion:
option 1:
- use kafka just as buffer queue and design your pipeline like
- kafka->hdfs(e.g. with spark or flume)->batch spark to hive/impala table
Option 2:
kafka->flume/spark to hbase/kudu->batch spark to hive/impala
option 1 has no "realtime" analysis option. It depends on how often you run the batch spark
option2 is a good choice i would recommend, store like 30 days in hbase and all older data in hive/impala. With a view you will be able to join new and old data for realtime analysis.
Kudu makes the architecture even easier.
Saving data into hive tables can be tricky if you like to partition it and use it via HIVEsql.
But basicly it would work like the following:
xml.write.format("parquet").mode("append").saveAsTable("test_ereignis_archiv")
BR

Does Spark do UnionAll in parallel?

I got 10 DataFrames with the same schema which I'd like to combine into one DataFrame. Each DataFrame is constructed using a sqlContext.sql("select ... from ...").cahce, which means that technically, the DataFrames are not really calculated until it's time to use them.
So, if I run:
val df_final = df1.unionAll(df2).unionAll(df3).unionAll(df4) ...
will Spark calculate all these DataFrames in parallel or one by one (due to the dot operator)?
And also, while we're here - is there a more elegant way to preform a unionAll on several DataFrames than the one I listed above?
unionAll is lazy. The example line in your question does not trigger any calculation, synchronous or asynchronous.
In general Spark is a distributed computation system. Each operation itself is made up of a bunch of tasks that are processed in parallel. So in general you don't have to worry about whether two operations can run in parallel or not. The cluster resources will be well utilized anyway.

Spark: spark-csv partitioning and parallelism in subsequent DataFrames

I'm wondering how to enforce usage of subsequent, more appropriately partitioned DataFrames in Spark when importing source data with spark-csv.
Summary:
spark-csv doesn't seem to support explicit partitioning on import like sc.textFile() does.
While it gives me inferred schema "for free", by default I'm getting returned DataFrames with normally only 2 partitions, when I'm using 8 executors in my cluster.
Even though subsequent DataFrames that have many more partitions are being cached via cache() and used for further processing (immediately after import of the source files), Spark job history is still showing incredible skew in the task distribution - 2 executors will have the vast majority of the tasks instead of a more even distribution that I expect.
Can't post data, but the code is just some simple joining, adding a few columns via .withColumn(), and then very basic linear regression via spark.mlib.
Below is a comparison image from the Spark History UI showing tasks per executor (the last row is the driver).
Note: I get the same skewed task distribution regardless of calling repartition() on the spark-csv DataFrames or not.
How do I "force" Spark to basically forget those initial DataFrames and start from more appropriately partitioned DataFrames, or force spark-csv to somehow partition its DataFrames differently (without forking it/modifying its source)?
I can resolve this issue using sc.textFile(file, minPartitions), but I'm hoping I don't have to resort to that because of things like the nicely typed schema that spark-csv provides.