Spark save(write) parquet only one file - scala

if i write
dataFrame.write.format("parquet").mode("append").save("temp.parquet")
in temp.parquet folder
i got the same file numbers as the row numbers
i think i'm not fully understand about parquet but is it natural?

Use coalesce before write operation
dataFrame.coalesce(1).write.format("parquet").mode("append").save("temp.parquet")
EDIT-1
Upon a closer look, the docs do warn about coalesce
However, if you're doing a drastic coalesce, e.g. to numPartitions =
1, this may result in your computation taking place on fewer nodes
than you like (e.g. one node in the case of numPartitions = 1)
Therefore as suggested by #Amar, it's better to use repartition

You can set partitions as 1 to save as single file
dataFrame.repartition(1).write.format("parquet").mode("append").save("temp.parquet")

Although previous answers are correct you have to understand repercusions that come after repartitioning or coalescing to a single partition. All your data will have to be transferred to a single worker just to immediately write it to a single file.
As it is repeatidly mentioned throughout the internet, you should use repartition in this scenario despite the shuffle step that gets added to the execution plan. This step helps to use your cluster's power instead of sequentially merging files.
There is at least one alternative worth mentioning. You can write a simple script that would merge all the files into a single one. That way you will avoid generating massive network traffic to a single node of your cluster.

Related

How to do simple cache file in Flink-Scala?

I am new to Flink. I am really confused how to do file caching and load it into a dataset ? I can't find a simple example. I am confused why we need to create a dataset first to call "RichMapFunction" ? How I cache file that with nothing do with any other dataset? In sample I found, it kind of performed join with other dataset. Thank you.
For the case to join two data sets, and one data set is small, use broadcast to avoid shuffle. Without broadcasting, it is a pain to shuffle a large data set.
E.g. one dataset has 1 billion records, another one has 100 records. With broadcast, the small dataset will be distributed to all task managers processing those 1 billion records - no moving 1 billion record for join. Without broadcast, the typical behaviour for joining operation is to shuffle the 1 billion records and 100 records, so that records with same key are in the same machine, which is much more expensive compared to broadcast.
The RichMapFunction provides the open() method and method to access RuntimeContext. In the open() function, the Flink job can get broadcasted dataset through getRuntimeContext(). getBroadcastVariable(). The open() function is called only one time for each operator, so the broadcasted dataset is initialised one time and then it can be applied to all incoming records. That is the reason why to use RichMapFunction() instead of MapFunction().
Note - Broadcast applies to the case that the dataset to broadcast is small. Need to create a dataset first and then broadcast the dataset to all operator. Please refer to here for the usage of the API.
For distributed file caching, it is for the case that the operation(e.g. Map operation) needs to load external file one time and use it in the operation.
E.g. A trained model is saved on HDFS. In Flink job, it needs to load the model and apply the model to each record. For this case, the Flink job can use distributed file cache API. The model file will be pulled from HDFS to local machine, and all tasks running on that machine can share the pulled file locally, which saves network and time.
You do not need to create a dataset for the file to be distributed, but using registerCachedFile(). Like the same reason for broadcasting dataset, using RichMapFunction allows the Flink job to load/init distributed file one time.
Please refer to this document for the usage.

Handling Skew data in apache spark production scenario

Can anyone explain how the skew data is handled in production for Apache spark?
Scenario:
We submitted the spark job using "spark-submit" and in spark-ui it is observed that few tasks are taking long time which indicates presence of skew.
Questions:
(1) What steps shall we take(re-partitioning,coalesce,etc.)?
(2) Do we need to kill the job and then include the skew solutions in the jar and
re-submit the job?
(3) Can we solve this issue by running the commands like (coalesce) directly from
shell without killing the job?
Data skews a primarily a problem when applying non-reducing by-key (shuffling) operations. The two most common examples are:
Non-reducing groupByKey (RDD.groupByKey, Dataset.groupBy(Key).mapGroups, Dataset.groupBy.agg(collect_list)).
RDD and Dataset joins.
Rarely, the problem is related to the properties of the partitioning key and partitioning function, with no per-existent issue with data distribution.
// All keys are unique - no obvious data skew
val rdd = sc.parallelize(Seq(0, 3, 6, 9, 12)).map((_, None))
// Drastic data skew
rdd.partitionBy(new org.apache.spark.HashPartitioner(3)).glom.map(_.size).collect
// Array[Int] = Array(5, 0, 0)
What steps shall we take(re-partitioning,coalesce,etc.)?
Repartitioning (never coalesce) can help you with the the latter case by
Changing partitioner.
Adjusting number of partitions to minimize possible impact of data (here you can use the same rules as for associative arrays - prime number and powers of two should be preferred, although might not resolve the problem fully, like 3 in the example used above).
The former cases typically won't benefit from repartitioning much, because skew is naturally induced by the operation itself. Values with the same key cannot be spread multiple partitions, and non-reducing character of the process, is minimally affected by the initial data distribution.
These cases have to be handled by adjusting the logic of your application. It could mean a number of things in practice, depending on the data or problem:
Removing operation completely.
Replacing exact result with an approximation.
Using different workarounds (typically with joins), for example frequent-infrequent split, iterative broadcast join or prefiltering with probabilistic filter (like Bloom filter).
Do we need to kill the job and then include the skew solutions in the jar and re-submit the job?
Normally you have to at least resubmit the job with adjust parameters.
In some cases (mostly RDD batch jobs) you can design your application, to monitor task execution and kill and resubmit particular job in case of possible skew, but it might hard to implement right in practice.
In general, if data skew is possible, you should design your application to be immune to data skews.
Can we solve this issue by running the commands like (coalesce) directly from shell without killing the job?
I believe this is already answered by the points above, but just to say - there is no such option in Spark. You can of course include these in your application.
We can fine tune the query to reduce the complexity .
We can Try Salting mechanism:
Salt the skewed column with random number creation better distribution of data across each partition.
Spark 3 Enables Adaptive Query Execution mechanism to avoid such scenarios in production.
Below are couple of spark properties which we can fine tune accordingly.
spark.sql.adaptive.enabled=true
spark.databricks.adaptive.autoBroadcastJoinThreshold=true #changes sort merge join to broadcast join dynamically , default size = 30 mb
spark.sql.adaptive.coalescePartitions.enabled=true #dynamically coalesced
spark.sql.adaptive.advisoryPartitionSizeInBytes=64MB default
spark.sql.adaptive.coalescePartitions.minPartitionSize=true
spark.sql.adaptive.coalescePartitions.minPartitionNum=true # Default 2X number of cores
spark.sql.adaptive.skewJoin.enabled=true
spark.sql.adaptive.skewJoin.skewedPartitionFactor=Default is 5
spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=256 MB

How can I control number of rows and/or output file size in Spark streaming when writing to HDFS - hive?

Using spark streaming to read and process messages from Kafka and write to HDFS - Hive.
Since I wish to avoid creating many small files which spams the filesystem, I would like to know if there's a way to ensure a minimal file size, and/or ability to force a minimal number of output rows in a file, with the exception of a timeout.
Thanks.
As far as I know, there is no way to control the number of lines in your output files. But you can control the number of output files.
Controlling that and considering your dataset size may help you with your needs, since you can calculate the size of each file in your output. You can do that with the coalesce and repartition commands:
df.coalesce(2).write(...)
df.repartition(2).write(...)
Both of them are used to create the number of partitions given as parameter. So if you set 2, you should have 2 files in your output.
The difference are that with repartition you can both increase and decrease your partitions, while with coalesce you can only decrease.
Also,keep in mind that repartition performs a full shuffle to equally distribute the data among the partitions, which may be resource and time expensive. On the other hand, coalesce does not perform a full shuffle, it combines existing partitions instead.
You can find an awesome explanation in this other answer here

Parallelize a RDD variable expected from an external file in Spark

Based on the materials I read and some online posts, I think Spark will broadcast all a RDD variable from external file by: sc.textFile, for example:
val rdd = sc.textFile(file_path)
however, when my colleague read my code and requests me code with sc.parallelize, I am so confused about it as i think the sc.parallelize is redundant, I asked my colleague again and he gave me a answer:
To my experience up till now, spark doesn't good at handling the dividence of external file over multiple nodes and workers, so you need set partitions, forcing the worker to apply multiple workers to do the job.
So based on my colleague's suggestions, what is the easiest way that I can set partitions when I am reading a large volume file if sc.textFile can not do that. A possible way is to collect first and then sc.parallelize, but i think it wast too much time and it is redundant.
You can call rdd.repartion(..). Collect and parrallelise is not the right way to achieve what you describe.
The reason that your colleague observed this behaviour is probably due to small files, as partitioning is driven by the HDFS blocks when reading. So if you files are smaller than the block size, all your data will end up in the same executor.

Spark dataframe saveAsTable is using a single task

We have a pipeline for which the initial stages are properly scalable - using several dozen workers apiece.
One of the last stages is
dataFrame.write.format(outFormat).mode(saveMode).
partitionBy(partColVals.map(_._1): _*).saveAsTable(tname)
For this stage we end up with a single worker. This clearly does not work for us - in fact the worker runs out of disk space - on top of being very slow.
Why would that command end up running on a single worker/single task only?
Update The output format was parquet. The number of partition columns did not affect the result (tried one column as well as several columns).
Another update None of the following conditions (as posited by an answer below) held:
coalesce or partitionBy statements
window / analytic functions
Dataset.limit
sql.shuffle.partitions
The problem is unlikely to be related in any way to saveAsTable.
A single task in a stage indicates that the input data (Dataset or RDD) has only a one partition. This is contrast to cases where there are multiple tasks but one or more have significantly higher execution time, which normally correspond to partitions containing positively skewed keys. Also you should confound a single task scenario with low CPU utilization. The former is usually a result of insufficient IO throughput (high CPU wait times are the most obvious indication of that), but in rare cases can be traced to usage of shared objects with low level synchronization primitives.
Since standard data sources don't shuffle data on write (including cases where partitionBy and bucketBy options are used) it is safe to assume that data has been repartitioned somewhere in the upstream code. Usually it means that one of the following happened:
Data has been explicitly moved to a single partition using coalesce(1) or repartition(1).
Data has been implicitly moved to a single partition for example with:
Dataset.limit
Window function applications with window definition lacking PARTITION BY clause.
df.withColumn(
"row_number",
row_number().over(Window.orderBy("some_column"))
)
sql.shuffle.partitions option is set to 1 and upstream code includes non-local operation on a Dataset.
Dataset is a result of applying a global aggregate function (without GROUP BY caluse). This usually not an issue, unless function is non-reducing (collect_list or comparable).
While there is no evidence that it is the problem here, in general case you should also possibility, data contains only a single partition all the way to the source. This usually when input is fetched using JDBC source, but the 3rd party formats can exhibit the same behavior.
To identify the source of the problem you should either check the execution plan for the input Dataset (explain(true)) or check SQL tab of the Spark Web UI.