I will partition my dataframe with repartitionByRange(n , col('value')), however, each file shouldn't be more than 128Mb, and I don't want to have very small files either. Therefore I need to be able to select the n value dynamically.
It says The sample size can be controlled by the config spark.sql.execution.rangeExchange.sampleSizePerPartition here
How can I use sampleSizePerPartition in pyspark? Can somebody give me an example of the usage?
Related
I want to train a regression prediction model with Azure Databricks AutoML using the GUI. The training data is very wide. All of the columns except for the response variable will be used as features.
To use the Databricks AutoML GUI I have to store the data as a table in the Hive metastore. I have a large DataFrame df with more than 40,000 columns.
print((df.count(), len(df.columns)))
(33030, 45502)
This data is written to a table in Hive using the following PySpark command (I believe this is standard):
df.write.mode('overwrite').saveAsTable("WIDE_TABLE")
Unfortunately this job does not finish within 'acceptable' time (10 hours). I cancel and hence don't have an error message.
When I reduce the number of columns with
df.select(df.columns[:500]).write.mode('overwrite').saveAsTable("WIDE_TABLE")
it fares better and finishes in 9.87 minutes, so the method should work.
Can this be solved:
With a better compute instance?
With a better script?
Not at all and if so, is there another approach?
[EDIT to address questions in comments]
Runtime and driver summary:
2-16 Workers 112-896 GB Memory 32-256 Cores (Standard_DS5_v2)
1 Driver 56 GB Memory, 16 Cores (Same as worker)
Runtime10.4.x-scala2.12
To give an impression of the timings I've added a table below.
columns
time (mins)
10
1.94
100
1.92
200
3.04
500
9.87
1000
25.91
5000
938.4
Data type of the remaining columns is Integer.
As far as I know I'm writing the table on the same environment that I am working on. Data flow: Azure Blob CSV -> Data read and wrangling -> PySpark DataFrame -> Hive Table. Last three steps are on the same cloud machine.
Hope this helps!
I think your case is not related to either Spark resource configuration or network connection, it's related to Spark design itself.
Long in short, Spark is designed for long and narrow data, which is exactly opposite of your dataframe. When you look at your experiment, the time consuming is in exponential growth when your column size increase. Although it's about reading the csv but not writing table, you can check this post for a good explanation on why Spark is not good at handling wide dataframe: Spark csv reading speed is very slow although I increased the number of nodes
Although I didn't use the Azure AutoML before, based on the dataset to achieve your goal, I think you can try:
Try to use python pandas dataframe and Hive connection library to see if there is any performance enhancement
Concatenate all your column into a single Array / Vector before you write to Hive
Working on some Spark job I faced with the problem of data skew.
I working on the spatial data and it looks like this:
<key - id of area; value - some meta data that also contains a size and link to binary file>.
The main problem that data is not distributed well. As it is spatial data for instance areas that contain big cities are very large. But areas with fields and countryside is to small.
So some Spark partitions for one RDD have size 400kb and other > 1gb. As a result long running last tasks.
Key contains only numbers (like 45949539) and for this moment all data will be partitioned by hash partitioner using this key - completely randomly.
My next transformations are "flatmap" an "reduce by key". (Yes, long running last tasks are here)
<id1, data1> -> [<id1, data1>, <id2, data2> ....]
While flatmap I produce all my internal data from binary file. So size of the file I know (As I said before - from meta data).
But I need some specific partitioner that can mix large and small areas within one partition.
Is it real to do in Spark?
This job is a some kind of legacy and I can use RDD only.
I have a data source that consists of a huge amount of small files. I would like to save this partitioned by column user_id to another storage:
sdf = spark.read.json("...")
sdf.write.partitionBy("user_id").json("...")
The reason for this is I want another system to be able to delete only select users' data upon request.
This works, but, I still get many files within each partition (due to my input data). For performance reasons I would like to reduce the number of files within each partition, ideally simply to one (the process will run each day, so having an output file per user per day would work well).
How do I obtain this with pyspark?
You can use repartition to ensure that each partition gets one file
sdf.repartition('user_id').write.partitionBy("user_id").json("...")
This will make sure for each partition one file is created but in case of coalesce if there are more than one partition it can cause trouble.
Just add coalesce and no. of file you want.
sdf.coalesce(1).write.partitionBy("user_id").json("...")
I'm trying to edit the hadoop block size configuration through spark shell so that the parquet part files generated are of a specific size. I tried setting several variables this way :-
val blocksize:Int = 1024*1024*1024
sc.hadoopConfiguration.setInt("dfs.blocksize", blocksize) //also tried dfs.block.size
sc.hadoopConfiguration.setInt("parquet.block.size", blocksize)
val df = spark.read.csv("/path/to/testfile3.txt")
df.write.parquet("/path/to/output/")
The test file is a large text file of almost 3.5 GB. However, no matter what blocksize I specify or approach I take, the number of part files created and their sizes are the same. It is possible for me to change the number of part files generated using the repartition and coalesce functions, but I have to use and approach that would not shuffle the data in the data frame in any way!
I have also tried specifying
f.write.option("parquet.block.size", 1048576).parquet("/path/to/output")
But with no luck. Can someone please highlight what I am doing wrong? Also is there any other approach I can use that can alter parquet block sizes that are written into hdfs?
I'm a bit confused to find the right way to save data into HDFS after processing them with spark.
This is what I'm trying to do. I'm calculating min, max and SD of numeric fields. My input files have millions of rows, but output will have only around 15-20 fields. So, the output is a single value(scalar) for each field.
For example: I will load all the rows of FIELD1 into an RDD, and at the end, I will get 3 single values for FIELD 1(MIN, MAX, SD). I concatenated these three values into temporary string. In the end, I will have 15 to twenty rows, containing 4 columns in this following format
FIELD_NAME_1 MIN MAX SD
FIELD_NAME_2 MIN MAX SD
This is a snippet of the code:
//create rdd
val data = sc.textFile("hdfs://x.x.x.x/"+args(1)).cache()
//just get the first column
val values = data.map(_.split(",",-1)(1))
val data_double= values.map(x=>if(x==""){0}else{x}.toDouble)
val min_value= data_double.map((_,1)).reduceByKey((_+_)).sortByKey(true).take(1)(0)._1
val max_value= data_double.map((_,1)).reduceByKey((_+_)).sortByKey(false).take(1)(0)._1
val SD = data_double.stdev
So, i have 3 variables, min_value, max_value and SD that I want to store back to hdfs.
Question 1:
Since the output will be rather small, do I just save it locally on the server? or should I dump it to HDFS. Seems to me like dumping the file locally makes better sense.
Question 2:
In spark, I can just call the following to save an RDD into text file
some_RDD.saveAsTextFile("hdfs://namenode/path")
How do I accomplish the same thing in for a String variable that is not an RDD in scala? should I parallelize my result into an RDD first and then call saveAsTextFile?
To save locally just do
some_RDD.collect()
Then save the resulting array with something like from this question. And yes if the data set is small, and can easily fit in memory you should collect and bring it to the driver of the program. Another option if the data is a little to large to store in memory is just some_RDD.coalesce(numParitionsToStoreOn). Keep in mind coalesce also takes a boolean shuffle, if you are doing calculations on the data before coalescing, you should set this to true to get more parallelism on the calculations. Coalesce will reduce the number of nodes that store data when you call some_RDD.saveAsTextFile("hdfs://namenode/path"). If the file is very small but you need it on hdfs, call repartition(1), which is the same as coalesce(1,true), this will ensure that your data is only saved on one node.
UPDATE:
So if all you want to do is save three values in HDFS you can do this.
sc.parallelize(List((min_value,max_value,SD)),1).saveAsTextFile("pathTofile")
Basically you are just putting the 3 vars in a tuple, wrap that in a List and set the parallelism to one since the data is very small
Answer 1: Since you just need several scalar, I'd like to say storing them in local file system. You can first do val localValue = rdd.collect(), which will collect all data from workers to master. And then you call java.io to write things to disk.
Answer 2: You can do sc.parallelize(yourString).saveAsTextFile("hdfs://host/yourFile"). The will write things to part-000*. If you want to have all things in one file, hdfs dfs -getmerge is here to help you.