my_data.write
.mode(SaveMode.Overwrite)
.avro(_outputPath)
It works fine usually, but when the data is a very small amount, there are some empty Avro files.
All the number of files are quite different per try, when the data row is less than the number of files, some file is in an empty state, only column info are included.
Is there a way to handle the number of output Avro files per the data row number? Or not to create output file if there's not data?
The number of files will depend on how many partitions your dataframe has. Each partition will create its own file. If you know that there is no much data to write, you can re-partition the dataframe before writing it.
my_data.repartition(1)
.write
.mode(SaveMode.Overwrite)
.avro(_outputPath)
I have a pyspark.sql.DataFrame that I would like to save as .csv. This is what I am doing.
df.toPandas().to_csv('myDF.csv')
Is it possible to partition the data in different chunks and save them as separate files?
You can achieve this using below
df.repartition()
df.coalesce(<integer value to number of file you want>).write.csv()
do not convert spark dataframe to pandas, directly save it to file.
I have a file AA.zip which again contains multiple files for ex aa.tar.gz, bb.tar.gz , etc
I need to read this files in spark scala , how can i achieve that??
the only problem here is to extract the contents of zip file.
so ZIPs on HDFS are going to be a bit tricky because they don't split well so you'll have to process 1 or more zip file per executor. This is also one of the few cases were you probably have to fall back to SparkContext because for some reason binary file support in Spark is not that good.
https://spark.apache.org/docs/2.4.0/api/scala/index.html#org.apache.spark.SparkContext
there's a readBinaryFiles there which gives you access to the zip binary data which you can then utilize with the usual ZIP-handling from java or scala.
I need to convert all text files in a folder that are gzipped to parquet. I wonder if I need to gunzip them first or not.
Also, I'd like to partition each file in 100 parts.
This is what I have so far:
sc.textFile("s3://bucket.com/files/*.gz").repartition(100).toDF()
.write.parquet("s3://bucket.com/parquet/")
Is this correct? Am I missing something?
Thanks.
You don't need to uncompress files individually. The only problem with reading gzip files directly is that your reads won't be parallelized. That means, irrespective of the size of the file, you will only get one partition per file because gzip is not a splittable compression codec.
You might face problems if individual files are greater than a certain size (2GB?) because there's an upper limit to Spark's partition size.
Other than that your code looks functionally alright.
I understand the basic theory of textFile generating partition for each file, while wholeTextFiles generates an RDD of pair values, where the key is the path of each file, the value is the content of each file.
Now, from a technical point of view, what's the difference between :
val textFile = sc.textFile("my/path/*.csv", 8)
textFile.getNumPartitions
and
val textFile = sc.wholeTextFiles("my/path/*.csv",8)
textFile.getNumPartitions
In both methods I'm generating 8 partitions. So why should I use wholeTextFiles in the first place, and what's its benefit over textFile?
The main difference, as you mentioned, is that textFile will return an RDD with each line as an element while wholeTextFiles returns a PairRDD with the key being the file path. If there is no need to separate the data depending on the file, simply use textFile.
When reading uncompressed files with textFile, it will split the data into chuncks of 32MB. This is advantagous from a memory perspective. This also means that the ordering of the lines is lost, if the order should be preserved then wholeTextFiles should be used.
wholeTextFiles will read the complete content of a file at once, it won't be partially spilled to disk or partially garbage collected. Each file will be handled by one core and the data for each file will be one a single machine making it harder to distribute the load.
textFile generating partition for each file, while wholeTextFiles generates an RDD of pair values
That's not accurate:
textFile loads one or more files, with each line as a record in the resulting RDD. A single file might be split into several partitions if the file is large enough (depends on the number of partitions requested, Spark's default number of partitions, and the underlying File System). When loading multiple files at once, this operation "loses" the relation between a record and the file that contained it - i.e. there's no way to know which file contained which line. The order of the records in the RDD will follow the alphabetical order of files, and the order of records within the files (order is not "lost").
wholeTextFiles preserves the relation between data and the files that contained it, by loading the data into a PairRDD with one record per input file. The record will have the form (fileName, fileContent). This means that loading large files is risky (might cause bad performance or OutOfMemoryError since each file will necessarily be stored on a single node). Partitioning is done based on user input or Spark's configuration - with multiple files potentially loaded into a single partition.
Generally speaking, textFile serves the common use case of just loading a lot of data (regardless of how it's broken-down into files). readWholeFiles should only be used if you actually need to know the originating file name of each record, and if you know all files are small enough.
As of Spark2.1.1 following is the code for textFile.
def textFile(
path: String,
minPartitions: Int = defaultMinPartitions): RDD[String] = withScope {
assertNotStopped()
hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
minPartitions).map(pair => pair._2.toString).setName(path) }
Which internally uses hadoopFile to read either local files, HDFS files, and S3 using the pattern like file:// , hdfs://, and s3a://
Where as WholeTextFile the syntax is as below
def wholeTextFiles(
path: String,
minPartitions: Int = defaultMinPartitions): RDD[(String, String)] = withScope
If we observe the syntax for the both methods are equal, but textfile is useful to read the files, where as wholeTextFiles is used to read the directories of small files. How ever we can also use larger files but performance may effect.
So when you want to deal with large files textFile is better option, whereas if we want to deal with directory of smaller files wholeTextFile is better
textfile() reads a text file and returns an RDD of Strings. For example sc.textFile("/mydata.txt") will create RDD in which each individual line is an element.
wholeTextFile() reads a directory of text files and returns pairRDD.
For example, if there are few files in a directory, the wholeTextFile() method will create pair RDD with filename and path as key, and value being the whole file as string.
See below example for clarity:-
textFile = sc.textFile("ml-100k/u1.data")
textFile.getNumPartitions()
Output- 2
i.e. 2 partitions
textFile = sc.wholeTextFiles("ml-100k/u1.data")
textFile.getNumPartitions()
Output - 1
i.e. Only one partition.
So in short wholeTextFiles
Read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file.