How to compress multiple HDFS files into one - pyspark

I am working on a Zeppelin Cluster (w Spark), using write.parquet(), I end up with multiple Parquet files.
I was wondering, is it possible to merge them into one file? Or I have to use a path="/folder/*" every time?

Use repartition():
df.repartition(1).write.parquet(path)
or, even better, coalesce()
df.coalesce(1).write.parquet(path)

Related

Merging files within the Azrue Data Lake container

Consider below scenario:
I want my data flow like following
import container ---> databricks (Transform)--> export container
Current situation after I am done with the transformation process
container:
---import
--folder
--mydata.csv
---export
--folder
--part-1-transformed-mydata.csv
--part-2-transformed-mydata.csv
--part-3-transformed-mydata.csv
--initial.txt
--success.txt
--finish.txt
I want below structure:
---import
--folder
--mydata.csv
---export
--folder
--transformed-mydata.csv
What should be preferred way (considering data is of few GBs <10) within data-bricks or I am happy to use any functionality in Data Factory as I am using this data-bricks notebook as a step in pipeline .
Note : I am using Apache Spark 3.0.0, Scala 2.12 in data-bricks with 14 GB Memory, 4 Cores. Cluster type is standard
you will either need to repartition the data into a single partition (note this defeats the point of using a distributed computing platform)
or after the files are generated simply run a command to concat them all into a single file this might be problematic if each file has a header you will need to account for that in your logic.
it might be better to think of the export folder as the "file" if that makes sense. doesn't solve your problem but unless you need to produce a single file for some reason most consumer wont have an issue reading the data in a directory

Which is the fastest way to read a few lines out of a large hdfs dir using spark?

My goal is to read a few lines out of a large hdfs dir, I'm using spark2.2.
This dir is generated by previous spark job and each task generated a single little file in the dir, so the whole dir is like 1GB size and have thousands of little files.
When I use collect() or head() or limit(), spark will load all the files, and creates thousands of tasks(monitoring in sparkUI), which costs a lot of time, even I just want to show the first few lines of the files in this dir.
So which is the fastest way to read this dir? I hope the best solution is only load only a few lines of data so it would save time.
Following is my code:
sparkSession.sqlContext.read.format("csv").option("header","true").option("inferschema","true").load(file).limit(20).toJSON.toString()
sparkSession.sql(s"select * from $file").head(100).toString
sparkSession.sql(s"select * from $file").limit(100).toString
If you directly want to use spark then it will anyways load the files and then it does taking records. So first even before spark logic you have to get one file name from the directory using ur technology like java or scala or python and pass that file name to text File method that won't load all files.

Use of spark to optimize S3 to S3 transfer

I am learning spark/scala and trying to experiment with the below scenario using scala language.
Scenario: Copy multiple files from one S3 bucket folder to another S3 bucket folder.
Things done so far:
1) Use AWS S3 SDK and scala:
- Create list of files from S3 source locations.
- Iterate through the list, pass the source and target S3 locations from step 1 and use S3 API copyObject to copy each of these files to the target locations (configured).
This works.
However, I am trying to understand if I have large number of files inside multiple folders, is this the most efficient way of doing or can I use spark to parallelize this copy of files?
The approach that I am thinking is:
1) Use S3 SDK to get the source paths similar to what's explained above
2) Create an RDD for each of the files using sc.parallelize() - something on these lines?
sc.parallelize(objs.getObjectSummaries.map(_.getKey).toList)
.flatMap { key => Source.fromInputStream(s3.getObject(bucket, key).getObjectContent: InputStream).getLines }
3) Can I use sc.wholeTextFiles in some way to make this work?
I am not sure how to achieve this as of now.
Can you please help me understand if I am thinking in the right direction and also is this approach correct?
Thanks
I think AWS did not make it complicated though.
We had the same problem, we transferred around 2TB close to 10 mins.
If you want to transfer from one bucket to another bucket, better to use the built-in functionality to transfer within s3 itself.
https://docs.aws.amazon.com/cli/latest/reference/s3/sync.html
AWS CLI Command Example:
aws s3 sync s3://sourcebucket s3://destinationbucket
If you want to do it programmatically you can use all SDK's to invoke the same type of command. I would avoid reinventing the same wheel.
Hope it helps.
I have a code snipped, cloudCp which uses spark for a high-performance parallelised upload; it'd be similar to do something for copy, where you'd drop to the AWS lib for that operation.
But: you may not need to push out work to many machines, as each of the PUT/x-copy-source calls may be slow, but it doesn't use any bandwidth. You could just start a process with many many threads & a large HTTP client pool and just run them all on in that process. Take the list, sort by largest few first and then shuffle the rest at random to reduce throttling effects. Print out counters to help profile...

How to write data as single (normal) csv file in Spark? [duplicate]

This question already has answers here:
Write single CSV file using spark-csv
(16 answers)
Closed 5 years ago.
I am trying to save a data frame as CSV file in my local drive. But, when I do that so, I get a folder generated and within that partition files were written. Is there any suggestion to overcome this ?
My Requirement:
To get a normal csv file with actual name given in the code.
Code Snippet:
dataframe.coalesce(1).write.mode("overwrite").format("com.databricks.spark.csv").option("header", "true").csv("E:/dataframe.csv")
TL:DR You are trying to enforce sequential, in-core concepts on a distribute enviornment. It cannot end up well.
Spark doesn't provide utility like this one. To be able to create one in a semi distributed fashion, you'd have to implement multistep, source dependent protocol where:
You write header.
You write data files for each partition.
You merge the files, and give a new name.
Since this has limited applications, is useful only for smallish files, and can be very expensive with some sources (like object stores) nothing like this is implemented in Spark.
You can of course collect data, use standard CSV parser (Univoicity, Apache Commons) and then put to the storage of your choice. This is sequential and requires multiple data transfers.
There is no automatic way to do this. I see two solutions
If the local directory is mounted on all the executors: Write the file as you did, but then move/rename the part-*csv file to the desired name
Or if the directory is not available on all executors: collect the
dataframe to the driver and then create the file using plain scala
But both solutions kind of destroy parallelism and thus the goal of spark.
It is not possible but you can do somethings like this:
dataframe.coalesce(1).write.mode("overwrite").format("com.databricks.spark.csv").option("header", "true").csv("E:/data/")
import org.apache.hadoop.fs._
val fs = FileSystem.get(sc.hadoopConfiguration)
val filePath = "E:/data/"
val fileName = fs.globStatus(new Path(filePath+"part*"))(0).getPath.getName
fs.rename(new Path(filePath+fileName), new Path(filePath+"dataframe.csv"))

How to automatically edit over 100k files on GCS using Dataflow?

I have over 100 thousand files on Google Cloud Storage that contain JSON objects and I'd like to create a mirror maintaining the filesytem structure, but with some fields removed from the content of files.
I tried to use Apache Beam on Google Cloud Dataflow, but it splits all files and I can't maintain the structure anymore. I'm using TextIO.
The structure I have is something like reports/YYYY/MM/DD/<filename>
But Dataflow outputs to output_dir/records-*-of-*.
How can I make Dataflow not split the files and output them with the same directory and file structure?
Alternatively, is there a better system to do this kind of edits on a large number of files?
You can not directly use TextIO for this, but Beam 2.2.0 will include a feature that will help you write this pipeline yourself.
If you can build a snapshot of Beam at HEAD, you can already use this feature. Note: the API may change slightly between the time of writing this answer and the release of Beam 2.2.0
Use Match.filepatterns() to create a PCollection<Metadata> of files matching the filepattern
Map the PCollection<Metadata> with a ParDo that does what you want to each file using FileSystems:
Use the FileSystems.open() API to read the input file and then standard Java utilities for working with ReadableByteChannel.
Use FileSystems.create() API to write the output file.
Note that Match is a very simple PTransform (that uses FileSystems under the hood) and another way you can use it in your project is by just copy-pasting (the necessary parts of) its code into your project, or studying its code and reimplementing something similar. This can be an option in case you're hesitant to update your Beam SDK version.