Reading Last 3 Directories Data In SparkDf Recursively - scala

Is there any way to read the last 3 hours of files in a Spark dataframe?
On HDFS I have a directory structure like:
/user/hdfs/test/partition_date=2022-09-14/hour=01/fil1.csv
/user/hdfs/test/partition_date=2022-09-14/hour=01/fil2.csv
/user/hdfs/test/partition_date=2022-09-14/hour=01/fil3.csv
/user/hdfs/test/partition_date=2022-09-14/hour=02/fil1.csv
/user/hdfs/test/partition_date=2022-09-14/hour=02/fil2.csv
/user/hdfs/test/partition_date=2022-09-14/hour=02/fil3.csv
/user/hdfs/test/partition_date=2022-09-14/hour=03/file1.csv
/user/hdfs/test/partition_date=2022-09-14/hour=03/file2.csv
/user/hdfs/test/partition_date=2022-09-14/hour=04/file1.csv
Now I want to read the data only Last 3 hrs. So basically below 3 files should be loaded in spark.
/user/hdfs/test/partition_date=2022-09-14/hour=03/file1.csv
/user/hdfs/test/partition_date=2022-09-14/hour=03/file2.csv
/user/hdfs/test/partition_date=2022-09-14/hour=04/file1.csv
Can Someone Pla guide me to read these folders data.

Related

how to save a Dataset[row] as text file in spark? [duplicate]

This question already has answers here:
Write single CSV file using spark-csv
(16 answers)
Write spark dataframe to single parquet file
(2 answers)
Closed 3 years ago.
I would like to save a Dataset[Row] as text file with a specific name in specific location.
Can anybody help me?
I have tried this, but this produce me a folder (LOCAL_FOLDER_TEMP/filename) with a parquet file inside of it:
Dataset.write.save(LOCAL_FOLDER_TEMP+filename)
Thanks
You can`t save your dataset to specific filename using spark api, there is multiple workarounds to do that.
as Vladislav offered, collect your dataset then write it into your filesystem using scala/java/python api.
apply repartition/coalesce(1), write your dataset and then change the filename.
both are not very recommended, because in large datasets it can cause OOM or just lost of the power of spark`s parallelism.
The second issue that you are getting parquet file, its because the default format of spark, you should use:
df.write.format("text").save("/path/to/save")
Please use
RDD.saveAsTextFile()
It Write the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem, HDFS or any other Hadoop-supported file system. Spark will call toString on each element to convert it to a line of text in the file.
Refer Link : rdd-programming-guide
Spark always creates multiple files - one file per partition. If you want a single file - you need to do collect() and then just write it to file the usual way.

how to decompress and read a file containing multiple compressed file in spark

I have a file AA.zip which again contains multiple files for ex aa.tar.gz, bb.tar.gz , etc
I need to read this files in spark scala , how can i achieve that??
the only problem here is to extract the contents of zip file.
so ZIPs on HDFS are going to be a bit tricky because they don't split well so you'll have to process 1 or more zip file per executor. This is also one of the few cases were you probably have to fall back to SparkContext because for some reason binary file support in Spark is not that good.
https://spark.apache.org/docs/2.4.0/api/scala/index.html#org.apache.spark.SparkContext
there's a readBinaryFiles there which gives you access to the zip binary data which you can then utilize with the usual ZIP-handling from java or scala.

Which is the fastest way to read a few lines out of a large hdfs dir using spark?

My goal is to read a few lines out of a large hdfs dir, I'm using spark2.2.
This dir is generated by previous spark job and each task generated a single little file in the dir, so the whole dir is like 1GB size and have thousands of little files.
When I use collect() or head() or limit(), spark will load all the files, and creates thousands of tasks(monitoring in sparkUI), which costs a lot of time, even I just want to show the first few lines of the files in this dir.
So which is the fastest way to read this dir? I hope the best solution is only load only a few lines of data so it would save time.
Following is my code:
sparkSession.sqlContext.read.format("csv").option("header","true").option("inferschema","true").load(file).limit(20).toJSON.toString()
sparkSession.sql(s"select * from $file").head(100).toString
sparkSession.sql(s"select * from $file").limit(100).toString
If you directly want to use spark then it will anyways load the files and then it does taking records. So first even before spark logic you have to get one file name from the directory using ur technology like java or scala or python and pass that file name to text File method that won't load all files.

Load DataFrame as Text File into HDFS and S3 [duplicate]

This question already has answers here:
Write single CSV file using spark-csv
(16 answers)
Closed 5 years ago.
I am trying to load a DataFrame into HDFS and S3 as text format file using below code. DataFrame name is finalData.
val targetPath = "/user/test/File"
val now = Calendar.getInstance().getTime()
val formatter = new SimpleDateFormat("yyyyMMddHHmmss")
val timeStampAfterFormatting = formatter.format(now)
val targetFile = s"""$targetPath/test_$timeStampAfterFormatting.txt"""
finalData.repartition(1).rdd.saveAsTextFile(targetFile)
Using above code I can load the Data successfully. But file name is not same as I provided and also not in text format. A directory has created with the name as I mentioned.
Directory Name - /user/test/File/test_20170918055206.txt
-bash-4.2$ hdfs dfs -ls /user/test/File/test_20170918055206.txt
Found 2 items
/user/test/File/test_20170918055206.txt/_SUCCESS
/user/test/File/test_20170918055206.txt/part-00000
I want to create the file as I mentioned instead of creating the directory. Can anyone please assist me.
In my opinion, this is working as design.
You got a repartition operation just before you saved your rdd data, and this would trigger a shuffle operation among your whole data, and eventually got a new rdd which had only one partition.
So this only one partition was stored in your HDFS as your saveAsTextFile operation.
This method is designed such way to let arbitrary number of partitions to be writed in a uniform way.
For example, if your rdd has 100 partitions, no coalesce or repartition before write to HDFS. Then you will get a directory include _SUCCESS flag and 100 files!
if this method is not designed such way, how rdd with multiple partitions could be stored in a concise, uniform and elegant way, and maybe user need to direct the multiple file names? ...ah, so tedious maybe
I hope this explanation helps you.
If you then need and a complete whole file on your local file system, just try the hadoop client command:
hadoop fs -getmerge [src] [des]

Reading in multiple files compressed in tar.gz archive into Spark [duplicate]

This question already has answers here:
Read whole text files from a compression in Spark
(2 answers)
Closed 6 years ago.
I'm trying to create a Spark RDD from several json files compressed into a tar.
For example, I have 3 files
file1.json
file2.json
file3.json
And these are contained in archive.tar.gz.
I want to create a dataframe from the json files. The problem is that Spark is not reading in the json files correctly. Creating an RDD using sqlContext.read.json("archive.tar.gz") or sc.textFile("archive.tar.gz") results in garbled/extra output.
Is there some way to handle gzipped archives containing multiple files in Spark?
UPDATE
Using the method given in the answer to Read whole text files from a compression in Spark I was able to get things running, but this method does not seem to be suitable for large tar.gz archives (>200 mb compressed) as the application chokes up on large archive sizes. As some of the archives I'm dealing with reach sizes upto 2 GB after compression I'm wondering if there is some efficient way to deal with the problem.
I'm trying to avoid extracting the archives and then merging the files together as this would be time consuming.
A solution is given in Read whole text files from a compression in Spark .
Using the code sample provided, I was able to create a DataFrame from the compressed archive like so:
val jsonRDD = sc.binaryFiles("gzarchive/*").
flatMapValues(x => extractFiles(x).toOption).
mapValues(_.map(decode())
val df = sqlContext.read.json(jsonRDD.map(_._2).flatMap(x => x))
This method works fine for tar archives of a relatively small size, but is not suitable for large archive sizes.
A better solution to the problem seems to be to convert the tar archives to Hadoop SequenceFiles, which are splittable and hence can be read and processed in parallel in Spark (as opposed to tar archives.)
See: A Million Little Files – Digital Digressions by Stuart Sierra.
Files inside of a *.tar.gz file, as you already have mentioned are compressed. You cannot put the 3 files into a single compressed tar file and expect the import function (which is looking for only text) to know how to handle decompressing the files, unpacking them from the tar archive, and then importing each file individually.
I would recommend you take the time to upload each individual json file manually since both sc.textfile and sqlcontext.read.json functions cannot handle compressed data.