Parquet file size 0 after converting csv to parquet - scala

In my case i used Spak-shell to convert csv file into parquet file, my csv file size was 126mb after converting to parquet file hadoop shows that the file size is 0 although i can read the parquet file using dataframes is it normal or my hadoop cluster is not working right
hadoop web ui
my hdfs dfs -ls command

Related

How to merge csv files into single parquet file inside a folder in pyspark?

I want to merge three csv files into single parquet file using pyspark.
Below mentioned is my S3 path,10th date folder having three files, I want merge those files into a single file as parquet
"s3://lla.raw.dev/data/shared/sap/orders/2022/09/10/orders1.csv,orders2.csv,orders3.csv"
Single file
"s3://lla.raw.dev/data/shared/sap/orders/parquet file
Just read from CSVs and write to parquet
(spark
# read from CSV
.read.csv('s3://lla.raw.dev/data/shared/sap/orders/2022/09/10/')
# turn to single file
.coalesce(1)
# write to parquet
.write
.parquet('s3://lla.raw.dev/data/shared/sap/orders/parquet')
)

Pyspark - How to filter out .gz files based on regex pattern in filename when reading into a pyspark dataframe

I have a folder structure as following:
data/level1=x/level2=y/level3=z/
And in this folder, I have some files as following:
filename_type_20201212.gz
filename_type_20201213.gz
filename_pq_type_20201213.gz
How do I read only the files with prefix "filename_type" into a dataframe?
There are many level1,level2,level3 subfolders. So, the data/ folder has to be loaded into a pyspark dataframe while reading files that have the above file name prefix.

how to get all csv files in tar directory that contains csv.gz directory using scala?

I have the following problem: suppose that I have a directory containing compressed directories .tar which contain multiple file .csv.gz. I want to get all csv.gz files in the parent compressed directorie *.tar. I work with scala 2.11.7
this tree
file.tar
|file1.csv.gz
file11.csv
|file2.csv.gz
file21.csv
|file3.csv.gz
file31.csv
I want to get from file.tar a list of files : file1.csv.gz , file2.csv.gz file3.csv.gz so after that a can create dataframe from each file csv.gz to do some transformation.

How to extract part of a file saved in HDFS and save it as csv?

My Device file is saved in HDFS and I need to take 100 rows from that saved file
and save as csv in my local filesystem.
I have tried this command:
hdfs dfs -text /path to hdfs/Device/* > DEvice.csv
hdfs dfs -text /path to hdfs/Device/* | head -100 > DEvice.csv
This will get the first 100 lines, uncompressed from the Hadoop file and store them to the csv file in your local filesystem.
hdfs dfs -copyToLocal /path_to_hdfs/Device/* path_to_local_file.csv | head -100

Spark coalesce looses file when program is aborted

in Scala/Spark i am using a DataFrame and write it into a single file using:
val dataFrame = rdd.toDF()
dataFrame.coalesce(1).write.mode(SaveMode.Overwrite).parquet(filePath)
This works fine. But I figured out using the console and Hadoop's ls command, that while it is coalesced, the file and folder is not on the Hadoop file system.
When you type hadoop fs -ls hdfs://path, there is no such file or directory. After the coalesce is done, the path is again there and also the file which was coalesced.
This might happen because the coalesce needs to delete the file and create a new one?!
Here the problem is now: When i kill the process/app while the file is not on the file system, the complete file is deleted. So an failure of the system would destroy the file.
Do you have an idea how to prevent the file loss? I thought Spark/Hadoop would care of this.
Thanks, Alex