Pyspark - How to filter out .gz files based on regex pattern in filename when reading into a pyspark dataframe - pyspark

I have a folder structure as following:
data/level1=x/level2=y/level3=z/
And in this folder, I have some files as following:
filename_type_20201212.gz
filename_type_20201213.gz
filename_pq_type_20201213.gz
How do I read only the files with prefix "filename_type" into a dataframe?
There are many level1,level2,level3 subfolders. So, the data/ folder has to be loaded into a pyspark dataframe while reading files that have the above file name prefix.

Related

How to merge csv files into single parquet file inside a folder in pyspark?

I want to merge three csv files into single parquet file using pyspark.
Below mentioned is my S3 path,10th date folder having three files, I want merge those files into a single file as parquet
"s3://lla.raw.dev/data/shared/sap/orders/2022/09/10/orders1.csv,orders2.csv,orders3.csv"
Single file
"s3://lla.raw.dev/data/shared/sap/orders/parquet file
Just read from CSVs and write to parquet
(spark
# read from CSV
.read.csv('s3://lla.raw.dev/data/shared/sap/orders/2022/09/10/')
# turn to single file
.coalesce(1)
# write to parquet
.write
.parquet('s3://lla.raw.dev/data/shared/sap/orders/parquet')
)

Parquet file size 0 after converting csv to parquet

In my case i used Spak-shell to convert csv file into parquet file, my csv file size was 126mb after converting to parquet file hadoop shows that the file size is 0 although i can read the parquet file using dataframes is it normal or my hadoop cluster is not working right
hadoop web ui
my hdfs dfs -ls command

How do I unzip multiple .gz files and output individual csv files contained in each of them?

Non of the threads give a complete answer to this task.. I have multiple .gz files in one directory, I want to be able to extract all the csv files in each of them and output individual csv files. my code below runs without error, but does not unzip any file. I can't figure out where the problem is
#unzipping .gz files
import gzip
import shutil
import pandas as pd
import glob, os
for filename in glob.iglob('C:/Users/shedez/Documents/Data/**', recursive=True):
if filename.endswith('gz'): # filter dirs
with gzip.open(filename, 'rb') as f_in:
with open('C:/Users/shedez/Documents/Data', 'wb') as f_out:
shutil.copyfileobj(f_in, f_out)

Robot framework-how to replace 'list files in a directory' keyword by an .csv file containing the list of files

I have a .csv files which contain all the files in the directory.
CSV
1,txt1.txt
2,txt2.txt
3,txt3.txt
4,txtsum.txt
These four text files and csv file are there in my directory c:/data i want to fetch the names of the files from the csv file insted of using the keyword 'list files in a directory'.

how to get all csv files in tar directory that contains csv.gz directory using scala?

I have the following problem: suppose that I have a directory containing compressed directories .tar which contain multiple file .csv.gz. I want to get all csv.gz files in the parent compressed directorie *.tar. I work with scala 2.11.7
this tree
file.tar
|file1.csv.gz
file11.csv
|file2.csv.gz
file21.csv
|file3.csv.gz
file31.csv
I want to get from file.tar a list of files : file1.csv.gz , file2.csv.gz file3.csv.gz so after that a can create dataframe from each file csv.gz to do some transformation.