I have a folder structure as following:
data/level1=x/level2=y/level3=z/
And in this folder, I have some files as following:
filename_type_20201212.gz
filename_type_20201213.gz
filename_pq_type_20201213.gz
How do I read only the files with prefix "filename_type" into a dataframe?
There are many level1,level2,level3 subfolders. So, the data/ folder has to be loaded into a pyspark dataframe while reading files that have the above file name prefix.
Related
I want to merge three csv files into single parquet file using pyspark.
Below mentioned is my S3 path,10th date folder having three files, I want merge those files into a single file as parquet
"s3://lla.raw.dev/data/shared/sap/orders/2022/09/10/orders1.csv,orders2.csv,orders3.csv"
Single file
"s3://lla.raw.dev/data/shared/sap/orders/parquet file
Just read from CSVs and write to parquet
(spark
# read from CSV
.read.csv('s3://lla.raw.dev/data/shared/sap/orders/2022/09/10/')
# turn to single file
.coalesce(1)
# write to parquet
.write
.parquet('s3://lla.raw.dev/data/shared/sap/orders/parquet')
)
In my case i used Spak-shell to convert csv file into parquet file, my csv file size was 126mb after converting to parquet file hadoop shows that the file size is 0 although i can read the parquet file using dataframes is it normal or my hadoop cluster is not working right
hadoop web ui
my hdfs dfs -ls command
Non of the threads give a complete answer to this task.. I have multiple .gz files in one directory, I want to be able to extract all the csv files in each of them and output individual csv files. my code below runs without error, but does not unzip any file. I can't figure out where the problem is
#unzipping .gz files
import gzip
import shutil
import pandas as pd
import glob, os
for filename in glob.iglob('C:/Users/shedez/Documents/Data/**', recursive=True):
if filename.endswith('gz'): # filter dirs
with gzip.open(filename, 'rb') as f_in:
with open('C:/Users/shedez/Documents/Data', 'wb') as f_out:
shutil.copyfileobj(f_in, f_out)
I have a .csv files which contain all the files in the directory.
CSV
1,txt1.txt
2,txt2.txt
3,txt3.txt
4,txtsum.txt
These four text files and csv file are there in my directory c:/data i want to fetch the names of the files from the csv file insted of using the keyword 'list files in a directory'.
I have the following problem: suppose that I have a directory containing compressed directories .tar which contain multiple file .csv.gz. I want to get all csv.gz files in the parent compressed directorie *.tar. I work with scala 2.11.7
this tree
file.tar
|file1.csv.gz
file11.csv
|file2.csv.gz
file21.csv
|file3.csv.gz
file31.csv
I want to get from file.tar a list of files : file1.csv.gz , file2.csv.gz file3.csv.gz so after that a can create dataframe from each file csv.gz to do some transformation.