How to read all but one TXT files from a directory in PySpark? - pyspark

I have a directory with several .txt files. I want to read all these files into a dataframe, but want to exclude one problematic file. Is there a way I can do this?
The files are named #100.1-YYYY1HH10MM.txt, #101.1-YYYY11HH20MM.txt, #102.1-YYYY9HH5MM.txt etc. You'll note that the file names are prefixed with an incremental number e.g. #100.1, #102.1 etc. If I want to read all these files except say file number #350.1, how can I do that? Not sure if I can use regex here.
from pyspark.sql.functions import *
filename = '/mnt/directory/*.txt' #Read all TXT files in the folder
filename = '/mnt/directory/#{1[0-4,7-9],[0,2-3][0-9]}.1.txt' #Try regex to filter out one file

if your lists are not too big, than using glob and looping can be a simple solution:
import glob
dont_want = ['#350.1']
files = []
for x in glob.glob("path/*.txt"):
for y in dont_want:
if y not in x: files.append(x)
df = spark.read.csv(mylist)

Related

Append files and add column with last part of each filename

I hope one of you are willing to help a complete Python beginner.
I have managed to create my first script where I append multiple excel files in a folder into one merged file. So far so good!
But I also need the script to create an additional column and complete it with the last two characters of the filename from each file it appends.
My script looks like this for now:
import pandas as pd
import glob
# getting excel files to be merged from the Desktop
path = "C:\\Users\\123\\OneDrive\\Descriptions\\Translated"
# read all the files with extension .xlsx i.e. excel
filenames = glob.glob(path + "\*.xlsx")
print('File names:', filenames)
# empty data frame for the new output excel file with the merged excel files
outputxlsx = pd.DataFrame()
# for loop to iterate all excel files
for file in filenames:
# using concat for excel files
# after reading them with read_excel()
df = pd.concat(pd.read_excel(file, sheet_name=None), ignore_index=True, sort=False)
# appending data of excel files
outputxlsx = outputxlsx.append( df, ignore_index=True)
print('Final Excel sheet now generated at the same location:')
outputxlsx.to_excel("C:/Users/123/OneDrive/Descriptions/Translated/Merged.xlsx", index=False)
The files in the folder are named like this:
CZ, PL, TR_cs-CZ
CZ, PL, TR_pl-PL
CZ, PL, TR_tr-TR
So the last column should be:
CZ
PL
TR
Thank you!!

Looping through a .txt files then rewrite them in different names

I am just trying to read different .txt files from a directory e.g. B1.txt...Bn.txt then I want to rescale the values of each .txt file and rewrite them different .txt files in form BF1.txt,...BFn.txt. I was able to read the files using the following snippet code but couldn't rewrite them. Here is my attempt;
NoOFfiles = 4;
for k = 1:NoOFfiles
filename = sprintf('B%d.txt',k);
A = load(filename)./1000;
end
Any help,
Thank you for your time!

Error code in importing multiple csv files from certain folder using matlab

I am really a newbie in matlab programming. I have a problem in coding to import multiple csv files into one from certain folder:
This is my code:
%% Importing multiple CSV files
myDir = uigetdir; %gets directory
myFiles = dir(fullfile(myDir,'*.csv')); %gets all csv files in struct
for k = 1:length(myFiles)
data{k} = csvread(myFiles{k});
end
I use the code uigetdir in order to be able to select data from any folder, because I try to make an automation program so it would be flexible to use by others. The code that I run only look for the directory and shows the list, but not for merging the csv files into one and read it in "import data". I want it to be merged and read as one file.
My merged file should look like this with semicolon delimited and consist of 47 csv files merged together (this picture is one of the csv file I have):
my merged file
I have been working for it a whole day but I find always error code. Please help me :(. Thank you very much in advance for your help.
As the error message states, you're attempting to reference myFiles as a cell array when it is not. The output of dir is a structure, which cannot be indexed like a cell array.
You want to do something like the following:
for k = 1:numel(myFiles)
filepath = fullfile(myFiles(k).folder, myFiles(k).name);
data{k} = csvread(filepath);
end

read multiple file from folder

I want to read multiple files from a folder but this code does not work properly:
direction=dir('data');
for i=3:length(direction)
Fold_name=strcat('data\',direction(i).name);
filename = fullfile(Fold_name);
fileid= fopen(filename);
data = fread (fileid)';
end
I modified your algorithm to make it easier
Just use this form :
folder='address\datafolder\' ( provide your folder address where data is located)
then:
filenames=dir([folder,'*.txt']); ( whatever your data format is, you can specify it in case you have other files you do not want to import, in this example, i used .txt format files)
for k = 1 : numel(filenames)
Do your code
end
It should work. It's a much more efficient method, as it can apply to any folder without you worrying about names, number order etc... Unless you want to specify certain files with the same format within the folder. I would recommend you to use a separate folder to put your files in.
In case of getting access to all the files after reading:
direction=dir('data');
for i=3:length(direction)
Fold_name=strcat('data\',direction(i).name);
filename = fullfile(Fold_name);
fileid(i)= fopen(filename);
data{i-2} = fread (fileid(i))';
end

importing excel into matlab

I have 4 folders in the same directory where each folder contains ~19 .xls files. I have written the code below to obtain the name of each of the folders and the name of each .xls file within the folders.
path='E:\Practice';
folder = path;
dirListing = dir(folder);
dirListing=dirListing(3:end);%first 2 are just pointers
for i=1:length(dirListing);
f{i} = fullfile(path, dirListing(i,1).name);%obtain the name of each folder
files{i}=dir(fullfile(f{i},'*.xls'));%find the .xls files
for j=1:length(files{1,i});
File_Name{1,i}{j,1}=files{1,i}(j,1).name;%find the name of each .xls file
end
end
Now I'm trying to import the data from excel into matlab by using xlsread. What I'm struggling with is knowing how to load the data into matlab within a loop where the excel files are in different directories (different folders).
This leaves me with a 1x4 cell named File_Name where each cell refers to a different folder located under 'path', and within each cell is then the name of the spreadsheets wanting to be imported. The size of the cells vary as the number of spreadsheets in each folder varies.
Any ideas?
thanks in advance
I'm not sure if I'm understanding your problem, but all you have to do is concatenate the strings that contain directory (f{}) and the file name. Modifying your code:
for i=1:length(dirListing);
f{i} = fullfile(path, dirListing(i,1).name);%obtain the name of each folder
files{i}=dir(fullfile(f{i},'*.xls'));%find the .xls files
for j=1:length(files{1,i});
File_Name{1,i}{j,1}=files{1,i}(j,1).name;%find the name of each .xls file
fullpath = [f{i} '/' File_Name{1,i}{j,1}];
disp(['Reading file: ' fullpath])
x = xlsread(fullpath);
end
end
This works on *nix systems. You may have to join the filenames with a '\' on Windows. I'll find a more elegant way and update this posting.
Edit: The command filesep gives the forward or backward slash, depending on your system. The following should give you the full path:
fullpath = [f{i} filesep File_Name{1,i}{j,1}];
Take a look at this helper function, written by a member of the matlab community.
It allows you to recursively search through directories to find files that match a certain pattern. This is a super handy function to use when looking to match files.
You should be able to find all your files in a single call to this function. Then you can loop through the results of the rdir function, loading the files one at a time into whatever data structure you want.