I have multiple files in my folder , i want to pattern match if any file is present , if that file is present then store the variable with whole file path.
how to achieve this in pyspark
Since you want to store the whole path in a variable, you can achieve this with a combination of dbutils and Regular expression pattern matching.
We can use dbutils.fs.ls(path) to return the list of files present in a folder (storage account or DBFS). Assign its return value to a variable called files.
#my sample path- mounted storage account folder.
files = dbutils.fs.ls("/mnt/repro")
Loop through this list. Now using Python's re.match() you can check if the current item's file name matches your pattern. If it matches, append its path to your result variable (list).
from re import match
matched_files=[]
for file in files:
#print(file)
if(match("sample.*csv", file.name)): #"sample.*csv" is pattern to be matched
matched_files.append(file.path)
#print("Matched files: ",matched_files)
Sample output:
Related
``
I have to copy files from source folder to target folder both are in the same storage account(ADL). The files in the source folder are of in .txt format and have date appended in the file name,
eg: RAINBOW.IND.EXPORT.20221201.WIFI.NETWORK.SCHOOL.txt
and
RAINBOW.IND.EXPORT.20221202.WIFI.NETWORK.SCHOOL.txt
(20221201 and 20221202 is date in file name , date format: yyyymmdd)
I have to create a pipeline that will sort and store files in the folders in ADL's in this hierarchy
ex: adl/2022/12/01/RAINBOW.IND.EXPORT.20221201.WIFI.NETWORK.SCHOOL.txt
adl/2022/12/02/RAINBOW.IND.EXPORT.20221202.WIFI.NETWORK.SCHOOL.txt
even if we have multiple files on same date in file name based on that date in file name it has to create year(YYYY) folder and in year(YYYY) folder it should create month(MM) folder and in month(MM) folder it should create date(DD) folder like above example. Each File should copy into respective yyyy and respective mm and respective date folder.
What I have done:
In Get Metadata - Given argument to extract **childitems**
For each activity that contains a Copy activity.
In Copy activity source wildcard path is given as *.txt
for sink took concat expression using split and substring functions
Please check the screenshots of all activities and expressions
but this pipeline is creating the folders based on date in file name (like adl/2022/12/01)
but problem is it was copying all files into all date(DD) folders
(like adl/2022/12/01/RAINBOW.IND.EXPORT.20221201.WIFI.NETWORK.SCHOOL.txt
RAINBOW.IND.EXPORT.20221202.WIFI.NETWORK.SCHOOL.txt
adl/2022/12/02/RAINBOW.IND.EXPORT.20221201.WIFI.NETWORK.SCHOOL.txt
RAINBOW.IND.EXPORT.20221202.WIFI.NETWORK.SCHOOL.txt)
1.[GET META to extract child items](https://i.stack.imgur.com/GVYgZ.png)
2.[Giving GET META output to FOREACH](https://i.stack.imgur.com/cbo30.png)
3.[Inside FOREACH using COPY ](https://i.stack.imgur.com/U5LK5.png)
4.[Source Data Set](https://i.stack.imgur.com/hyzuC.png)
5.[Sink Data Set](https://i.stack.imgur.com/aiYYm.png) Expression used in Data Set in Folder Path '#concat('adl','/'dataset().FolderName)
6.[Took parameter for Sink](https://i.stack.imgur.com/QihZR.png)
7.[Sink in copy activity ](https://i.stack.imgur.com/4OzT5.png)
Expression used in sink for dynamic folders using split and substring function
#concat(substring(split(item().name,'.')[3],0,4),'/',
substring(split(item().name,'.')[3],4,2),'/',
substring(split(item().name,'.')[3],6,2)
)
**OUTPUT for this pipeline**
adl/2022/12/01/RAINBOW.IND.EXPORT.20221201.WIFI.NETWORK.SCHOOL.txt
RAINBOW.IND.EXPORT.20221202.WIFI.NETWORK.SCHOOL.txt
adl/2022/12/02/RAINBOW.IND.EXPORT.20221201.WIFI.NETWORK.SCHOOL.txt
RAINBOW.IND.EXPORT.20221202.WIFI.NETWORK.SCHOOL.txt
**Required Output is**
adl/2022/12/01/RAINBOW.IND.EXPORT.20221201.WIFI.NETWORK.SCHOOL.txt
adl/2022/12/02/RAINBOW.IND.EXPORT.20221202.WIFI.NETWORK.SCHOOL.txt
(i.e each file should copy to respective date folders only even if we have multiple files in same date, they should copy to date folders based on date in file name)
I have reproduced the above and got same result when I followed the steps that you have given.
Copy activity did like this because, in source or sink you did not gave #item().name(file name for that particular iteration) and you have given *.txt in the wildcard path of source in copy activity.
It means for every iteration(for every file name) it copies all .txt files from source into that particular target folder(same happened for you).
To avoid this,
Give #item().name in source wild card file name
It means we are giving only one that iteration file name in the source for the copy.
(OR)
Keep the wildcard file name in source as it is(*.txt) and create a sink dataset parameter for file name.
and give #item().name to it in copy activity sink.
You can do any of the above and if you want you can do both at a time. I have checked all the 3 scenarios like
1.#item().name in wild card sink file name.
2. #item().name in dataset file name by keeping wildcard path same.
3. combining both 1 and 2(#item().name in wild card file name and in sink dataset parameter).
All are working fine and giving desired result.
Use Case: In Databricks PySpark environment, I want to check if there are multiple files with same file name pattern existing in the Azure storage account. If they exist, I expect to get the list of file path locations for each file matched.
Tried using, dbutils.fs.ls, but it does not support the wildcard pattern. PFA.
Workaround: Get paths of all files in the folder and then loop over each file to do filename pattern matching and prepare a list of required file paths.
Do let me know, if there is any other way to get the file paths, without looping over?
In Databricks, dbutils.fs.ls() doesn’t support wildcard paths. This official documentation consists of all the Databricks utilies and there is no dbfs utility function that helps to use wildcard paths for matching file names.
You cannot proceed further without using loops. The following operations are done using a storage account with random files for demo. This demonstrates a way you can use to get the files that match your pattern.
Using os.listdir() function, you can get the list of all files in your container/directory.
path_dbfs="dbfs:/mnt/omega/" #absolute dbfs path to your storage
import os
#using os.listdir() to get all files in container.
path = "/dbfs/mnt/omega"
file_names = os.listdir(path)
print(file_names)
['country_data.csv', 'json_input.json', 'json_input.txt', 'person.csv', 'sample_1.csv', 'sample_2.csv', 'sample_3.csv', 'sample_new_date_4.csv', 'store.txt']
Once you have list of all files, you can use regular expressions with re.search() and match object property group() to check whether each file matches the pattern or not.
import re
#use regex with loops to get absolute paths of pattern matching files.
file_to_find_pattern = "sample.*csv" #match pattern in this case.
# .* indicates 0 or more occurances of other characters, you can build it according to your requirement.
matched_files = []
for file in file_names:
val = re.search(file_to_find_pattern,file)
if(val is not None):
matched_files.append(path_dbfs+val.group())
print(matched_files)
['dbfs:/mnt/omega/sample_1.csv', 'dbfs:/mnt/omega/sample_2.csv', 'dbfs:/mnt/omega/sample_3.csv', 'dbfs:/mnt/omega/sample_new_date_4.csv']
I would like to validate my input filename whether it's in specified name.
Like my filename should be <><><>_<>.csv
Yes i am using event based i will get filename from trigger.
expected format: company_contry_yearmonth_timestamp.CSV
There is no explicit regex way of validating if the incoming file name matches a pattern. But if you are using activity like lookup or copy activity. You can specify in the source dataset settings a wildcard file name or file path to fetch a file matching the pattern.
- wildcardFileName
The file name with wildcard characters under the given container and
folder path (or wildcard folder path) to filter source files. Allowed
wildcards are: * (matches zero or more characters) and ? (matches zero
or single character). Use ^ to escape if your file name has a wildcard
or this escape character inside. See more examples in Folder and file
filter examples.
example:
You can use a if condition, with an expression as below using contains()
Here a storage event trigger, gets the trigged file name into a parameter. We then use contains() function to see if the file name contains a specified string
#contains(pipeline().parameters.filenameTriggered,'pattern')
If true a wait activity is executed.
There are two parts of my query
1) How to save different fields of structures as separate files(each file containing only named field of structure )?
2) Forcing save command to create directories in the save path when intermediate directories do not exist?
For first part:
data.a.name='a';
data.a.age=5;
data.b.name='b';
data.b.age=6;
data.c.name='c';
data.c.age=7;
fields=fieldnames(data);
for i=1:length(fields)
save(['E:\data\' fields{i} '.mat'],'-struct','data');
end
I want to save each field of struct data as a separate .mat file. So that after executing the loop, I should have 3 files inside E:\data viz. a.mat,b.mat and c.mat and a.mat contains only data of field 'a', b.mat contains only data of field 'b' and so on.
When I exeucte the above code, I get three files in my directory but each file contains identical content of all three variables a, b and c, instead of individual variables in each file.
Following command does not work:
for i=1:length(fields)
save(['E:\data\' fields{i} '.mat'],'-struct',['data.' fields{i} ]);
end
Error using save
The argument to -STRUCT must be the name of a scalar structure variable.
Is there some way to use save command to achieve my purpose without having to create temporary vaiables for saving each field?
For Second Part:
I have large number of files which need to stored in a directory structure. I want following to work.
test='abcdefgh';
save(['E:\data\' test(1:2) '\' test(3:4) '\' test(5:6) '\result.mat'])
But it showing following error
Error using save
Cannot create 'result.mat' because 'E:\data\ab\cd\ef' does not exist.
If any intermediate directory are not present, then they should be created by save command. I can get this part to work by checking if directory is present or not using exist command and then create directory using mkdir. I am wondering if there is some way to force save command to do the work using some argument I am not aware of.
Your field input argument to save is wrong. Per the documentation, the format is:
'-struct',structName,field1,...,fieldN
So the appropriate save syntax is:
data.a.name='a';
data.a.age=5;
data.b.name='b';
data.b.age=6;
data.c.name='c';
data.c.age=7;
fields = fieldnames(data);
for ii = 1:length(fields)
save(['E:\data\' fields{ii} '.mat'], '-struct', 'data', fields{ii});
end
And no, you cannot force save to generate the intermediate directories. Check for the existence of the save path first and create it if necessary.
I have a zip file X and I'd like do extract a single file, located in x/x/x/file.txt. How do I do this using Archive::Zip and Perl?
You can use the extractMember method:
extractMember( $memberOrName [, $extractedName ] )
Extract the given member, or match its name and extract it. Returns undef if member doesn't exist in this Zip. If optional second arg is given, use it as the name of the extracted member. Otherwise, the internal filename of the member is used as the name of the extracted file or directory. If you pass $extractedName, it should be in the local file system's format. All necessary directories will be created. Returns AZ_OK on success.
See Archive::Zip::FAQ, "extract file(s) from a Zip". The current version of the example file is online at http://cpansearch.perl.org/src/ADAMK/Archive-Zip-1.30/examples/extract.pl.