Azure Databricks: How to delete files of a particular extension outside of DBFS using python - pyspark

I am able to delete a file of a particular extension from the directory /databricks/driver using the bash command in databricks.
%%bash
rm /databricks/driver/file*.xlsx
But I am unable to figure out, how to access and delete a file outside of dbfs in a python script,
I think using dbutils we cannot access files outside of DBFS and the below command outputs False as its looking in DBFS.
dbutils.fs.rm("/databricks/driver/file*.xlsx")
I am eager to be corrected.

Not sure how to do it using dbutils but I am able to delete it using glob
import os
from glob import glob
for file in glob('/databricks/driver/file*.xlsx'):
os.remove(file)

Related

Can I include a variable in an `sh` command in zeppelin?

I'm using Zeppelin with Hadoop on a Spark cluster.
I'd like to run a command to check files on s3 and I'd like to use a variable.
This is my code
%sh
aws s3 ls s3://my-bucket/my_folder/
Can I replace my-bucket/my_folder/ with a variable?
What do you mean by "a variable"? A Python variable? If so, I'm not sure. But if you just want to pull the path out onto another line, you can use a shell variable:
%sh
export AWS_FOLDER=my-bucket/my_folder/
aws s3 ls s3://$AWS_FOLDER

How to copy entire content of a local disk in python 3.6?

I want to create a .py file in Python 3.6 that will copy entire local disk partition from my PC (partition D) to external hard drive(partition E). I only managed to copy folders with distutils copytree command like this:
copy_tree(r"D:\\Myfolder", "E:\\SAVE\\Myfolder", update = True)
But I want the program to copy the entire partition with all folders subfolders and files.
Any help?
I find the answer:
import distutils
from distutils.dir_util import copy_tree
copy_tree(r"D:\\", "E:\\SAVE\\Myfolder", update = True)`
This command will copy all folders from local disk to other location.
docs

How can I manage sqoop target dir and result files permissions

When I use sqoop import with target-dir parameter I have result in some folder with parts files and _SUCCESS file. How can I manage permission for this folder and files when I use sqoop. I know, we can change permissin after import, but I need to manage permission when I use only sqoop.
Ps. I am running sqoop from oozie workflow, probably I can use it to specify permissions.

Google Cloud Storage upload files modified today

I am trying to figure out if I can use the cp command of gsutil on the Windows platform to upload files to Google Cloud Storage. I have 6 folders on my local computer that get daily new pdf documents added to them. Each folder contains around 2,500 files. All files are currently on google storage in their respective folders. Right now I mainly upload all the new files using Google Cloud Storage Manager. Is there a way to create a batch file and schedule to run it automatically every night so it grabs only files that have been scanned today and uploads it to Google Storage?
I tried this format:
python c:\gsutil\gsutil cp "E:\PIECE POs\64954.pdf" "gs://dompro/piece pos"
and it uploaded the file perfectly fine.
This command
python c:\gsutil\gsutil cp "E:\PIECE POs\*.pdf" "gs://dompro/piece pos"
will upload all of the files into a bucket. But how do I only grab files that were changed or generated today? Is there a way to do it?
One solution would be to use the -n parameter on the gsutil cp command:
python c:\gsutil\gsutil cp -n "E:\PIECE POs\*" "gs://dompro/piece pos/"
That will skip any objects that already exist on the server. You may also want to look at using gsutil's -m flag and see if that speeds the process up for you:
python c:\gsutil\gsutil -m cp -n "E:\PIECE POs\*" "gs://dompro/piece pos/"
Since you have Python available to you, you could write a small Python script to find the ctime (creation time) or mtime (modification time) of each file in a directory, see if that date is today, and upload it if so. You can see an example in this question which could be adapted as follows:
import datetime
import os
local_path_to_storage_bucket = [
('<local-path-1>', 'gs://bucket1'),
('<local-path-2>', 'gs://bucket2'),
# ... add more here as needed
]
today = datetime.date.today()
for local_path, storage_bucket in local_path_to_storage_bucket:
for filename in os.listdir(local_path):
ctime = datetime.date.fromtimestamp(os.path.getctime(filename))
mtime = datetime.date.fromtimestamp(os.path.getmtime(filename))
if today in (ctime, mtime):
# Using the 'subprocess' library would be better, but this is
# simpler to illustrate the example.
os.system('gsutil cp "%s" "%s"' % (filename, storage_bucket))
Alternatively, consider using Google Cloud Store Python API directly instead of shelling out to gsutil.

Using COPY FROM in postgres - absolute filename of local file

I'm trying to import a csv file using the COPY FROM command with postgres.
The db is stored on a linux server, and my data is stored locally, i.e. C:\test.csv
I keep getting the error:
ERROR: could not open file "C:\test.csv" for reading: No such file or directory
SQL state: 58P01
I know that I need to use the absolute path for the filename that the server can see, but everything I try brings up the same error
Can anyone help please?
Thanks
Quote from the PostgreSQL manual:
The file must be accessible to the server and the name must be specified from the viewpoint of the server
So you need to copy the file to the server before you can use COPY FROM.
If you don't have access to the server, you can use psql's \copy command which is very similar to COPY FROM but works with local files. See the manual for details.