How can I save a binary file to my project assets using project-lib python? - ibm-cloud

The project lib documentation shows how to save a pandas dataframe to the project assets:
# Import the lib
from project_lib import Project
project = Project(sc,"<ProjectId>", "<ProjectToken>")
# let's assume you have the pandas DataFrame pandas_df which contains the data
# you want to save in your object storage as a csv file
project.save_data("file_name.csv", pandas_df.to_csv())
# the function returns a dict which contains the asset_id, bucket_name and file_name
# upon successful saving of the data
However, if I have a local file ...
! wget url_to_binary_file
How can I then upload that file to the project’s assets?

I needed to read the file as bytes. Note that this will read the file into memory, don’t try this is you have a file that is larger than your available memory:
import io
filename = ‘thefilename’
with open(filename, 'rb') as z:
data = io.BytesIO(z.read())
project.save_data(
filename, data, set_project_asset=True, overwrite=True
)

Related

Generate temporary Directory with files in Python for unittesting

I want to create a temporary folder with a directory and some files:
import os
import tempfile
from pathlib import Path
with tempfile.TemporaryDirectory() as tmp_dir:
# generate some random files in it
Path('file.txt').touch()
Path('file2.txt').touch()
files_in_dir = os.listdir(tmp_dir)
print(files_in_dir)
Expected: [file.txt,file2.txt]
Result: []
Does anyone know how to this in Python? Or is there a better way to just do some mocking?
You have to create the file inside the directory by getting the path of your tmp_dir. The with context does not do that for you.
with tempfile.TemporaryDirectory() as tmp_dir:
Path(tmp_dir, 'file.txt').touch()
Path(tmp_dir, 'file2.txt').touch()
files_in_dir = os.listdir(tmp_dir)
print(files_in_dir)
# ['file2.txt', 'file.txt']

Check if directory contains json files using org.apache.hadoop.fs.Path in HDFS

I'm following the steps indicated here Avoid "Path does not exist" in dir based spark load to filter which directories in an array contain json files before sending them to the spark.read method.
When I use
inputPaths.filter(f => fs.exists(new org.apache.hadoop.fs.Path(f + "/*.json*")))
It returns empty despite json files existing in the path in one of the paths, one of the comments says this doesn't work with HDFS, is there a way to do make this work?
I running this in a databricks notebook
There is a method for listing files in dir:
fs.listStatus(dir)
Sort of
inputPaths.filter(f => fs.listStatus(f).exists(file => file.getPath.getName.endsWith(".json")))

How to import referenced files in ETL scripts?

I have a script which I'd like to pass a configuration file into. On the Glue jobs page, I see that there is a "Referenced files path" which points to my configuration file. How do I then use that file within my ETL script?
I've tried from configuration import *, where the referenced file name is configuration.py, but no luck (ImportError: No module named configuration).
I noticed the same issue. I believe there is already a ticket to address it, but here is what AWS support suggests in the meantime.
If you are using referenced files path variable in a Python
shell job, referenced file is found in /tmp, where Python shell
job has no access by default. However, the same operation works
successfully in Spark job, because the file is found in the default
file directory.
Code below helps find the absolute path of sample_config.json that was referenced in Glue job configuration and prints its contents.
import json
import sys, os
def get_referenced_filepath(file_name, matchFunc=os.path.isfile):
for dir_name in sys.path:
candidate = os.path.join(dir_name, file_name)
if matchFunc(candidate):
return candidate
raise Exception("Can't find file: ".format(file_name))
with open(get_referenced_filepath('sample_config.json'), "r") as f:
data = json.load(f)
print(data)
Boto3 API can be used to access the referenced file as well
import boto3
s3 = boto3.resource('s3')
obj = s3.Object('sample_bucket', 'sample_config.json')
for line in obj.get()['Body']._raw_stream:
print(line)
I had this issue with a Glue v2 Spark job, rather than a Python shell job which the other answer discusses in detail.
The AWS documentation says that it is not necessary to zip a single .py file. However, I decided to use a .zip file anyway.
My .zip file contains the following:
Archive: utils.zip
Length Method Size Cmpr Date Time CRC-32 Name
-------- ------ ------- ---- ---------- ----- -------- ----
0 Defl:N 5 0% 01-01-2049 00:00 00000000 __init__.py
6603 Defl:N 1676 75% 01-01-2049 00:00 f4551ccb utils.py
-------- ------- --- -------
6603 1681 75% 2 files
Note that __init__.py is present and the archive is compressed using Deflate (usual zip format).
In my Glue Job, I added the referenced files path job parameter pointing to my zip file on S3.
In the job script, I needed to explicitly add my zip file to the Python path before the import would work.
import sys
sys.path.insert(0, "utils.zip")
import utils
Failing to do the above resulted in a ImportError: No module named error.
For others who are struggling with this, inspecting the following variables helped me to debug the issue and arrive at the solution. Paste into your Glue job and view the results in Cloudwatch.
import sys
import os
print(f"os.getcwd()={os.getcwd()}")
print(f"os.listdir('.')={os.listdir('.')}")
print(f"sys.path={sys.path}")

Spark-SQL: access file in current worker node directory

I need to read a file using spark-sql, and the file is in the current directory.
I use this command to decompress a list of files I have stored on HDFS.
val decompressCommand = Seq(laszippath, "-i", inputFileName , "-o", "out.las").!!
The file is outputted in the current worker node directory, and I know this because executing "ls -a"!! through scala I can see that the file is there. I then try to access it with the following command:
val dataFrame = sqlContext.read.las("out.las")
I assumed that the sql context would try to find the file in the current directory, but it doesn't. Also, it doesn't throw an error but a warning stating that the file could not be found (so spark continues to run).
I attempted to add the file using: sparkContext.addFile("out.las") and then access the location using: val location = SparkFiles.get("out.las") but this didn't work either.
I even ran the command val locationPt = "pwd"!! and then did val fullLocation = locationPt + "/out.las" and attempted to use that value but it didn't work either.
The actual exception that gets thrown is the following:
User class threw exception: org.apache.spark.sql.AnalysisException: cannot resolve 'x' given input columns: [];
org.apache.spark.sql.AnalysisException: cannot resolve 'x' given input columns: []
And this happens when I try to access column "x" from a dataframe. I know that column 'X' exists because I've downloaded some of the files from HDFS, decompressed them locally and ran some tests.
I need to decompress files one by one because I have 1.6TB of data and so I cannot decompress it at one go and access them later.
Can anyone tell me what I can do to access files which are being outputted to the worker node directory? Or maybe should I be doing it some other way?
So I managed to do it now. What I'm doing is I'm saving the file to HDFS, and then retrieving the file using the sql context through hdfs. I overwrite "out.las" each time in HDFS so that I don't have take too much space.
I have used the hadoop API before to get to files, I dunno if it will help you here.
val filePath = "/user/me/dataForHDFS/"
val fs:FileSystem = FileSystem.get(new java.net.URI(filePath + "out.las"), sc.hadoopConfiguration)
And I've not tested the below, but I'm pretty sure I'm passing the java array to scala illegally. But just giving an idea of what to do afterward.
var readIn: Array[Byte] = Array.empty[Byte]
val fileIn: FSDataInputStream = fs.open(file)
val fileIn.readFully(0, readIn)

Can't load a file in Play (always not found)

I can't load a file in Play:
val filePath1 = "/views/layouts/mylayout.scala.html"
val filePath2 = "views/layouts/mylayout.scala.html"
Play.getExistingFile(filePath1)
Play.getExistingFile(filePath2)
Play.resourceAsStream(filePath1)
Play.resourceAsStream(filePath2)
None of these works, they all return None.
You are essentially trying to read a source file at runtime. Which is not something you should usually do. If you want to read a file at runtime then I'd recommend putting it somewhere that will end up in the classpath and then use Play.resourceAsStream to read the file. The files in the conf directory and non-compiled files in the app dir should end up in the classpath.