How to import referenced files in ETL scripts?

How to import referenced files in ETL scripts? - pyspark

I have a script which I'd like to pass a configuration file into. On the Glue jobs page, I see that there is a "Referenced files path" which points to my configuration file. How do I then use that file within my ETL script?
I've tried from configuration import *, where the referenced file name is configuration.py, but no luck (ImportError: No module named configuration).

I noticed the same issue. I believe there is already a ticket to address it, but here is what AWS support suggests in the meantime.
If you are using referenced files path variable in a Python
shell job, referenced file is found in /tmp, where Python shell
job has no access by default. However, the same operation works
successfully in Spark job, because the file is found in the default
file directory.
Code below helps find the absolute path of sample_config.json that was referenced in Glue job configuration and prints its contents.
import json
import sys, os
def get_referenced_filepath(file_name, matchFunc=os.path.isfile):
for dir_name in sys.path:
candidate = os.path.join(dir_name, file_name)
if matchFunc(candidate):
return candidate
raise Exception("Can't find file: ".format(file_name))
with open(get_referenced_filepath('sample_config.json'), "r") as f:
data = json.load(f)
print(data)
Boto3 API can be used to access the referenced file as well
import boto3
s3 = boto3.resource('s3')
obj = s3.Object('sample_bucket', 'sample_config.json')
for line in obj.get()['Body']._raw_stream:
print(line)

I had this issue with a Glue v2 Spark job, rather than a Python shell job which the other answer discusses in detail.
The AWS documentation says that it is not necessary to zip a single .py file. However, I decided to use a .zip file anyway.
My .zip file contains the following:
Archive: utils.zip
Length Method Size Cmpr Date Time CRC-32 Name
-------- ------ ------- ---- ---------- ----- -------- ----
0 Defl:N 5 0% 01-01-2049 00:00 00000000 __init__.py
6603 Defl:N 1676 75% 01-01-2049 00:00 f4551ccb utils.py
-------- ------- --- -------
6603 1681 75% 2 files
Note that __init__.py is present and the archive is compressed using Deflate (usual zip format).
In my Glue Job, I added the referenced files path job parameter pointing to my zip file on S3.
In the job script, I needed to explicitly add my zip file to the Python path before the import would work.
import sys
sys.path.insert(0, "utils.zip")
import utils
Failing to do the above resulted in a ImportError: No module named error.
For others who are struggling with this, inspecting the following variables helped me to debug the issue and arrive at the solution. Paste into your Glue job and view the results in Cloudwatch.
import sys
import os
print(f"os.getcwd()={os.getcwd()}")
print(f"os.listdir('.')={os.listdir('.')}")
print(f"sys.path={sys.path}")

Related

PyQt5/Python3 reference qss file

I have a PyQt5 (5.15.6) application running in Python 3 and want to reference my qss file as such
qss_file = QtCore.QFile("my_app_qss.qss")
However, I have multiple apps that use the same qss file so depending on where I run the app from I need an absolute import rather than a relative import. I would also like to compile any of those apps with pyinstaller and deploy them to another machine. How can I reference this qss file?
example folder structure
main
| - resources/my_app_qss.qss
| - apps/
|--------project1/app1.py
| -------project2/
|-----------------subfolder/app2.py

The issue is that I did not understand that
qss_file = QtCore.QFile("my_app_qss.qss")
Is not a path to a file. It is referencing a file that gets built by pyrcc4 from the .qrc source

Reading a Python file from Scala

I'm trying to work with a file
But when I try to access this file, I get an error: No such file or directory
Can you tell me how to access files in hdfs correctly?
UPD:
The author of the answer directed me in the right direction.
As a result, this is how I execute the python script:
#!/usr/bin/python
# -*- coding: utf-8 -*-
#import pandas as pd
import sys
for line in sys.stdin:
print('Hello, ' + line)
# this is hello.py
And Scala application:
spark.sparkContext.addFile(getClass.getResource("hello.py").getPath, true)
val test = spark.sparkContext.parallelize(List("Body!")).repartition(1)
val piped = test.pipe(SparkFiles.get("./hello.py"))
val c = piped.collect()
c.foreach(println)
Output: Hello, Body!
Now I have to think about whether, as a cluster user, I can install pandas on workers.

I think you should try directly referencing the external file rather than attempting to download it to your Spark driver just to upload it again
spark.sparkContext.addFile(s"hdfs://$srcPy")

Importing PySpark packages

I have downloaded the graphframes package (from here) and saved it on my local disk. Now, I would like to use it. So, I use the following command:
IPYTHON_OPTS="notebook --no-browser" pyspark --num-executors=4 --name gorelikboris_notebook_1 --py-files ~/temp/graphframes-0.1.0-spark1.5.jar --jars ~/temp/graphframes-0.1.0-spark1.5.jar --packages graphframes:graphframes:0.1.0-spark1.5
All the pyspark functionality works as expected, except for the new graphframes package: whenever I try to import graphframes, I get an ImportError. When I examine sys.path, I can see the following two paths:
/tmp/spark-1eXXX/userFiles-9XXX/graphframes_graphframes-0.1.0-spark1.5.jar and /tmp/spark-1eXXX/userFiles-9XXX/graphframes-0.1.0-spark1.5.jar, however these files don't exist. Moreover, the /tmp/spark-1eXXX/userFiles-9XXX/ directory is empty.
What am I missing?

in my case:
1、cd /home/zh/.ivy2/jars
2、jar xf graphframes_graphframes-0.3.0-spark2.0-s_2.11.jar
3、add /home/zh/.ivy2/jar to PYTHONPATH in spark-env.sh like code above:
export PYTHONPATH=$PYTHONPATH:/home/zh/.ivy2/jars:.

This might be an issue in Spark packages with Python in general. Someone else was asking about it too earlier on the Spark user discussion alias.
My workaround is to unpackage the jar to find the python code embedded, and then move the python code into a subdirectory called graphframes.
For instance, I run pyspark from my home directory
~$ ls -lart
drwxr-xr-x 2 user user 4096 Feb 24 19:55 graphframes
~$ ls graphframes/
__init__.pyc examples.pyc graphframe.pyc tests.pyc
You would not need the py-files or jars parameters, though, something like
IPYTHON_OPTS="notebook --no-browser" pyspark --num-executors=4 --name gorelikboris_notebook_1 --packages graphframes:graphframes:0.1.0-spark1.5
and having the python code in the graphframes directory should work.

Add these lines to your $SPARK_HOME/conf/spark-defaults.conf :
spark.executor.extraClassPath file_path/jar1:file_path/jar2
spark.driver.extraClassPath file_path/jar1:file_path/jar2

In the more general case of importing 'orphan' python file (outside of current folder, not part of properly installed package) - use addPyFile, e.g.:
sc.addPyFile('somefolder/graphframe.zip')
addPyFile(path): Add a .py or .zip dependency for all tasks to be executed on this SparkContext in the future. The path passed can be either a local file, a file in HDFS (or other Hadoop-supported filesystems), or an HTTP, HTTPS or FTP URI.

fails to import module when using matlabdomain

While trying to use the sphinx matlab domain I can't get the MWE to work, provided on the extensions pypi site
There is always this Can't import module error. I'd guess, that the extension kind of generates pseudo modules from the m-code, but up to know I actually could not figure out, how this mechanism works.
The dir structure looks like this
root
|--test_data
| |--MyHandleClass.m
|
|--doc
|--------conf.py
|--------Makefile
|--------index.rst
The files MyHandleClass.m and index.rst contain the example code given on the package site and the conf.py starts like this
import sys, os
sys.path.append(os.path.abspath('.'))
sys.path.append(os.path.abspath('./test_data'))
# -- General configuration -----------------------------------------------------
# Add any Sphinx extension module names here, as strings. They can be extensions
# coming with Sphinx (named 'sphinx.ext.*') or your custom ones.
extensions = [
"sphinxcontrib.matlab",
"sphinx.ext.autosummary",
"sphinx.ext.autodoc"]
autodoc_default_flags = ['members','show-inheritance','undoc-members']
autoclass_content = 'both'
mathjax_path = 'http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=default'
# The suffix of source filenames.
source_suffix = '.rst'
# The encoding of source files.
#source_encoding = 'utf-8'
# The master toctree document.
master_doc = 'index'
Error msg
WARNING: autodoc: failed to import module u'test_data'; the following exception was raised:
Traceback (most recent call last):
File "C:\Python27\lib\site-packages\sphinx\ext\autodoc.py", line 335, in import_object
__import__(self.modname)
ImportError: No module named test_data
E:\ME\doc\index.rst:13: WARNING: don't know which module to import for autodocumenting u'MyHandleClass' (try placing a "module" or "currentmodule" directive in the document, or giving an explicit module name)
After varying this and that maybe somebody out there has a clue?

Thanks for trying the matlabdomain sphinxcontrib extension. In order to use Sphinx to document MATLAB m-files, you need to add matlab_src_dir in conf.py as described in the Configuration section of the documenation. This is because the Python interpreter can't import a MATLAB m-file. Therefore you should not add your MATLAB root to the Python sys.path, or you will get the error you received. Instead set matlab_src_dir to the path containing the folder of your MATLAB project which you want to document.
Given your file structure, in order to document test_data use a conf.py with the following:
import os
# NOTE: don't add MATLAB m-files to `sys.path`
#sys.path.insert(0, os.path.abspath('.'))
# instead add them to `matlab_src_dir
matlab_src_dir = os.path.abspath('..') # MATLAB
Hope that does it! Please feel free to ask any more questions. I'm happy to help!

Protobufs import from another directory

While trying to compile a proto file named UserOptions.proto which has an import named Account.proto using the below command
protoc --proto_path=/home/project_new1/account --java_out=/home/project_new1/source /home/project_new1/settings/Useroptions.proto
I get the following error :
/home/project_new1/settings/UserOpti‌ons.proto: File does not reside within any path specified using --proto_path (or -I). You must specify a --proto_path which encompasses this file.
PS: UserOptions.proto present in the directory /home/project_new1/settings
imports Account.proto present in the directory
/home/project_new1/account
Proto descriptor files:
UserOptions.proto
package settings;
import "Account.proto";
option java_outer_classname = "UserOptionsVOProto";
Account.proto
package account;
option java_outer_classname = "AccountVOProto";
message Object
{
optional string userId = 1;
optional string service = 2;
}

As the error message states, the file you pass on the command line needs to be in one of the --proto_paths. In your case, you have only specified one --proto_path of:
/home/project_new1/
But the file you're passing is:
/home/project_new1/settings/UserOpti‌ons.proto
Notice that the file is not in the account subdirectory; it's in settings instead.
You have two options:
(Not recommended) Pass a second --proto_path argument to add .../settings to the path.
(Recommended) Use the root of your source tree as the proto path. E.g.:
protoc --proto_path=/home/project_new1/ --java_out=/home/project_new1 /home/project_new1/settings/UserOpti‌ons.proto
In this case, to import Account.proto, you'll need to write:
import "acco‌unt/Account.proto";

For those of us who want this really spelled out, here is an example where I have installed the protoc beta for gRPC using NuGet Packages Google.Protobuf, Grpc.Core and Grpc.Tools. My solution packages are one level above my Grpc directory (i.e. at BruTrader\packages). My .proto files are at BruTrader\Grpc\protos.
1. My .proto file:
syntax = "proto3";
import "timestamp.proto";
import "enums.proto";
package BruTrader.Grpc;
message DividendMessage {
double amount = 1;
google.protobuf.Timestamp dateUnix = 2;
}
2. my GenerateProto.bat file:
..\packages\Google.Protobuf.3.0.0-beta2\tools\protoc.exe -I..\Grpc\protos -I..\packages\Google.Protobuf.3.0.0-beta2\tools\google\protobuf --csharp_out=..\Grpc\Generated --grpc_out=..\Grpc\Generated --plugin=protoc-gen-grpc=..\packages\Grpc.Tools.0.13.0\tools\grpc_csharp_plugin.exe %1
3. my BuildProtos.bat
call GenerateProto ..\Grpc\protos\masterinstrument.proto
call GenerateProto .\protos\instrument.proto
etc.
4. BuildProtos.bat is executed as a Pre-build event on my Grpc project like this:
CD $(ProjectDir)
CALL "$(ProjectDir)BuildProtos.bat"

For my environment, Windows 10 Pro operating system and C++ programming languaje, I used the protoc-3.12.2-win64.zip that you can downloat it from here. You should open a Windows PowerShell inside the protoc-3.12.2-win64\bin path and then you must execute one of the next commands:
.\protoc.exe -I=C:\Users\UserName\Desktop\SRC --cpp_out=C:\Users\UserName\Desktop\DST C:\Users\UserName\Desktop\SRC\addressbook.proto
Or
.\protoc.exe --proto_path=C:\Users\UserName\Desktop\SRC --cpp_out=C:\Users\UserName\Desktop\DST C:\Users\UserName\Desktop\SRC\addressbook.proto
Note:
1- My source folder is in: C:\Users\UserName\Desktop\SRC
2- My destination folder is in: C:\Users\UserName\Desktop\DST
3- My .proto file is in: C:\Users\UserName\Desktop\SRC\addressbook.proto