I'm trying to work with a file
But when I try to access this file, I get an error: No such file or directory
Can you tell me how to access files in hdfs correctly?
UPD:
The author of the answer directed me in the right direction.
As a result, this is how I execute the python script:
#!/usr/bin/python
# -*- coding: utf-8 -*-
#import pandas as pd
import sys
for line in sys.stdin:
print('Hello, ' + line)
# this is hello.py
And Scala application:
spark.sparkContext.addFile(getClass.getResource("hello.py").getPath, true)
val test = spark.sparkContext.parallelize(List("Body!")).repartition(1)
val piped = test.pipe(SparkFiles.get("./hello.py"))
val c = piped.collect()
c.foreach(println)
Output: Hello, Body!
Now I have to think about whether, as a cluster user, I can install pandas on workers.
I think you should try directly referencing the external file rather than attempting to download it to your Spark driver just to upload it again
spark.sparkContext.addFile(s"hdfs://$srcPy")
Related
I have a Talend job to create a .csv file and now I want to convert .parquet format using Talend v6.5.1. Only option I can think, tSystem component to call the python script from local or directory where .csv landing temporarily. I know I can convert this easily using pandas or pyspark but I am not sure the same code will be work for tSystem in Talend. Can you please provide the suggestions or instructions-
Code:
import pandas as pd
DF = pd.read_csv("Path")
DF1 = to_parquet(DF)
If you have an external script on your file system, you can try
"python \"myscript.py\" "
Here is a link on talend forum regarding this problem :
https://community.talend.com/t5/Design-and-Development/how-to-execute-a-python-script-file-with-an-argument-using/m-p/23975#M3722
I am able to resolve the problem following below steps-
import pandas as pd
import pyarrow as pa
import numpy as np
import sys
filename = sys.argv[1]
test = pd.read_csv(r"C:\\Users\\your desktop\\Downloads\\TestXML\\"+ filename+".csv")
test.to_parquet(r"C:\\Users\\your desktop\\Downloads\\TestXML\\"+ filename+".parque
t")
I have a script which I'd like to pass a configuration file into. On the Glue jobs page, I see that there is a "Referenced files path" which points to my configuration file. How do I then use that file within my ETL script?
I've tried from configuration import *, where the referenced file name is configuration.py, but no luck (ImportError: No module named configuration).
I noticed the same issue. I believe there is already a ticket to address it, but here is what AWS support suggests in the meantime.
If you are using referenced files path variable in a Python
shell job, referenced file is found in /tmp, where Python shell
job has no access by default. However, the same operation works
successfully in Spark job, because the file is found in the default
file directory.
Code below helps find the absolute path of sample_config.json that was referenced in Glue job configuration and prints its contents.
import json
import sys, os
def get_referenced_filepath(file_name, matchFunc=os.path.isfile):
for dir_name in sys.path:
candidate = os.path.join(dir_name, file_name)
if matchFunc(candidate):
return candidate
raise Exception("Can't find file: ".format(file_name))
with open(get_referenced_filepath('sample_config.json'), "r") as f:
data = json.load(f)
print(data)
Boto3 API can be used to access the referenced file as well
import boto3
s3 = boto3.resource('s3')
obj = s3.Object('sample_bucket', 'sample_config.json')
for line in obj.get()['Body']._raw_stream:
print(line)
I had this issue with a Glue v2 Spark job, rather than a Python shell job which the other answer discusses in detail.
The AWS documentation says that it is not necessary to zip a single .py file. However, I decided to use a .zip file anyway.
My .zip file contains the following:
Archive: utils.zip
Length Method Size Cmpr Date Time CRC-32 Name
-------- ------ ------- ---- ---------- ----- -------- ----
0 Defl:N 5 0% 01-01-2049 00:00 00000000 __init__.py
6603 Defl:N 1676 75% 01-01-2049 00:00 f4551ccb utils.py
-------- ------- --- -------
6603 1681 75% 2 files
Note that __init__.py is present and the archive is compressed using Deflate (usual zip format).
In my Glue Job, I added the referenced files path job parameter pointing to my zip file on S3.
In the job script, I needed to explicitly add my zip file to the Python path before the import would work.
import sys
sys.path.insert(0, "utils.zip")
import utils
Failing to do the above resulted in a ImportError: No module named error.
For others who are struggling with this, inspecting the following variables helped me to debug the issue and arrive at the solution. Paste into your Glue job and view the results in Cloudwatch.
import sys
import os
print(f"os.getcwd()={os.getcwd()}")
print(f"os.listdir('.')={os.listdir('.')}")
print(f"sys.path={sys.path}")
I am a beginner in Spark and trying to follow instructions from here on how to initialize Spark shell from Python using cmd: http://spark.apache.org/docs/latest/quick-start.html
But when I run in cmd the following:
C:\Users\Alex\Desktop\spark-1.4.1-bin-hadoop2.4\>c:\Python27\python bin\pyspark
then I receive the following error message:
File "bin\pyspark", line 21
export SPARK_HOME="$(cd ="$(cd "`dirname "$0"`"/..; pwd)"
SyntaxError: invalid syntax
What am I doing wrong here?
P.S. When in cmd I try just C:\Users\Alex\Desktop\spark-1.4.1-bin-hadoop2.4>bin\pyspark
then I receive ""python" is not recognized as internal or external command, operable program or batch file".
You need to have Python available in the system path, you can add it with setx:
setx path "%path%;C:\Python27"
I'm a fairly new Spark user (as of today, really). I am using spark 1.6.0 on Windows 10 and 7 machines. The following worked for me:
import os
import sys
spark_home = os.environ.get('SPARK_HOME', None)
if not spark_home:
raise ValueError('SPARK_HOME environment variable is not set')
sys.path.insert(0, os.path.join(spark_home, 'python'))
sys.path.insert(0, os.path.join(spark_home, 'C:/spark-1.6.0-bin-hadoop2.6/python/lib/py4j-0.9-src.zip'))
execfile(os.path.join(spark_home, 'python/pyspark/shell.py'))
Using the code above, I was able to launch Spark in an IPython notebook and my Enthought Canopy Python IDE. Before, this, I was only able to launch pyspark through a cmd prompt. The code above will only work if you have your Environment Variables set correctly for Python and Spark (pyspark).
I run these set of path settings whenever I start pyspark in ipython:
import os
import sys
# Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" "com.databricks:spark-csv_2.10:1.0.3" "sparkr-shell"') for R
### MANNN restart spart using ipython notebook --profile=pyspark --packages com.databricks:spark-csv_2.10:1.0.3
os.environ['SPARK_HOME']="G:/Spark/spark-1.5.1-bin-hadoop2.6"
sys.path.append("G:/Spark/spark-1.5.1-bin-hadoop2.6/bin")
sys.path.append("G:/Spark/spark-1.5.1-bin-hadoop2.6/python")
sys.path.append("G:/Spark/spark-1.5.1-bin-hadoop2.6/python/pyspark/")
sys.path.append("G:/Spark/spark-1.5.1-bin-hadoop2.6/python/pyspark/sql")
sys.path.append("G:/Spark/spark-1.5.1-bin-hadoop2.6/python/pyspark/mllib")
sys.path.append("G:/Spark/spark-1.5.1-bin-hadoop2.6/python/lib")
sys.path.append("G:/Spark/spark-1.5.1-bin-hadoop2.6/python/lib/pyspark.zip")
sys.path.append("G:/Spark/spark-1.5.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip")
sys.path.append("G:/Spark/spark-1.5.1-bin-hadoop2.6/python/lib/pyspark.zip")
from pyspark import SparkContext
from pyspark import SparkConf
from pyspark import SQLContext
##sc.stop() # IF you wish to stop the context
sc = SparkContext("local", "Simple App")
With the reference and help of the user "maxymoo" I was able to find a way to set a PERMANENT path is Windows 7 as well. The instructions are here:
http://geekswithblogs.net/renso/archive/2009/10/21/how-to-set-the-windows-path-in-windows-7.aspx
Simply set path in System -> Environment Variables -> Path
R Path in my system C:\Program Files\R\R-3.2.3\bin
Python Path in my system c:\python27
Spark Path in my system c:\spark-2
The path must be separated by ";" and there must be no space between paths
I have a Plone project which is created by a buildout script and needs a default encoding of utf-8. This is usually done in the sitecustomize.py file of the Python installation. Since there is a virtualenv, I'd like to generate this file automatically, to contain something like:
import sys
sys.setdefaultencoding('utf-8')
After generation I have two empty sitecustomize.py files - one in parts/instance/, and one in parts/buildout; but none of these seems to be used (I couldn't find them in sys.path).
I tried zopepy:
>>> from os.path import join, isfile
>>> from pprint import pprint
>>> import sys
>>> pprint([p for p in sys.path
... if isfile(join(p, 'sitecustomize.py'))
... ])
and found another one in my local lib/python2.7/site-packages/ directory which looks good; but it doesn't work:
>>> import sys
>>> sys.getdefaultencoding()
'ascii'
This directory sits near the end of the sys.path, because I needed to add it by an extra-paths entry (to get another historical package).
Any pointers? Thank you!
System information: CentOS 7, Python 2.7.5
Edit:
I deleted them two empty sitecustomize.py files; now I have a default encoding of utf-8 in the zopepy session but still ascii in Plone; this surprises me, because I have in my buildout script:
[zopepy]
recipe=zc.recipe.egg
eggs = ${instance:eggs}
extra-paths = ${instance:extra-paths}
interpreter = zopepy
scripts = zopepy
To debug this, I created a little function which I added to my code, and which displays a little information about relevant modules in the sys.path:
import sys
from os.path import join, isdir, isfile
def sitecustomize_info():
plen = len(sys.path)
print '-' * 79
print 'sys.path has %(plen)d entries' % locals()
for tup in zip(range(1, plen+1), sys.path):
nr, dname = tup
if isdir(dname):
for fname in ('site.py', 'sitecustomize.py'):
if isfile(join(dname, fname)):
print '%(nr)4d. %(dname)s/%(fname)s' % locals()
spname = join(dname, 'site-packages', 'sitecustomize.py')
if isfile(spname):
print '%(nr)4d. %(spname)s' % locals()
else:
print '? %(dname)s is not a directory' % locals()
print '-' * 79
Output:
sys.path has 303 entries
8. /usr/lib64/python2.7/site-packages/sitecustomize.py
295. /opt/zope/instances/wnzkb/lib/python2.7/site-packages/sitecustomize.py
? /usr/lib64/python27.zip is not a directory
298. /usr/lib64/python2.7/site.py
? /usr/lib64/python2.7/lib-tk is not a directory
? /usr/lib64/python2.7/lib-old is not a directory
303. /usr/lib/python2.7/site-packages/sitecustomize.py
All sitecustomize.py files look the same (switching to utf-8), and I didn't tweak site.py (for now; if everything else fails, I might need to.)
If you really want/need to use the sitecustomize.py trick, you could include this part in your buildout:
[fixencode]
recipe = plone.recipe.command
stop-on-error = yes
update-command = ${fixencode:command}
command =
SITE_PACKAGES=$(${buildout:executable} -c \
'from distutils.sysconfig import get_python_lib;print(get_python_lib())')
cat > $SITE_PACKAGES/../sitecustomize.py << EOF
#!${buildout:executable} -S
import sys
sys.setdefaultencoding('utf-8')
EOF
It will be added into the site-packages folder from your virtualenv.
It looks like sitecustomize.py is not found anymore unless placed in the global lib directories (Discussion "deleting setdefaultencoding in site.py is evil" (2009), Tracker ticket "sitecustomize.py not found") - and this was made on purpose (!).
This is to prevent users from overriding the default encoding which might have been adjusted by some library. Libraries shouldn't do that, however.
Thus, whoever needs to set the default encoding is seduced to do it globally, which looks like a very silly idea to me. I'd consider it much more reasonable to set this in my virtual environment.
Unfortunately the sitecustomize.py modules in a virtualenv seem to be silently ignored; but it is possible to edit the local site.py. Here is a little sed script:
# vv-------1------vv vv---2---vv vv--------3------vv vv-----------4------------vv
sed --in-place -e 's,^\( encoding =\) \("ascii"\) \(# Default value\) \(set by _PyUnicode_Init()\),\1 "utf-8" \3 \2 \4,' lib/python2.7/site.py
I would like to change the following batch script to Scala (just for fun), however, the script must keep running and listen for changes to the *.mkd files. If any file is changed, then the script should re-generate the affected doc. File IO has always been my Achilles heel...
#!/bin/sh
for file in *.mkd
do
pandoc --number-sections $file -o "${file%%.*}.pdf"
done
Any ideas around a good approach to this will be appreciated.
The following code, taken from my answer on: Watch for project files also can watch a directory and execute a specific command:
#!/usr/bin/env scala
import java.nio.file._
import scala.collection.JavaConversions._
import scala.sys.process._
val file = Paths.get(args(0))
val cmd = args(1)
val watcher = FileSystems.getDefault.newWatchService
file.register(
watcher,
StandardWatchEventKinds.ENTRY_CREATE,
StandardWatchEventKinds.ENTRY_MODIFY,
StandardWatchEventKinds.ENTRY_DELETE
)
def exec = cmd run true
#scala.annotation.tailrec
def watch(proc: Process): Unit = {
val key = watcher.take
val events = key.pollEvents
val newProc =
if (!events.isEmpty) {
proc.destroy()
exec
} else proc
if (key.reset) watch(newProc)
else println("aborted")
}
watch(exec)
Usage:
watchr.scala markdownFolder/ "echo \"Something changed!\""
Extensions have to be made to the script to inject file names into the command. As of now this snippet should just be regarded as a building block for the actual answer.
Modifying the script to incorporate the *.mkd wildcards would be non-trivial as you'd have to manually search for the files and register a watch on all of them. Re-using the script above and placing all files in a directory has the added advantage of picking up new files when they are created.
As you can see it gets pretty big and messy pretty quick just relying on Scala & Java APIs, you would be better of relying on alternative libraries or just sticking to bash while using INotify.