pyspark: IOError: [Errno 20] Not a directory - pyspark

I am running a pyspark job on AWS-EMR and I got the following error:
IOError: [Errno 20] Not a directory: '/mnt/yarn/usercache/hadoop/filecache/12/my_common-9.0-py2.7.egg/my_common/data_tools/myData.yaml'
Does anyone know what I might have missed? Thanks!

I've run into this recently when I switched my Python Spark application from Client deploy mode to Cluster deploy mode.
My workaround is to locate the ZIP file (the artifact that I fed to spark-submit using --py-files):
CURRENT_FILE_PATH = os.path.dirname(__file__)
print("[DEBUG] CURRENT_FILE_PATH=" + CURRENT_FILE_PATH)
It comes out something like this:
/mnt2/yarn/usercache/task/appcache/application_1638998214637_0019/container_1638998214637_0019_02_000001/something.zip
then I can use something like:
import zipfile
archive = zipfile.ZipFile(CURRENT_FILE_PATH, 'r')
json_bytes = archive.read('myfile.json')
json_string = json.loads(json_bytes)
Note: I first tried using pkg_resources but couldn't read in the
resulting JSON due to TypeError from json.loads():
import pkg_resources
json_data = pkg_resources.resource_stream(__name__, 'myfile.json')
See also PySpark: how to resolve path of a resource file present inside the dependency zip file

As the error states my_common-9.0-py2.7.egg is not a directory.
Are you missing space in your path?
/mnt/yarn/usercache/hadoop/filecache/12/my_common-9.0-py2.7.egg /my_common/data_tools/myData.yaml

Related

Databricks dbutils.fs.ls shows files. However, reading them throws an IO error

I am running a Spark Cluster and when I'm executing the below command on Databricks Notebook, it gives me the output:
dbutils.fs.ls("/mnt/test_file.json")
[FileInfo(path=u'dbfs:/mnt/test_file.json', name=u'test_file.json', size=1083L)]
However, when I'm trying to read that file, I'm getting the below mentioned error:
with open("mnt/test_file.json", 'r') as f:
for line in f:
print line
IOError: [Errno 2] No such file or directory: 'mnt/test_file.json'
What might be the issue here? Any help/support is greatly appreciated.
In order to access files on a DBFS mount using local file APIs you need to prepend /dbfs to the path, so in your case it should be
with open('/dbfs/mnt/test_file.json', 'r') as f:
for line in f:
print(line)
See more details in the docs at https://docs.databricks.com/data/databricks-file-system.html#local-file-apis especially regarding limitations. With Databricks Runtime 5.5 and below there's a 2GB file limit. With 6.0+ there's no longer such a limit as the FUSE mount has been optimized to deal with larger file sizes.

Spark-SQL: access file in current worker node directory

I need to read a file using spark-sql, and the file is in the current directory.
I use this command to decompress a list of files I have stored on HDFS.
val decompressCommand = Seq(laszippath, "-i", inputFileName , "-o", "out.las").!!
The file is outputted in the current worker node directory, and I know this because executing "ls -a"!! through scala I can see that the file is there. I then try to access it with the following command:
val dataFrame = sqlContext.read.las("out.las")
I assumed that the sql context would try to find the file in the current directory, but it doesn't. Also, it doesn't throw an error but a warning stating that the file could not be found (so spark continues to run).
I attempted to add the file using: sparkContext.addFile("out.las") and then access the location using: val location = SparkFiles.get("out.las") but this didn't work either.
I even ran the command val locationPt = "pwd"!! and then did val fullLocation = locationPt + "/out.las" and attempted to use that value but it didn't work either.
The actual exception that gets thrown is the following:
User class threw exception: org.apache.spark.sql.AnalysisException: cannot resolve 'x' given input columns: [];
org.apache.spark.sql.AnalysisException: cannot resolve 'x' given input columns: []
And this happens when I try to access column "x" from a dataframe. I know that column 'X' exists because I've downloaded some of the files from HDFS, decompressed them locally and ran some tests.
I need to decompress files one by one because I have 1.6TB of data and so I cannot decompress it at one go and access them later.
Can anyone tell me what I can do to access files which are being outputted to the worker node directory? Or maybe should I be doing it some other way?
So I managed to do it now. What I'm doing is I'm saving the file to HDFS, and then retrieving the file using the sql context through hdfs. I overwrite "out.las" each time in HDFS so that I don't have take too much space.
I have used the hadoop API before to get to files, I dunno if it will help you here.
val filePath = "/user/me/dataForHDFS/"
val fs:FileSystem = FileSystem.get(new java.net.URI(filePath + "out.las"), sc.hadoopConfiguration)
And I've not tested the below, but I'm pretty sure I'm passing the java array to scala illegally. But just giving an idea of what to do afterward.
var readIn: Array[Byte] = Array.empty[Byte]
val fileIn: FSDataInputStream = fs.open(file)
val fileIn.readFully(0, readIn)

java.io.FileNotFoundException in Spark

I'm new here to learn Spark and Scala using the Notebook and Cluster in Databricks.com, here is my very simple code to load a file:
import sys.process._
val localpath="file:/tmp/myfile.json"
dbutils.fs.mkdirs("dbfs:/datasets/")
dbutils.fs.cp(localpath, "dbfs:/datasets/")
but I got error like this:
java.io.FileNotFoundException: File file:/tmp/myfile.json does not exist.
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:402)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:255)
at com.databricks.backend.daemon.dbutils.FSUtils$.cp(DBUtilsCore.scala:82)
at com.databricks.dbutils_v1.impl.DbfsUtilsImpl.cp(DbfsUtilsImpl.scala:40)
I'm using Mac and I've made sure that the file exists in this absolute path, is this a spark error? Thanks!
The line:
val localpath="file:/tmp/myfile.json"
should be:
val localpath="file://tmp/myfile.json"
Basically all URIs are of the format ://path see RFC-3986

fails to import module when using matlabdomain

While trying to use the sphinx matlab domain I can't get the MWE to work, provided on the extensions pypi site
There is always this Can't import module error. I'd guess, that the extension kind of generates pseudo modules from the m-code, but up to know I actually could not figure out, how this mechanism works.
The dir structure looks like this
root
|--test_data
| |--MyHandleClass.m
|
|--doc
|--------conf.py
|--------Makefile
|--------index.rst
The files MyHandleClass.m and index.rst contain the example code given on the package site and the conf.py starts like this
import sys, os
sys.path.append(os.path.abspath('.'))
sys.path.append(os.path.abspath('./test_data'))
# -- General configuration -----------------------------------------------------
# Add any Sphinx extension module names here, as strings. They can be extensions
# coming with Sphinx (named 'sphinx.ext.*') or your custom ones.
extensions = [
"sphinxcontrib.matlab",
"sphinx.ext.autosummary",
"sphinx.ext.autodoc"]
autodoc_default_flags = ['members','show-inheritance','undoc-members']
autoclass_content = 'both'
mathjax_path = 'http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=default'
# The suffix of source filenames.
source_suffix = '.rst'
# The encoding of source files.
#source_encoding = 'utf-8'
# The master toctree document.
master_doc = 'index'
Error msg
WARNING: autodoc: failed to import module u'test_data'; the following exception was raised:
Traceback (most recent call last):
File "C:\Python27\lib\site-packages\sphinx\ext\autodoc.py", line 335, in import_object
__import__(self.modname)
ImportError: No module named test_data
E:\ME\doc\index.rst:13: WARNING: don't know which module to import for autodocumenting u'MyHandleClass' (try placing a "module" or "currentmodule" directive in the document, or giving an explicit module name)
After varying this and that maybe somebody out there has a clue?
Thanks for trying the matlabdomain sphinxcontrib extension. In order to use Sphinx to document MATLAB m-files, you need to add matlab_src_dir in conf.py as described in the Configuration section of the documenation. This is because the Python interpreter can't import a MATLAB m-file. Therefore you should not add your MATLAB root to the Python sys.path, or you will get the error you received. Instead set matlab_src_dir to the path containing the folder of your MATLAB project which you want to document.
Given your file structure, in order to document test_data use a conf.py with the following:
import os
# NOTE: don't add MATLAB m-files to `sys.path`
#sys.path.insert(0, os.path.abspath('.'))
# instead add them to `matlab_src_dir
matlab_src_dir = os.path.abspath('..') # MATLAB
Hope that does it! Please feel free to ask any more questions. I'm happy to help!

How to Import epf data using epfimporter.py provided by apple

i tried using this link http://www.apple.com/itunes/affiliates/resources/documentation/epfimporter.html
-----------------------
*Below is the script i executed..*
C:\Documents and Settings\freakk>python D:\freakk\Downloads\EPF_Itunes\EPFImporter\E
PFimporter.py \D:\freakk\Downloads\EPF_Itunes\EPFImporter\db\album_popularity_per_
genre
-----------------------
*But i am getting these errors*
2011-10-12 18:24:00,529 [INFO]: Beginning import for the following directories:
\D:\freakk\Downloads\EPF_Itunes\EPFImporter\db\album_popularity_per_genre
2011-10-12 18:24:00,529 [INFO]: Importing files in \D:\freakk\Downloads\EPF_Itunes
\EPFImporter\db\album_popularity_per_genre
Traceback (most recent call last):
File "D:\freakk\Downloads\EPF_Itunes\EPFImporter\EPFimporter.py", line 452, in <
module>
main()
File "D:\freakk\Downloads\EPF_Itunes\EPFImporter\EPFimporter.py", line 435, in m
ain
fieldDelim=fieldSep)
File "D:\freakk\Downloads\EPF_Itunes\EPFImporter\EPFimporter.py", line 162, in d
oImport
fileList = os.listdir(dirPath)
WindowsError: [Error 123] The filename, directory name, or volume label syntax i
s incorrect: 'C:\\D:\\freakk\\Downloads\\EPF_Itunes\\EPFImporter\\db\\album_popula
rity_per_genre/*.*'
please help me....
See the error log its saying you incorrect syntax
C:\\D:\\freakk\\Downloads\\EPF_Itunes\\EPFImporter\\db\\album_popularity_per_genre/*.*
and tell me how can D directory be in C? its not getting the correct path to reach there.
EPFImporter's this code is basically for Mac OS and it assumes that you are in same directory as of EPFImporter.py and in Mac OS everything is in same Directory (as mac is designed).
C:\Documents and Settings\freakk>python D:\freakk\Downloads\EPF_Itunes\EPFImporter\EPFimporter.py \D:\freakk\Downloads\EPF_Itunes\EPFImporter\db\album_popularity_per_genre
above command will not find either of your EPFImporter.py or album_popularity_per_genre.
change your directory to D from C and go to the directory of EPFImporter.py then try as
.....EPFImporter>python EPFImporter.py db\album_popularity_per_genre
assuming you are in same folder of EPFImporter, not tested but something like this may work for you. Hope this answer made you a bit clear on this.
Solved !
I was trying to import only partial data without main table.
Tried to import flat feed...it worked.
Code:
For Flat Feed
C:\Documents and Settings\freakk>python c:\epf\epfimporter.py -f c:\epf\db\application-usa-20111012
Note: Don't include file name(application-usa-20111012.txt)..restrict till folder name only (Eg:application-usa-20111012)