I use an enterprise cluster that has both local and HDFS filesystems. Files I need to process are in netcdf format and hence I use SciSpark to load. On a workstation that has no HDFS, the code reads from local folder. However, when HDFS folders are present, it attempts read from HDFS only. As the size of files present in the folder is huge (running cumulatively into hundreds of GBs to TBs), I am having to copy them to HDFS, which is inefficient and inconvenient. The (Scala) code I use for loading the files is shown below:
val ncFilesRDD = sc.netcdfDFSFiles(ncDirectoryPath, List("x1", "x2", "x3"))
val ncFileCRDArrayRDD = ncFilesRDD.map(x => (x.variables.get("x1").get.data.toArray,
x.variables.get("x2").get.data.toArray,
x.variables.get("x3").get.data.toArray
))
I would very much appreciate any help in modifying the code that will enable me to use local directory instead of HDFS.
The Source Code Comment Doc for netcdfDFSFiles says
that since the files are read from HDFS
Not sure if you can use netcdfDFSFiles to read from local.
But there is another function netcdfFileList which says : The URI could be an OpenDapURL or a filesystem path.
Since it can take filesystem path , you can use it like
val ncFilesRDD = sc.netcdfFileList("file://your/path", List("x1", "x2", "x3"))
The file:// will look for local dir only.
Related
We have one S3 folder that is being used for storing different files for ETL processing separately. The ETL processing of one files is reading all other files placed in the same S3 folder. I don't see an option to read only file from the folder. The Location property in the table is set to the folder level.
Code:
gluedb = "srcgluedb"
gluetbl = "gluesrctable"
dfRead=glue_context.create_dynamic_frame.from_catalog(database=gluedb, table_name=gluetbl)
df = dfRead.toDF()
How does pyspark read from a directory under the hood?
Asking because have situation where I am attempting to read from a dir and then write to that same location later in overwrite mode. Doing this is throwing java.io.FileNotFoundException errors (even when manually placing files in the dir) and am assuming that this read and overwriting in the cause, so am curious as to the underlying process that spark uses to read from a dir and what could be happening when I try then to overwrite that same dir.
I am trying to read a text file from local path using spark. But it's throwing an exception.Error image
The code I used to read file is this:
val assetFile = sc.textFile(assetFilePath)
assestFilePath is a variable whichrepresent the path to somehere in my local machine.
val adFile = sc.textFile(adFilePath)
adFilePath is a variable whichrepresent the path to somehere in my local machine.
sc.textFile will by default read from HDFS not from local file system, but Spark supports multiple file system apart from HDFS like LocalFileSyetem, Amazon S3, Azure, Swift FS.
So in order to read from the local file system you need to specify in the file path as protocol.
for example :
sc.textFile("file:///tmp/myfile.txt")
This will read a file named myfile.txt from tmp directory present in local file system where spark driver code is running.
I have an Spark Streaming application that is scanning declared directory via textFileStream method. I have batch set to 10 seconds and I'm processing new files and concatenate their content into bigger parquet files. Files are downloaded/created directly in declared directory and that's cause me an issue I cannot bypass.
Lets get following scenario as example:
Directory is empty for most of a time, then around 100 small files come in the same time as streaming batch triggers. Spark stream founds 70 files with .json extension and 30 files with _tmp (still under creation/download). This situation obviously crash my app since process try to work with _tmp file witch in the meantime changed into fully created/downloaded .json file.
I've tried to filter out the rdd with following method:
val jsonData = ssc.textFileStream("C:\\foo").filter(!_.endsWith("tmp"))
But it still cause
jsonData.foreachRDD{ rdd=>
/.../
}
to process _tmp files and throws exception of no such file as *_tmp
Question
I've read about some staging directory from which I should move files after create/download process finish, but copy nor move(why is that? copy operation creates new file so...) operation doesn't trigger textFileStream and process is ignoring those files. Is there any other method to filter out those files and wait till they are complete before processing?
Env spec
Directory on Windows 8.1 file system, eventually that would be a Unix like machine
Spark 1.5 for Hadoop 2.6
Actually, .filter(!_.endsWith("tmp")) will filter the content of files inside c:\foo, it means the data, by default line by line. If it finds a line (not a filename) ending with tmp will remove it from the stream.
in Scala/Spark i am using a DataFrame and write it into a single file using:
val dataFrame = rdd.toDF()
dataFrame.coalesce(1).write.mode(SaveMode.Overwrite).parquet(filePath)
This works fine. But I figured out using the console and Hadoop's ls command, that while it is coalesced, the file and folder is not on the Hadoop file system.
When you type hadoop fs -ls hdfs://path, there is no such file or directory. After the coalesce is done, the path is again there and also the file which was coalesced.
This might happen because the coalesce needs to delete the file and create a new one?!
Here the problem is now: When i kill the process/app while the file is not on the file system, the complete file is deleted. So an failure of the system would destroy the file.
Do you have an idea how to prevent the file loss? I thought Spark/Hadoop would care of this.
Thanks, Alex