Spark-SQL: access file in current worker node directory - scala

I need to read a file using spark-sql, and the file is in the current directory.
I use this command to decompress a list of files I have stored on HDFS.
val decompressCommand = Seq(laszippath, "-i", inputFileName , "-o", "out.las").!!
The file is outputted in the current worker node directory, and I know this because executing "ls -a"!! through scala I can see that the file is there. I then try to access it with the following command:
val dataFrame = sqlContext.read.las("out.las")
I assumed that the sql context would try to find the file in the current directory, but it doesn't. Also, it doesn't throw an error but a warning stating that the file could not be found (so spark continues to run).
I attempted to add the file using: sparkContext.addFile("out.las") and then access the location using: val location = SparkFiles.get("out.las") but this didn't work either.
I even ran the command val locationPt = "pwd"!! and then did val fullLocation = locationPt + "/out.las" and attempted to use that value but it didn't work either.
The actual exception that gets thrown is the following:
User class threw exception: org.apache.spark.sql.AnalysisException: cannot resolve 'x' given input columns: [];
org.apache.spark.sql.AnalysisException: cannot resolve 'x' given input columns: []
And this happens when I try to access column "x" from a dataframe. I know that column 'X' exists because I've downloaded some of the files from HDFS, decompressed them locally and ran some tests.
I need to decompress files one by one because I have 1.6TB of data and so I cannot decompress it at one go and access them later.
Can anyone tell me what I can do to access files which are being outputted to the worker node directory? Or maybe should I be doing it some other way?

So I managed to do it now. What I'm doing is I'm saving the file to HDFS, and then retrieving the file using the sql context through hdfs. I overwrite "out.las" each time in HDFS so that I don't have take too much space.

I have used the hadoop API before to get to files, I dunno if it will help you here.
val filePath = "/user/me/dataForHDFS/"
val fs:FileSystem = FileSystem.get(new java.net.URI(filePath + "out.las"), sc.hadoopConfiguration)
And I've not tested the below, but I'm pretty sure I'm passing the java array to scala illegally. But just giving an idea of what to do afterward.
var readIn: Array[Byte] = Array.empty[Byte]
val fileIn: FSDataInputStream = fs.open(file)
val fileIn.readFully(0, readIn)

Related

Synapse Spark exception handling - Can't write to log file

I have written PySpark code to hit a REST API and extract the contents in an XML format and later wrote to Parquet in a data lake container.
I am trying to add logging functionality where I not only write out errors but updates of actions/process we execute.
I am comparatively new to Spark I have been relying on online articles and samples. All explain the error handling and logging through "1/0" examples and saving logs in the default folder structure (not in ADLS account/container/folder) which do not help at all. Most of the code written in Pure Python doesn't run as-is.
Could I get some assistance with setting up the following:
Push errors to a log file under a designated folder sitting under a data lake storage account/container/folder hierarchy".
Catching REST specific exceptions.
This is a sample of what I have written:
''''
LogFilepath = "abfss://raw#.dfs.core.windows.net/Data/logging/data.log"
#LogFilepath2 = "adl://.azuredatalakestore.net/raw/Data/logging/data.log"
print(LogFilepath)
try:
1/0
except Exception as e:
print('My Error...' + str(e))
with open(LogFilepath, "a") as f:
f.write("An error occured: {}\n".format(e))
''''
I have tried it both ABFSS and ADL file paths with no luck. The log file is already available in the storage account/container/folder.
I have reproduced the above using abfss path in with open() function but it gave me the below error.
FileNotFoundError: [Errno 2] No such file or directory: 'abfss://synapsedata#rakeshgen2.dfs.core.windows.net/datalogs.logs'
As per this Documentation
we can use open() on ADLS file with a path like /synfs/{jobId}/mountpoint/{filename}.
For that, first we need to mount the ADLS.
Here I have mounted it using ADLS linked service. you can mount either by Storage account access key or SAS as per your requirement.
mssparkutils.fs.mount(
"abfss://<container_name>#<storage_account_name>.dfs.core.windows.net",
"/mountpoint",
{"linkedService":"<ADLS linked service name>"}
)
Now use the below code to achieve your requirement.
from datetime import datetime
currentDateAndTime = datetime.now()
jobid=mssparkutils.env.getJobId()
LogFilepath='/synfs/'+jobid+'/synapsedata/datalogs.log'
print(LogFilepath)
try:
1/0
except Exception as e:
print('My Error...' + str(e))
with open(LogFilepath, "a") as f:
f.write("Time : {}- Error : {}\n".format(currentDateAndTime,e))
Here I am writing date time along with the error and there is no need to create the log file first. The above code will create and append the error.
If you want to generate the logs daily, you can generate date file names log files as per your requirement.
My Execution:
Here I have executed 2 times.

Check if directory contains json files using org.apache.hadoop.fs.Path in HDFS

I'm following the steps indicated here Avoid "Path does not exist" in dir based spark load to filter which directories in an array contain json files before sending them to the spark.read method.
When I use
inputPaths.filter(f => fs.exists(new org.apache.hadoop.fs.Path(f + "/*.json*")))
It returns empty despite json files existing in the path in one of the paths, one of the comments says this doesn't work with HDFS, is there a way to do make this work?
I running this in a databricks notebook
There is a method for listing files in dir:
fs.listStatus(dir)
Sort of
inputPaths.filter(f => fs.listStatus(f).exists(file => file.getPath.getName.endsWith(".json")))

What is correct directory path format on Windows for StreamingContext.textFileStream?

I am trying to execute a spark streaming application to process the stream of files data to perform word count.
The directory I am reading is from Windows. As shown I using the local directory like "Users/Name/Desktop/Stream".It is not HDFS.
I created a folder as "Stream" in desktop.
I started the Spark Streaming application and after that I added some text files into the folder 'Stream'. But my spark application is not able to read the files. It is always giving the empty results.
Here is my code.
//args(0) = local[2]
object WordCount {
def main(args: Array[String]) {
val ssc = new StreamingContext(args(0), "word_count",Seconds(5))
val lines = ssc.textFileStream("Users/name/Desktop/Stream")
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
}
}
Output: Getting empty data every 5 seconds
17/05/18 07:35:00 INFO Executor: Running task 0.0 in stage 71.0 (TID 35)
-------------------------------------------
Time: 1495107300000 ms
-------------------------------------------
I tried giving the path as C:/Users/name/Desktop/Stream as well - still the same issue and application could not read the files.
Can anyone please guide if I am giving the incorrect directory path ?
Your code's fine so the only issue is to use proper path to the directory. Please use file:// prefix to denote local file system that would give file://C:/Users/name/Desktop/Stream.
Please start one step at a time to confirm that our understanding is at the same level.
When you execute the Spark Streaming application, create the directory to be in the same directory where you start the application, say Stream. Once you confirm that the application works fine with the local directory we'll fix it globally to read from any directory on Windows (if that's still needed).
Please also make sure that you "move" your files as the operation to create a file in the monitored directory has to be atomic (partial writes will mark the file as processed - see StreamingContext).
Files must be written to the monitored directory by "moving" them from another location within the same file system.
As you can see in the code the directory path will eventually be "wrapped" using Hadoop's File so the issue is to convince it to accept your path:
if (_path == null) _path = new Path(directory)

Test Spark with Tachyon

I have installed Tachyon and Spark according to instructions:
http://tachyon-project.org/documentation/Running-Spark-on-Tachyon.html
However, as a newbie I have no idea how to put file "X" into Tachyon File System as they said:
$ ./spark-shell
$ val s = sc.textFile("tachyon-ft://stanbyHost:19998/X")
$ s.count()
$ s.saveAsTextFile("tachyon-ft://activeHost:19998/Y")
What I did was to point to an existing file (that I find through the management UI):
scala> val s = sc.textFile("tachyon-ft://localhost:19998/root/default_tests_files/BasicFile_THROUGH")
s: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[1] at textFile at <console>:21
When I run count, I got this below error:
scala> s.count()
java.lang.NullPointerException: connectionString cannot be null
I assume my path was wrong. So two questions:
How to copy a file into Tachyon?
What is the proper path for its FS?
Sorry, very very newbie !!
UPDATE 1
I am not sure if tachyon-ft://localhost:19998/root/default_tests_files/BasicFile_THROUGH is correct path. I cannot get it either via the browser or wget
This is what I saw in the file system browser
I found out the issue. I didn't do this
sc.hadoopConfiguration.set("fs.tachyon.impl", "tachyon.hadoop.TFS")
After I went through this exercise http://ampcamp.berkeley.edu/5/exercises/tachyon.html#run-spark-on-tachyon, I found out the proper path is this:
val file = sc.textFile("tachyon://localhost:19998/LICENSE")
So my setup was fine afterall. The documentation here http://tachyon-project.org/documentation/Running-Spark-on-Tachyon.html was causing me a lot of confusion.

Loading files from JAR in Scala

I have the following code structure:
Projects/
classes/
performance/AcPerformance.class
resources/
Aircraft/
allAircraft.txt
I have the contents of the classes folder in a JAR and my AcPerformance scala code is trying to read the contents of the Aircraft folder text files. My code:
val AircraftPerf = getClass.getResource("resources/Aircraft").getFile
val dataDir = new File(AircraftPerf)
val acFile = new File(dataDir, "allAircraft.txt")
for (line <- linesFromResource(acFile)) {
// read in lines
}
When I try to run the code I get the following error:
Caused by: java.io.FileNotFoundException: C:\Projects\file:\C:\Projects\libraries\aircraft.jar!\Aircraft\allAircraft.txt (The filename, directory name, or volume label syntax is incorrect)
Is this the correct way to read the contents of a JAR? THanks!
No, URL's getFile isn't going to do what you want here—the path it gives you isn't a file system path that you could use in a File constructor. You'd be best off using getResourceAsStream and the full path to the resource:
val in = getClass.getResourceAsStream("/resources/Aircraft/allAircraft.txt")
Note that you also need to preface the path with / to make it absolute—in your current version you're looking for a resources directory under performance.