Move files between hdfs directories as aprt of spark scala application - scala

I am facing problem when moving files between two HDFS folders in a spark application. We are using Spark 2.1 version and Scala as programming language. I imported org.apache.hadoop.fs package and 'rename' method as a work around for moving files as I couldn't find method to 'move files between hdfs folders' in that package.
Code is as below.
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}
def move_files(fileName, fromLocation:String, toLocation:String, spark: SparkSession): Unit = {
val conf = spark.sparkContext.hadoopConfiguration
val fs = FileSystem.get(conf)
val file_source = new Path(fromLocation + "/" + fileName)
println(file_source)
val file_target = new Path(toLocation + fileName)
println(file_target)
try {
fs.rename(file_source, file_target)
} catch {
case e: Exception => println(e); println("Exception moving files between folders")
}
}
the move files method is called in another method which has other application logic and I need to remove required files from source directory before proceeding with the logic.
def main () {
/*
logic
*/
move files (abc.xml, /location/dev/file_folder_source, /location/dev/file_folder_target, spark)
/*
logic
*/
}
That move_files step is getting executed, without any errors but file is not moved out from source folder to target folder. Program Execution is moving on with the logic which is erroring out due to presence of bad files in the source folder. Please suggest any other way to move files between folders in hdfs or point out where I am doing mistake in the above code.

The api fs.rename(file_source, file_target) return boolean, if true means moved the file successfully. false means the file was not moved.
The move_files is getting executed successfully, because the api used doesn't fail in case it not able to move the files. It simply return false and continue execution.
You need to explicitly check the condition in your code.
For using the fs.rename api, you need to create the target directory and then give only the target directory path. Like below:
val file_target = new Path("toLocation")
fs.mkdirs(file_target)
fs.rename(file_source, file_target)
See this lineval file_target = new Path("toLocation") it contains only the directory path not the file name.

Related

Spark write operation HDFS using temporal path

I am trying to write to a csv file from this Scala code. I'm using HDFS as a temp directory, then just writer.write to create a new file in an existing subfolder. I get the following error message:
val inputFile = "s3a:/tfsdl-ghd-wb/raidnd/rawdata.csv" // INPUT path
val outputFile = "s3a:/tfsdl-ghd-wb/raidnd/Incte_19&20.csv" // OUTPUT path
val dateFormat = new SimpleDateFormat("yyyyMMdd")
val fileSystem = getFileSystem(inputFile)
val inputData = readCSVFile(fileSystem, inputFile, skipHeader = true).toSeq
val writer = new PrintWriter(new File(outputFile))
writer.write("Sales,cust,Number,Date,Credit,SKU\n")
filtinp.foreach(x => {
val (com1, avg1) = com1Average(filtermp, x)
val (com2, avg2) = com2Average(filtermp, x)
writer.write(s"${x.Date},${x.cust},${x.Number},${x.Credit}\n")
})
writer.close()
def getFileSystem(path: String): FileSystem = {
val hconf = new Configuration() // initialize new hadoop configuration
new Path(path).getFileSystem(hconf) // get new filesystem to handle data
java.io.FileNotFoundException: s3a:/tfsdl-ghd-wb/raidnd/Incte_19&20.csv (No such file or directory)
same happens if I choose new file or exiting one, I've checked the path is correct, just want to create a new file in there.
Problem is in order to write data using file system based source you'll need a temporal directory, this is a part of the commit mechanism used by Spark, i.e data is first written to a temporary directory, and once the tasks are finished, automatically moved the processed file to the final path.
Should I change the path to the temp folder for each Spark application to S3? I think is better to process locally (Local Files HDFS) then upload the processed output file to S3
Also I just see there is no "No Spark configuration set" in the databricks cluster I'm using, this interferes with the issue?
If you are able to read the raw data using spark/scala in the form of the DataFrame then you could perform transformations on your dataframe to build the final dataframe. Once you have the final dataframe then needs to be written as csv file you can just use the below single line of code to save the csv file to s3 bucket path or the hdfs path.
df.write.format('csv').option('header','true').mode('overwrite').option('sep',',').save('s3a:/tfsdl-ghd-wb/raidnd/Incte_19&20.csv')

Spark-SQL: access file in current worker node directory

I need to read a file using spark-sql, and the file is in the current directory.
I use this command to decompress a list of files I have stored on HDFS.
val decompressCommand = Seq(laszippath, "-i", inputFileName , "-o", "out.las").!!
The file is outputted in the current worker node directory, and I know this because executing "ls -a"!! through scala I can see that the file is there. I then try to access it with the following command:
val dataFrame = sqlContext.read.las("out.las")
I assumed that the sql context would try to find the file in the current directory, but it doesn't. Also, it doesn't throw an error but a warning stating that the file could not be found (so spark continues to run).
I attempted to add the file using: sparkContext.addFile("out.las") and then access the location using: val location = SparkFiles.get("out.las") but this didn't work either.
I even ran the command val locationPt = "pwd"!! and then did val fullLocation = locationPt + "/out.las" and attempted to use that value but it didn't work either.
The actual exception that gets thrown is the following:
User class threw exception: org.apache.spark.sql.AnalysisException: cannot resolve 'x' given input columns: [];
org.apache.spark.sql.AnalysisException: cannot resolve 'x' given input columns: []
And this happens when I try to access column "x" from a dataframe. I know that column 'X' exists because I've downloaded some of the files from HDFS, decompressed them locally and ran some tests.
I need to decompress files one by one because I have 1.6TB of data and so I cannot decompress it at one go and access them later.
Can anyone tell me what I can do to access files which are being outputted to the worker node directory? Or maybe should I be doing it some other way?
So I managed to do it now. What I'm doing is I'm saving the file to HDFS, and then retrieving the file using the sql context through hdfs. I overwrite "out.las" each time in HDFS so that I don't have take too much space.
I have used the hadoop API before to get to files, I dunno if it will help you here.
val filePath = "/user/me/dataForHDFS/"
val fs:FileSystem = FileSystem.get(new java.net.URI(filePath + "out.las"), sc.hadoopConfiguration)
And I've not tested the below, but I'm pretty sure I'm passing the java array to scala illegally. But just giving an idea of what to do afterward.
var readIn: Array[Byte] = Array.empty[Byte]
val fileIn: FSDataInputStream = fs.open(file)
val fileIn.readFully(0, readIn)

What is correct directory path format on Windows for StreamingContext.textFileStream?

I am trying to execute a spark streaming application to process the stream of files data to perform word count.
The directory I am reading is from Windows. As shown I using the local directory like "Users/Name/Desktop/Stream".It is not HDFS.
I created a folder as "Stream" in desktop.
I started the Spark Streaming application and after that I added some text files into the folder 'Stream'. But my spark application is not able to read the files. It is always giving the empty results.
Here is my code.
//args(0) = local[2]
object WordCount {
def main(args: Array[String]) {
val ssc = new StreamingContext(args(0), "word_count",Seconds(5))
val lines = ssc.textFileStream("Users/name/Desktop/Stream")
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
}
}
Output: Getting empty data every 5 seconds
17/05/18 07:35:00 INFO Executor: Running task 0.0 in stage 71.0 (TID 35)
-------------------------------------------
Time: 1495107300000 ms
-------------------------------------------
I tried giving the path as C:/Users/name/Desktop/Stream as well - still the same issue and application could not read the files.
Can anyone please guide if I am giving the incorrect directory path ?
Your code's fine so the only issue is to use proper path to the directory. Please use file:// prefix to denote local file system that would give file://C:/Users/name/Desktop/Stream.
Please start one step at a time to confirm that our understanding is at the same level.
When you execute the Spark Streaming application, create the directory to be in the same directory where you start the application, say Stream. Once you confirm that the application works fine with the local directory we'll fix it globally to read from any directory on Windows (if that's still needed).
Please also make sure that you "move" your files as the operation to create a file in the monitored directory has to be atomic (partial writes will mark the file as processed - see StreamingContext).
Files must be written to the monitored directory by "moving" them from another location within the same file system.
As you can see in the code the directory path will eventually be "wrapped" using Hadoop's File so the issue is to convince it to accept your path:
if (_path == null) _path = new Path(directory)

Unable to access file in relative path in Scala for test resource

I have setup my SCALA project using Maven, now I am writing test and need to access a file under sub directory of resource path is like:
src/test/resource/abc/123.sql
Now I am doing following:
val relativepath = "/setup/setup/script/test.sql"
val path = getClass.getResource(relativepath).getPath
println(path)
but this is pointing to src/main/resource folder instead of test resource, anyone has the idea what I am doing wrong here?
Just like in Java, it is a good practice to put your resource files under src/main/resources and src/test/resources, as Scala provides a nice API from retrieving resource files.
Considering you put your test.sql file under src/test/resources/setup/setup/script/test.sql, you can easily read the file by doing the following:
Scala 2.12
import scala.io.Source
val relativePath = "setup/setup/script/test.sql"
val sqlFile : Iterator[String] = Source.fromResource(relativePath).getLines
Prior Scala versions
import scala.io.Source
val relativePath = "setup/setup/script/test.sql"
val stream : InputStream = getClass.getResourceAsStream(relativePath)
val sqlFile : Iterator[String] = Source.fromInputStream(stream).getLines
Doing so, you can even have the same file put under the same relative path in src/main/resources. When trying to access the resource file in a test, the file from the src/test/resources will be considered.
I hope this is helpful.

How to access static resources in jar (that correspond to src/main/resources folder)?

I have a Spark Streaming application built with Maven (as jar) and deployed with the spark-submit script. The application project layout follows the standard directory layout:
myApp
src
main
scala
com.mycompany.package
MyApp.scala
DoSomething.scala
...
resources
aPerlScript.pl
...
test
scala
com.mycompany.package
MyAppTest.scala
...
target
...
pom.xml
In the DoSomething.scala object I have a method (let's call it doSomething()) that tries to execute a Perl script -- aPerlScript.pl (from the resources folder) -- using scala.sys.process.Process and passing two arguments to the script (the first one is the absolute path to a binary file used as input, the second one is the path/name of the produced output file). I call then DoSomething.doSomething().
The issue is that I was not able to access the script, not with absolute paths, relative paths, getClass.getClassLoader.getResource, getClass.getResource, I have specified the resources folder in my pom.xml. None of my attempts succeeded. I don't know how to find the stuff I put in src/main/resources.
I will appreciate any help.
SIDE NOTES:
I use an external Process instead of a Spark pipe because, at this step of my workflow, I must handle binary files as input and output.
I'm using Spark-streaming 1.1.0, Scala 2.10.4 and Java 7. I build the jar with "Maven install" from within Eclipse (Kepler)
When I use the getClass.getClassLoader.getResource "standard" method to access resources I find that the actual classpath is the spark-submit script's one.
There are a few solutions. The simplest is to use Scala's process infrastructure:
import scala.sys.process._
object RunScript {
val arg = "some argument"
val stream = RunScript.getClass.getClassLoader.getResourceAsStream("aPerlScript.pl")
val ret: Int = (s"/usr/bin/perl - $arg" #< stream).!
}
In this case, ret is the return code for the process and any output from the process is directed to stdout.
A second (longer) solution is to copy the file aPerlScript.pl from the jar file to some temporary location and execute it from there. This code snippet should have most of what you need.
object RunScript {
// Set up copy destination from the Java temporary directory. This is /tmp on Linux
val destDir = System.getProperty("java.io.tmpdir") + "/"
// Get a stream to the script in the resources dir
val source = Channels.newChannel(RunScript.getClass.getClassLoader.getResourceAsStream("aPerlScript.pl"))
val fileOut = new File(destDir, "aPerlScript.pl")
val dest = new FileOutputStream(fileOut)
// Copy file to temporary directory
dest.getChannel.transferFrom(source, 0, Long.MaxValue)
source.close()
dest.close()
}
// Schedule the file for deletion for when the JVM quits
sys.addShutdownHook {
new File(destDir, "aPerlScript.pl").delete
}
// Now you can execute the script.
This approach allows you to bundle native libraries in JAR files. Copying them out allows the libraries to be loaded at runtime for whatever JNI mischief you have planned.