What is correct directory path format on Windows for StreamingContext.textFileStream? - scala

I am trying to execute a spark streaming application to process the stream of files data to perform word count.
The directory I am reading is from Windows. As shown I using the local directory like "Users/Name/Desktop/Stream".It is not HDFS.
I created a folder as "Stream" in desktop.
I started the Spark Streaming application and after that I added some text files into the folder 'Stream'. But my spark application is not able to read the files. It is always giving the empty results.
Here is my code.
//args(0) = local[2]
object WordCount {
def main(args: Array[String]) {
val ssc = new StreamingContext(args(0), "word_count",Seconds(5))
val lines = ssc.textFileStream("Users/name/Desktop/Stream")
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
}
}
Output: Getting empty data every 5 seconds
17/05/18 07:35:00 INFO Executor: Running task 0.0 in stage 71.0 (TID 35)
-------------------------------------------
Time: 1495107300000 ms
-------------------------------------------
I tried giving the path as C:/Users/name/Desktop/Stream as well - still the same issue and application could not read the files.
Can anyone please guide if I am giving the incorrect directory path ?

Your code's fine so the only issue is to use proper path to the directory. Please use file:// prefix to denote local file system that would give file://C:/Users/name/Desktop/Stream.
Please start one step at a time to confirm that our understanding is at the same level.
When you execute the Spark Streaming application, create the directory to be in the same directory where you start the application, say Stream. Once you confirm that the application works fine with the local directory we'll fix it globally to read from any directory on Windows (if that's still needed).
Please also make sure that you "move" your files as the operation to create a file in the monitored directory has to be atomic (partial writes will mark the file as processed - see StreamingContext).
Files must be written to the monitored directory by "moving" them from another location within the same file system.
As you can see in the code the directory path will eventually be "wrapped" using Hadoop's File so the issue is to convince it to accept your path:
if (_path == null) _path = new Path(directory)

Related

How to provide a text file location in spark when file is on server

I am Learning spark to implement in my project. I want to run command in spark shell-
val rddFromFile = spark.sparkContext.textFile("abc");
where abc is file location. My file is on remote server and through that remote server I am opening spark shell, how should I specify file location.
I tried to put a text file in local C drive and provided the location to read that, it also did not worked. I am getting similar error for all the file location.
Error :
scala> val rddFromFile = spark.sparkContext.textFile("C:/Users/eee/Spark test/Testspark.txt")
rddFromFile: org.apache.spark.rdd.RDD[String] = C:/Users/eee/Spark test/Testspark.txt MapPartitionsRDD[1] at textFile at <console>:23
scala> rddFromFile.collect().foreach(f=>{
| println(f)
| })
org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "C"
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3281)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3301)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:361)
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:268)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:239)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:325)
Spark is expecting the file to be present in the Hadoop FS, as it looks like that's the default file system set in your app.
To load a file from local FS, you need to put it like
val rddFromFile = spark.sparkContext.textFile("file:///C:/Users/eee/Spark test/Testspark.txt")
That will work when you run Spark in local mode.
If you run Spark in the cluster, then the file would have to be present on all executor nodes.

Spark write operation HDFS using temporal path

I am trying to write to a csv file from this Scala code. I'm using HDFS as a temp directory, then just writer.write to create a new file in an existing subfolder. I get the following error message:
val inputFile = "s3a:/tfsdl-ghd-wb/raidnd/rawdata.csv" // INPUT path
val outputFile = "s3a:/tfsdl-ghd-wb/raidnd/Incte_19&20.csv" // OUTPUT path
val dateFormat = new SimpleDateFormat("yyyyMMdd")
val fileSystem = getFileSystem(inputFile)
val inputData = readCSVFile(fileSystem, inputFile, skipHeader = true).toSeq
val writer = new PrintWriter(new File(outputFile))
writer.write("Sales,cust,Number,Date,Credit,SKU\n")
filtinp.foreach(x => {
val (com1, avg1) = com1Average(filtermp, x)
val (com2, avg2) = com2Average(filtermp, x)
writer.write(s"${x.Date},${x.cust},${x.Number},${x.Credit}\n")
})
writer.close()
def getFileSystem(path: String): FileSystem = {
val hconf = new Configuration() // initialize new hadoop configuration
new Path(path).getFileSystem(hconf) // get new filesystem to handle data
java.io.FileNotFoundException: s3a:/tfsdl-ghd-wb/raidnd/Incte_19&20.csv (No such file or directory)
same happens if I choose new file or exiting one, I've checked the path is correct, just want to create a new file in there.
Problem is in order to write data using file system based source you'll need a temporal directory, this is a part of the commit mechanism used by Spark, i.e data is first written to a temporary directory, and once the tasks are finished, automatically moved the processed file to the final path.
Should I change the path to the temp folder for each Spark application to S3? I think is better to process locally (Local Files HDFS) then upload the processed output file to S3
Also I just see there is no "No Spark configuration set" in the databricks cluster I'm using, this interferes with the issue?
If you are able to read the raw data using spark/scala in the form of the DataFrame then you could perform transformations on your dataframe to build the final dataframe. Once you have the final dataframe then needs to be written as csv file you can just use the below single line of code to save the csv file to s3 bucket path or the hdfs path.
df.write.format('csv').option('header','true').mode('overwrite').option('sep',',').save('s3a:/tfsdl-ghd-wb/raidnd/Incte_19&20.csv')

Copy file from Hdfs to Hdfs scala

Is there a known way using Hadoop api / spark scala to copy files from one directory to another on Hdfs ?
I have tried using copyFromLocalFile but was not helpful
Try Hadoop's FileUtil.copy() command, as described here: https://hadoop.apache.org/docs/r2.8.5/api/org/apache/hadoop/fs/FileUtil.html#copy(org.apache.hadoop.fs.FileSystem,%20org.apache.hadoop.fs.Path,%20org.apache.hadoop.fs.FileSystem,%20org.apache.hadoop.fs.Path,%20boolean,%20org.apache.hadoop.conf.Configuration)
val conf = new org.apache.hadoop.conf.Configuration()
val srcPath = new org.apache.hadoop.fs.Path("hdfs://my/src/path")
val dstPath = new org.apache.hadoop.fs.Path("hdfs://my/dst/path")
org.apache.hadoop.fs.FileUtil.copy(
srcPath.getFileSystem(conf),
srcPath,
dstPath.getFileSystem(conf),
dstPath,
true,
conf
)
As I've understand your question, the answer is as easy as abc. Actually, there is no difference between your OS filesystem and some other distributed versions in the fundamental concepts like copying files in them. That is true that each would have its own rules in commands. For instance, when you want to copy a file from one directory to another you can do something like:
hdfs dfs -cp /dir_1/file_1.txt /dir_2/file_1_new_name.txt
The first part of the example command is just to let the command to be routed to the true destination not the OS's own file system.
for further reading you can use: copying data in hdfs

Move files between hdfs directories as aprt of spark scala application

I am facing problem when moving files between two HDFS folders in a spark application. We are using Spark 2.1 version and Scala as programming language. I imported org.apache.hadoop.fs package and 'rename' method as a work around for moving files as I couldn't find method to 'move files between hdfs folders' in that package.
Code is as below.
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}
def move_files(fileName, fromLocation:String, toLocation:String, spark: SparkSession): Unit = {
val conf = spark.sparkContext.hadoopConfiguration
val fs = FileSystem.get(conf)
val file_source = new Path(fromLocation + "/" + fileName)
println(file_source)
val file_target = new Path(toLocation + fileName)
println(file_target)
try {
fs.rename(file_source, file_target)
} catch {
case e: Exception => println(e); println("Exception moving files between folders")
}
}
the move files method is called in another method which has other application logic and I need to remove required files from source directory before proceeding with the logic.
def main () {
/*
logic
*/
move files (abc.xml, /location/dev/file_folder_source, /location/dev/file_folder_target, spark)
/*
logic
*/
}
That move_files step is getting executed, without any errors but file is not moved out from source folder to target folder. Program Execution is moving on with the logic which is erroring out due to presence of bad files in the source folder. Please suggest any other way to move files between folders in hdfs or point out where I am doing mistake in the above code.
The api fs.rename(file_source, file_target) return boolean, if true means moved the file successfully. false means the file was not moved.
The move_files is getting executed successfully, because the api used doesn't fail in case it not able to move the files. It simply return false and continue execution.
You need to explicitly check the condition in your code.
For using the fs.rename api, you need to create the target directory and then give only the target directory path. Like below:
val file_target = new Path("toLocation")
fs.mkdirs(file_target)
fs.rename(file_source, file_target)
See this lineval file_target = new Path("toLocation") it contains only the directory path not the file name.

Spark-SQL: access file in current worker node directory

I need to read a file using spark-sql, and the file is in the current directory.
I use this command to decompress a list of files I have stored on HDFS.
val decompressCommand = Seq(laszippath, "-i", inputFileName , "-o", "out.las").!!
The file is outputted in the current worker node directory, and I know this because executing "ls -a"!! through scala I can see that the file is there. I then try to access it with the following command:
val dataFrame = sqlContext.read.las("out.las")
I assumed that the sql context would try to find the file in the current directory, but it doesn't. Also, it doesn't throw an error but a warning stating that the file could not be found (so spark continues to run).
I attempted to add the file using: sparkContext.addFile("out.las") and then access the location using: val location = SparkFiles.get("out.las") but this didn't work either.
I even ran the command val locationPt = "pwd"!! and then did val fullLocation = locationPt + "/out.las" and attempted to use that value but it didn't work either.
The actual exception that gets thrown is the following:
User class threw exception: org.apache.spark.sql.AnalysisException: cannot resolve 'x' given input columns: [];
org.apache.spark.sql.AnalysisException: cannot resolve 'x' given input columns: []
And this happens when I try to access column "x" from a dataframe. I know that column 'X' exists because I've downloaded some of the files from HDFS, decompressed them locally and ran some tests.
I need to decompress files one by one because I have 1.6TB of data and so I cannot decompress it at one go and access them later.
Can anyone tell me what I can do to access files which are being outputted to the worker node directory? Or maybe should I be doing it some other way?
So I managed to do it now. What I'm doing is I'm saving the file to HDFS, and then retrieving the file using the sql context through hdfs. I overwrite "out.las" each time in HDFS so that I don't have take too much space.
I have used the hadoop API before to get to files, I dunno if it will help you here.
val filePath = "/user/me/dataForHDFS/"
val fs:FileSystem = FileSystem.get(new java.net.URI(filePath + "out.las"), sc.hadoopConfiguration)
And I've not tested the below, but I'm pretty sure I'm passing the java array to scala illegally. But just giving an idea of what to do afterward.
var readIn: Array[Byte] = Array.empty[Byte]
val fileIn: FSDataInputStream = fs.open(file)
val fileIn.readFully(0, readIn)