Reading files from Apache Spark textFileStream - scala

I'm trying to read/monitor txt files from a Hadoop file system directory. But I've noticed all txt files inside this directory are directories themselves as showed in this example bellow:
/crawlerOutput/b6b95b75148cdac44cd55d93fe2bbaa76aa5cccecf3d723c5e47d361b28663be-1427922269.txt/_SUCCESS
/crawlerOutput/b6b95b75148cdac44cd55d93fe2bbaa76aa5cccecf3d723c5e47d361b28663be-1427922269.txt/part-00000
/crawlerOutput/b6b95b75148cdac44cd55d93fe2bbaa76aa5cccecf3d723c5e47d361b28663be-1427922269.txt/part-00001
I'd want read all the data inside the part's files. I'm trying to use the following code as showed in this snippet:
val testData = ssc.textFileStream("/crawlerOutput/*/*")
But, unfortunately it said it doesn't exist /crawlerOutput/*/*. Doesn't textFileStream accept wildcards? What should I do to solve this problem?

The textFileStream() is just a wrapper for fileStream() and does not support subdirectories (see https://spark.apache.org/docs/1.3.0/streaming-programming-guide.html).
You would need to list the specific directories to monitor. If you need to detect new directories a StreamingListener could be used to check then stop streaming context and restart with new values.
Just thinking out loud.. If you intend to process each subdirectory once and just want to detect these new directories then potentially key off another location that may contain job info or a file token that once present could be consumed in the streaming context and call the appropriate textFile() to ingest the new path.

Related

Is there a way to know filenames generated with MultiResourceItemWriter?

I'm writing a spring-batch application with spring-boot support and I'm looking for a way to know which files were generated by MultiResourceItemWriter. The first solution I have in mind is to have a folder for only the files generated and check the content, but if there is something already implemented on spring-batch would be great!
The intention is to encrypt and then upload each file to an sftp server.
The file names generated by the MultiResourceItemWriter are the combination of the resource name + the suffix created by the ResourceSuffixCreator. For example, if you create the writer like the following:
MultiResourceItemWriter<String> writer = new MultiResourceItemWriter<>();
writer.setResource(new FileSystemResource(new File("data.txt")));
writer.setResourceSuffixCreator(index -> "part" + index);
Then the generated files will be data.txt.part1, data.txt.part2, etc.
MultiResourceItemWriter doesn't perform write directly but delegate this job to other components.
All those components are ResourceAwareItemWriterItemStream implementors so you may write a ResourceAwareItemWriterItemStreamDelegate, intercept setResource() method and store resource into current step execution-context as a collection.
If you want to pass this list of resources to next steps you may use an ExecutionContextPromotionListener.

Searching all file names recursively in hdfs using Spark

I’ve been looking for a while now for a way to get all filenames in a directory and its sub-directories in Hadoop file system (hdfs).
I found out I can use these commands to get it :
sc.hadoopConfiguration.set("mapreduce.input.fileinputformat.input.dir.recursive", "true")
sc.wholeTextFiles(path).map(_._1)
Here is "wholeTextFiles" documentation:
Read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file.
Parameters:
path - Directory to the input data files, the path can be
comma separated paths as the list of inputs.
minPartitions - A
suggestion value of the minimal splitting number for input data.
Returns:
RDD representing tuples of file path and the corresponding
file content
Note: Small files are preferred, large file is also
allowable, but may cause bad performance., On some filesystems,
.../path/* can be a more efficient way to read all files in a
directory rather than .../path/ or .../path, Partitioning is
determined by data locality. This may result in too few partitions by
default.
As you can see "wholeTextFiles" returns a pair RDD with both the filenames and their content. So I tried mapping it and taking only the file names, but I suspect it still reads the files.
The reason I suspect so: if I try to count (for example) and I get the spark equivalent of "out of memory" (losing executors and not being able to complete the tasks).
I would rather use Spark to achieve this goal the fastest way possible, however, if there are other ways with a reasonable performance I would be happy to give them a try.
EDIT:
To clear it - I want to do it using Spark, I know I can do it using HDFS commands and such thing - I would like to know how to do such thing with the existing tools provided with Spark and maybe an explanation on how I can make "wholeTextFiles" not reading the text itself (kind of like how transformations only happen after an action and some of the "commands" never really happen).
Thank you very much!
This is the way to list out all the files till the depth of last subdirectory....and is with out using wholetextfiles
and is recursive call till the depth of subdirectories...
val lb = new scala.collection.mutable[String] // variable to hold final list of files
def getAllFiles(path:String, sc: SparkContext):scala.collection.mutable.ListBuffer[String] = {
val conf = sc.hadoopConfiguration
val fs = FileSystem.get(conf)
val files: RemoteIterator[LocatedFileStatus] = fs.listLocatedStatus(new Path(path))
while(files.hasNext) {// if subdirectories exist then has next is true
var filepath = files.next.getPath.toString
//println(filepath)
lb += (filepath)
getAllFiles(filepath, sc) // recursive call
}
println(lb)
lb
}
Thats it. it was tested with success. you can use as is..

how can i save rdd data to local file instead of println

I'd like to print rdd data using scala such as below
res1.foreach{case(userid,tags)=>println(s"${userid}${"\t"}${tags.topicInterests.map(_.id).mkString(",")}")}
And now ,i want to save the detail to local file instead of println,how can i implement it?
Use saveAsTextFile() method of the RDD as shown below:
val strRdd = res1.map{case(userid,tags)=>(s"${userid}${"\t"}${tags.topicInterests.map(_.id).mkString(",")}")}
strRdd.saveAsTextFile("/home/test_user/result")
Note that, saveAsTextFile method takes a path(absolute or relative) to a folder/directory and not a file. The RDD data will be written as part files inside the given directory. In this case, a directory called result will be created with part files inside it.
There will be as many part files as the number of partitions in the strRdd. If the path /home/test_user/result already exists, your code will fail. So you will have to use a non-existing directory only.
Bonus info: The same saveAsTextFile method also works on other file systems like HDFS, S3 etc by taking the URL to the target directories instead of just paths.

How to exclude a directory in Stream mapping in p4v

I'm using the visual client for perforce and I want to exclude a directory from the workspace. Before streams, I would just navigate to my workspace, find the folder in the tree, and exclude it (and I've found this solution in a number of other related questions I've found). However, now that I am using a stream, it won't let me do this, i have to edit the stream mapping apparently.
So I tried to add this line to the remapped box when editing the stream:
-//NumberPlus/current/Library/... //nplus-mainline/current/Library/
However I just get an error:
Error in stream specification.
Error detected at line 24
Null directory (//) not allowed in '-//NumberPlus/current/Library/...'.
EDIT: I'm in Windows 8.1, for clarification.
If the folder you want to exclude is specific to your machine, setting P4IGNORE locally is the easiest way to exclude it from being added to the depot.
http://www.perforce.com/blog/120214/new-20121-p4ignore
You'd set P4IGNORE to some name like "p4ignore.txt", create a file with that name, and add "Libraries" to it -- subsequent "p4 add" commands will skip over paths found in the P4IGNORE file, so those files will never get added to the depot.
If this is something that's going to be common to all workspaces of this stream (e.g. it's a build artifact that everyone is going to generate and nobody is supposed to check in), what you want to do is add an "exclude" to the stream's Paths (this will exclude it from both branch views and client views generated by that stream). E.g.:
Paths:
share ...
exclude Libraries/...
The "exclude Libraries/..." is basically the same thing as the exclusion line you would add to the client view, except you specify it as a relative path, you don't need to specify both sides of the mapping, and the "-" is implied by the "exclude" type. The "remap" type is if you want to keep those files but in a different depot location, which doesn't sound applicable here.
More information on defining stream views:
http://www.perforce.com/perforce/doc.current/manuals/p4v/streams_views.html
You can't just edit the mappings for your client workspace if it is switched to a particular stream. The whole point of streams is that your workspace mapping is directly generated from the stream definition. So that's a feature.
It's not totally clear whether
you don't want the directory in the stream at all, or
it's valid to have the directory in the stream, but you don't want to sync it to your workstation, or
you want the directory sync'd to your workstation, but you want the directory to have different contents (say, from some other stream which has a different version of the library.
However, for all of these situations, I suspect the best path forward is to define a new child stream of your current stream.
You will want to define the path mappings using the "share", "exclude", "isolate", and "import" mapping types.
For example, if you just didn't want the Library/... directory at all, you'd "exclude" it from your parent.
Then that stream simply won't have that directory, and it (of course) won't be on your workstation when you sync to the stream, either.
If you wanted to have a different copy of the code in the Library/... directory, so that it became a point of intentional divergence from the parent, you'd "isolate" it from your parent to submit your own custom version, or "import" it from another stream to use that stream's Library/... directory instead.
In either case, the directory would be part of the stream, and would be sync'd to your workstation, but the contents of that directory would differ from the contents that are used in the parent stream (the exact way in which they'd differ is under your control, as you define the stream accordingly).
Documentation and some examples are here: http://www.perforce.com/perforce/doc.current/manuals/p4v/streams_views.html
and here:
http://www.perforce.com/sites/default/files/pdf/Streams-ebook.pdf
I believe I have solved this. To be clear, I wanted the folder to be completely ignored by version control. I'm using p4connect with Unity and it keeps wanting to include unnecessary stuff in my depot.
All I had to do was add this line to my parent stream in the Paths box:
exclude current/Library/...

How can I rename file uploaded to s3 using javascript api?

'pickAndStore' method allows me to specify full path to the file, but I don't know it's extension at this point (file path has to be defined before file is uploaded, so it's not possible to provide a path with correct extension).
if I use 'pick' and then 'store' I have 2 files (because both methods uploads file to the s3). I can delete 'old' file, but it's not optimal and can be pain (take ages) with really big files.
Is there any better solution? Ideally to rename existing file.
Currently, there is no workaround for renaming file.
However, in our Javascript API v2 we are planing to add new callback function. onStart callback will be fired after user pick file but before file uploading. There could be option like renaming file based on original filename.
We will keep you updated.