I have csv files being pushed to Google Storage and a pubsub subscription that notifies me when they arrive. What I'm trying to accomplish is writting a beam program that will grab the JSON data from the pubsub subscription parse out the file location and then read the csv file from GS and then process those. I have a process that will process read the pubsub and then process it into a pcollection. So far I have this:
Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class);
final String output = options.getOutput();
Pipeline pipeline = Pipeline.create(options);
PCollection<String> input = pipeline.apply(PubsubIO.readStrings().fromSubscription(StaticValueProvider.of("beamsub")));
PCollection<String> files = input.apply(ParDo.of(new ParseOutGSFiles()));
now i need to do something like this:
pipeline.apply("ReadLines", TextIO.read().from(FILEsFROMEARLIER).withCompressionType(TextIO.CompressionType.GZIP))
any ideas or is this not possible...it seems like it should be easy
Thanks in advance
The natural way to express your read would be by using TextIO.readAll() method, which reads text files from an input PCollection of file names. This method has been introduced within the Beam codebase, but is not currently in a released version. It will be included in the Beam 2.2.0 release and the corresponding Dataflow 2.2.0 release.
Your result code would look something like
Options options = PipelineOptionsFactory.fromArgs(args)
.withValidation().as(Options.class);
final String output = options.getOutput();
Pipeline pipeline = Pipeline.create(options);
PCollection<String> files = pipeline
.apply(PubsubIO.readStrings().fromSubscription("beamsub"))
.apply(ParDo.of(new ParseOutGSFiles()));
PCollection<String> contents = files
.apply(TextIO.readAll().withCompressionType(TextIO.CompressionType.GZIP));
Related
I’ve been looking for a while now for a way to get all filenames in a directory and its sub-directories in Hadoop file system (hdfs).
I found out I can use these commands to get it :
sc.hadoopConfiguration.set("mapreduce.input.fileinputformat.input.dir.recursive", "true")
sc.wholeTextFiles(path).map(_._1)
Here is "wholeTextFiles" documentation:
Read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file.
Parameters:
path - Directory to the input data files, the path can be
comma separated paths as the list of inputs.
minPartitions - A
suggestion value of the minimal splitting number for input data.
Returns:
RDD representing tuples of file path and the corresponding
file content
Note: Small files are preferred, large file is also
allowable, but may cause bad performance., On some filesystems,
.../path/* can be a more efficient way to read all files in a
directory rather than .../path/ or .../path, Partitioning is
determined by data locality. This may result in too few partitions by
default.
As you can see "wholeTextFiles" returns a pair RDD with both the filenames and their content. So I tried mapping it and taking only the file names, but I suspect it still reads the files.
The reason I suspect so: if I try to count (for example) and I get the spark equivalent of "out of memory" (losing executors and not being able to complete the tasks).
I would rather use Spark to achieve this goal the fastest way possible, however, if there are other ways with a reasonable performance I would be happy to give them a try.
EDIT:
To clear it - I want to do it using Spark, I know I can do it using HDFS commands and such thing - I would like to know how to do such thing with the existing tools provided with Spark and maybe an explanation on how I can make "wholeTextFiles" not reading the text itself (kind of like how transformations only happen after an action and some of the "commands" never really happen).
Thank you very much!
This is the way to list out all the files till the depth of last subdirectory....and is with out using wholetextfiles
and is recursive call till the depth of subdirectories...
val lb = new scala.collection.mutable[String] // variable to hold final list of files
def getAllFiles(path:String, sc: SparkContext):scala.collection.mutable.ListBuffer[String] = {
val conf = sc.hadoopConfiguration
val fs = FileSystem.get(conf)
val files: RemoteIterator[LocatedFileStatus] = fs.listLocatedStatus(new Path(path))
while(files.hasNext) {// if subdirectories exist then has next is true
var filepath = files.next.getPath.toString
//println(filepath)
lb += (filepath)
getAllFiles(filepath, sc) // recursive call
}
println(lb)
lb
}
Thats it. it was tested with success. you can use as is..
I need help with script in python to run stream in SPSS. Now I am using with code to export data into Excel file and it works. But this code is required one manual step before exporting data to Excel file.
stream = modeler.script.stream() **- geting stream in SPSS;**
output1 = stream.findByType("excelexport", "1") **- then searching Excel file with name "1";**
results = [] **- then run all stream;**
output1.run(results) **- but here I need to press button to finish execution(Have a look screenshots);**
output1 = stream.findByType("excelexport", "2") **- this the next step!**
results = []
output1.run(results)
I would like to fully automate stream. Please, help me! Thanks a lot!
I can help you only using the legacy script. I have several export excel nodes in my streams and they are saved according to the month and year of reference.
set Excel_1.full_filename = "\\PATH"><"TO"><".xlsx" execute Excel_1image example
Take a look in the image because the stackoverflow is messing with the written code.
And to fully automatize, you have also to set all the passwords in the initial nodes, example:
set Database.username = "TEST"
set Database.password = "PASSWORD"*
On the stream properties window -> Exection tab have you selected the 'Run this script' on stream execution??
If you make this selection you can run the stream and produce your output without even opening SPSS Modeler User Interface(via Modelerbatch).
I'm trying to read/monitor txt files from a Hadoop file system directory. But I've noticed all txt files inside this directory are directories themselves as showed in this example bellow:
/crawlerOutput/b6b95b75148cdac44cd55d93fe2bbaa76aa5cccecf3d723c5e47d361b28663be-1427922269.txt/_SUCCESS
/crawlerOutput/b6b95b75148cdac44cd55d93fe2bbaa76aa5cccecf3d723c5e47d361b28663be-1427922269.txt/part-00000
/crawlerOutput/b6b95b75148cdac44cd55d93fe2bbaa76aa5cccecf3d723c5e47d361b28663be-1427922269.txt/part-00001
I'd want read all the data inside the part's files. I'm trying to use the following code as showed in this snippet:
val testData = ssc.textFileStream("/crawlerOutput/*/*")
But, unfortunately it said it doesn't exist /crawlerOutput/*/*. Doesn't textFileStream accept wildcards? What should I do to solve this problem?
The textFileStream() is just a wrapper for fileStream() and does not support subdirectories (see https://spark.apache.org/docs/1.3.0/streaming-programming-guide.html).
You would need to list the specific directories to monitor. If you need to detect new directories a StreamingListener could be used to check then stop streaming context and restart with new values.
Just thinking out loud.. If you intend to process each subdirectory once and just want to detect these new directories then potentially key off another location that may contain job info or a file token that once present could be consumed in the streaming context and call the appropriate textFile() to ingest the new path.
In my project, I have three input files and make the file names as args(0) to args(2), I also have a output filename as args(3), in the source code, I use
val sc = new SparkContext()
var log = sc.textFile(args(0))
for(i <- 1 until args.size - 1) log = log.union(sc.textFile(args(i)))
I do nothing to the log but save it as a text file by using
log.coalesce(1, true).saveAsTextFile(args(args.size - 1))
but it still save to 3 file as part-00000、part-00001、part-00002, So is there any way that I can save the three input files to an output file?
Having multiple output files is a standard behavior of multi-machine clusters like Hadoop or Spark. The number of output files depends on the number of reducers.
How to "solve" it in Hadoop:
merge output files after reduce phase
How to "solve" in Spark:
how to make saveAsTextFile NOT split output into multiple file?
A good info you can get also here:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-make-Spark-merge-the-output-file-td322.html
So, you were right about coalesce(1,true). However, it is very inefficient. Interesting is that (as #climbage mentioned in his remark) your code is working if you run it locally.
What you might try is to read the files first and then save the output.
...
val sc = new SparkContext()
var str = new String("")
for(i <- 0 until args.size - 1){
val file = sc.textFile(args(i))
file.foreach(line => str+= line)
}
//and now you might save the content
str.coalesce(1, true).saveAsTextFile("out")
Note: this code is also extremely inefficient and working for small files only!!! You need to come up with a better code. I wouldn't try to reduce number of file but process multiple outputs files instead.
As mentioned your problem is somewhat unavoidable via the standard API's as the assumption is that you are dealing with large quanatities of data. However, if I assume your data is manageable you could try the following
import java.nio.file.{Paths, Files}
import java.nio.charset.StandardCharsets
Files.write(Paths.get("./test_file"), data.collect.mkString("\n").getBytes(StandardCharsets.UTF_8))
What I am doing here is converting the RDD into a String by performing a collect and then mkString. I would suggest not doing this in production. It works fine for local data analysis (Working with 5gb~ of local data)
I have stored some attachments into GridFS mongo using the Put command in Gridfs.
x12 = 'c:\test\' + str10
attachment.SaveAsFile(x12)
with open(x12, 'rb') as content_file:
content = content_file.read()
object_id = fs.put(strattach,filename=str10)
strattach is obtained as follows
attachment = A1.Item(1) processing email attachments using MAPI
strattach = str(attachment) converting to string.If i am not doing this i get a Typeerror: saying
can only write strings or file like objects
A1 is the attachments collection and attachment is the object obtained.
Now the Put was successful and i got the the object ID object_id which was store in Mongo db along with file name.
Now i need to build my Binary file again using the object_id and file name in Python 2.7.
to do this i read from gridfs using f2 = object_id.read() and tried to apply the write method on F2
which is failing. When i read the manual it said read in python 2.7 returns a string instance.
Could you please help me on how i can save that instance back as a binary file in python2.7.
Any alternate suggestions will also be helpful
Thanks