Finding individual filenames when loading multiple files in Apache Spark - scala

I have an Apache Spark job that loads multiple files for processing using
val inputFile = sc.textFile(inputPath)
This is working fine. However for auditing purposes it would be useful to track which line came from which file when the inputPath is a wildcard. Something like an RDD[(String, String)] where the first string is the line of input text and the second is the filename.
Specifically, I'm using Google's Dataproc and the file(s) are located in Google Cloud Storage, so the paths are similar to 'gs://my_data/*.txt'.

Check out SparkContext#wholeTextFiles.
Sometimes if you use many input paths, you may find yourself wanting to use worker resources for this file listing. To do that, you can use the Scala parts of this answer on improving performance of wholeTextFiles in pyspark (ignore all the python bits, the scala parts are what matter).

Related

Is it possible to read files with uncommon extensions (not like .txt or .csv) in Apache Beam using Python SDK? For example, file with .set extension

Is it possible to read files with uncommon extensions (not like .txt or .csv or .json) in Apache Beam using Python SDK? For example, I want to read file from local with .set extension (this is special file with EEG record). I could't find any information about how to implement this on the official page.
If I understand correctly, beam.Create creates PCollection from iterable, but what if my data is not iterable (like data in .set file)? How to read it?
If you have only one file to process, you can pass a list to beam.Create that contains your set file.
p | 'Initialise pipeline' >> beam.Create(['your_file.set'])
Regarding reading the set file, even if it's not supported officially by beam I/O connectors, you can create your own connector with python
class ReadSetContent(beam.DoFn):
def process(self, file):
# your set file path will passed here, so you can read it
# yield the content of the set file and it will be processed by the net transformation
So you pipeline start would look like this
(p | 'Initialise pipeline' >> beam.Create(['your_file.set'])
| 'Reading content' >> beam.ParDo(ReadSetContent())
| 'Next transformation' >> ...)
You can use the Beam fileio library to read arbitrary files. Any custom processing should be done by subsequent ParDo transforms.

TextIO. Read multiple files from GCS using pattern {}

I tried using the following
TextIO.Read.from("gs://xyz.abc/xxx_{2017-06-06,2017-06-06}.csv")
That pattern didn't work, as I get
java.lang.IllegalStateException: Unable to find any files matching StaticValueProvider{value=gs://xyz.abc/xxx_{2017-06-06,2017-06-06}.csv}
Even though those 2 files do exist. And I tried with a local file using a similar expression
TextIO.Read.from("somefolder/xxx_{2017-06-06,2017-06-06}.csv")
And that did work just fine.
I would've thought there would be support for all kinds of globs for files in GCS, but nope. Why is that? is there away to accomplish what I'm looking for?
This may be another option, in addition to Scott's suggestion and your comment on his answer:
You can define a list with the paths you want to read and then iterate over it, creating a number of PCollections in the usual way:
PCollection<String> events1 = p.apply(TextIO.Read.from(path1));
PCollection<String> events2 = p.apply(TextIO.Read.from(path2));
Then create a PCollectionList:
PCollectionList<String> eventsList = PCollectionList.of(events1).and(events2);
And then flatten this list into your PCollection for your main input:
PCollection<String> events = eventsList.apply(Flatten.pCollections());
Glob patterns work slightly differently in Google Cloud Storage vs. the local filesystem. Apache Beam's TextIO.Read transform will defer to the underlying filesystem to interpret the glob.
GCS glob wildcard patterns are documented here (Wildcard Names).
In the case above, you could use:
TextIO.Read.from("gs://xyz.abc/xxx_2017-06-*.csv")
Note however that this will also include any other matching files.
Did you try Apache Beam TextIO.Read from function? Here, it says that it is possible with gcs as well:
public TextIO.Read from(java.lang.String filepattern)
Reads text files that reads from the file(s) with the given filename or filename pattern. This can be a local path (if running locally), or a Google Cloud Storage filename or filename pattern of the form "gs://<bucket>/<filepath>" (if running locally or using remote execution service).
Standard Java Filesystem glob patterns ("*", "?", "[..]") are supported.

Reading zipped xml files in Spark

I have a set of large xml files, zipped together in a singe file and many such zip files. I was using Mapreduce earlier to parse the xml using custom inputformat and recordreader setting the splittable=false and reading the zip and xml file.
I am new to Spark. Can someone help me how can I prevent spark from splitting the zip file and process multiple zips in parallel as I am able to do in MR.
AFAIk ! The answer to your question is provided here by #holden :
Please take a look ! Thanks :)

Write RDD in txt file

I have the following type of data:
`org.apache.spark.rdd.RDD[org.apache.spark.rdd.RDD[((String, String),Int)]] = MapPartitionsRDD[29] at map at <console>:38`
I'd like to write those data in a txt file to have something like
((like,chicken),2) ((like,dog),3) etc.
I store the data in a variable called res
But for the moment I tried with this:
res.coalesce(1).saveAsTextFile("newfile.txt")
But it doesn't seem to work...
If my assumption is correct, then you feel that the output should be a single .txt file if it was coalesced down to one worker. This is not how Spark is built. It is meant for distributed work and should not be attempted to be shoe-horned into a form where the output is not distributed. You should use a more generic command line tool for that.
All that said, you should see a folder named newfile.txt which contains data files with your expected output.

Saving a string to HDFS creates line feeds for each character

I have a plain text file that I am reading from my local system that I am uploading to HDFS. I have Spark/Scala code that reads the file in, converts the file to a string, and then i use saveAsTextFile function to specify my HDFS path where I want the file to be saved. Note I am using the coalesce function because I want one file saved, rather than the file getting split.
import scala.io.Source
val fields = Source.fromFile("MyFile.txt").getLines
val lines = fields.mkString
sc.makeRDD(lines).coalesce(1, true).saveAsTextFile("hdfs://myhdfs:8520/user/alan/mysavedfile")
The code I have saves my text successfully to HDFS, unfortunately though, for some reason each character in my string has a line feed character after it.
Is there a way around this?
I wasn't able to get this working exactly as I wanted, but I did come up with a work around. I saved the device locally and then called a shell command through Scala to upload the completed file to HDFS. Pretty straightforward.
Would still appreciate if anyone could tell me how to copy a string directly to a file in HDFS though.