Is it possible to read files with uncommon extensions (not like .txt or .csv) in Apache Beam using Python SDK? For example, file with .set extension - apache-beam

Is it possible to read files with uncommon extensions (not like .txt or .csv or .json) in Apache Beam using Python SDK? For example, I want to read file from local with .set extension (this is special file with EEG record). I could't find any information about how to implement this on the official page.
If I understand correctly, beam.Create creates PCollection from iterable, but what if my data is not iterable (like data in .set file)? How to read it?

If you have only one file to process, you can pass a list to beam.Create that contains your set file.
p | 'Initialise pipeline' >> beam.Create(['your_file.set'])
Regarding reading the set file, even if it's not supported officially by beam I/O connectors, you can create your own connector with python
class ReadSetContent(beam.DoFn):
def process(self, file):
# your set file path will passed here, so you can read it
# yield the content of the set file and it will be processed by the net transformation
So you pipeline start would look like this
(p | 'Initialise pipeline' >> beam.Create(['your_file.set'])
| 'Reading content' >> beam.ParDo(ReadSetContent())
| 'Next transformation' >> ...)

You can use the Beam fileio library to read arbitrary files. Any custom processing should be done by subsequent ParDo transforms.

Related

TextIO. Read multiple files from GCS using pattern {}

I tried using the following
TextIO.Read.from("gs://xyz.abc/xxx_{2017-06-06,2017-06-06}.csv")
That pattern didn't work, as I get
java.lang.IllegalStateException: Unable to find any files matching StaticValueProvider{value=gs://xyz.abc/xxx_{2017-06-06,2017-06-06}.csv}
Even though those 2 files do exist. And I tried with a local file using a similar expression
TextIO.Read.from("somefolder/xxx_{2017-06-06,2017-06-06}.csv")
And that did work just fine.
I would've thought there would be support for all kinds of globs for files in GCS, but nope. Why is that? is there away to accomplish what I'm looking for?
This may be another option, in addition to Scott's suggestion and your comment on his answer:
You can define a list with the paths you want to read and then iterate over it, creating a number of PCollections in the usual way:
PCollection<String> events1 = p.apply(TextIO.Read.from(path1));
PCollection<String> events2 = p.apply(TextIO.Read.from(path2));
Then create a PCollectionList:
PCollectionList<String> eventsList = PCollectionList.of(events1).and(events2);
And then flatten this list into your PCollection for your main input:
PCollection<String> events = eventsList.apply(Flatten.pCollections());
Glob patterns work slightly differently in Google Cloud Storage vs. the local filesystem. Apache Beam's TextIO.Read transform will defer to the underlying filesystem to interpret the glob.
GCS glob wildcard patterns are documented here (Wildcard Names).
In the case above, you could use:
TextIO.Read.from("gs://xyz.abc/xxx_2017-06-*.csv")
Note however that this will also include any other matching files.
Did you try Apache Beam TextIO.Read from function? Here, it says that it is possible with gcs as well:
public TextIO.Read from(java.lang.String filepattern)
Reads text files that reads from the file(s) with the given filename or filename pattern. This can be a local path (if running locally), or a Google Cloud Storage filename or filename pattern of the form "gs://<bucket>/<filepath>" (if running locally or using remote execution service).
Standard Java Filesystem glob patterns ("*", "?", "[..]") are supported.

Finding individual filenames when loading multiple files in Apache Spark

I have an Apache Spark job that loads multiple files for processing using
val inputFile = sc.textFile(inputPath)
This is working fine. However for auditing purposes it would be useful to track which line came from which file when the inputPath is a wildcard. Something like an RDD[(String, String)] where the first string is the line of input text and the second is the filename.
Specifically, I'm using Google's Dataproc and the file(s) are located in Google Cloud Storage, so the paths are similar to 'gs://my_data/*.txt'.
Check out SparkContext#wholeTextFiles.
Sometimes if you use many input paths, you may find yourself wanting to use worker resources for this file listing. To do that, you can use the Scala parts of this answer on improving performance of wholeTextFiles in pyspark (ignore all the python bits, the scala parts are what matter).

Write RDD in txt file

I have the following type of data:
`org.apache.spark.rdd.RDD[org.apache.spark.rdd.RDD[((String, String),Int)]] = MapPartitionsRDD[29] at map at <console>:38`
I'd like to write those data in a txt file to have something like
((like,chicken),2) ((like,dog),3) etc.
I store the data in a variable called res
But for the moment I tried with this:
res.coalesce(1).saveAsTextFile("newfile.txt")
But it doesn't seem to work...
If my assumption is correct, then you feel that the output should be a single .txt file if it was coalesced down to one worker. This is not how Spark is built. It is meant for distributed work and should not be attempted to be shoe-horned into a form where the output is not distributed. You should use a more generic command line tool for that.
All that said, you should see a folder named newfile.txt which contains data files with your expected output.

ZIP contents as a Gray stream?

I'm writing a CL library to read MS Excel(tm) spreadsheets called "xlmanip" (not ready for prime time yet -- only reads "xlsx" spreadsheets, works for the 80% use case of "I want to operate on cell contents"... but I digress.)
One thing that worries me when reading "xlsx" (ZIP archives in XML format) is that the current ZIP handling library, Common Lisp ZIP, unpacks the compressed contents as a (vector (unsigned-byte 8)). For a large spreadsheet, that'll cause an issue for the end user.
One alternative I've thought about is delayed loading -- let-over-lambda a closure that effectively demand-loads the worksheet when needed. However, that's just delaying the inevitable.
Are there any ZIP file CL libraries out there that return a Gray stream to a ZIP component's contents as opposed to a (potentially large) (vector (unsigned-byte 8))?
Edit: Clarification
I'm looking for a ZIP component function that returns a stream, not one that takes a stream. The functions that take a stream write the ZIP component's contents directly to the file associated with the stream. I'd rather that xlmanip reads from a stream directly as if the ZIP component were (implicitly, virtually) a file.
Chipz can decompress a ZIP to a stream. It provides a decompress functions where you give it an output stream and a input stream to decompress and it returns the output stream where the decompressed contents can be read.

Is there a way to replay a gzip-compressed log file in kdb+ without uncompressing it first?

Streaming execute, -11!, does not work on named pipes, so an obvious solution of redirecting the gzip -cd output to a named pipe and passing it to -11! does not work.
-11! accepts a compressed file and streams it so long as it was compressed with -19! (using 2 as the compression algorithm parameter, which is gzip).
The only difference between a normal gzipped file and a kdb compressed one is a few bytes at the beginning of the file.
EDIT (see comment) Thanks, this isn't true - the bytes are different at the end of the file
So a possible solution is to prepend your gzipped files (if they weren't produced by -19!) with the appropriate byte array first.
For anyone using kdb v3.4+ streaming execution for named pipes was introduced as the function .Q.fps.
Here is a simple example of .Q.fps in action, first create the pipe via command line:
echo "aa" > test.pipe
Next in a q session:
q).Q.fps[0N!]`:test.pipe
,"aa"
Where 0N! is a function used to display the contents of the file.