Reading zipped xml files in Spark - scala

I have a set of large xml files, zipped together in a singe file and many such zip files. I was using Mapreduce earlier to parse the xml using custom inputformat and recordreader setting the splittable=false and reading the zip and xml file.
I am new to Spark. Can someone help me how can I prevent spark from splitting the zip file and process multiple zips in parallel as I am able to do in MR.

AFAIk ! The answer to your question is provided here by #holden :
Please take a look ! Thanks :)

Related

Control parquet filenames in spark output

I am writing spark output to an external system that does not like file extensions (I know, I know, don't start).
Something like:
df.write.partitionBy("date").parquet(some_path)
creates files like: some/path/date=2021-01-01/part-00000-77dd02e8-1a67-4f0d-9c07-b55b4f2e5efc-c000.snappy.parquet
And that makes that external system unhappy.
I am looking for a way to tell spark to write that file without extension.
I know I can just rename it afterwards, but that seems ... stupid (and there's a lot of files) :/
Is there some option I could use to tell spark to just write it the way I want?
df.write.partitionBy("date").parquet(some_path_ends_with_file_name)

Convert EDI format to csv using scala spark?

How to convert EDI format file to CSV file using spark or scala?
You can use a tool like this to create a mapping from EDI format to CSV and then generate a code in that tool. This code then can be used to convert EDI to CSV in Spark.
For open source solutions, I think your best bet is EDI Reader from BerryWorks. Haven't tried it myself, but apparently this is what Hortonworks recommends, and I'd trust their judgement in the Big Data area. I'm not involved with either, for the matters of disclosure.
From there, it's still a matter of converting EDI XML representation to CSV. Given that XML processing is not part of vanilla Spark, again, your options are rather limited here. Try Databricks spark-xml maybe?

Need to read .dat file in scala ide

Need to read .dat files (binary files) from local and write the output in console using scala IDE,
Is it required first convert .dat file to .txt/.csv file then we can read and apply if any transformation and again need to convert .txt/.csv to .dat
tried with some existing code
ref:http://alvinalexander.com/scala/how-to-read-write-binary-files-in-scala-examples
still getting error ,Please share any suggestion
Thanks in advance.

Finding individual filenames when loading multiple files in Apache Spark

I have an Apache Spark job that loads multiple files for processing using
val inputFile = sc.textFile(inputPath)
This is working fine. However for auditing purposes it would be useful to track which line came from which file when the inputPath is a wildcard. Something like an RDD[(String, String)] where the first string is the line of input text and the second is the filename.
Specifically, I'm using Google's Dataproc and the file(s) are located in Google Cloud Storage, so the paths are similar to 'gs://my_data/*.txt'.
Check out SparkContext#wholeTextFiles.
Sometimes if you use many input paths, you may find yourself wanting to use worker resources for this file listing. To do that, you can use the Scala parts of this answer on improving performance of wholeTextFiles in pyspark (ignore all the python bits, the scala parts are what matter).

Pipe multiple files into a zip file

I have several files in a GridFS Document Store and what I'd like to do is to pipe this data into a zip file via stdin in NodeJS. So that I will end up with a zip file containing all these files.
Now my question is how can I give the files a valid filename inside of the zip file. I think I need to emulate/fake a file header containing the filename?
Any help is appreciated!
Thanks
I had problems when writing zip files with Node.js not long ago. I ended up doing something similar to what is described in Zip archives in node.js
I can't help you directly with your problem, but at least I hope I can point out some things:
Don't try to use node-archive. Even if the description says it allows to create zip files, the moment I read the source code (since documentation is unexistant) I realized that's just a lie. It only exposes methods for reading.
Using zip by spawning a process, like recommended on the provided link, seems to be the best way. Something that would work is copying the files to a local folder with whatever name you desire and then calling the zip command, just to delete the files afterwards.
The other option, which seems ok, is to use zipper (https://github.com/rubenv/zipper, although better just use npm). The reason I'm not really wishing to use it is because there's not that much flexibility, it seems to have been done in a day and it hasn't been modified since the first commit, so I'm not sure it will receive maintenance (sure, you could just fork it...).
I swear the day I have an entire free weekend with no work I will write a freaking module that does this as complete as possible. It's silly that there isn't and it shouldn't be that much struggle. blablablarant.
Edit:
Not sure if it was there before, but now I've been using the node-compress module (also using gzippo). It works fine.