Spark/Scala Opening Zipped CSV Files - scala

I am new to Spark and Scala. We have ad event log files formatted as CSV's and then compressed using pkzip. I have seen many examples on how to decompress zipped files using Java, but how would I do this using Scala for Spark? We, ultimately, want to get, extract, and load the data from each incoming file into an Hbase destination table. Maybe this can this be done with the HadoopRDD? After this, we are going to introduce Spark streaming to watch for these files.
Thanks,
Ben

Default compression support
#samthebest answer is correct, if you are using compression format that is by default available in Spark (Hadoop). Which are:
bzip2
gzip
lz4
snappy
I have explained this topic deeper in my other answer: https://stackoverflow.com/a/45958182/1549135
Reading zip
However, if you are trying to read a zip file you need to create a custom solution. One is mentioned in the answer I have already provided.
If you need to read multiple files from your archive, you might be interested in the answer I have provided: https://stackoverflow.com/a/45958458/1549135
Basically, all the time, using sc.binaryFiles and later on decompressing the PortableDataStream, like in the sample:
sc.binaryFiles(path, minPartitions)
.flatMap { case (name: String, content: PortableDataStream) =>
val zis = new ZipInputStream(content.open)
Stream.continually(zis.getNextEntry)
.takeWhile(_ != null)
.flatMap { _ =>
val br = new BufferedReader(new InputStreamReader(zis))
Stream.continually(br.readLine()).takeWhile(_ != null)
}

In Spark, provided your files have the correct filename suffix (e.g. .gz for gzipped), and it's supported by org.apache.hadoop.io.compress.CompressionCodecFactory, then you can just use
sc.textFile(path)
UPDATE: At time of writing their is a bug in Hadoop bzip2 library which means trying to read bzip2 files using spark results in weird exceptions - usually ArrayIndexOutOfBounds.

Related

Read a fastparquet file using Akka parquet

I have one of our Python systems generating Parquet files using Pandas and fastparquet. These are to be read by a Scala system that runs atop Akka streams.
Akka does provide a source for reading Avro Parquet files. However, when I try to read the file, I end up with
java.lang.IllegalArgumentException: INT96 not yet implemented.
This is one of the columns that does not need to be read for the Scala application to work. My question is whether I can specify a schema and get just that one column out considering that the generated file is from fastparquet.
The relevant snippet which generates a source for reading Parquet files is:
.map(result => {
val path = s"s3a://${result.bucketName}/${result.key}"
val file = HadoopInputFile.fromPath(new Path(path), hadoopConfig)
val reader: ParquetReader[GenericRecord] =
AvroParquetReader
.builder[GenericRecord](file)
.withConf(hadoopConfig)
.build()
AvroParquetSource(reader)
})

Searching all file names recursively in hdfs using Spark

I’ve been looking for a while now for a way to get all filenames in a directory and its sub-directories in Hadoop file system (hdfs).
I found out I can use these commands to get it :
sc.hadoopConfiguration.set("mapreduce.input.fileinputformat.input.dir.recursive", "true")
sc.wholeTextFiles(path).map(_._1)
Here is "wholeTextFiles" documentation:
Read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file.
Parameters:
path - Directory to the input data files, the path can be
comma separated paths as the list of inputs.
minPartitions - A
suggestion value of the minimal splitting number for input data.
Returns:
RDD representing tuples of file path and the corresponding
file content
Note: Small files are preferred, large file is also
allowable, but may cause bad performance., On some filesystems,
.../path/* can be a more efficient way to read all files in a
directory rather than .../path/ or .../path, Partitioning is
determined by data locality. This may result in too few partitions by
default.
As you can see "wholeTextFiles" returns a pair RDD with both the filenames and their content. So I tried mapping it and taking only the file names, but I suspect it still reads the files.
The reason I suspect so: if I try to count (for example) and I get the spark equivalent of "out of memory" (losing executors and not being able to complete the tasks).
I would rather use Spark to achieve this goal the fastest way possible, however, if there are other ways with a reasonable performance I would be happy to give them a try.
EDIT:
To clear it - I want to do it using Spark, I know I can do it using HDFS commands and such thing - I would like to know how to do such thing with the existing tools provided with Spark and maybe an explanation on how I can make "wholeTextFiles" not reading the text itself (kind of like how transformations only happen after an action and some of the "commands" never really happen).
Thank you very much!
This is the way to list out all the files till the depth of last subdirectory....and is with out using wholetextfiles
and is recursive call till the depth of subdirectories...
val lb = new scala.collection.mutable[String] // variable to hold final list of files
def getAllFiles(path:String, sc: SparkContext):scala.collection.mutable.ListBuffer[String] = {
val conf = sc.hadoopConfiguration
val fs = FileSystem.get(conf)
val files: RemoteIterator[LocatedFileStatus] = fs.listLocatedStatus(new Path(path))
while(files.hasNext) {// if subdirectories exist then has next is true
var filepath = files.next.getPath.toString
//println(filepath)
lb += (filepath)
getAllFiles(filepath, sc) // recursive call
}
println(lb)
lb
}
Thats it. it was tested with success. you can use as is..

Spark - Get from a directory with nested folders all filenames of a particular data type

I have a directory with some subfolders which content different parquet files. Something like this:
2017-09-05
10-00
part00000.parquet
part00001.parquet
11-00
part00000.parquet
part00001.parquet
12-00
part00000.parquet
part00001.parquet
What I want is by passing the path to the directory 05-09 to get a list of names of all parquet files.
I was able to achieve it, but in a very inefficient way:
val allParquetFiles = sc.wholeTextFiles("C:/MyDocs/2017-09-05/*/*.parquet")
allParquetFiles.keys.foreach((k) => println("The path to the file is: "+k))
So each key is the name I am looking for, but this process requires me to load all files as well, which then I can't use, since I get them in binary (and I don't know how to convert them into a dataframe).
Once I have the keys (so the list of filePaths) I am planning to invoke:
val myParquetDF = sqlContext.read.parquet(filePath);
As you may have already understood I am quite new in Spark. So please if there is a faster or easier approach to read a list of parquet files located in different folders, please let me know.
My Partial Solution: I wasn't able to get all paths for all filenames in a folder, but I was able to get the content of all files of that type into the same dataframe. Which was my ultimate goal. In case someone may need it in the future, I used the following line:
val df = sqlContext.read.parquet("C:/MyDocs/2017-05-09/*/*.parquet")
Thanks for your time
You can do it using the hdfs api like this
import org.apache.hadoop.fs._
import org.apache.hadoop.conf._
val fs = FileSystem.get(new Configuration())
val files = ( fs.listStatus(new Path("C:/MyDocs/2017-09-05/*/*.parquet")) ).map(_.getPath.toString)
First, it is better to avoid using wholeTextFiles. This method reads the whole file at once. Try to use textFile method. read more
Second, if you need to get all files recursively in one directory, you can achieve it by textFile method:
sc.hadoopConfiguration.set("mapreduce.input.fileinputformat.input.dir.recursive", "true")
This configuration will enable recursive search (works for spark jobs as for mapreduce jobs). And then just invoke sc.textFile(path).

decompressing files from hdfs in spark

I am using spark and I have different kind of compressed files on hdfs(zip,gzip,7zip,tar,bz2,tar.gz etc). Could anyone please let me know best way for decompression. For some compression I could use CompressionCodec. But it does not support all compression format.For zip file I did some search and found that ZipFileInputFormat could be used. but i could not find any jar for this.
For some compressed format (I know that it is true for tar.gz and zip, haven't tested for the others), you can use the dataframe API directly and it'll take care of the compression for you:
val df = spark.read.json("compressed-json.tar.gz")

How to read and write DataFrame from Spark

I need to save DataFrame in CSV or parquet format (as a single file) and then open it again. The amount of data will not exceed 60Mb, so a single file is reasonable solution. This simple task provides me a lot of headache... This is what I tried:
To read the file if it exists:
df = sqlContext
.read.parquet("s3n://bucket/myTest.parquet")
.toDF("key", "value", "date", "qty")
To write the file:
df.write.parquet("s3n://bucket/myTest.parquet")
This does not work because:
1) write creates the folder myTest.parquet with hadoopish files that later I cannot read with .read.parquet("s3n://bucket/myTest.parquet"). In fact I don't care about multiple hadoopish files, unless I can later read them easily into DataFrame. Is it possible?
2) I am always working with the same file myTest.parquet that I am updating and overwriting in S3. It tells me that the file cannot be saved because it already exists.
So, can someone indicate me a right way to do the read/write loop? The file format doesn't matter for me (csv,parquet,csv,hadoopish files) unleass I can make the read and write loop.
You can save your DataFrame with saveAsTable("TableName") and read it with table("TableName"). And the location can be set by spark.sql.warehouse.dir. And you can overwrite a file with mode(SaveMode.Ignore). You can read here more from the official documentation.
In Java it would look like this:
SparkSession spark = ...
spark.conf().set("spark.sql.warehouse.dir", "hdfs://localhost:9000/tables");
Dataset<Row> data = ...
data.write().mode(SaveMode.Overwrite).saveAsTable("TableName");
Now you can read from the Data with:
spark.read().table("TableName");