Read Parquet into scala without Spark - scala

I have a Parquet file which I would like to read into my Scala program without using Spark or other Big Data Technologies.
I found the projects
https://github.com/apache/parquet-mr
https://github.com/51zero/eel-sdk
but not detailed enough examples to get them to work.
Parquet-MR
https://stackoverflow.com/a/35594368/4533188 mentions this, but the examples given are not complete. For example it is not clear what path is supposed to be. It is supposed to implement InputFile, how is this supposed to be done? Also, from the post it seems to me that Parquet-MR does not directly truns the parquet data as standard Scala classes.
Eel
Here I tried
import io.eels.component.parquet.ParquetSource
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}
val parquetFilePath = new Path("file://home/raeg/Datatroniq/Projekte/14. Witzenmann/Teilprojekt Strom und Spannung/python_witzenmann/src/data/1.parquet")
implicit val hadoopConfiguration = new Configuration()
implicit val hadoopFileSystem = FileSystem.get(hadoopConfiguration) // This is required
ParquetSource(parquetFilePath)
.toDataStream()
.collect
.foreach(row => println(row))
but I get the error
java.io.IOException: No FileSystem for scheme: file
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(ParquetReaderTesting.sc:2582)
at org.apache.hadoop.fs.FileSystem.createFileSystem(ParquetReaderTesting.sc:2589)
at org.apache.hadoop.fs.FileSystem.access$200(ParquetReaderTesting.sc:87)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(ParquetReaderTesting.sc:2628)
at org.apache.hadoop.fs.FileSystem$Cache.get(ParquetReaderTesting.sc:2610)
at org.apache.hadoop.fs.FileSystem.get(ParquetReaderTesting.sc:366)
at org.apache.hadoop.fs.FileSystem.get(ParquetReaderTesting.sc:165)
at dataReading.A$A6$A$A6.hadoopFileSystem$lzycompute(ParquetReaderTesting.sc:7)
at dataReading.A$A6$A$A6.hadoopFileSystem(ParquetReaderTesting.sc:7)
at dataReading.A$A6$A$A6.get$$instance$$hadoopFileSystem(ParquetReaderTesting.sc:7)
at #worksheet#.#worksheet#(ParquetReaderTesting.sc:30)
in my worksheet.

Related

Writting Parquet file with append or overwrite, error File doesn't exist, Scala Spark

When I try to write my final DF with append or overwrite mode, sometimes, I get the following error:
Caused by: java.io.FileNotFoundException: File file:/C:/Users/xxx/ScalaSparkProjects/Date=2019-11-02/part-xxxx2x.28232x.213.c000.snappy.parquet does not exist
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
And I can't understand why. This is how I am writting the DF as a parquet file:
df.write.mode("append")
.partitionBy("Date")
.format("parquet")
.save(/data/testing/files)
Why could be happening this?
Based on your information consider this scenario:
Source DataFrame example under the path /tmp/sourceDF
target path to save under /tmp/destPath
val sourceDF = spark.read.parquet("/tmp/source")
At this point spark reads the header of the parquet in this folder to infere the schema. The schema I used is for simplicity reasons num: Integer
Now what you probably think is that all data is loaded at this point, but spark works lazy until an action occurs (Actions: df.show(), df.take(1), df.count())
so this code would result in error.
import scala.reflect.io.Directory
import java.io.File
import spark.implicits._
val sourceDF = spark.read.parquet("/tmp/sourceDF")
val directory = new Directory(new File("/tmp/sourceDF"))
directory.deleteRecursively()
sourceDF.write.parquet("/tmp/destDF")
the result will be:
java.io.FileNotFoundException: File file:/tmp/source/part-00000-1915503b-4beb-4e14-87ef-ca8b99fc4b11-c000.snappy.parquet does not exist
In order to fix this you you have two options I can think of.
Change the order:
import scala.reflect.io.Directory
import java.io.File
import spark.implicits._
val sourceDF = spark.read.parquet("/tmp/sourceDF")
sourceDF.write.mode("append").parquet("/tmp/destDF")
// Deletion happens now after writing
val directory = new Directory(new File("/tmp/sourceDF"))
directory.deleteRecursively()
Or you can use a checkpoint which loads the df at some point and caches it:
import scala.reflect.io.Directory
import java.io.File
import spark.implicits._
// set checkpoint directory
spark.sparkContext.setCheckpointDir("/tmp/checkpoint")
// cache df
val sourceDF = spark.read.parquet("/tmp/sourceDF").checkpoint()
// Now you can delete before writing it out
val directory = new Directory(new File("/tmp/sourceDF"))
directory.deleteRecursively()
sourceDF.write.mode("append").parquet("/tmp/destDF")

Create and stream a zip file as it is beeing created with the playframework and scala

My scala-play api provides endpoints to return a file as a stream via the Ok.chunked-function.
I now want to be able to allow the download of multiple files as a zip archive.
I want to create a zip file as a stream which play should directly return as a filestream.
Meaning without the need to temporarly save the zip-file on the disc and serving it while it is being created.
What would be a good way to implemente a function that creates this stream?
I solved the issue by using akkas alpakka.
import akka.stream.alpakka.file.ArchiveMetadata
import akka.stream.alpakka.file.scaladsl.Archive
import akka.stream.scaladsl.Source
import akka.util.ByteString
val fileSource: Source[ByteString, _] = FileIO.fromPath(path)
val tupelWithMetaData = (ArchiveMetadata(s"file.txt"), fileSource)
val stream: Source[ByteString, _] = Source(List(tupelWithMetaData)).via(Archive.zip())
First I create a akka ByteString source. The source is used inside the Tupel2 with some ArchiveMetadata. This tuple can than be used to create a new source which can be connected to alpakkas Archive.zip().
The resulting stream can than be used with plays Ok.chunked.
I hope this solution might help you if you have the same question.

Spark Scala file stream

I am new to Spark and Scala. I want to keep read files from folder and persist file content in Cassandra. I have written simple Scala program using file streaming to read the file content. it is not reading files from the specified folder.
Can anybody correct my below sample code ?
i am using Windows 7
Code:
val spark = SparkHelper.getOrCreateSparkSession()
val ssc = new StreamingContext(spark.sparkContext, Seconds(1))
val lines = ssc.textFileStream("file:///C:/input/")
lines.foreachRDD(file=> {
file.foreach(fc=> {
println(fc)
})
})
ssc.start()
ssc.awaitTermination()
}
I think a normal spark job is needed for the scenario rather than spark streaming.Spark streaming is used in cases where your source is something like kafka or a normal port where there is continuous inflow of data.

Can I write a plain text HDFS (or local) file from a Spark program, not from an RDD?

I have a Spark program (in Scala) and a SparkContext. I am writing some files with RDD's saveAsTextFile. On my local machine I can use a local file path and it works with the local file system. On my cluster it works with HDFS.
I also want to write other arbitrary files as the result of processing. I'm writing them as regular files on my local machine, but want them to go into HDFS on the cluster.
SparkContext seems to have a few file-related methods but they all seem to be inputs not outputs.
How do I do this?
Thanks to marios and kostya, but there are few steps to writing a text file into HDFS from Spark.
// Hadoop Config is accessible from SparkContext
val fs = FileSystem.get(sparkContext.hadoopConfiguration);
// Output file can be created from file system.
val output = fs.create(new Path(filename));
// But BufferedOutputStream must be used to output an actual text file.
val os = BufferedOutputStream(output)
os.write("Hello World".getBytes("UTF-8"))
os.close()
Note that FSDataOutputStream, which has been suggested, is a Java serialized object output stream, not a text output stream. The writeUTF method appears to write plaint text, but it's actually a binary serialization format that includes extra bytes.
Here's what worked best for me (using Spark 2.0):
val path = new Path("hdfs://namenode:8020/some/folder/myfile.txt")
val conf = new Configuration(spark.sparkContext.hadoopConfiguration)
conf.setInt("dfs.blocksize", 16 * 1024 * 1024) // 16MB HDFS Block Size
val fs = path.getFileSystem(conf)
if (fs.exists(path))
fs.delete(path, true)
val out = new BufferedOutputStream(fs.create(path)))
val txt = "Some text to output"
out.write(txt.getBytes("UTF-8"))
out.flush()
out.close()
fs.close()
Using HDFS API (hadoop-hdfs.jar) you can create InputStream/OutputStream for an HDFS path and read from/write to a file using regular java.io classes. For example:
URI uri = URI.create (“hdfs://host:port/file path”);
Configuration conf = new Configuration();
FileSystem file = FileSystem.get(uri, conf);
FSDataInputStream in = file.open(new Path(uri));
This code will work with local files as well (change hdfs:// to file://).
One simple way to write files to HDFS is to use a SequenceFiles. Here you use the native Hadoop APIs and not the ones provided by Spark.
Here is a simple snippet (in Scala):
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs._
import org.apache.hadoop.io._
val conf = new Configuration() // Hadoop configuration
val sfwriter = SequenceFile.createWriter(conf,
SequenceFile.Writer.file(new Path("hdfs://nn1.example.com/file1")),
SequenceFile.Writer.keyClass(LongWritable.class),
SequenceFile.Writer.valueClass(Text.class))
val lw = new LongWritable()
val txt = new Text()
lw.set(12)
text.set("hello")
sfwriter.append(lw, txt)
sfwriter.close()
...
In case you don't have a key you can use NullWritable.class in its place:
SequenceFile.Writer.keyClass(NullWritable.class)
sfwriter.append(NullWritable.get(), new Text("12345"));

Setting textinputformat.record.delimiter in spark

In Spark, it is possible to set some hadoop configuration settings like, e.g.
System.setProperty("spark.hadoop.dfs.replication", "1")
This works, the replication factor is set to 1.
Assuming that this is the case, I thought that this pattern (prepending "spark.hadoop." to a regular hadoop configuration property), would also work for the
textinputformat.record.delimiter:
System.setProperty("spark.hadoop.textinputformat.record.delimiter", "\n\n")
However, it seems that spark just ignores this setting.
Do I set the textinputformat.record.delimiter in the correct way?
Is there a simpler way of setting the textinputformat.record.delimiter. I would like to avoid writing my own InputFormat, since I really only need to obtain records delimited by two newlines.
I got this working with plain uncompressed files with the below function.
import org.apache.hadoop.io.LongWritable
import org.apache.hadoop.io.Text
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat
def nlFile(path: String) = {
val conf = new Configuration
conf.set("textinputformat.record.delimiter", "\n")
sc.newAPIHadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text], conf)
.map(_._2.toString)
}