Scala load data into DataFrame - scala

In python and specifically pandas one can pass a Amazon S3 streaming object to the function pd.read_csv and easily return a dataframe with the streaming data. Something like this:
result = s3.get_object(Bucket=my_bucket, key= my_key)
df = pd.read_csv(result)
Where in the above result is a streaming object.
Is there a possible way to do something similar in scala? I have the below code when retrieving the streaming input in scala. The object is of type S3ObjectInputStream
val obj = amazonS3Client.getObject(bucket_name, file_name)
Then the hope is I could do something like:
scala.read_csv(obj.getObjectContent())
I'm new to Scala so I would appreciate the help. Thanks

Related

Read Java object as DataSet in scala spark

I have a HDFS path which contains data written by Java object say Obj1, I want to read this path in my spark Scala code and read it as a DataSet of Obj1.
One way to do this will be to read the HDFS path, apply a map on it to create a new Scala object corresponding to Obj1.
Is there a simpler way to do this, as we know in java we can do something like :
Dataset<Obj1> obj1DataSet = sparkSession.read().parquet("path").as(Encoders.bean(Obj1.class));
This can be done as follows :
val obj1Encoder: Encoder[Obj1] = Encoders.bean(classOf[Obj1])
val objDataSet : Dataset[Obj1] = sparkSession.read.parquet("hdfs://dataPath/").as(obj1Encoder)

Scala: Print content of function definition

I have spark application and I implemented DataFrame extension -
def transform : Dataframe => Dataframe
,so app developer can pass custom transformations in my framework. Like
builder.load(path).transform(_.filter(col("sample") == lit(""))).
Now I want to track what was happened during spark execution:
Log:
val df = spark.read()
val df2 = df.filter(col("sample") == lit("")))
...
So, the idea is keep log of actions and pretty-print it at the end, but to do this I need somehow get the content of Dataframe => DataFrame function. Possibly, macros can help me, but I am not sure. I actually don't need the code(however will appreciate it), but just get the direction to go.

how can i create Spark DataFrame From .Thrift file's struct Object

I tried this
val temp = Seq[ProcessAction]() // ProcessAction is declared in Thrift
val toDF = temp.toDF()
I got the error
scala.ScalaReflectionException: none is a term
if I use case class object rather than ProcessAction I can get the DataFrame...
Are there any ways to get rid of this error??
Parquet files understand Thrift encoded objects so you could use ThriftParquetWriter to load a parquet file and then use Spark SQL or something to get those objects into a DataFrame.
https://github.com/apache/parquet-mr/blob/master/parquet-thrift/src/main/java/org/apache/parquet/thrift/ThriftParquetWriter.java

Spark 2.0 Scala - RDD.toDF()

I am working with Spark 2.0 Scala. I am able to convert an RDD to a DataFrame using the toDF() method.
val rdd = sc.textFile("/pathtologfile/logfile.txt")
val df = rdd.toDF()
But for the life of me I cannot find where this is in the API docs. It is not under RDD. But it is under DataSet (link 1). However I have an RDD not a DataSet.
Also I can't see it under implicits (link 2).
So please help me understand why toDF() can be called for my RDD. Where is this method being inherited from?
It's coming from here:
Spark 2 API
Explanation: if you import sqlContext.implicits._, you have a implicit method to convert RDD to DataSetHolder (rddToDataSetHolder), then you call toDF on the DataSetHolder
Yes, you should import sqlContext implicits like that:
val sqlContext = //create sqlContext
import sqlContext.implicits._
val df = RDD.toDF()
Before you call to "toDF" in your RDDs
Yes I finally found piece of mind, this issue. It was troubling me like hell, this post is a life saver. I was trying to generically load data from log files to a case class object making it mutable List, this idea was to finally convert the list into DF. However as it was mutable and Spark 2.1.1 has changed the toDF implementation, what ever why the list want not getting converted. I finally thought of even covering save the data to file and the load it back using .read. However 5 min back this post had saved my day.
I did the exact same way as described.
after loading the data to mutable list I immediately used
import spark.sqlContext.implicits._
val df = <mutable list object>.toDF
df.show()
I have done just this with Spark 2.
it worked.
val orders = sc.textFile("/user/gd/orders")
val ordersDF = orders.toDF()

Read Parquet files from Scala without using Spark

Is it possible to read parquet files from Scala without using Apache Spark?
I found a project which allows us to read and write avro files using plain scala.
https://github.com/sksamuel/avro4s
However I can't find a way to read and write parquet files using plain scala program without using Spark?
It's straightforward enough to do using the parquet-mr project, which is the project Alexey Raga is referring to in his answer.
Some sample code
val reader = AvroParquetReader.builder[GenericRecord](path).build().asInstanceOf[ParquetReader[GenericRecord]]
// iter is of type Iterator[GenericRecord]
val iter = Iterator.continually(reader.read).takeWhile(_ != null)
// if you want a list then...
val list = iter.toList
This will return you a standard Avro GenericRecords, but if you want to turn that into a scala case class, then you can use my Avro4s library as you linked to in your question, to do the marshalling for you. Assuming you are using version 1.30 or higher then:
case class Bibble(name: String, location: String)
val format = RecordFormat[Bibble]
// then for a given record
val bibble = format.from(record)
We can obviously combine that with the original iterator in one step:
val reader = AvroParquetReader.builder[GenericRecord](path).build().asInstanceOf[ParquetReader[GenericRecord]]
val format = RecordFormat[Bibble]
// iter is now an Iterator[Bibble]
val iter = Iterator.continually(reader.read).takeWhile(_ != null).map(format.from)
// and list is now a List[Bibble]
val list = iter.toList
There is also a relatively new project called eel this is a lightweight (non distributed processing) toolkit for using some of the 'big data' technology in the small.
Yes, you don't have to use Spark to read/write Parquet.
Just use parquet lib directly from your Scala code (and that's what Spark is doing anyway): http://search.maven.org/#search%7Cga%7C1%7Cparquet