Read Java object as DataSet in scala spark - scala

I have a HDFS path which contains data written by Java object say Obj1, I want to read this path in my spark Scala code and read it as a DataSet of Obj1.
One way to do this will be to read the HDFS path, apply a map on it to create a new Scala object corresponding to Obj1.
Is there a simpler way to do this, as we know in java we can do something like :
Dataset<Obj1> obj1DataSet = sparkSession.read().parquet("path").as(Encoders.bean(Obj1.class));

This can be done as follows :
val obj1Encoder: Encoder[Obj1] = Encoders.bean(classOf[Obj1])
val objDataSet : Dataset[Obj1] = sparkSession.read.parquet("hdfs://dataPath/").as(obj1Encoder)

Related

Scala load data into DataFrame

In python and specifically pandas one can pass a Amazon S3 streaming object to the function pd.read_csv and easily return a dataframe with the streaming data. Something like this:
result = s3.get_object(Bucket=my_bucket, key= my_key)
df = pd.read_csv(result)
Where in the above result is a streaming object.
Is there a possible way to do something similar in scala? I have the below code when retrieving the streaming input in scala. The object is of type S3ObjectInputStream
val obj = amazonS3Client.getObject(bucket_name, file_name)
Then the hope is I could do something like:
scala.read_csv(obj.getObjectContent())
I'm new to Scala so I would appreciate the help. Thanks

How To create dynamic data source reader and different file format reader in scala spark

I am trying to create a program in spark scala that read the data from the different sources based on dynamic based on configuration setting.
i am trying to create a program that read the data in different format like csv,parquet and Sequence files dynamic based on configuration setting.
I tried more please help i am new in scala spark
Please use a config file to specify your input file format and location. For example:
import java.io.File
import com.typesafe.config.{ Config, ConfigFactory }
val configFile= System.getProperty("config.file")
val config = ConfigFactory.parseFile(new File(configFile))
val format = config.getString("inputDataFormat")
Based on the above format, write your conditional statements for reading files.

how can i create Spark DataFrame From .Thrift file's struct Object

I tried this
val temp = Seq[ProcessAction]() // ProcessAction is declared in Thrift
val toDF = temp.toDF()
I got the error
scala.ScalaReflectionException: none is a term
if I use case class object rather than ProcessAction I can get the DataFrame...
Are there any ways to get rid of this error??
Parquet files understand Thrift encoded objects so you could use ThriftParquetWriter to load a parquet file and then use Spark SQL or something to get those objects into a DataFrame.
https://github.com/apache/parquet-mr/blob/master/parquet-thrift/src/main/java/org/apache/parquet/thrift/ThriftParquetWriter.java

Read Parquet files from Scala without using Spark

Is it possible to read parquet files from Scala without using Apache Spark?
I found a project which allows us to read and write avro files using plain scala.
https://github.com/sksamuel/avro4s
However I can't find a way to read and write parquet files using plain scala program without using Spark?
It's straightforward enough to do using the parquet-mr project, which is the project Alexey Raga is referring to in his answer.
Some sample code
val reader = AvroParquetReader.builder[GenericRecord](path).build().asInstanceOf[ParquetReader[GenericRecord]]
// iter is of type Iterator[GenericRecord]
val iter = Iterator.continually(reader.read).takeWhile(_ != null)
// if you want a list then...
val list = iter.toList
This will return you a standard Avro GenericRecords, but if you want to turn that into a scala case class, then you can use my Avro4s library as you linked to in your question, to do the marshalling for you. Assuming you are using version 1.30 or higher then:
case class Bibble(name: String, location: String)
val format = RecordFormat[Bibble]
// then for a given record
val bibble = format.from(record)
We can obviously combine that with the original iterator in one step:
val reader = AvroParquetReader.builder[GenericRecord](path).build().asInstanceOf[ParquetReader[GenericRecord]]
val format = RecordFormat[Bibble]
// iter is now an Iterator[Bibble]
val iter = Iterator.continually(reader.read).takeWhile(_ != null).map(format.from)
// and list is now a List[Bibble]
val list = iter.toList
There is also a relatively new project called eel this is a lightweight (non distributed processing) toolkit for using some of the 'big data' technology in the small.
Yes, you don't have to use Spark to read/write Parquet.
Just use parquet lib directly from your Scala code (and that's what Spark is doing anyway): http://search.maven.org/#search%7Cga%7C1%7Cparquet

how to read properties file in scala

I am new to Scala programming and I wanted to read a properties file in Scala.
I can't find any APIs to read a property file in Scala.
Please let me know if there are any API for this or other way to read properties files in Scala.
Beside form Java API, there is a library by Typesafe called config with a good API for working with configuration files of different types.
You will have to do it in similar way you would with with Scala Map to java.util.Map. java.util.Properties extends java.util.HashTable whiche extends java.util.Dictionary.
scala.collection.JavaConverters has functions to convert to and fro from Dictionary to Scala mutable.Map:
val x = new Properties
//load from .properties file here.
import scala.collection.JavaConverters._
scala> x.asScala
res4: scala.collection.mutable.Map[String,String] = Map()
You can then use Map above. To get and retrieve. But if you wish to convert it back to Properties type (to store back etc), you might have to type cast it manually then.
You can just use the Java API.
Consider something along the lines
def getPropertyX: Option[String] = Source.fromFile(fileName)
.getLines()
.find(_.startsWith("propertyX="))
.map(_.replace("propertyX=", ""))