Read Parquet files from Scala without using Spark - scala

Is it possible to read parquet files from Scala without using Apache Spark?
I found a project which allows us to read and write avro files using plain scala.
https://github.com/sksamuel/avro4s
However I can't find a way to read and write parquet files using plain scala program without using Spark?

It's straightforward enough to do using the parquet-mr project, which is the project Alexey Raga is referring to in his answer.
Some sample code
val reader = AvroParquetReader.builder[GenericRecord](path).build().asInstanceOf[ParquetReader[GenericRecord]]
// iter is of type Iterator[GenericRecord]
val iter = Iterator.continually(reader.read).takeWhile(_ != null)
// if you want a list then...
val list = iter.toList
This will return you a standard Avro GenericRecords, but if you want to turn that into a scala case class, then you can use my Avro4s library as you linked to in your question, to do the marshalling for you. Assuming you are using version 1.30 or higher then:
case class Bibble(name: String, location: String)
val format = RecordFormat[Bibble]
// then for a given record
val bibble = format.from(record)
We can obviously combine that with the original iterator in one step:
val reader = AvroParquetReader.builder[GenericRecord](path).build().asInstanceOf[ParquetReader[GenericRecord]]
val format = RecordFormat[Bibble]
// iter is now an Iterator[Bibble]
val iter = Iterator.continually(reader.read).takeWhile(_ != null).map(format.from)
// and list is now a List[Bibble]
val list = iter.toList

There is also a relatively new project called eel this is a lightweight (non distributed processing) toolkit for using some of the 'big data' technology in the small.

Yes, you don't have to use Spark to read/write Parquet.
Just use parquet lib directly from your Scala code (and that's what Spark is doing anyway): http://search.maven.org/#search%7Cga%7C1%7Cparquet

Related

How To create dynamic data source reader and different file format reader in scala spark

I am trying to create a program in spark scala that read the data from the different sources based on dynamic based on configuration setting.
i am trying to create a program that read the data in different format like csv,parquet and Sequence files dynamic based on configuration setting.
I tried more please help i am new in scala spark
Please use a config file to specify your input file format and location. For example:
import java.io.File
import com.typesafe.config.{ Config, ConfigFactory }
val configFile= System.getProperty("config.file")
val config = ConfigFactory.parseFile(new File(configFile))
val format = config.getString("inputDataFormat")
Based on the above format, write your conditional statements for reading files.

Read Java object as DataSet in scala spark

I have a HDFS path which contains data written by Java object say Obj1, I want to read this path in my spark Scala code and read it as a DataSet of Obj1.
One way to do this will be to read the HDFS path, apply a map on it to create a new Scala object corresponding to Obj1.
Is there a simpler way to do this, as we know in java we can do something like :
Dataset<Obj1> obj1DataSet = sparkSession.read().parquet("path").as(Encoders.bean(Obj1.class));
This can be done as follows :
val obj1Encoder: Encoder[Obj1] = Encoders.bean(classOf[Obj1])
val objDataSet : Dataset[Obj1] = sparkSession.read.parquet("hdfs://dataPath/").as(obj1Encoder)

Scala: Print content of function definition

I have spark application and I implemented DataFrame extension -
def transform : Dataframe => Dataframe
,so app developer can pass custom transformations in my framework. Like
builder.load(path).transform(_.filter(col("sample") == lit(""))).
Now I want to track what was happened during spark execution:
Log:
val df = spark.read()
val df2 = df.filter(col("sample") == lit("")))
...
So, the idea is keep log of actions and pretty-print it at the end, but to do this I need somehow get the content of Dataframe => DataFrame function. Possibly, macros can help me, but I am not sure. I actually don't need the code(however will appreciate it), but just get the direction to go.

Spark ML - Save OneVsRestModel

I am in the middle of refactoring my code to take advantage of DataFrames, Estimators, and Pipelines. I was originally using MLlib Multiclass LogisticRegressionWithLBFGS on RDD[LabeledPoint]. I am enjoying learning and using the new API, but I am not sure how to save my new model and apply it on new data.
Currently, the ML implementation of LogisticRegression only supports binary classification. I am, instead using OneVsRest like so:
val lr = new LogisticRegression().setFitIntercept(true)
val ovr = new OneVsRest()
ovr.setClassifier(lr)
val ovrModel = ovr.fit(training)
I would now like to save my OneVsRestModel, but this does not seem to be supported by the API. I have tried:
ovrModel.save("my-ovr") // Cannot resolve symbol save
ovrModel.models.foreach(_.save("model-" + _.uid)) // Cannot resolve symbol save
Is there a way to save this, so I can load it in a new application for making new predictions?
Spark 2.0.0
OneVsRestModel implements MLWritable so it should be possible to save it directly. Method shown below can be still useful to save individual models separately.
Spark < 2.0.0
The problem here is that models returns an Array of ClassificationModel[_, _]] not an Array of LogisticRegressionModel (or MLWritable). To make it work you'll have to be specific about the types:
import org.apache.spark.ml.classification.LogisticRegressionModel
ovrModel.models.zipWithIndex.foreach {
case (model: LogisticRegressionModel, i: Int) =>
model.save(s"model-${model.uid}-$i")
}
or to be more generic:
import org.apache.spark.ml.util.MLWritable
ovrModel.models.zipWithIndex.foreach {
case (model: MLWritable, i: Int) =>
model.save(s"model-${model.uid}-$i")
}
Unfortunately as for now (Spark 1.6) OneVsRestModel doesn't implement MLWritable so it cannot be saved alone.
Note:
All models int the OneVsRest seem to use the same uid hence we need an explicit index. It will be also useful to identify the model later.

how to read properties file in scala

I am new to Scala programming and I wanted to read a properties file in Scala.
I can't find any APIs to read a property file in Scala.
Please let me know if there are any API for this or other way to read properties files in Scala.
Beside form Java API, there is a library by Typesafe called config with a good API for working with configuration files of different types.
You will have to do it in similar way you would with with Scala Map to java.util.Map. java.util.Properties extends java.util.HashTable whiche extends java.util.Dictionary.
scala.collection.JavaConverters has functions to convert to and fro from Dictionary to Scala mutable.Map:
val x = new Properties
//load from .properties file here.
import scala.collection.JavaConverters._
scala> x.asScala
res4: scala.collection.mutable.Map[String,String] = Map()
You can then use Map above. To get and retrieve. But if you wish to convert it back to Properties type (to store back etc), you might have to type cast it manually then.
You can just use the Java API.
Consider something along the lines
def getPropertyX: Option[String] = Source.fromFile(fileName)
.getLines()
.find(_.startsWith("propertyX="))
.map(_.replace("propertyX=", ""))