How to read parquet files with Flink as Case Class (scala)? - scala

With spark we can easily read parquet files and use it as a Case Class with the following code:
spark.read.parquet("my_parquet_table").as[MyCaseClass]
With Flink, I'm having a lot a trouble to do that. My Case Class comes from an Avro schema, so it is a SpecificRecord.
I tried the following:
val parquetInputFormat = new ParquetRowInputFormat(new Path(path), messageType)
env.readFile(parquetInputFormat, path)
The issue here is the messageType, I was not able to convert my case class nor the avro schema to a valid messageType.
I tried this:
val messageType = ParquetSchemaConverter.toParquetType(TypeInformation.of(classOf[MyCaseClass], true)
which ends with the following error:
class org.apache.flink.formats.avro.typeutils.AvroTypeInfo cannot be cast to class org.apache.flink.api.java.typeutils.RowTypeInfo
I could try to use the table-api, but it would mean having to create the all table schema myself and it would be a pain to maintain.
If someone can indicate me an example of implementation, or propose anything that might help it will be greatly appreciated.

Related

How to serialize/deserialize case classes from a spark dataset to/from s3

Let's say I have a Dataset[MyData] where MyData is defined as:
case class MyData(id: String, listA: List[SomeOtherCaseClass])
I want to save the data to s3 and load back later with MyData case class.
I know case class data is serializable. But is it possible to do like:
myData.write.xxxx("s3://someBucket/some")
// later
val myloadedData: Dataset[MyData] = spark.read.yyyy("s3://someBucket/some", MyData)
What does serialization means for you?
Because you only need to do exactly what you showed, choosing any available format you like, e.g. csv, json, parquet, orc, ...
(I would recommend doing a benchmarking between ORC and Parquet for your data, to see which one works better for you).
myData.write.orc("s3://someBucket/somePath")
And, when reading, just use the same format to get a DataFrame back, which you can cast to a Dataset[MyData] using the as[T] method.
val myloadedData: Dataset[MyData] = spark.read.orc("s3://someBucket/somePath").as[MyData]
Or, your question was how to connect to S3? - If so, if you are running from EMR then everything will be setup already. You only need to prepend your path with s3://, as you already did.

how can i create Spark DataFrame From .Thrift file's struct Object

I tried this
val temp = Seq[ProcessAction]() // ProcessAction is declared in Thrift
val toDF = temp.toDF()
I got the error
scala.ScalaReflectionException: none is a term
if I use case class object rather than ProcessAction I can get the DataFrame...
Are there any ways to get rid of this error??
Parquet files understand Thrift encoded objects so you could use ThriftParquetWriter to load a parquet file and then use Spark SQL or something to get those objects into a DataFrame.
https://github.com/apache/parquet-mr/blob/master/parquet-thrift/src/main/java/org/apache/parquet/thrift/ThriftParquetWriter.java

Why does creating a DataFrame fail with "java.lang.UnsupportedOperationException: Schema for type Category is not supported"

I am using spark 1.4.0.
I am trying to classify a text document on two different categories: scientific or non-scientific.
I have an issue while defining the type: Category. I use these commands:
scala> case class LabeledText(id: Long, category: Category, text: String)
defined class LabeledText
scala> val data = Seq(LabeledText(0, Scientific, "hello world"), LabeledText(1, NonScientific, "witaj swiecie")).toDF
But, an error was appeared:
java.lang.UnsupportedOperationException: Schema for type Category is not supported.
Any help with this will be greatly appreciated.
I think you may have used Example — Text Classification that I wrote in an attempt to offer an example of LogisticRegression in Spark MLlib.
I deeply apologize for not finishing it (or at least checking it out for correctness).
The proper version should start around case class Article(id: Long, topic: String, text: String) and follow along.
The opening example with case class LabeledText is not correct as it uses Category type that I think I've never used properly.

How to generate avro file using Scala

I have a used python for doing same. Libraries like fastavro really work. I have the data in csv format and avro schema.
Is the any library in scala that does something like
val avro = buildAvro(schema,data)
I have really muddled around with this but cannot find a solution
Try avro4s. Once you add it to your build, you can generate a schema as such:
val schema = AvroSchema[YourType]
Then you can print that out, save it, whatever you need to do with the schema
println(schema.toString(true))

Storing an object to a file

I want to save an object (an instance of a class) to a file. I didn't find any valuable example of it. Do I need to use serialization for it?
How do I do that?
UPDATE:
Here is how I tried to do that
import scala.util.Marshal
import scala.io.Source
import scala.collection.immutable
import java.io._
object Example {
class Foo(val message: String) extends scala.Serializable
val foo = new Foo("qweqwe")
val out = new FileOutputStream("out123.txt")
out.write(Marshal.dump(foo))
out.close
}
First of all, out123.txt contains many extra data and it was in a wrong encoding. My gut tells me there should be another proper way.
On the last ScalaDays Heather introduced a new library which gives a new cool mechanism for serialization - pickling. I think it's would be an idiomatic way in scala to use serialization and just what you want.
Check out a paper on this topic, slides and talk on ScalaDays'13
It is also possible to serialize to and deserialize from JSON using Jackson.
A nice wrapper that makes it Scala friendly is Jacks
JSON has the following advantages
a simple human readable text
a rather efficient format byte wise
it can be used directly by Javascript
and even be natively stored and queried using a DB like Mongo DB
(Edit) Example Usage
Serializing to JSON:
val json = JacksMapper.writeValueAsString[MyClass](instance)
... and deserializing
val obj = JacksMapper.readValue[MyClass](json)
Take a look at Twitter Chill to handle your serialization: https://github.com/twitter/chill. It's a Scala helper for the Kyro serialization library. The documentation/example on the Github page looks to be sufficient for your needs.
Just add my answer here for the convenience of someone like me.
The pickling library, which is mentioned by #4lex1v, only supports Scala 2.10/2.11 but I'm using Scala 2.12. So I'm not able to use it in my project.
And then I find out BooPickle. It supports Scala 2.11 as well as 2.12!
Here's the example:
import boopickle.Default._
val data = Seq("Hello", "World!")
val buf = Pickle.intoBytes(data)
val helloWorld = Unpickle[Seq[String]].fromBytes(buf)
More details please check here.