I have some data stored as parquet files and case classes matching the data schema. Spark deals well with regular Product types so if I have
case class A(s:String, i:Int)
I can easily do
spark.read.parquet(file).as[A]
But from what I understand, Spark doesn't handle disjunction types, so when I have enums in my parquet, previously encoded as integers, and a scala representation like
sealed trait E
case object A extends E
case object B extends E
I cannot do
spark.read.parquet(file).as[E]
// java.lang.UnsupportedOperationException: No Encoder found for E
Makes sense so far, but then, probably too naively, I try
implicit val eEncoder = new org.apache.spark.sql.Encoder[E] {
def clsTag = ClassTag(classOf[E])
def schema = StructType(StructField("e", IntegerType, nullable = false)::Nil)
}
And I still get the same "No Encoder found for E" :(
My question at this point is, why is the implicit missing in scope? (or not recognized as an Encoder[E]) and even if it did, how would such an interface allow me to actually decode the data? I would still need to map the value to the proper case object.
I did read a related answer that says "TL;DR There is no good solution right now, and given Spark SQL / Dataset implementation, it is unlikely there will be one in the foreseeable future." But I'm struggling to understand why a custom Encoder couldn't do the trick.
But I'm struggling to understand why a custom Encoder couldn't do the trick.
Two main reasons:
There is no API for custom Encoders. Publicly available are only "binary" Kryo and Java Encoders, which create useless (in case of DataFrame / Dataset[Row]) blobs with no support for any meaningful SQL / DataFrame operations.
Code like this would work fine
import org.apache.spark.sql.Encoders
spark.createDataset(Seq(A, B): Seq[E])(Encoders.kryo[E])
but it is nothing more than a curiosity.
DataFrame is a columnar store. It is technically possible to encode type hierarchies on top of this structure (private UserDefinedType API does that) but it is cumbersome (as you have to provide storage for all possible variants, see for example How to define schema for custom type in Spark SQL?) and inefficient (in general complex types are somewhat second class citizens in Spark SQL, and many optimizations are not accessible with complex schema, subject to future changes).
In broader sense DataFrame API is effectively relational (as in relational algebra) and tuples (main building block of relations) are by definition homogeneous, so by extension there is no place in SQL / DataFrame API, for heterogeneous structures.
Related
There are a lot of RDDs in Spark; from the docs:
AsyncRDDActions
CoGroupedRDD
DoubleRDDFunctions
HadoopRDD
JdbcRDD
NewHadoopRDD
OrderedRDDFunctions
PairRDDFunctions
PartitionPruningRDD
RDD
SequenceFileRDDFunctions
ShuffledRDD
UnionRDD
and I do not understand what they are supposed to be.
Additionally I noticed that there are
ParallelCollectionRDD
MapPartitionsRDD
which are not listed though they appear very often in my spark-shell as objects.
Question
Why are there different RDDs and what are their respective purposes?
What I understood so far
I understood from tutorials and books (e.g. "Learning Spark") that there are two types of operations on RDDs: Those for RDDs which have pairs (x, y) and all the other operations. So I would expect to have class RDD and PairRDD and that's it.
What I suspect
I suspect that I got it partly wrong and what is actually the case is that a lot of RDD classes could be just one RDD class - but that would make things less tidy. So instead, the developers decided to put different methods into different classes and in order to provide those to any RDD class type, they use implicit to coerce between the class types. I suspect that due to the fact that many of the RDD class types end with "Functions" or "Actions" and text in the respective scaladocs sound like this.
Additionally I suspect that some of the RDD classes still are not like that, but have some more in-depth meaning (e.g. ShuffledRDD).
However - I am not sure about any of this.
First of all roughly a half of the listed classes don't extend RDD but are type classes designed to augment RDD with different methods specific to the stored type.
One common example is RDD[(T, U)], commonly known as PairRDD, which is enriched by methods provided by PairRDDFunctions like combineByKeyWithClassTag which is a basic building block for all byKey transformations. It is worth nothing that there is no such class as PairRDD or PairwiseRDD and these names are purely informal.
There are also a few commonly used subclasses of the RDD which are not a part of the public API and such are not listed above. Some examples worth mentioning are ParallelCollectionRDD and MapPartitionsRDD.
RDD is an abstract class which doesn't implement two important methods:
compute which computes result for a given partition
getPartitions which return a sequence of partitions for a given RDD
In general there are two reasons to subclass RDD;
create a class representing input source (e.g ParallelCollectionRDD, JdbcRDD)
create an RDD which provides non standard transformations
So to summarize:
RDD class provides a minimal interface for RDDs.
subclasses of RDD provide internal logic required for actual computations based on external sources and / or parent RDDs. These are either private or part of the developer API and, excluding debug strings or Spark UI, are not exposed directly to the final user.
type classes provide additional methods based on the type of the values which are stored in the RDD and not dependent on how it has been created.
I'm new to Scala and Slick and was surprised by something in the Slick documentation:
The following primitive types are supported out of the box for
JDBC-based databases in JdbcProfile
...
Unit
...
I don't get why this list contains Unit. From my understanding, Unit is similar to Java's void, something I neither can save to nor receive from my database. What is the intention behind it?
edit: you can find it here.
One way to look at Slick is running Scala code on your database as the execution engine. We are working on allowing more Scala code over time. An expression that contains a unit, e.g. in a tuple is a valid Scala expression and thus should be runnable by Slick unless there is a good reason why not. So we support unit.
I am building an application using Scala 2.10, Salat and Play frmework 2.1-RC2 (will upgrade to 2.1 release soon) and MongoDB.
This is a faceless application where JSON web services are exposed for consumers. Up until now JSON was converted into Model object directly using Play's Json API and implicit converters. I have to refactor some case classes to avoid 22 tuples limit and now instead of flat case class I'm now refactoring to have an embedded case(and embedded MongoDB collection).
Web service interface should remain same where client should still be passing in JSON data as they were before in a flat structure but application needs to map them into proper case class(es) structure. What's the best way to handle this kind of situation. I fear of writing a lot of conversion code <-> Flat JSON <-> complex case class structure <-> from complex case classes to flat JSON output again.
How would you approach such a requirement? I assume case class 22 tuple limit may have had been faced by many others to handle this kind of requirements? How would you approach this
The Play 2.1 json library relies heavily on combinators (path1 and path2). These combinators all have the same 22 restriction. That gives you two options:
Don't use combinators and construct your objects the hard way: path(json) will give you the value at that point in the path. Searching for 'Accessing value of JsPath' at ScalaJsonCombinators will give more examples.
First transform the json into a structure that does not have more than 22 values in a single object and then use the normal combinators. More information about transforming can be found here: ScalaJsonTransformers
I need to store Scala class in Morphia. With annotations it works well unless I try to store collection of _ <: Enumeration
Morphia complains that it does not have serializers for that type, and I am wondering, how to provide one. For now I changed type of collection to Seq[String], and fill it with invoking toString on every item in collection.
That works well, however I'm not sure if that is right way.
This problem is common to several available layers of abstraction on the top of MongoDB. It all come back to a base reason: there is no enum equivalent in json/bson. Salat for example has the same problem.
In fact, MongoDB Java driver does not support enums as you can read in the discussion going on here: https://jira.mongodb.org/browse/JAVA-268 where you can see the problem is still open. Most of the frameworks I have seen to use MongoDB with Java do not implement low-level functionalities such as this one. I think this choice makes a lot of sense because they leave you the choice on how to deal with data structures not handled by the low-level driver, instead of imposing you how to do it.
In general I feel that the absence of support comes not from technical limitation but rather from design choice. For enums, there are multiple way to map them with their pros and their cons, while for other data types is probably simpler. I don't know the MongoDB Java driver in detail, but I guess supporting multiple "modes" would have required some refactoring (maybe that's why they are talking about a new version of serialization?)
These are two strategies I am thinking about:
If you want to index on an enum and minimize space occupation, you will map the enum to an integer ( Not using the ordinal , please can set enum start value in java).
If your concern is queryability on the mongoshell, because your data will be accessed by data scientist, you would rather store the enum using its string value
To conclude, there is nothing wrong in adding an intermediate data structure between your native object and MongoDB. Salat support it through CustomTransformers, on Morphia maybe you would need to do the conversion explicitely. Go for it.
I want to read a rather large csv file and process it (slice, dice, summarize etc.) interactively (data exploration). My idea is to read the file into a database (H2) and use SQL to process it:
Read the file: I use Ostermiller csv parser
Determine the type of each column: I select randomly 50 rows and derive the type (int, long, double, date, string) of each column
I want to use Squeryl to process. To do so I need to create a case class dynamically. That's the bottleneck so far!
I upload the file to H2 and use any SQL command.
My questions:
Is there a better general interactive way of doing this in Scala?
Is there a way to solve the 3rd point? To state it differently, given a list of types (corresponding to the columns in the csv file), is it possible to dynamically create a case class corresponding to the table in Squeryl? To my understanding I can do that using macros, but I do not have enough exposure to do that.
I think your approach to the first question sounds reasonable.
Regarding your 2nd question - as an addition to drexin's answer - it is possible to generate the bytecode, with a library such as ASM. With such a library you can generate the same byte code as a case class would.
As scala is a statically typed language there is no way to dynamically create classes except for reflection, which is slow and dangerous and therefore should be avoided. Even with macros you cannot do this. Macros are evaluated at compile-time, not at runtime, so you need to know the structure of your data at compile-time. What do you need the case classes for, if you don't even know what your data looks like? What benefit do you expect from this over using a Map[String,Any]?
I think you want to create a sealed base class and then a series of case classes as subclasses of it. Each subclass will wrap a different type that you support.
Then you can use match statements and deconstruction to deal with the individual types, and treat them generically via the base class in the places where it doesn't matter.
You can't create a class for an entire row since you don't know enough about it at compile time. Even if you could dynamically generate a class (maybe by invoking the compiler at runtime), you wouldn't be able to benefit from type-safety and most of your code would have to treat it generically anyway.