Scala 2.10.6 + Spark 1.6.0: Most appropriate way to map from DataFrame to case classes

Scala 2.10.6 + Spark 1.6.0: Most appropriate way to map from DataFrame to case classes - scala

I'm looking for the most appropriate way to map the information contained in a DataFrame to some case classes I've defined, according to the following situation.
I have 2 Hive tables, and a third table which represents the many-to-many relationship between them. Lets call them "Item", "Group", and "GroupItems".
I'm considering executing a single query joining them all, to get the information of a single group, and all its items.
So, each row of the resulting DataFrame, would contain the fields of the Group, and the fields of an Item.
Then, I've created 4 different case classes to use this information in my application. Lets call them:
- ItemProps1: its properties match with some of the Item fields
- ItemProps2: its properties match with some of the Item fields
- Item: contains some properties which match with some of the Item fields, and has 1 object of type ItemProps1, and another of type ItemProps2
- Group: its properties match with the Group fields, and contains a list of items
What I want to do is to map the info contained in the resulting DataFrame into these case classes, but I don't know which would be the most appropriate way.
I know DataFrame has a method "as[U]" which is very useful to perform this kind of mapping, but I'm afraid in my case it wont be useful.
Then, I've found some options to perform the mapping manually, like the following ones:
df.map {
case Row(foo: Int, bar: String) => Record(foo, bar)
}
-
df.collect().foreach { row =>
val foo = row.getAs[Int]("foo")
val bar = row.getAs[String]("bar")
Record(foo, bar)
}
Is any of these approaches the most appropriate one, or should I do it in another way?
Thanks a lot!

Related

Selecting identical named columns in jOOQ

Im currently using jOOQ to build my SQL (with code generation via the mvn plugin).
Executing the created query is not done by jOOQ though (Using vert.X SqlClient for that).
Lets say I want to select all columns of two tables which share some identical column names. E.g. UserAccount(id,name,...) and Product(id,name,...). When executing the following code
val userTable = USER_ACCOUNT.`as`("u")
val productTable = PRODUCT.`as`("p")
create().select().from(userTable).join(productTable).on(userTable.ID.eq(productTable.AUTHOR_ID))
the build method query.getSQL(ParamType.NAMED) returns me a query like
SELECT "u"."id", "u"."name", ..., "p"."id", "p"."name", ... FROM ...
The problem here is, the resultset will contain the column id and name twice without the prefix "u." or "p.", so I can't map/parse it correctly.
Is there a way how I can say to jOOQ to alias these columns like the following without any further manual efforts ?
SELECT "u"."id" AS "u.id", "u"."name" AS "u.name", ..., "p"."id" AS "p.id", "p"."name" AS "p.name" ...
Im using the holy Postgres Database :)
EDIT: Current approach would be sth like
val productFields = productTable.fields().map { it.`as`(name("p.${it.name}")) }
val userFields = userTable.fields().map { it.`as`(name("p.${it.name}")) }
create().select(productFields,userFields,...)...
This feels really hacky though

How to correctly dereference tables from records
You should always use the column references that you passed to the query to dereference values from records in your result. If you didn't pass column references explicitly, then the ones from your generated table via Table.fields() are used.
In your code, that would correspond to:
userTable.NAME
productTable.NAME
So, in a resulting record, do this:
val rec = ...
rec[userTable.NAME]
rec[productTable.NAME]
Using Record.into(Table)
Since you seem to be projecting all the columns (do you really need all of them?) to the generated POJO classes, you can still do this intermediary step if you want:
val rec = ...
val userAccount: UserAccount = rec.into(userTable).into(UserAccount::class.java)
val product: Product = rec.into(productTable).into(Product::class.java)
Because the generated table has all the necessary meta data, it can decide which columns belong to it, and which ones don't. The POJO doesn't have this meta information, which is why it can't disambiguate the duplicate column names.
Using nested records
You can always use nested records directly in SQL as well in order to produce one of these 2 types:
Record2<Record[N], Record[N]> (e.g. using DSL.row(table.fields()))
Record2<UserAccountRecord, ProductRecord> (e.g using DSL.row(table.fields()).mapping(...), or starting from jOOQ 3.17 directly using a Table<R> as a SelectField<R>)
The second jOOQ 3.17 solution would look like this:
// Using an implicit join here, for convenience
create().select(productTable.userAccount(), productTable)
.from(productTable)
.fetch();
The above is using implicit joins, for additional convenience
Auto aliasing all columns
There are a ton of flavours that users could like to have when "auto-aliasing" columns in SQL. Any solution offered by jOOQ would be no better than the one you've already found, so if you still want to auto-alias all columns, then just do what you did.
But usually, the desire to auto-alias is a derived feature request from a misunderstanding of what's the best approch to do something in jOOQ (see above options), so ideally, you don't follow down the auto-aliasing road.

Merge objects in mutable list on Scala

I have recently started looking at Scala code and I am trying to understand how to go about a problem.
I have a mutable list of objects, these objects have an id: String and values: List[Int]. The way I get the data, more than one object can have the same id. I am trying to merge the items in the list, so if for example, I have 3 objects with id 123 and whichever values, I end up with just one object with the id, and the values of the 3 combined.
I could do this the java way, iterating, and so on, but I was wondering if there is an easier Scala-specific way of going about this?

The first thing to do is avoid using mutable data and think about transforming one immutable object into another. So rather than mutating the contents of one collection, think about creating a new collection from the old one.
Once you have done that it is actually very straightforward because this is the sort of thing that is directly supported by the Scala library.
case class Data(id: String, values: List[Int])
val list: List[Data] = ???
val result: Map[String, List[Int]] =
list.groupMapReduce(_.id)(_.values)(_ ++ _)
The groupMapReduce call breaks down into three parts:
The first part groups the data by the id field and makes that the key. This gives a Map[String, List[Data]]
The second part extracts the values field and makes that the data, so the result is now Map[String, List[List[Int]]]
The third part combines all the values fields into a single list, giving the final result Map[String, List[Int]]

Dynamically cast JsonElement value when mapping Json object

What I'm trying to achieve here is, in a JSON field which contains more than 1 type of element that I need to use later on but cast in prior stages so there is no need to do so deep in the execution schedule, dynamically map those specific JSON types to the objects so there is no need to do so later.
My actual situation is the following one...
obj.exampleField = json
.getAsJsonObject(exampleField)
.entrySet
.map(entry => entry.getKey -> entry.getValue.getAsString)
.toMap
At the moment everything is a String, but will need to be modified so that exampleField starts containing a field which is an Array type.
How can I dynamically map those classes within my current .map stage? So that a key which contains a String type field is already casted to type resultant of getAsString and, in case it's an ArrayType, getAsJsonArray.
Or there is no other option rather than avoid the current .map stage and move the casting to classes to the last stage in the execution schedule?

Spark Cassandra connector - using IN for filtering with dynamic data

Let's assume that I have an RDD with items of type
case class Foo(name: String, nums: Seq[Int])
and a table my_schema.foo in Cassandra with a partitioning key composed of name and num columns
Now, I'd like to fetch for each element in the input RDD all corresponding rows, i.e. something like:
SELECT * from my_schema.foo where name = :name and num IN :nums
I've tried the following approaches:
use the joinWithCassandraTable extension: rdd.joinWithCassandraTable("my_schema", "foo").on(SomeColumns("name")) but I don't know how I could specify the IN constraint
For each element of the input RDD issue a separate query (within a map function). This does not work, as the spark context is not serializable and cannot be passed into the map
Flatmap the input RDD to generate a separate item (name, num) for each num in nums. This will work, but it will probably be way less efficient than using an IN clause.
What would be a proper way of solving this problem?

How to use CQLinq to get metrics of Methods and Fields within a single query

I am calculating average length of identifiers with CQLinq in NDepend, and I want to get the length of the names of classes, fields and methods. I walked through this page of CQlinq: http://www.ndepend.com/docs/cqlinq-syntax, and I have code like:
let id_m = Methods.Select(m => new { m.SimpleName, m.SimpleName.Length })
let id_f = Fields.Select(f => new { f.Name, f.Name.Length })
select id_m.Union(id_f)
It doesn't work, one error says:
'System.Collections.Generic.IEnumerable' does not
contain a definition for 'Union'...
The other one is:
cannot convert from
'System.Collections.Generic.IEnumerable' to
'System.Collections.Generic.HashSet'
However, according to MSDN, IEnumerable Interface defines Union() and Concat() methods.
It seems to me that I cannot use CQLinq exactly the same way as Linq. Anyway, is there a way to get the information from Types, Methods and Fields domains within a singe query?
Thanks a lot.

is there a way to get the information from Types, Methods and Fields domains within a singe query?
Not for now, because a CQLinq query can only match a sequence of types, or a sequence of methods or a sequence of field, so you need 3 distinct code queries.
For next version CQLinq, will be improved a lot and indeed you'll be able to write things like:
from codeElement in Application.TypesAndMembers
select new { codeElement, codeElement.Name.Length }
Next version will be available before the end of the year 2016.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Scala 2.10.6 + Spark 1.6.0: Most appropriate way to map from DataFrame to case classes - scala

Related

Selecting identical named columns in jOOQ

Merge objects in mutable list on Scala

Dynamically cast JsonElement value when mapping Json object

Spark Cassandra connector - using IN for filtering with dynamic data

How to use CQLinq to get metrics of Methods and Fields within a single query

Categories

Resources