Scala Slick 3 - How to get non-matching results on joinLeft? - scala

I would like to join two tables and get the rows from the first table that don't have a matching row in the second table for some condition of a certain column
for example:
tableA.joinLeft(tableB)
.on((a: A, b: B) => a.key === b.key && a.field1 =!= b.field1)
.filter(_._2.map(_.key).isEmpty)
.map(_._1)
but this checks that key==null in tableB instead of checking on the result of the join. What am I doing wrong?

Perhaps you need a full outer join, and then filter on result rows where the second table entry is None (NULL). For example:
tableA.fullJoin(tableB)
.on((a: A, b: B) => /* your join condition here */)
.filter { case (_, maybeMissing) => maybeMissing.isEmpty }
.map { case (first, _) => first }

I've found a solution by splitting it into 2 queries:
one query is:
tableA.join(tableB)
.on((a: A, b: B) => a.key === b.key)
.filter((a: A, b: B) => a.field1 =!= b.field1)
.map(_._1)
second query is:
tableA.filterNot(_.key in tableB.map(_.key))
And then "union" the two queries

Related

slick navigation issues to the foreign keys (after joinLeft)

val joined = A joinLeft B on (_.id === _.id)
val query = for {
(A, BOpt) <- joined
} yield (A.a, BOpt.map(_.b))
So far so good.
B has a foreign key relationship(def FK_R) --> C. I'd like to navigate to C.c
BOpt is Rep[Option[B]]. How do I access table B so I can navigate to C.
Something like BOpt.FK_R.c
(There's also C-> D, but let's skip that.)
In a nutshell, I need A.a, B.b, C.c, D.d

How to update column with another subquery value in Slick?

I want to do something like this
UPDATE item
SET value = (
SELECT max(value)
FROM item
)
WHERE id = 1;
I tried
for {
maxValue <- Tables.Item.map(_.value).max
x <- Tables.Item
.filter(item => item.id === 1)
.map(_.value).update(maxValue)
} yield x
but maxValue is a Rep[Int] instead of Int
Slick's update doesn't support dynamic values or sub-queries. You have a couple of options for this situation.
First, you can use Plain SQL:
sqlu""" UPDATE item SET value = (SELECT max(value) FROM item) WHERE id = 1 """
Second, you could run the expression as two queries (potentially inside a transaction). This is similar to the example you have as update is a DBIO, rather than a Query.
I'd expect max to have an optional value as there might be no rows in the table:
val updateAction: DBIO[Int] =
Tables.Item.map(_.value).max.result.flatMap {
case Some(maxValue) =>
Tables.Item
.filter(item => item.id === 1)
.map(_.value).update(maxValue)
case None =>
DBIO.successful(0) // ... or whatever behaviour you want
}
However, perhaps your value field is already an option and you can use your existing for comprehension with the addition of .result on the end of the maxValue expression as mentioned by #Duelist.

Spark dataframe aggregation scala

val df = sc.parallelize(Seq((a, 1), (a, null), (b, null)(b, 2),(b, 3),(c, 2),(c, 4),(c, 3))).toDF("col1","col2")
The output should be like below.
col1 col2
a null
b null
c 4
I knew that groupBy on col1 and get the max of col2. which I can perform using df.groupBy("col1").agg("col2"->"max")
But my requirement is if null is there I want to select that record, but if null is not there I want to select max of col2.
How can I do this, can any please help me.
As I commented, your use of null makes things unnecessarily problematic, so if you can't work without null in the first place, I think it makes most sense to turn it into something more useful:
val df = sparkContext.parallelize(Seq((a, 1), (a, null), (b, null), (b, 2),(b, 3),(c, 2),(c, 4),(c, 3)))
.mapValues { v => Option(v) match {
case Some(i: Int) => i
case _ => Int.MaxValue
}
}.groupBy(_._1).map {
case (k, v) => k -> v.map(_._2).max
}
First, I use Option to get rid of null and to move things down the tree from Any to Int so I can enjoy more type safety. I replace null with MaxValue for reasons I'll explain shortly.
Then I groupBy as you did, but then I map over the groups to pair the keys with the max of the values, which will either be one of your original data items or MaxValue where the nulls once were. If you must, you can turn them back into null, but I wouldn't.
There might be a simpler way to do all this, but I like the null replacement with MaxValue, the pattern matching which helps me narrow the types, and the fact I can just treat everything the same afterwards.

Calculating a variable inside RDD after full outer join in Scala

What I want to do is simpl, but I struggle with Scala and RDDs.
The concept is this:
rdd1 rdd2
id count id count
a 2 a 1
b 1 c 5
d 3
And the result I am searching for is this:
rdd2
id count
a 3
b 1
c 5
d 3
what I intend to do is to perform a full outer join to get common and non common registers, identified by the id field. For now, rdd2, is empty.
rdd1 and rdd2 are:
RDD[(String, org.apache.spark.sql.Row)]
For now, I have the following code:
var rdd3 = rdd1.fullOuterJoin(rdd2).map {
case (id, left, right) =>
// TODO
}
How can I calculate that sum between RDDs?
If you are doing a fullOuterJoin you get the key and two Options passed into the closure (one Option represents the left side, the other one the right side). So the closure could look like this:
val result = rdd1.fullOuterJoin(rdd2).map {
case (id, (left, right)) =>
(id, left.getOrElse(0) + right.getOrElse(0))
}
This applies if your RDD is of type (String, Int).

Slick 3 multiple outer joins

from Slick documentation, it's clear how to make a single left join between two tables.
val q = for {
(t, v) <- titles joinLeft volumes on (_.uid === _.titleUid)
} yield (t, v)
Query q will, as expected, have attributes: _1 of type Titles and _2 of type Rep[Option[Volumes]] to cover for non-existing volumes.
Further cascading is problematic:
val q = for {
((t, v), c) <- titles
joinLeft volumes on (_.uid === _.titleUid)
joinLeft chapters on (_._2.uid === _.volumeUid)
} yield /* etc. */
This won't work because _._2.uid === _.volumeUid is invalid given _.uid being not existing.
According to various sources on the net, this shouldn't be an issue, but then again, sources tend to target different slick versions and 3.0 is still rather new. Does anyone have some clue on the issue?
To clarify, idea is to use two left joins to extract data from 3 cascading 1:n:n tables.
Equivalent SQL would be:
Select *
from titles
left join volumes
on titles.uid = volumes.title_uid
left join chapters
on volumes.uid = chapters.volume_uid
Your second left join is no longer operating on a TableQuery[Titles], but instead on what is effectively a Query[(Titles, Option[Volumes])] (ignoring the result and collection type parameters). When you join the resulting query on your TableQuery[Chapters] you can access the second entry in the tuple using the _2 field (since it's an Option you'll need to map to access the uid field):
val q = for {
((t, v), c) <- titles
joinLeft volumes on (_.uid === _.titleUid)
joinLeft chapters on (_._2.map(_.uid) === _.volumeUid)
} yield /* etc. */
Avoiding TupleN
If the _N field syntax is unclear, you can also use Slick's capacity for user-defined record types to map your rows alternatively:
// The `Table` variant of the joined row representation
case class TitlesAndVolumesRow(title: Titles, volumes: Volumes)
// The DTO variant of the joined row representation
case class TitleAndVolumeRow(title: Title, volumes: Volume)
implicit object TitleAndVolumeShape
extends CaseClassShape(TitlesAndVolumesRow.tupled, TitleAndVolumeRow.tupled)