val joined = A joinLeft B on (_.id === _.id)
val query = for {
(A, BOpt) <- joined
} yield (A.a, BOpt.map(_.b))
So far so good.
B has a foreign key relationship(def FK_R) --> C. I'd like to navigate to C.c
BOpt is Rep[Option[B]]. How do I access table B so I can navigate to C.
Something like BOpt.FK_R.c
(There's also C-> D, but let's skip that.)
In a nutshell, I need A.a, B.b, C.c, D.d
Related
I would like to join two tables and get the rows from the first table that don't have a matching row in the second table for some condition of a certain column
for example:
tableA.joinLeft(tableB)
.on((a: A, b: B) => a.key === b.key && a.field1 =!= b.field1)
.filter(_._2.map(_.key).isEmpty)
.map(_._1)
but this checks that key==null in tableB instead of checking on the result of the join. What am I doing wrong?
Perhaps you need a full outer join, and then filter on result rows where the second table entry is None (NULL). For example:
tableA.fullJoin(tableB)
.on((a: A, b: B) => /* your join condition here */)
.filter { case (_, maybeMissing) => maybeMissing.isEmpty }
.map { case (first, _) => first }
I've found a solution by splitting it into 2 queries:
one query is:
tableA.join(tableB)
.on((a: A, b: B) => a.key === b.key)
.filter((a: A, b: B) => a.field1 =!= b.field1)
.map(_._1)
second query is:
tableA.filterNot(_.key in tableB.map(_.key))
And then "union" the two queries
I have two rdd one rdd have just one column other have two columns to join the two RDD on key's I have add dummy value which is 0 , is there any other efficient way of doing this using join ?
val lines = sc.textFile("ml-100k/u.data")
val movienamesfile = sc.textFile("Cml-100k/u.item")
val moviesid = lines.map(x => x.split("\t")).map(x => (x(1),0))
val test = moviesid.map(x => x._1)
val movienames = movienamesfile.map(x => x.split("\\|")).map(x => (x(0),x(1)))
val shit = movienames.join(moviesid).distinct()
Edit:
Let me convert this question in SQL. Say for example I have table1 (moveid) and table2 (movieid,moviename). In SQL we write something like:
select moviename, movieid, count(1)
from table2 inner join table table1 on table1.movieid=table2.moveid
group by ....
here in SQL table1 has only one column where as table2 has two columns still the join works, same way in Spark can join on keys from both the RDD's.
Join operation is defined only on PairwiseRDDs which are quite different from a relation / table in SQL. Each element of PairwiseRDD is a Tuple2 where the first element is the key and the second is value. Both can contain complex objects as long as key provides a meaningful hashCode
If you want to think about this in a SQL-ish you can consider key as everything that goes to ON clause and value contains selected columns.
SELECT table1.value, table2.value
FROM table1 JOIN table2 ON table1.key = table2.key
While these approaches look similar at first glance and you can express one using another there is one fundamental difference. When you look at the SQL table and you ignore constraints all columns belong in the same class of objects, while key and value in the PairwiseRDD have a clear meaning.
Going back to your problem to use join you need both key and value. Arguably much cleaner than using 0 as a placeholder would be to use null singleton but there is really no way around it.
For small data you can use filter in a similar way to broadcast join:
val moviesidBD = sc.broadcast(
lines.map(x => x.split("\t")).map(_.head).collect.toSet)
movienames.filter{case (id, _) => moviesidBD.value contains id}
but if you really want SQL-ish joins then you should simply use SparkSQL.
val movieIdsDf = lines
.map(x => x.split("\t"))
.map(a => Tuple1(a.head))
.toDF("id")
val movienamesDf = movienames.toDF("id", "name")
// Add optional join type qualifier
movienamesDf.join(movieIdsDf, movieIdsDf("id") <=> movienamesDf("id"))
On RDD Join operation is only defined for PairwiseRDDs, So need to change the value to pairedRDD. Below is a sample
val rdd1=sc.textFile("/data-001/part/")
val rdd_1=rdd1.map(x=>x.split('|')).map(x=>(x(0),x(1)))
val rdd2=sc.textFile("/data-001/partsupp/")
val rdd_2=rdd2.map(x=>x.split('|')).map(x=>(x(0),x(1)))
rdd_1.join(rdd_2).take(2).foreach(println)
from Slick documentation, it's clear how to make a single left join between two tables.
val q = for {
(t, v) <- titles joinLeft volumes on (_.uid === _.titleUid)
} yield (t, v)
Query q will, as expected, have attributes: _1 of type Titles and _2 of type Rep[Option[Volumes]] to cover for non-existing volumes.
Further cascading is problematic:
val q = for {
((t, v), c) <- titles
joinLeft volumes on (_.uid === _.titleUid)
joinLeft chapters on (_._2.uid === _.volumeUid)
} yield /* etc. */
This won't work because _._2.uid === _.volumeUid is invalid given _.uid being not existing.
According to various sources on the net, this shouldn't be an issue, but then again, sources tend to target different slick versions and 3.0 is still rather new. Does anyone have some clue on the issue?
To clarify, idea is to use two left joins to extract data from 3 cascading 1:n:n tables.
Equivalent SQL would be:
Select *
from titles
left join volumes
on titles.uid = volumes.title_uid
left join chapters
on volumes.uid = chapters.volume_uid
Your second left join is no longer operating on a TableQuery[Titles], but instead on what is effectively a Query[(Titles, Option[Volumes])] (ignoring the result and collection type parameters). When you join the resulting query on your TableQuery[Chapters] you can access the second entry in the tuple using the _2 field (since it's an Option you'll need to map to access the uid field):
val q = for {
((t, v), c) <- titles
joinLeft volumes on (_.uid === _.titleUid)
joinLeft chapters on (_._2.map(_.uid) === _.volumeUid)
} yield /* etc. */
Avoiding TupleN
If the _N field syntax is unclear, you can also use Slick's capacity for user-defined record types to map your rows alternatively:
// The `Table` variant of the joined row representation
case class TitlesAndVolumesRow(title: Titles, volumes: Volumes)
// The DTO variant of the joined row representation
case class TitleAndVolumeRow(title: Title, volumes: Volume)
implicit object TitleAndVolumeShape
extends CaseClassShape(TitlesAndVolumesRow.tupled, TitleAndVolumeRow.tupled)
When I try to do Query(query.length).first on a query, that represents a join of 2 tables which have several columns with the same names, I am getting malformed sql. Consider the example:
// in Main.scala
import scala.slick.driver.MySQLDriver.simple._
object Main extends App {
object Houses extends Table[Long]("Houses") {
def id = column[Long]("id")
def * = id
}
object Rooms extends Table[(Long, Long)]("Rooms") {
def id = column[Long]("id")
def houseId = column[Long]("houseId")
def * = id ~ houseId
}
val query = for {
h <- Houses
r <- Rooms
if h.id === r.houseId
} yield (h, r)
println("QUERY: " + Query(query.length).selectStatement)
}
// in build.sbt
scalaVersion := "2.10.2"
libraryDependencies += "com.typesafe.slick" %% "slick" % "1.0.1"
This example generates the following SQL:
select x2.x3 from
(select count(1) as x3 from
(select x4.`id`, x5.`id`, x5.`houseId`
from `Houses` x4, `Rooms` x5 where x4.`id` = x5.`houseId`) x6) x2
Which is clearly wrong and is rejected by MySQL because id column is duplicated in select x4.id, x5.id part.
I could try to do the following:
query.list.size
but that will extract all the rows from the query and send them over the wire, which is going to hinder performance greatly.
What am I doing wrong? Is there some way to fix it?
That's an interesting issue. Usually with SQL, you alias the other column which would cause a name collision but I'm not sure how that works with Slick (or if even possible). But you can work around this I believe by only selecting a single column if you just want to count:
val query = for {
h <- Houses
r <- Rooms
if h.id === r.houseId
} yield h.id.count
Now the count call on id is deprecated, but this one produced a clean sql statement which looks like this:
select count(x2.`id`) from `Houses` x2, `Rooms` x3 where x2.`id` = x3.`houseId`
Anything that I tried using .length produced a bunch of sql that was not correct.
EDIT
In response to your comment, it you wanted to leave the query the way it was (and let's forget that the query itself is broken due to field collision/ambiguity in the join) and then be able to also derive a count query from it, that would look like this:
def main(args: Array[String]) {
val query = for {
h <- Houses
r <- Rooms
if h.id === r.houseId
} yield (h,r)
val lengthQuery = query.map(_._1.id.count)
}
The point here is that you should be able to take any query and map it to a count query by selecting a single column (instead of the full objects) and then getting that count for that column. In this case, because the result is a Tuple2, I have to go in an additional level to get to the id column, but I think you get the picture.
Let's imagine I have an table called Foo with a primary key FooID and an integer non-unique column Bar. For some reason in a SQL query I have to join table Foo with itself multiple times, like this:
SELECT * FROM Foo f1 INNER JOIN Foo f2 ON f2.Bar = f1.Bar INNER JOIN Foo f3 ON f3.Bar = f1.Bar...
I have to achieve this via LINQ to Entities.
Doing
ObjectContext.Foos.Join(ObjectContext.Foos, a => a.Bar, b => b.Bar, (a, b) => new {a, b})
gives me LEFT OUTER JOIN in the resulting query and I need inner joins, this is very critical.
Of course, I might succeed if in edmx I added as many associations of Foo with itself as necessary and then used them in my code, Entity Framework would substitute correct inner join for each of the associations. The problem is that at design time I don't know how many joins I will need. OK, one workaround is to add as many of them as reasonable...
But, if nothing else, from theoretical point of view, is it at all possible to create inner joins via EF without explicitly defining the associations?
In LINQ to SQL there was a (somewhat bizarre) way to do this via GroupJoin, like this:
ObjectContext.Foos.GroupJoin(ObjectContext.Foos, a => a.Bar, b => b.Bar, (a, b) => new {a, b}).SelectMany(o = > o.b.DefaultIfEmpty(), (o, b) => new {o.a, b)
I've just tried it in EF, the trick does not work there. It still generates outer joins for me.
Any ideas?
In Linq to Entities, below is one way to do an inner join on mutiple instances of the same table:
using (ObjectContext ctx = new ObjectContext())
{
var result = from f1 in ctx.Foo
join f2 in ctx.Foo on f1.bar equals f2.bar
join f3 in ctx.Foo on f1.bar equals f3.bar
select ....;
}