from Slick documentation, it's clear how to make a single left join between two tables.
val q = for {
(t, v) <- titles joinLeft volumes on (_.uid === _.titleUid)
} yield (t, v)
Query q will, as expected, have attributes: _1 of type Titles and _2 of type Rep[Option[Volumes]] to cover for non-existing volumes.
Further cascading is problematic:
val q = for {
((t, v), c) <- titles
joinLeft volumes on (_.uid === _.titleUid)
joinLeft chapters on (_._2.uid === _.volumeUid)
} yield /* etc. */
This won't work because _._2.uid === _.volumeUid is invalid given _.uid being not existing.
According to various sources on the net, this shouldn't be an issue, but then again, sources tend to target different slick versions and 3.0 is still rather new. Does anyone have some clue on the issue?
To clarify, idea is to use two left joins to extract data from 3 cascading 1:n:n tables.
Equivalent SQL would be:
Select *
from titles
left join volumes
on titles.uid = volumes.title_uid
left join chapters
on volumes.uid = chapters.volume_uid
Your second left join is no longer operating on a TableQuery[Titles], but instead on what is effectively a Query[(Titles, Option[Volumes])] (ignoring the result and collection type parameters). When you join the resulting query on your TableQuery[Chapters] you can access the second entry in the tuple using the _2 field (since it's an Option you'll need to map to access the uid field):
val q = for {
((t, v), c) <- titles
joinLeft volumes on (_.uid === _.titleUid)
joinLeft chapters on (_._2.map(_.uid) === _.volumeUid)
} yield /* etc. */
Avoiding TupleN
If the _N field syntax is unclear, you can also use Slick's capacity for user-defined record types to map your rows alternatively:
// The `Table` variant of the joined row representation
case class TitlesAndVolumesRow(title: Titles, volumes: Volumes)
// The DTO variant of the joined row representation
case class TitleAndVolumeRow(title: Title, volumes: Volume)
implicit object TitleAndVolumeShape
extends CaseClassShape(TitlesAndVolumesRow.tupled, TitleAndVolumeRow.tupled)
Related
val joined = A joinLeft B on (_.id === _.id)
val query = for {
(A, BOpt) <- joined
} yield (A.a, BOpt.map(_.b))
So far so good.
B has a foreign key relationship(def FK_R) --> C. I'd like to navigate to C.c
BOpt is Rep[Option[B]]. How do I access table B so I can navigate to C.
Something like BOpt.FK_R.c
(There's also C-> D, but let's skip that.)
In a nutshell, I need A.a, B.b, C.c, D.d
What I want to do is simpl, but I struggle with Scala and RDDs.
The concept is this:
rdd1 rdd2
id count id count
a 2 a 1
b 1 c 5
d 3
And the result I am searching for is this:
rdd2
id count
a 3
b 1
c 5
d 3
what I intend to do is to perform a full outer join to get common and non common registers, identified by the id field. For now, rdd2, is empty.
rdd1 and rdd2 are:
RDD[(String, org.apache.spark.sql.Row)]
For now, I have the following code:
var rdd3 = rdd1.fullOuterJoin(rdd2).map {
case (id, left, right) =>
// TODO
}
How can I calculate that sum between RDDs?
If you are doing a fullOuterJoin you get the key and two Options passed into the closure (one Option represents the left side, the other one the right side). So the closure could look like this:
val result = rdd1.fullOuterJoin(rdd2).map {
case (id, (left, right)) =>
(id, left.getOrElse(0) + right.getOrElse(0))
}
This applies if your RDD is of type (String, Int).
I have two rdd one rdd have just one column other have two columns to join the two RDD on key's I have add dummy value which is 0 , is there any other efficient way of doing this using join ?
val lines = sc.textFile("ml-100k/u.data")
val movienamesfile = sc.textFile("Cml-100k/u.item")
val moviesid = lines.map(x => x.split("\t")).map(x => (x(1),0))
val test = moviesid.map(x => x._1)
val movienames = movienamesfile.map(x => x.split("\\|")).map(x => (x(0),x(1)))
val shit = movienames.join(moviesid).distinct()
Edit:
Let me convert this question in SQL. Say for example I have table1 (moveid) and table2 (movieid,moviename). In SQL we write something like:
select moviename, movieid, count(1)
from table2 inner join table table1 on table1.movieid=table2.moveid
group by ....
here in SQL table1 has only one column where as table2 has two columns still the join works, same way in Spark can join on keys from both the RDD's.
Join operation is defined only on PairwiseRDDs which are quite different from a relation / table in SQL. Each element of PairwiseRDD is a Tuple2 where the first element is the key and the second is value. Both can contain complex objects as long as key provides a meaningful hashCode
If you want to think about this in a SQL-ish you can consider key as everything that goes to ON clause and value contains selected columns.
SELECT table1.value, table2.value
FROM table1 JOIN table2 ON table1.key = table2.key
While these approaches look similar at first glance and you can express one using another there is one fundamental difference. When you look at the SQL table and you ignore constraints all columns belong in the same class of objects, while key and value in the PairwiseRDD have a clear meaning.
Going back to your problem to use join you need both key and value. Arguably much cleaner than using 0 as a placeholder would be to use null singleton but there is really no way around it.
For small data you can use filter in a similar way to broadcast join:
val moviesidBD = sc.broadcast(
lines.map(x => x.split("\t")).map(_.head).collect.toSet)
movienames.filter{case (id, _) => moviesidBD.value contains id}
but if you really want SQL-ish joins then you should simply use SparkSQL.
val movieIdsDf = lines
.map(x => x.split("\t"))
.map(a => Tuple1(a.head))
.toDF("id")
val movienamesDf = movienames.toDF("id", "name")
// Add optional join type qualifier
movienamesDf.join(movieIdsDf, movieIdsDf("id") <=> movienamesDf("id"))
On RDD Join operation is only defined for PairwiseRDDs, So need to change the value to pairedRDD. Below is a sample
val rdd1=sc.textFile("/data-001/part/")
val rdd_1=rdd1.map(x=>x.split('|')).map(x=>(x(0),x(1)))
val rdd2=sc.textFile("/data-001/partsupp/")
val rdd_2=rdd2.map(x=>x.split('|')).map(x=>(x(0),x(1)))
rdd_1.join(rdd_2).take(2).foreach(println)
Given the following LINQ to Entities (EF4) query...
var documents =
from doc in context.Documents
.Include(d => d.Batch.FinancialPeriod)
.Include(d => d.Batch.Contractor)
.Include(d => d.GroupTypeHistory.Select(gth => gth.GroupType))
.Include(d => d.Items.Select(i => i.Versions))
.Include(d => d.Items.Select(i => i.Versions.Select(v => v.ProductPackPeriodic.ProductPack.Product.HomeDelivery)))
.Include(d => d.Items.Select(i => i.Versions.Select(v => v.ProductPackPeriodic.ProductPack.Product.Manufacturer)))
.Include(d => d.Items.Select(i => i.Versions.Select(v => v.ProductPackPeriodic.ProductPeriodic)))
.Include(d => d.Items.Select(i => i.Versions.Select(v => v.ProductPackPeriodic.SpecialContainerIndicator)))
.Include(d => d.Items.Select(i => i.Versions.Select(v => v.Endorsements.Select(e => e.Periodic))))
where doc.ID == this.documentID
select doc;
Document document = documents.FirstOrDefault();
Where ...
a Document will always have a Batch which in turn will always have a
financial period and a contractor
a Document will always have one or
more GroupTypeHistory, each of which will always have a GroupType
a Document will have zero or more Item's, which in turn will have one
or more Version's
a Version will have zero or one ProductPackPeriodic
a ProductPackPeriodic will always have one ProductPack and one
ProductPeriodic, along with zero or one SpecialContainerIndicator
a ProductPack will always have one Product
a Product will have zero or one Manufacturer, and zero or one HomeDelivery
a Version will have zero or more Endorsements, each of which will have one Periodic
The above LINQ query generates some of the worst TSQL that I have ever seen, with some of the related tables included multiple times (probably because they are referenced within the query multiple times) and takes significantly longer than I would like to run (the tables concerned can contain millions of rows, but this is not the cause).
I know that there has to be a better way to write it (taking into account all of the different reference types that I describe above) which will result in better TSQL, but every version that I try fails to return the data correctly.
Can anybody assist in pointing me to a better solution?
If it makes it any easier to understand, were I writing TSQL directly I would be looking at something like the following...
select *
from Document d
inner join Batch b
inner join FinancialPeriod fp on b.FinancialPeriodID = fp.FinancialPeriodID
inner join Contractor c on b.ContractorID = c.ContractorID
on d.BatchID = b.BatchID
inner join DocumentGroupType dgt on d.DocumentID = dgt.DocumentID
left join Item i
left join ItemVersion iv
left join ProductPackPeriodic ppp
inner join ProductPack pack
inner join Product p
left join Manufacturer m on p.ManufacturerID = m.ManufacturerID
left join HomeDeliveryProduct hdp on p.ProductID = hdp.ProductID
on pack.ProductID = p.ProductID
on ppp.ProductPackID = pack.ProductPackID
inner join ProductPeriodic pp on ppp.ProductPeriodicID = pp.ProductPeriodicID
left join SpecialContainerIndicator sci on ppp.SpecialContainerIndicatorCode = sci.SpecialContainerIndicatorCode
on iv.ProductPackPeriodicID = ppp.ProductPackPeriodicID
left join ItemVersionEndorsement ive
inner join EndorsementPeriodic ep on ive.EndorsementPeriodicID = ep.EndorsementPeriodicID
on iv.ItemVersionID = ive.ItemVersionID
on i.ItemID = iv.ItemID
on d.DocumentID = i.DocumentID
where d.DocumentID = 33 -- example value
This may also make the relationship requirements clearer.
Thanks in advance.
For scenarios like this i write specific stored procedures and then use EFExtensions with a custom materializer to not only get excellent performance but also correctly materialized entities.
I don't have any good answer for EF, but it might suit you to use a micro ORM for certain complex queries. Micro ORMs are essentially low-level wrappers over SQL, that allow you to get strongly typed objects, along with other convenience features. You can take a look at Dapper, for example, which is used by this very site for some of the bottleneck queries. It should perform very close to native SQL performance.
Let's imagine I have an table called Foo with a primary key FooID and an integer non-unique column Bar. For some reason in a SQL query I have to join table Foo with itself multiple times, like this:
SELECT * FROM Foo f1 INNER JOIN Foo f2 ON f2.Bar = f1.Bar INNER JOIN Foo f3 ON f3.Bar = f1.Bar...
I have to achieve this via LINQ to Entities.
Doing
ObjectContext.Foos.Join(ObjectContext.Foos, a => a.Bar, b => b.Bar, (a, b) => new {a, b})
gives me LEFT OUTER JOIN in the resulting query and I need inner joins, this is very critical.
Of course, I might succeed if in edmx I added as many associations of Foo with itself as necessary and then used them in my code, Entity Framework would substitute correct inner join for each of the associations. The problem is that at design time I don't know how many joins I will need. OK, one workaround is to add as many of them as reasonable...
But, if nothing else, from theoretical point of view, is it at all possible to create inner joins via EF without explicitly defining the associations?
In LINQ to SQL there was a (somewhat bizarre) way to do this via GroupJoin, like this:
ObjectContext.Foos.GroupJoin(ObjectContext.Foos, a => a.Bar, b => b.Bar, (a, b) => new {a, b}).SelectMany(o = > o.b.DefaultIfEmpty(), (o, b) => new {o.a, b)
I've just tried it in EF, the trick does not work there. It still generates outer joins for me.
Any ideas?
In Linq to Entities, below is one way to do an inner join on mutiple instances of the same table:
using (ObjectContext ctx = new ObjectContext())
{
var result = from f1 in ctx.Foo
join f2 in ctx.Foo on f1.bar equals f2.bar
join f3 in ctx.Foo on f1.bar equals f3.bar
select ....;
}