Performance slick query one to many - scala

I'm using play 2.5 and slick 3.1.1 and I'm trying to build optimal query for multiple relations one to many, and one to one. I have a such db model:
case class Accommodation(id: Option[Long], landlordId: Long, name: String)
case class LandLord(id: Option[Long], name: String)
case class Address(id: Option[Long], accommodationId: Long, street: String)
case class ExtraCharge(id: Option[Long], accommodationId: Long, title: String)
For data output:
case class AccommodationFull(accommodation: Accommodation, landLord: LandLord, extraCharges:Seq[ExtraCharge], addresses:Seq[Address])
I've created two queries to get accommodation by id:
/** Retrieve a accommodation from the id. */
def findByIdFullMultipleQueries(id: Long): Future[Option[AccommodationFull]] = {
val q = for {
(a, l) <- accommodations join landLords on (_.landlordId === _.id)
if a.id === id
} yield (a, l)
for {
(data) <- db.run(q.result.headOption)
(ex) <- db.run(extraCharges.filter(_.accommodationId === id).result)
(add) <- db.run(addresses.filter(_.accommodationId === id).result)
} yield data.map { accLord => AccommodationFull(accLord._1, accLord._2, ex, add) }
}
/** Retrieve a accommodation from the id. */
def findByIdFull(id: Long): Future[Option[AccommodationFull]] = {
val qr = accommodations.filter(_.id === id).join(landLords).on(_.landlordId === _.id)
.joinLeft(extraCharges).on(_._1.id === _.accommodationId)
.joinLeft(addresses).on(_._1._1.id === _.accommodationId)
.result.map { res =>
res.groupBy(_._1._1._1.id).headOption.map {
case (k, v) =>
val addresses = v.flatMap(_._2).distinct
val extraCharges = v.flatMap(_._1._2).distinct
val landLord = v.map(_._1._1._2).head
val accommodation = v.map(_._1._1._1).head
AccommodationFull(accommodation, landLord, extraCharges, addresses)
}
}
db.run(qr)
}
After tests multiple query is like 5x faster than join. How can I create more optimal join query?
=== Update ===
I'm testing now on postgresql 9.3 with data:
private[bootstrap] object InitialData {
def landLords = (1L to 10000L).map { id =>
LandLord(Some(id), s"Good LandLord $id")
}
def accommodations = (1L to 10000L).map { id =>
Accommodation(Some(id), s"Nice house $id", 100 * id, 3, 5, 500, 1l, None)
}
def extraCharge = (1L to 10000L).flatMap { id =>
(1 to 100).map { nr =>
ExtraCharge(None, id, s"Extra $nr", 100.0)
}
}
def addresses = (1L to 1000L).flatMap { id =>
(1 to 100).map { nr =>
Address(None, id, s"Słoneczna 4 - $nr", "17-200", "", "PL")
}
}
}
and here results for multiple runs (ms):
JOIN: 367
MULTI: 146
JOIN: 306
MULTI: 110
JOIN: 300
MULTI: 103
== Update 2 ==
After adding indexes it's better, but still multi is much faster:
def accommodationLandLordIdIndex = index("ACCOMMODATION__LANDLORD_ID__INDEX", landlordId, unique = false)
def addressAccommodationIdIndex = index("ADDRESS__ACCOMMODATION_ID__INDEX", accommodationId, unique = false)
def extraChargeAccommodationIdIndex = index("EXTRA_CHARGE__ACCOMMODATION_ID__INDEX", accommodationId, unique = false)
I made a test:
val multiResult = (1 to 1000).map { i =>
val start = System.currentTimeMillis()
Await.result(accommodationDao.findByIdFullMultipleQueries(i), Duration.Inf)
System.currentTimeMillis() - start
}
println(s"MULTI AVG Result: ${multiResult.sum.toDouble / multiResult.length}")
val joinResult = (1 to 1000).map { i =>
val start = System.currentTimeMillis()
Await.result(accommodationDao.findByIdFull(i), Duration.Inf)
System.currentTimeMillis() - start
}
println(s"JOIN AVG Result: ${joinResult.sum.toDouble / joinResult.length}")
here result for 2 runs:
MULTI AVG Result: 3.287
JOIN AVG Result: 96.797
MULTI AVG Result: 3.206
JOIN AVG Result: 100.221

Postgres does not add indexes for foreign key columns. The multi-query is using an index on all three tables (the primary key), while the single join query will scan the join tables for the desired IDs.
Try adding indexes on your accommodationId columns.
Update
While indexes would help if this were a 1:1 relationship, it looks like these are all 1:many relationships. In that case, using joins and a later distinct filter is going to return a lot more data from the database than you need.
For your data model, doing multiple queries looks like the correct way to process the data.

I think it depends on your DB engine. Slick generates queries that may not be optimal (see docs), but you need to profile queries on a database level to understand what's happening and to optimize

Related

How to perform UPSERT or MERGE operation in Apache Spark?

I am trying to update and insert records to old Dataframe using unique column "ID" using Apache Spark.
In order to update Dataframe, you can perform "left_anti" join on unique columns and then UNION it with Dataframe which contains new records
def refreshUnion(oldDS: Dataset[_], newDS: Dataset[_], usingColumns: Seq[String]): Dataset[_] = {
val filteredNewDS = selectAndCastColumns(newDS, oldDS)
oldDS.join(
filteredNewDS,
usingColumns,
"left_anti")
.select(oldDS.columns.map(columnName => col(columnName)): _*)
.union(filteredNewDS.toDF)
}
def selectAndCastColumns(ds: Dataset[_], refDS: Dataset[_]): Dataset[_] = {
val columns = ds.columns.toSet
ds.select(refDS.columns.map(c => {
if (!columns.contains(c)) {
lit(null).cast(refDS.schema(c).dataType) as c
} else {
ds(c).cast(refDS.schema(c).dataType) as c
}
}): _*)
}
val df = refreshUnion(oldDS, newDS, Seq("ID"))
Spark Dataframes are immutable structure. Therefore, you can't do any update based on the ID.
The way to update dataframe is to merge the older dataframe and the newer dataframe and save the merged dataframe on HDFS. To update the older ID you would require some de-duplication key (Timestamp may be).
I am adding the sample code for this in scala. You need to call the merge function with the uniqueId and the timestamp column name. Timestamp should be in Long.
case class DedupableDF(unique_id: String, ts: Long);
def merge(snapshot: DataFrame)(
delta: DataFrame)(uniqueId: String, timeStampStr: String): DataFrame = {
val mergedDf = snapshot.union(delta)
return dedupeData(mergedDf)(uniqueId, timeStampStr)
}
def dedupeData(dataFrameToDedupe: DataFrame)(
uniqueId: String,
timeStampStr: String): DataFrame = {
import sqlContext.implicits._
def removeDuplicates(
duplicatedDataFrame: DataFrame): Dataset[DedupableDF] = {
val dedupableDF = duplicatedDataFrame.map(a =>
DedupableDF(a(0).asInstanceOf[String], a(1).asInstanceOf[Long]))
val mappedPairRdd =
dedupableDF.map(row ⇒ (row.unique_id, (row.unique_id, row.ts))).rdd;
val reduceByKeyRDD = mappedPairRdd
.reduceByKey((row1, row2) ⇒ {
if (row1._2 > row2._2) {
row1
} else {
row2
}
})
.values;
val ds = reduceByKeyRDD.toDF.map(a =>
DedupableDF(a(0).asInstanceOf[String], a(1).asInstanceOf[Long]))
return ds;
}
/** get distinct unique_id, timestamp combinations **/
val filteredData =
dataFrameToDedupe.select(uniqueId, timeStampStr).distinct
val dedupedData = removeDuplicates(filteredData)
dataFrameToDedupe.createOrReplaceTempView("duplicatedDataFrame");
dedupedData.createOrReplaceTempView("dedupedDataFrame");
val dedupedDataFrame =
sqlContext.sql(s""" select distinct duplicatedDataFrame.*
from duplicatedDataFrame
join dedupedDataFrame on
(duplicatedDataFrame.${uniqueId} = dedupedDataFrame.unique_id
and duplicatedDataFrame.${timeStampStr} = dedupedDataFrame.ts)""")
return dedupedDataFrame
}

spark: join rdd based on sequence of another rdd

I have an rdd say sample_rdd of type RDD[(String, String, Int))] with 3 columns id,item,count. sample data:
id1|item1|1
id1|item2|3
id1|item3|4
id2|item1|3
id2|item4|2
I want to join each id against a lookup_rdd this:
item1|0
item2|0
item3|0
item4|0
item5|0
The output should give me following for id1, outerjoin with lookuptable:
item1|1
item2|3
item3|4
item4|0
item5|0
Similarly for id2 i should get:
item1|3
item2|0
item3|0
item4|2
item5|0
Finally output for each id should have all counts with id:
id1,1,3,4,0,0
id2,3,0,0,2,0
IMPORTANT:this output should be always ordered according to the order in lookup
This is what i have tried:
val line = rdd_sample.map { case (id, item, count) => (id, (item,count)) }.map(row=>(row._1,row._2)).groupByKey()
get(line).map(l=>(l._1,l._2)).mapValues(item_count=>lookup_r‌​dd.leftOuterJoin(ite‌​m_count))
def get (line: RDD[(String, Iterable[(String, Int)])]) = { for{ (id, item_cnt) <- line i = item_cnt.map(tuple => (tuple._1,tuple._2)) } yield (id,i)
Try below. Run each step on your local console to understand whats happening in detail.
The idea is to zipwithindex and form seq based on lookup_rdd.
(i1,0),(i2,1)..(i5,4) and (id1,0),(id2,1)
Index of final result wanted = [delta(length of lookup_rdd seq) * index of id1..id2 ] + index of i1...i5
So the base seq generated will be (0,(i1,id1)),(1,(i2,id1))...(8,(i4,id2)),(9,(i5,id2))
and then based on the key(i1,id1) reduce and calculate count.
val res2 = sc.parallelize(arr) //sample_rdd
val res3 = sc.parallelize(cart) //lookup_rdd
val delta = res3.count
val res83 = res3.map(_._1).zipWithIndex.cartesian(res2.map(_._1).distinct.zipWithIndex).map(x => (((x._1._1,x._2._1),((delta * x._2._2) + x._1._2, 0)))
val res86 = res2.map(x => ((x._2,x._1),x._3)).reduceByKey(_+_)
val res88 = res83.leftOuterJoin(res86)
val res91 = res88.map( x => {
x._2._2 match {
case Some(x1) => (x._2._1._1, (x._1,x._2._1._2+x1))
case None => (x._2._1._1, (x._1,x._2._1._2))
}
})
val res97 = res91.sortByKey(true).map( x => {
(x._2._1._2,List(x._2._2))}).reduceByKey(_++_)
res97.collect
// SOLUTION: Array((id1,List(1,3,4,0,0)),(id2,List(3,0,0,2,0)))

Slick3 with SQLite - autocommit seems to not be working

I'm trying to write some basic queries with Slick for SQLite database
Here is my code:
class MigrationLog(name: String) {
val migrationEvents = TableQuery[MigrationEventTable]
lazy val db: Future[SQLiteDriver.backend.DatabaseDef] = {
val db = Database.forURL(s"jdbc:sqlite:$name.db", driver = "org.sqlite.JDBC")
val setup = DBIO.seq(migrationEvents.schema.create)
val createFuture = for {
tables <- db.run(MTable.getTables)
createResult <- if (tables.length == 0) db.run(setup) else Future.successful()
} yield createResult
createFuture.map(_ => db)
}
val addEvent: (String, String) => Future[String] = (aggregateId, eventType) => {
val id = java.util.UUID.randomUUID().toString
val command = DBIO.seq(migrationEvents += (id, aggregateId, None, eventType, "CREATED", System.currentTimeMillis, None))
db.flatMap(_.run(command).map(_ => id))
}
val eventSubmitted: (String, String) => Future[Unit] = (id, batchId) => {
val q = for { e <- migrationEvents if e.id === id } yield (e.batchId, e.status, e.updatedAt)
val updateAction = q.update(Some(batchId), "SUBMITTED", Some(System.currentTimeMillis))
db.map(_.run(updateAction))
}
val eventMigrationCompleted: (String, String, String) => Future[Unit] = (batchId, id, status) => {
val q = for { e <- migrationEvents if e.batchId === batchId && e.id === id} yield (e.status, e.updatedAt)
val updateAction = q.update(status, Some(System.currentTimeMillis))
db.map(_.run(updateAction))
}
val allEvents = () => {
db.flatMap(_.run(migrationEvents.result))
}
}
Here is how I'm using it:
val migrationLog = MigrationLog("test")
val events = for {
id <- migrationLog.addEvent("aggregateUserId", "userAccessControl")
_ <- migrationLog.eventSubmitted(id, "batchID_generated_from_idam")
_ <- migrationLog.eventMigrationCompleted("batchID_generated_from_idam", id, "Successful")
events <- migrationLog.allEvents()
} yield events
events.map(_.foreach(event => event match {
case (id, aggregateId, batchId, eventType, status, submitted, updatedAt) => println(s"$id $aggregateId $batchId $eventType $status $submitted $updatedAt")
}))
The idea is to add event first, then update it with batchId (which also updates status) and then update the status when the job is done. events should contain events with status Successful.
What happens is that after running this code it prints events with status SUBMITTED. If I wait a while and do the same allEvents query or just go and check the db from command line using sqlite3 then it's updated correctly.
I'm properly waiting for futures to be resolved before starting the next operation, auto-commit should be enabled by default.
Am I missing something?
Turns out the problem was with db.map(_.run(updateAction)) which returns Future[Future[Int]] which means that the command was not finished by the time I tried to run another query.
Replacing it with db.flatMap(_.run(updateAction)) solved the issue.

Insert if not exists in Slick 3.0.0

I'm trying to insert if not exists, I found this post for 1.0.1, 2.0.
I found snippet using transactionally in the docs of 3.0.0
val a = (for {
ns <- coffees.filter(_.name.startsWith("ESPRESSO")).map(_.name).result
_ <- DBIO.seq(ns.map(n => coffees.filter(_.name === n).delete): _*)
} yield ()).transactionally
val f: Future[Unit] = db.run(a)
I'm struggling to write the logic from insert if not exists with this structure. I'm new to Slick and have little experience with Scala. This is my attempt to do insert if not exists outside the transaction...
val result: Future[Boolean] = db.run(products.filter(_.name==="foo").exists.result)
result.map { exists =>
if (!exists) {
products += Product(
None,
productName,
productPrice
)
}
}
But how do I put this in the transactionally block? This is the furthest I can go:
val a = (for {
exists <- products.filter(_.name==="foo").exists.result
//???
// _ <- DBIO.seq(ns.map(n => coffees.filter(_.name === n).delete): _*)
} yield ()).transactionally
Thanks in advance
It is possible to use a single insert ... if not exists query. This avoids multiple database round-trips and race conditions (transactions may not be enough depending on isolation level).
def insertIfNotExists(name: String) = users.forceInsertQuery {
val exists = (for (u <- users if u.name === name.bind) yield u).exists
val insert = (name.bind, None) <> (User.apply _ tupled, User.unapply)
for (u <- Query(insert) if !exists) yield u
}
Await.result(db.run(DBIO.seq(
// create the schema
users.schema.create,
users += User("Bob"),
users += User("Bob"),
insertIfNotExists("Bob"),
insertIfNotExists("Fred"),
insertIfNotExists("Fred"),
// print the users (select * from USERS)
users.result.map(println)
)), Duration.Inf)
Output:
Vector(User(Bob,Some(1)), User(Bob,Some(2)), User(Fred,Some(3)))
Generated SQL:
insert into "USERS" ("NAME","ID") select ?, null where not exists(select x2."NAME", x2."ID" from "USERS" x2 where x2."NAME" = ?)
Here's the full example on github
This is the version I came up with:
val a = (
products.filter(_.name==="foo").exists.result.flatMap { exists =>
if (!exists) {
products += Product(
None,
productName,
productPrice
)
} else {
DBIO.successful(None) // no-op
}
}
).transactionally
It's is a bit lacking though, for example it would be useful to return the inserted or existing object.
For completeness, here the table definition:
case class DBProduct(id: Int, uuid: String, name: String, price: BigDecimal)
class Products(tag: Tag) extends Table[DBProduct](tag, "product") {
def id = column[Int]("id", O.PrimaryKey, O.AutoInc) // This is the primary key column
def uuid = column[String]("uuid")
def name = column[String]("name")
def price = column[BigDecimal]("price", O.SqlType("decimal(10, 4)"))
def * = (id, uuid, name, price) <> (DBProduct.tupled, DBProduct.unapply)
}
val products = TableQuery[Products]
I'm using a mapped table, the solution works also for tuples, with minor changes.
Note also that it's not necessary to define the id as optional, according to the documentation it's ignored in insert operations:
When you include an AutoInc column in an insert operation, it is silently ignored, so that the database can generate the proper value
And here the method:
def insertIfNotExists(productInput: ProductInput): Future[DBProduct] = {
val productAction = (
products.filter(_.uuid===productInput.uuid).result.headOption.flatMap {
case Some(product) =>
mylog("product was there: " + product)
DBIO.successful(product)
case None =>
mylog("inserting product")
val productId =
(products returning products.map(_.id)) += DBProduct(
0,
productInput.uuid,
productInput.name,
productInput.price
)
val product = productId.map { id => DBProduct(
id,
productInput.uuid,
productInput.name,
productInput.price
)
}
product
}
).transactionally
db.run(productAction)
}
(Thanks Matthew Pocock from Google group thread, for orienting me to this solution).
I've run into the solution that looks more complete. Section 3.1.7 More Control over Inserts of the Essential Slick book has the example.
At the end you get smth like:
val entity = UserEntity(UUID.random, "jay", "jay#localhost")
val exists =
users
.filter(
u =>
u.name === entity.name.bind
&& u.email === entity.email.bind
)
.exists
val selectExpression = Query(
(
entity.id.bind,
entity.name.bind,
entity.email.bind
)
).filterNot(_ => exists)
val action = usersDecisions
.map(u => (u.id, u.name, u.email))
.forceInsertQuery(selectExpression)
exec(action)
// res17: Int = 1
exec(action)
// res18: Int = 0
according to the slick 3.0 manual insert query section (http://slick.typesafe.com/doc/3.0.0/queries.html), the inserted values can be returned with id as below:
def insertIfNotExists(productInput: ProductInput): Future[DBProduct] = {
val productAction = (
products.filter(_.uuid===productInput.uuid).result.headOption.flatMap {
case Some(product) =>
mylog("product was there: " + product)
DBIO.successful(product)
case None =>
mylog("inserting product")
(products returning products.map(_.id)
into ((prod,id) => prod.copy(id=id))) += DBProduct(
0,
productInput.uuid,
productInput.name,
productInput.price
)
}
).transactionally
db.run(productAction)
}

Slick 2.1: Return query results as a map [duplicate]

I have methods in my Play app that query database tables with over hundred columns. I can't define case class for each such query, because it would be just ridiculously big and would have to be changed with each alter of the table on the database.
I'm using this approach, where result of the query looks like this:
Map(columnName1 -> columnVal1, columnName2 -> columnVal2, ...)
Example of the code:
implicit val getListStringResult = GetResult[List[Any]] (
r => (1 to r.numColumns).map(_ => r.nextObject).toList
)
def getSomething(): Map[String, Any] = DB.withSession {
val columns = MTable.getTables(None, None, None, None).list.filter(_.name.name == "myTable").head.getColumns.list.map(_.column)
val result = sql"""SELECT * FROM myTable LIMIT 1""".as[List[Any]].firstOption.map(columns zip _ toMap).get
}
This is not a problem when query only runs on a single database and single table. I need to be able to use multiple tables and databases in my query like this:
def getSomething(): Map[String, Any] = DB.withSession {
//The line below is no longer valid because of multiple tables/databases
val columns = MTable.getTables(None, None, None, None).list.filter(_.name.name == "table1").head.getColumns.list.map(_.column)
val result = sql"""
SELECT *
FROM db1.table1
LEFT JOIN db2.table2 ON db2.table2.col1 = db1.table1.col1
LIMIT 1
""".as[List[Any]].firstOption.map(columns zip _ toMap).get
}
The same approach can no longer be used to retrieve column names. This problem doesn't exist when using something like PHP PDO or Java JDBCTemplate - these retrieve column names without any extra effort needed.
My question is: how do I achieve this with Slick?
import scala.slick.jdbc.{GetResult,PositionedResult}
object ResultMap extends GetResult[Map[String,Any]] {
def apply(pr: PositionedResult) = {
val rs = pr.rs // <- jdbc result set
val md = rs.getMetaData();
val res = (1 to pr.numColumns).map{ i=> md.getColumnName(i) -> rs.getObject(i) }.toMap
pr.nextRow // <- use Slick's advance method to avoid endless loop
res
}
}
val result = sql"select * from ...".as(ResultMap).firstOption
Another variant that produces map with not null columns (keys in lowercase):
private implicit val getMap = GetResult[Map[String, Any]](r => {
val metadata = r.rs.getMetaData
(1 to r.numColumns).flatMap(i => {
val columnName = metadata.getColumnName(i).toLowerCase
val columnValue = r.nextObjectOption
columnValue.map(columnName -> _)
}).toMap
})