tuple result in slick 3 DBIO - scala

There is a table with key, values.
There is an another table with auto incremented PK.
Take the value for the key from the first table. If not present insert and
return defaulted value.
Query on other table based on it.
Return filtered result and the value.
So I could try:
def lastLogs(limit: Long = 666): Future[(Long, Seq[VLogEntry])] = {
val q: DBIO[(Long, Seq[VLogEntry])] = {
for {
existing <- kvTable.filter(_.key === "log").result.headOption
conf = existing getOrElse KV(key = "log", value = "0")
last = conf.value.toLong
rds = vLogTable.filter(_.id > last).take(limit).result
_ <- kvTable.insertOrUpdate(conf)
} yield {
(last, rds)
}
}
db.run(q)
}
This gives compile error:
found : HipDAO.this.domain.dbConfig.profile.StreamingProfileAction[Seq[HipDAO.this.domain.VLogTable ...
required: Seq[db.types.Types.VLogEntry]
In Slick 2 I could call list or result on queries in session.
How do I to this in Slick 3.

Going through the iterations of what could be done I came to:
def lastLogs(limit: Long = 666): Future[(Long, Seq[VLogEntry])] = {
val q: DBIO[(Long, Seq[VLogEntry])] = {
for {
existing <- kvTable.filter(_.key === "log").result.headOption
conf = existing getOrElse KV(key = "log", value = "0")
_ <- kvTable.insertOrUpdate(conf)
last = conf.value.toLong
rds <- vLogTable.filter(_.id > last).take(limit).result
} yield {
(last, rds)
}
}
db.run(q)
}
Since difference is only 2 characters this seems like an easy fix.
Internet is devoid from what StreamingProfileAction is, and how to read this messages.
Some insight might come from reading essential slick 3
Eventually, how I saw it, it's a monad, you are supposed to flatMap it.

Related

Spark: calling a function inside of mapPartitionsWithIndex

I got very strange results with the following code.
I only want to take the partition data and iterate for each data, X times.
Here I'm calling to my function for each partition:
val myRDDResult = myRDD.mapPartitionsWithIndex( myFunction(_, _, limit), preservesPartitioning = true)
And the funcion is:
private def myFunction (partitionIndex: Long,
partitionData: Iterator[Array[(LabeledPoint,Int,Int)]]), limit: Int): Iterator[String] = {
var newData = ArrayBuffer[String]()
if (partitionData.nonEmpty){
val partDataMap = partitionData.next.map{ case (lp, _, neighId) => (lp, neighId) }.toMap
var newString:String = ""
for {
(k1,_) <- partDataMap
i <- 0 to limit
_ = {
// ... some code to generate the content for `newString`
newData.+=(newString)
}
}yield ()
}
newData.iterator
}
Here are some values obtained:
partitionData limit newData newData_expected
1640 250 411138 410000 (1640*250)
16256 27 288820 438912
I don't know if I misundertanding some concept of my code.
I've also tried changing the for part for this idea: partDataMap.map{elem=> for (i <- 0 to limit){....}}
Any suggestions?
First, sorry because I downvoted/upvoted (click error) your question and since I didn't cancel it within 10 minutes, SO kept it upvoted.
Regarding to your code, I think your expected results are bad because I took the same code as you, simplified it a little, and instead of receiving 410000 elements, I got 411640. Maybe I copied something incorrectly or ignore some stuff, but the code giving 411640 looks like:
val limit = 250
val partitionData: Iterator[Array[Int]] = Seq((1 to 1640).toArray).toIterator
var newData = ArrayBuffer[String]()
if (partitionData.nonEmpty){
val partDataMap = partitionData.next.map{ nr => nr.toString }
for {
value <- partDataMap
i <- 0 to limit
_ = {
newData.+=(s"${value}_${i}")
}
} yield ()
}
println(s"new buffer=${newData}")
println(s"Buffer size = ${newData.size}")
Now to answer to your question about why mapWithPartitions results differ from your expectations. IMO it's because your conversion from the Array to Map. If in your array you have duplicated key, it will count only once. It could explain why in both cases (if we consider 411640 as correct expected number) you receive the results lower than expected. To be sure of that you can compare partDataMap.size with partitionData.next.size.

Passing result of one DBIO into another

I'm new to Slick and I am trying to rewrite the following two queries to work in one transaction. My goal is to
1. check if elements exists
2. return existing element or create it handling autoincrement from MySQL
The two functions are:
def createEmail(email: String): DBIO[Email] = {
// We create a projection of just the email column, since we're not inserting a value for the id column
(emails.map(p => p.email)
returning emails.map(_.id)
into ((email, id) => Email(id, email))
) += email
}
def findEmail(email: String): DBIO[Option[Email]] =
emails.filter(_.email === email).result.headOption
How can I safely chain them, ie. to run first check for existence, return if object already exists and if it does not exist then create it and return the new element in one transaction?
You could use a for comprehension:
def findOrCreate(email: String) = {
(for {
found <- findEmail(email)
em <- found match {
case Some(e) => DBIO.successful(e)
case None => createEmail(email)
}
} yield em).transactionally
}
val result = db.run(findOrCreate("batman#gotham.gov"))
// Future[Email]
With a little help of cats library:
def findOrCreate(email: String): DBIO[Email] = {
OptionT(findEmail(email)).getOrElseF(createEmail(email)).transactionally
}

Scala Spark not returning value outside loop [duplicate]

I am new to Scala and Spark and would like some help in understanding why the below code isn't producing my desired outcome.
I am comparing two tables
My desired output schema is:
case class DiscrepancyData(fieldKey:String, fieldName:String, val1:String, val2:String, valExpected:String)
When I run the below code step by step manually, I actually end up with my desired outcome. Which is a List[DiscrepancyData] completely populated with my desired output. However, I must be missing something in the code below because it returns an empty list (before this code gets called there are other codes that is involved in reading tables from HIVE, mapping, grouping, filtering, etc etc etc):
val compareCols = Set(year, nominal, adjusted_for_inflation, average_private_nonsupervisory_wage)
val key = "year"
def compare(table:RDD[(String, Iterable[Row])]): List[DiscrepancyData] = {
var discs: ListBuffer[DiscrepancyData] = ListBuffer()
def compareFields(fieldOne:String, fieldTwo:String, colName:String, row1:Row, row2:Row): DiscrepancyData = {
if (fieldOne != fieldTwo){
DiscrepancyData(
row1.getAs(key).toString, //fieldKey
colName, //fieldName
row1.getAs(colName).toString, //table1Value
row2.getAs(colName).toString, //table2Value
row2.getAs(colName).toString) //expectedValue
}
else null
}
def comparison() {
for(row <- table){
var elem1 = row._2.head //gets the first element in the iterable
var elem2 = row._2.tail.head //gets the second element in the iterable
for(col <- compareCols){
var value1 = elem1.getAs(col).toString
var value2 = elem2.getAs(col).toString
var disc = compareFields(value1, value2, col, elem1, elem2)
if (disc != null) discs += disc
}
}
}
comparison()
discs.toList
}
I'm calling the above function as such:
var outcome = compare(groupedFiltered)
Here is the data in groupedFiltered:
(1991,CompactBuffer([1991,7.14,5.72,39%], [1991,4.14,5.72,39%]))
(1997,CompactBuffer([1997,4.88,5.86,39%], [1997,3.88,5.86,39%]))
(1999,CompactBuffer([1999,5.15,5.96,39%], [1999,5.15,5.97,38%]))
(1947,CompactBuffer([1947,0.9,2.94,35%], [1947,0.4,2.94,35%]))
(1980,CompactBuffer([1980,3.1,6.88,45%], [1980,3.1,6.88,48%]))
(1981,CompactBuffer([1981,3.15,6.8,45%], [1981,3.35,6.8,45%]))
The table schema for groupedFiltered:
(year String,
nominal Double,
adjusted_for_inflation Double,
average_provate_nonsupervisory_wage String)
Spark is a distributed computing engine. Next to "what the code is doing" of classic single-node computing, with Spark we also need to consider "where the code is running"
Let's inspect a simplified version of the expression above:
val records: RDD[List[String]] = ??? //whatever data
var list:mutable.List[String] = List()
for {record <- records
entry <- records }
{ list += entry }
The scala for-comprehension makes this expression look like a natural local computation, but in reality the RDD operations are serialized and "shipped" to executors, where the inner operation will be executed locally. We can rewrite the above like this:
records.foreach{ record => //RDD.foreach => serializes closure and executes remotely
record.foreach{entry => //record.foreach => local operation on the record collection
list += entry // this mutable list object is updated in each executor but never sent back to the driver. All updates are lost
}
}
Mutable objects are in general a no-go in distributed computing. Imagine that one executor adds a record and another one removes it, what's the correct result? Or that each executor comes to a different value, which is the right one?
To implement the operation above, we need to transform the data into our desired result.
I'd start by applying another best practice: Do not use null as return value. I also moved the row ops into the function. Lets rewrite the comparison operation with this in mind:
def compareFields(colName:String, row1:Row, row2:Row): Option[DiscrepancyData] = {
val key = "year"
val v1 = row1.getAs(colName).toString
val v2 = row2.getAs(colName).toString
if (v1 != v2){
Some(DiscrepancyData(
row1.getAs(key).toString, //fieldKey
colName, //fieldName
v1, //table1Value
v2, //table2Value
v2) //expectedValue
)
} else None
}
Now, we can rewrite the computation of discrepancies as a transformation of the initial table data:
val discrepancies = table.flatMap{case (str, row) =>
compareCols.flatMap{col => compareFields(col, row.next, row.next) }
}
We can also use the for-comprehension notation, now that we understand where things are running:
val discrepancies = for {
(str,row) <- table
col <- compareCols
dis <- compareFields(col, row.next, row.next)
} yield dis
Note that discrepancies is of type RDD[Discrepancy]. If we want to get the actual values to the driver we need to:
val materializedDiscrepancies = discrepancies.collect()
Iterating through an RDD and updating a mutable structure defined outside the loop is a Spark anti-pattern.
Imagine this RDD being spread over 200 machines. How can these machines be updating the same Buffer? They cannot. Each JVM will be seeing its own discs: ListBuffer[DiscrepancyData]. At the end, your result will not be what you expect.
To conclude, this is a perfectly valid (not idiomatic though) Scala code but not a valid Spark code. If you replace RDD with an Array it will work as expected.
Try to have a more functional implementation along these lines:
val finalRDD: RDD[DiscrepancyData] = table.map(???).filter(???)

Slick 3.0 how to update variable column list, which number is know only in Runtime

Is it possible to update variable column list, which number is know only in runtime by slick 3.0?
Below is example what I want to do (won't compile)
var q: Query[UserTable, UserTable#TableElementType, Seq] = userTable
var columns = List[Any]()
var values = List[Any]()
if (updateCommands.name.isDefined) {
columns = q.name :: columns
values = updateCommands.name.get :: values
}
if (updateCommands.surname.isDefined) {
columns = q.surname :: columns
values = updateCommands.surname.get :: values
}
q = q.filter(_.id === updateCommands.id).map(columns).update(values)
Here is what I've done in Slick 3.1. I wasn't sure what worse, editing plain SQL statement or multiple queries. So I decided to go with latter assuming Postgres optimizer would see same WHERE clause in update queries of single transaction. My update method looks like this
def updateUser(user: User, obj: UserUpdate): Future[Unit] = {
val actions = mutable.ArrayBuffer[DBIOAction[Int, NoStream, Write with Transactional]]()
val query = users.withFilter(_.id === user.id)
obj.name.foreach(v => actions += query.map(_.name).update(v))
obj.email.foreach(v => actions += query.map(_.email).update(Option(v)))
obj.password.foreach(v => actions += query.map(_.pwdHash).update(Option(encryptPassword(v))))
slickDb.run(DBIO.seq(actions.map(_.transactionally): _*))
}
In Slick 3.0 they adopted slightly different approach, instead of having updateAll methods, as far as I userstand path of combinators was adopted.
So main idea is to define some actions on the data and then combine them ont he database to make a single run.
Example:
// let's assume that you have some table classes defined somewhere
// then let's define some actions, they might be really different
val action: SqlAction = YourTable.filter(_id === idToAssert)
val anotherAction = AnotherTable.filter(_.pets === "fun")
// and then we can combine them on a db.run
val combinedAction = for {
someResult <- action
anotherResult <- anotherAction
} yeild (someResult,anotherResult)
db.run(combinedAction) // that returns actual Future of the result type
In the same way you can deal with lists and sequences, for that please take a look here: http://slick.typesafe.com/doc/3.1.0-M1/dbio.html
DBIO has some functions that allows you to combine list of actions to one action.
I hope that idea is clear, if you have questions you are wellcome to the comments.
to update a variable number of columns you may use this way as I used for slick 3:
def update(id: Long, schedule: Schedule, fieldNames: Seq[String]): Future[_] = {
val columns = schedules.baseTableRow.create_*.map(_.name).toSeq.filter(fieldNames.map(_.toUpperCase).contains)
val toBeStored = schedule.withDefaults
val actions = mutable.ArrayBuffer[DBIOAction[Int, NoStream, Write with Transactional]]()
val query = schedules.withFilter(_.id === id)
//this is becasue of limitations in slick, multiple columns are not possible to be updated!
columns.find("NAME".equalsIgnoreCase).foreach(x => actions += query.map(_.name).update(toBeStored.name))
columns.find("NAMESPACE".equalsIgnoreCase).foreach(x => actions += query.map(_.namespace).update(toBeStored.namespace))
columns.find("URL".equalsIgnoreCase).foreach(x => actions +=
db.run(DBIO.seq(actions: _ *).transactionally.withPinnedSession)
}

How to break/escape from a for loop in Scala?

Im new to scala and searched a lot for the solution.
I'm querying the database and storing the value of the http request parsed as a json4s object in response. I wait for the response and parse the json.
val refService = url("http://url//)
val response = Http(refService OK dispatch.as.json4s.Json)
var checkVal :Boolean = true
val json = Await.result(response, 30 seconds)
val data = json \ "data"
I want to run a loop and check if the value of "name" is present in the data returned. If present I want to break and assign checkVal to false. So far I have this:
for {
JObject(obj) <- data
JField("nameValue", JString(t)) <- obj //nameValue is the column name in the returned data
} yield {checkVal= if (t == name){ break }
else
true
}
Eclipse is giving me the following error: type mismatch; found : List[Unit] required:
List[String]
Please advice. Thank you.
One of your problems is that you have different return types in yield: if t==name, return type is the type of break, and if t!=name return type is Boolean.
In scala you don't have break operator, this behaviour is achieved using breakable construct and calling break() method which actually throws an exception to exit from breakable block. Also you can use if statements in for body to filter you results:
import scala.util.control.Breaks._
breakable {
for {
JObject(obj) <- data
JField("nameValue", JString(t)) <- obj
if t == name
} yield {
checkVal = false
break()
}
}
UPDATE:
I used this imperative approach because you are new to scala, but it's not scala way. IMHO you should stick to #Imm code in comments to your question.
I actually don't like using pattern matching in for loops as if for some reason data is not a JObject it won't be handled well. I prefer an approach like below.
data match {
case JObject(fields) => fields.exists{
case (name:String,value:JString) => name == "nameValue" && value.s == "name"
case _ => false
}
case _ => false // handle error as not a JObject
}
Edit: revised to include your matches.
I would suggest to use exists as it is lazy on all collection members.
code:-
val list= Map(
"nameValue1" -> 1,
"nameValue2" -> 2,
"nameValue3" -> 3,
"nameValue4" -> 4,
"nameValue5" -> 5
)
val requiredHeader = "nameValue2"
var keyvalue:Int=0
list.exists(p=>{ if(p._1.equalsIgnoreCase(requiredHeader))keyvalue=p._2;p._1.equalsIgnoreCase(requiredHeader) })
if(keyvalue!=0){
//header present
}else{
//header doesn't exit
}