How does scala slick determin which rows to update in this query - scala

I was asked how scala slick determines which rows need to update given this code
def updateFromLegacy(criteria: CertificateGenerationState, fieldA: CertificateGenerationState, fieldB: Option[CertificateNotification]) = {
val a: Query[CertificateStatuses, CertificateStatus, Seq] = CertificateStatuses.table.filter(status => status.certificateState === criteria)
val b: Query[(Column[CertificateGenerationState], Column[Option[CertificateNotification]]), (CertificateGenerationState, Option[CertificateNotification]), Seq] = a.map(statusToUpdate => (statusToUpdate.certificateState, statusToUpdate.notification))
val c: (CertificateGenerationState, Option[CertificateNotification]) = (fieldA, fieldB)
b.update(c)
}
Above code is (as i see it)
a) looking for all rows that have "criteria" for "certificateState"
b) a query for said columns is created
c) a tuple with the values i want to update to is created
then the query is used to find rows where tuple needs to be applied.
Background
I wonder were slick keeps track of the Ids of the rows to update.
What i would like to find out
What is happening behind the covers?
What is Seq in "val a: Query[CertificateStatuses, CertificateStatus, Seq]"
Can someone maybe point out the slick source where the moving parts are located?

OK - I reformatted your code a little bit to easier see it here and divided it into chunks. Let's go through this one by one:
val a: Query[CertificateStatuses, CertificateStatus, Seq] =
CertificateStatuses.table
.filter(status => status.certificateState === criteria)
Above is a query that translated roughly to something along these lines:
SELECT * // Slick would list here all your columns but it's essiantially same thing
FROM certificate_statuses
WHERE certificate_state = $criteria
Below this query is mapped that is, there is a SQL projection applied to it:
val b: Query[
(Column[CertificateGenerationState], Column[Option[CertificateNotification]]),
(CertificateGenerationState, Option[CertificateNotification]),
Seq] = a.map(statusToUpdate =>
(statusToUpdate.certificateState, statusToUpdate.notification))
So instead of * you will have this:
SELECT certificate_status, notification
FROM certificate_statuses
WHERE certificate_state = $criteria
And last part is reusing this constructed query to perform update:
val c: (CertificateGenerationState, Option[CertificateNotification]) =
(fieldA, fieldB)
b.update(c)
Translates to:
UPDATE certificate_statuses
SET certificate_status = $fieldA, notification = $fieldB
WHERE certificate_state = $criteria
I understand that last step may be a little bit less straightforward then others but that's essentially how you do updates with Slick (here - although it's in monadic version).
As for your questions:
What is happening behind the covers?
This is actually outside of my area of expertise. That being said it's relatively straightforward piece of code and I guess that an update transformation may be of some interest. I provided you a link to relevant piece of Slick sources at the end of this answer.
What is Seq in "val a:Query[CertificateStatuses, CertificateStatus, Seq]"
It's collection type. Query specifies 3 type parameters:
mixed type - Slick representation of table (or column - Rep)
unpacked type - type you get after executing query
collection type - collection type were above unpacked types are placed for you as a result of a query.
So to have an example:
CertificateStatuses - this is your Slick table definition
CertificateStatus this is your case class
Seq - this is how your results would be retrieved (it would be Seq[CertificateStatus] basically)
I have it explained here: http://slides.com/pdolega/slick-101#/47 (and 3 next slides or so)
Can someone maybe point out the slick source where the moving parts are located?
I think this part may be of interest - it shows how query is converted in update statement: https://github.com/slick/slick/blob/51e14f2756ed29b8c92a24b0ae24f2acd0b85c6f/slick/src/main/scala/slick/jdbc/JdbcActionComponent.scala#L320
It may be also worth to emphasize this:
I wonder were slick keeps track of the Ids of the rows to update.
It doesn't. Look at generated SQLs. You may see them by adding following configuration to your logging (but you also have them in this answer):
<logger name="slick.jdbc.JdbcBackend.statement" level="DEBUG" />
(I assumed logback above).

Related

Using getOrElse to return nothing if option type is None (Scala)

I am using RDD to create a left outer join as so far I have the following results:
scala> LeftJoinedDataset.foreach(println)
(300000004,Trevor,Parr,Some((35 Jedburgh Road,PL23 6BA)))
(300000006,Ava,Coleman,None)
(200000008,Lisa,Knox,None)
(100000007,Dorothy,Thomson,None)
(400000002,Jasmine,Miller,Some((68 High Street,LE16 3PH)))
(300000009,Ruth,Campbell,None)
(100000005,Deirdre,Pullman,Some((63 Crown Street,SW99 2HY)))
(100000010,Dominic,Parr,None)
(100000001,Simon,Walsh,Some((99 Newgate Street,PA5 9UY)))
(100000003,Liam,Brown,Some((9 Earls Avenue,ML12 2DY)))
To remove the None and Some I have so far used the below getOrElse code:
scala> val LeftJoinedDataset = LeftJoin.map(x=>(x._1,x._2._1._1,x._2._1._2,x._2._2.getOrElse(None)))
This prints out:
scala> LeftJoinedDataset.foreach(println)
(300000004,Trevor,Parr,(35 Jedburgh Road,PL23 6BA))
(300000006,Ava,Coleman,None)
(200000008,Lisa,Knox,None)
(100000007,Dorothy,Thomson,None)
(400000002,Jasmine,Miller,(68 High Street,LE16 3PH))
(300000009,Ruth,Campbell,None)
(100000005,Deirdre,Pullman,(63 Crown Street,SW99 2HY))
(100000010,Dominic,Parr,None)
(100000001,Simon,Walsh,(99 Newgate Street,PA5 9UY))
(100000003,Liam,Brown,(9 Earls Avenue,ML12 2DY))
Although the some has gone, I still want to remove the None and return no data. E.g.
(300000006,Ava,Coleman) instead of (300000006,Ava,Coleman,None)
How can i do this?
Many Thanks
You can't have different amount of columns in different rows of the same dataset, so you'll have to either drop that column altogether, all deal with Option values, or fill them with something else (e.g. empty strings).
But just having an Option in that column seems like the best way - it will show the consumer, that this data may be absent.

How to convert a type Any List to a type Double (Scala)

I am new to Scala and I would like to understand some basic stuff.
First of all, I need to calculate the average of a certain column of a DataFrame and use the result as a double type variable.
After some Internet research I was able to calculate the average and at the same time pass it into a List type Any by using the following command:
val avgX_List = mainDataFrame.groupBy().agg(mean("_c1")).collect().map(_(0)).toList
where "_c1" is the second column of my dataframe. This line of code returns a List with type List[Any].
To pass the result into a variable I used the following command:
var avgX = avgX_List(0)
hoping that the var avgX would be type double automatically but that didn't happen obviously.
So now let the questions begin:
What does map(_(0)) do? I know the basic definition of the map() transformation but I can't find an explanation with this exact argument
I know that by using .toList method in the end of the command my result will be a List with type Any. Is there a way that I could change this into List which contains type Double elements? Or even convert this one
Do you think that it would be much more appropriate to pass the column of my Dataframe into a List[Double] and then calculate the average of its elements?
Is the solution I showed above at any point of view correct based on my problem? I know that "it is working" is different from "correct solution"?
Summing up, I need to calculate the average of a certain column of a Dataframe and have the result as a double type variable.
Note that: I am Greek and I find it hard sometimes to understand some English coding "slang".
map(_(0)) is a shortcut for map( (r: Row) => r(0) ), which is in turn a shortcut for map( (r: Row) => r.apply(0) ). The apply method returns Any, and so you are losing the right type. Try using map(_.getAs[Double](0)) or map(_.getDouble(0)) instead.
Collecting all entries of the column and then computing the average would be highly counterproductive, because you'd have to send huge amounts of data to the master node, and then do all the calculations on this single central node. That would be the exact opposite of what Spark is good for.
You also don't need collect(...).toList, because you can access the 0-th entry directly (it doesn't matter whether you get it from an Array or from a List). Since you are collapsing everything into a single Row anyway, you could get rid of the map step entirely by reordering the methods a little bit:
val avgX = mainDataFrame.groupBy().agg(mean("_c1")).collect()(0).getDouble(0)
It can be written even shorter using the first method:
val avgX = mainDataFrame.groupBy().agg(mean("_c1")).first().getDouble(0)
#Any dataType in Scala can't be directly converted to Double.
#Use toString & then toDouble on final captured result.
#Eg-
#scala> x
#res22: Any = 1.0
#scala> x.toString.toDouble
#res23: Double = 1.0
#Note- Instead of using map().toList() directly use (0)(0) to get the final value from your resultset.
#TestSample(Scala)-
val wa = Array("one","two","two")
val wrdd = sc.parallelize(wa,3).map(x=>(x,1))
val wdf = wrdd.toDF("col1","col2")
val x = wdf.groupBy().agg(mean("col2")).collect()(0)(0).toString.toDouble
#O/p-
#scala> val x = wdf.groupBy().agg(mean("col2")).collect()(0)(0).toString.toDouble
#x: Double = 1.0

What's the simplest way to get a Spark DataFrame from arbitrary Array Data in Scala?

I've been breaking my head about this one for a couple of days now. It feels like it should be intuitively easy... Really hope someone can help!
I've built an org.nd4j.linalg.api.ndarray.INDArray of word occurrence from some semi-structured data like this:
import org.nd4j.linalg.factory.Nd4j
import org.nd4s.Implicits._
val docMap = collection.mutable.Map[Int,Map[Int,Int]] //of the form Map(phrase -> Map(phrasePosition -> word)
val words = ArrayBuffer("word_1","word_2","word_3",..."word_n")
val windows = ArrayBuffer("$phrase,$phrasePosition_1","$phrase,$phrasePosition_2",..."$phrase,$phrasePosition_n")
var matrix = Nd4j.create(windows.length*words.length).reshape(windows.length,words.length)
for (row <- matrix.shape(0)){
for(column <- matrix.shape(1){
//+1 to (row,column) if word occurs at phrase, phrasePosition indicated by window_n.
}
}
val finalmatrix = matrix.T.dot(matrix) // to get co-occurrence matrix
So far so good...
Downstream of this point I need to integrate the data into an existing pipeline in Spark, and use that implementation of pca etc, so I need to create a DataFrame, or at least an RDD. If I knew the number of words and/or windows in advance I could do something like:
case class Row(window : String, word_1 : Double, word_2 : Double, ...etc)
val dfSeq = ArrayBuffer[Row]()
for (row <- matrix.shape(0)){
dfSeq += Row(windows(row),matrix.get(NDArrayIndex.point(row), NDArrayIndex.all()))
}
sc.parallelize(dfSeq).toDF("window","word_1","word_2",...etc)
but the number of windows and words is determined at runtime. I'm looking for a WindowsxWords org.apache.spark.sql.DataFrame as output, input is a WindowsxWords org.nd4j.linalg.api.ndarray.INDArray
Thanks in advance for any help you can offer.
Ok, so after several days work it looks like the simple answer is: there isn't one. In fact, it looks like trying to use Nd4j in this context at all is a bad idea for several reasons:
It's (really) hard to get data out of the native INDArray format once you've put it in.
Even using something like guava, the .data() method brings everything on heap which will quickly become expensive.
You've got the added hassle of having to compile an assembly jar or use hdfs etc to handle the library itself.
I did also consider using Breeze which may actually provide a viable solution but carries some of the same problems and can't be used on distributed data structures.
Unfortunately, using native Spark / Scala datatypes, although easier once you know how, is - for someone like me coming from Python + numpy + pandas heaven at least - painfully convoluted and ugly.
Nevertheless, I did implement this solution successfully:
import org.apache.spark.mllib.linalg.{Vectors,Vector,Matrix,DenseMatrix,DenseVector}
import org.apache.spark.mllib.linalg.distributed.RowMatrix
//first make a pseudo-matrix from Scala Array[Double]:
var rowSeq = Seq.fill(windows.length)(Array.fill(words.length)(0d))
//iterate through 'rows' and 'columns' to fill it:
for (row 0 until windows.length){
for (column 0 until words.length){
// rowSeq(row)(column) += 1 if word occurs at phrase, phrasePosition indicated by window_n.
}
}
//create Spark DenseMatrix
val rows : Array[Double] = rowSeq.transpose.flatten.toArray
val matrix = new DenseMatrix(windows.length,words.length,rows)
One of the main operations that I needed Nd4J for was matrix.T.dot(matrix) but it turns out that you can't multiply 2 matrices of Type org.apache.spark.mllib.linalg.DenseMatrix together, one of them (A) has to be a org.apache.spark.mllib.linalg.distributed.RowMatrix and - you guessed it - you can't call matrix.transpose() on a RowMatrix, only on a DenseMatrix! Since it's not really relevant to the question, I'll leave that part out, except to explain that what comes out of that step is a RowMatrix. Credit is also due here and here for the final part of the solution:
val rowMatrix : [RowMatrix] = transposeAndDotDenseMatrix(matrix)
// get DataFrame from RowMatrix via DenseMatrix
val newdense = new DenseMatrix(rowMatrix.numRows().toInt,rowMatrix.numCols().toInt,rowMatrix.rows.collect.flatMap(x => x.toArray)) // the call to collect() here is undesirable...
val matrixRows = newdense.rowIter.toSeq.map(_.toArray)
val df = spark.sparkContext.parallelize(matrixRows).toDF("Rows")
// then separate columns:
val df2 = (0 until words.length).foldLeft(df)((df, num) =>
df.withColumn(words(num), $"Rows".getItem(num)))
.drop("Rows")
Would love to hear improvements and suggestions on this, thanks.

Strange slowdown in some simple scala code

I am processing a large number of records (CDRS) that are essentially (who, where, how much), to save space I use a lookup to map the strings into integer and aggregate the traffic on a map of maps (who maps to a map (where maps how much)
type CDR = (String, String, Int)
type Lookup = scala.collection.mutable.HashMap[String, (Int, Float)]
type Traffic = scala.collection.mutable.HashMap[Int,scala.collection.mutable.HashMap[Int,Int]]enter code here
I have found a strange behavior, when I build the lookup tables in advance the code runs as expected, however when I start processing and build the maps on the fly it slows down as it processes the records.
I use the same function to build the lookup tables for this comparison. I essentially check if the code for the lookup is there, if not i create a new entry (it is a mutable map), like this:
def index(id: String, map: Lookup, reverse: Reverse): Int = {
if (map.contains(id)) {
map(id)._1
} else {
val number = if (map.keys.size == 0) 0 else reverse.keys.max + 1
reverse += ( number -> id)
map += (id -> (number, 0.toFloat))
number
}
}
Am I missing something here?
EDIT----> I can no longer reproduce the slowdown. I will assume I was either too tired or dumber than usual. Running time now seems to be same as I expected to be.
What is mapCellRvs? Default scala Map's .size (and .keys.size, which is the same thing) simply counts all elements by scanning them linearly.
Try replacing mapCellRvs.keys.size == 0 with mapCellRvs.isEmpty ...
Also, reverse.keys.max is linear as well. You may want to just remember the max somewhere separately, rather than compute it every time.

Is it possible to return a map of key values using gremlin scala

Currently i have two gremlin queries which will fetch two different values and i am populating in a map.
Scenario : A->B , A->C , A->D
My queries below,
graph.V().has(ID,A).out().label().toList()
Fetch the list of outE labels of A .
Result : List(B,C,D)
graph.traversal().V().has("ID",A).outE("interference").as("x").otherV().has("ID",B).select("x").values("value").headOption()
Given A and B , get the egde property value (A->B)
Return : 10
Is it possible that i can combine both there queries to get a return as Map[(B,10)(C,11)(D,12)]
I am facing some performance issue when i have two queries. Its taking more time
There is probably a better way to do this but I managed to get something with the following traversal:
gremlin> graph.traversal().V().has("ID","A").outE("interference").as("x").otherV().has("ID").label().as("y").select("x").by("value").as("z").select("y", "z").select(values);
==>[B,1]
==>[C,2]
I would wait for more answers though as I suspect there is a better traversal out there.
Below is working in scala
val b = StepLabel[Edge]()
val y = StepLabel[Label]()
val z = StepLabel[Integer]()
graph.traversal().V().has("ID",A).outE("interference").as(b)
.otherV().label().as(y)
.select(b).values("name").as(z)
.select((y,z)).toMap[String,Integer]
This will return Map[String,Int]