slick db.run is not called - scala

I want to insert a record in database but the db.run is not called
my code looks like this
val insertQueryStep = processStepTemplates returning processStepTemplates.map(_.id) into ((processStep, id) => processStep.copy(id = Some(id)))
/**
* Generates a new ProcessStepTemplate
*
* #param step
* #return
*/
def addProcessStepTemplateToProcessTemplate(step: ProcessStepTemplatesModel, processId: Int): Future[Some[ProcessStepTemplatesModel]] = {
println("In DTO: " + step + ", processtemplate: " + processId)
//val p = processStepTemplates returning processStepTemplates.map(_.id) += step
val p = insertQueryStep += step
db.run(p).map(id => {
println("Die Query lautet: " + p)
println("Die erzeugte ID lautet: " + id)
//Update the foreign key
val q = for { p <- processStepTemplates if p.id == id } yield p.processtemplate
val updateAction = q.update(Some(processId))
db.run(updateAction).map(id => {
println("Der neue Prozesschritt lautet: " + step)
Some(step)
})
Some(step)
})
}
What could be a problem in this case?

You should compose your futures as monads (with flatMap). Because the inner future will not complete.
Try to change your code in the following way (see comments #1, #2):
def addProcessStepTemplateToProcessTemplate(step: ProcessStepTemplatesModel, processId: Int): Future[Some[ProcessStepTemplatesModel]] = {
println("In DTO: " + step + ", processtemplate: " + processId)
//val p = processStepTemplates returning processStepTemplates.map(_.id) += step
val p = insertQueryStep += step
db.run(p).flatMap(id => { // #1 change map to flatMap
println("Die Query lautet: " + p)
println("Die erzeugte ID lautet: " + id)
//Update the foreign key
val q = for { p <- processStepTemplates if p.id == id } yield p.processtemplate
val updateAction = q.update(Some(processId))
val innerFuture = db.run(updateAction).map(id => {
println("Der neue Prozesschritt lautet: " + step)
Some(step)
})
innerFuture // # 2 return inner future
})
}
Also use a logging for detecting another issues (connected with db-schema, queries, etc).

Related

Batching of Dataset Spark scala

I am trying to create batches of rows of Dataset in Spark.
For maintaining the number of records sent to service I want to batch the items so that i can maintain the rate at which the data will be sent.
For,
case class Person(name:String, address: String)
case class PersonBatch(personBatch: List[Person])
For a given Dataset[Person] I want to create Dataset[PersonBatch]
For example if input Dataset[Person] has 100 records the output Dataset should be like Dataset[PersonBatch] where every PersonBatchshould be list of n records (Person).
I have tried this but it din't work.
object DataBatcher extends Logger {
var batchList: ListBuffer[PersonBatch] = ListBuffer[PersonBatch]()
var batchSize: Long = 500 //default batch size
def addToBatchList(batch: PersonBatch): Unit = {
batchList += batch
}
def clearBatchList(): Unit = {
batchList.clear()
}
def createBatches(ds: Dataset[Person]): Dataset[PersonBatch] = {
val dsCount = ds.count()
logger.info(s"Count of dataset passed for creating batches : ${dsCount}")
val batchElement = ListBuffer[Person]()
val batch = PersonBatch(batchElement)
ds.foreach(x => {
batch.personBatch += x
if(batch.personBatch.length == batchSize) {
addToBatchList(batch)
batch.requestBatch.clear()
}
})
if(batch.personBatch.length > 0) {
addToBatchList(batch)
batch.personBatch.clear()
}
sparkSession.createDataset(batchList)
}
}
I want to run this job on Hadoop cluster.
Can some help me with this ?
rdd.iterator has grouped function may be useful for you.
for example :
iter.grouped(batchSize)
Sample code snippet which does batch insert with iter.grouped(batchsize) here its 1000 and Im trying to insert in to database
df.repartition(numofpartitionsyouwant) // numPartitions ~ number of simultaneous DB connections you can planning to give...
def insertToTable(sqlDatabaseConnectionString: String,
sqlTableName: String): Unit = {
val tableHeader: String = dataFrame.columns.mkString(",")
dataFrame.foreachPartition { partition =>
//NOTE : EACH PARTITION ONE CONNECTION (more better way is to use connection pools)
val sqlExecutorConnection: Connection =
DriverManager.getConnection(sqlDatabaseConnectionString)
//Batch size of 1000 is used since some databases cant use batch size more than 1000 for ex : Azure sql
partition.grouped(1000).foreach { group =>
val insertString: scala.collection.mutable.StringBuilder =
new scala.collection.mutable.StringBuilder()
group.foreach { record =>
insertString.append("('" + record.mkString(",") + "'),")
}
sqlExecutorConnection
.createStatement()
.executeUpdate(f"INSERT INTO [$sqlTableName] ($tableHeader) VALUES "
+ insertString.stripSuffix(","))
}
sqlExecutorConnection.close() // close the connection so that connections wont exhaust.
}
}
val tableHeader: String = dataFrame.columns.mkString(",")
dataFrame.foreachPartition((it: Iterator[Row]) => {
println("partition index: " )
val url = "jdbc:...+ "user=;password=;"
val conn = DriverManager.getConnection(url)
conn.setAutoCommit(true)
val stmt = conn.createStatement()
val batchSize = 10
var i =0
while (it.hasNext) {
val row = it.next
import java.sql.SQLException
import java.sql.SQLIntegrityConstraintViolationException
try {
stmt.addBatch(" UPDATE TABLE SET STATUS = 0 , " +
" DATE ='" + new java.sql.Timestamp(System.currentTimeMillis()) +"'" +
" where id = " + row.getAs("IDNUM") )
i += 1
if ( i % batchSize == 0 ) {
stmt.executeBatch
conn.commit
}
} catch {
case e: SQLIntegrityConstraintViolationException =>
case e: SQLException =>
e.printStackTrace()
}
finally {
stmt.executeBatch
conn.commit
}
}
import java.util
val ret = stmt.executeBatch
System.out.println("Ret val: " + util.Arrays.toString(ret))
System.out.println("Update count: " + stmt.getUpdateCount)
conn.commit
stmt.close

Scala - Constructs map from two lists at initialization

I'm new in Scala programming.
I would like to have this kind of immutable map :
Map[ (Int,Int), (List[BoolVar]) ]
From these two lists :
val courseName = List("Course1","Course2")
val serieName = List("Serie1","Serie2")
My goal :
Map[0][0] // List[BoolVar] for "Course1""Serie1"
Map[0][0](0) // a BoolVar from "Course1""Serie1" List
....
I tried this but the syntax is false :
val test = Map[ (Int,Int), (List[BoolVar]) ](
for (course <- List.range(0,courseName.length) )
for( serie <- List.range(0,serieName.length) )
yield (course,serie) ->
for (indice <- List.range(0, 48))
yield BoolVar( courseName(course) + " - " + serieName(serie) )
);
Thanks for your help
Is that what you are looking for ?? Just a few minor changes.
But it will use round brackets
val courseName = List("Course1","Course2")
val serieName = List("Serie1","Serie2")
val m = {
for {
course <- List.range(0,courseName.length)
serie <- List.range(0,serieName.length)
} yield (course, serie) -> {
for (indice <- List.range(0, 48))
yield BoolVar( courseName(course) + " - " + serieName(serie) )
}
}.toMap
println( m )

Scala - append RDD to itself

for (fordate <- 2 to 30) {
val dataRDD = sc.textFile("s3n://mypath" + fordate + "/*")
val a = 1
val c = fordate - 1
for (b <- a to c) {
val cumilativeRDD1 = sc.textFile("s3n://mypath/" + b + "/*")
val cumilativeRDD : org.apache.spark.rdd.RDD[String] = sc.union(cumilativeRDD1, cumilativeRDD)
if (b == c) {
val incrementalDEviceIDs = dataRDD.subtract(cumilativeRDD)
val countofIDs = incrementalDEviceIDs.distinct().count()
println(s"201611 $fordate $countofIDs")
}
}
}
i have a data set where i get deviceIDs on daily basis. i need to figure out the incremental count per day but when i join cumilativeRDD to itself it saysthrows following error:
forward reference extends over definition of value cumilativeRDD
how can i overcome this.
The problem is this line:
val cumilativeRDD : org.apache.spark.rdd.RDD[String] = sc.union(cumilativeRDD1 ,cumilativeRDD)
You're using cumilativeRDD before it's declaration. Variable assignment works from right to left. The right side of = defines the variable on the left. Therefore you cannot use the variable inside it's own definition. Because on the right side of the equation the variable does not yet exist.
You have to init cumilativeRDD in the first run and then you can you use it in following runs:
var cumilativeRDD: Option[org.apache.spark.rdd.RDD[String]] = None
for (fordate <- 2 to 30) {
val DataRDD = sc.textFile("s3n://mypath" + fordate + "/*")
val c = fordate - 1
for (b <- 1 to c) {
val cumilativeRDD1 = sc.textFile("s3n://mypath/" + b + "/*")
if (cumilativeRDD.isEmpty) cumilativeRDD = Some(cumilativeRDD1)
else cumilativeRDD = Some(sc.union(cumilativeRDD1, cumilativeRDD.get))
if (b == c) {
val IncrementalDEviceIDs = DataRDD.subtract(cumilativeRDD.get)
val countofIDs = IncrementalDEviceIDs.distinct().count()
println("201611" + fordate + " " + countofIDs)
}
}
}

functional code to below imperative one in scala

i want to write the functional version for finding the pair of elements with given sum.the below is the imperative code:
object ArrayUtil{
def findPairs(arr:Array[Int],sum:Int) ={
val MAX = 50
val binmap:Array[Boolean] = new Array[Boolean](MAX)
for(i <- 0 until arr.length){
val temp:Int = sum-arr(i);
if (temp>=0 && binmap(temp))
{
println("Pair with given sum " + sum + " is (" + arr(i) +", "+temp+")");
}
binmap(arr(i)) = true;
}
}
}
Study the Standard Library.
def findPairs(arr:Array[Int],sum:Int): List[Array[Int]] =
arr.combinations(2).filter(_.sum == sum).toList

org.apache.spark.SparkException: Task not serializable (scala)

I am new for scala as well as FOR spark, Please help me to resolve this issue.
in spark shell when I load below functions individually they run without any exception, when I copy this function in scala object, and load same file in spark shell they throws task not serialization exception in "processbatch" function when trying to parallelize.
PFB code for the same:
import org.apache.spark.sql.Row
import org.apache.log4j.Logger
import org.apache.spark.sql.hive.HiveContext
object Process {
val hc = new HiveContext(sc)
def processsingle(wait: Int, patient: org.apache.spark.sql.Row, visits: Array[org.apache.spark.sql.Row]) : String = {
var out = new StringBuilder()
val processStart = getTimeInMillis()
for( x <- visits ) {
out.append(", " + x.getAs("patientid") + ":" + x.getAs("visitid"))
}
}
def processbatch(batch: Int, wait: Int, patients: Array[org.apache.spark.sql.Row], visits: Array[org.apache.spark.sql.Row]) = {
val out = sc.parallelize(patients, batch).map( r=> processsingle(wait, r, visits.filter(f=> f.getAs("patientid") == r.getAs("patientid")))).collect()
for(x <- out) println(x)
}
def processmeasures(fetch: Int, batch: Int, wait: Int) = {
val patients = hc.sql("SELECT patientid FROM tableName1 order by p_id").collect()
val visit = hc.sql("SELECT patientid, visitid FROM tableName2")
val count = patients.length
val fetches = if(count % fetch > 0) (count / fetch + 1) else (count / fetch)
for(i <- 0 to fetches.toInt-1){
val startFetch = i*fetch
val endFetch = math.min((i+1)*fetch, count.toInt)-1
val fetchSize = endFetch - startFetch + 1
val fetchClause = "patientid >= " + patients(startFetch).get(0) + " and patientid <= " + patients(endFetch).get(0)
val fetchVisit = visit.filter( fetchClause ).collect()
val batches = if(fetchSize % batch > 0) (fetchSize / batch + 1) else (fetchSize / batch)
for(j <- 0 to batches.toInt-1){
val startBatch = j*batch
val endBatch = math.min((j+1)*batch, fetch.toInt)-1
println(s"Batch from $startBatch to $endBatch");
val batchVisits = fetchVisit.filter(g => g.getAs[Long]("patientid") >= patients(i*fetch + startBatch).getLong(0) && g.getAs[Long]("patientid") <= patients(math.min(i*fetch + endBatch + 1, endFetch)).getLong(0))
processbatch(batch, wait, patients.slice(i*fetch + startBatch, i*fetch + endBatch + 1), batchVisits)
}
}
println("Processing took " + getExecutionTime(processStart) + " millis")
}
}
You should make Process object Serializable:
object Process extends Serializable {
...
}