Doobie streaming query causes invalid operation - scala

Doobie
Without much experience in either Scala or Doobie I am trying to select data from a DB2 database. Following query works fine and prints as expected 5 employees.
import doobie.imports._, scalaz.effect.IO
object ScalaDoobieSelect extends App {
val urlPrefix = "jdbc:db2:"
val schema = "SCHEMA"
val obdcName = "ODBC"
val url = urlPrefix + obdcName + ":" +
"currentSchema=" + schema + ";" +
"currentFunctionPath=" + schema + ";"
val driver = "com.ibm.db2.jcc.DB2Driver"
val username = "username"
val password = "password"
implicit val han = LogHandler.jdkLogHandler // (ii)
val xa = DriverManagerTransactor[IO](
driver, url, username, password
)
case class User(id: String, name: String)
def find(): ConnectionIO[List[User]] =
sql"SELECT ID, NAME FROM EMPLOYEE FETCH FIRST 10 ROWS ONLY"
.query[User]
.process
.take(5) // (i)
.list
find()
.transact(xa)
.unsafePerformIO
.foreach(e => println("ID = %s, NAME = %s".format(e.id, e.name)))
}
Issue
When I want to read all selected rows and remove take(5), so I have .process.list instead of .process.take(5).list, I get following error. (i)
com.ibm.db2.jcc.am.SqlException: [jcc][t4][10120][10898][3.64.133] Invalid operation: result set is closed. ERRORCODE=-4470, SQLSTATE=null
I am wondering what take(5) changes that it does not return an error. To get more information about the invalid operation, I have tried to enable logging. (ii) Unfortunately, logging is not supported for streaming. How can I get more information about what operation causes this error?
Plain JDBC
Below, in my opinion equivalent, plain JDBC query works as expected and returns all 10 rows.
import java.sql.{Connection,DriverManager}
object ScalaJdbcConnectSelect extends App {
val urlPrefix = "jdbc:db2:"
val schema = "SCHEMA"
val obdcName = "ODBC"
val url = urlPrefix + obdcName + ":" +
"currentSchema=" + schema + ";" +
"currentFunctionPath=" + schema + ";"
val driver = "com.ibm.db2.jcc.DB2Driver"
val username = "username"
val password = "password"
var connection:Connection = _
try {
Class.forName(driver)
connection = DriverManager.getConnection(url, username, password)
val statement = connection.createStatement
val rs = statement.executeQuery(
"SELECT ID, NAME FROM EMPLOYEE FETCH FIRST 10 ROWS ONLY"
)
while (rs.next) {
val id = rs.getString("ID")
val name = rs.getString("NAME")
println("ID = %s, NAME = %s".format(id,name))
}
} catch {
case e: Exception => e.printStackTrace
}
connection.close
}
Environment
As can be seen in the error message, I am using db2jcc.jar version 3.64.133. DB2 is used in version 11.

Related

Batching of Dataset Spark scala

I am trying to create batches of rows of Dataset in Spark.
For maintaining the number of records sent to service I want to batch the items so that i can maintain the rate at which the data will be sent.
For,
case class Person(name:String, address: String)
case class PersonBatch(personBatch: List[Person])
For a given Dataset[Person] I want to create Dataset[PersonBatch]
For example if input Dataset[Person] has 100 records the output Dataset should be like Dataset[PersonBatch] where every PersonBatchshould be list of n records (Person).
I have tried this but it din't work.
object DataBatcher extends Logger {
var batchList: ListBuffer[PersonBatch] = ListBuffer[PersonBatch]()
var batchSize: Long = 500 //default batch size
def addToBatchList(batch: PersonBatch): Unit = {
batchList += batch
}
def clearBatchList(): Unit = {
batchList.clear()
}
def createBatches(ds: Dataset[Person]): Dataset[PersonBatch] = {
val dsCount = ds.count()
logger.info(s"Count of dataset passed for creating batches : ${dsCount}")
val batchElement = ListBuffer[Person]()
val batch = PersonBatch(batchElement)
ds.foreach(x => {
batch.personBatch += x
if(batch.personBatch.length == batchSize) {
addToBatchList(batch)
batch.requestBatch.clear()
}
})
if(batch.personBatch.length > 0) {
addToBatchList(batch)
batch.personBatch.clear()
}
sparkSession.createDataset(batchList)
}
}
I want to run this job on Hadoop cluster.
Can some help me with this ?
rdd.iterator has grouped function may be useful for you.
for example :
iter.grouped(batchSize)
Sample code snippet which does batch insert with iter.grouped(batchsize) here its 1000 and Im trying to insert in to database
df.repartition(numofpartitionsyouwant) // numPartitions ~ number of simultaneous DB connections you can planning to give...
def insertToTable(sqlDatabaseConnectionString: String,
sqlTableName: String): Unit = {
val tableHeader: String = dataFrame.columns.mkString(",")
dataFrame.foreachPartition { partition =>
//NOTE : EACH PARTITION ONE CONNECTION (more better way is to use connection pools)
val sqlExecutorConnection: Connection =
DriverManager.getConnection(sqlDatabaseConnectionString)
//Batch size of 1000 is used since some databases cant use batch size more than 1000 for ex : Azure sql
partition.grouped(1000).foreach { group =>
val insertString: scala.collection.mutable.StringBuilder =
new scala.collection.mutable.StringBuilder()
group.foreach { record =>
insertString.append("('" + record.mkString(",") + "'),")
}
sqlExecutorConnection
.createStatement()
.executeUpdate(f"INSERT INTO [$sqlTableName] ($tableHeader) VALUES "
+ insertString.stripSuffix(","))
}
sqlExecutorConnection.close() // close the connection so that connections wont exhaust.
}
}
val tableHeader: String = dataFrame.columns.mkString(",")
dataFrame.foreachPartition((it: Iterator[Row]) => {
println("partition index: " )
val url = "jdbc:...+ "user=;password=;"
val conn = DriverManager.getConnection(url)
conn.setAutoCommit(true)
val stmt = conn.createStatement()
val batchSize = 10
var i =0
while (it.hasNext) {
val row = it.next
import java.sql.SQLException
import java.sql.SQLIntegrityConstraintViolationException
try {
stmt.addBatch(" UPDATE TABLE SET STATUS = 0 , " +
" DATE ='" + new java.sql.Timestamp(System.currentTimeMillis()) +"'" +
" where id = " + row.getAs("IDNUM") )
i += 1
if ( i % batchSize == 0 ) {
stmt.executeBatch
conn.commit
}
} catch {
case e: SQLIntegrityConstraintViolationException =>
case e: SQLException =>
e.printStackTrace()
}
finally {
stmt.executeBatch
conn.commit
}
}
import java.util
val ret = stmt.executeBatch
System.out.println("Ret val: " + util.Arrays.toString(ret))
System.out.println("Update count: " + stmt.getUpdateCount)
conn.commit
stmt.close

How to use UPDATE command in spark-sql & DataFrames

I am trying to implement UPDATE command on DataFrames in spark. But getting this error. Please put suggestions on what should be done.
17/01/19 11:49:39 INFO Replace$: query --> UPDATE temp SET c2 = REPLACE(c2,"i","a");
17/01/19 11:49:39 ERROR Main$: [1.1] failure: ``with'' expected but identifier UPDATE found
UPDATE temp SET c2 = REPLACE(c2,"i","a");
^
java.lang.RuntimeException: [1.1] failure: ``with'' expected but identifier UPDATE found
UPDATE temp SET c2 = REPLACE(c2,"i","a");
This is the program
object Replace extends SparkPipelineJob{
val logger = LoggerFactory.getLogger(getClass)
protected implicit val jsonFormats: Formats = DefaultFormats
def createSetCondition(colTypeMap:List[(String,DataType)], pattern:String, replacement:String):String = {
val res = colTypeMap map {
case (c,t) =>
if(t == StringType)
c+" = REPLACE(" + c + ",\"" + pattern + "\",\"" + replacement + "\")"
else
c+" = REPLACE(" + c + "," + pattern + "," + replacement + ")"
}
return res.mkString(" , ")
}
override def execute(dataFrames: List[DataFrame], sc: SparkContext, sqlContext: SQLContext, params: String, productId: Int) : List[DataFrame] = {
import sqlContext.implicits._
val replaceData = ((parse(params)).extractOpt[ReplaceDataSchema]).get
logger.info(s"Replace-replaceData --> ${replaceData}")
val (inputDf, (columnsMap, colTypeMap)) = (dataFrames(0), LoadInput.colMaps(dataFrames(0)))
val tableName = Constants.TEMP_TABLE
inputDf.registerTempTable(tableName)
val colMap = replaceData.colName map {
x => (x,colTypeMap.get(x).get)
}
logger.info(s"colMap --> ${colMap}")
val setCondition = createSetCondition(colMap,replaceData.input,replaceData.output)
val query = "UPDATE "+tableName+" SET "+setCondition+";"
logger.info(s"query --> ${query}")
val outputDf = sqlContext.sql(query)
List(outputDf)
}
}
Here is some extra information.
17/01/19 11:49:39 INFO Replace$: Replace-replaceData --> ReplaceDataSchema(List(SchemaDetectData(s3n://fakepath/data37.csv,None,None)),List(c2),i,a)
17/01/19 11:49:39 INFO Replace$: colMap --> List((c2,StringType))
data37.csv
c1 c2
90 nine
Please ask for extra information if needed.
Spark SQL doesn't support UPDATE queries. If you want to "modify" the data you should create new table with SELECT:
SELECT * REPLACE(c2, 'i', 'a') AS c2 FROM table

Play, Scala, JDBC and concurrency

I have a Scala class that accesses a database through JDBC:
class DataAccess {
def select = {
val driver = "com.mysql.jdbc.Driver"
val url = "jdbc:mysql://localhost:3306/db"
val username = "root"
val password = "xxxx"
var connection:Connection = null
try {
// make the connection
Class.forName(driver)
connection = DriverManager.getConnection(url, username, password)
// create the statement, and run the select query
val statement = connection.createStatement()
val resultSet = statement.executeQuery("SELECT name, descrip FROM table1")
while ( resultSet.next() ) {
val name = resultSet.getString(1)
val descrip = resultSet.getString(2)
println("name, descrip = " + name + ", " + descrip)
}
} catch {
case e => e.printStackTrace
}
connection.close()
}
}
I access this class in my Play application, like so:
def testSql = Action {
val da = new DataAccess
da.select()
Ok("success")
}
The method testSql may be invoked by several users. Question is: could there be a race condition in the while ( resultSet.next() ) loop (or in any other part of the class)?
Note: I need to use JDBC as the SQL statement will be dynamic.
No there cannot.
Each thread is working with a distinct local instance of ResultSet so there cannot be concurrent access to the same object.

Scala Tail Recursion java.lang.StackOverflowError

I am iteratively querying a mysql table called txqueue that is growing continuously.
Each successive query only considers rows that were inserted into the txqueue table after the query executed in the previous iteration.
To achieve this, each successive query selects rows from the table where the primary key (seqno field in my example below) exceeds the maximum seqno observed in the previous query.
Any newly inserted rows identified in this way are written into a csv file.
The intention is for this process to run indefinitely.
The tail recursive function below works OK, but after a while it runs into a java.lang.StackOverflowError. The results of each iterative query contains two to three rows and results are returned every second or so.
Any ideas on how to avoid the java.lang.StackOverflowError?
Is this actually something that can/should be achieved with streaming?
Many thanks for any suggestions.
Here's the code that works for a while:
object TXQImport {
val driver = "com.mysql.jdbc.Driver"
val url = "jdbc:mysql://mysqlserveraddress/mysqldb"
val username = "username"
val password = "password"
var connection:Connection = null
def txImportLoop(startID : BigDecimal) : Unit = {
try {
Class.forName(driver)
connection = DriverManager.getConnection(url, username, password)
val statement = connection.createStatement()
val newMaxID = statement.executeQuery("SELECT max(seqno) as maxid from txqueue")
val maxid = new Iterator[BigDecimal] {
def hasNext = newMaxID.next()
def next() = newMaxID.getBigDecimal(1)
}.toStream.max
val selectStatement = statement.executeQuery("SELECT seqno,someotherfield " +
" from txqueue where seqno >= " + startID + " and seqno < " + maxid)
if(startID != maxid) {
val ts = System.currentTimeMillis
val file = new java.io.File("F:\\txqueue " + ts + ".txt")
val bw = new BufferedWriter(new FileWriter(file))
// Iterate Over ResultSet
while (selectStatement.next()) {
bw.write(selectStatement.getString(1) + "," + selectStatement.getString(2))
bw.newLine()
}
bw.close()
}
connection.close()
txImportLoop(maxid)
}
catch {
case e => e.printStackTrace
}
}
def main(args: Array[String]) {
txImportLoop(0)
}
}
Your function is not tail-recursive (because of the catch in the end).
That's why you end up with stack overflow.
You should always annotate the functions you intend to be tail-recursive with #scala.annotation.tailrec - it will fail compilation in case tail recursion is impossible, so that you won't be surprised by it at run time.

Slick 2.1: Return query results as a map [duplicate]

I have methods in my Play app that query database tables with over hundred columns. I can't define case class for each such query, because it would be just ridiculously big and would have to be changed with each alter of the table on the database.
I'm using this approach, where result of the query looks like this:
Map(columnName1 -> columnVal1, columnName2 -> columnVal2, ...)
Example of the code:
implicit val getListStringResult = GetResult[List[Any]] (
r => (1 to r.numColumns).map(_ => r.nextObject).toList
)
def getSomething(): Map[String, Any] = DB.withSession {
val columns = MTable.getTables(None, None, None, None).list.filter(_.name.name == "myTable").head.getColumns.list.map(_.column)
val result = sql"""SELECT * FROM myTable LIMIT 1""".as[List[Any]].firstOption.map(columns zip _ toMap).get
}
This is not a problem when query only runs on a single database and single table. I need to be able to use multiple tables and databases in my query like this:
def getSomething(): Map[String, Any] = DB.withSession {
//The line below is no longer valid because of multiple tables/databases
val columns = MTable.getTables(None, None, None, None).list.filter(_.name.name == "table1").head.getColumns.list.map(_.column)
val result = sql"""
SELECT *
FROM db1.table1
LEFT JOIN db2.table2 ON db2.table2.col1 = db1.table1.col1
LIMIT 1
""".as[List[Any]].firstOption.map(columns zip _ toMap).get
}
The same approach can no longer be used to retrieve column names. This problem doesn't exist when using something like PHP PDO or Java JDBCTemplate - these retrieve column names without any extra effort needed.
My question is: how do I achieve this with Slick?
import scala.slick.jdbc.{GetResult,PositionedResult}
object ResultMap extends GetResult[Map[String,Any]] {
def apply(pr: PositionedResult) = {
val rs = pr.rs // <- jdbc result set
val md = rs.getMetaData();
val res = (1 to pr.numColumns).map{ i=> md.getColumnName(i) -> rs.getObject(i) }.toMap
pr.nextRow // <- use Slick's advance method to avoid endless loop
res
}
}
val result = sql"select * from ...".as(ResultMap).firstOption
Another variant that produces map with not null columns (keys in lowercase):
private implicit val getMap = GetResult[Map[String, Any]](r => {
val metadata = r.rs.getMetaData
(1 to r.numColumns).flatMap(i => {
val columnName = metadata.getColumnName(i).toLowerCase
val columnValue = r.nextObjectOption
columnValue.map(columnName -> _)
}).toMap
})