Cache Slick DBIO Actions - scala

I am trying to speed up "SELECT * FROM WHERE name=?" kind of queries in Play! + Scala app. I am using Play 2.4 + Scala 2.11 + play-slick-1.1.1 package. This package uses Slick-3.1 version.
My hypothesis was that slick generates Prepared statements from DBIO actions and they get executed. So I tried to cache them buy turning on flag cachePrepStmts=true
However I still see "Preparing statement..." messages in the log which means that PS are not getting cached! How should one instructs slick to cache them?
If I run following code shouldn't the PS be cached at some point?
for (i <- 1 until 100) {
Await.result(db.run(doctorsTable.filter(_.userName === name).result), 10 seconds)
}
Slick config is as follows:
slick.dbs.default {
driver="slick.driver.MySQLDriver$"
db {
driver="com.mysql.jdbc.Driver"
url="jdbc:mysql://localhost:3306/staging_db?useSSL=false&cachePrepStmts=true"
user = "user"
password = "passwd"
numThreads = 1 // For not just one thread in HikariCP
properties = {
cachePrepStmts = true
prepStmtCacheSize = 250
prepStmtCacheSqlLimit = 2048
}
}
}
Update 1
I tried following as per #pawel's suggestion of using compiled queries:
val compiledQuery = Compiled { name: Rep[String] =>
doctorsTable.filter(_.userName === name)
}
val stTime = TimeUtil.getUtcTime
for (i <- 1 until 100) {
FutureUtils.blockFuture(db.compiledQuery(name).result), 10)
}
val endTime = TimeUtil.getUtcTime - stTime
Logger.info(s"Time Taken HERE $endTime")
In my logs I still see statement like:
2017-01-16 21:34:00,510 DEBUG [db-1] s.j.J.statement [?:?] Preparing statement: select ...
Also timing of this is also remains the same. What is the desired output? Should I not see these statements anymore? How can I verify if Prepared statements are indeed reused.

You need to use Compiled queries - which are exactly doing what you want.
Just change above code to:
val compiledQuery = Compiled { name: Rep[String] =>
doctorsTable.filter(_.userName === name)
}
for (i <- 1 until 100) {
Await.result(db.run(compiledQuery(name).result), 10 seconds)
}
I extracted above name as a parameter (because you usually want to change some parameters in your PreparedStatements) but that's definitely an optional part.
For further information you can refer to: http://slick.lightbend.com/doc/3.1.0/queries.html#compiled-queries

For MySQL you need to set an additional jdbc flag, useServerPrepStmts=true
HikariCP's MySQL configuration page links to a quite useful document that provides some simple performance tuning configuration options for MySQL jdbc.
Here are a few that I've found useful (you'll need to & append them to jdbc url for options not exposed by Hikari's API). Be sure to read through linked document and/or MySQL documentation for each option; should be mostly safe to use.
zeroDateTimeBehavior=convertToNull&characterEncoding=UTF-8
rewriteBatchedStatements=true
maintainTimeStats=false
cacheServerConfiguration=true
avoidCheckOnDuplicateKeyUpdateInSQL=true
dontTrackOpenResources=true
useLocalSessionState=true
cachePrepStmts=true
useServerPrepStmts=true
prepStmtCacheSize=500
prepStmtCacheSqlLimit=2048
Also, note that statements are cached per thread; depending on what you set for Hikari connection maxLifetime and what server load is, memory usage will increase accordingly on both server and client (e.g. if you set connection max lifetime to just under MySQL default of 8 hours, both server and client will keep N prepared statements alive in memory for the life of each connection).
p.s. curious if bottleneck is indeed statement caching or something specific to Slick.
EDIT
to log statements enable the query log. On MySQL 5.7 you would add to your my.cnf:
general-log=1
general-log-file=/var/log/mysqlgeneral.log
and then sudo touch /var/log/mysqlgeneral.log followed by a restart of mysqld. Comment out above config lines and restart to turn off query logging.

Related

Cadence Java client, unable to get history of workflow executions

I am following the idea, mentioned in this answer and trying this:
workflowTChannel.ListClosedWorkflowExecutions(ListClosedWorkflowExecutionsRequest().apply {
domain = "valid-domain-name"
startTimeFilter = StartTimeFilter().apply {
setEarliestTime(Instant.parse("2023-01-01T00:00:00.000Z").toEpochMilli())
setLatestTime(Instant.parse("2024-01-01T00:59:59.999Z").toEpochMilli())
}
})
However, the result is always an empty list.
Fetching the list via UI works fine at the same time.
Using PostgreSQL and local test installation without advanced visibility.
UPD: debugged Cadence locally and found that it expects nanoseconds instead of milliseconds. This way correct parameters must be prepared like this:
Instant.parse("2023-01-01T00:00:00.000Z").toEpochMilli() * 1000000
My guess is that you are using seconds and Cadence expects nanoseconds timestamps.

How do I ensure that my Apache Spark setup code runs only once?

I'm writing a Spark job in Scala that reads in parquet files on S3, does some simple transforms, and then saves them to a DynamoDB instance. Each time it runs we need to create a new table in Dynamo so I've written a Lambda function which is responsible for table creation. The first thing my Spark job does is generates a table name, invokes my Lambda function (passing the new table name to it), waits for the table to be created, and then proceeds normally with the ETL steps.
However it looks as though my Lambda function is consistently being invoked twice. I cannot explain that. Here's a sample of the code:
def main(spark: SparkSession, pathToParquet: String) {
// generate a unique table name
val tableName = generateTableName()
// call the lambda function
val result = callLambdaFunction(tableName)
// wait for the table to be created
waitForTableCreation(tableName)
// normal ETL pipeline
var parquetRDD = spark.read.parquet(pathToParquet)
val transformedRDD = parquetRDD.map((row: Row) => transformData(row), encoder=kryo[(Text, DynamoDBItemWritable)])
transformedRDD.saveAsHadoopDataset(getConfiguration(tableName))
spark.sparkContext.stop()
}
The code to wait for table creation is pretty-straightforward, as you can see:
def waitForTableCreation(tableName: String) {
val client: AmazonDynamoDB = AmazonDynamoDBClientBuilder.defaultClient()
val waiter: Waiter[DescribeTableRequest] = client.waiters().tableExists()
try {
waiter.run(new WaiterParameters[DescribeTableRequest](new DescribeTableRequest(tableName)))
} catch {
case ex: WaiterTimedOutException =>
LOGGER.error("Timed out waiting to create table: " + tableName)
throw ex
case t: Throwable => throw t
}
}
And the lambda invocation is equally simple:
def callLambdaFunction(tableName: String) {
val myLambda = LambdaInvokerFactory.builder()
.lambdaClient(AWSLambdaClientBuilder.defaultClient)
.lambdaFunctionNameResolver(new LambdaByName(LAMBDA_FUNCTION_NAME))
.build(classOf[MyLambdaContract])
myLambda.invoke(new MyLambdaInput(tableName))
}
Like I said, when I run spark-submit on this code, it definitely does hit the Lambda function. But I can't explain why it hits it twice. The result is that I get two tables provisioned in DynamoDB.
The waiting step also seems to fail within the context of running this as a Spark job. But when I unit-test my waiting code it seems to work fine on its own. It successfully blocks until the table is ready.
At first I theorized that perhaps spark-submit was sending this code to all of the worker nodes and they were independently running the whole thing. Initially I had a Spark cluster with 1 master and 2 workers. However I tested this out on another cluster with 1 master and 5 workers, and there again it hit the Lambda function exactly twice, and then apparently failed to wait for table creation because it dies shortly after invoking the Lambdas.
Does anyone have any clues as to what Spark might be doing? Am I missing something obvious?
UPDATE: Here's my spark-submit args which are visible on the Steps tab of EMR.
spark-submit --deploy-mode cluster --class com.mypackage.spark.MyMainClass s3://my-bucket/my-spark-job.jar
And here's the code for my getConfiguration function:
def getConfiguration(tableName: String) : JobConf = {
val conf = new Configuration()
conf.set("dynamodb.servicename", "dynamodb")
conf.set("dynamodb.input.tableName", tableName)
conf.set("dynamodb.output.tableName", tableName)
conf.set("dynamodb.endpoint", "https://dynamodb.us-east-1.amazonaws.com")
conf.set("dynamodb.regionid", "us-east-1")
conf.set("mapred.output.format.class", "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat")
conf.set("mapred.input.format.class", "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat")
new JobConf(conf)
}
Also here is a Gist containing some of the exception logs I see when I try to run this.
Thanks #soapergem for adding logging and options. I add an answer (a try one) since it may be a little bit longer than a comment :)
To wrap-up:
nothing strange with spark-submit and configuration options
in https://gist.github.com/soapergem/6b379b5a9092dcd43777bdec8dee65a8#file-stderr-log you can see that the application is executed twice. It passes twice from an ACCEPTED to RUNNING state. And that's consistent with EMR defaults (How to prevent EMR Spark step from retrying?). To confirm that, you can check whether you have 2 tables created after executing the step (I suppose here that you're generating tables with dynamic names; a different name per execution which in case of retry should give 2 different names)
For your last question:
It looks like my code might work if I run it in "client" deploy mode, instead of "cluster" deploy mode? Does that offer any hints to anyone here?
For more information about the difference, please check https://community.hortonworks.com/questions/89263/difference-between-local-vs-yarn-cluster-vs-yarn-c.html In your case, it looks like the machine executing spark-submit in client mode has different IAM policies than the EMR jobflow. My supposition here is that your jobflow role is not allowed to dynamodb:Describe* and that's why you're getting the exception with 500 code (from your gist):
Caused by: com.amazonaws.services.dynamodbv2.model.ResourceNotFoundException: Requested resource not found: Table: EmrTest_20190708143902 not found (Service: AmazonDynamoDBv2; Status Code: 400; Error Code: ResourceNotFoundException; Request ID: V0M91J7KEUVR4VM78MF5TKHLEBVV4KQNSO5AEMVJF66Q9ASUAAJG)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1712)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1367)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1113)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:770)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:744)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:726)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:686)
at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:668)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:532)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:512)
at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.doInvoke(AmazonDynamoDBClient.java:4243)
at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.invoke(AmazonDynamoDBClient.java:4210)
at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.executeDescribeTable(AmazonDynamoDBClient.java:1890)
at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.describeTable(AmazonDynamoDBClient.java:1857)
at org.apache.hadoop.dynamodb.DynamoDBClient$1.call(DynamoDBClient.java:129)
at org.apache.hadoop.dynamodb.DynamoDBClient$1.call(DynamoDBClient.java:126)
at org.apache.hadoop.dynamodb.DynamoDBFibonacciRetryer.runWithRetry(DynamoDBFibonacciRetryer.java:80)
To confirm this hypothesis, you an execute your part creating the table and waiting for creation locally (no Spark code here, just a simple java command of your main function) and:
for the first execution ensure that you have all permissions. IMO it will be dynamodb:Describe* on Resources: * (if it's the reason, AFAIK you should use somthing Resources: Test_Emr* in production for principle of least privilege )
for the 2nd execution remove dynamodb:Describe* and check whether you're getting the same stack trace like in the gist
I encountered the same problem in cluster mode too (v2.4.0). I workaround it by launching my apps programmatically using SparkLauncher instead of using spark-submit.sh. You could move your lambda logic into your main method that starts your spark app like this:
def main(args: Array[String]) = {
// generate a unique table name
val tableName = generateTableName()
// call the lambda function
val result = callLambdaFunction(tableName)
// wait for the table to be created
waitForTableCreation(tableName)
val latch = new CountDownLatch(1);
val handle = new SparkLauncher(env)
.setAppResource("/path/to/spark-app.jar")
.setMainClass("com.company.SparkApp")
.setMaster("yarn")
.setDeployMode("cluster")
.setConf("spark.executor.instances", "2")
.setConf("spark.executor.cores", "2")
// other conf ...
.setVerbose(true)
.startApplication(new SparkAppHandle.Listener {
override def stateChanged(sparkAppHandle: SparkAppHandle): Unit = {
latch.countDown()
}
override def infoChanged(sparkAppHandle: SparkAppHandle): Unit = {
}
})
println("app is launching...")
latch.await()
println("app exited")
}
your spark job starts before the table is actually created because defining operations one by one doesn't mean they will wait until previous one is finished
you need to change the code so that block related to spark is starting after table is created, and in order to achieving it you have to either use for-comprehension that insures every step is finished or put your spark pipeline into the callback of waiter called after the table is created (if you have any, hard to tell)
you can also use andThen or simple map
the main point is that all the lines of code written in your main are executed one by one immediately without waiting for previous one to finish

Is it possible to change the socket timout for a single select?

Our project uses mongodb to store its documents. We have it configured in our DataSource.groovy file with a socketTimeout of 60000. Most of our queries do not get any where near that threshold, but I assume we have it just in case something goes wrong.
Anyway, now I am working on a query that is known to be a long running query. It is pretty much guarenteed to be doing a tablespace scan, which we acknowledge and are ok with for now. The problem is the amount of data we currently have is causing it to exceed the socket timeout on random cases. Add to the issue that we are expecting the amount of data to grow considerably larger in the future.
So my question is, is it possible to increase/remove the socket timeout for a single mongodb select? I've found the following:
grailsApplication.mainContext.getBean('mongoDatastore').mongo.mongoOptions.socketTimeout = 0
That appears to work, but that also is changing the socket timeout for everything application wide I assume, which we do not want. Help!
Update: After a lot of trial and error, I found a way to open another mongo connection that reuses the configuration, but leaves off the socketTimeout and appears to work.
class MongoService {
def grailsApplication
def openMongoClientWithoutSocketTimeout() {
def datastore = grailsApplication.mainContext.getBean('mongoDatastore')
def config = grailsApplication.config.grails.mongo
def credentials = MongoCredential.createCredential(config.username, config.databaseName, config.password.toCharArray())
def options = MongoClientOptions.builder()
.autoConnectRetry(config.options.autoConnectRetry)
.connectTimeout(config.options.connectTimeout)
.build()
new MongoClient(datastore.mongo.getAllAddress(), [credentials, credentials], options)
}
def selectCollection(mongoClient, collection) {
def mongoConfig = grailsApplication.config.grails.mongo
mongoClient.getDB(mongoConfig.databaseName).getCollection(collection)
}
}
Not sure if this is the simplest solution though...
What version of mongo do you use? Possible maxTimeMS can help you?
From document page :
db.collection.find({description: /August [0-9]+, 1969/}).maxTimeMS(50)

How to implement master/slave structure with squeryl and play framework

I am running a play framework website that uses squeryl and mysql database.
I need to use squeryl to run all read queries to the slave and all write queries to the master.
How can I achieve this? either via squeryl or via jdbc connector itself.
Many thanks,
I don't tend to use MySQL myself, but here's an idea:
Based on the documentation here, the MySQL JDBC driver will round robin amongst the slaves if the readOnly attribute is properly set on the Connnection. In order to retrieve and change the current Connection you'll want to use code like
transaction {
val conn = Session.currentSession.connection
conn.setReadOnly(true)
//Your code here
}
Even better, you can create your own readOnlyTransaction method:
def readOnlyTransaction(f: => Unit) = {
transaction {
val conn = Session.currentSession.connection
val orig = conn.getReadOnly()
conn.setReadOnly(true)
f
conn.setReadOnly(orig)
}
}
Then use it like:
readOnlyTransaction {
//Your code here
}
You'll probably want to clean that up a bit so the default readOnly state is reset if an exception occurs, but you get the general idea.

How can I connect to a postgreSQL database in scala?

I want to know how can I do following things in scala?
Connect to a postgreSQL database.
Write SQL queries like SELECT , UPDATE etc. to modify a table in that database.
I know that in python I can do it using PygreSQL but how to do these things in scala?
You need to add dependency "org.postgresql" % "postgresql" % "9.3-1102-jdbc41" in build.sbt and you can modify following code to connect and query database. Replace DB_USER with your db user and DB_NAME as your db name.
import java.sql.{Connection, DriverManager, ResultSet}
object pgconn extends App {
println("Postgres connector")
classOf[org.postgresql.Driver]
val con_st = "jdbc:postgresql://localhost:5432/DB_NAME?user=DB_USER"
val conn = DriverManager.getConnection(con_str)
try {
val stm = conn.createStatement(ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY)
val rs = stm.executeQuery("SELECT * from Users")
while(rs.next) {
println(rs.getString("quote"))
}
} finally {
conn.close()
}
}
I would recommend having a look at Doobie.
This chapter in the "Book of Doobie" gives a good sense of what your code will look like if you make use of this library.
This is the library of choice right now to solve this problem if you are interested in the pure FP side of Scala, i.e. scalaz, scalaz-stream (probably fs2 and cats soon) and referential transparency in general.
It's worth nothing that Doobie is NOT an ORM. At its core, it's simply a nicer, higher-level API over JDBC.
Take look at the tutorial "Using Scala with JDBC to connect to MySQL", replace the db url and add the right jdbc library. The link got broken so here's the content of the blog:
Using Scala with JDBC to connect to MySQL
A howto on connecting Scala to a MySQL database using JDBC. There are a number of database libraries for Scala, but I ran into a problem getting most of them to work. I attempted to use scala.dbc, scala.dbc2, Scala Query and Querulous but either they aren’t supported, have a very limited featured set or abstracts SQL to a weird pseudo language.
The Play Framework has a new database library called ANorm which tries to keep the interface to basic SQL but with a slight improved scala interface. The jury is still out for me, only used on one project minimally so far. Also, I’ve only seen it work within a Play app, does not look like it can be extracted out too easily.
So I ended up going with basic Java JDBC connection and it turns out to be a fairly easy solution.
Here is the code for accessing a database using Scala and JDBC. You need to change the connection string parameters and modify the query for your database. This example was geared towards MySQL, but any Java JDBC driver should work the same with Scala.
Basic Query
import java.sql.{Connection, DriverManager, ResultSet};
// Change to Your Database Config
val conn_str = "jdbc:mysql://localhost:3306/DBNAME?user=DBUSER&password=DBPWD"
// Load the driver
classOf[com.mysql.jdbc.Driver]
// Setup the connection
val conn = DriverManager.getConnection(conn_str)
try {
// Configure to be Read Only
val statement = conn.createStatement(ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY)
// Execute Query
val rs = statement.executeQuery("SELECT quote FROM quotes LIMIT 5")
// Iterate Over ResultSet
while (rs.next) {
println(rs.getString("quote"))
}
}
finally {
conn.close
}
You will need to download the mysql-connector jar.
Or if you are using maven, the pom snippets to load the mysql connector, you’ll need to check what the latest version is.
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.12</version>
</dependency>
To run the example, save the following to a file (query_test.scala) and run using, the following specifying the classpath to the connector jar:
scala -cp mysql-connector-java-5.1.12.jar:. query_test.scala
Insert, Update and Delete
To perform an insert, update or delete you need to create an updatable statement object. The execute command is slightly different and you will most likely want to use some sort of parameters. Here’s an example doing an insert using jdbc and scala with parameters.
// create database connection
val dbc = "jdbc:mysql://localhost:3306/DBNAME?user=DBUSER&password=DBPWD"
classOf[com.mysql.jdbc.Driver]
val conn = DriverManager.getConnection(dbc)
val statement = conn.createStatement(ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_UPDATABLE)
// do database insert
try {
val prep = conn.prepareStatement("INSERT INTO quotes (quote, author) VALUES (?, ?) ")
prep.setString(1, "Nothing great was ever achieved without enthusiasm.")
prep.setString(2, "Ralph Waldo Emerson")
prep.executeUpdate
}
finally {
conn.close
}
We are using Squeryl, which is working well so far for us. Depending on your needs it may do the trick.
Here is a list of supported DB's and the adapters
If you want/need to write your own SQL, but hate the JDBC interface, take a look at O/R Broker
I would recommend the Quill query library. Here is an introduction post by Li Haoyi to get started.
TL;DR
{
import io.getquill._
import com.zaxxer.hikari.{HikariConfig, HikariDataSource}
val pgDataSource = new org.postgresql.ds.PGSimpleDataSource()
pgDataSource.setUser("postgres")
pgDataSource.setPassword("example")
val config = new HikariConfig()
config.setDataSource(pgDataSource)
val ctx = new PostgresJdbcContext(LowerCase, new HikariDataSource(config))
import ctx._
}
Define a class ORM:
// mapping `city` table
case class City(
id: Int,
name: String,
countryCode: String,
district: String,
population: Int
)
and query all items:
# ctx.run(query[City])
cmd11.sc:1: SELECT x.id, x.name, x.countrycode, x.district, x.population FROM city x
val res11 = ctx.run(query[City])
^
res11: List[City] = List(
City(1, "Kabul", "AFG", "Kabol", 1780000),
City(2, "Qandahar", "AFG", "Qandahar", 237500),
City(3, "Herat", "AFG", "Herat", 186800),
...
ScalikeJDBC is quite easy to use. It allows you to write raw SQL using interpolated strings.
http://scalikejdbc.org/