Optimize database actions Slick 3 - postgresql

I have created a database with Slick. And I am trying to create schemas of tables, select some information and so on. Here is my code for schemas creation:
val createUserTable = UserTable.table.schema.create
val createTaskTable = TaskTable.table.schema.create
Await.result(db.run(DBIO.seq(Queries.createUserTable, Queries.createTaskTable)), 2 seconds)
This code works just fine but I do not want to use Await.result with every query. What I am looking for is executing them in batch at least by purpose (creation, selection and so on). I could I have created this method to pass different actions:
def executeAction[T](action: DBIO[T]) =
Await.result(db.run(action), 2 seconds)
So I am curious how can I change it to pass some data structure which holds a sequence of queries? For example, List(createUserTable, createTaskTable)
Your help is appreciated!

Two ways to avoid Await for every DBIO action
Create list of DBIO actions and gather them using DBIO.seq and execute.
Use for-comprehension to compose all DBIO actions into one DBIO action.
This will help you some using await again and again to wait for the result of your intermediate DBIO actions.
In both cases, you have to wait for results in main thread (i.e stop the main thread from exiting) using Await.result at least once.

Related

Duplicate jobs are being generated in DAG for the same action in Spark

I have a spark-streaming job in which I receive data from a message queue and process a bunch of records. In the process, I have a take() method on a dataset. Although the take action is happening in an expected manner, In the DAG visualization, I see multiple job ids created and all of them have the same take action. This is happening only when the data is in the order of a hundreds of thousand records. I didn't observe redundant jobs while running with tens of records in my local machine. Can anyone help me understand the reasoning behind this behavior?
The job ids - (91 to 95) are basically running the same action. Following is the code snippet corresponding to the mentioned action above.
val corruptedMessageArray: Array[ String ] = corruptedMessageDs.take(1);
if ( !corruptedMessageArray.isEmpty ) {
val firstCorruptedMessage: String = corruptedMessageArray( 0 )
}
Your question seems to be whether duplicate jobs are created by Spark.
If you look at the screenshot you will see that the jobs have a different number of tasks, hence it is not a simple matter of duplication.
I am not sure exactly what is happening, but it seems that for large datasets take() needs several quick subsequent jobs. Perhaps because it devises work, or perhaps because it needs to try how much work needs to be done.

Spark : Multiple independant actions in parallel [duplicate]

This question already has answers here:
How to run concurrent jobs(actions) in Apache Spark using single spark context
(2 answers)
Processing multiple files as independent RDD's in parallel
(3 answers)
How to run multiple Spark jobs in parallel?
(3 answers)
Closed 4 years ago.
I am using the SQL API of Spark 2.0.0.
I would like to know what is the good practice when I have two independant actions that have to be done on my data. Here is a basic example :
val ds = sc.parallelize(List(
("2018-12-07T15:31:48Z", "AAA",3),
("2018-12-07T15:32:48Z", "AAA",25),
("2018-12-07T15:33:48Z", "AAA",20),
("2018-12-07T15:34:48Z", "AAA",10),
("2018-12-07T15:35:48Z", "AAA",15),
("2018-12-07T15:36:48Z", "AAA",16),
("2018-12-07T15:37:48Z", "AAA",8),
("2018-12-07T15:31:48Z", "BBB",15),
("2018-12-07T15:32:48Z", "BBB",0),
("2018-12-07T15:33:48Z", "BBB",0),
("2018-12-07T15:34:48Z", "BBB",1),
("2018-12-07T15:35:48Z", "BBB",8),
("2018-12-07T15:36:48Z", "BBB",7),
("2018-12-07T15:37:48Z", "BBB",6)
)).toDF("timestamp","tag","value")
val newDs = commonTransformation(ds).cache();
newDs.count() // force computation of the dataset
val dsAAA = newDs.filter($"tag"==="AAA")
val dsBBB = newDs.filter($"tag"==="BBB")
actionAAA(dsAAA)
actionBBB(dsBBB)
Using the following functions :
def commonTransformation(ds:Dataset[Row]):Dataset[Row]={
ds // do multiple transformations on dataframe
}
def actionAAA(ds:Dataset[Row]){
Thread.sleep(5000) // Sleep to simulate an action that takes time
ds.show()
}
def actionBBB(ds:Dataset[Row]){
Thread.sleep(5000) // Sleep to simulate an action that takes time
ds.show()
}
In this example, we have an input dataset that contains multiple time series identified by the 'tag' column. Some transofrmations are applied to this whole dataset.
Then, I want to apply different actions depending of the tag of the time series on my data.
In my example, I get the expected result, but I had to wait a long time for both my actions to get executed, event though I had executors available.
I partialy solved the problem by using Java class Future, which allows me to start my actions in an asynchronous way. But with this solution, Spark become very slow if I start too much actions compared to his resources and end up taking more time than doing the actions one by one.
So for now, my solution is to start multiple actions with a maximum limit of actions running at the same time, but I don't feel like it is the good way to do (and the maximum limit is hard to guess).

Spark UI active jobs getting stuck when using scala parallel collection

I have a dataFrame of 1000 columns, and I am trying to get some statistics by doing some operations on each column. I need to sort each column so, I can't basically do multi column operations on it. I am doing all these column operations in a function called processColumn
def processColumn(df: DataFrame): Double = {
// sort the column
// get some statistics
}
To get this done, I am persisting the dataframe in memory, and doing a scala multi thread processing on it. So, the code is something like this
Let say the initial dataframe is df
df.columns.grouped(100).foreach { columnGroups =>
val newDf = df.select(columnGroups.head, columnGroups.tail:_*)
newDf.persist()
val parallelCol = columnGroups.par
parallelCol.tasksupport = new ForkJoinTaskSupport(
new scala.concurrent.forkjoin.ForkJoinPool(4)
)
parallelCol.foreach { columnName =>
val result = processColumn(df.select(columnName))
// I am storing result here to a synchronized list
}
newDf.unpersist()
}
So, if you see, I am specifying 4 threads to run at a time. But what happens sometimes is that one of the threads gets stuck, and I have more than 4 active jobs running. And the ones that gets stuck never finishes.
I feel the threads that starts from scala parallel collections have a time out, where sometimes it don't wait for all jobs to finish. And then the unpersist gets called. So, the active job is now stuck forever. I am trying to figure it out by going to source code to see if scala collections operations have a timeout, but haven't been able to figure it out for sure.
Any help will be highly appreciated. Also, please let me know if you have any questions. Thank you.

Await statement execution completion in Slick

In my tests, I've got some database actions that aren't exposed as Futures at the test level. Sometimes, my tests run fast enough that close() in my cleanup happens before those database actions complete, and then I get ugly errors. Is there a way to detect how many statements are in-flight or otherwise hold off close()?
When you execute a query you get Future[A] where A is the result of the query.
You can compose all your queries using Future.sequence() to get a single future composedFuture which will be completed when all your queries have returned result.
Now you can use composedFuture.map(_ => close()) to make sure that all queries have finished execution and then you close the resource.
Best option is to expose the actions as future and then compose them.
Otherwise you can put Thread.sleep(someSensibleTime) and hope your future completes within someSensibleTime, but this will make your tests slow and errorprone.
I think it may be database-dependant rather than slick-dependant.
For example, mysql technologies allow you to see currently running queries with the query "show processlist", and act accordingly.
If that's not an option, I suppose that you could poll the db to observe a selected side effect, and close() afterwards ?

Get list of executions filtered by parameter value

I am using Spring-batch 3.0.4 stable. While submitting a job I add some specific parameters to its execution, say, a tag. Jobs information is persisted in the DB.
Later on I will need to retrieve all the executions marked with a particular tag.
Currently I see 2 options:
Get all job instances with org.springframework.batch.core.explore.JobExplorer#findJobInstancesByJobName. For each instance get all available executions with org.springframework.batch.core.explore.JobExplorer#getJobExecutions. Filter the resulting collection of executions checking its JobParameters.
Write my own JdbcTemplate-based DAO implementation to run the select query.
While the former option seems pretty inefficient, the latter one suggests writing extra code to deal with the Spring-specific database tables structure.
Is there any option I am missing here?