How do I ensure that my Apache Spark setup code runs only once? - scala

I'm writing a Spark job in Scala that reads in parquet files on S3, does some simple transforms, and then saves them to a DynamoDB instance. Each time it runs we need to create a new table in Dynamo so I've written a Lambda function which is responsible for table creation. The first thing my Spark job does is generates a table name, invokes my Lambda function (passing the new table name to it), waits for the table to be created, and then proceeds normally with the ETL steps.
However it looks as though my Lambda function is consistently being invoked twice. I cannot explain that. Here's a sample of the code:
def main(spark: SparkSession, pathToParquet: String) {
// generate a unique table name
val tableName = generateTableName()
// call the lambda function
val result = callLambdaFunction(tableName)
// wait for the table to be created
waitForTableCreation(tableName)
// normal ETL pipeline
var parquetRDD = spark.read.parquet(pathToParquet)
val transformedRDD = parquetRDD.map((row: Row) => transformData(row), encoder=kryo[(Text, DynamoDBItemWritable)])
transformedRDD.saveAsHadoopDataset(getConfiguration(tableName))
spark.sparkContext.stop()
}
The code to wait for table creation is pretty-straightforward, as you can see:
def waitForTableCreation(tableName: String) {
val client: AmazonDynamoDB = AmazonDynamoDBClientBuilder.defaultClient()
val waiter: Waiter[DescribeTableRequest] = client.waiters().tableExists()
try {
waiter.run(new WaiterParameters[DescribeTableRequest](new DescribeTableRequest(tableName)))
} catch {
case ex: WaiterTimedOutException =>
LOGGER.error("Timed out waiting to create table: " + tableName)
throw ex
case t: Throwable => throw t
}
}
And the lambda invocation is equally simple:
def callLambdaFunction(tableName: String) {
val myLambda = LambdaInvokerFactory.builder()
.lambdaClient(AWSLambdaClientBuilder.defaultClient)
.lambdaFunctionNameResolver(new LambdaByName(LAMBDA_FUNCTION_NAME))
.build(classOf[MyLambdaContract])
myLambda.invoke(new MyLambdaInput(tableName))
}
Like I said, when I run spark-submit on this code, it definitely does hit the Lambda function. But I can't explain why it hits it twice. The result is that I get two tables provisioned in DynamoDB.
The waiting step also seems to fail within the context of running this as a Spark job. But when I unit-test my waiting code it seems to work fine on its own. It successfully blocks until the table is ready.
At first I theorized that perhaps spark-submit was sending this code to all of the worker nodes and they were independently running the whole thing. Initially I had a Spark cluster with 1 master and 2 workers. However I tested this out on another cluster with 1 master and 5 workers, and there again it hit the Lambda function exactly twice, and then apparently failed to wait for table creation because it dies shortly after invoking the Lambdas.
Does anyone have any clues as to what Spark might be doing? Am I missing something obvious?
UPDATE: Here's my spark-submit args which are visible on the Steps tab of EMR.
spark-submit --deploy-mode cluster --class com.mypackage.spark.MyMainClass s3://my-bucket/my-spark-job.jar
And here's the code for my getConfiguration function:
def getConfiguration(tableName: String) : JobConf = {
val conf = new Configuration()
conf.set("dynamodb.servicename", "dynamodb")
conf.set("dynamodb.input.tableName", tableName)
conf.set("dynamodb.output.tableName", tableName)
conf.set("dynamodb.endpoint", "https://dynamodb.us-east-1.amazonaws.com")
conf.set("dynamodb.regionid", "us-east-1")
conf.set("mapred.output.format.class", "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat")
conf.set("mapred.input.format.class", "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat")
new JobConf(conf)
}
Also here is a Gist containing some of the exception logs I see when I try to run this.

Thanks #soapergem for adding logging and options. I add an answer (a try one) since it may be a little bit longer than a comment :)
To wrap-up:
nothing strange with spark-submit and configuration options
in https://gist.github.com/soapergem/6b379b5a9092dcd43777bdec8dee65a8#file-stderr-log you can see that the application is executed twice. It passes twice from an ACCEPTED to RUNNING state. And that's consistent with EMR defaults (How to prevent EMR Spark step from retrying?). To confirm that, you can check whether you have 2 tables created after executing the step (I suppose here that you're generating tables with dynamic names; a different name per execution which in case of retry should give 2 different names)
For your last question:
It looks like my code might work if I run it in "client" deploy mode, instead of "cluster" deploy mode? Does that offer any hints to anyone here?
For more information about the difference, please check https://community.hortonworks.com/questions/89263/difference-between-local-vs-yarn-cluster-vs-yarn-c.html In your case, it looks like the machine executing spark-submit in client mode has different IAM policies than the EMR jobflow. My supposition here is that your jobflow role is not allowed to dynamodb:Describe* and that's why you're getting the exception with 500 code (from your gist):
Caused by: com.amazonaws.services.dynamodbv2.model.ResourceNotFoundException: Requested resource not found: Table: EmrTest_20190708143902 not found (Service: AmazonDynamoDBv2; Status Code: 400; Error Code: ResourceNotFoundException; Request ID: V0M91J7KEUVR4VM78MF5TKHLEBVV4KQNSO5AEMVJF66Q9ASUAAJG)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1712)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1367)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1113)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:770)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:744)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:726)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:686)
at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:668)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:532)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:512)
at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.doInvoke(AmazonDynamoDBClient.java:4243)
at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.invoke(AmazonDynamoDBClient.java:4210)
at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.executeDescribeTable(AmazonDynamoDBClient.java:1890)
at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.describeTable(AmazonDynamoDBClient.java:1857)
at org.apache.hadoop.dynamodb.DynamoDBClient$1.call(DynamoDBClient.java:129)
at org.apache.hadoop.dynamodb.DynamoDBClient$1.call(DynamoDBClient.java:126)
at org.apache.hadoop.dynamodb.DynamoDBFibonacciRetryer.runWithRetry(DynamoDBFibonacciRetryer.java:80)
To confirm this hypothesis, you an execute your part creating the table and waiting for creation locally (no Spark code here, just a simple java command of your main function) and:
for the first execution ensure that you have all permissions. IMO it will be dynamodb:Describe* on Resources: * (if it's the reason, AFAIK you should use somthing Resources: Test_Emr* in production for principle of least privilege )
for the 2nd execution remove dynamodb:Describe* and check whether you're getting the same stack trace like in the gist

I encountered the same problem in cluster mode too (v2.4.0). I workaround it by launching my apps programmatically using SparkLauncher instead of using spark-submit.sh. You could move your lambda logic into your main method that starts your spark app like this:
def main(args: Array[String]) = {
// generate a unique table name
val tableName = generateTableName()
// call the lambda function
val result = callLambdaFunction(tableName)
// wait for the table to be created
waitForTableCreation(tableName)
val latch = new CountDownLatch(1);
val handle = new SparkLauncher(env)
.setAppResource("/path/to/spark-app.jar")
.setMainClass("com.company.SparkApp")
.setMaster("yarn")
.setDeployMode("cluster")
.setConf("spark.executor.instances", "2")
.setConf("spark.executor.cores", "2")
// other conf ...
.setVerbose(true)
.startApplication(new SparkAppHandle.Listener {
override def stateChanged(sparkAppHandle: SparkAppHandle): Unit = {
latch.countDown()
}
override def infoChanged(sparkAppHandle: SparkAppHandle): Unit = {
}
})
println("app is launching...")
latch.await()
println("app exited")
}

your spark job starts before the table is actually created because defining operations one by one doesn't mean they will wait until previous one is finished
you need to change the code so that block related to spark is starting after table is created, and in order to achieving it you have to either use for-comprehension that insures every step is finished or put your spark pipeline into the callback of waiter called after the table is created (if you have any, hard to tell)
you can also use andThen or simple map
the main point is that all the lines of code written in your main are executed one by one immediately without waiting for previous one to finish

Related

Is it possible to create a batch flink job in streaming flink job?

I have a job streaming using Apache Flink (flink version: 1.8.1) using scala. there are flow job requirements as follows:
Kafka -> Write to Hbase -> Send to kafka again with a different topic
During the writing process to Hbase, there was a need to retrieve data from another table. To ensure that the data is not empty (NULL), the job must check repeatedly (within a certain time) if the data is empty.
is this possible with Flink? If yes, can you help provide examples for conditions similar to my needs?
Edit :
I mean, with the problem that I described in the content, I thought about having to create some kind of job batch in the job streaming, but I couldn't find the right example for my case. So, is it possible to create a batch flink job in streaming flink job? If yes, can you help provide examples for conditions similar to my needs?
With more recent versions of Flink you can do lookup queries (with a configurable cache) against HBase from the SQL/Table APIs. Your use case sounds like it might be easily implemented in this fashion. See the docs for more info.
Just to clarify my comment I will post a sketch of what I was trying to suggest based on The Broadcast State Pattern. The link provides an example in Java, so I will follow it. In case you want in Scala it should not be too much different. You will likely have to implement the below code as it is explained on the link that I mentioned:
DataStream<String> output = colorPartitionedStream
.connect(ruleBroadcastStream)
.process(
// type arguments in our KeyedBroadcastProcessFunction represent:
// 1. the key of the keyed stream
// 2. the type of elements in the non-broadcast side
// 3. the type of elements in the broadcast side
// 4. the type of the result, here a string
new KeyedBroadcastProcessFunction<Color, Item, Rule, String>() {
// my matching logic
}
);
I was suggesting that you can collect the stream ruleBroadcastStream in fixed intervals from the database or whatever is your store. Instead of getting:
// broadcast the rules and create the broadcast state
BroadcastStream<Rule> ruleBroadcastStream = ruleStream
.broadcast(ruleStateDescriptor);
like the web page says. You will need to add a source where you can schedule it to run every X minutes.
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
BroadcastStream<Rule> ruleBroadcastStream = env
.addSource(new YourStreamSource())
.broadcast(ruleStateDescriptor);
public class YourStreamSource extends RichSourceFunction<YourType> {
private volatile boolean running = true;
#Override
public void run(SourceContext<YourType> ctx) throws Exception {
while (running) {
// TODO: yourData = FETCH DATA;
ctx.collect(yourData);
Thread.sleep("sleep for X minutes");
}
}
#Override
public void cancel() {
this.running = false;
}
}

"IllegalStateException: state should be: open" when using mapPartitions with Mongo connector

The setup
I have a simple Spark application that uses mapPartitions to transform an RDD. As part of this transformation, I retrieve some necessary data from a Mongo database. The connection from the Spark worker to the Mongo database is managed using the MongoDB Connector for Spark (https://docs.mongodb.com/spark-connector/current/).
I'm using mapPartitions instead of the simpler map because there is some relatively expensive setup that is only required once for all elements in a partition. If I were to use map instead, this setup would have to be repeated for every element individually.
The problem
When one of the partitions in the source RDD becomes large enough, the transformation fails with the message
IllegalStateException: state should be: open
or, occasionally, with
IllegalStateException: The pool is closed
The code
Below is the code of a simple Scala application with which I can reproduce the issue:
package my.package
import com.mongodb.spark.MongoConnector
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
import org.bson.Document
object MySparkApplication {
def main(args: Array[String]): Unit = {
val sparkSession: SparkSession = SparkSession.builder()
.appName("MySparkApplication")
.master(???) // The Spark master URL
.config("spark.jars", ???) // The path at which the application's fat JAR is located.
.config("spark.scheduler.mode", "FAIR")
.config("spark.mongodb.keep_alive_ms", "86400000")
.getOrCreate()
val mongoConnector: MongoConnector = MongoConnector(Map(
"uri" -> ??? // The MongoDB URI.
, "spark.mongodb.keep_alive_ms" -> "86400000"
, "keep_alive_ms" -> "86400000"
))
val localDocumentIds: Seq[Long] = Seq.range(1L, 100L)
val documentIdsRdd: RDD[Long] = sparkSession.sparkContext.parallelize(localDocumentIds)
val result: RDD[Document] = documentIdsRdd.mapPartitions { documentIdsIterator =>
mongoConnector.withMongoClientDo { mongoClient =>
val collection = mongoClient.getDatabase("databaseName").getCollection("collectionName")
// Some expensive query that should only be performed once for every partition.
collection.find(new Document("_id", 99999L)).first()
documentIdsIterator.map { documentId =>
// An expensive operation that does not interact with the Mongo database.
Thread.sleep(1000)
collection.find(new Document("_id", documentId)).first()
}
}
}
val resultLocal = result.collect()
}
}
The stack trace
Below is the stack trace returned by Spark when I run the application above:
Driver stacktrace:
[...]
at my.package.MySparkApplication.main(MySparkApplication.scala:41)
at my.package.MySparkApplication.main(MySparkApplication.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:775)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.IllegalStateException: state should be: open
at com.mongodb.assertions.Assertions.isTrue(Assertions.java:70)
at com.mongodb.connection.BaseCluster.getDescription(BaseCluster.java:152)
at com.mongodb.Mongo.getConnectedClusterDescription(Mongo.java:885)
at com.mongodb.Mongo.createClientSession(Mongo.java:877)
at com.mongodb.Mongo$3.getClientSession(Mongo.java:866)
at com.mongodb.Mongo$3.execute(Mongo.java:823)
at com.mongodb.FindIterableImpl.first(FindIterableImpl.java:193)
at my.package.MySparkApplication$$anonfun$1$$anonfun$apply$1$$anonfun$apply$2.apply(MySparkApplication.scala:36)
at my.package.MySparkApplication$$anonfun$1$$anonfun$apply$1$$anonfun$apply$2.apply(MySparkApplication.scala:33)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
at scala.collection.AbstractIterator.to(Iterator.scala:1336)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2069)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2069)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
The research I have done
I have found several people asking about this issue, and it seems that in all of their cases, the problem turned out to be them using the Mongo client after it had been closed. As far as I can tell, this is not happening in my application - opening and closing the connection should be handled by the Mongo-Spark connector, and I would expect the client to only be closed after the function passed to mongoConnector.withMongoClientDo returns.
I did manage to discover that the issue does not arise for the very first element in the RDD. It seems instead that a number of elements are being processed successfully, and that the failure only occurs once the process has taken a certain amount of time. This amount of time seems to be on the order of 5 to 15 seconds.
The above leads me to believe that something is automatically closing the client once it has been active for a certain amount of time, even though it is still being used.
As you can tell by my code, I have discovered the fact that the Mongo-Spark connector exposes a configuration spark.mongodb.keep_alive_ms that, according to the connector documentation, controls "The length of time to keep a MongoClient available for sharing". Its default value is 5 seconds, so this seemed like a useful thing to try. In the application above, I attempt to set it to an entire day in three different ways, with zero effect. The documentation does state that this specific property "can only be configured via a System Property". I think that this is what I'm doing (by setting the property when initialising the Spark session and/or Mongo connector), but I'm not entirely sure. It seems to be impossible to verify the setting once the Mongo connector has been initialised.
One other StackOverflow question mentions that I should try setting the maxConnectionIdleTime option in the MongoClientOptions, but as far as I can tell it is not possible to set these options through the connector.
As a sanity check, I tried replacing the use of mapPartitions with a functionally equivalent use of map. The issue disappeared, which is probably because the connection to the Mongo database is re-initialised for each individual element of the RDD. However, as mentioned above, this approach would have significantly worse performance because I would end up repeating expensive setup work for every element in the RDD.
Out of curiosity I also tried replacing the call to mapPartitions with a call to foreachPartition, also replacing the call to documentIdsIterator.map with documentIdsIterator.foreach. The issue also disappeared in this case. I have no idea why this would be, but because I need to transform my RDD, this is also not an acceptable approach.
The kind of answer I am looking for
"You actually are closing the client prematurely, and here's where: [...]"
"This is a known issue in the Mongo-Spark connector, and here's a link to their issue tracker: [...]"
"You are setting the spark.mongodb.keep_alive_ms property incorrectly, this is how you should do it: [...]"
"It is possible to verify the value of spark.mongodb.keep_alive_ms on your Mongo connector, and here's how: [...]"
"It is possible to set MongoClientOptions such as maxConnectionIdleTime through the Mongo connector, and here's how: [...]"
Edit
Further investigation has yielded the following insight:
The phrase 'System property' used in the connector's documentation refers to a Java system property, set using System.setProperty("spark.mongodb.keep_alive_ms", desiredValue) or the command line option -Dspark.mongodb.keep_alive_ms=desiredValue. This value is then read by the MongoConnector singleton object, and passed to the MongoClientCache. However, neither of the approaches for setting this property actually works:
Calling System.setProperty() from the driver program sets the value only in the JVM for the Spark driver program, while the value is needed in the Spark worker's JVM.
Calling System.setProperty() from the worker program sets the value only after it is read by MongoConnector.
Passing the command line option -Dspark.mongodb.keep_alive_ms to the Spark option spark.driver.extraJavaOptions again only sets the value in the driver's JVM.
Passing the command line option to the Spark option spark.executor.extraJavaOptions results in an error message from Spark:
Exception in thread "main" java.lang.Exception: spark.executor.extraJavaOptions is not allowed to set Spark options (was '-Dspark.mongodb.keep_alive_ms=desiredValue'). Set them directly on a SparkConf or in a properties file when using ./bin/spark-submit.
The Spark code that throws this error is located in org.apache.spark.SparkConf#validateSettings, where it checks for any worker option value that contains the string -Dspark.
This seems like an oversight in the design of the Mongo connector; either the property should be set through the Spark session (as I originally expected it to be), or it should be renamed to something that doesn't start with spark. I added this information to the JIRA ticket mentioned in the comments.
The core issue is that the MongoConnector uses a cache for MongoClients and follows the loan pattern for managing that cache. Once all loaned MongoClients are returned and the keep_alive_ms time has passed the MongoClient will be closed and removed from the cache.
Due to the nature of how RDD's are implemented (they follow Scala's lazy collection semantics) the following code: documentIdsIterator.map { documentId => ... } is only processed once the RDD is actioned. By that time the loaned MongoClient has already been returned back to the cache and after the keep_alive_ms the MongoClient will be closed. This results in a state should be open exception on the client.
How to solve?
You could once SPARK-246 is fixed set the keep_alive_ms to be high enough so that the MongoClient is not closed during the processing of the RDD. However, that still breaks the contract of the loan pattern that the MongoConnector uses - so should be avoided.
Reuse the MongoConnector to get the client as needed. This way the cache can still be used, should a client be available, but should a client timeout for any reason then a new one will be automatically created:
documentIdsRdd.mapPartitions { documentIdsIterator =>
mongoConnector.withMongoClientDo { mongoClient =>
// Do some expensive operation
...
// Return the lazy collection
documentIdsIterator.map { documentId =>
// Loan the mongoClient
mongoConnector.withMongoClientDo { mongoClient => ... }
}
}
}
Connection objects are in general tightly bound to the context, in which they where initialized. You cannot simply serialize such objects and pass them around. Instead you these should be initialized in-place, in the mapPartitions:
val result: RDD[Document] = documentIdsRdd.mapPartitions { documentIdsIterator =>
val mongoConnector: MongoConnector = MongoConnector(Map(
"uri" -> ??? // The MongoDB URI.
, "spark.mongodb.keep_alive_ms" -> "86400000"
, "keep_alive_ms" -> "86400000"
))
mongoConnector.withMongoClientDo { mongoClient =>
...
}
}

Setting up and accessing Flink Queryable State (NullPointerException)

I am using Flink v1.4.0 and I have set up two distinct jobs. The first is a pipeline that consumes data from a Kafka Topic and stores them into a Queryable State (QS). Data are keyed by date. The second submits a query to the QS job and processes the returned data.
Both jobs were working fine with Flink v.1.3.2. But with the new update, everything has broken. Here is part of the code for the first job:
private void runPipeline() throws Exception {
StreamExecutionEnvironment env = configurationEnvironment();
QueryableStateStream<String, DataBucket> dataByDate = env.addSource(sourceDataFromKafka())
.map(NewDataClass::new)
.keyBy(data.date)
.asQueryableState("QSName", reduceIntoSingleDataBucket());
}
and here is the code on client side:
QueryableStateClient client = new QueryableStateClient("localhost", 6123);
// the state descriptor of the state to be fetched.
ValueStateDescriptor<DataBucket> descriptor = new ValueStateDescriptor<>(
"QSName",
TypeInformation.of(new TypeHint<DataBucket>() {}));
jobId = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx";
String key = "2017-01-06";
CompletableFuture<ValueState<DataBucket> resultFuture = client.getKvState(
jobId,
"QSName",
key,
BasicTypeInfo.STRING_TYPE_INFO,
descriptor);
try {
ValueState<DataBucket> valueState = resultFuture.get();
DataBucket bucket = valueState.value();
System.out.println(bucket.getLabel());
} catch (IOException | InterruptionException | ExecutionException e) {
throw new RunTimeException("Unable to query bucket key: " + key , e);
}
I have followed the instructions as per the following link:
https://ci.apache.org/projects/flink/flink-docs-release-1.4/dev/stream/state/queryable_state.html
making sure to enable the queryable state on my Flink cluster by including the flink-queryable-state-runtime_2.11-1.4.0.jar from the opt/ folder of your Flink distribution to the lib/ folder and checked it runs in the task manager.
I keep getting the following error:
Exception in thread "main" java.lang.NullPointerException
at org.apache.flink.api.java.typeutils.GenericTypeInfo.createSerializer(GenericTypeInfo.java:84)
at org.apache.flink.api.common.state.StateDescriptor.initializeSerializerUnlessSet(StateDescriptor.java:253)
at org.apache.flink.queryablestate.client.QueryableStateClient.getKvState(QueryableStateClient.java:210)
at org.apache.flink.queryablestate.client.QueryableStateClient.getKvState(QueryableStateClient.java:174)
at com.company.dept.query.QuerySubmitter.main(QuerySubmitter.java:37)
Any idea of what is happening? I think that my requests don't reach the QS at all ... Really don't know if and how I should change anything. Thanks.
So, as it turned out, it was 2 things that were causing this error. The first was the use of the wrong constructor for creating a descriptor on the client side. Rather than using the one that only takes as input a name for the QS and a TypeHint, I had to use another one where a keySerialiser along with a default value are provided as per below:
ValueStateDescriptor<DataBucket> descriptor = new ValueStateDescriptor<>(
"QSName",
TypeInformation.of(new TypeHint<DataBucket>() {}).createSerializer(new ExecutionConfig()),
DataBucket.emptyBucket()); // or anything that can be used as a default value
The second was relevant to the host and port values. The port was different from v1.3.2 now set to 9069 and the localhost was also different in my case. You can verify both by checking the logs of any task manager for the line:
Started the Queryable State Proxy Server # ....
Finally, in case you are here because you are looking to allow port-range for queryable state client proxy, I suggest you follow the respective issue (FLINK-7788) here: https://issues.apache.org/jira/browse/FLINK-7788.

Store execution plan of Spark´s dataframe

I am currently trying to store the execution plan of a Spark´s dataframe into HDFS (through dataframe.explain(true) command)
The issue I am finding is that when I am using the explain(true) command, I am able to see the output by the command line and by the logs, however if I create a file (let´s say a .txt) with the content of the dataframe´s explain the file will appear empty.
I believe the issue relates to the configuration of Spark, but I am unable to
find any information about this in internet
(for those who want to see more about the plan execution of the dataframes using the explain function please refer to https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-sql-dataset-operators.html#explain)
if I create a file (let´s say a .txt) with the content of the dataframe´s explain
How exactly did you try to achieve this?
explain writes its result to console, using println, and returns Unit, as can be seen in Dataset.scala:
def explain(extended: Boolean): Unit = {
val explain = ExplainCommand(queryExecution.logical, extended = extended)
sparkSession.sessionState.executePlan(explain).executedPlan.executeCollect().foreach {
// scalastyle:off println
r => println(r.getString(0))
// scalastyle:on println
}
}
So, unless you redirect the console output to write to your file (along with anything else printed to the console...), you won't be able to write explain's output to file.
The best way I have found is to redirect the output to a file when you run the job. I have used the following command :
spark-shell --master yarn -i test.scala > getlogs.log
my scala file has the following simple commands :
val df = sqlContext.sql("SELECT COUNT(*) FROM testtable")
df.explain(true)
exit()

Cache Slick DBIO Actions

I am trying to speed up "SELECT * FROM WHERE name=?" kind of queries in Play! + Scala app. I am using Play 2.4 + Scala 2.11 + play-slick-1.1.1 package. This package uses Slick-3.1 version.
My hypothesis was that slick generates Prepared statements from DBIO actions and they get executed. So I tried to cache them buy turning on flag cachePrepStmts=true
However I still see "Preparing statement..." messages in the log which means that PS are not getting cached! How should one instructs slick to cache them?
If I run following code shouldn't the PS be cached at some point?
for (i <- 1 until 100) {
Await.result(db.run(doctorsTable.filter(_.userName === name).result), 10 seconds)
}
Slick config is as follows:
slick.dbs.default {
driver="slick.driver.MySQLDriver$"
db {
driver="com.mysql.jdbc.Driver"
url="jdbc:mysql://localhost:3306/staging_db?useSSL=false&cachePrepStmts=true"
user = "user"
password = "passwd"
numThreads = 1 // For not just one thread in HikariCP
properties = {
cachePrepStmts = true
prepStmtCacheSize = 250
prepStmtCacheSqlLimit = 2048
}
}
}
Update 1
I tried following as per #pawel's suggestion of using compiled queries:
val compiledQuery = Compiled { name: Rep[String] =>
doctorsTable.filter(_.userName === name)
}
val stTime = TimeUtil.getUtcTime
for (i <- 1 until 100) {
FutureUtils.blockFuture(db.compiledQuery(name).result), 10)
}
val endTime = TimeUtil.getUtcTime - stTime
Logger.info(s"Time Taken HERE $endTime")
In my logs I still see statement like:
2017-01-16 21:34:00,510 DEBUG [db-1] s.j.J.statement [?:?] Preparing statement: select ...
Also timing of this is also remains the same. What is the desired output? Should I not see these statements anymore? How can I verify if Prepared statements are indeed reused.
You need to use Compiled queries - which are exactly doing what you want.
Just change above code to:
val compiledQuery = Compiled { name: Rep[String] =>
doctorsTable.filter(_.userName === name)
}
for (i <- 1 until 100) {
Await.result(db.run(compiledQuery(name).result), 10 seconds)
}
I extracted above name as a parameter (because you usually want to change some parameters in your PreparedStatements) but that's definitely an optional part.
For further information you can refer to: http://slick.lightbend.com/doc/3.1.0/queries.html#compiled-queries
For MySQL you need to set an additional jdbc flag, useServerPrepStmts=true
HikariCP's MySQL configuration page links to a quite useful document that provides some simple performance tuning configuration options for MySQL jdbc.
Here are a few that I've found useful (you'll need to & append them to jdbc url for options not exposed by Hikari's API). Be sure to read through linked document and/or MySQL documentation for each option; should be mostly safe to use.
zeroDateTimeBehavior=convertToNull&characterEncoding=UTF-8
rewriteBatchedStatements=true
maintainTimeStats=false
cacheServerConfiguration=true
avoidCheckOnDuplicateKeyUpdateInSQL=true
dontTrackOpenResources=true
useLocalSessionState=true
cachePrepStmts=true
useServerPrepStmts=true
prepStmtCacheSize=500
prepStmtCacheSqlLimit=2048
Also, note that statements are cached per thread; depending on what you set for Hikari connection maxLifetime and what server load is, memory usage will increase accordingly on both server and client (e.g. if you set connection max lifetime to just under MySQL default of 8 hours, both server and client will keep N prepared statements alive in memory for the life of each connection).
p.s. curious if bottleneck is indeed statement caching or something specific to Slick.
EDIT
to log statements enable the query log. On MySQL 5.7 you would add to your my.cnf:
general-log=1
general-log-file=/var/log/mysqlgeneral.log
and then sudo touch /var/log/mysqlgeneral.log followed by a restart of mysqld. Comment out above config lines and restart to turn off query logging.