Setting up and accessing Flink Queryable State (NullPointerException) - streaming

I am using Flink v1.4.0 and I have set up two distinct jobs. The first is a pipeline that consumes data from a Kafka Topic and stores them into a Queryable State (QS). Data are keyed by date. The second submits a query to the QS job and processes the returned data.
Both jobs were working fine with Flink v.1.3.2. But with the new update, everything has broken. Here is part of the code for the first job:
private void runPipeline() throws Exception {
StreamExecutionEnvironment env = configurationEnvironment();
QueryableStateStream<String, DataBucket> dataByDate = env.addSource(sourceDataFromKafka())
.map(NewDataClass::new)
.keyBy(data.date)
.asQueryableState("QSName", reduceIntoSingleDataBucket());
}
and here is the code on client side:
QueryableStateClient client = new QueryableStateClient("localhost", 6123);
// the state descriptor of the state to be fetched.
ValueStateDescriptor<DataBucket> descriptor = new ValueStateDescriptor<>(
"QSName",
TypeInformation.of(new TypeHint<DataBucket>() {}));
jobId = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx";
String key = "2017-01-06";
CompletableFuture<ValueState<DataBucket> resultFuture = client.getKvState(
jobId,
"QSName",
key,
BasicTypeInfo.STRING_TYPE_INFO,
descriptor);
try {
ValueState<DataBucket> valueState = resultFuture.get();
DataBucket bucket = valueState.value();
System.out.println(bucket.getLabel());
} catch (IOException | InterruptionException | ExecutionException e) {
throw new RunTimeException("Unable to query bucket key: " + key , e);
}
I have followed the instructions as per the following link:
https://ci.apache.org/projects/flink/flink-docs-release-1.4/dev/stream/state/queryable_state.html
making sure to enable the queryable state on my Flink cluster by including the flink-queryable-state-runtime_2.11-1.4.0.jar from the opt/ folder of your Flink distribution to the lib/ folder and checked it runs in the task manager.
I keep getting the following error:
Exception in thread "main" java.lang.NullPointerException
at org.apache.flink.api.java.typeutils.GenericTypeInfo.createSerializer(GenericTypeInfo.java:84)
at org.apache.flink.api.common.state.StateDescriptor.initializeSerializerUnlessSet(StateDescriptor.java:253)
at org.apache.flink.queryablestate.client.QueryableStateClient.getKvState(QueryableStateClient.java:210)
at org.apache.flink.queryablestate.client.QueryableStateClient.getKvState(QueryableStateClient.java:174)
at com.company.dept.query.QuerySubmitter.main(QuerySubmitter.java:37)
Any idea of what is happening? I think that my requests don't reach the QS at all ... Really don't know if and how I should change anything. Thanks.

So, as it turned out, it was 2 things that were causing this error. The first was the use of the wrong constructor for creating a descriptor on the client side. Rather than using the one that only takes as input a name for the QS and a TypeHint, I had to use another one where a keySerialiser along with a default value are provided as per below:
ValueStateDescriptor<DataBucket> descriptor = new ValueStateDescriptor<>(
"QSName",
TypeInformation.of(new TypeHint<DataBucket>() {}).createSerializer(new ExecutionConfig()),
DataBucket.emptyBucket()); // or anything that can be used as a default value
The second was relevant to the host and port values. The port was different from v1.3.2 now set to 9069 and the localhost was also different in my case. You can verify both by checking the logs of any task manager for the line:
Started the Queryable State Proxy Server # ....
Finally, in case you are here because you are looking to allow port-range for queryable state client proxy, I suggest you follow the respective issue (FLINK-7788) here: https://issues.apache.org/jira/browse/FLINK-7788.

Related

Is it possible to create a batch flink job in streaming flink job?

I have a job streaming using Apache Flink (flink version: 1.8.1) using scala. there are flow job requirements as follows:
Kafka -> Write to Hbase -> Send to kafka again with a different topic
During the writing process to Hbase, there was a need to retrieve data from another table. To ensure that the data is not empty (NULL), the job must check repeatedly (within a certain time) if the data is empty.
is this possible with Flink? If yes, can you help provide examples for conditions similar to my needs?
Edit :
I mean, with the problem that I described in the content, I thought about having to create some kind of job batch in the job streaming, but I couldn't find the right example for my case. So, is it possible to create a batch flink job in streaming flink job? If yes, can you help provide examples for conditions similar to my needs?
With more recent versions of Flink you can do lookup queries (with a configurable cache) against HBase from the SQL/Table APIs. Your use case sounds like it might be easily implemented in this fashion. See the docs for more info.
Just to clarify my comment I will post a sketch of what I was trying to suggest based on The Broadcast State Pattern. The link provides an example in Java, so I will follow it. In case you want in Scala it should not be too much different. You will likely have to implement the below code as it is explained on the link that I mentioned:
DataStream<String> output = colorPartitionedStream
.connect(ruleBroadcastStream)
.process(
// type arguments in our KeyedBroadcastProcessFunction represent:
// 1. the key of the keyed stream
// 2. the type of elements in the non-broadcast side
// 3. the type of elements in the broadcast side
// 4. the type of the result, here a string
new KeyedBroadcastProcessFunction<Color, Item, Rule, String>() {
// my matching logic
}
);
I was suggesting that you can collect the stream ruleBroadcastStream in fixed intervals from the database or whatever is your store. Instead of getting:
// broadcast the rules and create the broadcast state
BroadcastStream<Rule> ruleBroadcastStream = ruleStream
.broadcast(ruleStateDescriptor);
like the web page says. You will need to add a source where you can schedule it to run every X minutes.
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
BroadcastStream<Rule> ruleBroadcastStream = env
.addSource(new YourStreamSource())
.broadcast(ruleStateDescriptor);
public class YourStreamSource extends RichSourceFunction<YourType> {
private volatile boolean running = true;
#Override
public void run(SourceContext<YourType> ctx) throws Exception {
while (running) {
// TODO: yourData = FETCH DATA;
ctx.collect(yourData);
Thread.sleep("sleep for X minutes");
}
}
#Override
public void cancel() {
this.running = false;
}
}

How do I ensure that my Apache Spark setup code runs only once?

I'm writing a Spark job in Scala that reads in parquet files on S3, does some simple transforms, and then saves them to a DynamoDB instance. Each time it runs we need to create a new table in Dynamo so I've written a Lambda function which is responsible for table creation. The first thing my Spark job does is generates a table name, invokes my Lambda function (passing the new table name to it), waits for the table to be created, and then proceeds normally with the ETL steps.
However it looks as though my Lambda function is consistently being invoked twice. I cannot explain that. Here's a sample of the code:
def main(spark: SparkSession, pathToParquet: String) {
// generate a unique table name
val tableName = generateTableName()
// call the lambda function
val result = callLambdaFunction(tableName)
// wait for the table to be created
waitForTableCreation(tableName)
// normal ETL pipeline
var parquetRDD = spark.read.parquet(pathToParquet)
val transformedRDD = parquetRDD.map((row: Row) => transformData(row), encoder=kryo[(Text, DynamoDBItemWritable)])
transformedRDD.saveAsHadoopDataset(getConfiguration(tableName))
spark.sparkContext.stop()
}
The code to wait for table creation is pretty-straightforward, as you can see:
def waitForTableCreation(tableName: String) {
val client: AmazonDynamoDB = AmazonDynamoDBClientBuilder.defaultClient()
val waiter: Waiter[DescribeTableRequest] = client.waiters().tableExists()
try {
waiter.run(new WaiterParameters[DescribeTableRequest](new DescribeTableRequest(tableName)))
} catch {
case ex: WaiterTimedOutException =>
LOGGER.error("Timed out waiting to create table: " + tableName)
throw ex
case t: Throwable => throw t
}
}
And the lambda invocation is equally simple:
def callLambdaFunction(tableName: String) {
val myLambda = LambdaInvokerFactory.builder()
.lambdaClient(AWSLambdaClientBuilder.defaultClient)
.lambdaFunctionNameResolver(new LambdaByName(LAMBDA_FUNCTION_NAME))
.build(classOf[MyLambdaContract])
myLambda.invoke(new MyLambdaInput(tableName))
}
Like I said, when I run spark-submit on this code, it definitely does hit the Lambda function. But I can't explain why it hits it twice. The result is that I get two tables provisioned in DynamoDB.
The waiting step also seems to fail within the context of running this as a Spark job. But when I unit-test my waiting code it seems to work fine on its own. It successfully blocks until the table is ready.
At first I theorized that perhaps spark-submit was sending this code to all of the worker nodes and they were independently running the whole thing. Initially I had a Spark cluster with 1 master and 2 workers. However I tested this out on another cluster with 1 master and 5 workers, and there again it hit the Lambda function exactly twice, and then apparently failed to wait for table creation because it dies shortly after invoking the Lambdas.
Does anyone have any clues as to what Spark might be doing? Am I missing something obvious?
UPDATE: Here's my spark-submit args which are visible on the Steps tab of EMR.
spark-submit --deploy-mode cluster --class com.mypackage.spark.MyMainClass s3://my-bucket/my-spark-job.jar
And here's the code for my getConfiguration function:
def getConfiguration(tableName: String) : JobConf = {
val conf = new Configuration()
conf.set("dynamodb.servicename", "dynamodb")
conf.set("dynamodb.input.tableName", tableName)
conf.set("dynamodb.output.tableName", tableName)
conf.set("dynamodb.endpoint", "https://dynamodb.us-east-1.amazonaws.com")
conf.set("dynamodb.regionid", "us-east-1")
conf.set("mapred.output.format.class", "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat")
conf.set("mapred.input.format.class", "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat")
new JobConf(conf)
}
Also here is a Gist containing some of the exception logs I see when I try to run this.
Thanks #soapergem for adding logging and options. I add an answer (a try one) since it may be a little bit longer than a comment :)
To wrap-up:
nothing strange with spark-submit and configuration options
in https://gist.github.com/soapergem/6b379b5a9092dcd43777bdec8dee65a8#file-stderr-log you can see that the application is executed twice. It passes twice from an ACCEPTED to RUNNING state. And that's consistent with EMR defaults (How to prevent EMR Spark step from retrying?). To confirm that, you can check whether you have 2 tables created after executing the step (I suppose here that you're generating tables with dynamic names; a different name per execution which in case of retry should give 2 different names)
For your last question:
It looks like my code might work if I run it in "client" deploy mode, instead of "cluster" deploy mode? Does that offer any hints to anyone here?
For more information about the difference, please check https://community.hortonworks.com/questions/89263/difference-between-local-vs-yarn-cluster-vs-yarn-c.html In your case, it looks like the machine executing spark-submit in client mode has different IAM policies than the EMR jobflow. My supposition here is that your jobflow role is not allowed to dynamodb:Describe* and that's why you're getting the exception with 500 code (from your gist):
Caused by: com.amazonaws.services.dynamodbv2.model.ResourceNotFoundException: Requested resource not found: Table: EmrTest_20190708143902 not found (Service: AmazonDynamoDBv2; Status Code: 400; Error Code: ResourceNotFoundException; Request ID: V0M91J7KEUVR4VM78MF5TKHLEBVV4KQNSO5AEMVJF66Q9ASUAAJG)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1712)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1367)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1113)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:770)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:744)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:726)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:686)
at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:668)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:532)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:512)
at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.doInvoke(AmazonDynamoDBClient.java:4243)
at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.invoke(AmazonDynamoDBClient.java:4210)
at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.executeDescribeTable(AmazonDynamoDBClient.java:1890)
at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.describeTable(AmazonDynamoDBClient.java:1857)
at org.apache.hadoop.dynamodb.DynamoDBClient$1.call(DynamoDBClient.java:129)
at org.apache.hadoop.dynamodb.DynamoDBClient$1.call(DynamoDBClient.java:126)
at org.apache.hadoop.dynamodb.DynamoDBFibonacciRetryer.runWithRetry(DynamoDBFibonacciRetryer.java:80)
To confirm this hypothesis, you an execute your part creating the table and waiting for creation locally (no Spark code here, just a simple java command of your main function) and:
for the first execution ensure that you have all permissions. IMO it will be dynamodb:Describe* on Resources: * (if it's the reason, AFAIK you should use somthing Resources: Test_Emr* in production for principle of least privilege )
for the 2nd execution remove dynamodb:Describe* and check whether you're getting the same stack trace like in the gist
I encountered the same problem in cluster mode too (v2.4.0). I workaround it by launching my apps programmatically using SparkLauncher instead of using spark-submit.sh. You could move your lambda logic into your main method that starts your spark app like this:
def main(args: Array[String]) = {
// generate a unique table name
val tableName = generateTableName()
// call the lambda function
val result = callLambdaFunction(tableName)
// wait for the table to be created
waitForTableCreation(tableName)
val latch = new CountDownLatch(1);
val handle = new SparkLauncher(env)
.setAppResource("/path/to/spark-app.jar")
.setMainClass("com.company.SparkApp")
.setMaster("yarn")
.setDeployMode("cluster")
.setConf("spark.executor.instances", "2")
.setConf("spark.executor.cores", "2")
// other conf ...
.setVerbose(true)
.startApplication(new SparkAppHandle.Listener {
override def stateChanged(sparkAppHandle: SparkAppHandle): Unit = {
latch.countDown()
}
override def infoChanged(sparkAppHandle: SparkAppHandle): Unit = {
}
})
println("app is launching...")
latch.await()
println("app exited")
}
your spark job starts before the table is actually created because defining operations one by one doesn't mean they will wait until previous one is finished
you need to change the code so that block related to spark is starting after table is created, and in order to achieving it you have to either use for-comprehension that insures every step is finished or put your spark pipeline into the callback of waiter called after the table is created (if you have any, hard to tell)
you can also use andThen or simple map
the main point is that all the lines of code written in your main are executed one by one immediately without waiting for previous one to finish

How to set the offset.commit.policy to AlwaysCommitOffsetPolicy in debezium?

I created a Debezium Embedded engine to capture MySQL change data. I want to commit the offsets as soon as I can. In the code, the config is created including follows.
.with("offset.commit.policy",OffsetCommitPolicy.AlwaysCommitOffsetPolicy.class.getName())
Running this returns, java.lang.NoSuchMethodException: io.debezium.embedded.spi.OffsetCommitPolicy$AlwaysCommitOffsetPolicy.<init>(io.debezium.config.Configuration)
However, When I start the embedded engine with,
.with("offset.commit.policy",OffsetCommitPolicy.PeriodicCommitOffsetPolicy.class.getName()), the embedded engine works fine.
Note that the class OffsetCommitPolicy.PeriodicCommitOffsetPolicy constructor includes the config parameter while OffsetCommitPolicy.AlwaysCommitOffsetPolicy doesn't.
public PeriodicCommitOffsetPolicy(Configuration config) {
...
}
How to get the debezium embedded engine to use its AlwaysCommitOffsetPolicy?
Thanks for the report. This is partly bug (which we would appreciate if you could log into our Jira). You can solve this issue by calling a dedicated method embedded engine builder like `io.debezium.embedded.EmbeddedEngine.create().with(OffsetCommitPolicy.always())'
Tested with version 1.4.0Final:
new EmbeddedEngine.BuilderImpl() // create builder
.using(config) // regular config
.using(OffsetCommitPolicy.always()) // explicit commit policy
.notifying(this::handleEvent) // even procesor
.build(); // and finally build!

Azure Service Fabric - Clone Actors with state

Is there any way to clone an actor and its state and create the exact same actor, with the same actor Id and its corresponding state in another application in Azure Service Fabric? I've looked at the backup and restore, but it doesn't seem to do what I need.
We have several instances of the same application type running actors in production. We need this functionality for 2 reasons:
1. We need to combine 2 of the applications into one, so all of the actors will need to be re-created in their current state with their current ID's in the other instance.
2 . We would like to be able to clone production into a QA environment, which is on a different Azure Server, so we can test upgrades and new code in the exact state production is in.
Any help with this is much appreciated!
I don't think that there is built-in support for cloning. But I believe you can implement your own mechanism to do that - you can iterate over your actors to get their states and then pass it to another cluser.
To iterate:
`
var cancellationToken = new CancellationToken();
ContinuationToken continuationToken = null;
var actorProxyFactory = new ActorProxyFactory();
var actorServiceProxy = ActorServiceProxy.Create("fabric:/MyActorApp/MyActorService", partitionKey);
do
{
var queryResult = await actorServiceProxy.GetActorsAsync(continuationToken, cancellationToken);
foreach (var item in queryResult.Items)
{
var actor = actorProxyFactory.CreateActorProxy<IMyActor>("fabric:/MyActorApp/MyActorService", item.ActorId);
var state = await actor.GetState();
}
continuationToken = queryResult.ContinuationToken;
} while (continuationToken != null);
`
If you have a dynamic list of names for actor state, you can get all names using GetStateNamesAsync method:
IEnumerable<string> stateNames = await StateManager.GetStateNamesAsync();
Hope this will help.
For #1, you need to use Alex's solution or similar.
For #2, you can use the existing backup & restore mechanism. The target partition of a service you are restoring doesn't need to be the source partition you created the backup of.

Scala Netty is there any way to share a ReplayingDecoder

I am looking to open up multiple connections using a netty client bootstrap in order to parse messages coming from multiple sources. The messages all have the same format, however, due to the amount of data that needs to be processed, I must run each connection on separate threads (This is assuming netty creates a thread per client channel, which I couldn't find a reference for - if that's not the case, how would this be achieved?).
This is the code that I use to connect to the data server:
var b = new Bootstrap()
.group(group)
.channel(classOf[NioSocketChannel])
.handler(RawFeedChannelInitializer)
var ch1 = b.clone().connect(host, port).sync().channel();
var ch2 = b.clone().connect(host, port).sync().channel();
The initializer calls RawPacketDecoder, which extends ReplayingDecoder, and is defined here.
The code works well without #Sharable when opening a single connection, but for the purpose of my application I must connect to the same server multiple times.
This results in the runtime error #Sharable annotation is not allowed pointing to my RawPacketDecoder class.
I am not entirely sure on how to get past this issue, short of reimplementing in scala an instantiable class of ReplayingDecoder as my decoder based directly on ByteToMessageDecoder.
Any help would be greatly appreciated.
Note: I am using netty 4.0.32 Final
I found the solution in this StockExchange answer.
My issue was that I was using an object based ChannelInitializer (singleton), and ReplayingDecoder as well as ByteToMessageDecoder are not sharable.
My initializer was created as a scala object, and therefore a single instance allowed. Changing the initializer to a scala class and instantiating for each bootstrap clone solved the problem. I modified the bootstrap code above as follows:
var b = new Bootstrap()
.group(group)
.channel(classOf[NioSocketChannel])
//.handler(RawFeedChannelInitializer)
var ch1 = b.clone().handler(new RawFeedChannelInitializer()).connect(host, port).sync().channel();
var ch2 = b.clone().handler(new RawFeedChannelInitializer()).connect(host, port).sync().channel();
I am not sure whether this ensures multithreading as wanted but it does allow to split the data access into multiple connections to the feed server.
Edit Update: After performing additional research on the subject, I have determined that netty does in fact create a thread per channel; this was verified by printing to console after the creation of each channel:
println("No. of active threads: " + Thread.activeCount());
The output shows an incremental number as channels are created and associated with their respective threads.
By default NioEventLoopGroup uses 2*Num_CPU_cores threads as defined here:
DEFAULT_EVENT_LOOP_THREADS = Math.max(1, SystemPropertyUtil.getInt(
"io.netty.eventLoopThreads",
Runtime.getRuntime().availableProcessors() * 2));
This value can be overriden to something else by setting
val group = new NioEventLoopGroup(16)
and then using the group to create/setup the bootstrap.