Modify connector config in KafkaConnect before sending to task - apache-kafka

I'm writing a SinkConnector in Kafka Connect and hitting an issue. This connector has a configuration as such :
{
"connector.class" : "a.b.ExampleFileSinkConnector",
"tasks.max" : '1',
"topics" : "mytopic",
"maxFileSize" : "50"
}
I define the connector's config like this :
#Override public ConfigDef config()
{
ConfigDef result = new ConfigDef();
result.define("maxFileSize", Type.STRING, "10", Importance.HIGH, "size of file");
return result;
}
In the connector, I start the tasks as such :
#Override public List<Map<String, String>> taskConfigs(int maxTasks) {
List<Map<String, String>> result = new ArrayList<Map<String,String>>();
for (int i = 0; i < maxTasks; i++) {
Map<String, String> taskConfig = new HashMap<>();
taskConfig.put("connectorName", connectorName);
taskConfig.put("taskNumber", Integer.toString(i));
taskConfig.put("maxFileSize", maxFileSize);
result.add(taskConfig);
}
return result;
}
and all goes well.
However, when starting the Task (in taskConfigs()), if I add this :
taskConfig.put("epoch", "123");
this breaks the whole infrastructure : all connectors are stopped and restarted in an endless loop.
There is no exception or error whatsoever in the connect log file that can help.
The only way to make it work is to add "epoch" in the connector config, which I don't want to do since it is an internal parameter that the connector has to send to the task. It is not intended to be exposed to the connector's users.
Another point I noticed is that it is not possible to update the value of any connector config parameter, apart to set it to the default value. Changing a parameter and sending it to the task produces the same behavior.
I would really appreciate any help on this issue.
EDIT : here is the code of SinkTask::start()
#Override public void start(Map<String, String> taskConfig) {
try {
connectorName = taskConfig.get("connectorName");
log.info("{} -- Task.start()", connectorName);
fileNamePattern = taskConfig.get("fileNamePattern");
rootDir = taskConfig.get("rootDir");
fileExtension = taskConfig.get("fileExtension");
maxFileSize = SimpleFileSinkConnector.parseIntegerConfig(taskConfig.get("maxFileSize"));
maxTimeMinutes = SimpleFileSinkConnector.parseIntegerConfig(taskConfig.get("maxTimeMinutes"));
maxNumRecords = SimpleFileSinkConnector.parseIntegerConfig(taskConfig.get("maxNumRecords"));
taskNumber = SimpleFileSinkConnector.parseIntegerConfig(taskConfig.get("taskNumber"));
epochStart = SimpleFileSinkConnector.parseLongConfig(taskConfig.get("epochStart"));
log.info("{} -- fileNamePattern: {}, rootDir: {}, fileExtension: {}, maxFileSize: {}, maxTimeMinutes: {}, maxNumRecords: {}, taskNumber: {}, epochStart : {}",
connectorName, fileNamePattern, rootDir, fileExtension, maxFileSize, maxTimeMinutes, maxNumRecords, taskNumber, epochStart);
if (taskNumber == 0) {
checkTempFilesForPromotion();
}
computeInitialFilename();
log.info("{} -- Task.start() END", connectorName);
} catch (Exception e) {
log.info("{} -- Task.start() EXCEPTION : {}", connectorName, e.getLocalizedMessage());
}
}

We found the root cause of the issue. The Kafka Connect Framework is actually behaving as designed - the problem has to do with how we are trying to use the taskConfigs configuration framework.
The Problem
In our design, the FileSinkConnector sets an epoch in its start() lifecycle method, and this epoch is passed down to its tasks by way of the taskConfigs() lifecycle method. So each time the Connector's start() lifecycle method runs, different configuration is generated for the tasks - which is the problem.
Generating different configuration each time is a no-no. It turns out that the Connect Framework detects differences in configuration and will restart/rebalance upon detection - stopping and restarting the connector/task. That restart will call the stop() and start() methods of the connector ... which will (of course) produces yet another configuration change (because of the new epoch), and the vicious cycle is on!
This was an interesting and unexpected issue ... due to a behavior in Connect that we had no appreciation for. This is the first time we tried to generate task configuration that was not a simple function of the connector configuration.
Note that this behavior in Connect is intentional and addresses real issues of dynamically-changing configuration - like a JDBC Sink Connector that spontaneously updates its configuration when it detects a new database table it wants to sink.
Thanks to those who helped us !

Related

Kafka listener, get all messages

Good day collegues.
I have Kafka project using Spring Kafka what listen a definite topic.
I need one time in a day listen all messages, put them into a collection and find specific message there.
I couldn't understand how to read all messages in one #KafkaListener method.
My class is:
#Component
public class KafkaIntervalListener {
public CountDownLatch intervalLatch = new CountDownLatch(1);
private final SCDFRunnerService scdfRunnerService;
public KafkaIntervalListener(SCDFRunnerService scdfRunnerService) {
this.scdfRunnerService = scdfRunnerService;
}
#KafkaListener(topics = "${kafka.interval-topic}", containerFactory = "intervalEventKafkaListenerContainerFactory")
public void intervalListener(IntervalEvent event) throws UnsupportedEncodingException, JSONException {
System.out.println("Recieved interval message: " + event);
IntervalType type = event.getType();
Instant instant = event.getInterval();
List<IntervalEvent> events = new ArrayList<>();
events.add(event);
events.size();
this.intervalLatch.countDown();
}
}
My events collection always has size = 1;
I tried to use different loops, but then, my collection become filed 530 000 000 times the same message.
UPDATE:
I have found a way to do it with factory.setBatchListener(true); But i need to find launch it with #Scheduled(cron = "${kafka.cron}", zone = "Europe/Moscow"). Right now this method is always is listening. Now iam trying something like this:
#Scheduled(cron = "${kafka.cron}", zone = "Europe/Moscow")
public void run() throws Exception {
kafkaIntervalListener.intervalLatch.await();
}
It doesn't work, in debug mode my breakpoint never works on this site.
The listener container is, by design, message-driven.
For fetching messages on-demand, it's better to use the Kafka Consumer API directly and fetch messages using the poll() method.

Distributed state-machine's zookeeper ensemble fails while processing parallel regions with error KeeperErrorCode = BadVersion

Background :
Diagram :
Statemachine uml state diagram
We have a normal state machine as depicted in diagram that monitors spring-BATCH micro-service(deployed on streams source/processor/sink design) ,for each batch that is started .
We receive sequence of REST calls to internally fire events per batch id on respective batch's machine object. i.e. per batch id the new state machine object is created .
And each machine is having n number of parallel regions(representing spring batch's chunks ) also as shown in the diagram.
REST calls made are using multi-threaded environment where 2 simultaneous calls of same batchId may come for different region Ids of BATCHPROCESSING state .
Up till now we had a single node(single installation) running of this state machine micro-service but now we want to deploy it on multiple instances ; to receive REST calls .
For this , the Distributed State Machine is what we want to introduce . We have below configuration in place for Running Distributed State Machine .
#Configuration
#EnableStateMachine
public class StateMachineUMLWayConfiguration extends
StateMachineConfigurerAdapter<String, String> {
..
..
#Override
public void configure(StateMachineModelConfigurer<String,String> model)
throws Exception {
model
.withModel()
.factory(stateMachineModelFactory());
}
#Bean
public StateMachineModelFactory<String,String> stateMachineModelFactory() {
StorehubBatchUmlStateMachineModelFactory factory =null;
try {
factory = new StorehubBatchUmlStateMachineModelFactory
(templateUMLInClasspath,stateMachineEnsemble());
} catch (Exception e) {
LOGGER.info("Config's State machine factory got exception
:"+factory);
}
LOGGER.info("Config's State machine factory method Called:"+factory);
factory.setStateMachineComponentResolver(stateMachineComponentResolver());
return factory;
}
#Override
public void configure(StateMachineConfigurationConfigurer<String,
String>
config) throws Exception {
config
.withDistributed()
.ensemble(stateMachineEnsemble());
}
#Bean
public StateMachineEnsemble<String, String> stateMachineEnsemble() throws
Exception {
return new ZookeeperStateMachineEnsemble<String, String>(curatorClient(), "/batchfoo1", true, 512);
}
#Bean
public CuratorFramework curatorClient() throws Exception {
CuratorFramework client =
CuratorFrameworkFactory.builder().defaultData(new byte[0])
.retryPolicy(new ExponentialBackoffRetry(1000, 3))
.connectString("localhost:2181").build();
client.start();
return client;
}
StorehubBatchUmlStateMachineModelFactory's build method:
#Override
public StateMachineModel<String, String> build(String batchChunkId) {
Model model = null;
try {
model = UmlUtils.getModel(getResourceUri(resolveResource(batchChunkId)).getPath());
} catch (IOException e) {
throw new IllegalArgumentException("Cannot build model from resource " + resource + " or location " + location, e);
}
UmlModelParser parser = new UmlModelParser(model, this);
DataHolder dataHolder = parser.parseModel();
ConfigurationData<String, String> configurationData = new ConfigurationData<String, String>( null, new SyncTaskExecutor(),
new ConcurrentTaskScheduler() , false, stateMachineEnsemble,
new ArrayList<StateMachineListener<String, String>>(), false,
null, null,
null, null, false,
null , batchChunkId, null,
null ) ;
return new DefaultStateMachineModel<String, String>(configurationData, dataHolder.getStatesData(), dataHolder.getTransitionsData());
}
Created new custom service interface level method in place of DefaultStateMachineService.acquireStateMachine(machineId)
#Override
public StateMachine<String, String> acquireDistributedStateMachine(String machineId, boolean start) {
synchronized (distributedMachines) {
DistributedStateMachine<String,String> distributedStateMachine = distributedMachines.get(machineId);
StateMachine<String,String> distMachineDelegateX = null;
if (distributedStateMachine == null) {
StateMachine<String, String> machine = stateMachineFactory.getStateMachine(machineId);
distributedStateMachine = (DistributedStateMachine<String, String>) machine;
}
distributedMachines.put(machineId, distributedStateMachine);
return handleStart(distributedStateMachine, start);
}
}
Problem :
Now problem is that , micro service deployed on single instance runs successfully even for events received by it are from multi threaded environment where one thread hits with the event REST call belonging to Region 1 and simultaneously other thread comes for region 2 of same batch . Machine goes ahead in synch ,with successful parallel regions' processing , till its last state i.e BATCHCOMPLETED .
Also we checked at zookeeper side that at last the BATCHCOMPLETED STATE was being recorded in node's current version.
But , besides 1st instance , when we keep same micro service app-jar deployed on some other location to treat it as a 2nd instance of micro-service that is also now running to accept event REST calls(say by listening at another tomcat port 9002) ; it fails in middle somewhere randomly . This failure happens randomly after any one of the events among parallel regions is fired and when ensemble.setState() is being called internally on state change of that event .
It gives following error:
[36mo.s.s.support.AbstractStateMachine [0;39m [2m:[0;39m Interceptors threw exception, skipping state change
org.springframework.statemachine.StateMachineException: Error persisting data; nested exception is org.springframework.statemachine.StateMachineException: Error persisting data; nested exception is org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion
at org.springframework.statemachine.zookeeper.ZookeeperStateMachineEnsemble.setState(ZookeeperStateMachineEnsemble.java:241) ~[spring-statemachine-zookeeper-2.0.1.RELEASE.jar!/:2.0.1.RELEASE]
at org.springframework.statemachine.ensemble.DistributedStateMachine$LocalStateMachineInterceptor.preStateChange(DistributedStateMachine.java:209) ~[spring-statemachine-core-2.0.0.RELEASE.jar!/:2.0.0.RELEASE]
at org.springframework.statemachine.support.StateMachineInterceptorList.preStateChange(StateMachineInterceptorList.java:101) ~[spring-statemachine-core-2.0.0.RELEASE.jar!/:2.0.0.RELEASE]
at org.springframework.statemachine.support.AbstractStateMachine.callPreStateChangeInterceptors(AbstractStateMachine.java:859) [spring-statemachine-core-2.0.0.RELEASE.jar!/:2.0.0.RELEASE]
at org.springframework.statemachine.support.AbstractStateMachine.switchToState(AbstractStateMachine.java:880) [spring-statemachine-core-2.0.0.RELEASE.jar!/:2.0.0.RELEASE]
at org.springframework.statemachine.support.AbstractStateMachine.access$500(AbstractStateMachine.java:81) [spring-statemachine-core-2.0.0.RELEASE.jar!/:2.0.0.RELEASE]
at org.springframework.statemachine.support.AbstractStateMachine$3.transit(AbstractStateMachine.java:335) [spring-statemachine-core-2.0.0.RELEASE.jar!/:2.0.0.RELEASE]
at org.springframework.statemachine.support.DefaultStateMachineExecutor.handleTriggerTrans(DefaultStateMachineExecutor.java:286) [spring-statemachine-core-2.0.0.RELEASE.jar!/:2.0.0.RELEASE]
at org.springframework.statemachine.support.DefaultStateMachineExecutor.handleTriggerTrans(DefaultStateMachineExecutor.java:211) [spring-statemachine-core-2.0.0.RELEASE.jar!/:2.0.0.RELEASE]
at org.springframework.statemachine.support.DefaultStateMachineExecutor.processTriggerQueue(DefaultStateMachineExecutor.java:449) [spring-statemachine-core-2.0.0.RELEASE.jar!/:2.0.0.RELEASE]
at org.springframework.statemachine.support.DefaultStateMachineExecutor.access$200(DefaultStateMachineExecutor.java:65) [spring-statemachine-core-2.0.0.RELEASE.jar!/:2.0.0.RELEASE]
at org.springframework.statemachine.support.DefaultStateMachineExecutor$1.run(DefaultStateMachineExecutor.java:323) [spring-statemachine-core-2.0.0.RELEASE.jar!/:2.0.0.RELEASE]
at org.springframework.core.task.SyncTaskExecutor.execute(SyncTaskExecutor.java:50) [spring-core-4.3.13.RELEASE.jar!/:4.3.13.RELEASE]
at org.springframework.statemachine.support.DefaultStateMachineExecutor.scheduleEventQueueProcessing(DefaultStateMachineExecutor.java:352) [spring-statemachine-core-2.0.0.RELEASE.jar!/:2.0.0.RELEASE]
at org.springframework.statemachine.support.DefaultStateMachineExecutor.execute(DefaultStateMachineExecutor.java:163) [spring-statemachine-core-2.0.0.RELEASE.jar!/:2.0.0.RELEASE]
at org.springframework.statemachine.support.AbstractStateMachine.sendEventInternal(AbstractStateMachine.java:603) [spring-statemachine-core-2.0.0.RELEASE.jar!/:2.0.0.RELEASE]
at org.springframework.statemachine.support.AbstractStateMachine.sendEvent(AbstractStateMachine.java:218) [spring-statemachine-core-2.0.0.RELEASE.jar!/:2.0.0.RELEASE]
at org.springframework.statemachine.ensemble.DistributedStateMachine.sendEvent(DistributedStateMachine.java:108)
..skipping Lines....
Caused by: org.springframework.statemachine.StateMachineException: Error persisting data; nested exception is org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion
at org.springframework.statemachine.zookeeper.ZookeeperStateMachinePersist.write(ZookeeperStateMachinePersist.java:113) ~[spring-statemachine-zookeeper-2.0.1.RELEASE.jar!/:2.0.1.RELEASE]
at org.springframework.statemachine.zookeeper.ZookeeperStateMachinePersist.write(ZookeeperStateMachinePersist.java:50) ~[spring-statemachine-zookeeper-2.0.1.RELEASE.jar!/:2.0.1.RELEASE]
at org.springframework.statemachine.zookeeper.ZookeeperStateMachineEnsemble.setState(ZookeeperStateMachineEnsemble.java:235) ~[spring-statemachine-zookeeper-2.0.1.RELEASE.jar!/:2.0.1.RELEASE]
... 73 common frames omitted
Caused by: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion
at org.apache.zookeeper.KeeperException.create(KeeperException.java:115) ~[zookeeper-3.4.8.jar!/:3.4.8--1]
at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1006) ~[zookeeper-3.4.8.jar!/:3.4.8--1]
at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:910) ~[zookeeper-3.4.8.jar!/:3.4.8--1]
at org.apache.curator.framework.imps.CuratorTransactionImpl.doOperation(CuratorTransactionImpl.java:159)
Question :
1.So is the configuration mentioned above needs something more to be configured to avoid that exception mentioned above??
Because Both state-machine micro-service instances were tested with the case when they both were connecting to same instance i.e. same string .connectString("localhost:2181").build() or case when they were made to connect to different zookeeper instances(i.e. 'localhost:2181' , 'localhost:2182').
Same exception of BAD VERSION occurs during state machine ensemble's processing in both cases .
2.Also If Batches would run in parallel so their respective machines would need to be created to run in parallel at state-machine micro-service end .
So here , technically new State machine we need for new batchId , running simultaneously .
But looking at the ZookeeperStateMachineEnsemble , One znode path seems to be associated with one ensemble , whenever ensemble object is instantiated once in the main config class ("StateMachineUMLWayConfiguration") .
So is it expected to only use that singleton ensemble instance only? Can't multiple ensembles be created at run-time referencing different znode paths run in parallel to log their respective Distributed State Machine's states to their respective znode paths??
a. Because batches running in parallel would need separate znode paths to be created . Thus due to our attempt of keeping separate znode path per batch , we need separate ensemble to be instantiated per batch's machine. But that seems to be getting into the lock condition while getting connection to znode through curator client.
b. REST call fired for event triggering does not complete , as the machine it acquired is stuck in ensemble to connect .
Thanks in advance .

MassTransit 3 How to send a message explicitly to the error queue

I'm using MassTransit with Reactive Extensions to stream messages from the queue in batches. Since the behaviour isn't the same as a normal consumer I need to be able to send a message to the error queue if it fails an x number of times.
I've looked through the MassTransit source code and posted on the google groups and can't find an anwser.
Is this available on the ConsumeContext interface? Or is this even possible?
Here is my code. I've removed some of it to make it simpler.
_busControl = Bus.Factory.CreateUsingRabbitMq(cfg =>
{
var host = cfg.Host(new Uri("rabbitmq://localhost/"), h =>
{
h.Username("guest");
h.Password("guest");
});
cfg.UseInMemoryScheduler();
cfg.ReceiveEndpoint(host, "customer_update_queue", e =>
{
var _observer = new ObservableObserver<ConsumeContext<Customer>>();
_observer.Buffer(TimeSpan.FromMilliseconds(1000)).Subscribe(OnNext);
e.Observer(_observer);
});
});
private void OnNext(IList<ConsumeContext<Customer>> messages)
{
foreach (var consumeContext in messages)
{
Console.WriteLine("Content: " + consumeContext.Message.Content);
if (consumeContext.Message.RetryCount > 3)
{
// I want to be able to send to the error queue
consumeContext.SendToErrorQueue()
}
}
}
I've found a work around by using the RabbitMQ client mixed with MassTransit. Since I can't throw an exception when using an Observable and therefore no error queue is created. I create it manually using the RabbitMQ client like below.
ConnectionFactory factory = new ConnectionFactory();
factory.HostName = "localhost";
factory.UserName = "guest";
factory.Password = "guest";
using (IConnection connection = factory.CreateConnection())
{
using (IModel model = connection.CreateModel())
{
string exchangeName = "customer_update_queue_error";
string queueName = "customer_update_queue_error";
string routingKey = "";
model.ExchangeDeclare(exchangeName, ExchangeType.Fanout);
model.QueueDeclare(queueName, false, false, false, null);
model.QueueBind(queueName, exchangeName, routingKey);
}
}
The send part is to send it directly to the message queue if it fails an x amount of times like so.
consumeContext.Send(new Uri("rabbitmq://localhost/customer_update_queue_error"), consumeContext.Message);
Hopefully the batch feature will be implemented soon and I can use that instead.
https://github.com/MassTransit/MassTransit/issues/800

Pax Exam how to start multiple containers

for a project i'm working on, we have the necessity to write PaxExam integration tests which run over multiple Karaf containers.
The idea would be finding a way to extend/configure PaxExam to start-up a Karaf container (or more) and deploying there a bounce of bundles, and then start the test Karaf container which will then test the functionality.
We need this to verify performance tests and other things.
Does someone know anything about that? Is that actually possible in PaxExam?
I write the answer by myself, after having found this interesting article.
In particular have a look at the sections Using the Karaf Shell and Distributed integration tests in Karaf
http://planet.jboss.org/post/advanced_integration_testing_with_pax_exam_karaf
This is basically what the article says:
first of all you have to change the test probe header, allowing the dynamic-package
#ProbeBuilder
public TestProbeBuilder probeConfiguration(TestProbeBuilder probe) {
probe.setHeader(Constants.DYNAMICIMPORT_PACKAGE, "*;status=provisional");
return probe;
}
After that, the article suggests the following code that is able to execute commands in the Karaf shell
#Inject
CommandProcessor commandProcessor;
protected String executeCommands(final String ...commands) {
String response;
final ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
final PrintStream printStream = new PrintStream(byteArrayOutputStream);
final CommandSession commandSession = commandProcessor.createSession(System.in, printStream, System.err);
FutureTask<string> commandFuture = new FutureTask<string>(
new Callable<string>() {
public String call() {
try {
for(String command:commands) {
System.err.println(command);
commandSession.execute(command);
}
} catch (Exception e) {
e.printStackTrace(System.err);
}
return byteArrayOutputStream.toString();
}
});
try {
executor.submit(commandFuture);
response = commandFuture.get(COMMAND_TIMEOUT, TimeUnit.MILLISECONDS);
} catch (Exception e) {
e.printStackTrace(System.err);
response = "SHELL COMMAND TIMED OUT: ";
}
return response;
}
Then, the rest is kind of trivial, you will have to implement a layer able to start-up a child instance of Karaf
public void createInstances() {
//Install broker feature that is provided by FuseESB
executeCommands("admin:create --feature broker brokerChildInstance");
//Install producer feature that provided by imaginary feature repo.
executeCommands("admin:create --featureURL mvn:imaginary/repo/1.0/xml/features --feature producer producerChildInstance");
//Install producer feature that provided by imaginary feature repo.
executeCommands("admin:create --featureURL mvn:imaginary/repo/1.0/xml/features --feature consumer consumerChildInstance");
//start child instances
executeCommands("admin:start brokerChildInstance");
executeCommands("admin:start producerChildInstance");
executeCommands("admin:start consumerChildInstance");
//You will need to destroy the child instances once you are done.
//Using #After seems the right place to do that.
}

When is a started service not a started service? (SQL Express)

We require programmatic access to a SQL Server Express service as part of our application. Depending on what the user is trying to do, we may have to attach a database, detach a database, back one up, etc. Sometimes the service might not be started before we attempt these operations. So we need to ensure the service is started. Here is where we are running into problems. Apparently the ServiceController.WaitForStatus(ServiceControllerStatus.Running) returns prematurely for SQL Server Express. What is really puzzling is that the master database seems to be immediately available, but not other databases. Here is a console application to demonstrate what I am talking about:
namespace ServiceTest
{
using System;
using System.Data.SqlClient;
using System.Diagnostics;
using System.ServiceProcess;
using System.Threading;
class Program
{
private static readonly ServiceController controller = new ServiceController("MSSQL$SQLEXPRESS");
private static readonly Stopwatch stopWatch = new Stopwatch();
static void Main(string[] args)
{
stopWatch.Start();
EnsureStop();
Start();
OpenAndClose("master");
EnsureStop();
Start();
OpenAndClose("AdventureWorksLT");
Console.ReadLine();
}
private static void EnsureStop()
{
Console.WriteLine("EnsureStop enter, {0:N0}", stopWatch.ElapsedMilliseconds);
if (controller.Status != ServiceControllerStatus.Stopped)
{
controller.Stop();
controller.WaitForStatus(ServiceControllerStatus.Stopped);
Thread.Sleep(5000); // really, really make sure it stopped ... this has a problem too.
}
Console.WriteLine("EnsureStop exit, {0:N0}", stopWatch.ElapsedMilliseconds);
}
private static void Start()
{
Console.WriteLine("Start enter, {0:N0}", stopWatch.ElapsedMilliseconds);
controller.Start();
controller.WaitForStatus(ServiceControllerStatus.Running);
// Thread.Sleep(5000);
Console.WriteLine("Start exit, {0:N0}", stopWatch.ElapsedMilliseconds);
}
private static void OpenAndClose(string database)
{
Console.WriteLine("OpenAndClose enter, {0:N0}", stopWatch.ElapsedMilliseconds);
var connection = new SqlConnection(string.Format(#"Data Source=.\SQLEXPRESS;initial catalog={0};integrated security=SSPI", database));
connection.Open();
connection.Close();
Console.WriteLine("OpenAndClose exit, {0:N0}", stopWatch.ElapsedMilliseconds);
}
}
}
On my machine, this will consistently fail as written. Notice that the connection to "master" has no problems; only the connection to the other database. (You can reverse the order of the connections to verify this.) If you uncomment the Thread.Sleep in the Start() method, it will work fine.
Obviously I want to avoid an arbitrary Thread.Sleep(). Besides the rank code smell, what arbitary value would I put there? The only thing we can think of is to put some dummy connections to our target database in a while loop, catching the SqlException thrown and trying again until it works. But I'm thinking there must be a more elegant solution out there to know when the service is really ready to be used. Any ideas?
EDIT: Based on feedback provided below, I added a check on the status of the database. However, it is still failing. It looks like even the state is not reliable. Here is the function I am calling before OpenAndClose(string):
private static void WaitForOnline(string database)
{
Console.WriteLine("WaitForOnline start, {0:N0}", stopWatch.ElapsedMilliseconds);
using (var connection = new SqlConnection(string.Format(#"Data Source=.\SQLEXPRESS;initial catal
using (var command = connection.CreateCommand())
{
connection.Open();
try
{
command.CommandText = "SELECT [state] FROM sys.databases WHERE [name] = #DatabaseName";
command.Parameters.AddWithValue("#DatabaseName", database);
byte databaseState = (byte)command.ExecuteScalar();
Console.WriteLine("databaseState = {0}", databaseState);
while (databaseState != OnlineState)
{
Thread.Sleep(500);
databaseState = (byte)command.ExecuteScalar();
Console.WriteLine("databaseState = {0}", databaseState);
}
}
finally
{
connection.Close();
}
}
Console.WriteLine("WaitForOnline exit, {0:N0}", stopWatch.ElapsedMilliseconds);
}
I found another discussion dealing with a similar problem. Apparently the solution is to check the sys.database_files of the database in question. But that, of course, is a chicken-and-egg problem. Any other ideas?
Service start != database start.
Service is started when the SQL Server process is running and responded to the SCM that is 'alive'. After that the server will start putting user databases online. As part of this process, it runs the recovery process on each database, to ensure transactional consistency. Recovery of a database can last anywhere from microseconds to whole days, it depends on the ammount of log to be redone and the speed of the disk(s).
After the SCM returns that the service is running, you should connect to 'master' and check your database status in sys.databases. Only when the status is ONLINE can you proceed to open it.