Zuul/Ribbon/Hystrix not retrying on different instance - spring-cloud

Background
I'm using Spring cloud Brixton.RC2, with Zuul and Eureka.
I have one gateway service with #EnableZuulProxy and a book-service with a status method. Via configuration I can emulate work on the status method by sleeping a defined amount of time.
The Zuul route is simple
zuul.routes.foos.path=/foos/**
zuul.routes.foos.serviceId=reservation-service
I run two instances of the book-service. When I set the sleeping time below the Hystrix timeout threshold (1000ms) I can see requests going to both instance of the book services. This works well.
Problem
I understand that if the Hystrix command fails, it should be possible for Ribbon to retry the command on a different server. This should make the failure transparent to the client.
I read the Ribbon configuration and added the following configuration in Zuul:
zuul.routes.reservation-service.retryable=true //not sure which one to try
zuul.routes.foos.retryable=true //not sure which one to try
ribbon.MaxAutoRetries=0 // I don't want to retry on the same host, I also tried with 1 it doesn't work either
ribbon.MaxAutoRetriesNextServer=2
ribbon.OkToRetryOnAllOperations=true
Now I update the configuration so that only one service sleeps for more than 1s, which means that I have one health service, and one bad one.
When I call the gateways the calls get send to both instances, and half the calls returns a 500. In the gateway I see the Hystrix timeout:
com.netflix.zuul.exception.ZuulException: Forwarding error
[...]
Caused by: com.netflix.hystrix.exception.HystrixRuntimeException: reservation-service timed-out and no fallback available.
[...]
Caused by: java.util.concurrent.TimeoutException: null
Why isn't Ribbon retrying the call on the other instance?
Am I Missing something here?
References
Relates to this question (not solved)
Ribbon configuration
According to this commit Zuul should support retries for Ribbon

By default Zuul uses SEMAPHORE isolation strategy which doesn't allow to set a timeout. I haven't been able to use load balancing with this stragegy. What worked for me was (following your example):
1) Changing Zuul's isolation to THREAD:
hystrix:
command:
reservation-service:
execution:
isolation:
strategy: THREAD
thread:
timeoutInMilliseconds: 100000
IMPORTANT: timeoutInMilliseconds= 100000 is like saying no HystrixTimeout. Why? Because if Hystrix times out there won't be any load balancing (I just tested it playing with timeoutInMilliseconds)
Then, configure Ribbon's ReadTimeout to the desired value:
reservation-service:
ribbon:
ReadTimeout: 800
ConnectTimeout: 250
OkToRetryOnAllOperations: true
MaxAutoRetriesNextServer: 2
MaxAutoRetries: 0
In this case after the 1sec service times out in Ribbon it'll retry with the 500ms service
Below you have the log I got in my zuul instance:
o.s.web.servlet.DispatcherServlet : DispatcherServlet with name 'dispatcherServlet' processing GET request for [/api/stories]
o.s.web.servlet.DispatcherServlet : Last-Modified value for [/api/stories] is: -1
c.n.zuul.http.HttpServletRequestWrapper : Path = null
c.n.zuul.http.HttpServletRequestWrapper : Transfer-Encoding = null
c.n.zuul.http.HttpServletRequestWrapper : Content-Encoding = null
c.n.zuul.http.HttpServletRequestWrapper : Content-Length header = -1
c.n.loadbalancer.ZoneAwareLoadBalancer : Zone aware logic disabled or there is only one zone
c.n.loadbalancer.LoadBalancerContext : storyteller-api using LB returned Server: localhost:7799 for request /api/stories
---> ATTEMPTING THE SLOW SERVICE
com.netflix.niws.client.http.RestClient : RestClient sending new Request(GET: ) http://localhost:7799/api/stories
c.n.http4.MonitoredConnectionManager : Get connection: {}->http://localhost:7799, timeout = 250
com.netflix.http4.NamedConnectionPool : [{}->http://localhost:7799] total kept alive: 1, total issued: 0, total allocated: 1 out of 200
com.netflix.http4.NamedConnectionPool : No free connections [{}->http://localhost:7799][null]
com.netflix.http4.NamedConnectionPool : Available capacity: 50 out of 50 [{}->http://localhost:7799][null]
com.netflix.http4.NamedConnectionPool : Creating new connection [{}->http://localhost:7799]
com.netflix.http4.NFHttpClient : Attempt 1 to execute request
com.netflix.http4.NFHttpClient : Closing the connection.
c.n.http4.MonitoredConnectionManager : Released connection is not reusable.
com.netflix.http4.NamedConnectionPool : Releasing connection [{}->http://localhost:7799][null]
com.netflix.http4.NamedConnectionPool : Notifying no-one, there are no waiting threads
--- HERE'S RIBBON'S TIMEOUT
c.n.l.reactive.LoadBalancerCommand : Got error com.sun.jersey.api.client.ClientHandlerException: java.net.SocketTimeoutException: Read timed out when executed on server localhost:7799
c.n.loadbalancer.ZoneAwareLoadBalancer : Zone aware logic disabled or there is only one zone
c.n.loadbalancer.LoadBalancerContext : storyteller-api using LB returned Server: localhost:9977 for request /api/stories
---> HERE IT RETRIES
com.netflix.niws.client.http.RestClient : RestClient sending new Request(GET: ) http://localhost:9977/api/stories
c.n.http4.MonitoredConnectionManager : Get connection: {}->http://localhost:9977, timeout = 250
com.netflix.http4.NamedConnectionPool : [{}->http://localhost:9977] total kept alive: 1, total issued: 0, total allocated: 1 out of 200
com.netflix.http4.NamedConnectionPool : Getting free connection [{}->http://localhost:9977][null]
com.netflix.http4.NFHttpClient : Stale connection check
com.netflix.http4.NFHttpClient : Attempt 1 to execute request
com.netflix.http4.NFHttpClient : Connection can be kept alive indefinitely
c.n.http4.MonitoredConnectionManager : Released connection is reusable.
com.netflix.http4.NamedConnectionPool : Releasing connection [{}->http://localhost:9977][null]
com.netflix.http4.NamedConnectionPool : Pooling connection [{}->http://localhost:9977][null]; keep alive indefinitely
com.netflix.http4.NamedConnectionPool : Notifying no-one, there are no waiting threads
o.s.web.servlet.DispatcherServlet : Null ModelAndView returned to DispatcherServlet with name 'dispatcherServlet': assuming HandlerAdapter completed request handling
o.s.web.servlet.DispatcherServlet : Successfully completed request
o.s.web.servlet.DispatcherServlet : DispatcherServlet with name 'dispatcherServlet' processing GET request for [/favicon.ico]
o.s.w.s.handler.SimpleUrlHandlerMapping : Matching patterns for request [/favicon.ico] are [/**/favicon.ico]
o.s.w.s.handler.SimpleUrlHandlerMapping : URI Template variables for request [/favicon.ico] are {}
o.s.w.s.handler.SimpleUrlHandlerMapping : Mapping [/favicon.ico] to HandlerExecutionChain with handler [ResourceHttpRequestHandler [locations=[ServletContext resource [/], class path resource [META-INF/resources/], class path resource [resources/], class path resource [static/], class path resource [public/], class path resource []], resolvers=[org.springframework.web.servlet.resource.PathResourceResolver#a0d875d]]] and 1 interceptor
o.s.web.servlet.DispatcherServlet : Last-Modified value for [/favicon.ico] is: -1
o.s.web.servlet.DispatcherServlet : Null ModelAndView returned to DispatcherServlet with name 'dispatcherServlet': assuming HandlerAdapter completed request handling
o.s.web.servlet.DispatcherServlet : Successfully completed request

Related

How to setup Narayana ConnectionManager so it doesn't stop after some transactions

I'm using Spring Boot, Spring Session and JTA Narayana (arjuna), I'm sending select and insert statements in a loop using two different threads.
The application runs correctly for some time but after some number of transactions, the Arjuna ConnectionManager fails to get a connection and generates the following exception:
2019-10-05 22:48:20.724 INFO 27032 --- [o-auto-1-exec-4] c.m.m.db.PrepareStatementExec : START select
2019-10-05 22:49:20.225 WARN 27032 --- [nsaction Reaper] com.arjuna.ats.arjuna : ARJUNA012117: TransactionReaper::check timeout for TX 0:ffffc0a82101:c116:5d989ef0:6e in state RUN
2019-10-05 22:49:20.228 WARN 27032 --- [Reaper Worker 0] com.arjuna.ats.arjuna : ARJUNA012095: Abort of action id 0:ffffc0a82101:c116:5d989ef0:6e invoked while multiple threads active within it.
2019-10-05 22:49:20.234 WARN 27032 --- [Reaper Worker 0] com.arjuna.ats.arjuna : ARJUNA012381: Action id 0:ffffc0a82101:c116:5d989ef0:6e completed with multiple threads - thread http-nio-auto-1-exec-10 was in progress with java.lang.Object.wait(Native Method)
java.lang.Object.wait(Object.java:502)
com.arjuna.ats.internal.jdbc.ConnectionManager.create(ConnectionManager.java:134)
com.arjuna.ats.jdbc.TransactionalDriver.connect(TransactionalDriver.java:89)
java.sql.DriverManager.getConnection(DriverManager.java:664)
java.sql.DriverManager.getConnection(DriverManager.java:208)
com.mono.multidatasourcetest.db.PrepareStatementExec.executeUpdate(PrepareStatementExec.java:51)
Source code is in github https://github.com/saavedrah/multidataset-test
I'm wondering if the connection should be closed or if I should change some settings in Arjuna to make the ConnectionManager work.
although what you are showing is a stack trace being printed by the Narayana BasicAction class (rather than an exception) the result for you is ultimately the same and you need to close your connections.
You should most likely look to add it in close to the same place you are doing the getConnection calls within https://github.com/saavedrah/multidataset-test/blob/cf910c345db079a4e910a071ac0690af28bd3f81/src/main/java/com/mono/multidatasourcetest/db/PrepareStatementExec.java#L38
e.g.
//connection = getConnection
//do something with it
//connection.close()
But as Connection is AutoCloseable you could just do:
try (Connection connection = DriverManager.getConnection) {
connnection.doSomething();
}

Getting exception while doing block() on Mono object I got back from ReactiveMongoRepository object

I have a service that streams data to a second service that receives stream of objects and saves them to my MongoDB.
inside my subscribe function on the Flux object that I get from the streaming service I use the save method from the ReactiveMongoRepository interface.
when I try to use the block function and get the data I get the following error :
2019-10-11 13:30:38.559 INFO 19584 --- [localhost:27017] org.mongodb.driver.connection : Opened connection [connectionId{localValue:1, serverValue:25}] to localhost:27017
2019-10-11 13:30:38.566 INFO 19584 --- [localhost:27017] org.mongodb.driver.cluster : Monitor thread successfully connected to server with description ServerDescription{address=localhost:27017, type=STANDALONE, state=CONNECTED, ok=true, version=ServerVersion{versionList=[4, 0, 1]}, minWireVersion=0, maxWireVersion=7, maxDocumentSize=16777216, logicalSessionTimeoutMinutes=30, roundTripTimeNanos=6218300}
2019-10-11 13:30:39.158 INFO 19584 --- [ctor-http-nio-4] quote-monitor-service : onNext(Quote(id=null, ticker=AAPL, price=164.8, instant=2019-10-11T10:30:38.800Z))
2019-10-11 13:30:39.411 INFO 19584 --- [ctor-http-nio-4] quote-monitor-service : cancel()
2019-10-11 13:30:39.429 INFO 19584 --- [ntLoopGroup-2-2] org.mongodb.driver.connection : Opened connection [connectionId{localValue:3, serverValue:26}] to localhost:27017
2019-10-11 13:30:39.437 WARN 19584 --- [ctor-http-nio-4] io.netty.util.ReferenceCountUtil : Failed to release a message: DefaultHttpContent(data: PooledSlicedByteBuf(freed), decoderResult: success)
io.netty.util.IllegalReferenceCountException: refCnt: 0, decrement: 1
at io.netty.util.internal.ReferenceCountUpdater.toLiveRealRefCnt(ReferenceCountUpdater.java:74) ~[netty-common-4.1.39.Final.jar:4.1.39.Final]
at io.netty.util.internal.ReferenceCountUpdater.release(ReferenceCountUpdater.java:138) ~[netty-common-4.1.39.Final.jar:4.1.39.Final]
at
reactor.core.Exceptions$ErrorCallbackNotImplemented: java.lang.IllegalStateException: block()/blockFirst()/blockLast() are blocking, which is not supported in thread reactor-http-nio-4
Caused by: java.lang.IllegalStateException: block()/blockFirst()/blockLast() are blocking, which is not supported in thread reactor-http-nio-4
at reactor.core.publisher.BlockingSingleSubscriber.blockingGet(BlockingSingleSubscriber.java:77) ~[reactor-core-3.2.12.RELEASE.jar:3.2.12.RELEASE]
at reactor.core.publisher.Mono.block(Mono.java:1494) ~[reactor-core-3.2.12.RELEASE.jar:3.2.12.RELEASE]
at
my code:
stockQuoteClient.getQuoteStream()
.log("quote-monitor-service")
.subscribe(quote -> {
Mono<Quote> savedQuote = quoteRepository.save(quote);
System.out.println("I saved a quote! Id: " +savedQuote.block().getId());
});
after some digging, I manage to get it to work but I don't understand why it works now.
the new code:
stockQuoteClient.getQuoteStream()
.log("quote-monitor-service")
.subscribe(quote -> {
Mono<Quote> savedQuote = quoteRepository.insert(quote);
savedQuote.subscribe(result ->
System.out.println("I saved a quote! Id :: " + result.getId()));
});
the definition of block(): Subscribe to this Mono and block indefinitely until a next signal is received.
the definition of subscribe(): Subscribe to this Mono and request unbounded demand.
can someone help me understand why the block didn't work and the subscribe worked?
what am I missing here?
Blocking is bad, since it ties up a thread waiting for a response. It's very bad in a reactive framework which has few threads at its disposal, and is designed so that none of them should be unnecessarily blocked.
This is the very thing that reactive frameworks are designed to avoid, so in this case it simply stops you doing it:
block()/blockFirst()/blockLast() are blocking, which is not supported in thread reactor-http-nio-4
Your new code, in contrast, works asynchronously. The thread isn't blocked, as nothing actually happens until the repository returns a value (and then the lambda that you passed to savedQuote.subscribe() is executed, printing out you result to the console.)
However, the new code still isn't optimal / normal from a reactive streams perspective, as you're doing all your logic in your subscribe method. The normal thing to do is to us a series of flatMap/map calls to transform the items in the stream, and use doOnNext() for side effects (such as printing out a value):
stockQuoteClient.getQuoteStream()
.log("quote-monitor-service")
.flatMap(quoteRepository::insert)
.doOnNext(result -> System.out.println("I saved a quote! Id :: " + result.getId())))
.subscribe();
If you're doing any serious amount of work with reactor / reactive streams, it would be worth reading up on them in general. They're very powerful for non-blocking work, but they do require a different way of thinking (and coding) than more "standard" Java.

Deploy Graylog on GKE

I'm having a hard time deploying Graylog on Google Kubernetes Engine, I'm using this configuration https://github.com/aliasmee/kubernetes-graylog-cluster with some minor modifications. My Graylog server is up but show this error in the interface:
Error message
Request has been terminated
Possible causes: the network is offline, Origin is not allowed by Access-Control-Allow-Origin, the page is being unloaded, etc.
Original Request
GET http://ES_IP:12900/system/sessions
Status code
undefined
Full error message
Error: Request has been terminated
Possible causes: the network is offline, Origin is not allowed by Access-Control-Allow-Origin, the page is being unloaded, etc.
Graylog logs show nothing in particular other than this:
org.graylog.plugins.threatintel.tools.AdapterDisabledException: Spamhaus service is disabled, not starting (E)DROP adapter. To enable it please go to System / Configurations.
at org.graylog.plugins.threatintel.adapters.spamhaus.SpamhausEDROPDataAdapter.doStart(SpamhausEDROPDataAdapter.java:68) ~[?:?]
at org.graylog2.plugin.lookup.LookupDataAdapter.startUp(LookupDataAdapter.java:59) [graylog.jar:?]
at com.google.common.util.concurrent.AbstractIdleService$DelegateService$1.run(AbstractIdleService.java:62) [graylog.jar:?]
at com.google.common.util.concurrent.Callables$4.run(Callables.java:119) [graylog.jar:?]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]
but at the end :
2019-01-16 13:35:00,255 INFO : org.graylog2.bootstrap.ServerBootstrap - Graylog server up and running.
Elastic search health check is green, no issues in ES nor Mongo logs.
I suspect a problem with the connection to Elastic Search though.
curl http://ip_address:9200/_cluster/health\?pretty
{
"cluster_name" : "elasticsearch",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 4,
"active_shards" : 4,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 100.0
}
After reading the tutorial you shared, I was able to identify that kubelet needs to run with the argument --allow-privileged.
"Elasticsearch pods need for an init-container to run in privileged mode, so it can set some VM options. For that to happen, the kubelet should be running with args --allow-privileged, otherwise the init-container will fail to run."
It's not possible to customize or modify kubelet parameter/arguments, there is a feature request found here: https://issuetracker.google.com/118428580, so this can be implemented in a future.
Also in case you are modifying kubelet directly on the node(s), it's possible that the master resets the configuration and it isn't guaranteed that the configurations will be persistent.

why must eureka.client.serviceUrl.defaultZone be provided in `bootstrap.properties` when config the spring cloud config server in a discovery manner?

I was aiming to config the location of spring cloud config server by setting spring.applicaton.name and server.port and eureka.client.serviceUrl.defaultZone in application.properties together with spring.cloud.config.discovery.enabled=true and spring.cloud.config.discovery.service-id=cloud-config in bootstrap.properties, which turned out to be insufficient. The following error messages are shown in log:'
com.netflix.discovery.DiscoveryClient : DiscoveryClient_BOOTSTRAP/192.168.1.5:bootstrap - was unable to refresh its cache! status = Cannot execute request on any known server
No instances found of configserver (cloud-config)
According to the docs, I moved eureka.client.serviceUrl.defaultZone into bootstrap.properties and succeeded.
My question is, if spring.application.name and server.port are essential for a eureka client to register on the eureka server, why they can be unsettled in bootstrap.properties for the config client?
I suspect that the config client will first use eureka.client.serviceUrl.defaultZone alone to connect with the eureka server and fetch service registration informations but not register itself so as to locate the config server and pull something. After that, since the config client is also a eureka client, it uses relative parameters in application.properties to register on eureka server. As some evidence of my suspect, I found the following logs during the startup of the application:
2017-09-07 06:13:09.651 INFO [bootstrap,,,] 74104 --- [ restartedMain] com.netflix.discovery.DiscoveryClient : Getting all instance registry info from the eureka server
2017-09-07 06:13:09.817 INFO [bootstrap,,,] 74104 --- [ restartedMain] com.netflix.discovery.DiscoveryClient : The response status is 200
2017-09-07 06:13:09.821 INFO [bootstrap,,,] 74104 --- [ restartedMain] com.netflix.discovery.DiscoveryClient : Not registering with Eureka server per configuration
2017-09-07 06:13:37.427 INFO [-,,,] 74104 --- [ restartedMain] com.netflix.discovery.DiscoveryClient : Getting all instance registry info from the eureka server
Is it?

Spring Boot with server.contextPath set vs. URL to hystrix.stream via Eureka Server

I have Eureka Server with Turbine instance running and a few discovery clients that are connected to it. Everything works fine, but if I register a discovery client that has server.contextPath set, it didn't get recognized by InstanceMonitor and Turbine stream is not able to combine its hystrix.stream.
This is how it looks in the logs of Eureka/Turbine server:
2015-02-12 06:56:23.265 INFO 1 --- [ Timer-0] c.n.t.discovery.InstanceObservable : Hosts up:3, hosts down: 0
2015-02-12 06:56:23.266 INFO 1 --- [ Timer-0] c.n.t.monitor.instance.InstanceMonitor : Url for host: http://user-service:8887/hystrix.stream default
2015-02-12 06:56:23.268 ERROR 1 --- [InstanceMonitor] c.n.t.monitor.instance.InstanceMonitor : Could not initiate connection to host, giving up: []
2015-02-12 06:56:23.269 WARN 1 --- [InstanceMonitor] c.n.t.monitor.instance.InstanceMonitor : Stopping InstanceMonitor for: user-service default
com.netflix.turbine.monitor.instance.InstanceMonitor$MisconfiguredHostException: []
at com.netflix.turbine.monitor.instance.InstanceMonitor.init(InstanceMonitor.java:318)
at com.netflix.turbine.monitor.instance.InstanceMonitor.access$100(InstanceMonitor.java:103)
at com.netflix.turbine.monitor.instance.InstanceMonitor$2.call(InstanceMonitor.java:235)
at com.netflix.turbine.monitor.instance.InstanceMonitor$2.call(InstanceMonitor.java:229)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
It tries to get hystrix stream from http://user-service:8887/hystrix.stream where the correct URL including sever.contextPath should be http://user-service:8887/uaa/hystrix.stream
The application.yml of that client contains:
server:
port: 8887
contextPath: /uaa
security:
ignored: /css/**,/js/**,/favicon.ico,/webjars/**
basic:
enabled: false
My question is: should I add some additional configuration options to this user-service discovery client to register proper hystrix.stream URL location?
I didn't dig into that yet, I will let you know if found something before getting answer to that question.
Current solution
There is one problem when it comes to using server.contextPath and management.context-path. When both are set, turbine stream is being served on ${HOST_URL}/${server.contextPath}/${management.context-path}/hystrix.stream. In that case I had to drop using server.contextPath (I replaced it with a prefix in controllers #RequestMapping).
Now, when you user management.context-path, then your hystrix.stream is being served from the URL that uses it as a prefix. In that case you have to follow Spencer's suggestion and set
turbine.instanceUrlSuffix=/{PUT_YOUR_MANAGEMENT_CONTEXT_PATH_HERE}/hystrix.stream
And of course this management.context-path must be set with the same value for all your Discovery Clients - it can be done easily with Spring Cloud Config http://cloud.spring.io/spring-cloud-config/spring-cloud-config.html
You can set turbine.instanceUrlSuffix.<CLUSTERNAME>=/uaa/hystrix.stream. Where <CLUSTERNAME> is the value set in turbine.aggregator.clusterConfig. All of the config options from the Turbine 1 wiki work. You don't need to add the port to the suffix as Spring Cloud Netflix Turbine adds the port from eureka.