Mesos Kafka task failed memory limit - apache-kafka

I am going to set up a kafka cluster on apache mesos.
I follow the instruction at kafka-mesos on github. I installed a mesos cluster (using Mesosphere without Marathon) with 3 nodes each with 2 CPUs and 4GB memory. I tested the cluster with hello world examples successfully.
I can run kafka-mesos scheduler on it and can add brokers to it.
But when i want to start the broker, an memory limit issued appear.
broker-191-.... TASK_FAILED slave:#c3-S1 reason:REASON_MEMORY_LIMIT
Although, the cluster has 12GB memory, but broker just need 3GB memory with 1GB heap. (I test it with various configuration from 512M till 3GB, but not worked)
What is the problem? and what is the solution?
the complete story is here:
2015-10-17 17:39:24,748 [Jetty-17] INFO ly.stealth.mesos.kafka.HttpServer$ - handling - http://192.168.11.191:7000/api/broker/start
2015-10-17 17:39:28,202 [Thread-605] INFO ly.stealth.mesos.kafka.Scheduler$ - [resourceOffers]
mesos-2#O1160 cpus:2.00 mem:4098.00 disk:9869.00 ports:[31000..32000]
mesos-3#O1161 cpus:2.00 mem:4098.00 disk:9869.00 ports:[31000..32000]
mesos-1#O1162 cpus:2.00 mem:4098.00 disk:9869.00 ports:[31000..32000]
2015-10-17 17:39:28,204 [Thread-605] INFO ly.stealth.mesos.kafka.Scheduler$ - Starting broker 191: launching task broker-191-0abe9e57-b0fb-4d87-a1b4-529acb111940 by offer mesos-2#O1160
broker-191-0abe9e57-b0fb-4d87-a1b4-529acb111940 slave:#c6-S3 cpus:1.00 mem:3096.00 ports:[31000..31000] data:defaults=broker.id\=191\,log.dirs\=kafka-logs\,port\=31000\,zookeeper.connect\=192.168.11.191:2181\\\,192.168.11.192:2181\\\,192.168.11.193:2181\,host.name\=mesos-2\,log.retention.bytes\=10737418240,broker={"stickiness" : {"period" : "10m"\, "stopTime" : "2015-10-17 13:43:29.278"}\, "id" : "191"\, "mem" : 3096\, "cpus" : 1.0\, "heap" : 1024\, "failover" : {"delay" : "1m"\, "maxDelay" : "10m"}\, "active" : true}
2015-10-17 17:39:28,417 [Thread-606] INFO ly.stealth.mesos.kafka.Scheduler$ - [statusUpdate] broker-191-0abe9e57-b0fb-4d87-a1b4-529acb111940 TASK_FAILED slave:#c6-S3 reason:REASON_MEMORY_LIMIT
2015-10-17 17:39:28,418 [Thread-606] INFO ly.stealth.mesos.kafka.Scheduler$ - Broker 191 failed 1, waiting 1m, next start ~ 2015-10-17 17:40:28+03
2015-10-17 17:39:29,202 [Thread-607] INFO ly.stealth.mesos.kafka.Scheduler$ - [resourceOffers]
I found the following in Mesos master log:
...validation.cpp:422] Executor broker-191-... for task broker-191-... uses less CPUs (None) than the minimum required (0.01). Please update your executor, as this will be mandatory in future releases.
...validation.cpp:434] Executor broker-191-... for task broker-191-... uses less memory (None) than the minimum required (32MB). Please update your executor, as this will be mandatory in future releases.
but i set the CPU and MEM for brokers via broker add (update):
broker updated:
id: 191
active: false
state: stopped
resources: cpus:1.00, mem:2048, heap:1024, port:auto
failover: delay:1m, max-delay:10m
stickiness: period:10m, expires:2015-10-19 11:15:53+03

The executor doesn't get the heap setting just the broker. I opened an issue for this https://github.com/mesos/kafka/issues/137. Please increase the mem until a patch is available.
This hasn't been a problem seen I suspect because the mem gets set as a larger value (the size of your data set you don't want to hit disk from when reading) so there is page cache for max efficiencies http://kafka.apache.org/documentation.html#maximizingefficiency

Related

Keycloak cluster fails on Amazon ECS (org.infinispan.commons.CacheException: Initial state transfer timed out for cache)

I am trying to deploy a cluster of 2 Keycloak docker images (6.0.1) on Amazon ECS (Fargate) using the built-in ECS Service Discovery mecanism (using DNS_PING).
Environment:
JGROUPS_DISCOVERY_PROTOCOL=dns.DNS_PING
JGROUPS_DISCOVERY_PROPERTIES=dns_query=my.services.internal,dns_record_type=A
JGROUPS_TRANSPORT_STACK=tcp <---(also tried udp)
The instances IP are correctly resolved from Route53 private namespace and they discover each other without any problem (x.x.x.138 is started first, then x.x.x.76).
Second instance:
[org.jgroups.protocols.dns.DNS_PING] (ServerService Thread Pool -- 58) ip-x.x.x.76: entries collected from DNS (in 3 ms): [x.x.x.76:0, x.x.x.138:0]
[org.jgroups.protocols.dns.DNS_PING] (ServerService Thread Pool -- 58) ip-x.x.x.76: sending discovery requests to hosts [x.x.x.76:0, x.x.x.138:0] on ports [55200 .. 55200]
[org.jgroups.protocols.pbcast.GMS] (ServerService Thread Pool -- 58) ip-x.x.x.76: sending JOIN(ip-x-x-x-76) to ip-x-x-x-138
And on the first instance:
[org.infinispan.CLUSTER] (thread-8,ejb,ip-x-x-x-138) ISPN000094: Received new cluster view for channel ejb: [ip-x-x-x-138|1] (2) [ip-x-x-x-138, ip-172-x-x-x-76]
[org.infinispan.remoting.transport.jgroups.JGroupsTransport] (thread-8,ejb,ip-x-x-x-138) Joined: [ip-x-x-x-76], Left: []
[org.infinispan.CLUSTER] (thread-8,ejb,ip-x-x-x-138) ISPN100000: Node ip-x-x-x-76 joined the cluster
[org.jgroups.protocols.FD_SOCK] (FD_SOCK pinger-12,ejb,ip-x-x-x-76) ip-x-x-x-76: pingable_mbrs=[ip-x-x-x-138, ip-x-x-x-76], ping_dest=ip-x-x-x-138
So it seems we have a working cluster. Unfortunately, the second instance ends up failing with the following exception:
Caused by: org.infinispan.commons.CacheException: Initial state transfer timed out for cache work on ip-x-x-x-76
Before this occurs, I am seeing a bunch of failure discovery task suspecting/unsuspecting the opposite instance:
[org.jgroups.protocols.FD_ALL] (Timer runner-1,null,null) haven't received a heartbeat from ip-x-x-x-76 for 60016 ms, adding it to suspect list
[org.jgroups.protocols.FD_ALL] (Timer runner-1,null,null) ip-x-x-x-138: suspecting [ip-x-x-x-76]
[org.jgroups.protocols.FD_ALL] (thread-9,ejb,ip-x-x-x-138) Unsuspecting ip-x-x-x-76
[org.jgroups.protocols.FD_SOCK] (thread-9,ejb,ip-x-x-x-138) ip-x-x-x-138: broadcasting unsuspect(ip-x-x-x-76)
On the Infinispan side (cache), everything seems to occur correctly but I am not sure. Every cache is "rebalanced" and each "rebalance" seems to end up with, for example:
[org.infinispan.statetransfer.StateConsumerImpl] (transport-thread--p24-t2) Finished receiving of segments for cache offlineSessions for topology 2.
It feels like its a connectivity issue, but all the ports are wide open between these 2 instances, both for TCP and UDP.
Any idea ? Anyone successfull at configuring this on ECS (fargate) ?
UPDATE 1
The second instance was initially shutting down not because of the "Initial state transfer timed out .." error but because the health check was taking longer than the configured grace period. Nonetheless, with 2 healthy instances, I receive "404 - Not Found" once every 2 queries, telling me that there is indeed a cache problem.
In current keycloak docker image (6.0.1), the default stack is UDP. According to this, version 7.0.0 will default to TCP and will also introduce a variable to toggle the stack (JGROUPS_TRANSPORT_STACK).
Using the UDP stack in Amazon ECS will "partially" work, meaning the discovery will work, the cluster will form, but the Infinispan cache won't be able to sync between instances, which will produce erratic errors. There is probably a way to make it work "as-is", but I dont see anything blocked between the instances when checking the VPC Flow logs.
A workaround is to switch to TCP by modifying the JGroups stack directly in the image in file /opt/jboss/keycloak/standalone/configuration/standalone-ha.xml:
<subsystem xmlns="urn:jboss:domain:jgroups:6.0">
<channels default="ee">
<channel name="ee" stack="tcp" cluster="ejb"/> <-- set stack to tcp
</channels>
Then commit the new image:
docker commit -m="TCP cluster stack" CONTAINER_ID jboss/keycloak:6.0.1-tcp-cluster
Tag/Push the image to Amazon ECR and make sure the port 7600 is accepted in your security group between your Amazon ECS tasks.

Camunda Cockpit and Rest API down but application up/JobExecutor config

We are facing a major incident in our Camunda Orchestrator. When we hit 100 running process instances, Camunda Cockpit takes an eternity and never responds.
We have the same issue when calling /app/engine/.
Few messages are being consumed from RabbitMQ, and then everything stops.
The application however is not down.
I suspect a process engine configuration issue, because of being unable to get the job executor log.
When I set JobExecutorActivate to false, all things go right for Cockpit and queue consumption, but processes stop at the end of the first subprocess.
We have this log loop non stop:
2018/11/17 14:47:33.258 DEBUG ENGINE-14012 Job acquisition thread woke up
2018/11/17 14:47:33.258 DEBUG ENGINE-14022 Acquired 0 jobs for process engine 'default': []
2018/11/17 14:47:33.258 DEBUG ENGINE-14023 Execute jobs for process engine 'default': [8338]
2018/11/17 14:47:33.258 DEBUG ENGINE-14023 Execute jobs for process engine 'default': [8217]
2018/11/17 14:47:33.258 DEBUG ENGINE-14023 Execute jobs for process engine 'default': [8256]
2018/11/17 14:47:33.258 DEBUG ENGINE-14011 Job acquisition thread sleeping for 100 millis
2018/11/17 14:47:33.359 DEBUG ENGINE-14012 Job acquisition thread woke up
And this log too (for queue consumption):
2018/11/17 15:04:19.582 DEBUG Waiting for message from consumer. {"null":null}
2018/11/17 15:04:19.582 DEBUG Retrieving delivery for Consumer#5d05f453: tags=[{amq.ctag-0ivcbc2QL7g-Duyu2Rcbow=queue_response}], channel=Cached Rabbit Channel: AMQChannel(amqp://guest#127.0.0.1:5672/,4), conn: Proxy#77a5983d Shared Rabbit Connection: SimpleConnection#17a1dd78 [delegate=amqp://guest#127.0.0.1:5672/, localPort= 49812], acknowledgeMode=AUTO local queue size=0 {"null":null}
Environment :
Spring Boot 2.0.3.RELEASE, Camunda v7.9.0 with PostgreSQL, RabbitMQ
Camunda BPM listen and push to 165 RabbitMQ queue.
Configuration :
# Data source (PostgreSql)
com.campDo.fr.camunda.datasource.url=jdbc:postgresql://localhost:5432/campDo
com.campDo.fr.camunda.datasource.username=campDo
com.campDo.fr.camunda.datasource.password=password
com.campDo.fr.camunda.datasource.driver-class-name=org.postgresql.Driver
com.campDo.fr.camunda.bpm.database.jdbc-batch-processing=false
oms.camunda.retry.timer=1
oms.camunda.retry.nb-max=2
SpringProcessEngineConfiguration :
#Bean
public SpringProcessEngineConfiguration processEngineConfiguration() throws IOException {
final SpringProcessEngineConfiguration config = new SpringProcessEngineConfiguration();
config.setDataSource(camundaDataSource);
config.setDatabaseSchemaUpdate("true");
config.setTransactionManager(transactionManager());
config.setHistory("audit");
config.setJobExecutorActivate(true);
config.setMetricsEnabled(false);
final Resource[] resources = resourceLoader.getResources(CLASSPATH_ALL_URL_PREFIX + "/processes/*.bpmn");
config.setDeploymentResources(resources);
return config;
}
Pom dependencies :
<dependency>
<groupId>org.camunda.bpm.springboot</groupId>
<artifactId>camunda-bpm-spring-boot-starter</artifactId>
</dependency>
<dependency>
<groupId>org.camunda.bpm.springboot</groupId>
<artifactId>camunda-bpm-spring-boot-starter-webapp</artifactId>
</dependency>
<dependency>
<groupId>org.camunda.bpm.springboot</groupId>
<artifactId>camunda-bpm-spring-boot-starter-rest</artifactId>
</dependency>
I am quite sure that my job executor config is wrong.
Update :
I can start cockpit and make Camunda consume messages by setting JobExecutorActivate to false, but processes are still stopping at the first job executor required step:
config.setJobExecutorActivate(false);
Thanks for your help.
First: if your process contains async steps (Jobs) then it will pause. Activating the jobExecutor will just say that camunda should manage how these jobs are worked on. If you disable the executor, your processes will still stop and since no-one will execute them, they remain stopped.
Disabling job-execution is only sensible during testing or when you have multiple nodes and only some of them should do processing.
To your main issue: the job executor works with a threadPool. From what you describe, it is very likely, that all threads in the pool block forever, so they never finish and never return, meaning your system is stuck.
This happened to us a while ago when working with a smtp server, there was an infinite timeout on the connection so the threads kept waiting although the machine was not available.
Since job execution in camunda is highly reliable and well tested per se, I yywould suggest that you double check everything you do in your delegates, if you are lucky (and I am right) you will find the spot where you just wait forever ...

TYPO3 - Extension ke_search - Bug in scheduler

I'm using :
TYPO3 6.2
ke_search 2.2
Everything work fine except the indexing process, I mean :
If I manually index (with the backend module) it's OK, no error messages.
If I run manually the scheduler indexing task it's OK, no error messages.
If I run the scheduler with the php typo3/cli_dispatch.phpsh scheduler command, then I got this error :
Fatal error: Allowed memory size of 16777216 bytes exhausted (tried to
allocate 87 bytes) in
/path_to_my_website/typo3/sysext/core/Classes/Cache/Frontend/VariableFrontend.php on line 99
For your information :
my PHP memory_limit setting is on 128M.
Other tasks are OK.
After this error appears on my console, the scheduler task is locked :
I can't figure out what's wrong ?
EDIT : I made flush frontend caches + flush general caches + flush system caches. If I run one more time the scheduler via the console, this is the new error I get :
Fatal error: Allowed memory size of 16777216 bytes exhausted (tried to
allocate 12288 bytes) in
/path_to_my_website/typo3/sysext/core/Classes/Database/QueryGenerator.php
on line 1265
EDIT 2 : if I disable all my indexer configurations, all goes well. But if I enable even 1 configuration -> PHP error.
Here is one of the indexer file :

orientdb 2.2.26 cluster setup - queries hang

Orientdb version - 2.2.26
CLuster - 3 node setup, readQuorum = 2, writeQuorum = "majority", ridBag.embeddedToSbtreeBonsaiThreshold = 2147483647
Nodes - CentOS 7.0, 24 cores and 96 GB RAM
Gremlin-scala/tinkerpop APIs are used for querying and inserting.
This code works fine on single node setup.
Code checks for existing vertex in graph. If vertex does not exist, then the insert operation are batched and send to the db within a transaction.
I see following warnings in orientdb log on all three nodes -
2017-09-15 16:37:31:025 WARNI [dev2] Timeout (852567ms) on waiting for synchronous responses from nodes=[dev1, dev3, dev2] responsesSoFar=[] request=(id=1.354 task=record_read(#65:22)) [ODistributedDatabaseImpl]
2017-09-15 16:52:18:239 WARNI [dev2] Timeout (1049042ms) on waiting for synchronous responses from nodes=[dev1, dev3, dev2] responsesSoFar=[] request=(id=1.568 task=record_read(#63:24)) [ODistributedDatabaseImpl]
2017-09-15 17:25:22:477 WARNI [dev2] Timeout (1984236ms) on waiting for synchronous responses from nodes=[dev1, dev3, dev2] responsesSoFar=[] request=(id=1.889 task=record_read(#63:24)) [ODistributedDatabaseImpl]
There is no problem on network. Firewall is disabled on all three nodes.
Are these logs related to the problem ?
What else I should check to fix the problem ?

How to debug TVHeadend "scan no data, failed"

I have compiled the driver for my Geniatech T230C DVB-T2 USB receiver according to https://www.linuxtv.org/wiki/index.php/Geniatech_T230C and it seems to work.
But when I try to scan for channels with TVHeadend 4.0.8 on my Raspbian Jessie system TVHeadend reports:
:
2017-04-06 02:22:34.000 mpegts: 706MHz in DVB-T2 - scan no data, failed
2017-04-06 02:22:34.000 subscription: 01FC: "scan" unsubscribing
2017-04-06 02:22:44.000 mpegts: 546MHz in DVB-T2 - tuning on Silicon Labs Si2168 : DVB-T #0
2017-04-06 02:22:44.000 subscription: 01FE: "scan" subscribing to mux "546MHz", weight: 2, adapter: "Silicon Labs Si2168 : DVB-T #0", network: "DVB-T2", service: "Raw PID Subscription" comet failure [e=this.el.dom is undefined]
2017-04-06 02:22:49.000 mpegts: 546MHz in DVB-T2 - scan no data, failed
2017-04-06 02:22:49.000 subscription: 01FE: "scan" unsubscribing
:
Only 3 SD channels but no HD channels are found.
With http://www.dvbviewer.com/de/ I could receive all HD Channels in good quality on Windows 10.
What could be the problem?
Could the version of TVHeadend be too old?
I managed to perform a re-scan by shutting down tvheadend (on debian/ubuntu service tvheadend stop), removing the usb tuner, counting to 10 :-), inserting the tuner, starting tvheadend and tailed syslog until it had finished scanning.
After that, I had to go back into the UI and selecting configuration | DVB Inputs | Services | Map Services -> Map All Services.
I then had all my services back (in their new positions). Just a matter then of deleting all the porn and sales channels :-)