Unable to dynamically scale Ignite Pods in Kubernetes - kubernetes

We have been experimenting with the number of Ignite server pods to see the impact on performance.
One thing that we have noticed is that if the number of Ignite server pods is increased after client nodes have established communication the new pod will just fail loop with the error below.
If however the grid is destroyed (bring down all client and server nodes) and then the desired number of server nodes is launch there are no issues.
Also the above procedure is not fully dependable for anything other than launching a single Ignite server.
From reading it looks like [this stack over flow][1] post and [this documentation][2] that the issue may be that we are not launching the "Kubernetes service".
Ignite's KubernetesIPFinder requires users to configure and deploy a special Kubernetes service that maintains a list of the IP addresses of all the alive Ignite pods (nodes).
However this is the only documentation I have found and it says that it is no longer current.
Is this information still relevant for Ignite 2.11.1?
If not is there some more recent documentation?
If this service is indeed needed, are there some more concreate examples and information on setting them up?
Error on new Server pod:
[21:37:55,793][SEVERE][main][IgniteKernal] Failed to start manager: GridManagerAdapter [enabled=true, name=o.a.i.i.managers.discovery.GridDiscoveryManager]
class org.apache.ignite.IgniteCheckedException: Failed to start SPI: TcpDiscoverySpi [addrRslvr=null, addressFilter=null, sockTimeout=5000, ackTimeout=5000, marsh=JdkMarshaller [clsFilter=org.apache.ignite.marshaller.MarshallerUtils$1#78422efb], reconCnt=10, reconDelay=2000, maxAckTimeout=600000, soLinger=0, forceSrvMode=false, clientReconnectDisabled=false, internalLsnr=null, skipAddrsRandomization=false]
at org.apache.ignite.internal.managers.GridManagerAdapter.startSpi(GridManagerAdapter.java:281)
at org.apache.ignite.internal.managers.discovery.GridDiscoveryManager.start(GridDiscoveryManager.java:980)
at org.apache.ignite.internal.IgniteKernal.startManager(IgniteKernal.java:1985)
at org.apache.ignite.internal.IgniteKernal.start(IgniteKernal.java:1331)
at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start0(IgnitionEx.java:2141)
at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start(IgnitionEx.java:1787)
at org.apache.ignite.internal.IgnitionEx.start0(IgnitionEx.java:1172)
at org.apache.ignite.internal.IgnitionEx.startConfigurations(IgnitionEx.java:1066)
at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:952)
at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:851)
at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:721)
at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:690)
at org.apache.ignite.Ignition.start(Ignition.java:353)
at org.apache.ignite.startup.cmdline.CommandLineStartup.main(CommandLineStartup.java:367)
Caused by: class org.apache.ignite.spi.IgniteSpiException: Node with the same ID was found in node IDs history or existing node in topology has the same ID (fix configuration and restart local node) [localNode=TcpDiscoveryNode [id=000e84bb-f587-43a2-a662-c7c6147d2dde, consistentId=8751ef49-db25-4cf9-a38c-26e23a96a3e4, addrs=ArrayList [0:0:0:0:0:0:0:1%lo, 127.0.0.1, fd00:85:4001:5:f831:8cc:cd3:f863%eth0], sockAddrs=HashSet [nkw-mnomni-ignite-1-1-1.nkw-mnomni-ignite-1-1.680e5bbc-21b1-5d61-8dfa-6b27be10ede7.svc.cluster.local/fd00:85:4001:5:f831:8cc:cd3:f863:47500, /0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500], discPort=47500, order=0, intOrder=0, lastExchangeTime=1676497065109, loc=true, ver=2.11.1#20211220-sha1:eae1147d, isClient=false], existingNode=000e84bb-f587-43a2-a662-c7c6147d2dde]
at org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.duplicateIdError(TcpDiscoverySpi.java:2083)
at org.apache.ignite.spi.discovery.tcp.ServerImpl.joinTopology(ServerImpl.java:1201)
at org.apache.ignite.spi.discovery.tcp.ServerImpl.spiStart(ServerImpl.java:473)
at org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.spiStart(TcpDiscoverySpi.java:2207)
at org.apache.ignite.internal.managers.GridManagerAdapter.startSpi(GridManagerAdapter.java:278)
... 13 more
Server DiscoverySpi Config:
<property name="discoverySpi">
<bean class="org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi">
<property name="ipFinder">
<bean class="org.apache.ignite.spi.discovery.tcp.ipfinder.kubernetes.TcpDiscoveryKubernetesIpFinder">
<property name="namespace" value="myNameSpace"/>
<property name="serviceName" value="myServiceName"/>
</bean>
</property>
</bean>
</property>
Client DiscoverySpi Configs:
<bean id="discoverySpi" class="org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi">
<property name="ipFinder" ref="ipFinder" />
</bean>
<bean id="ipFinder" class="org.apache.ignite.spi.discovery.tcp.ipfinder.vm.TcpDiscoveryVmIpFinder">
<property name="shared" value="false" />
<property name="addresses">
<list>
<value>myServiceName.myNameSpace:47500</value>
</list>
</property>
</bean>
Edit:
I have experimented more with this issue. As long as I do not deploy any clients (using the static TcpDiscoveryVmIpFinder above) I am able to scale up and down server pods without any issue. However as soon as a single client joins I am no longer able to scale the server pods up.
I can see that the server pods have ports 47500 and 47100 open so I am not sure what the issue is. Dows the TcpDiscoveryKubernetesIpFinder still need the port to be specified on the client config?
I have tried to change my client config to use the TcpDiscoveryKubernetesIpFinder below but I am getting a discovery timeout falure (see below).
<property name="discoverySpi">
<bean class="org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi">
<property name="ipFinder">
<bean class="org.apache.ignite.spi.discovery.tcp.ipfinder.kubernetes.TcpDiscoveryKubernetesIpFinder">
<property name="namespace" value="680e5bbc-21b1-5d61-8dfa-6b27be10ede7"/>
<property name="serviceName" value="nkw-mnomni-ignite-1-1"/>
</bean>
</property>
</bean>
</property>
24-Feb-2023 14:15:02.450 WARNING [grid-timeout-worker-#22%igniteClientInstance%] org.apache.ignite.logger.java.JavaLogger.warning Thread dump at 2023/02/24 14:15:02 UTC
Thread [name="main", id=1, state=WAITING, blockCnt=78, waitCnt=3]
Lock [object=java.util.concurrent.CountDownLatch$Sync#45296dbd, ownerName=null, ownerId=-1]
at java.base#17.0.1/jdk.internal.misc.Unsafe.park(Native Method)
at java.base#17.0.1/java.util.concurrent.locks.LockSupport.park(LockSupport.java:211)
at java.base#17.0.1/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:715)
at java.base#17.0.1/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1047)
at java.base#17.0.1/java.util.concurrent.CountDownLatch.await(CountDownLatch.java:230)
at o.a.i.spi.discovery.tcp.ClientImpl.spiStart(ClientImpl.java:324)
at o.a.i.spi.discovery.tcp.TcpDiscoverySpi.spiStart(TcpDiscoverySpi.java:2207)
at o.a.i.i.managers.GridManagerAdapter.startSpi(GridManagerAdapter.java:278)
at o.a.i.i.managers.discovery.GridDiscoveryManager.start(GridDiscoveryManager.java:980)
at o.a.i.i.IgniteKernal.startManager(IgniteKernal.java:1985)
at o.a.i.i.IgniteKernal.start(IgniteKernal.java:1331)
at o.a.i.i.IgnitionEx$IgniteNamedInstance.start0(IgnitionEx.java:2141)
at o.a.i.i.IgnitionEx$IgniteNamedInstance.start(IgnitionEx.java:1787)
- locked o.a.i.i.IgnitionEx$IgniteNamedInstance#57ac9100
at o.a.i.i.IgnitionEx.start0(IgnitionEx.java:1172)
at o.a.i.i.IgnitionEx.startConfigurations(IgnitionEx.java:1066)
at o.a.i.i.IgnitionEx.start(IgnitionEx.java:952)
at o.a.i.i.IgnitionEx.start(IgnitionEx.java:851)
at o.a.i.i.IgnitionEx.start(IgnitionEx.java:721)
at o.a.i.i.IgnitionEx.start(IgnitionEx.java:690)
at o.a.i.Ignition.start(Ignition.java:353)
Edit 2:
I also spoke with an admin about opening client side ports in case that was the issue. He indicated that should not be needed as clients should be able to open ephemeral ports to communicate with the server nodes.
[1]: Ignite not discoverable in kubernetes cluster with TcpDiscoveryKubernetesIpFinder
[2]: https://apacheignite.readme.io/docs/kubernetes-ip-finder

It's hard to say precisely what the root cause is, but in general it's something related to the network or domain names resolution.
A public address is assigned to a node on a startup and is exposed to other nodes for communication. Other nodes store that address and nodeId in their history. Here is what is happening: a new node is trying to enter the cluster, it connects to a random node, then this request is transferred to the coordinator. The coordinator issues TcpDiscoveryNodeAddedMessage that must circle across the topology ring and be ACKed by all other nodes. That process didn't finish during a join timeout, so the new node is trying to re-enter the topology by starting the same joining process but with a new ID. But, other nodes see that this address is already registered by another nodeId, causing the original duplicate nodeId error.
Some recommendations:
If the issue is reproducible on a regular basis, I'd recommend collecting more information by enabling DEBUG logging for the following package:
org.apache.ignite.spi.discovery (discovery-related events tracing)
Take thread dumps from affected nodes (could be done by kill -3). Check for discovery-related issues. Search for "lookupAllHostAddr".
Check that it's not DNS issue and all public addresses for your node are resolved instantly nkw-mnomni-ignite-1-1-1.nkw-mnomni-ignite-1-1.680e5bbc-21b1-5d61-8dfa-6b27be10ede7.svc.cluster.local. I was asking about the provider, because in OpenShift there seems to be a hard limit on DNS resolution time.
Check GC and safepoints.
To hide the underlying issue you can play around by increasing Ignite configuration: network timeout, join timeout, reducing failure detection timeout. But I recommend finding the real root cause instead of treating the symptoms.

Related

How to enable JBossWS' LogRecorders globally?

I'm using Wildfly 8.1.0.Final.
I have RecordingServerHandler configured, it does get triggered by web services' messages. The problem is, LogRecorders are disabled by default.
The Records management article says:
Default processors are not in recording mode upon creation, thus you need to switch them to recording mode through their MBean interfaces (see the Recording flag in the jmx-console).
Enabling them in runtime one by one for each endpoint won't do, I need to enable them globally at "development time".
The same article says:
The recorders can be configured in the stacks bean configuration
<!-- Installed Record Processors-->
<bean name="WSMemoryBufferRecorder" class="org.jboss.wsf.framework.management.recording.MemoryBufferRecorder">
<property name="recording">false</property>
</bean>
<bean name="WSLogRecorder" class="org.jboss.wsf.framework.management.recording.LogRecorder">
<property name="recording">false</property>
</bean>
What's "stacks bean configuration"? Does the specified WSLogRecorder name imply that this configuration creates another, non-default, LogRecorder by that name, and that I would need to add it to all endpoints somehow?
Ended up enabling them via JMX at the end of deployment.
import java.lang.management.ManagementFactory;
import java.util.Set;
import javax.management.Attribute;
import javax.management.MBeanServer;
import javax.management.ObjectName;
/* ... */
MBeanServer server = ManagementFactory.getPlatformMBeanServer();
Set<ObjectName> recorderNames = server.queryNames(
new ObjectName("jboss.ws:recordProcessor=LogRecorder,*"), null);
for (ObjectName recorderName : recorderNames) {
server.setAttribute(recorderName, new Attribute("Recording", true));
}

Quartz trigger state is not persisting on server start

We have a requirement to pause a job before application maintenance. We are using Quartz 2.2.1 in cluster. Database is oracle.
I have developed a screen with "Pause" functionality. I observed that "pause" works fine until I start the server again. The moment I start server, TRIGGER_STATE of QRTZ_TRIGGERS table resets to "WAITING".
Can anyone please provide a hint.
Thanks a lot in advance.
Rgds - Roy
If you have set overwriteExistingJobs=true (note that default value is false) then each time server starts, it loads the jobs/triggers from the configuration file and replaces existing ones (that have the same job/trigger names), therefore overwriting triggers and their states too as in your case.
You could try to set overwriteExistingJobs=false in the SchedulerFactoryBean. This however may not be convenient for you, since if you ever change job configuration in the server, the existing jobs with old configuration will remain in the database.
<bean class="org.springframework.scheduling.quartz.SchedulerFactoryBean">
....
<property name="overwriteExistingJobs" value="false"/>
<property name="triggers">
<list>
....
</list>
</property>
....
</bean>

Jboss5.01GA RMI EJB3.0

I have a cloud instance where i have installed Jboss5.0.1GA server. Server instance contains a Public ip and a natted Ip Address. I have run Jboss server using -b with ip(natted) address and web url is working fine. Now i am creating Java external client to access EJB3 bean which is deployed in Jboss server where i am getting the exception and trying solution using google which is not helped my case. Find below code which tells what i am using in external client to access EJB3.
properties = new Properties();
properties.load(stream);
// Set the context
Hashtable ht = new Hashtable();
ht.put(Context.INITIAL_CONTEXT_FACTORY,
"org.jnp.interfaces.NamingContextFactory");
ht.put(Context.PROVIDER_URL,"public ip address");
ht.put(Context.URL_PKG_PREFIXES,
"org.jboss.naming:org.jnp.interfaces");
// Find and create a reference to the bean using JNDI
context = new InitialContext(ht);
While executing it localhost its working fine. While connecting remote throwing below exception. "javax.naming.CommunicationException [Root exception is java.rmi.ConnectException: Connection refused to host: ". Can anyone help me on the same.
`This is my connector file(ejb3-connectors-jboss-beans.xml).
EJB3 Connectors
-->
JBoss Remoting Connector
Note: Bean Name "org.jboss.ejb3.RemotingConnector" is used
as a lookup value; alter only after checking java references
to this key.
-->
<property name="invokerLocator">
<value-factory bean="ServiceBindingManager"
method="getStringBinding">
<parameter>
jboss.remoting:type=Connector,name=DefaultEjb3Connector,handler=ejb3
</parameter>
<parameter>
<null />
</parameter>
<parameter>socket://${jboss.bind.address}:${port}</parameter>
<parameter>
<null />
</parameter>
<parameter>3873</parameter>
</value-factory>
</property>
<property name="serverConfiguration">
<inject bean="ServerConfiguration" />
</property>
AOP
org.jboss.aspects.remoting.AOPRemotingInvocationHandler
`
Do a telnet to the ip and port you are trying to connect on the jboss from the remote server instance. If that's not working then you have to solve networking issues first. (Let me know, so I can guide you on how to do it)
Also check your EJB3 binding settings and check networking. Out of the box config looks looks this..
<mbean code="org.jboss.remoting.transport.Connector"
xmbean-dd="org/jboss/remoting/transport/Connector.xml"
name="jboss.remoting:type=Connector,name=DefaultEjb3Connector,handler=ejb3">
<depends>jboss.aop:service=AspectDeployer</depends>
<attribute name="InvokerLocator">socket://0.0.0.0:3873</attribute>
<attribute name="Configuration">
<handlers>
<handler subsystem="AOP">org.jboss.aspects.remoting.AOPRemotingInvocationHandler</handler>
</handlers>
</attribute>
</mbean>
Thanks!
#leo.
To my case below 2 things worked for me.
1. Running Jboss server using run.bat -b **public ip(not nat ip)** -Djboss.bind.address=0.0.0.0
2. Enabling my **local** machine hosts file to point remote ip to hostname ie remoteip remotehostname.
Hope it will help to others as well.

Is there a list of Jboss port numbers that a JBoss instance uses?

I have to configure an instance of Jboss 5.1.0 to use a different port number (i.e. 8480). To do this i made the following changes to the bindings-jboss-beans.xml.
<parameter>
<set>
<inject bean="PortsDefaultBindings"/>
<inject bean="Ports01Bindings"/>
<inject bean="Ports02Bindings"/>
<inject bean="Ports03Bindings"/>
<inject bean="Ports04Bindings"/>
</set>
</parameter>
<bean name="Ports04Bindings" class="org.jboss.services.binding.impl.ServiceBindingSet">
<constructor>
<!-- The name of the set -->
<parameter>ports-04</parameter>
<!-- Default host name -->
<parameter>${jboss.bind.address}</parameter>
<!-- The port offset -->
<parameter>400</parameter>
<!-- Set of bindings to which the "offset by X" approach can't be applied -->
<parameter><null/></parameter>
</constructor>
</bean>
The change works fine in that i can access my application using the URL http://localhost:8480/XYZApp.
Now to be able to do the deployment, i have to inform the infrastructure people all the port numbers that the application will use.
I know that we will be using 8480 but how would i know all the other port numbers that Jboss will use for this instance based on an offset of 400?
JBoss listens to many ports for each of its services respectively, but you shouldn't need to open all those ports if your applications don't make use of the services related to these ports. For example if no external applications will use the Naming Service you shouldn't need to open the port 1099 (1499 in your case).
Anyway, if you need a list of all the ports from where Jboss listens, you can check the bean with name="StandardBindings" in the file conf/bindingservice.beans/META-INF/bindings-jboss-beans.xml. Those are the standard ports, so if you have defined an offset (in your case 400) you'll have to add it to the respective port to get the ports used by your JBoss instance.

Problem changing Database in Hibernate

i'm having a problem with hibernate and don't know exactly what's going on, i have this project at work where i connect to an Oracle 10g Database using the following settings:
Host Name: localhost
port:1521
SID:orcl
user:anfxi
password:password
Now i'm at home trying to work with the same database remotely, im connected via VPN and the database ip is now 10.73.98.230 , i imported my WAR and changed the settings in my
hibernate.cfg.xml from:
<property name="hibernate.connection.driver_class">oracle.jdbc.OracleDriver</property>
<property name="hibernate.connection.url">jdbc:oracle:thin://localhost:1521:orcl</property>
<property name="hibernate.connection.username">anfexi</property>
<property name="connection.password">password</property>
<property name="connection.pool_size">1</property>
<property name="hibernate.dialect">org.hibernate.dialect.OracleDialect</property>
<property name="show_sql">true</property>
<property name="hbm2ddl.auto">validate</property>
<property name="current_session_context_class">thread</property>
to:
<property name="hibernate.connection.driver_class">oracle.jdbc.OracleDriver</property>
<property name="hibernate.connection.url">jdbc:oracle:thin://10.73.98.230:1521:orcl</property>
<property name="hibernate.connection.username">anfexi</property>
<property name="connection.password">password</property>
<property name="connection.pool_size">1</property>
<property name="hibernate.dialect">org.hibernate.dialect.OracleDialect</property>
<property name="show_sql">true</property>
<property name="hbm2ddl.auto">validate</property>
<property name="current_session_context_class">thread</property>
but i keep getting this error:
ERROR [main] (SchemaValidator.java:135) - could not get database metadata
java.sql.SQLException: Listener refused the connection with the following error:
ORA-12505, TNS:listener does not currently know of SID given in connect descriptor
The Connection descriptor used by the client was:
localhost:1521:orcl
so it seems to be still using localhost as the DB address, i cleaned my project and rebuilt, still with no luck, is there something else that i could be missing? does the hibernate configuration gets cached in some file i have to erase or something?
EDIT
For what it may serve, i can connect using SQL developer,the problem is just hibernate still using the old localhost:1521:orcl Connection descriptor.
Thanks for your help!
Verify that the xml file you are changing in Eclipse is actually being deployed to the server. I run into problems every once in awhile where Eclipse doesn't know it needs to redeploy certain files for my webapp.
If you are using Tomcat and deploying using the workspace metadata (the default), you can check what the actual deployed WAR files look like by looking at your filesystem under:
WORKSPACE/.metadata/.plugins/org.eclipse.wst.server.core/tmp0/wtpwebapps/APPNAME/.../path/to/hibernate.cfg.xml
If you find the config file is NOT being updated, I would recommend undeploying you app in Eclipse, deleting the entire APPNAME directory in the above path, and redeploying clean.
If none of that works, do a project-wide search for "localhost" and see if there could possible be any hardcoded connections strings anywhere.
This kind of problem is usually due to the wrong configuration file being present. Maybe you have two copies of the file and you changed one but the system is using the other
Typically when building/compiling, resources get copied to a target/build folder. Check source folders and build target folders etc.
Search the file system for all files with the name hibernate.cfg.xml or with the contents localhost:1521:orcl
Check the classpath, or try explicitly putting the folder with the configuration file you want first in the classpath.
It can also be a case of some other configuration overriding your configuration, for instance a datasource filer or a persistence.xml-file. Check those if you have them as well.
How are you running your application? Through a test case, standalone console application, servlet/j2ee container?
It is unable to understand the "orcl" SID. May be the SID is present on your "localhost" but not on the server "10.73.98.230". verify you are using the correct SID available on "10.73.98.230".
Try changing this line in your config file.
<property name="hibernate.connection.url">jdbc:oracle:thin:#10.73.98.230:1521:orcl</property>
replace // with #
you can follow the link the have infomation http://www.cryer.co.uk/brian/oracle/ORA12505.htm
Hope this will help