I have 2 ubuntu machines
Ubuntu server
Ubuntu server
The problem which i am facing is that i am unable to run 2nd Node as a peer of first Node....
I have tried with different type of genesis files like :
Providing both Peers info in genesis
{
"addPeer":{
"peer":{
"address":"110.39.197.250:10001",
"peerKey":"d04da271b57fe63426ae1dc97f6952104037411fcf4f3b739dc217f45e5fc99b"
}
}
},
{
"addPeer":{
"peer":{
"address":"135.148.120.71:10001",
"peerKey":"d64606f47120a1c62a7b7a8869acee11c50b287dbfe18de30556ce98cf478db9"
}
}
},
**FYI**
**........ 110.39.197.250 ==> local machine ip
........ 135.148.120.71 ==> server ip
........ Running point 1 results in error with exiting of related Iroha Node**
`Providing only current node info in genesis
see genesis file here for both Nodes Also you can
see config file for both Nodes
"addPeer": {
"peer": {
"address": "127.0.0.1:10001",
"peerKey": "d04da271b57fe63426ae1dc97f6952104037411fcf4f3b739dc217f45e5fc99b"
}
}
FYI
// like same way provide node info in 2nd running node genesis and then run Addpeer in cli then this error comes and both of the nodes stop working
I am not sure what is the correct way to run Iroha nodes on different servers and then add 2nd node as a peer of 1st one....Even i have not found any helping document ..The only help which i found was to deploy multiple nodes on single machine using docker but i want to deploy on different machines and use them as a peer. Please guide me
Related
I have a on prem Service Fabric 3 Node cluster running 8.2.1571.9590. This has been running for months without any problems.
The cluster node were rebooted overnight, as part of operating system patching, and the cluster will now not establish connections.
If I run connect-servicefabriccluster -verbose, I get the timeout error
System.Fabric.FabricTransientException: Could not ping any of the provided Service Fabric gateway endpoints.
Looking at the processes running I can see all the expected processes start and are stable with the exception of FabricGateway.exe which goes into a boot loop cycle.
I have confirmed that
I can do a TCP-IP Ping between the nodes in the cluster
I can do a PowerShell remote session between the nodes in the cluster
No cluster certs have expired.
Any suggestions as to how to debug this issue?
Actual Problem
On checking the Windows event logs Admin Tools > Event Viewer > Application & Service Logs > Microsoft Service Fabric > Admin I could see errors related to the startup of the FabricGateway process. The errors and warnings come in repeated sets with the following basic order
CreateSecurityDescriptor: failed to convert mydomain\admin-old to SID: NotFound
failed to set security settings to { provider=Negotiate protection=EncryptAndSign remoteSpn='' remote='mydomain\admin-old, mydomain\admin-new, mydomain\sftestuser, ServiceFabricAdministrators, ServiceFabricAllowedUsers' isClientRoleInEffect=true adminClientIdentities='mydomain\admin-old, mydomain\admin-new, ServiceFabricAdministrators' claimBasedClientAuthEnabled=false }: NotFound
Failed to initialize external channel error : NotFound
EntreeService proxy open failed : NotFound
FabricGateway failed with error code = S_OK
client-sfweb1:19000/[::1]:19000: error = 2147943625, failureCount=9082. This is conclusive that there is no listener. Connection failure is expected if listener was never started, or listener / its process was stopped before / during connecting. Filter by (type~Transport.St && ~"(?i)sfweb1:19000") on the other node to get listener lifecycle, or (type~Transport.GlobalTransportData) for all listeners
Using Windows Task Manager (or similar tool) you would see the Fabricgateway.exe process was starting and terminating every few seconds.
The net effect of this was the Service Fabric cluster communication could not be established.
Solution
The problem was the domain account mydomain\admin-old (an old historic account, not use for a long period) had been deleted in the Active Directory, so no SID for the account could be found. This failure was causing then loop, even though the admin accounts were valid.
The fix was to remove this deleted ID from the cluster nodes current active setting.xml file. The process I used was
RDP onto a cluster node VM
Stop the service fabric service
Find the current service fabric cluster configuration e.g. the newest folder on the form D:\SvcFab\VM0\Fabric\Fabric.Config.4.123456
Edit the settings.xml and remove the deleted account mydomain\admin-old from the AdminClientIdentities block, so I ended up with
<Section Name="Security">
<Parameter Name="AdminClientIdentities" Value="mydomain\admin-new" />
...
Once the file is saved, restart the service fabric service, it should start as normal. Remember,it will take a minute or two startup
Repeat the process on the other nodes in the cluster.
Once completed the cluster starts and operates as expected
I'm pretty new to Ceph, so I've included all my steps I used to set up my cluster since I'm not sure what is or is not useful information to fix my problem.
I have 4 CentOS 8 VMs in VirtualBox set up to teach myself how to bring up Ceph. 1 is a client and 3 are Ceph monitors. Each ceph node has 6 8Gb drives. Once I learned how the networking worked, it was pretty easy.
I set each VM to have a NAT (for downloading packages) and an internal network that I called "ceph-public". This network would be accessed by each VM on the 10.19.10.0/24 subnet. I then copied the ssh keys from each VM to every other VM.
I followed this documentation to install cephadm, bootstrap my first monitor, and added the other two nodes as hosts. Then I added all available devices as OSDs, created my pools, then created my images, then copied my /etc/ceph folder from the bootstrapped node to my client node. On the client, I ran rbd map mypool/myimage to mount the image as a block device, then used mkfs to create a filesystem on it, and I was able to write data and see the IO from the bootstrapped node. All was well.
Then, as a test, I shutdown and restarted the bootstrapped node. When it came back up, I ran ceph status but it just hung with no output. Every single ceph and rbd command now hangs and I have no idea how to recover or properly reset or fix my cluster.
Has anyone ever had the ceph command hang on their cluster, and what did you do to solve it?
Let me share a similar experience. I also tried some time ago to perform some tests on Ceph (mimic i think) an my VMs on my VirtualBox acted very strange, nothing comparing with actual bare metal servers so please bare this in mind... the tests are not quite relevant.
As regarding your problem, try to see the following:
have at least 3 monitors (or an even number). It's possible that hang is because of monitor election.
make sure the networking part is OK (separated VLANs for ceph servers and clients)
DNS is resolving OK. (you have added the servername in hosts)
...just my 2 cents...
I have Airflow running with CeleryExecutor and 2 workers. When my DAG runs, the tasks generate a log on the filesystem of the worker that ran them. But when I go to the Web UI and click on the task logs, I get:
*** Log file does not exist: /usr/local/airflow/logs/test_dag/task2/2019-11-01T18:12:16.309655+00:00/1.log
*** Fetching from: http://70953abf1c10:8793/log/test_dag/task2/2019-11-01T18:12:16.309655+00:00/1.log
*** Failed to fetch log file from worker. HTTPConnectionPool(host='70953abf1c10', port=8793): Max retries exceeded with url: /log/test_dag/task2/2019-11-01T18:12:16.309655+00:00/1.log (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f329c3a2650>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution'))
http://70953abf1c10:8793/ is obviously not the correct IP of the worker. However, celery#70953abf1c10 is the name of this worker in Celery. It seems like Airflow is trying to learn the worker's URL from Celery, but Celery is giving the worker's name instead. How can I solve this?
DejanLekic's solution put me on the right track, but it wasn't entirely obvious, so I'm adding this answer to clarify.
In my case I was running Airflow on Docker containers. By default, Docker containers use a bridge network called bridge. This is a special network that does not automatically resolve hostnames. I created a new bridge network in Docker called airflow-net and had all my Airflow containers join this one (leaving the default bridge was not necessary). Then everything just worked.
By default, Docker sets the hostname to the hex ID of the container. In my case the container ID began with 70953abf1c10 and the hostname was also 70953abf1c10. There is a Docker parameter for specifying hostname, but it turned out to not be necessary. After I connected the containers to a new bridge network, 70953abf1c10 began to resolve to that container.
Simplest solution is either to use the default name, which will include the hostname, or to explicitly set the node name that has a valid host name in it (example: celery1#hostname.domain.tld).
If you use the default settings, then machine running the airflow worker has incorrectly set hostname to 70953abf1c10. You should fix this by running something like: hostname -B hostname.domain.tld
I want to start two or more jboss eap 6.4 in the domain, but when I started the second domain I got this warning:
[Server:server-one] 15:34:35,606 WARN [org.hornetq.core.client]
(hornetq-discovery-group-thread-dg-group1) HQ212034: There are more
than one servers on the network broadcasting the same node id. You
will see this message exactly once (per node) if a node is restarted,
in which case it can be safely ignored. But if it is logged
continuously it means you really do have more than one node on the
same network active concurrently with the same node id. This could
occur if you have a backup node active at the same time as its live
node. nodeID=14bdbf74-f56c-11e4-a65f-738aa3641190
I cannot get this to work.
You must have copied the node from other node. Delete messagingjournal from data directory from all the nodes and restart all the nodes again.
I connect three servers to form an HPC cluster using condor as a middleware when I run the command condor_status from the central manager it does not shows the other nodes I can run jobs in the central manager and connect to the other nodes via SSH but it seems that there is something missing in condor configuration files where I set the central manager as condor host and allows writing and reading for everyone. I keep the daemon MASTER, STARTD in the daemon list for the worker nodes.
When I run condor_status in the central manager it just show the central manager and when I run it on the compute node it give me the error "CEDAR:6001:Failed to connect to" followed by the central manager IP and port number.
I manage to solve it. The problem was in the central manager's firewall (in my case it was iptables) which was running.
So, when I stopped the firewall (su -c "service iptables stop") all nodes appeared normally, typing condor_status".
The firewall status can be checked using "service iptables status".
There are a number of things that could be going on here. I'd suggest you follow this tutorial and see if it resolves your problems -
http://spinningmatt.wordpress.com/2011/06/12/getting-started-creating-a-multiple-node-condor-pool/
In my case the service "condor.exe" was not running on the server. I had stopped manually. I just start it and every thing went fine.