I've got a raspberry pi cluster that has three nodes. I've installed mpi on it and i tried to excute an example code named cpi. The thing is that I get this error:
The command executed on master node:
mpiexec -f machinefile -n 2 mpi-build/examples/cpi
The result:
Process 0 of 2 is on Pi01
Fatal error in PMPI_Reduce: A process has failed,
error stack:PMPI_Reduce(1259)...............:MPI_Reduce(sbuf=0xbebc6630,rbuf=0xbebc6638,count=1, MPI_DOUBLE, MPI_SUM, root=0, MPI_COMM_WORLD) failed
MPIR_Reduce_impl(1071)..........:
MPIR_Reduce_intra(877)..........:
MPIR_Reduce_binomial(184).......:
MPIDI_CH3U_Recvq_FDU_or_AEP(630): Communication error with rank 1
Process 1 of 2 is on Pi02
I've used SSH Keygens between the master and each slave so, there is no need to use the password to login between each node. (But if a slave connects to another it must login to another slave using the password, this means that I didn't share the ssh keygens between the slaves, but only between the master and the slaves.)
Programs that print helloworld with the process rank and the PC that executed it work properly, but when a process needs to communicate with another, I get the error as stated above.
What should I do?
Related
I have a pod running in kubernetes / aws cloud. Due to limited configuration options in a custom deployment process (not my fault!!) I cannot start the symfony messenger as you usually would start it. What I have to do after a deployment is log into the shell and manually do
bin/console messenger:consume my_kafka_messages
Of course as soon as the pod for any reason is automatically restarted my worker isn't running. So until we can change the company deployment process I have to make sure to at least get notice if the worker isn't running.
Is there any option to e.g. run an symfony command which checks if the worker is running? If that was possible I could let the system start the worker or at least send me a notification.
How about
bin/console debug:messenger
?
If I do that and get e.g. this output is this sign that the worker is running? Or is it just the configuration of a worker, which could run, if it were started and may or may not run currently?
$ bin/console deb:mess
Messenger
=========
events
------
The following messages can be dispatched:
--------------------------------------------------
#codeCoverageIgnore
App\Domain\KafkaEvents\ProductPictureCollection
handled by App\Handler\ProductPictureHandler
--------------------------------------------------
Of course I can do a crude approach and check the db, which logs the processed datasets. But t is always possible that for e.g. 5 days there are no data to process. In that case I would get false alarms although everything is fine.
So checking directly if the worker is running would be much better, but I have no idea how to do it.
I'm using a rapsberry pi 4, v10(buster).
I installed supervisor per the instructions here: http://supervisord.org/installing.html
Except I changed "pip" to "pip3" because I want to monitor running things that use the python3 kernel.
I'm using Prefect, and the supervisord.conf is running the program with command=/home/pi/.local/bin/prefect "agent local start" (I tried this with and without double quotes)
Looking at the supervisord.log file it seems like the Prefect Agent does start, I see the ASCII art that normally shows up when I start it from the command line. But then it shows it was terminated by SIGTERM;not expected, WARN recieved SIGTERM inidicating exit request.
I saw this post: Supervisor gets a SIGTERM for some reason, quits and stops all its processes but I don't even have that 10Periodic file it references.
Anyone know why/how Supervisor processes are getting killed by sigterm?
It could be that your process exits immediately because you don’t have an API key in your command and this is required to connect your agent to the Prefect Cloud API. Additionally, it’s a best practice to always assign a unique label to your agents, below is an example with “raspberry” as a label.
You can also check the logs/status:
supervisorctl status
Here is a command you can try, plus you can specify a directory in your supervisor config (not sure whether environment variables are needed but I saw it from other raspberry Pi supervisor user):
[program:prefect-agent]
command=prefect agent local start -l raspberry -k YOUR_API_KEY --no-hostname-label
directory=/home/pi/.local/bin/prefect
user=pi
environment=HOME="/home/pi/.local/bin/prefect",USER="pi"
I have Airflow running with CeleryExecutor and 2 workers. When my DAG runs, the tasks generate a log on the filesystem of the worker that ran them. But when I go to the Web UI and click on the task logs, I get:
*** Log file does not exist: /usr/local/airflow/logs/test_dag/task2/2019-11-01T18:12:16.309655+00:00/1.log
*** Fetching from: http://70953abf1c10:8793/log/test_dag/task2/2019-11-01T18:12:16.309655+00:00/1.log
*** Failed to fetch log file from worker. HTTPConnectionPool(host='70953abf1c10', port=8793): Max retries exceeded with url: /log/test_dag/task2/2019-11-01T18:12:16.309655+00:00/1.log (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f329c3a2650>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution'))
http://70953abf1c10:8793/ is obviously not the correct IP of the worker. However, celery#70953abf1c10 is the name of this worker in Celery. It seems like Airflow is trying to learn the worker's URL from Celery, but Celery is giving the worker's name instead. How can I solve this?
DejanLekic's solution put me on the right track, but it wasn't entirely obvious, so I'm adding this answer to clarify.
In my case I was running Airflow on Docker containers. By default, Docker containers use a bridge network called bridge. This is a special network that does not automatically resolve hostnames. I created a new bridge network in Docker called airflow-net and had all my Airflow containers join this one (leaving the default bridge was not necessary). Then everything just worked.
By default, Docker sets the hostname to the hex ID of the container. In my case the container ID began with 70953abf1c10 and the hostname was also 70953abf1c10. There is a Docker parameter for specifying hostname, but it turned out to not be necessary. After I connected the containers to a new bridge network, 70953abf1c10 began to resolve to that container.
Simplest solution is either to use the default name, which will include the hostname, or to explicitly set the node name that has a valid host name in it (example: celery1#hostname.domain.tld).
If you use the default settings, then machine running the airflow worker has incorrectly set hostname to 70953abf1c10. You should fix this by running something like: hostname -B hostname.domain.tld
I want to kill my Kafka Connect distributed worker, but I am unable (or I do not know how) to determine which process running in linux is that worker.
When running
ps aux | grep worker
I do see a lot of worker processes but am unsure which is the connect worker and which are standard non-connect workers
It is true that only one of these processes was started yesterday and I suspect that that is the one, but that would obviously not be a sufficient condition in all cases, say for example if the Kafka cluster was brought online yesterday. So, in general, how can I determine which process is a Kafka connect worker?
What is the fool proof method here?
If the other worker processes are not related to connect, you can search connect process with properties file which you passed to start connect worker.
ps aux | grep connect-distributed.properties
There is no kill script for connect workers. You have to run kill command with SIGTERM to stop worker process gracefully.
While setting up history server and hive server, webHDFS is giving following error in REST API.
curl -sS -L -w '%{http_code}' -X PUT -T /usr/hdp/2.3.4.0-3485/hadoop/mapreduce.tar.gz 'http://ambari1.devcloud.247-inc.net:50070/webhdfs/v1/hdp/apps/2.3.4.0-3485/mapreduce/mapreduce.tar.gz?op=CREATE&user.name=hdfs&overwrite=True&permission=444'
{"RemoteException":
{"exception":"IOException",
"javaClassName":"java.io.IOException",
"message":"Failed to find datanode, suggest to check cluster health."
}}403
As user hdfs check the cluster status
hdfs dfsadmin -report
I had the same problem, the report showed that the datanodes were not online. In the end I had to add the nodes to the /etc/hosts file. In your case it might be another reason, the essential thing is that the datanodes have to be up and can be reached.
I had the same problem with HDP2.5. Mapreduce report the same Error and Ambari shows 0/3 Datanodes alive even though it also shows all 3 Datanodes are started.
For me, the reason is DNS.
My machines have central DNS server, so at first, I did not add the ip/host pair in /etc/hosts file, after deployment, Mapreduce server failed to start.
Then I add all machines ip/host paris in /etc/hosts file on each machine, and restart HDFS. The problem is resolved.