I am new to DASK and would like to make a test of running DASK on a cluster. The cluster has a head server and several other nodes. I can enter into other nodes by a simple ssh without password, once I log in the head server.
I would like to run a simple function to iterate over a large array. The function is defined below. It is to convert dt64 to numpy datetime object.
import xarray as xr
import numpy as np
from dask import compute, delayed
import dask.multiprocessing
from datetime import datetime, timedelta
def converdt64(dt64):
ts = (dt64 - np.datetime64('1970-01-01T00:00:00Z')) / np.timedelta64(1, 's')
return datetime.utcfromtimestamp(ts)
Then on the terminal, I iterate over an array 1D with size of N by applying this function.
values = [delayed(convertdt64)(x) for x in arraydata]
results1 = compute(*values,scheduler='processes’)
This uses some cores on the head server and it works, though slowly. Then I tried to launch the function on several nodes of the cluster by using the Client as below:
from dask.distributed import Client
client = Client("10.140.251.254:8786 »)
results = compute(*values, scheduler='distributed’)
It does not work at all. There are some warnings and one error message as in the following.
distributed.comm.tcp - WARNING - Could not set timeout on TCP stream: [Errno 92] Protocol not available
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://10.140.251.254:57257 remote=tcp://10.140.251.254:8786>
CancelledError: convertdt64-0205ad5e-214b-4683-b5c4-b6a2a6d8e52f
I also tried dask.bag and I got the same error message. What may be the reasons that the parallel computation on the cluster does not work ? Is it due to some server/network configuration, or my incorrect use of DASK client ? Thanks in advance for your help !
Best wishes
Shannon X
...then I tried to launch the function on several nodes of the cluster by using the Client as below:
I had similar issues trying to run tasks on the scheduler. The nodes connect just fine. Attempting to submit tasks, however, results in cancellation.
The documented examples were either local or from the same node as the scheduler. When I moved my client to the scheduler node the problem went away.
Related
I'm using kubernetes==10.1.0
My code is in python 3.7 and something like that:
from kubernetes import client as kubernetes_client
BatchV1_api = kubernetes_client.BatchV1Api()
api_response = BatchV1_api.list_namespaced_job(namespace="default", watch = False, pretty='true', async_req=False )
The problem is I got about 1500 jobs in kubernetes and api_response returns only 20
The goal is to implement program which go throw all jobs and delete old ones by job name as parameter
Any idea why I'm getting only partial data from BatchV1_api.list_namespaced_job function?
I have opened an AWS EMR cluster and in pyspark3 jupyter notebook I run this code:
"..
textRdd = sparkDF.select(textColName).rdd.flatMap(lambda x: x)
textRdd.collect().show()
.."
I got this error:
An error was encountered:
Invalid status code '400' from http://..../sessions/4/statements/7 with error payload: {"msg":"requirement failed: Session isn't active."}
Running the line:
sparkDF.show()
works!
I also created a small subset of the file and all my code runs fine.
What is the problem?
I had the same issue and the reason for the timeout is the driver running out of memory. Since you run collect() all the data gets sent to the driver. By default the driver memory is 1000M when creating a spark application through JupyterHub even if you set a higher value through config.json. You can see that by executing the code from within a jupyter notebook
spark.sparkContext.getConf().get('spark.driver.memory')
1000M
To increase the driver memory just do
%%configure -f
{"driverMemory": "6000M"}
This will restart the application with increased driver memory. You might need to use higher values for your data. Hope it helps.
From This stack overflow question's answer which worked for me
Judging by the output, if your application is not finishing with a FAILED status, that sounds like a Livy timeout error: your application is likely taking longer than the defined timeout for a Livy session (which defaults to 1h), so even despite the Spark app succeeds your notebook will receive this error if the app takes longer than the Livy session's timeout.
If that's the case, here's how to address it:
1. edit the /etc/livy/conf/livy.conf file (in the cluster's master node)
2. set the livy.server.session.timeout to a higher value, like 8h (or larger, depending on your app)
3. restart Livy to update the setting: sudo restart livy-server in the cluster's master
4. test your code again
Alternative way to edit this setting - https://allinonescript.com/questions/54220381/how-to-set-livy-server-session-timeout-on-emr-cluster-boostrap
Just a restart helped solve this problem for me. On your Jupyter Notebook, go to -->Kernel-->>Restart
Once done, if you run the cell with "spark" command you will see that a new spark session gets established.
You might get some insights from this similar Stack Overflow thread: Timeout error: Error with 400 StatusCode: "requirement failed: Session isn't active."
Solution might be to increase spark.executor.heartbeatInterval. Default is 10 seconds.
See EMR's official documentation on how to change Spark defaults:
You change the defaults in spark-defaults.conf using the spark-defaults configuration classification or the maximizeResourceAllocation setting in the spark configuration classification.
Insufficient reputation to comment.
I tried increasing heartbeat Interval to a much higher (100 seconds), still the same result. FWIW, the error shows up in < 9s.
What worked for me is adding {"Classification": "spark-defaults", "Properties": {"spark.driver.memory": "20G"}} to the EMR configuration.
I'm currently using NiFi 1.5.0 (but it's the same with the previous versions) and I wonder if there is a way to clear all queues in the same time ?
When the number of processors increase, it can be really long to reset everything.
(I already know how to clear a single queue :
How to clear NiFi queues? )
I'm looking for a solution using either the UI or the API
Thanks in advance !
I haven't had time to test this thoroughly, but it should work:
# In your linux shell - NiPyAPI is a Python2/3 SDK for the NiFi API
pip install nipyapi
python
# In Python
from nipyapi import config, canvas, nifi
# Get a flat list of all process groups
pgs = canvas.list_all_process_groups()
# get a flat list of all connections in all process groups
cons = []
for pg in pgs: cons += nifi.ProcessgroupsApi().get_connections(pg.id).connections
# Issue a drop order for every connection in every process group
for con in cons: nifi.FlowfilequeuesApi().create_drop_request(con.id)
Edit: I went ahead and implemented this as it seems useful:
https://github.com/Chaffelson/nipyapi/issues/45
import nipyapi
pg = nipyapi.canvas.get_process_group('MyProcessGroup')
nipyapi.canvas.purge_process_group(pg, stop=True)
The stop option will deschedule the Process Group before purging it, just to be extra handy
If you want to get rid of all your data completely, you can stop NiFi and remove all of the "_repository" directories (flow file, content, and provenance). This is basically completely resetting your NiFi in terms of data.
I am working with mutiple systems as workers.
Each worker system has a part of the data locally stored. And I want the computation done by each worker on its respective file only.
I have tried using :
distributed.scheduler.decide_worker()
send_task_to_worker(worker, key)
but I could not automate assigning the task for each file.
Also, is there anyway I can access local files of the worker? Using tcp address, I only have access to a temp folder of the worker created for dask.
You can target computations to run on certain workers using the workers= keyword to the various methods on the client. See http://distributed.readthedocs.io/en/latest/locality.html#user-control for more information.
You might run a function on each of your workers that tells you which files are present:
>>> client.run(os.listdir, my_directory)
{'192.168.0.1:52523': ['myfile1.dat', 'myfile2.dat'],
'192.168.0.2:4244': ['myfile3.dat'],
'192.168.0.3:5515': ['myfile4.dat', 'myfile5.dat']}
You might then submit computations to run on those workers specifically.
future = client.submit(load, 'myfile1.dat', workers='192.168.0.1:52523')
If you are using dask.delayed you can also pass workers= to the `persist method. See http://distributed.readthedocs.io/en/latest/locality.html#user-control for more information
Can I run the program, listens at the port in the cluster?
I want to write an application that accepts http requests and performs some calculation using spark
Yes, you can run any code you want on driver node. You can use, for example, spray.io http server and connect to spark actor system:
import org.apache.spark.SparkEnv
implicit val system = SparkEnv.get.actorSystem
But there is no way to execute arbitrary code on workers. Workers run only code blocks inside RDD's map-reduce functions.
It is hard to understand your English, but, if I understood you correctly you are looking for something like Spark-JobServer