What happens if sparksession is not closed？

What happens if sparksession is not closed？ - pyspark

1.Does sparksession need to be closed for standard usage?
2.Is it normal for sparksession to stop every time and it takes several seconds to connect again?
3.If sparksession is not stopped, is the resource occupied all the time, there will be a memory leak
Thank you for your help

A Spark application can have only one SparkContext. You don't have to close unless the application is used only by you. Cluster manager will take care of the resources based on the usage. If no job is running, then the worker resources will be free. Small portion of driver resources will be in use though.

Related

Accessing tasks progress from the Driver in Spark

I am trying to do some actions at the Driver side in Spark while an application is running. The Driver needs to know the tasks progress before making any decision. I know that tasks progress can be accessed within each executor or task from RecordReader class by calling getProgress().
The question is, how can I let the Driver call or have an access to getProgress() method of each task? I thought about using broadcast variables, but I don't know how the Driver would distinguish between different tasks.
Note that I am not looking for results displayed in Spark UI.
Any help is appreciated!

One way to do this is to send the progress from each Executor thread to a listening thread in the Driver. This has to be a seperate thread as the main thread is blocked while the action is in progress.

PhantomJS not killing webserver client connections

I have a kind of proxy server running on a WebServer module and I noticed that this server is being killed due to its memory consumption.
Every time the server gets a new request it creates a child client process, the problem I see is that the process remains alive indefinitely.
Here is the server I'm using:
server.js
I thought response.close() was closing and killing client connections, but it is not.
Here is the list of child processes displayed on htop:
(Those process are even more, it is just a fragment of the list)
I really need to kill those processes because they are using all the free memory. Am I missing something?
I could simply restart the server, but the memory will still be wasted.
Thanks you !
EDIT:
The processes I mentioned before are threads and no independient processes as I thought (check this).
Every http request creates a new thread, and that's ok, but this thread is not being killed after the script ends.
Also, I found out that no new threads are created if the request handler doesn't run casper (I mean casper.run(..)).
So, new threads are created only if the server runs a casper instance, the problem is that this instance doesn't end after run function does.
I tried casper.done() as mentioned below, but it kill the whole process instead of the current running thread. (I did not find any doc for this function).
When I execute other casper scripts, outside the server in the same machine, the instanced threads and the whole phantom process ends successfully. What would be happening?
I am using Phantom 2.1.1 and Casper 1.1.1 versions.
Please ask me anything if you want more or specific information.
Thanks again for reading !

This is a well known issue with casper:
https://github.com/casperjs/casperjs/issues/1355
It has not been fixed by the casper guys and is currently marked as an enhancement. I guess it's not on their priority list.
Anyways, the workaround is to write a server side component e.g. a node.js server to handle the incoming requests and for every request run a casper script to do the scraping in a new child process. This child process will be closed when casper terminates it's job. While this is a workaround, it is not an optimal solution as the cost of opening a child process for every request is not cheap. it will be hard to heavily scale an approach similar to this. However, it is a sufficient workaround. More on this fully sensible approach is in the link above.

How to maintain state after streaming application restart?

I am trying to understand how state management in Spark Streaming works in general. If I run this example program twice will the second run see state from the first run?
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/StatefulNetworkWordCount.scala
Is there a way how to achieve this? I am thinking about redeploying an application an I would like not to loose the current state.

tl;dr It depends on what you need the other instance to see. Checkpointing is usually a solution.
ssc.checkpoint(".") (at the line 50 in StatefulNetworkWordCount) enables checkpointing that (quoting the official documentation):
Spark Streaming needs to checkpoint enough information to a fault-tolerant storage system such that it can recover from failures.
A failure can be considered a form of redeployment. It is described in the official documentation under Upgrading Application Code that lists two cases:
Two instances run in parallel
One is gracefully brought down, and the other reads state from checkpoint directory.

Stopping the execution of a currently running job after some time

I am using quartz to schedule jobs to be executed daily as a part of a much larger web application. However, after a couple of days, the administrator would like to stop the execution of a particular job (maybe because it is no longer needed). How do I go about doing this? I read the api docs for the Scheduler and it has a method called interrupt(JobKey jobkey) but that method would work only with the same instance of the scheduler that was used to schedule the job.
interrupt(JobKey jobKey)
Request the interruption, within this Scheduler instance, of all
currently executing instances of the identified Job, which must be an
implementor of the InterruptableJob interface.
Is there anyway of getting the instance of an existing scheduler? Or maybe use singletons?

Should definitely use a singleton instance of your scheduler. I recommend the use of an IoC container to manage this in a clean and efficient way.

How to visualize online workers properly and remove offline workers from Flower?

We are using Flower to visualize tasks and workers in Celery. The problem is that we use Amazon autoscaling to spawn new workers. Hence, old workers terminate one day and new workers are spawned the next day and they register themselves as new workers. The old ones still remain there as offline workers. This makes sense if we are interested to see the stats of each worker. Is there a way to hide them if we are not interested in their stats?
Also most of the times a new workers registers itself, Flower has an issue showing it, it shows
Unknown worker 'celery#ip-172-XX-XX-XX'
How can we ensure that each worker can be visualized properly when online and avoid this error?

I wanted the same too, and not so long ago created an issue on GitHub for it - https://github.com/mher/flower/issues/840. Soon after that Bjoern Stiel wrote an implementation, which is still waiting to be merged (https://github.com/mher/flower/pull/852). You can simply grab this branch and use it. :)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

What happens if sparksession is not closed？ - pyspark

1.Does sparksession need to be closed for standard usage? 2.Is it normal for sparksession to stop every time and it takes several seconds to connect again? 3.If sparksession is not stopped, is the resource occupied all the time, there will be a memory leak Thank you for your help

Related

Accessing tasks progress from the Driver in Spark

PhantomJS not killing webserver client connections

How to maintain state after streaming application restart?

Stopping the execution of a currently running job after some time

How to visualize online workers properly and remove offline workers from Flower?

Categories

Resources