Run DBAPI call asynchronously in Pyspark application - pyspark

I have an application that creates a few dataframes, writes them to disk, then runs a command using vertica_python to load the data into Vertica. The Spark Vertica connector doesn't work because of an encrypted drive.
What I'd like to do, is have the application run the command to load the data, then move on to the next job immediately. What it's doing however, is waiting for the load to be done in Vertica before moving to the next job. How can I have it do what I want? Thanks.
What's weird about this problem is that the command I'd like to have run in the background is as simple as db_client.cursor.execute(command). This shouldn't be blocking under normal circumstances, so why is it in Spark?
To be very specific, what is happening is that I'm reading in a dataframe, doing transformations, writing to s3, and then I'd like to start the db loading the files from s3, before moving taking the transformed dataframe, transforming it again, writing to s3, loading to db.... multiple times.

I see now what I was doing. Simply putting the dbapi call in its own thread isn't enough. I have to put the other calls that I want to run concurrently in their own threads as well.

Related

Flink StreamingEnvronment does not terminate when using Kafka as source

I am using Flink for a Streaming Application. When creating the stream from a Collection or a List, the application terminates and everything after the "env.execute" gets executed normally.
I need to use a different source for the stream. More precisely, I use Kafka as a source (env.addSource(...)). In this case, the program just block while reaching the end of the stream.
I created an appropriate Deserialization Schema for my stream, having an extra event that signals the end of the stream.
I know that the isEndOfStream() condition succeeds on that point (I have an appropriate message printed on the screen in this case).
At this point the program just stops and does nothing, and so the commands that follow the "execute" line aren't on my disposal.
I am using Flink 1.7.2 and the flink-connector-kafka_2.11, with Scala 2.11.12. I am executing using the IntelliJ environment and Maven.
While researching, I found a suggestion to throw an exception while getting at the end of the stream (using the Schema's capabilities). That does not support my goal because I also have more operators/commands within the execution of the environment that need to be executed (and do get executed correctly at this point). If i choose to disrupt the program by throwing an exception I would lose everything else.
After the execution line I use the .getNetRuntime() function to measure the running time of my operators within the stream.
I need to have StreamingEnvironment end like it does when using a List as a source. Is there a way to remove Kafka at this point for example?

How to use DataFrames within SparkListener?

I've written a CustomListener (deriving from SparkListener, etc...) and it works fine, I can intercept the metrics.
The question is about using the DataFrames within the listener itself, as that assumes the usage of the same Spark Context, however as of 2.1.x only 1 context per JVM.
Suppose I want to write to disk some metrics in json. Doing it at ApplicationEnd is not possible, only at the last jobEnd (if you have several jobs, the last one).
Is that possible/feasible???
I'm trying to measure the perfomance of jobs/stages/tasks, record that and then analyze programmatically. May be that is not the best way?! Web UI is good - but I need to make things presentable
I can force the creation of dataframes upon endJob event, however there are a few errors thrown (basically they refer to not able to propogate events to the listener) and in general I would like to avoid unnecessary manipulations. I want to have a clean set of measurements that I can record and write to disk
SparkListeners should be as fast as possible as a slow SparkListener would block others to receive events. You could use separate threads to release the main event dispatcher thread, but you're still bound to the limitation of having a single SparkContext per JVM.
That limitation is however easily to overcome since you could ask for the current SparkContext using SparkContext.getOrCreate.
I'd however not recommend the architecture. That puts too much pressure on the driver's JVM that should rather "focus" on the application processing (not collecting events that probably it already does for web UI and/or Spark History Server).
I'd rather use Kafka or Cassandra or some other persistence storage to store events to and have some other processing application to consume them (just like Spark History Server works).

Spark: check if parquet write succeded

I have a Spark 1.6.2 (on Scala) process which writes a parquet file, and when it finishes, it should load it back again as a DataFrame. Is there a Spark method to check it a DataFrameWriter finished successfully and resume only after that? I've tried using Future and on Complete, but it doesn't work well with Spark (SparkContext gets shut down).
I assume I can just look for the _SUCCESS file in the folder and loop until it's there, but then if the process gets stuck, I'll have an infinite loop..

Spark streaming merge data

My understanding is that Spark Streaming serialises the closure (e.g. map, filter, etc) and executes it on worker nodes (as explained here). Is there some way of sending the results back to the driver program and perform further operations on the local machine?
In our specific use case, we are trying to turn the results produced by Spark into an observable stream (using RxScala).
Someone posted a comment but deleted it afterwards. He suggested using collect() on an RDD. A simple test showed that collect gathers data from the worker nodes and executes on the driver node; exactly what I needed.

Multiple Actors that writes to the same file + rotate

I have written a very simple webserver in Scala (based on Actors). The purpose
of it so to log events from our frontend server (such as if a user clicks a
button or a page is loaded). The file will need to be rotated every 64-100mb or so and
it will be send to s3 for later analysis with Hadoop. the amount of traffic will
be about 50-100 calls/s
Some questions that pops into my mind:
How do I make sure that all actors can write to one file in a thread safe way?
What is the best way to rotate the file after X amount of mb. Should I do this
in my code or from the filesystem (if I do it from the filesystem, how do I then verify
that the file isn't in the middle of a write or that the buffer is flushed)
One simple method would be to have a single file writer actor that serialized all writes to the disk. You could then have multiple request handler actors that fed it updates as they processed logging events from the frontend server. You'd get concurrency in request handling while still serializing writes to your log file. Having more than a single actor would open the possibility of concurrent writes, which would at best corrupt your log file. Basically, if you want something to be thread-safe in the actor model, it should be executed on a single actor. Unfortunately, your task is inherently serial at the point you write to disk. You could do something more involved like merge log files coming from multiple actors at rotation time but that seems like overkill. Unless you're generating that 64-100MB in a second or two, I'd be surprised if the extra threads doing I/O bought you anything.
Assuming a single writing actor, it's pretty trivial to calculate the amount that has been written since the last rotation and I don't think tracking in the actor's internal state versus polling the filesystem would make a difference one way or the other.
U can use Only One Actor to write every requests from different threads, since all of the requests go through this actor, there will be no concurrency problems.
As per file write rolling, if your write requests can be logged in line by line, then you can resort to log4j or logback's FileRollingAppender things. Otherwise, you can write your own which will be easy as long as remembering to lock the file before performing any delete or update operations.
The rolling usually means you rename the older files and current file to other names and then create a new file with current file name, at last, u can always write to the file with current file name.