Is it possible to use spark to process complex entities with complex dependencies? - scala

Consider a scenario (objects and dependencies are Scala classes):
There is a set of dependencies which themselves require significant amount of data to be instantiated (data coming from a database).
There is a set of objects with complex nested hierarchy which store references to those dependencies.
The current workflow consist of:
Loading the dependencies data from a database and instantiating them
(in a pretty complex way with interdependencies).
Loading object
data from the database and instantiating objects using previously
loaded dependencies.
Running operations on a list of objects like:
a. Search with a complex predicate
b. Transform
c. Filter
d. Save to the database
e. Reload from the database
We are considering running these operations on multiple machines. One of the options is to use Spark, but it is not clear how to properly support data serialization and distribute/update the dependencies.
Even if we are able to separate the logic in the objects from the data (making objects easily serializable) the functions we want to run over the objects will still rely on the complex dependencies mentioned above.
Additionally, at least at the moment, we don't have plans to use any operations requiring shuffling of the data between machines and all we need is basically sharding.
Does Spark look like a good fit for such scenario?
If yes, how to handle the complex dependencies?
If no, would appreciate any pointers to alternative systems that can handle the workflow.

I don't understand enough what you mean by "complex interdependencies" but it seems that if you only need sharding you won't really get much from spark - just run multiple whatever you have an use a queue to synchronize the work and distribute to each copy the shard it needs to work on.
We did something similar converting a pySpark jot to a Kubernetes setup where the queue holds the list of ids and then we have multiple pods (we control the scale via kubectl) that read from that queue and got much better performance and simpler solution - see https://kubernetes.io/docs/tasks/job/coarse-parallel-processing-work-queue/

Related

Is it possible to generate DataFrame rows from the context of a Spark Worker?

The fundamental problem is attempting to use spark to generate data but then work with the data internally. I.e., I have a program that does a thing, and it generates "rows" of data - can I leverage Spark to parallelize that work across the worker nodes, and have them each contribute back to the underlying store?
The reason I want to use Spark is that seems to be a very popular framework, and I know this request is a little outside of the defined range of functions Spark should offer. However, the alternatives of MapReduce or Storm are dreadfully old and there isn't much support anymore.
I have a feeling there has to be a way to do this, has anyone tried to utilize Spark in this way?
To be honest, I don't think adopting Spark just because it's popular is the right decision. Also, it's not obvious from the question why this problem would require a framework for distributed data processing (that comes along with a significant coordination overhead).
The key consideration should be how you are going to process the generated data in the next step. If it's all about dumping it immediately into a data store I would really discourage using Spark, especially if you don't have the necessary infrastructure (Spark cluster) at hand.
Instead, write a simple program that generates the data. Then run it on a modern resource scheduler such as Kubernetes and scale it out and run as many instances of it as necessary.
If you absolutely want to use Spark for this (and unnecessarily burn resources), it's not difficult. Create a distributed "seed" dataset / stream and simply flatMap that. Using flatMap you can generate as many new rows for each seed input row as you like (obviously limited by the available memory).

What is the fastest way to persist large complex data objects in Powershell for a short term period?

Case in point - I have a build which invokes a lot of REST API calls and processes the results. I would like to split the monolithic step that does all that into 3 steps:
initial data acquisition - gets data from REST Api. Plain objects, no reference loops or duplicate references
data massaging - enriches the data from (1) with all kinds of useful information. May result in duplicate references (the same object is referenced from multiple places) or reference loops.
data processing
The catch is that there is a lot of data and converting it to json takes too much time to my taste. I did not check the Export-CliXml function, but I think it would be slow too.
If I wrote the code in C# I would use some kind of binary serialization, which should be sophisticated enough to handle reference loops and duplicate references.
Please, note that serialization would write to the build staging directory and would be deserialized almost immediately as soon as the next step runs.
I wonder what are my options in Powershell?
EDIT 1
I would like to clarify what do I mean by steps. This is a build running on a CI build server. Each step runs in a separate shell and is reported individually on the build page. There is no memory sharing between the steps. The only way to communicate between the steps is either through build variables or file system. Of course, using databases is also possible, but it is an overkill.
Build variables are set using certain API and are exposed to the subsequent steps as environment variables. As such they are quite limited in length.
So I am talking about communicating through the file system. I am sacrificing performance here for the sake of build granularity - instead of having one monolithic step I want to have 3 smaller steps. This way the build is more transparent and communicates clearly what it is doing. But I have to temporarily persist payloads between steps. If it is possible to minimize the overhead, then the benefits worth it. If the performance is going to degrade significantly, then I will keep the monolithic step.

DynamoDB - How to handle updates using adjacency list pattern?

So, in DynamoDB the reccomended approach to a many-to-many relationship is using Adjacency List Pattern.
Now, it works great for when you need to read the Data because you can easily read several items with one request.
But what if I need to update/delete the Data? These operations happen on a specific item instead of a query result.
So if I have thousands of replicated data to facilitate a GET operation, how am I going to update all of these replicas?
The easiest way I can think of is instead of duplicating the data, I only store an immutable ID, but that's pretty much emulating a relational database and will take at least 2 requests.
Simple answer: You just update the duplicated items :) AFAIK redundant data is preferred in NoSQL databases and there are no shortcuts to updating data.
This of course works best when read/write ratio of the data is heavily on the read side. And in most everyday apps that is the case (my gut feeling that could be wrong), so updates to data are rare compared to queries.
DynamoDB has a couple of utils that might be applicable here. Both have their shortcomings though
BatchWriteItem allows to put or delete multiple items in one or more tables. Unfortunately, it does not allow updates, so probably not applicable to your case. The number of operations is also limited to 25.
TransactWriteItems allows to perform an atomic operation that groups up to 10 action requests in one or more tables. Again the number of operations is limited for your case
My understanding is that both of these should be used with caution and consideration, since they might cause performance bottlenecks for example. The simple way of updating each item separately is usually just fine. And since the data is redundant, you can use async operations to make multiple updates in parallel.

Modelling a mutable collection in Spark

Our existing application loads approximately ten million rows from a database into a collection of objects on startup. The collection is stored in a GigaSpaces cache.
As new messages are received by the application, the cache is checked to see if an entry for that message already exists. If not, a new entity is added to the cache based on the data in the message. (At the same time, the new entity is persisted to a database).
We are investigating the feasibility and value add of re-architecting the application using Spark and Scala. The question is, what would be the correct way to model this in Spark.
My first thought is to load from the database into a Spark RDD. Looking up existing entries would obviously be simple. However, because an RDD is immutable, adding new entries to the cache would require a transformation. Given the large set of data, my presumption is that this would not perform well.
The other idea is to create the cache as a mutable Scala collection. However, how would we then integrate this with Spark, given that Spark works with RDD's?
Thanks
This is more of a design questions. Spark is not great for fast lookups. It is optimize for batch jobs that need to touch almost the entire dataset; potentially multiple times.
If you want something that has fast search-like capabilities you should look into Elastic Search.
Other technologies that are often used for storing large in-memory/lookup tables is redis and memcached.
Since RDDs are immutable, every single cache update would require producing an entirely new RDD from your previous RDD. This is clearly inefficient (you have to manipulate the entire RDD just to update a tiny part of it). As for the other idea of having a mutable scala collection of RDD elements -- well, that won't be distributable across machines/CPUs, so what's the point?
If your goal is to have in-memory, distributable/partitionable operations on your cache, what you're looking for is an operational in-memory data grid, not Apache Spark. For example: Hazelcast, ScaleOut software, etc.
Apache Spark is notoriously bad at fine-grained transformations like the ones you would need for an in-memory distributed cache.
Sorry if I'm not directly answering the technical question, instead I'm answering your question behind your question...

Guava - disk backed large collections

Do Guava's collection implementations flow the data to disk beyond a certain size?
I didn't find a simple reference on this, though it sounds like a natural/simple way of supporting large datasets...
We can use other libraries like Map DB of course...or even more comprehensive setups like voldemort...but just wondering why Guava doesn't have it...
Is it that from architecture point of view, if one needs to use such large dataset, he is better off splitting out the data to a separate data store instance instead of putting it in same JVM...?
Nope; Guava sticks to on-heap collections, just like the ones in java.util. On-disk collections are more or less out of scope for Guava.