Spark/Scala parallel write to redis - scala

Is it possible to write to Redis in parallel from spark?
(Or: how to write tens of thousands of keys/lists quickly from spark)
Currently, I'm writing to Redis by key in sequence, and it's taking forever. I need to write about 90000 lists (of length 2-2000). Speed is extremely important. Currently, it's taking on the order of 1 hour. Tradition benchmarks of Redis claim thousands of Redis writes per second, but in my pipeline, I'm not anywhere near that.
Any help is appreciated.

A single Redis instance runs in one thread so operations are inherently sequential. If you have a Redis cluster then the instance to which a datum is written depends on a hash slot calculated from the key being written. This hash function (amongst other things) ensures that the load gets distributed across all the Redis instances in the cluster. If you cluster has N instances, you have (almost) at most N parallel writes that you can execute. This because each cluster instance is still a single thread. A reasonable Spark Redis connector should exploit the cluster efficiently.
Either way, Redis is really quick, especially if you use mass inserts.

Related

Cassandra as replacement to PostgreSQL

Is Cassandra with multiple nodes a good choice as replacement to single node PostgreSql? Data being stored is a time series. It is about tens of gigabytes already and is expected to grow. Database should be integrated into pipeline with apache spark as source and possibly result destination.
What is needed:
1) redundancy: one node failure shouldn't stop the system (all data should be available)
2) speed: more nodes - less time per single insert/select for one client
3) concurrency: more nodes - better speed for simultaneous inserts/selects from different clients
For your points:
1) This is a question which is up to you while choosing the keyspace replication factor RF and the consistency levels CL of your inserts and selects. To be available and consistent you need RF=3 on your and CL.QUORUM for both insert and select for hande loss of one node (for QUORUM you need RF/2+1 nodes online, 3/2+1=2 - integer division, with RF=5 you would neeed 5/2+1=3 nodes online, so you can handle loss of 2).
2) A single request will be handled by a single node as coordinator in your cluster. You do not gain much performance here with singe and synchronous requsts. If you issue any requests and use async you will split your requests across more nodes and gain performance.
3) With more clients you have the same effect - the coordinator will be picked at random (ok there is the TokenAwarePolicy which will pick a appropriate coordinator).
You've mentioned that you use time series data.
1. Naturally, you can vary the replication factor and consistency level. So yes, Cassandra would be good as a replacement.
2. The insert would be really fast as Cassandra writes memory first. So yes, Cassandra would be good as a replacement.
3. Cassandra has linear horizontal scalability. So yes, Cassandra would be good as a replacement.
The drawbacks are that Cassandra is a key-value storage. So you should model the table structure around the queries. And PostgreSQL as RDBMS is more flexible as support the whole set of SQL operations.
You can read more about some pros and cons of using Cassandra with time series data here and here.

How read in parallel all data from OrientDB

I want read all data from Orientdb database and I dont want get an iterator, I want read in some way all data in parallel by chunks from distinct pc across the network. There is any way to read database´s clusters in parallel or any other way to do this?
I have seen the Spark connector for Orientdb, they query directly the clusters of the Orientdb classes in order to read the values of a complete class in parallel.
Orient-Spark
Git-code
You can use PARALLEL in a SELECT query.
See: https://orientdb.com/docs/last/SQL-Query.html
PARALLEL Executes the query against x concurrent threads, where x refers to the number of processors or cores found on the host operating system of the query. You may find PARALLEL execution useful on long running queries or queries that involve multiple cluster. For simple queries, using PARALLEL may cause a slow down due to the overhead inherent in using multiple threads.

Spark and sharded JDBC datasources

I have a production sharded cluster of PostgreSQL machines where sharding is handled at the application layer. (Created records are assigned a system generated unique identifier - not a UUID - which includes a 0-255 value indicating the shard # that record lives on.) This cluster is replicated in RDS so large read queries can be executed against it.
I'm trying to figure out the best option for accessing this data within Spark.
I was thinking of creating a small dataset (a text file) that contains only the shard names, i.e., integration-shard-0, integration-shard-1, etc. Then I'd partition this dataset across the Spark cluster so ideally each worker would only have a single shard name (but I'd have to handle cases where a worker has more than one shard). Then when I create a JdbcRDD I'd actually create 1..n such RDDs, one for each shard name residing on that worker, and merge the resulting RDDs together.
This seems like it would work but before I go down this path I wanted to see how other people have solved similar problems.
(I also have a separate Cassandra cluster available as second datacenter for analytic processing which I will be accessing with Spark.)
I ended up writing my own ShardedJdbcRDD for which the preliminary version can be found at the following gist:
https://gist.github.com/cfeduke/3bca88ed793ddf20ea6d
At the time I wrote it, this version doesn't support use from Java, only Scala. (I may update it.) It also doesn't have the same sub-partitioning scheme that JdbcRDD has, for which I will eventually create an overload constructor. Basically ShardedJdbcRDD will query your RDBMS shards across the cluster; if you have at least as many Spark slaves as shards, each slave will get one shard for its partition.
A future overloaded constructor will support the same range query that JdbcRDD has so if there are more Spark slaves in the cluster than shards the data can be broken up into smaller sets through range queries.

Redis versus Cassandra(Bigtable data model)

Suppose I need to do the following operations intensively:
put(key, value)
where value is a map of <column name, column value>.
I havn’t known NoSQL for long, what I know is that both Cassandra insert(which conform the api defined in Bigtable paper) and Redis “HSET” command could do that. But what’s the pros and cons of both way? Any performance and scalability difference there?
EDIT :
My requirement is something like an IM server --- I need to store session data , and I want all of them to be in memory so that low latency can be easily achieved. The session last for at most 2 hours. No consistency requirement to consider yet. And disk is only for fail-over. Lost of data is not terrible. All i need is lower latency. Operations per second --- the more, the better.
Both redis and cassandra can be used as a key value store. The difference is in speed, scale and reliability.
Redis works best as a single server, where the entire data set resides in memory.
Cassandra can handle data sets that don't fit in memory, and data sets that don't fit on a single machine. As part of distributing over multiple machines, cassandra is much more reliable. Cassandra can handle machine failures, rebuilding machines, adding capacity to the cluster when needed.
Because redis is entirely in memory, and reads/writes are served by a single machine (a single cassandra write will typically talk to multiple machines), redis will most likely be faster.
If your primary goal is speed, and you don't need to store data reliably, and your data set fits in memory, then redis would probably be a better solution.

Distributing work to multiple cores: Hadoop or Scala's parallel collections?

What is the better way of making full use of multiple cores for parallel processing in a Scala/Hadoop system?
Let's say I need to process 100 million documents. Documents are not very large, but processing them is computationally intensive. If I have a Hadoop cluster with 100 machines with 10 cores each, I could either:
A) send 1000 documents to each machine and let Hadoop start a map on each of the 10 cores (or as many as are available)
or
B) send 1000 documents to each machine (still using Hadoop) and use Scala's parallel collections to make full use of the multiple cores. (I would put all documents in a parallel collection, and then call map on the collection). In other words, use Hadoop for distribution at cluster level, and use parallel collections to manage the distribution to cores within each machine.
Hadoop is going to offer a lot more than just parallelization. It offers a platform to distribute work, a scheduler for handling concurrent jobs, a distributed filesystem, the ability to perform a distributed reduce, and fault tolerance. That said, it is a complicated system and can sometimes be difficult to work with.
If you plan to have multiple users submitting many different jobs, Hadoop is the way to go (out of the two options). However, if you are devoting a cluster to be always be processing documents through the same function, you could, without too much trouble, develop a system with Scala parallel collections and actors for inter-machine communication. The Scala solution would give you more control, the system could respond in real time, and you wouldn't have to deal with a lot of Hadoop configuration that doesn't pertain to your task.
If you need to run varied jobs over large amounts of data (larger than would fit on a single node), then use Hadoop. I can give you more information if you describe your requirements in more detail.
Update: one million is a fairly small number. You might want to do some calculations and see how long it would take on a single machine with parallel collections. The advantage here is that the development time is minimal!
The answer depends on the following question - does your Scala code capable to fully utilize all cores available. Probabbly if you have good intrinsic synchronization between parts of the document to be processed or some other way to parralelyze algorithm without lock contention - then the "B"" is the way. If so - configure one mapper per node and let your mapper to utilize cores in a best way.
If your gain from the parralelization is not that good, and adding more threads (cores) to the processing does not improve performance in a linear way - then the "A" can be better way. Efficiency of "A" also depends on the size of your RAM - you will need enough ram for 10 mappers per node.
I can suspect that ideal solution can be somewhere in between. So my suggestion is to develop mapper which takes number of threads used as a parameter and then do a few tests increasing number of threads per mapper and decreasing number of mappers per node.
Hadoop is not very good for processing a lot of small files, but for processing a small amount of very large files. Is there any way you can merge the files before processing them, or are they all totally different? Hadoop takes care of distribution and parallelism itself, so there is no need to explicitly send X docs to Y machines. And also i don't think you should use hadoop only as a distribution mechanism, that is not what it's made for. You should either use a real map/reduce, or build your own system for whatever you are trying to do, but not try to bend hadoop to your will.