Scala persistent key value store? - scala

Please advise on persistent key value store for Scala. Is it possible to use Scala Spark components to build such a store with good access times? (I am new to Spark and planing to use it). Thanks!

Spark is a library used for data processing. The underlying datastore is normally Hadoop. So there is a conceptual difference between what Spark is and what a data store is.
You are looking for a persistant key-value store, I would suggest Redis because it is easy to setup, is relatively mature and has a Scala client.
However, you could also use any key-value store depending on your specific needs and what may already exist in your environment. Take a look at these Websites for some guidance:
http://en.wikipedia.org/wiki/NoSQL#Key-value_stores
http://www.metabrew.com/article/anti-rdbms-a-list-of-distributed-key-value-stores

Related

Is it possible to generate DataFrame rows from the context of a Spark Worker?

The fundamental problem is attempting to use spark to generate data but then work with the data internally. I.e., I have a program that does a thing, and it generates "rows" of data - can I leverage Spark to parallelize that work across the worker nodes, and have them each contribute back to the underlying store?
The reason I want to use Spark is that seems to be a very popular framework, and I know this request is a little outside of the defined range of functions Spark should offer. However, the alternatives of MapReduce or Storm are dreadfully old and there isn't much support anymore.
I have a feeling there has to be a way to do this, has anyone tried to utilize Spark in this way?
To be honest, I don't think adopting Spark just because it's popular is the right decision. Also, it's not obvious from the question why this problem would require a framework for distributed data processing (that comes along with a significant coordination overhead).
The key consideration should be how you are going to process the generated data in the next step. If it's all about dumping it immediately into a data store I would really discourage using Spark, especially if you don't have the necessary infrastructure (Spark cluster) at hand.
Instead, write a simple program that generates the data. Then run it on a modern resource scheduler such as Kubernetes and scale it out and run as many instances of it as necessary.
If you absolutely want to use Spark for this (and unnecessarily burn resources), it's not difficult. Create a distributed "seed" dataset / stream and simply flatMap that. Using flatMap you can generate as many new rows for each seed input row as you like (obviously limited by the available memory).

Spark binary data source vs sc.binaryFiles

Spark 3.0 enables reading binary data using a new data source:
val df = spark.read.format(“binaryFile”).load("/path/to/data")
Using previous spark versions you cloud load data using:
val rdd = sc.binaryFiles("/path/to/data")
Beyond having the option to access binary data using the High-Level API (Dataset) is there any additional benefits or features that spark 3.0 introduce with this feature?
I dont think there is any additional benefit besides developers have more control over data with high level API (Dataframe/ Dataset) than low level (RDD), and they dont need to worry about performance as it is well optimized/ managed by high level API by its own.
Reference -
https://spark.apache.org/docs/3.0.0-preview/sql-data-sources-binaryFile.html
P.S. - I do think my answer does not qualify as a formal answer. I earlier wanted to add it as comment only but unable to do so because I am yet to earn privilege of commenting.. :)

Modelling a mutable collection in Spark

Our existing application loads approximately ten million rows from a database into a collection of objects on startup. The collection is stored in a GigaSpaces cache.
As new messages are received by the application, the cache is checked to see if an entry for that message already exists. If not, a new entity is added to the cache based on the data in the message. (At the same time, the new entity is persisted to a database).
We are investigating the feasibility and value add of re-architecting the application using Spark and Scala. The question is, what would be the correct way to model this in Spark.
My first thought is to load from the database into a Spark RDD. Looking up existing entries would obviously be simple. However, because an RDD is immutable, adding new entries to the cache would require a transformation. Given the large set of data, my presumption is that this would not perform well.
The other idea is to create the cache as a mutable Scala collection. However, how would we then integrate this with Spark, given that Spark works with RDD's?
Thanks
This is more of a design questions. Spark is not great for fast lookups. It is optimize for batch jobs that need to touch almost the entire dataset; potentially multiple times.
If you want something that has fast search-like capabilities you should look into Elastic Search.
Other technologies that are often used for storing large in-memory/lookup tables is redis and memcached.
Since RDDs are immutable, every single cache update would require producing an entirely new RDD from your previous RDD. This is clearly inefficient (you have to manipulate the entire RDD just to update a tiny part of it). As for the other idea of having a mutable scala collection of RDD elements -- well, that won't be distributable across machines/CPUs, so what's the point?
If your goal is to have in-memory, distributable/partitionable operations on your cache, what you're looking for is an operational in-memory data grid, not Apache Spark. For example: Hazelcast, ScaleOut software, etc.
Apache Spark is notoriously bad at fine-grained transformations like the ones you would need for an in-memory distributed cache.
Sorry if I'm not directly answering the technical question, instead I'm answering your question behind your question...

Couchbase patterns for storing a big amount of records

There is quiet popular example of Couchbase + Hadoop tandem for log file processing.
Hadoop is used for MapReduce jobs and storing log files, and Couchbase is for storing and querying the results (of Hadoop jobs).
What about if I want to query for specific log file. Hadoop is not good at this.
What is the best options? Is Couchbase appropriate for kind of usecase or are there any better options? Are there any limitations for this objective in Couchbase?
This could be a good option, but you have to remember that Couchbase keeps the keys in memory (this allows the server to be very fast for all operations set/get/delete ...)
So it depends a lot of the number of entries you have globally.
The reason why we often see the couple Hadoop+Couchbase is to let Hadoop deals with the "BigData" itself, use Couchbase to provide fast access to a subset (that could be very large) of this data.

Deciding suitable key value store : Voldemort vs Cassandra vs Memcached vs Redis

I am using triple store database for one of my project (semantic search engine for healthcare) and it works pretty fine. I am considering on giving it a performance boost by using a layer of key value store above triple store. Triple store querying is slower since we do deep semantic processing.
This is how I am planning to improve performance:
1) Running Hadoop job for all query terms every day by querying triple store.
2) Caching these results in a key value store in a cluster.
3) When user searches for a query term, instead of searching triple store, key value store will be searched first. Triple store will be searched only when query term not found in key value store.
Key value pair which I plan to save is a "String" to "List of POJO mapping". I can save it as a BLOB.
I am confused on using which key value store. I am looking mainly for failover and load balancing support. All I need is a simple key value store which provides above features. I do not need to sort/search within values or any other functionalities.
Please correct me if I am wrong. I am assuming memcached and Redis will be faster since it is in memory. But I do not know if any Java clients of Redis(Jredis) or memchaced(Spymemcached) supports failover. I am not sure whether to go with in memory or persistent storage. I am also considering Voldemort, Cassandra and HBase. Overall key values will be around 2GB to 4GB size. Any pointers on this will be really helpful.
I am very new to nosql and key value stores. Please let me know if you need any more details.
Have you gone over memcached tutorial article (they explain load balancing aspects there, since memcached instances balance load based on your key hash, also spymemcached is discussed how it handles connectivity failures):
Use Memcached for Java enterprise performance, Part 1: Architecture and setup http://www.javaworld.com/javaworld/jw-04-2012/120418-memcached-for-java-enterprise-performance.html
Use Memcached for Java enterprise performance, Part 2: Database-driven web apps http://www.javaworld.com/javaworld/jw-05-2012/120515-memcached-for-java-enterprise-performance-2.html
For enterprise grade fail-over/cross data center replication support in memcached you should use Couchbase that offers these features. The product has evolved from memcached base.
Before you build infrastructure to load your cache, you might just try adding memcached on top of your existing system. First, measure your current performance well. I suggest JMeter or similar tools. Here's the workflow in your application: Check memcached, if it's there, you're done. If not, run the query against the triple store and save the results in memcached. This will improve performance if you have queries that are repeated. Memcached will use the memory you give it efficiently, throwing away things that don't get used very often. Failover is handled by your application (if it's not in memcached, you use your existing infrastructure).
We use triple store and cache data in memcache provided by google app engine and it works fine. It reduced the overhead of sparql query over triple store.
Only cassandra will have mentioned features and CQL full support, which helps in maintaining, otherwise maybe you should look in another direction:
Write heavy, replicated, bigger-than-memory key-value store
Since you want just to cache data in front of your triple store, going with disk-based, or replicated/distributed key-value stores seems to be pointless. All you need is essentially to cache data in front of your queries right on the machines where those queries are done. No "key-value stores", just vanilla Java caching solutions.
In 2016 the best cache for Java is Caffeine.