Scala Redis driver for high performance - scala

I need to develop a Scala based application that will write\read to\from managed AWS Redis at verry high rate. On official Redis page they mention several clients, without comparission. For my project every microsecond matters. I saw similar questions here, on SO, but they all are outdated.
Please advice what client has better performance.

As another pointed out, you can use Jedis: https://github.com/xetorthio/jedis/blob/master/src/main/java/redis/clients/jedis/JedisPool.java
The latency may depend more on requesting within the same AZ/VPC (avoiding external networks) and using Redis pipelines, which batch together transactions and reduce number of requests. See pipeline usage examples here:
https://github.com/xetorthio/jedis/wiki/AdvancedUsage
Here is another example combining AWS client libraries with Jedis:
https://github.com/fishercoder1534/AmazonElastiCacheExample/blob/master/src/main/java/AmazonElastiCacheExample.java

Related

What is the recommended way to upgrade a dataproc cluster?

Dataproc seems to be designed to be Stateless / Immutable. Is this assumption correct? Should we just quit right now if we are planning to deploy a Hive/Presto data warehouse?
We are struggling to find any documentation that suggests how one should care for a cluster once has been provisioned?
How to upgrade components?
How to install tools (e.g. Hue etc) after a cluster was established?
How to secure access to data + services once deployed?
The FAQs "Can I run a persistent cluster?" don't really address this either.
The internet
is suggesting we should just create a new cluster if we have a problem. As a developer I'm quite happy with the "Minimize State" argument but I work in the enterprise world that like solutions like Hive (and its metadata store), Hue and Zeppellin and want to connect external tools like Tableau into a cluster.
The documentation should really make it clear which use-cases dataproc excels at (Batch, on-demand & short lived workloads) vs things it isn't really designed for (e.g. OLAP)?
Dataproc indeed provides the most benefit for on-demand use cases, but this isn't necessarily at odds with being used for OLAP. The main idea is that the stateful components can all be separated from the "processing" resources so that you can better adjust resources according to needs at different points in time.
The recommended architecture for your Hive metadata is to keep your Hive metastore backend off the cluster, e.g. in a CloudSQL instance; many are able to use Dataproc in this way with short-lived or semi-short-lived clusters (e.g. keeping a pool of live clusters but deleting/recreating the oldest each day or each week) combined with initialization actions pointing the Hiveserver at CloudSQL: https://github.com/GoogleCloudPlatform/dataproc-initialization-actions/tree/master/cloud-sql-proxy
In this world, the stateful metastore pieces are all in CloudSQL and bulk storage is all in GCS. Some clusters might sync from GCS to local HDFS for performance reasons (especially if running HDFS on local-SSD), but even for interactive OLAP use cases, this isn't usually necessary; running queries directly against GCS works fine too. There are admittedly some performance pitfalls for older formats due to longer round-trip latency to GCS, but a bit of tuning can bring it mostly in-line; here's a (non-google-owned) blog post about Presto on Dataproc going over some of those.
This also provides much easier ways to handle traditional cluster admin; upgrades are just swapping out entire clusters, additional tools should be done in initialization actions for easy reproducibility on new clusters, and you can more easily define security perimeters at a per-cluster granularity.

Best approach to construct a real-time rule engine for our streaming events

We are at the beginning of building an IoT cloud platform project. There are certain well known portions to achieve complete IoT platform solution. One of them is real-time rule processing/engine system which is needed to understand that streaming events are matched with any rules defined dynamically by end users with readable format (SQL or Drools if/when/then etc.)
I am so confused because there are lots of products, projects (Storm, Spark, Flink, Drools, Espertech etc.) in internet so, considering we have 3-person development team (a junior, a mid-senior, a senior), what would it be the best choice ?
Choosing one of the streaming projects such as Apache Flink and learn well ?
Choosing one of the complete solution (AWS, Azure etc.)
The BRMS(Business Rule Management System) like Drools is mainly built for quickly adapting changes in business logic and are more matured and stable compared to stream processing engines like Apache Storm, Spark Streaming, and Flink. Stream processing engines are built for high throughput and low latency. The BRMS may not be suitable to serve hundreds of millions of events in IOT scenarios and may be difficult to deal with event-time-based window calculations.
All these solutions can be used in Iaas providers. In AWS you may also want to take a look at AWS EMR and Kinesis/Kinesis Analytics.
Some use cases I've seen.
Stream data directly to FlinkCEP.
Use rule engines to do fast response with low latency, at the same time stream data to Spark for analysis and machine learning.
You can also run Drools in Spark and Flink to hot-deploy user-defined rules.
Disclaimer, I work for them. But, you should check out Losant. It's developer friendly and it's super easy to get started. We also have a workflow engine, where you can build custom logic/rules for your application.
check out the Waylay rules engine built specifically for real-time IoT data streams.
In the beginning phase Go for the cloud based IoT platform like predix,AWA,SAP or Watson for rapid product development and initial learning.

What are the pros and cons of DynamoDB with respect to other NoSQL databases?

We use MongoDB database add-on on Heroku for our SaaS product. Now that Amazon launched DynamoDB, a cloud database service, I was wondering how that changes the NoSQL offerings landscape?
Specifically for cloud based services or SaaS vendors, how will using DynamoDB be better or worse as compared to say MongoDB? Are there any cost, performance, scalability, reliability, drivers, community etc. benefits of using one versus the other?
For starters, it will be fully managed by Amazon's expert team, so you can bet that it will scale very well with virtually no input from the end user (developer).
Also, since its built and managed by Amazon, you can assume that they have designed it to work very well with their infrastructure so you can can assume that performance will be top notch. In addition to being specifically built for their infrastructure, they have chosen to use SSD's as storage so right from the start, disk throughput will be significantly higher than other data stores on AWS that are HDD backed.
I havent seen any drivers yet and I think its too early to tell how the community will react to this, but I suspect that Amazon will have drivers for all of the most popular languages and the community will likely receive this well - and in turn create additional drivers and tools.
Using MongoDB through an add-on for Heroku effectively turns MongoDB into a SaaS product as well.
In reality one would be comparing whatever service a chosen provider has compared to what Amazon can offer instead of comparing one persistance solution to another.
This is very hard to do. Each provider will have varying levels of service at different price points and one could consider the option of running it on their own hardware locally for development purposes a welcome option.
I think the key difference to consider is MongoDB is a software that you can install anywhere (including at AWS or at other cloud service or in-house) where as DynamoDB is a SaaS available exclusively as hosted service from Amazon (AWS). If you want to retain the option of hosting your application in-house, DynamoDB is not an option. If hosting outside of AWS is not a consideration, then, DynamoDB should be your default choice unless very specific features are of higher consideration.
There's a table in the following link that summarizes the attributes of DynamoDB and Cassandra:
http://www.datastax.com/dev/blog/amazon-dynamodb
Something that needs improvement on DynamoDB in order to become more usable is the possibility to index columns other than the primary key.
UPDATE 1 (06/04/2013)
On 04/18/2013, Amazon announced support for Local Secondary Indexes, which made DynamoDB f***ing great:
http://aws.amazon.com/about-aws/whats-new/2013/04/18/amazon-dynamodb-announces-local-secondary-indexes/
I have to be honest; I was very excited when I heard about the new DynamoDB and did attend the webinar yesterday. However it's so difficult to make a decision right now as everything they said was still very vague; I have no idea the functions that are going to be allowed / used through their service.
The one thing I do know is that scaling is automatically handled; which is pretty awesome, yet there are still so many unknowns that it's tough to really make a great analysis until all the facts are in and we can start using it.
Thus far I still see mongo as working much better for me (personally) in the project undertaking that I've been working on.
Like most DB decisions, it's really going to come down to a project by project decision of what's best for your need.
I anxiously await more information on the product, as for now though it is in beta and I wouldn't jump ship to adopt the latest and greatest only to be a tester :)
I think one of the key differences between DynamoDB and other NoSQL offerings is the provisioned throughput - you pay for a specific throughput level on a table and provided you keep your data well-partitioned you can always expect that throughput to be met. So as your application load grows you can scale up and keep you performance more-or-less constant.
Amazon DynamoDB seems like a pretty decent NoSQL solution. It is fast, and it is pretty easy to use. Other than having an AWS account, there really isn't any setup or maintenance required. The feature set and API is fairly small right now compared to MongoDB/CouchDB/Cassandra, but I would probably expect that to grow over time as feedback from the developer community is received. Right now, all of the official AWS SDKs include a DynamoDB client.
Pros
Lightning Fast (uses SSDs internally)
Really (really) reliable. (chances of write failures are lower)
Seamless scaling (no need to do manual sharding)
Works as webservices (no server, no configuration, no installation)
Easily integrated with other AWS features (can store the whole table into S3 or use EMR etc)
Replication is managed internally, so chances of accidental loss of data is negligible.
Cons
Very (very) limited querying.
Scanning is painful (I remember once a scanning through Java ran for 6 hours)
pre-defined throughput, which means sudden increase beyond the set throughput will be throttled.
throughput is partitioned as table is sharded internally. (which means if you had a throughput for 1000 and its partitioned in two and if you are reading only the latest data(from one part) then your throughput of reading is 500 only)
No joins, Limited indexing allowed (basically 2).
No views, triggers, scripts or stored procedure.
It's really good as an alternative to session storage in scalable application. Another good use would be logging/auditing in extensive system. NOT preferable for feature rich application with frequent enhancement or changes.

Which database back-end shall I use?

I am writing an iPhone app, that requires cloud back-end DB storage. I have a couple options in mind, and was wondering which one is better fit?
What I need:
be able to perform GRUD in the cloud from the iPhone app
the DB needs to scale (speed-wise) without much or any management
schema free
all i need is to store maybe 1 million records
Google App Engine:
Uses bigTable, scales, and schema free, but I need to write a RESTful interface
CouchDB:
Recently released iOS support, RESTful built-in, but I worry about scaling when syncing with remote server
SimpleDB: (seems to be my best pick)
Has iOS SDK, so I can do GRUD directly, auto scale (I probably won't be running into the 10GB limit), schema free
MongoDB:
Don't know much about, from what I hear, it's faster than SimpleDB, and easy to setup, but again I need to do the admin work
Cassandra:
Too much work, for what I need.
Any insight or feedback or correction is great appreciated.
Regards,
Johnny
If you're looking for zero management on your end, then you've already answered yourself that SimpleDB or GAE are probably your best options.
SimpleDB is probably better in your case, because it'll save you from having to write a simple RESTful interface on top of GAE.
Note that both of them aren't great in terms of speed. I worked with both and there's visible query latency. Unfortunately there's no way for you to tune that - you're completely in the hands of Amazon/Google. That's the price you pay for not managing the datastore yourself, so I guess you'll have to decide if you're willing to pay that price.
I recommend that you try SimpleDB, which is simple enough, first. If latency is a problem then you can move to hosting and tuning your own Mongo or some other option.
SQL Azure Services. Meets your requirements above.
http://en.wikipedia.org/wiki/SQL_Azure

Old concepts with new names (namely REST and Cloud computing)

It seems that SaaS and Cloud computing are old concepts with new names, and I am curious if I am wrong.
For cloud computing you can look at: Difference between cloud computing and distributed computing?
Basically, it seems that when we have been hosting that that is cloud computing, it is just that now some companies have put in much great resources to ensure better uptime than my local ISP. But, it seems that there is nothing really new here.
For REST, it seems that it is what we have been doing with cgis for 15 years.
Here is a question on REST: What am I not understanding about REST?
It appears that REST is an old concept, and I am curious how it is different than has been done since the early days of the web, and, to a large extent, the early days of using telnet (which http is on top of).
Am I mistaken in my simplification of these? I try to see how what is new is like what I know so I can see what more has to be learned in that topic, but for cloud computing and REST it seems that very little needs to be learned.
You are both right and wrong. You are right in the sense that new ideas are normally similar to old ideas, and indeed cloud computing is based significantly on distributed computing.
What is new in cloud computing is
virtualization
self-service
With virtualization, you can run multiple operating systems on a single hardware. While that, in itself, isn't new, either, it was never considered in distributed systems as a relevant piece of the architecture. Using virtualization allows self-service: users can create their own clusters of nodes without the administrator of the hardware taking any action. This allows a significant acceleration of deployment, and a significant reduction of cost.
For ReST, what you are missing is the client API. It is true that on the server side, a ReST service can be implemented with CGI. What is new here is that it is not an end user which retrieves the URL, but a program.
Saying that HTTP is on top of telnet ignores realities; this is like saying that we made no progress since the introduction of copper wires for communication. Strictly speaking, HTTP is not in top of telnet, but on top of TCP (which telnet is also on top of, these days).
Considering Roy's dissertation coined the term REST back in 2000, you can definitely argue that there is nothing new about REST. Additionally, the REST architectural style was synthesized from successful existing practices, so REST implementations pre-date the definition. Having said that, there is nothing simple about designing REST interfaces. Ever since Netscape first abused cookies to allow servers to maintain session state people have been swimming upstream against the web.
REST's recent resurrection has come mainly from people becoming disillusioned with SOAP based Web Services. SOAP tried to hide HTTP instead of embracing it and I think people are starting to realize how effective HTTP can be as an distributed application protocol that can do more than just deliver HTML to web browsers.
RESTful web applications don't use session state, so one could argue that by that virtue alone it is different than most web applications in existence at the moment.
As for Cloud Computing, I find myself agreeing with Larry Ellison for once in my life.
I'm in agreement on what you've posted. You might consider making this community wiki since it's likely to garner many answers based on opinion. Cloud computing seems to have taken off as a buzzword, and this is largely due to a decrease in cost for mass quantities of hardware. And then there is REST which is really just a formal name and definition for something that has been in place for a long time. Some people like to encapsulate ideas with buzzwords and acronyms. Sometimes it's useful to put a name to an idea though.
Not only this, the concept of things being old concepts with new names is old. It's hard to be original these days :P
You are right about REST -- its mostly old concepts with a lot of added pedantry and not much added substance.
Cloud computing has a small but fundamental difference from distributed computing. In distributed computing you had servers dedicated to particular functions, and usually some sort of directory service to locate the correct server. In cloud computing any server is capable of any task and usually the servers queue up for work which is distributed from a central point.