Memcached/other key-value engine isolation - nosql

I have a bunch of web servers(frontends) behind balancer. Each apache process runs with it's own user for every virtualhost. Code that apache runs is PHP and it's not trusted code.
I need to have shared (between web servers) session storage and limit user(vhost) to only access it's session storage. So I want to avoid one tenant to be able to purge or corrupt memcached stored data.
So I basically looking for solution to authenticate users + create private buckets.
I know there is always MySQL way avaliable but I want to avoid performance penalty introduced by SQL layer.
Any solution in your mind so far?

I found product called CouchBase which fully comply with my requirements. It has buckets along with memcache caching layer and access protocol. It has SASL authentication and a bonus of load balancing and fail tolerance.

Related

Revoke JWT session tokens with a blacklist. Should I create another system for the blacklist for performance?

I'm creating a web application (in C++, for performance) where I'm expecting to process a tremendous amount of events per second; like thousands.
I've been reading about invalidating JWT tokens in my web sessions, and the most reasonable solution for that is to have a storage place for blacklisted tokens. The list has to be checked for every request, and what I'm wondering is performance related: Should I create a separate system for storing my blacklisted tokens (like redis)? Or should I just use the same PostgreSQL database I'm using for everything else? What are the advantages of using another system?
The reason I'm asking is that I saw many discussions about invalidating JWT tokens online, and many suggest to use redis (and don't explain whether it's just a solution relevant to their design or whether it's a replacement to their SQL database server for some reason). Well, why not use the same database you're using for your web application? Is there a reason that makes redis better for this?
Redis is a lot faster since its stored on memory of the server rather than opening a connection to the DB, querying and returning the results. So if speed is of importance then Redis is what you would want.
The only negative is if the server restarts the blacklisted tokens are gone. Unless you save them on disk somewhere.

Can we switch to using authenticated access to MongoDB with no downtime?

We currently have a fairly complex Mongo environment with multiple query routers and data servers in different AWS regions using sharding and replication so that data will be initially written to a master shard in a local data server and then replicated to all regions.
When we first set this up we didn't add any security to the Mongo infrastructure and are using unauthenticated access for read and write. We now need to enable authentication so that the platform components that are writing data can use a single identity for write and read, and our system administrators can use their own user accounts for admin functionality.
The question is whether and how we can switch to using authentication without taking any downtime in the backend. We can change connection strings on the fly in the components that read and write to the DB, and can roll components in and out of load-balancers if we do need a restart. The concern is on the Mongo side.
Can we enable authentication without having to restart?
Can we continue to allow open access from an anonymous user after enabling authentication (to allow backward compatibility while we update the connection strings)?
If not, can we change the query strings before we enable authentication and have Mongo accept the connection requests even though it isn't authenticating?
Can we add authorization to our DBs and Collections after the fact?
Will there be any risk to replication as we go through this process? We have a couple of TB of data and if things get out of sync it's very difficult to force a resync.
I'm sure I'm missing some things, so any thoughts here will be much appreciated.
Thanks,
Ian

In hadoop, Is there any limit to the size of data that can be accessed through knox + webhdfs?

In hadoop, Is there any limit to the size of data that can be accessed/Ingested to HDFS through knox + webhdfs?
Apache Knox is your best option when you need to access webhdfs resources from outside of a cluster that is protected with firewall/s. If you don't have access to all of the datanode ports then direct access to webhdfs will not work for you. Opening firewall holes for all of those host:ports defeats the purpose of the firewall, introduces a management nightmare and leaks network details to external clients needlessly.
As Hellmar indicated, it depends on your specific usecase/s and clients. If you need to do ingestion of huge files or numbers of files then you may want to consider a different approach to accessing the cluster internals for those clients. If you merely need access to files of any size then you should be able to extend that access to many clients.
Not having to authenticate using kerberos/SPNEGO to access such resources opens up many possible clients that would otherwise be unusable with secure clusters.
The Knox users guide has examples for accessing webhdfs resources - you can find them: http://knox.apache.org/books/knox-0-7-0/user-guide.html#WebHDFS - this also illustrates the groovy based scripting available from Knox. This allows you to do some really interesting things.
In theory, there is no limit. However, using Knox creates a bottleneck. Pure WebHDFS would redirect the read/write request for each block to a
(possibly) different datanode, parallelizing access; but with Knox everything is routed through a single gateway and serialized.
That being said, you would probably not want to upload a huge file using Knox and WebHDFS. It will simply take too long (and depending on your client, you may get a timeout.)

Caching in a Service oriented architecture

In a distributed systems environment, we have a RESTful service that needs to provide high read throughput at low-latency. Due to limitations in the database technology and given its a read-heavy system, we decided to use MemCached. Now, in a SOA, there are atleast 2 choices for the location of the cache, basically client looks up in Cache before calling server vs client always calls server which looks up in cache. In both cases, caching itself is done in a distributed MemCached server.
Option 1: Client -> RESTful Service -> MemCached -> Database
OR
Option 2: Client -> MemCached -> RESTful Service -> Database
I have an opinion but i'd love to hear arguments for and against either option from SOA experts in the community. Please assume either option is feasible, its a architecture question. Appreciate sharing your experience.
I have seen the
Option 1: Client -> RESTful Service -> Cache Server -> Database
working very well. Pros IMHO are that you are able to operate wtih and use this layer in a way allowing you to "free" part of the load on the DB. Assuming that your end-users can have a lot of similar requests and after all the Client can decide what storage to spare for caching. Also how often to clear it.
I prefer Option 1 and I am currently using it. In this way it is easier to control the load on the DB (just as #ekostatinov mentioned). I have lots of data that are required for every user in the system, but the data is never changed (such as some system rules, types of items, etc). It really reduces the DB load. In this way you can also control the behavior of the cache (such as when to clear the items).
Option 1 is the prefered option as it makes memcache an implementation detail of the service. the other option means that if the business changes and things can't be kept in the cache (or other can etc.) the clients would have to change. Option 1 hides all that behind the service interface.
Additionally option 1 lets you evolve the service as you wish. e.g. maybe later you think you need a new technology, maybe you'd solve the performance problem with the DB etc. Again, option 1 lets you make all these changes without dragging the clients into the mess
Is the REST ful API exposed to external consumers. In that case it is up to the consumer to decide if they want to use a cache and how much stale data can they use.
As for as the REST ful service goes, the service is the container of business logic and it is the authority of data, so it decides how much to cache, cache expiry, when to flush etc. A client consuming the REST service always assumes that the service is providing it with the latest data. And hence option 1 is preferred.
Who is the client in this case?
Is it a wrapper for your REST API. Are you providing both client and the service.
I can share my experience with Enduro/X middleware implementation. For local XATMI service calls any client process connects to shared memory (LMDB) and checks the result there. If there is response saved it returns data directly from shm. If data is not there, client process goes the longest path and performs the IPC. In case of REST access, network clients still performs the HTTP invocation, but HTTP server as XATMI client returns the data from shared mem. From real life, this technique was greatly boosting the web frontend web application which used middleware via REST calls.

InProc vs. AppFabric session state with single web server

I have an ASP.Net MVC application which makes significant use of session to save state (including large data collections). At present, it is hosted on a single web server. The session is set to the default of InProc.
An issue arises whereby the application freezes for some users when many users are on line. I guess that this is because InProc session does not scale too well, and that there is only so much memory available to the process. (What happens if memory demand exceeds the available memory - does it swap out to disk?)
I have a couple of solutions in mind that would help with scalability. (a) Sql server session state; (b) Configure session state to use AppFabric caching. The first option looks like a good solution, except that it will affect performance and require stored items to be serializable.
What about configuring session state to use AppFabric caching (aka Velocity) in an environment where the single web server is also used as the cache host? How does this differ from InProc in this single-server environment? Will this provide more scalability and available memory than InProc, or will it essentially amount to the same constraints?
You would be better off implementing AppFabric Cache for your scenario. As your system grows, you can increase the number of cache servers with each new web node - something you cannot easily do with SQL Server without additional cost. SQL Server licensing also costs much more than AppFabric - which is bundled with a Windows Server license.
The only benefit SQL Server will provide is recoverability, but for what you need, it's probably overkill.
See related SO post discussing AppFabric Cache vs. SQL Server for session.
As for AppFabric Cache vs. InProc...
You could put your AppFabric Cache on another server if you are running into memory limitations. You can't do this with InProc.
Here are some other miscellaneous benefits of the AppFabric Cache:
Supports Local Cache to speedup retrieval costs involved in serializing/deserializing.
Provides finer grained controls with respect to cache eviction and expiration policies.
Supports compression of session contents to reduce network bandwidth.
Blob mode versus single item retrieval to improve data retrieval for large objects.
Same session state store can be used across multiple applications (via sharedId).
The most important thing is that the session will survive an app-pool recycle and even a redeploy of the application.
Also AppFabric can serialize IXmlSerializable objects as well as [Serializable]. If you try to use the out-of-proc ASP.NET session service you ironically cannot serialize IXmlSerializable objects such as XElement. You can also do completely customized serialization if you want. With AppFabric your app is much more 'azure' ready if you ever move that way.
Then of course you can use it for caching other data if you have a need for that.