In hadoop, Is there any limit to the size of data that can be accessed through knox + webhdfs? - rest

In hadoop, Is there any limit to the size of data that can be accessed/Ingested to HDFS through knox + webhdfs?

Apache Knox is your best option when you need to access webhdfs resources from outside of a cluster that is protected with firewall/s. If you don't have access to all of the datanode ports then direct access to webhdfs will not work for you. Opening firewall holes for all of those host:ports defeats the purpose of the firewall, introduces a management nightmare and leaks network details to external clients needlessly.
As Hellmar indicated, it depends on your specific usecase/s and clients. If you need to do ingestion of huge files or numbers of files then you may want to consider a different approach to accessing the cluster internals for those clients. If you merely need access to files of any size then you should be able to extend that access to many clients.
Not having to authenticate using kerberos/SPNEGO to access such resources opens up many possible clients that would otherwise be unusable with secure clusters.
The Knox users guide has examples for accessing webhdfs resources - you can find them: http://knox.apache.org/books/knox-0-7-0/user-guide.html#WebHDFS - this also illustrates the groovy based scripting available from Knox. This allows you to do some really interesting things.

In theory, there is no limit. However, using Knox creates a bottleneck. Pure WebHDFS would redirect the read/write request for each block to a
(possibly) different datanode, parallelizing access; but with Knox everything is routed through a single gateway and serialized.
That being said, you would probably not want to upload a huge file using Knox and WebHDFS. It will simply take too long (and depending on your client, you may get a timeout.)

Related

What are the options of routing HTTP connections to one specific instance out of many instances behind a load balancer?

Assume there is a system that accepts millions of simultaneous WebSocket connections from client applications. I was wondering if there is a way to route WebSocket connections to a specific instance behind a load balancer (or IP/Domain/etc) if clients provide some form of metadata, such as hash key, instance name, etc.
For instance, let's say each WebSocket client of the above system will always belong to a group (e.g. max group size of 100), and it will attempt to communicate with 99 other clients using the above system as a message gateway.
So the system's responsibility is to relay messages sent from clients in a group to other 99 clients in the same group. Clients won't ever need to communicate with other clients who belong to different groups.
Of course, one way to tackle this problem is to use Pubsub system, such that regardless of which instance clients are connected to, the server can simply publish the message to the Pubsub system with a group identifier and other clients can subscribe to the messages with a group identifier.
However, the Pubsub system can potentially encounter scaling challenges, excessive resource usage (single message getting published to thousands of instances), management overhead, latency increase, cost, and etc.
If it is possible to guarantee that WebSocket clients in a group will all be connected to the instance behind LB, we can skip using the Pubsub system and make things simpler, lower latency, and etc.
Would this be something that is possible to do, and if it isn't, what would be the best option?
(I am using Kubernetes in one of the cloud service providers if that matters.)
Routing in HTTP is generally based on the hostname and/or URL path. Sometimes to a lesser degree on other headers like cookies. But in this case it would mean that each group should have it's own unique URL.
But that part is easy, what I think you're really asking is "given arbitrary URLs, how can I get consistent routing?" which is much, much more complicated. The base concept is "consistent hashing", you hash the URL and use that to pick which endpoint to talk to. But then how to do deal with adding or removing replicas without scrambling the mapping entirely. That usually means using a hash ring and assigning portions of the hash space to specific replicas. Unfortunately this is the point where off-the-shelf tools aren't enough. These kinds of systems require deep knowledge of your protocol and system specifics so you'll probably need to rig this up yourself.

How to handle private files in a microservice

We're working on a backend project and we've started a move to microservices development. We already have a few services in place, one of which is a FileService which stores and fetches files (using underlying Amazon S3 storage). The FileService also provides file checksum, authentication and retry mechanism and is used to share files across services and with the clients.
We are now building a new service and part of this service's private data are files that the service stores and uses for its business logic, and we have a dilemma of whether we should use the FileService to store and fetch the files or handle the storage and fetching of the files internally in the service.
The reason to use the FileService is we're getting all the features implemented in the service for free (retry, checksum etc).
The reason not to use it is we want the new service to be able to work autonomously and using the FileService ties the new service to it (it must handle OAuth2 authentication to fetch/upload files, it must deploy the FileService and the AuthService whenever this services is deployed etc).
I wanted to know if someone has best practices for storing private files in a microservices environment, and what is the best approach to it with the pros and cons.
Converting in-process FileService component to microservice will definitely have advantages as well as disadvantages. You've listed several of them, but most importantly you have to create a cost/benefit analysis matrix applicable to your business and domain specifically.
There is no "best practices" approach here.
Costs:
is it okay for you to increase response times? Because now you will have to transfer files twice: s3 -> fs microservice -> client microservice
how more likely situation of losing a connection between nodes becomes?
how big your files are? The unreliable connection between microservices may become a problem?
how frequently do you need to access those files? Maybe you will lose the ability to have local cache to speed up the process?
are you okay with implementing and supporting separate auth microservice or you can just whitelist this service in your firewall
Benefits:
you don't have to redeploy all dependent components every time the logic of storing files or doing retries changes.
you can move to another cloud provider more easily in the future if necessary, again, without redeploying everyting.
it is reusable in a heterogeneous environment, where other components may be implemented using different technological stacks
Conclusion:
There is no way to answer those questions without actually talking with business people and discussing risks around such transition.

Caching in a Service oriented architecture

In a distributed systems environment, we have a RESTful service that needs to provide high read throughput at low-latency. Due to limitations in the database technology and given its a read-heavy system, we decided to use MemCached. Now, in a SOA, there are atleast 2 choices for the location of the cache, basically client looks up in Cache before calling server vs client always calls server which looks up in cache. In both cases, caching itself is done in a distributed MemCached server.
Option 1: Client -> RESTful Service -> MemCached -> Database
OR
Option 2: Client -> MemCached -> RESTful Service -> Database
I have an opinion but i'd love to hear arguments for and against either option from SOA experts in the community. Please assume either option is feasible, its a architecture question. Appreciate sharing your experience.
I have seen the
Option 1: Client -> RESTful Service -> Cache Server -> Database
working very well. Pros IMHO are that you are able to operate wtih and use this layer in a way allowing you to "free" part of the load on the DB. Assuming that your end-users can have a lot of similar requests and after all the Client can decide what storage to spare for caching. Also how often to clear it.
I prefer Option 1 and I am currently using it. In this way it is easier to control the load on the DB (just as #ekostatinov mentioned). I have lots of data that are required for every user in the system, but the data is never changed (such as some system rules, types of items, etc). It really reduces the DB load. In this way you can also control the behavior of the cache (such as when to clear the items).
Option 1 is the prefered option as it makes memcache an implementation detail of the service. the other option means that if the business changes and things can't be kept in the cache (or other can etc.) the clients would have to change. Option 1 hides all that behind the service interface.
Additionally option 1 lets you evolve the service as you wish. e.g. maybe later you think you need a new technology, maybe you'd solve the performance problem with the DB etc. Again, option 1 lets you make all these changes without dragging the clients into the mess
Is the REST ful API exposed to external consumers. In that case it is up to the consumer to decide if they want to use a cache and how much stale data can they use.
As for as the REST ful service goes, the service is the container of business logic and it is the authority of data, so it decides how much to cache, cache expiry, when to flush etc. A client consuming the REST service always assumes that the service is providing it with the latest data. And hence option 1 is preferred.
Who is the client in this case?
Is it a wrapper for your REST API. Are you providing both client and the service.
I can share my experience with Enduro/X middleware implementation. For local XATMI service calls any client process connects to shared memory (LMDB) and checks the result there. If there is response saved it returns data directly from shm. If data is not there, client process goes the longest path and performs the IPC. In case of REST access, network clients still performs the HTTP invocation, but HTTP server as XATMI client returns the data from shared mem. From real life, this technique was greatly boosting the web frontend web application which used middleware via REST calls.

Memcached/other key-value engine isolation

I have a bunch of web servers(frontends) behind balancer. Each apache process runs with it's own user for every virtualhost. Code that apache runs is PHP and it's not trusted code.
I need to have shared (between web servers) session storage and limit user(vhost) to only access it's session storage. So I want to avoid one tenant to be able to purge or corrupt memcached stored data.
So I basically looking for solution to authenticate users + create private buckets.
I know there is always MySQL way avaliable but I want to avoid performance penalty introduced by SQL layer.
Any solution in your mind so far?
I found product called CouchBase which fully comply with my requirements. It has buckets along with memcache caching layer and access protocol. It has SASL authentication and a bonus of load balancing and fail tolerance.

Is a HTTP REST request the only way to access Azure Storage?

I've started reading about Azure Storage and it seems that the only way to access it is via an HTTP REST request.
I've seen that there are a few wrappers around these requests, for example, StorageClient (by Microsoft) and cloud storage api (http://cloudstorageapi.codeplex.com/), but they all still use REST in the background (to the best of my understanding).
It seems unreasonable to me that this is actually true. If I have a machine in Azure, and I want to access data stored in Azure Storage, it would seem every inefficient to
Yes, all storage calls are normalized to the REST API. Its actually very efficient when you consider the problem. You are thinking of a machine in Azure and data in azure as stored on two servers sitting in a rack. Remember in Azure, your data, your "servers", etc may be stored in different racks, different zones, and even different datacenters. With the REST API, your apps don't have to care about any of this. They just get the data with the URL.
So while a tiny HTTP overhead may appear inefficient if these were two boxes next to each other, its actually a very elegant solution when they are on different continents. Factor in concepts such as CDN, and it becomes an even better fit.
Layered onto this base concept is the Azure load balancer and other pieces of the internal infrastructure which can further optimize every request because they are all the same (HTTP). I also wouldn't be surprised (not sure at all, I dont work for MSFT) if the LB was doing traffic management optimizations when a request is made intra-datacenter.
Throughput on the storage subsystem in Windows Azure is pretty high. I'd be very surprised if the system cannot deliver to your needs.
There are also many design patterns to increase scalability of your app, like asynch processing, batching requests, delayed processing, etc.