Distributed CDN but for database? - mongodb

My website will have users from all around the globe not 1 location. I know that I can put my file assets on a globally distributed CDN, and those files will be served from the location closest to the user which will lower the latency.
Is it possible to do the same thing for a (Mongo) database? Or does one still need to pick one location for the database and just put up with increased latency for users who are far away?

It is possible but you may want to put special attention on the headers that you DB/service will return especially the ones regarding caching like Cache-Control: max-age=<int> you would like to avoid the CDN behave just like a plain proxy with no caching capabilities.
In many cases, the CDN obey origin headers but if they are too small, they may be overwritten by some defaults that based on the plan/price could be adjusted.
Some CDN's allow prefetching based on custom times keeping always data updated. This could be useful in some cases and avoid flooding the DB with too many connections besides also restricting access to the service/DB only from the CDN and trusted sources.

Related

Where I can store data for spreading them online?

My company have an application which could be installed with Qt Online-Installers. The data are stored on the our personal server, but, with time, we found out, that the internet connection is a bit slow for users on the other edge of the world. So, there is a question - "What services are we able to use to store these data, which are designed for these purposes?". When I was investigating this question I found the Information about the thing which is called "Content Delivery Network", but I'm not sure if it's something fits or not.
Unfortunately, I don't have enough experience in this area, so, maybe somebody knows more and could give me an advice. Thank you!
Cloudfront on AWS . Depends on what your content is but can probably store it on s3 and then use Cloudfront to cache it at edge locations across the globe.
Your research led you to the right topic because it sounds like you could benefit from a CDN. CDNs store cached versions of your website, download files, video, etc. on their servers which is often a distributed network of servers across the globe, known as 'Points of Presence' (PoPs). When a user requests a file from your website, assuming it is leveraging a CDN, the user request actually goes to the closest POP and retrieves the file. This improves performance because the user may be very far from your origin server, or your origin server may not have enough resources to answer every request by itself.
The amount of time a CDN caches objects from your site depends on configurable settings. You can inform the CDN on how to cache objects using HTTP cache headers. Here is an intro video from Akamai, the largest CDN, with some helpful explanation of HTTP caching headers.
https://www.youtube.com/watch?v=zAxSE1M4yKE
Cheers.

How do I reduce guessability of MongoDB ObjectIDs?

MongoDB ObjectIDs are guessable.
I'm running an application that has publicly available resources located at
http://application.com/resource/**ObjectID**
These resources need to be publicly accessible (not behind a login), however I'm trying to reduce the chance of a hacker brute-forcing ObjectIDs and scraping them at will.
My idea is to include a randomly generated key with each MongoDB document, so that it can be matched up when the request is made. For example:
http://application.com/resource/**ObjectID**/**Key**
http://application.com/resource/**ObjectID**?key=**Key**
or even
http://**Key**.application.com/resource/**ObjectID**
If the key doesn't match the one stored in the document, then the server will return 404.
I realize this isn't true protection in the sense of guaranteed privacy, because if someone in the middle is sniffing URLs, they can access the resource. I'm just trying to prevent someone from brute-forcing ObjectIDs.
Is this approach feasible and effective?
In the context of the original answer you referenced on ObjectID generation, guessable is qualified by "given enough time".
Rather than focusing on whether your IDs might be guessable by brute force, I would look at approaches to detect and mitigate any brute force attacks. This removes the aspect of giving an adversary enough time to try all possible combinations (an approach that could work regardless of your ID format).
For example, obvious signatures to detect brute force attacks might include:
a large number of 404 requests from a specific IP address
successive requests for incremental ObjectIDs (which should be rare)
invalid ObjectIDs (if the adversary is unaware of the expected format)
There are many different strategies and countermeasures to consider, but a helpful starting point would be OWASP's info on Blocking Brute Force Attacks.
These resources need to be publicly accessible (not behind a login), however I'm trying to reduce the chance of a hacker brute-forcing ObjectIDs and scraping them at will.
If the resources are public, there may be an easier way for an adversary to find them: crawling public pages and fetching the linked resources. In this case, you could still apply anti-crawling strategies but these become trickier if you want to avoid affecting legitimate users. For example, a large number of valid requests from a specific IP might indicate a corporate or ISP proxy rather than someone trying to abuse your service. Smart crawlers can also mimic user patterns (delay between requests, randomness in requested urls, ..) in order to try to defeat any protections.
As a starting point see: Anti-crawling Techniques.
First of all it is hard to see your goal here. Resources should be publicly accessible and at the same time you are worried that some one can guess the location of the resource. It might be better if you will include the reason why guessing ObjectIDs in your application will cause a problem.
I assume that that you are trying to do some sharing service which allows to exchange data (like dropbox via file sharing) and you basically trying to protect data behind the ObjectID.
There is no problem with the way you outlined in such a case, so I will just add another approach: create your own objectIDs (some randomly generated long strings). One potential problem is that they can collide, but this event will be really rare so if you will wrap it into try catch and redo on fail, you should be ok.

CDN for a RESTful API?

I have a RESTful API with resources updates once a week. That is, for each resource, I update it once and week and allow clients to access it. It's an ever changing calculator.
There are probably 10,000 resources which could be requested.
Is it possible to put something like this behind a CDN? Traditionally CDNs are for undeniably static content, ie images. I'm not sure where my situation sits in the spectrum of dynamic <-> static.
Cheers
90% of the resources might not even get called, and if they are, will
get called a few times only. It wont be a mass of repetitive calls.
Right there in your comments, you just showed me that a CDN is not beneficial to you.
Usually how a CDN works is the first call it is downloaded from the main server to the regional CDN node then delivered to the client meaning the first GET will have no improvements. The following GETs to the same regional node will have the speed improvement. If you have little to no repetitive calls, then you will not see any noticeable improvement.
As I said in the comments, for small files, clients are probably spending as much time on the DNS lookup as they are on the download. Look into a global DNS solution (like Anycast) to reduce connection times. This is easy to setup and requires little to no maintenance.
I think it's entirely reasonable to put it behind a CDN if you think your content will reach the appropriate level of scale. As long as the cache-control headers are set such that the latest content is loaded when the cached version may be stale, you'll be fine.
The main benefit of CDNs comes when resources are requested from a variety of different sources, and so siteY.com can use the same cached version of a resource as siteX.com. Do you anticipate your resources will be requested from various different sources?

Any optimizations in reducing the number of disk accesses for inode number lookup by web-servers?

Web-servers typically have a document root denoting the filesystem sub-tree visible via the web. Consequently for eg., if the document root is: /home/foouser/public_html/, then the web-server would map a request for http://www.foo.com/pics/foo.jpg to /home/foouser/public_html/pics/foo.jpg. This results in a series of disk requests to obtain the inode-number of foo.jpg.
Do web-servers do any optimizations to reduce the number of disk accesses (or) is it the role of the server-admin to set the document root as close to "/" as possible, to reduce the number of disk-accesses in the filename to inode number translation?
I know this isn't directly the answer to your question, but by setting up a caching strategy you can drastically reduce disk reads. Especially if your static content is not hosted on your server.
Options:
Host static content on a CDN:
Pros: Off-load all load onto someone else's network. Cost?
Cons: Potentially less control. Cost?
Use Contendo/Akamai, which is also a CDN, but with some differences.
Pros: Host your content, but after the first read the cdn will handle caching based on the headers you send with your content (static or not)
Cons: Sometimes headers are really annoying to manage. Cache busting (breaking your own cache) can be annoying to handle when you want to replace old content.
Cache things locally. If you are making a DB request for instance you can cache the request. Next time your code is run check your in memory cache first (as opposed to make a db request immediately). You could cache entire pages then at an application controller/route level check if there is a cached version of the page/asset and serve that.
Pros: Lots of control. You can cache almost anything.
Cons: A ton of work to set up caching on every little thing. You need a strategy for every part of your website.
My recommendation is to start out by moving your assets to AmazonS3 or Rackspace or something. Joyent has something for this as well. You could then enable cloudfront for s3 which will turn on the cdn, which caches things in various regions. This is a really cheap solution (depending on the amount of files you have).
You could also go the contendo route.
The caching on the application side route takes quite a bit of work and completely depends on your server/language/db/configuration.

What technical considerations must a system/network administrator worry about when a site gets onto social bookmarking/sharing sites?

The reason I ask is that Stack Overflow has been Slashdotted, and Redditted.
First, what kinds of effect does this have on the servers that power a website? Second, what can be done by system administrators to ensure that their sites remain up and running as best as possible?
Unfortunately, if you haven't planned for this before it happens, it's probably too late and your users will have a poor experience.
Scalability is your first immediate concern. You may start getting more hits per second than you were getting per month. Your first line of defense is good programming and design. Make sure you're not doing anything stupid like reloading data from a database multiple times per request instead of caching it. Before the spike happens, you need to do some fairly realistic load tests to see where the bottlenecks are.
For absurdly high traffic, consider the ability to switch some dynamic pages over to static pages.
Having a server architecture that can scale also helps. Shared hosts generally don't scale. A single dedicated machine generally doesn't scale. Using something like Amazon's EC2 to host can help, especially if you plan for a cluster of servers from the beginning (even if your cluster is a single computer).
You're next major concern is security. You're suddenly a much bigger target for the bad guys. Make sure you have a good security plan in place. This is something you should always have, but it become more important with high usage.
Firstly, ask if you really want to spend weeks and thousands of $ on planning for something that might not even happen, and if it does happen, lasts about 5 hours.
Easiest solution is to have a good way to switch to a page simply allowing a signup. People will sign up and you can email them when the storm has passed.
More elaborate solutions rely on being able to scale quickly. That's firstly a software issue (can you connect to a db on another server, can you do load balancing). Secondly, your hosting solution needs to support fast expansion. Amazon EC2 comes to mind, or maybe slicehost. With both services you can easily start new instances ("Let's move the database to a different server") and expand your instances ("Let's upgrade the db server to 4GB RAM").
If you keep all data in the db (including sessions), you can easily have multiple front-end servers. For the database I'd usually try a single server with the highest resources available, but only because I haven't worked with db replication and it used to be quite hard to do, at least with mysql. Things might have improved.
The app designer needs to think about scaling up (larger machines with more cores and higher performance) and/or scaling out (distributing workload across multiple systems). The IT guy needs to work out how to best support that. The network is what you look at first, because obviously everything rides on top of it. Starting at the border, that usually means network load balancers and redundant routers being served by multiple providers. You can also look at geographic caching services and apps such as cachefly.
You want to reduce your bottlenecks as much as possible. You also want to design the environment such that it can be scaled out as needed without much work. Do the design work up front and it'll mean less headaches when you do get dugg.
Some ideas (of what I used in the past and current projects):
For boosting performance (if needed) you can put a reverse-proxying, caching squid in front of your server. Of course that only works if you don't have session keys and if the pages are somewhat static (means: they change only once an hour or so) and not personalised.
With the squid you can boost a bloated and slow CMS like typo3, thus having the performance of static websites with the comfort of a CMS.
You can outsource large files to external services like Amazon S3, saving your server's bandwidth.
And if you are able to spend some (three-figures per month) bucks, you can as well use a Content Delivery Network. Whith that in place you automatically have scaling, high-availability and low latencys for your users. Of course, your pages must be cachable, so session keys and personalised pages are a no-no. If designed carefully and with CDNs in mind, you can at least cache SOME content, like pics and videos and static stuff.
The load goes up, as other answers have mentioned.
You'll also get an influx of new users/blog comments/votes from bored folks who are only really interested in vandalism. This is mostly a problem for blogs which allow completely anonymous commenting, where some dreadful stuff will be entered. The blog platform might have spam filters sufficient to block it, but manual intervention is frequently required to clean up remaining drivel.
Even a little barrier to entry, like requiring a user name or email address even if no verification is done, will dramatically reduce the volume of the vandalism.