TECHNICAL QUESTION --> When connecting to a CDN (eg. cloudflare), does a CDN actually connect to the original shared server when the visitor visits the website? Because if it doesn't connect then probably we don't need to focus that much on the origin server's speed? Or a CDN does rely on the origin's server's speed?
OPTIONS:
1)use CDN + use cheap shared server (eg. Hostgator)
2)upgrade to VPS or even dedicated server, but do NOT use CDN.
3) use CDN + upgrade to VPS / dedicated server
Which option is recommended when you are on budget? My understanding is that there is no need to upgrade the shared server to VPS, the CDN is enough as an upgrade. Or a CDN does rely on the origin's server's speed?
Thanks for any technical insights!
CDN's connect to the origin server to grab static resources the first time. That's images, scripts, stylesheets, fonts, video, etc. They are then stored on the edge servers and future requests don't go to your server.
However html files are NOT cached on the CDN. Every page load hits your origin server through the CDN. You can set up CDN page rules to cache html pages, too, but that's of no use on dynamic sites.
If you have dynamic pages, option (2) is best if on a budget. Especially if you're using WordPress (slow). Money of no concern, (3) is best.
Related
I want to host web content that could become viral, but I'm cost sensitive. I want the protection of a CDN, but don't want to pay for it unless its needed.
I think CDN usage typically routes all requests through the CDN, but this isn't my expertise. I'd prefer an architecture where the origin server handles most requests, but the CDN takes over under load. Are there any CDNs which support something like this natively?
I'd happily have the origin server HTTP-redirect to CDN, if under load. This is such a simple solution I feel like it must be wrong. Is this a terrible idea?
I wouldn't recommend redirecting to a CDN. That has a couple of problems:
Redirects cause URL changes. Changing the URL temporarily is bad for usability and SEO.
Your server would still get hit for each request to issue the redirect. Issuing redirects is less intensive that serving content, but if something goes viral it could still bring down your server.
It might be possible to adjust your DNS records to point to a CDN only when your server comes under load. To make that work you would have to set up the CDN ahead of time and just not use it. CDNs typically want to become your DNS host, so you would change your NS records to use the CDNs DNS servers and then make configuration changes in the CDN when your server comes under load. Some CDNs even have programmatic APIs to support cases like this. Switching over to a CDN using DNS would take at minimum half an hour. You would have to set your DNS TTL to as low as possible (30 minutes) ahead of time.
You should also consider that many CDNs have a free tier of service.
CloudFlare is famous for having free CDN services that work really well. See their pricing page. They have said that they plan to keep their free tier of service indefinitely because developers use it for their personal sites and gain familiarity with their services. Those users then are more likely to recommend CloudFlare for their employer's enterprise sites that don't fit under the free tier.
If you are running on AWS, Amazon's CloudFront CDN has a free tier and then makes you pay only when you exceed that usage.
I'm developing an app that would download codes from GitHub once in a while. The queries (per IP) will stay waaay under the limits listed on the GH documentation (QPM && QPH). The queries are simple cUrls to HTTP GitHub code pages (raw) without passing through the API.
Let's suppose 50k users will query a GH resource on the same day: are the queriers risking some kind of ban?
Limits 1
Limits 2
The rate limits for the raw and archive endpoints (which are the same service) are currently the same as for authenticated API endpoints: 5000 per hour. It does sometimes happen that, due to the way the rate-limiting works, the actual number of requests you can make is higher, but you should not rely on that.
In general, those endpoints are not designed to be a CDN or code distribution network for your app. They're designed instead to provide easy access for individual users to the raw contents of a few files instead of having to clone the entire repository. If you do this anyway and end up using excessive resources for your repository, GitHub Support will reach out to you and ask you to stop, and your repository may be suspended if you don't cause the excessive resource use to be stopped promptly.
If you are going to make any sort of automated requests to those endpoints, it's strongly recommended that you use a unique User-Agent header in your library so they can be effectively identified, and preferably that that header contain some sort of identifying information (a URL or email address related to the project) so that you can be contacted about problems.
Your app should also gracefully handle errors like 403 and 429 and back off or stop trying to connect. This is true of every HTTP client.
If you want to distribute code or assets for your app, you should do so using a CDN backed by a server you control rather than the GitHub raw endpoints. You should be sure to serve this data over HTTPS and implement a secure digital signature mechanism to prevent malicious code or assets from being distributed.
I'm trying to figure out how to compare CMS like Adobe Experience Manager (AEM) with CDN service like AWS CloudFront? Am I comparing apples to oranges? Thanks in advance.
Yes, you are comparing apples to oranges... but there's probably a reasonable explanation for that -- they are often used together.
A Content Management System (CMS) is a high-level system for creating, modifying, managing, organizing, and publishing of content, with WordPress (the software, not the service) being a common example.
Blog hosting web sites are examples of hosted CMS. WordPress (the company) is one example of a hosted (SaaS) CMS service.
A Content Delivery Network (CDN) is a low-level infrastructure provider that typically facilitates global, high-performance delivery of electronic content, using globally-distributed storage, infrastructure, and connectivity. Examples are Amazon CloudFront, Fastly, and CloudFlare.
CDNs typically do not authoritatively store or render the content, they only cache it, and the caching is globally-distributed, with copies of the content being held in geographic areas where it is frequently accessed. CDNs often behave like HTTP reverse proxies, pulling content from the authoritative origin server (often a cluster of identical servers), which may itself also be globally-distributed, though in some cases the CDN provides sufficient optimization to allow the origin to be in a single geographic location.
A CMS is often deployed "behind" a CDN -- the CMS server (cluster) is the origin server. Whether to do this is typically an easy decision, even at small scale. Viewers connect to the CDN and make requests, which the CDN will serve from the cache if possible and otherwise forward to the origin. The generated response is returned to the original requester as well as stored in the CDN's cache, if possible. This arrangement often allows the origin to be scaled smaller when deployed with a CDN than it could be without, since the CDNs cache means less workload for the origin.
Note, though, that CDNs tend to go beyond the simple definition of optimizing global static content delivery, and indeed beyond any proper definition of "CDN."
CDNs are increasingly integrating serverless compute services, such as CloudFront's Lambda#Edge and CloudFlare Workers which allow you to deploy serverless functions that can manipulate HTTP headers, make request routing decisions, and even generate rendered responses. This is outside the traditional scope of a CDN, but could conceivably be harnessed to embed an entire CMS into the CDN infrastructure, but this really doesn't blur the distinction between CMS (software) and CDN (infrastructure).
CloudFront also has the ability to detect simultaneous requests from multiple browsers in the same geographic area for exactly the same resource, using something called request collapsing. If a request for content that isn't in the edge cache is already in flight to the origin server and more requests for the same resource arrive, CloudFront will actually hold those pending requests waiting for the server to return the single response to the single request, and will clone that response to all the browsers that are waiting for it. Fastly supports this, too, and appears to provide more granularity of control than CloudFront, which implements the feature automatically.
Some CDNs can also pass-through requests/responses from web browser to origin server that are not properly "content" requests -- HTML form post requests, for example -- which provides multiple advantages, including simpler integration (all the site traffic can pass through a single domain, avoiding cross-origin complications), optimized transport and TCP stack, faster TLS negotiation (due to reduced round-trip time between the browser and the web server it connects to, which is at the CDN), and transforming HTTP/2 (browser-facing) to HTTP/1.1 (server-facing).
CDNs also intrinsically offer an layer of DDoS protection for the origin server, since traffic arrives at the CDN on the front side, and only the back side of the CDN is contacting your origin server. Requests must be valid, not servable from the cache, and not blocked by the mitigation systems in place at (and managed by) the CDN before your origin server will even see them.
But it's important to note that none of these features are properly part of the "CDN" definition; they are capabilities these services offer among others that are that are bundled into a product marketed as and designed around CDN concepts... so I would suggest that it is often a good idea to use one of these CDN services even in places where actual CDN functionality isn't called for.
So far all the guides I've found for creating rest API's are for displaying stuff from your own site, but can you display stuff from another site?
Typically you'd do this by:
Proxying calls: When a request comes into your server, make a request to the remote server and pass it back to the user. You'll want to make sure you can make the requests quickly and cache results aggressively. You'll probably want to use a short timeout for the remote call and rate-limit API requests so your server can't be blocked making all these remote calls.
Pre-fetching: Downloading with a data dump periodically or pre-fetching the data you need so you can store it locally.
Keep in mind:
Are you allowed to use the API this way, according to its terms of use? If it's a website you're scraping, it may be okay for small hobby use, but not for a large commercial operation.
The remote source probably has its own rate limits in place. Can you realistically provide your service under those limits?
As mentioned, cache aggressively to avoid re-requesting the same data. Get to know HTTP caching standards (cache-control, etag, etc headers) to minimise network activity.
If you are proxying, consider choosing a data center near the API's data center to reduce latency.
I have an admin type system for a website with multiple web servers where users can configure pages and upload images to appear on the page (kind of similar to a CMS). If you already have a MongoDB instance with replica sets setup, what is the preferred way to store these uploads so that failover exists and why?
CDN, such as Amazon S3 / CloudFront.
Store the images in MongoDB? I do this now and don't use GridFS cause our images are all under 1MB.
Use some type of NFS with some sort of failover setup. If #3, then how do you configure this failover?
I use #2 just fine right now and have used #3 without the failover before. If I use MongoDB as the data store for my website and for serving images, could these GET requests for the images ever impact the performance of getting non-image data out of the DB?
could these GET requests for the images ever impact the performance of getting non-image data out of the DB?
Well, more image requests = more HTTP connections to your web servers = more requests for images from MongoDB = more network traffic.
So, yes, getting more image data from the DB could, in theory, impact getting non-image data. All you need to do is request 1000 images / sec at 1MB an image and you'll start seeing lots of network traffic between your MongoDB servers and your Web servers.
Note that this isn't a MongoDB limitation, this is a limitation of network throughput.
If you start getting lots of traffic, then the CDN is definitely recommended. If you already have an HTTP page that outputs the image, this should be pretty straight-forward.
Why not a CDN in front of MongoDB?
Redhat or CentOS clustering with a shared filesystem can provide a failover mechanism for NFS.