What are the advantages and disadvantages of site mirroring [closed] - webserver

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
Question 1:
When sites are mirrored, the content of their respective servers is synchronized (possibly automatically (live mirrors) or manually). Is this true? Are all servers 'equal', or does a main server exists? which then sends it changes to other 'children servers'? So all changes have to happen on the main server, and children servers are not allowed changes?
Question 2:
Expected advantages:
Global advantage: when a site that is originally hosted in the US is mirrored to a server in London, Europeans will benefit from this. They will have a better response time and because the amount of downloaders is cut down into two pieces (American and European servers) their download speeds can be higher.
Security: When one server crashes or is hacked, the other server can continue to operate normally.
Expected disadvantages:
If live mirroring is not used, some users will have to wait for renewed content.
More servers equals higher upkeep costs.
What other items can be added to these lists?

When sites are mirrored, the content of their respective servers is
synchronized. Is this true?
Yes, mirror sites should always be synchronized with their masters even if, for several reasons (eg. updates propagation times, network failures, etc.) they may not be.
There are several ways to achieve this; for example, a simple method could be using a rsync command in a cron job; a better solution is the "push mirroring" technique, used by the Debian and Ubuntu Linux distributions.
Are all servers 'equal', or does a main server exists, which then
sends it changes to other 'children servers'?
No, not all server are equals; generally the content provider updates one or more master servers which, in turn, provide the updated content to the other mirrors.
For example, in the Fedora infrastructure there are master servers, tier-1 servers (fastest mirrors) and tier-2 servers.
So all changes have to happen on the main server, and children servers
are not allowed changes?
Yes, in a mirrored context the content must be updated only on the master servers (one or more).
Expected advantages
Maybe the most comprehensive list of reasons for mirroring can be found on the Wikipedia:
To preserve a website or page, especially when it is closed or is about to be closed.
To allow faster downloads for users at a specific geographical location.
To counteract censorship and promote freedom of information.
To provide access to otherwise unavailable information.
To preserve historic content.
To balance load.
To counterbalance a sudden, temporary increase in traffic.
To increase a site's ranking in a search engine.
To serve as a method of circumventing firewalls.
Expected disadvantages
Cost: you have to buy additional servers and spend time to operate them.
Inconsistency: when one or more mirrors are not synchronized with the master (and this could happen not only with manual sync, but also with live sync).
As a further reference, since mirroring is a simple form of a Web Distributed System, you could also be interested in this reading.

Also, for files that are popular for downloading, a mirror helps reduce network traffic, ensures better availability of the Web site or files, or enables the site or downloaded files to arrive more quickly for users close to the mirror site. Mirroring is the practice of creating and maintaining mirror sites.
A mirror site is an exact replica of the original site and is usually updated frequently to ensure that it reflects the content of the original site. Mirror sites are used to make access faster when the original site may be geographically distant (for example, a much-used Web site in Germany may arrange to have a mirror site in the United States). In some cases, the original site (for example, on a small university server) may not have a high-speed connection to the Internet and may arrange for a mirror site at a larger site with higher-speed connection and perhaps closer proximity to a large audience.
In addition to mirroring Web sites, you can also mirror files that can be downloaded from an File Transfer Protocol server. Netscape, Microsoft, Sun Microsystems, and other companies have mirror sites from which you can download their browser software.
Mirroring could be considered a static form of content delivery.

Related

My website is almost finished, how to proceed

I have a social networking site which is almost ready. On the site people would upload images and put information about themselves for their profile and would also post messages (which can include images). I am wondering exactly how to proceed (hosting, servers etc.), I am a relative beginner at all this stuff so I am not sure exactly what route to take. I am thinking of maybe hosting from home initially from my Personal Computer and maybe expand by acquiring servers to stack (which I am not exactly sure how to do honestly) if we grow. Since the site is aimed at a small proportion of the population, I am not expecting huge growth in traffic but I want to be prepared for spikes, albeit small ones. I was wondering if maybe it is possible to just host it off my computer and store the the database (MySQL) in a removable disk along with the images. I was also thinking about cloud hosting, which seems to be the most common, but I was wondering if that really is the best thing to do, given this is a social networking site. I know this question is very vague and broad, but since I am a beginner I really have no clue how to proceed. What is the best thing to do? Thank you so much!
Hosting from personal computers is a bad idea for few reasons - your internet bandwidth limits the speed of the website, you need to maintain 24/7 interest connectivity/ power and all the resources.
I suggest you to start with AWS, get a free account of AWS, which comes with a basic level machine free for 12 months, more details here (https://aws.amazon.com/activate/).
Deploy a machine in EC2,
Install the webserver and MySQL tools into the machine
Host your files in this machine.
Refer this machine public ip to your domain service provider(where you bought your domain. Example: Godaddy)
Deploying a machine and configuring the server takes a while, but its worth, it and the best part is its FREE for 12 months, so you need not worry about the pricing, connectivity and bandwidth.
Also when you think the traffic is more, you can upgrade your server with few clicks with no config changes.

Server Architecture for hosting Java PLAY application in the cloud

This is rather a set of questions than one very specific question. In the last couple weeks/days I puzzled together information regarding how to properly host a JAVA PLAY application "in the cloud", as lots of this information is scattered over different services, I felt like gathering up all these small pieces to one, because lots of things are important to be seen in full context. However, I moved my considerations to the bottom of the question, as they are mainly my opinions and subjective findings, which I don't want to be held responsible for. If I got something wrong, please don't hesitate to point that out.
Hosting Java PLAY + MySQL on AWS for world wide accessibility
Our Scenario: we have a quite straight forward application written within the Java PLAY framework (https://www.playframework.com/), working on iOS and Android as well as with a backend-system (for administration, content management and API), storing data in a MySQL DB. While most of the users' interactions with the server is quick and easy (login, sync some data) there are also some more data-intensive tasks (download some <100mb data zips to the mobile phone, upload a couple of mb to the server). Therefore we were looking for a solution to properly provide users far away from our servers with reasonable response times. The obvious next step was hosting in the cloud.
Hosting setup within AWS:
Horizontal scaling: for the start, only 1 EC2 instance with our app will be running in eu-1a. We will need to evaluate how much resources one instance actually requires, if more instances are needed and if more instances would actually benefit to quicker response times.
Horizontal scaling across regions: once the app generates heavy user load from another region, the whole EC2 instance should be duplicated and put to another region, running a db read replica (see Setting up a globally available web app on amazon web services and https://aws.amazon.com/de/blogs/aws/cross-region-read-replicas-for-amazon-rds-for-mysql/ ).
Vertical scaling of EC2 instances: in recent tests of the old hosting setup, the database proved to be the bottleneck rather than the play app and its server's hardware specifications. Therefore it is not yet fully clear how much vertical scaling would affect response times. If a t2.micro instance serves as good as a m3.xlarge instance, of course we would rather climb our way up from the bottom here.
Vertical scaling of RDS: we will need to estimate how much traffic hits the DB server and what CPU/RAM/etc will be required. Probably we will work our way up here aswell.
Global Redirection: done using Amazon Route 53 (?). A user from Tokio should be redirected to the EC2 instance running in Asia; a user from Rome to the EC2 instance in Europe. This does not only affect API calls within the app, but also content delivery (in both directions).
Open Questions regarding the setup
Is this setup conclusive? Am I missing crucial components?
Regarding global redirection: is Amazon Route 53 the right tool? How does it differ from CloudFront (which strikes me to be purely for content / media distribution?).
How do I define correct data/api endpoints for my app? Of course I don't want to define the database endpoint of a db read replica during app deployment. Will this also happen during the AR53 (question 2) setup? Same goes for API calls, of course the app should direct it's calls to https://myurl.com/api and from there it should be redirected. Is this realistic?
I would highly appreciate all kinds of thoughts (!), also regarding the background info written below. If you can point me to further reading to solve my questions on my own, I am also very thankful - there is simply a huge load of information regarding this, but this makes it hard to narrow the answers down. I do have knowledge in hosting/servers, but I am pretty sure there are true experts out there waiting to slap me with knowledge. :)
Background-Information
Current Hosting Setup: a load balancer distributes the traffic on 2 root linux servers, both of them running the PLAY app, one of them also holding the MySQL installation.
The current hosting setup has 3 big flaws:
No vertical scalability: the hosting company would take money for each scaling step. Currently the servers are running idle, but if the app booms, we could run short on capacity quickly. Running idle is still paid as if permanently under full load. This is expensive!
No deployment support: currently, we connect through SSH, manually deploy the correct folders to the file system, recompile on the server, set privileges, apply database evolutions; do the same for the second server (with different db connection parameters). What could possibly go wrong. ;)
No worldwide availability: to set up another server in another region of the world would mean a huge effort. To have a synchronized replica of our DB can be done, but once again deploying would mean downtime, room for errors and therefore time and money.
Hosting Options for Java PLAY:
There are lot of different blog posts about this. In short:
AWS: Amazon Web Services is one of the first places you start looking. Here you get everything that's possible, at a flexible price. You set yourself up an EC2 instance, a MySQL RDS and you're good to go - all of this in the free tier, so you can experiment, play around, test your stuff.
Microsoft Azure: similar to AWS regarding pricing and possibilities. However, I did not dive into setting up and deploying our application for test purposes.
Heroku: super easy deployment from within PLAY, scalable servers. However (on the first glance?) lacks possibility to supply remote regions with high speed content.
Jelastic: even easier deployment from within PLAY / IntelliJ IDEA. You push your app image to jelastic, jelastic distributes it further to their infrastructure providers.
RedHat OpenShift (https://www.openshift.com/): sounds promising, yet not as complete as AWS.
Lots of choices and possible setups/prices. Especially after finding out about deployment using boxfuse (https://cloudcaptain.sh/) I made my choice for AWS, as it offers absolutely all we need from 1 source. Boxfuse has low monthly costs but is perfectly integrated into AWS. Scaling is supported as well as the 3 common environments (dev/test/prod). Support is outstanding.
The setup looks good. I would however make one change: your large up- & downloads. As mobile speeds may not be ideal, have your app serve long-running requests is something you should avoid as this will needlessly tie up server threads. Instead consider having users upload and download straight from S3 using presigned URLs. You can then later add CloudFront to the mix when it makes financial sense to do so.
R53 will work just fine for picking the best server(s) for each end user.
For EC2 consider having an ELB + Auto-Scaling Group setup. Even just for a single instance you get the benefit of permanent health monitoring and auto-respawns. If you expect more load you can then auto-scale based on your expected bottleneck (cpu, network i/o). This will give you a more autonomous and robust setup than manually having to scale up and down based on your own monitoring analysis (even though the scaling part is very easy if you stick with immutable infrastructure & blue/green deployments like what Boxfuse offers).
Your focus on vertical server scaling might not serve you well on AWS. I would start thinking about horizontal scaling of app servers behind an Elastic Load Balancer, and possibly look into Elastic Beanstalk.
I'm not sure you can setup a read replica in another region via RDS, you might have to set that up via MySQL servers running on standard EC2 instances. And even if you can, that's going to be some expensive and high-latency data transfer.
If file uploads and downloads are all you are worried about, you just need to put CloudFront (Amazon's CDN service) in front of your application, and allow it to handle file uploads and downloads via its global edge servers. You could even do this without moving your entire application into AWS. I would recommend reading this blog post as a start.

how high frequency trading system connects to exchange

I'm trying to study about high frequency trading systems. Whats the mechanism that HFT use to connect with the exchange and whats the procedure (does it has to go through a broker or is it direct access, if it's direct access what sort of connection information that i require)
Thanks in advance for your answers.
Understand that there are two different "connections" in an HFT engine. The first is the connection to a market data source. The second is to a clearing resource. As mentioned in kpavlov's answer, a very expensive COLO (co-location) is needed to get as close to the data source/target as possible. Depending on their nominal latency these COLO resources cost thousands of dollars per month.
With both connections, your trading engine must be certified by the provider (ICE, CME, etc) to comply with their requirements. With CME the certification process is automated, with ICE it employs human review. In any case, the certification requires that your software demonstrate conformance to standards and freedom from undesirable network side effects.
You must also subscribe to your data source(s) and clearing service, neither is inexpensive and pricing varies over a pretty wide range. During the subscription process you'll gain access to the service providers technical data specification(s)-- a critical part of designing your trading engine. Using old data that you find on the Internet for design purposes is a recipe for problems later. Subscription also gets you access to the provider(s) test sites. It is on these test sites that you test and debug your engine.
After you think you engine is ready for deployment you begin connecting to the data/clearing production servers. This connection will get you into a place of shadows-- port roulette. Not every port at the provider's network edge has the same latency. Here you'll learn that you can have the shortest latency yet seldom have orders filled first. Traditional load balancing does little to help this and CME has begun deployment of FPGA-based systems to ensure correct temporal sequencing of inbound orders, but it's still early in its deployment process.
Once you're running you then get to learn that mistakes can be very expensive. If you place an order prior to a market pre-open event the order is automatically rejected. Do it too often and the clearing provider will charge you a very stiff penalty. Other things can also get you penalized or even kicked-off the service if your systems are determined to be implementing strategies to block others from access, etc.
All the major exchanges web sites have links to public data and educational resources to help decide if HFT is "for you" and how to go about it.
It usually requires an approval from exchange to grant access from outside. They protect their servers by firewalls so your server/network need to be authorized to access.
Special certification procedure with technician (by phone) is usually required before they authorize you.
Most liquidity providers use FIX protocol or custom APIs. You may consider starting implementing your connector with QuickFix, but it may become a bottleneck later, when your traffic will grow.
Information you need to access by FIX is:
Server IP
Server port
FIX protocol credentials:
SenderCompID
TargetCompID
Username
Password
Other fields

Horizontal scalability for distributed apps, how to achieve that?

I would like to disregard web applications here, because to scale them horizontally, ie to use multiple server instances together, it is "sufficient" to just duplicate the server software over the machines and just use a sort of router that forwards requests to the "less busy" server machine.
But what if my server application allows users to engage together in realtime ?
If the response to the request of a certain client X depends on the context of a client Y whose connection is managed by another machine then "inter machines" communication is needed.
I'd like to know the kind of "design solutions" that people has used in such cases.
For example, the people at Facebook must have already encountered such situation when enabling the chat feature of their social app.
Thank you in advance for any advise.
One solution to achive that is to use distibuted caches like memcache (Facebook also uses that aproach).
Then all the information which is needed on all nodes is stored in that cache (and a database if it needs to be permanent) an so all nodes can access that information (with a very small latency between the nodes).
regards
You should consider some solutions that provide transparent horizontal database scalability and guarantee ACID semantics. There are many solutions that offer this at various levels. People at Facebook which you reference have solved the problem by accepting eventual consistency but your question leads me to believe that you can't accept eventual consistency.

What's the best IPC mechanism for medium-sized data in Perl? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
I'm working on designing a multi-tiered app in Perl and I'm wondering about the pros and cons of the various IPC mechnisms available to me. I'm looking at handling moderately-sized data, typically a few dozen kilobytes but up to a couple of megabytes, and the load is pretty light, at most a couple of hundred requests per minute.
My primary concerns are maintainability and performance (in that order). I don't think I'll need to scale up to more than one server, or port off of our main platform (RHEL), but I suppose it's something to consider.
I can think of the following options:
Temporary files - Simplistic, probably the worst option in terms of speed and storage requirements
UNIX domain sockets - Not portable, not scalable
Internet Sockets - Portable, scalable
Pipes - Portable, not scalable (?)
Considering that scalability and portability are not my primary concerns, I need to learn more. What's the best choice, and why? Please comment if you need additional information.
EDIT: I'll try to give more detail in response to ysth's questions (warning, wall of text follows):
Are readers/writers in a one-to-one relationship, or something more more complicated?
What do you want to happen to the writer if the reader is no longer there or busy?
And vice versa?
What other information do you have about your desired usage?
At this point, I'm contemplating a three-tiered approach, but I'm not sure how many processes I'll have in each tier. I think I need to have more processes towards the left side and fewer toward the right, but maybe I should have the same number across the board:
.---------. .----------. .-------.
| Request | -----> | Business | -----> | Data |
| Manager | <----- | Logic | <----- | Layer |
`---------' `----------' `-------'
These names are still generic and probably won't make it into the implementation in these forms.
The request manager is responsible for listening for requests from different interfaces, for example web requests and CLI (where response time is important) and e-mail (where response time is less important). It performs logging and manages the responses to the requests (which are rendered in a format appropriate to the type of request).
It sends data about the request to the business logic which performs logging, authorization depending on business rules, etc.
The business logic (if it needs to) then requests data from the data layer, which can either talk to (most often) the internal MySQL database or some other data source outside our team's control (e.g., our organization's primary LDAP servers, or our DB2 employee information database, etc.). This is mostly simply a wrapper which formats the data in a uniform way so that it can be handled more easily in the business logic.
The information then flows back to to the request manager for presentation.
If, when data is flowing to the right, the reader is busy, for the interactive requests I'd like to simply wait a suitable period of time, and return a timeout error if I don't get access in that amount of time (e.g. "Try again later"). For the non-interactive requests (e.g. e-mail), the polling system can simply exit and try again on the next invocation (which will probably be once per 1-3 minutes).
When data is flowing in the other direction, there shouldn't be any waiting situations. If one of the processes has died when trying to travel back to the left, all I can really do is log and exit.
Anyway, that was pretty verbose, and since I'm still in early design I probably still have some confused ideas in there. Some of what I've mentioned is probably tangential to the issue of which IPC system to use. I'm open to other suggestions on the design, but I was trying to keep the question limited in scope (For example, maybe I should consider collapsing down to two tiers, which is a much simpler for IPC). What are your thoughts?
If you're unsure about your exact requirements at the moment, try to think of a simple interface that you can code to, that any IPC implementation (be it temporary files, TCP/IP or whatever) needs to support. You can then choose a particular IPC flavour (I would start with whatever's easiest and/or easiest to debug -- probably temporary files) and implement the interface using that. If that turns out to be too slow, implement the interface using e.g. TCP/IP. Actually implementing the interface does not involve much work as you will essentially just be forwarding calls to some existing library.
The point is that you have a high-level task to perform ("transmit data from program A to program B") which is more or less independent of the details of how it is performed. By establishing an interface and coding to it, you isolate the main program from changes in the event that you need to change the implementation.
Note that you don't need to use any heavyweight Perl language mechanisms to capitalise on the idea of having an interface. You could simply have e.g. 3 different packages (for temp files, TCP/IP, Unix domain sockets), each of which exports the same set of methods. Choosing which implementation you want to use in your main program amounts to choosing which module to use.
Temporary files (and related things, like a shared memory region), are probably a bad bet. If you ever want to run your server on one machine and your clients on another, you will need to rewrite your application. If you pick any of the other options, at least the semantics are the essentially the same, if you need to switch between them at a later date.
My only real advice, though, is to not write this yourself. On the server side, you should use POE (or Coro, etc.), rather than doing select on the socket yourself. Also, if your interface is going to be RPC-ish, use something like JSON-RPC-Common/ from the CPAN.
Finally, there is IPC::PubSub, which might work for you.
Temporary files have other problems besides that. I think Internet socks are really the best choice. They are well documented, and as you say, scalable and portable. Even if that is not a core requirement, you get it nearly for free. Sockets are pretty easy to deal with, again there is copious amounts of documentation. You can build out your data sharing mechanism and protocol out in a library and never have to look at it again!
UNIX domain sockets are portable across unices. It's no less portable than pipes. It's also more efficient than IP sockets.
Anyway, you missed a few options, shared memory for example. Some would add databases to that list but I'd say that's a rather heavyweight solution.
Message queues would also be a possibility, though you'd have to change a kernel option for it to handle such large messages. Otherwise, they have an ideal interface for a lot of things, and IMHO they are greatly underused.
I generally agree though that using an existing solution is better than building somethings of your own. I don't know the specifics of your problem, but I'd suggest you'd check out the IPC section of CPAN
There are so many different options because most of them are better for some particular case, but you haven't really given any information that would identify your case.
Are readers/writers in a one-to-one relationship, or something more more complicated?
What do you want to happen to the writer if the reader is no longer there or busy? And vice versa?
What other information do you have about your desired usage?
For "interactive" requests (holding the connection open while waiting for a response (asynchronously or not): HTTP + JSON. JSON::XS is insanely fast. Everyone and everything can speak HTTP and it's easy to load balance, debug, ...
For queued requests ("please do this, thanks!"): Beanstalkd and Beanstalk::Client. Serialize the requests in the beanstalk queue with JSON.
Thrift might also be worth looking into depending on your application.