How to make scalable backend postfix server solution?

How to make scalable backend postfix server solution? - email

Here is the problem I am facing with. We are having a postfix server that needs to parse emails forwarded from some user's account and extract some data from it. Usually there are around 200 emails per user. We have tested it with 5 users and this all was good, but what to do if the number of user reaches some greater number, for example 10000 or 100000? Do you have any ideas how to make the postfix solution scalable so it could support this heavy load.
Our current postfix server resource is Ubuntu 10.04 machine with 512MB of RAM.
Best regards,
Mladjo

Postfix is a mailer. Not a data miner, arbitrary string parser or general purpose light bulb. When receiving 10000 letters, you - the mentally unstable postal worker - do not want to open the letters, read them, cut out some parts, close them and then deliver them.
You want to figure out if they're yours to deliver and put them in the right pile. For the other task, you call on your buddy Cron, who's dating Ms. Perl and have all the right features for previously mentioned tasks.

Related

Best practices - should I order newsletter recipients by email host or not?

For a newsletter mailing, about 50,000 users, using pear, is it convenient to order the list by mail provider or leave it all randomly?

From my experiences using Exim to send large amounts of emails, performance will suffer heavily if your email queue grows too large. Depending on your hardware, once you have around 10,000 emails in the queue, you will start to see significant effects of bogosorting, where the server uses more CPU just juggling the queue than actually getting any useful work done.
One way of avoiding large queues, is of course to get the emails delivered as fast and efficient as possible. One of the many ways of achieving this, is to get Exim to deliver multiple emails over the same TCP connection. This in turn can be achieved by sorting the recipients by domain, but that is not enough! By default, Exim will try to deliver each mail it receives immediately and then each delivery will open its own connection (this gives fast deliveries for very small volumes but will drive server load through the roof for larger volumes). You need to first spool the mails to Exim, and then let a queue runner handle the actual delivery which will then automatically see all other emails in the queue that should go to the same host and will deliver them over the same connection.
Optimizing Exim for sending large amounts of emails is a very complex subject that cannot be solved with just a few magical tricks. Crucial configuration options are (but not limited to): queue_only, queue_run_max, deliver_queue_load_max, remote_max_parallel, split_spool_directory, but also fast spool disk, enough RAM, and making sure Exim starts new queue runners often enough (command-line option when starting the Exim daemon).
How this relates to PEAR escapes me, but perhaps this gives you some ideas of how to approach your problem.

Data transfers (from/to server vs from/to client) in non-browser distributed applications

So we have command line scripts (written in Python) that sit on customer machines and send us data in CSV after every 24 hours. Now we are at a point that we actually want to be able to tell the clients to send us data any time. Almost all of the customers are on MS Windows machines and requirement is that we can install very little software on the customer machines (and most people cannot even log on to customer machines, only few people can).
I'm not actually sure as to how to best solve this problem. May be following are three possible ways (but looking for better)
We make a daemon in Python and install it on customer machine.
Daemon talks to our servers and we send back configuration
information. In that configuration information we send back the
"sleep duration". So daemon sends us the data and then goes to sleep
for number of seconds defined in "sleep duration" variable. Once the
limit is over, daemon pings us and again we send back the
configuration information. Rinse and repeat.
We install a script on customer machine and it runs every hour. At
our end, we've stored how often a customer should send us data (24
hours, 12 hours, and etc) and when script talks to us we determine
how much time has passed and if it is time that script should be
sending us data? If it's time, then we tell the script to send us
data.
We install a very small server-side (Django or Flask) application
and it runs on customer machines. Whenever we want data we send a
request to customer machine and our small server-side application
serves us. For that may be we will ask our customers to reserve a
port for us (not sure how many customers will actually allow this)
I'm sure there are better ways possible. Can you kindly let me which of the above methods are most suitable? Or please let me know if there exists a better way.
I really appreciate all insights, thanks for all help in advance.

Option 3 may not work. Most people have their machines behind a
fire-wall or router which does NAT. In such a scenario, a server that is listening for a request to come in would not typically be accessible from the public internet.
If they have static IP addresses and if the server is accessible from the public internet, then port scanners would detect it and potentially attempt to do undesirable things. You really do not someone hacking into your customer systems and wrecking havoc on them. Please avoid this option if possible.
However, it is safe to have a server on a customer system as long as it is the one logging into your server and sending data.
A better solution would be to have an app that is continuously
feeding data to your server as it is generated. Is is relatively
easy to do an equivalent of
tail -f csv_file | send_data_home
where send_data_home is program running on your customer's system.
This way there is minimal impact. The csv file creation is not
affected. The send_data_home logs into your server and sends
data as it is generated.

Do not send mails with duplicate subjects

we've got different processes that send mails in case of issues encountered (e.g. not enough permissions to perform an operation on a certain order item). This works fine to the point that sometimes identical messages are sent every 5 minutes. In our environment it is very difficult to synchronize the email sending on application layer (actually there are different applications sending out email, so we'd have to touch every application if we were to implement this inside application layer).
It would seem logical for me that filtering out mails (by duplicate subjects) is best done within the email layer, e.g. the application receiving the SMTP requests.
Yet we'd also prefer not to go down to SMTP layer by ourselves, rather use an existing service/application.
Is anybody aware of a web mailer (like googlemail) which does this kind of filtering? it would be ok for us the pay for such a service, so being "free as in beer" would be nice, but being not free is not a showstopper.
Thanks in advance
Holger

I find the idea of filtering duplicate e-mail message by the Subject: header quite worrisome. If they are produced by multiple applications, how can you be certain that the content of the messages is duplicated and that you are not unwittingly dropping important notifications?
The only unique feature of a message that can be used to filter out duplicates is its Message-ID: header. If that header is the same for two messages, then it's usually reasonable to assume that they are copies of the same original message - e.g. one received directly and one that was CC'ed to a mailing list.
That said, you can do pretty much anything you want on most SMTP servers - at least those that are based on a Unix-like OS. For example, Postfix can use custom shell scripts for filtering.
You can, for example, use formail to extract the body of each message and produce its
MD5 hash. Comparing the message body hashes along with the Date:, Subject:, From:, To: and Cc: headers at the same time is a good start to detect real duplicates.

gather file(s) from users

I'm looking for ways to gather files from clients. These clients have our software and we are currently using FTP for gathering files from them. The files are collected from the client's database, encrypted and uploaded via FTP to our FTP server. The process is fraught with frustration and obstacles. The software is frequently blocked by common firewalls and often runs into difficulties with VPNs and NAT (switching to Passive instead of Active helps usually).
My question is, what other ideas do people have for getting files programmatically from clients in a reliable manner. Most of the files they are submitting are < 1 MB in size. However, one of them ranges up to 25 MB in size.
I'd considered HTTP POST, however, I'm concerned that a 25 mb file would often fail over a post (the web server timing out before the file could completely be uploaded).
Thoughts?
AndrewG
EDIT: We can use any common web technology. We're using a shared host, which may make central configuration changes difficult to make. I'm familiar with PHP from a common usage perspective... but not from a setup perspective (written lots of code, but not gotten into anything too heavy duty). Ruby on Rails is also possible... but I would be starting from scratch. Ideally... I'm looking for a "web" way of doing it as I'd like to eventually be ready to transition from installed code.

Research scp and rsync.

One option is to have something running in the browser which will break the upload into chunks which would hopefully make it more reliable. A control which does this would also give some feedback to the user as the upload progressed which you wouldn't get with a simple HTTP POST.
A quick Google found this free Java Applet which does just that. There will be lots of other free and pay for options that do the same thing

You probably mean a HTTP PUT. That should work like a charm. If you have a decent web server. But as far as I know it is not restartable.
FTP is the right choice (passive mode to get through the firewalls). Use an FTP server that supports Restartable transfers if you often face VPN connection breakdowns (Hotel networks are soooo crappy :-) ) trouble.
The FTP command that must be supported is REST.
From http://www.nsftools.com/tips/RawFTP.htm:
Syntax: REST position
Sets the point at which a file transfer should start; useful for resuming interrupted transfers. For nonstructured files, this is simply a decimal number. This command must immediately precede a data transfer command (RETR or STOR only); i.e. it must come after any PORT or PASV command.

Does the POP3 protocol allow you to specify a subset of emails to download?

I am writing a POP3 mail client. I want to leave the messages on the server, but I don't want to have to redownload all messages every time I reconnect.
If I download all the messages today, and reconnect tomorrow does the protocol support the ability to only download the messages from the last 24 hours or from a certain sequential ID? Or will I have to redownload all of the messages again?
I am aware of the Unique IDentification Listing feature, but according to http://www.faqs.org/rfcs/rfc1939.html it's not supported in the original specification. Do most mail servers support this feature?
Yes, my client supports IMAP too, but this question is specifically for the POP servers.

Have you considered using IMAP?

I've done it.
You'll have to reread all the headers but you can decide which messages to download.
I don't recall anything in the header that will give you a foolproof timestamp, however. I don't believe your solution is possible without keeping a record of what you have already seen.
(In my case I didn't care--I was simply looking for messages with certain identifying features in the header--those messages were downloaded, processed and killed, everything else was untouched.)
I also wonder if you're misunderstanding the protocol. Just because you download a message doesn't mean it's removed from the server. It's only removed from the server if you give an explicit command to kill the message. (And when a message contains so many attachments that the system time-outs before you properly log off and thus your kill command is discarded you'll be driven up the wall!) (It was an oversight in the design. The original logic was attach one file over 100k, or as many as possible whose total was under 100k. Another task barfed and generated thousands of files of around 100 bytes each. While it was a perfectly legit, albeit extreme, e-mail nothing was able to kill it!)
Thus if I were writing a mail client I would simply download anything I didn't already have locally. If it's supposed to remain on the server, fine, just don't give the kill command.

The way I have seen that handled in the past is on a client-by-client basis. For example, if I use Scribe to get e-mail on one machine without deleting, then move to another machine, all e-mails are downloaded again despite the fact that I've seen them before. Internally, I imagine the client has a table that stores whether or not an e-mail has been downloaded previously.
There's nothing in the protocol that I'm aware of that would allow for that.

Sort-of. You can download individual messages, but you can't store state on the remote server.
See the RETR command at http://www.faqs.org/rfcs/rfc1939.html.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse