Apple Push Notifications in Bulk - iphone

I have an app that involves sending Apple Push Notifications to ~1M users periodically. The setup for doing so has been built and tested for small numbers of notifications. Since there is no way I can test sending at that scale, I am interested in knowing whether there are any gotchas in sending bulk push notifications. I have scripts written in Python that open a single connection to the push server and send all notifications over that connection. Apple recommends keeping it open for as long as possible. But I have also seen that the connection terminates and you need to reestablish it.
All in all, it is disconcerting that successful sends are not acknowledged, only erroneous ones are flagged. From a programmer's standpoint instead of simply checking one thing "if (success)" you now need to watch for numerous things that could go wrong.
My question is: What are the typical set of errors that you need to watch out for to make sure your messages don't silently disappear into oblivion? The connection closing is an easy one. Are there others?

I completely agree with you that this API is very frustrating, and if they would have sent a response for each notification it would have been much easier to implement.
That said, here's what Apple say you should do (from Technical Note) :
Push Notification Throughput and Error Checking
There are no caps or batch size limits for using APNs. The iOS 6.1
press release stated that APNs has sent over 4 trillion push
notifications since it was established. It was announced at WWDC 2012
that APNs is sending 7 billion notifications daily.
If you're seeing throughput lower than 9,000 notifications per second,
your server might benefit from improved error handling logic.
Here's how to check for errors when using the enhanced binary
interface. Keep writing until a write fails. If the stream is ready
for writing again, resend the notification and keep going. If the
stream isn't ready for writing, see if the stream is available for
reading.
If it is, read everything available from the stream. If you get zero
bytes back, the connection was closed because of an error such as an
invalid command byte or other parsing error. If you get six bytes
back, that's an error response that you can check for the response
code and the ID of the notification that caused the error. You'll need
to send every notification following that one again.
Once everything has been sent, do one last check for an error
response.
It can take a while for the dropped connection to make its way from
APNs back to your server just because of normal latency. It's possible
to send over 500 notifications before a write fails because of the
connection being dropped. Around 1,700 notifications writes can fail
just because the pipe is full, so just retry in that case once the
stream is ready for writing again.
Now, here's where the tradeoffs get interesting. You can check for an
error response after every write, and you'll catch the error right
away. But this causes a huge increase in the time it takes to send a
batch of notifications.
Device tokens should almost all be valid if you've captured them
correctly and you're sending them to the correct environment. So it
makes sense to optimize assuming failures will be rare. You'll get way
better performance if you wait for write to fail or the batch to
complete before checking for an error response, even counting the time
to send the dropped notifications again.
None of this is really specific to APNs, it applies to most
socket-level programming.
If your development tool of choice supports multiple threads or
interprocess communication, you could have a thread or process waiting
for an error response all the time and let the main sending thread or
process know when it should give up and retry.

Just wanted to chime in with a first person perspective, as we send millions of APNS notifications every day.
The reference #Eran quotes is unfortunately about the best resource we have for how Apple manages APNS sockets. It's fine for low volume, but Apple's documentation overall is very skewed towards the casual, low volume developer. You will see plenty of undocumented behavior once you get to scale.
The part of that document about doing error detection asynchronously is critical for high throughput. If you insist on blocking for errors on every send, then you'll need to heavily parallelize your workers to keep up throughput. The recommended way, however, is to just send as fast as you can send, and whenever you do get and error: repair and replay.
The part of that post I take exception to is:
Device tokens should almost all be valid if you've captured them
correctly and you're sending them to the correct environment. So it
makes sense to optimize assuming failures will be rare.
To predicate that advice with such a huge "IF" seems hugely misleading. I can almost guarantee that most developers are not capturing tokens and processing Apple's feedback service 100% "correctly". Even if they were, the system is inherently lossy, so drift is going to happen.
We see a non-zero number of error #8 responses (invalid device token) which I attribute to rooted phones, client bugs, or users intentionally spoofing their tokens to us. We have also seen a number of error #7 (invalid payload size) in the past, which we tracked down to improperly encoded messages that a developer added on our end. That was our fault of course, but that's my point--saying "optimize assuming failures will be rare" is the wrong message to send to learning developers. What I would say instead would be:
Assume errors will happen.
Hope that they happen infrequently, but
code defensively in case they don't.
If you optimize assuming errors will be rare, you may be putting your infrastructure at risk whenever the APNS service goes down and every message you send returns an error #10.
The trouble comes when trying to figure out how to properly respond to errors. Documentation is ambiguous or absent regarding how to properly handle and recover from different errors. This is left as an exercise for the reader apparently.

Related

low connectivity protocols or technologies

I'm trying to enhance a server-app-website architecture in reliability, another programmer has developed.
At the moment, android smartphones start a tcp connection to a server component to exchange data. The server takes the data, writes them into a DB and another user can have a look on the data through a website. The problem is that the smartphones very regularly are in locations where connectivity is really bad. The consequence is that the smartphones lose the tcp connection and it's hard to reconnect. Now my question is, if there are any protocols that are so lightweight or accomodating concerning bad connectivity that the data exchange could work better or more reliable.
For example, I was thinking about replacing the raw TCP interface with a RESTful API, but I don't really know how well REST works in this scenario, as I don't have any experience in this area.
Maybe useful to know for answering this question: The server component is programmed in c#. The connecting components are android smartphones.
Please understand that I dont add some code to this question, because in my opinion its just a theoretically question.
Thank you in advance !
REST runs over HTTP which runs over TCP so it would have the same issues with connectivity.
Moving up the stack to the application you could perhaps think in terms of 'interference'. I quite often have to use technical stuff in remote areas with limited reception and it reminds of trying to communicate in a storm. If you think about it, if you're trying to get someone to do something in a storm where they can hardly hear you and the words get blown away (dropped signal), you don't read them the manual on how to fix something, you shout key words such as 'handle', 'pull', 'pull', 'PULL', 'ok'. So the information reaches them in small bursts you can repeat (pull, what? pull, eh? PULL! oh righto!)
Can you redesign the communications between the android app and the server so the server can recognise key 'words' with corresponding data and build up the request over a period of time? If you consider idempotency, each burst of data would not alter the request if it has already been received (pull, PULL!) and over time the android app could send/receive smaller chunks of the request. If the signal stays up, just keep sending. If it goes down, note which parts of the request haven't been sent and retry them when the signal comes back.
So you're sending the request jigsaw-style but the server knows how to reassemble the pieces in the right order. A STOP word at the end tells the server ok this request is complete, go work on it. Until that word arrives the server can store the incomplete request or discard it if no more data comes in.
If the server respond to the first request chunk with an id, the app can use the id to get the response and keep trying until the full response comes back, at which point the server can remove the response from its jigsaw cache. A fair amount of work though.

Trigger something only after Resend Request is satisfied if exists with QuickFIX/J

Our system generates some messages (unsolicited cancel for example) it needs to send to the other party after a disconnect/connection lost, as soon as the connection recovers.
The problem is that we trigger sending those in onLogon(), but if there's a Resend Request that's too early and we had problems (maybe just because of how is implemented on the other end) when we had too many messages to send (hundreds).
I'm aware that ResendRequest may not come and it is impossible to figure that out without simply waiting, but what would be the best approach for us using QuickFIX/J to send our messages as soon as possible but after sequence numbers are synchronized?
EDIT: I'm trying to solve this using FIX 4.2. FIX 4.4 actually introduced http://www.onixs.biz/fix-dictionary/4.4/tagNum_789.html which would solve my problem (as long as the other party sends this optional tag too).
Thanks
My 10 cents is it sounds like you're trying to treat 2 scenarios in 1 go, and that's difficult. Do 1 thing at a time. For example, if it's your network that causes you to disconnect, before your client knows you've disconnected, your clients will send resend requests, right? Meanwhile, if a client disconnects but you don't then when they reconnect you gap fill. You've got to look carefully at the scenarios. Yes, a resend request may not come at all, it all depends how the client configures things their side. Maybe, per this question you want to send sequence resets because actually, the messages you're trying to send are quotes, right? I mean, what kind of messages are you trying to resend after a disco?

Is it possible to get POE::Component::IRC to receive events for its own PRIVMSGs?

I'm currently developing a bot with POE::Component::IRC whose job, amongst other things, is to post a notice to a list of channels on a schedule, for one week.
I can't seem to find a way to check that the message has been successfully sent to a channel though. The old Net::IRC package would fire a message received event for every message sent to a channel, including ones it itself had sent. POE seems not to do this - at least, the irc_public event is not firing when the bot's own message is published on the channel.
Is there a flag I can pass to the event handler to say "I'd really like to receive all messages please, even my own"? Or is there a way to do this with some sort of RAW event handler?
The IRC protocol does not echo your PRIVMSGs back to you, so you just have to trust that the server received your message and handled it the way it should.
If you just want to receive POE events for messages you send, there's a plug-in for that: POE::Component::IRC::Plugin::BotTraffic. It doesn't actually do anything to verify that the messages ever reach the server, though.
Fortunately, IRC runs on top of TCP, which provides guaranteed in-order delivery. Thus, as long as the connection doesn't drop or hang indefinitely, you can pretty safely assume that your commands will reach the server.
If you wanted to be absolutely sure, you could always follow your PRIVMSG with some command, such as TIME or PING, that you know the server will reply to; if it does, you'll know that it received your PRIVMSG too. Of course, even then there's still no guarantee that the server actually passed the message on to the intended recipient(s); things like netsplits do occur from time to time, and can cause messages to be dropped.

How can I get QuickFix to process messages that come in from a resend request?

I am writing an acceptor application and using a persistent FIX session. I am trying to write a recovery mode, such that if I go offline or my program restarts, when I reconnect I want to reprocess all the messages sent to me during the day to get back to the current state.
To do this, when I start up I send a resend request for all messages to the server. They fire me back all the relevant messages, and they are marked possdupflag=Y and possresend=Y. Before each message, they send a sequence reset for the repeated message they are about to send.
The problem is though, these messages do not seem to be processed by my message cracker. Both fromAdmin and fromApp do not get these messages. I assume they are being ignored because of the dup flag and/or resend. So is there a way for me to tell QuickFIX that I want to see these messages?
On that note- if anyone has any recommendations on better recovery processes I would be open to them.
Thanks.
There's at least a couple of potential problems with this recovery strategy. The first is that it's not very friendly to your trading counterparty. If you only receive a small number of messages during your session then it may not be an issue, but if you receive hundreds of thousands of messages then your counterparty might complain about the massive resends.
The other issue is that message resend is intended for error recovery and is managed by the session protocol layer. In QuickFIX/J (and other FIX engines) the session maintains recovery state in addition to sending the ResendRequest automatically when it detects a sequence number gap. Your approach might work if you reset the next expected incoming sequence number to 1. When the session receives the next message with a higher sequence number it will detect the gap and request the missing messages. If the messages are validated, they will be forwarded to application layer with the PossDup flag set. If you send the ResendRequest message yourself the behavior is undefined since the session state will not have been set up properly.
I recommend using a MessageLog implementation to store your incoming messages in a form you can use for recovery when your application starts. You can look at the implementation of the existing message logs (FileLog, JdbcLog) to get some ideas.
The behaviour occurs because the engine's persistance system tells it that the recieved messages are resent messages and so (per the FIX protocol specification) are discarded. Here we save FIXml strings into our database to provide a similar recovery ability to that which you describe(they are also written to xml files on disk for other reasons). I don't believe that there is any way to tell quickfix that you want to see duplicate messages but it is probably better to use a different form or persistance to save on connection overheads. Quickfix does provide a way of outputting messages to file as they come in if that helps.
I too have the same issue and What Frank Says is absolutely correct ,
Just use the below method to set the target sequence number to the begin seq number of the desired resend req .
getSession()->setNextTargetMsgSeqNum(atoi(seq.c_str()));
The engine internally identifies that the target number is way too large and automatically sends resend request , and all messages will be captured in onMessage call back itself as usual

How can I measure the breakdown of network time spent in iOS?

Uploads from my app are too slow, and I'd like to gather some real data as to where the time is being spent.
By way of example, here are a few stages a request goes through:
Initial radio connection (significant source of latency in EDGE)
DNS lookup (if not cached)
SSL/TLS handshake.
HTTP request upload, including data.
Server processing time.
HTTP response download.
I can address most of these (e.g. by powering up the radio earlier via a dummy request, establishing a dummy HTTP 1.1 connection, etc.), but I'd like to know which ones are actually contributing to network slowness, on actual devices, with my actual data, using actual cell towers.
If I were using WiFi, I could track a bunch of these with Wireshark and some synchronized clocks, but I need cellular data.
Is there any good way to get this detailed breakdown, short of having to (gak!) use very low level socket functions to reproduce my vanilla http request?
Ok, the method I would use is not easy, but it does work. Maybe you're already tried this, but bear with me.
I get a time-stamped log of the sending time of each message, the time each message is received, and the time it is acted upon. If this involves multiple processes or threads, I have each one generate a log, and then merge them into a common timeline.
Then I plot out the timeline. (A tool would be nice, but I did it by hand.)
What I look for is things like 1) messages re-transmitted due to timeouts, 2) delays between the time a message is received and the time it's acted upon.
Usually, this identifies problems that I can fix in the code I can control. This improves things, but then I do it all over again, because chances are pretty good that I missed something the last time.
The result was that a system of asynchronous message-passing can be made to run quite fast, once preventable sources of delay have been eliminated.
There is a tendency in posting questions about performance to look for magic fixes to improve the situation. But, the real magic fix is to refine your diagnostic technique so it tells you what to fix, because it will be different from anyone else's.
An easy solution to this would be once the application get's fired, make a Long Polling connection with the server (you can choose when this connection need's to establish prior hand, and when to disconnect), but that is a kind of a hack if you want to avoid all the sniffing of packets with less api exposure iOS provides.