Third party data delivery of lots of data - real-time

Does anyone know how sites that have a real-time feed of a lot of data work? I am referring to something like a stock site, where they can tell you in real time (well, 20 minute delay mostly, but still real-time - 20 minutes as I understand it).
They have thousands of data pieces delivered to them every second, I would imagine: MSFT 25.00 +.23 VOL 12000 ???? for each stock that had a change during some interval.
So, is there just a constant feed of small pushes going on? Or do you think a site will pull from the place that has the real data and say "give me all changes since 12:23:45 CST to now" type query?
I ask this because at work we might have a situation where we need to have at our application's fingertips real time information like this, and it won't make sense to hit our third party provider over and over and over again every second...

Generally there is a server/client protocol defined between the 2 parties. In the company I work for the connection is maintained at all times.
Here is info on real time data feeds to go with your stock example
NYSE,NASDAQ
It is common for data providers to also have FTP sites with (delayed) batched data. One that comes to mind is the NWS EMWIN

Sites like Twitter feed data to certain approved sites in real-time via XMPP (Wiki link).

In the broadest terms, a push model is going to be the best way of achieving "real time" transfer, particularly if you're talking about a large amount of data.
However you do always have a problem when using a purely push model of how to recover from missed data.
Depending on the nature of your data that may not be a problem (thinking of video delivery as an analogue, where the amount of data is huge but there is sufficient redundancy for it to recover from missing data). And if you have any control over the data you may be able to build some redundancy in. For example, on every change event you can provide absolute values rather than changes, or previous value and new value.

I've done this making an attempt to retrieve the stock quote from the source, and falling back to a timestamped on-disk cache of the quote when the main source fails or times out.

Related

Instant Messaging Schema design advice

I'm trying to build an Instant Messaging functionality in my app as part a bigger project.
Chats can have more than 2 participants (group chats)
If participant A delete a message, it still should be visible to participant B (that's why I used the Message Participants table)
Same applies to Conversation.
By same logic, if all participants delete the conversation/message, it should be erased from DB.
Questions :
I'm afraid that this schema is too cumbersome, meaning that the queries will be too slow once the app gets certain traffic mark (1k active users ? I'm guessing)
Message Participants will have multiple records for each message - one for each participants in the chat. Instant Messaging means it will involve those writes with very tight timings. Wouldn't that be a problem?
Should I add a layer of Redis DB, to manage a chat's active session's messaging? it will store the recent messages, and actively sync the PostgreSQL db with those messages (perhaps with Async transactions functionality that postgresql has?)
UPDATED schema :
I would also gladly hear ideas for having a "read" status functionality. I'm assuming it's much more complex with Group chats, so at least offering that for 1:1 chats would be nice.
I am a little confused by your diagram. Shouldn't the Conversation Participants be linked to the Conversations instead of the Message? The FKs look all right, just the lines appear wrong.
I wouldn't be worried about performance yet. The Premature Optimization Anti-Pattern warns us not to give up a clean design for performance reasons until we know whether we are going to have a performance problem. You anticipate 1000 users - that's not much for a modern information system. Even if they are all active at the same time and enter a message every 10 seconds, this will just mean 100 transactions per second, which is nothing to be afraid of. Of course, I don't know the platform on which you are going to run this. But it should be an easy task to set up those tables and write a simple test program that inserts those records as fast as possible.
Your second question makes me wonder how "instant" you expect your message passing to be. Shall all viewers of a message receive each keystroke of the text within a millisecond? Or do they just need to see each message appear right after it was posted? Anyway, the limiting factor for user responsiveness will probably be the network, not the database.
Maybe this is not mainly a database design issue. Let's assume you will have a tremendous rate of postings and viewings. But not all conversations will be busy all the time. If the need arises - but not earlier - it might be necessary to hold the currently busy conversations in memory and using the database just as a backup for future times when they aren't busy any more.
Concerning your additional comments:
100k users: This is a topic not for this forum, but concerning business development of a startup. Many founders of startup companies imagine huge masses of users being attracted to their site, while in reality most startups just fail or only reach very few. So beware of investments (in money, but also in design and implementation effort) that will only pay in the highly improbable case that your company will be the next Whatsapp.
In case you don't really anticipate such masses of users but just want to imagine this as a programming exercise, you still have a difficult task. You won't have the platform to simulate the traffic, so there is no way to make measurements on where you actually have a performance problem to solve. That's one of the reasons for the Premature Optimization warning: Unless you know positively where you have a bottleneck, you - and all of us - will be just guessing and probably make the wrong decisions.
Marking a message as read is easy: Introduce a boolean attribute read at Message Participants, and set it to true as soon as, well, the user has read the message. It's up to your business requirements in which cases and to whom you show this.

Stuck with understanding how to build a scalable system

I need some guidance on how to properly build out a system that will be able to scale. I will give you some information about what I am trying to do and then ask my specific question.
I have a site where I want visitors to send some data to be processed. They input the data into a textarea or upload it in a file. Simple. The data is somewhat preprocessed on the client side before a POST request is made to a REST endpoint.
What I am stuck on is what is a good way to take this posted data store it and then associate an id with it that references the user since I cannot process the data fast enough for it to be returned to the user in a reasonable amount of time?
This question is a bit vague and open to opinion, I admit it. I just need a push in the right direction to keep moving. What I have been considering is throwing the data into a message queue and then having some workers process the data elsewhere and when the data is processed alert the user as to where to find it with some sort of link to an S3 bucket or just a URL to a file. The other idea was to just run the request for each item to be processed against another end-point that already processes individual records in some sort of loop client side. The problem is as follows with this idea:
To process the data it may take somewhere from 30 minutes to 2 hours depending upon the amount that they want processed. It's not ideal for them to just sit there and wait for that to finish depending on the amount of records they need processed, so I have ruled this out mostly.
Any guidance would be very much appreciated as I don't have any coworkers to bounce things off of, nor do I know many people with the domain knowledge that I could freely ask. If this isn't the right place to ask this, could you point me in the right direction as to where it should be asked?
Chris
If I've got you right, your pipeline is:
Accept item from user
Possibly preprocess/validate it (?)
Put into some queue
Process data
Return result.
You man use one or several queues on stage (3). Entity from user gets added to one of the queues. If it's big enough, it could be stored in S3 or storage alike, and only info about it put into the queue: link, add date, user id (or email of alike). Processors can pull items from queue and give feedback to users.
If you have no strict requirements on order, things get much simpler: you don't need any sync between them. Treat all the components: upload acceptors, queues, storages and processors as independent pools of processes. Monitor each pool separately. If there's some bottlenecks - add machines to that pool.

Fetching 1 minute bars from Yahoo Finance

I'm trying to download 1 minute historical stock prices from Yahoo Finance, both for the current day and the previous ones.
Yahoo (just like Google) supports up to 15 days worth of data, using the following API query:
http://chartapi.finance.yahoo.com/instrument/1.0/AAPL/chartdata;type=quote;range=1d/csv
The thing is that data keeps on changing even when the markets are closed! Try refreshing every minute or so and some minute bars change, even from the beginning of the session.
Another interesting thing is that all of these queries return slightly different data for the same bars:
http://chartapi.finance.yahoo.com/instrument/2.0/AAPL/chartdata;type=quote;range=1d/csv
Replace the bold number with 100000 and it will still work but return slightly different data.
Does anyone understand this?
Is there a modern YQL query that can fetch historical minute data instead of this API?
Thanks!
Historical minute data is not as easily accessible as we all would like. I have found that the most affordable way to gather Intraday Stock Price data is to develop automated scripts that log price information for whenever the markets are open.
Similar to the Yahoo data URL that you shared, Bloomberg maintains 1-Day Intraday Price information in JSON format like this : https://www.bloomberg.com/markets/api/bulk-time-series/price/AAPL%3AUS?timeFrame=1_DAY
The URL convention appears easy to input on your own once you have a list of Ticker Symbols and an understanding of the consistent syntax.
To arrive at that URL initially though, without having any idea for guessing / reverse-engineering it, I simply went here https://www.bloomberg.com/quote/AAPL:US and used Developer Tools on my browser and tracked a background GET request which led me to that URL. I wouldn't be surprised if you could employ similar methods on other Price Data-related websites.
You can also write scripts to track price data as fast as your internet goes. One python package that I find pretty handy and is ystockquote
You can have it request price data every couple of seconds and log that into a daily time series database.
Yes there is other APIs.
I don't know if it can still help but if you need intraday data, there is a API on rapidapi called (Quotient) which allows to pull intraday (at 1-min level), EOD market (FX, Crypto, Stocks (US, CANADIAN, UK, AUSTRALIA, EUROPE), ETFs and Futures. It also provides earnings, dividends, splits and a lot others informations.

Avoiding data loss: suggested reading

I am about to work on an app which handles extremely valuable data. Any loss of this data for the user would be very costly, so I'm interested in finding out more about the best architecture design for our needs.
The user will be inputting this data in their iPhone each day. The alternative to using this app is carrying around a piece of paper with this sensitive information on it. So while I know we can be more secure than a piece of paper, I want to make sure we also cover the user stories like "I flushed my phone down the toilet" or "my son deleted the app, where's my data?"
A service like Dropbox comes to mind, but I wouldn't want to require our users to have a Dropbox account; the syncing architecture must be transparent to the user. iCloud is out because web and Android versions may follow.
Can anyone suggest either some good reading on this subject, or some good frameworks to look at? I expect to use a node.js backend, and while we are targeting iPhone first, Android will follow.
The data itself consists of 2 tables, each with a small number of fields, with a many to many relationship. A few new rows will be created by the user each day, but the data will be small and highly compressible.
Turns out this is an extremely difficult issue. In data assurance (this isnt yet a security type situation although could become one because of the assurance aspect) there is ALWAYS a time element. As a simple example what happens if your use has locally updated some piece of data. Just before you have the ability to fully push the data to some cloud service, etc... he / she dumps it in the toilet. Even if good signal was there for transmitting the data there is time in transferring and time necessary for the cloud server to respond saying the data got there properly.
Generally in data assurance, you really have to work to the best you can. You will NEVER be able to solve all issues as there is no data center, nor link to a data center, etc... that is perfect. There is always a chance of data loss. Truly the best you can do, is SYNC as fast as data changes, and if there is loss of connection, as soon as the connection becomes alive again.
Now, for security. Security by itself does not create assurance. If the data itself is something that the customer does not want to lose, and that is his only requirement, then security is un-necessary. If he / she is also worried about other getting their hands on his data, then you have to be worried about data-in-transit (both up and down during syncing), and on the device itself. For the best potential security, encrypt the data locally on the device prior to pushing over the cloud. There are many known attacks that even if using SSL or other services, can get at the data. If you wish, locally encrypt a file, then you could for SOME added security still use SSL (at this point you will have doubly encrypted the data). You also want to sign the data so that there is little chance of it being manipulated in transit, or by the cloud server itself (if a hacker hacked the cloud server). Generally the way to protect the data while on device, you may choose to have the user input a password, and put some fairly strict rules around how passwords are formed, and how many tries you allow before you disallow attempts for 30 minutes or so.
You may also wish to store the data locally in an encrypted form. This way if someone gets the device, they still will need to have the password before they can get the data (unless of course they can crack the algorithm you use to generate the symetric key from the password).
In terms of online data service, you could use iCloud, etc... I am actually NOT a fan of anything cloud. I think it is SO counter enterprise / proprietary data, it isnt even funny. I think it actually almost laughable that so many of these phone / device manufacturers are going SOOOOO cloud based. I think they are abandoning the big companies, as NO big company I know of wants to place their proprietary data on a cloud server that THEY DONT CONTROL. In any case, I would argue that so long as you have a good local encryption scheme prior to sending out the data, then you should be OK. I would from an assurance perspective however look at where the servers are in locale. the reason being that if assurance of data is of prime concern, most larger IT setups like to have replicated data centers on opposing sides of the country / world etc... The reason for this is if an earthquake takes down the data center on one side of the country, it most likely will NOT take down the one on the other side of the country simultaneously. If the data centers for iCloud or whatever you can find are essentially in one locale, then you may consider syncing with one data center on the west coast, and choose a completely differing data center (in this case company) to sync with that is centered on the east coast.
This is all very high level, how you would implement this on an iPhone specifically we could also talk about, byt I hope this at least begins to help pave a path.

Best strategy for synching data in iPhone app

I am working on a regular iPhone app which pulls data from a server (XML, JSON, etc...), and I'm wondering what is the best way to implement synching data. Criteria are speed (less network data exchange), robustness (data recovery in case update fails), offline access and flexibility (adaptable when the structure of the database changes slightly, like a new column). I know it varies from app to app, but can you guys share some of your strategy/experience?
For me, I'm thinking of something like this:
1) Store Last Modified Date in iPhone
2) Upon launching, send a message like getNewData.php?lastModifiedDate=...
3) Server will process and send back only modified data from last time.
4) This data is formatted as so:
<+><data id="..."></data></+> // add this to SQLite/CoreData
<-><data id="..."></data></-> // remove this
<%><data id="..."><attribute>newValue</attribute></data></%> // new modified value
I don't want to make <+>, <->, <%>... for each attribute as well, because it would be too complicated, so probably when receive a <%> field, I would just remove the data with the specified id and then add it again (assuming id here is not some automatically auto-incremented field).
5) Once everything is downloaded and updated, I will update the Last Modified Date field.
The main problem with this strategy is: If the network goes down when I am updating something => the Last Modified Date is not yet updated => next time I relaunch the app, I will have to go through the same thing again. Not to mention potential inconsistent data. If I use a temporary table for update and make the whole thing atomic, it would work, but then again, if the update is too long (lots of data change), the user has to wait a long time until new data is available. Should I use Last-Modified-Date for each of the data field and update data gradually?
I would start by making the update routine atomic, since you'll have enough on your hands figuring out how to get the client-server communication working properly.
After that is a good time to consider tweaking it to be incremental, but only after you do some testing to figure out if it's really necessary. If you're tuning your update protocol to be as low bandwidth as possible, you might discover that even a "big" update is downloaded fast enough.
Another way to look at it is to ask yourself, how often is there going to be network trouble when an average user is doing a sync? You probably don't want to tune for unlikely scenarios.
If you are trying to optimize (minimize) the data transfer you may want to consider a different format than XML, since XML is fairly verbose. Or at least you may want to trade in XML readability for space by making each element name and attribute as small as possible, and eliminate all unnecessary whitespace.
Your basic scheme is good. The thing you need to do is to somehow make your updates idempotent so that you can restart a partially-completed transfer without risk. This is a better way to go than to try to implement some sort of true atomic commit (though you could do that too, using, eg, the SQLite database).
In our experience fairly large updates (10s of KB) can be downloaded quite rapidly, if the server is fast enough. No great need to break updates up into tiny bits. But certainly it won't hurt to try to minimize the amount of data transferred by keeping more granular info on "last update".
(And definitely you should use JSON rather than XML as your transmitted data representation.)
Wonder if you have considered using a Sync Framework to manage the synchronization. If that interests you can take a look at the open source project, OpenMobster's Sync service. You can do the following sync operations
two-way
one-way client
one-way device
bootup
Besides that, all modifications are automatically tracked and synced with the Cloud. You can have your app offline when network connection is down. It will track any changes and automatically in the background synchronize it with the cloud when the connection returns. It also provides synchronization like iCloud across multiple devices
Also, modifications in the Cloud are synched using Push notifications, so the data is always current even if it is stored locally.
In your case,
Criteria are speed (less network data exchange), robustness (data recovery in case update fails), offline access
Speed: Only the changes are sent across the network in both directions
Robustness: It stores data in a transactional store like sqlite and any failed updates are communicated in the SyncML payload. Only the successful operations are processed while the failed operations are re-tried during the next sync
Here is a link to the open source project: http://openmobster.googlecode.com
Here is a link to iPhone App Sync: http://code.google.com/p/openmobster/wiki/iPhoneSyncApp