How to overcome API/Websocket limitations with OHCL data for trading platform for lots of users? - rest

I'm using CCXT for some API REST calls for information and websockets. It's OK for 1 user, if I wanted to have many users using the platform, How would I go about an inhouse solution?
Currently each chart is either using websockets or rest calls, if I have 20 charts then thats 20 calls, if I increase users, then thats 20x whatever users. If I get a complete coin list with realtime prices from 1 exchange, then that just slows everything down.
Some ideas I have thought about so far are:
Use proxies with REST/Websockets
Use timescale DB to store the data and serve that OR
Use caching on the server, and serve that to the users
Would this be a solution? There's got to be a way to over come rate limiting & reducing the amount of calls to the exchanges.

Probably, it's good to think about having separated layers to:
receive market data (a single connection that broadcast data to OHLC processors)
process OHLC histograms (subscribe to internal market data)
serve histogram data (subscribe to processed data)
The market data stream is huge, and if you think about these layers independently, it will make it easy to scale and even decouple the components later if necessary.
With timescale, you can build materialized views that will easily access and retrieve the information. Every materialized view can set a continuous aggregate policy based on the interval of the histograms.
Fetching all data all the time for all the users is not a good idea.
Pagination can help bring the visible histograms first and limit the query results to avoid heavy IO in the server with big chunks of memory.

Related

Monitor MongoDB Atlas data transfer costs

I have a MongoDB Atlas cluster that serves many customers. Each customer has its own database on the cluster.
I would like to reduce my application's impact on MongoDB data transfer costs, which have been increasing for the last few days, but the billing info provided by Atlas does not break down prices per database. Therefore, I have no way of knowing which customers are costly and what are the most costly queries in terms of data transfer.
Moreover, using the prices on a daily basis and a few queries, I cannot correlate insertion of resources in my application with prices. For example, let's say my resources are Cats, one day it will cost 5$ of data transfer with 5000 Cats inserted in total in the databases, but the next day, it's going to cost 13$ with 1500 Cats inserted.
Do you know of tools or something in the Atlas dashboard I might've missed that could help me better track costs per customer, or say, a cost per Cat (in my example) so that I build a pricing model for my customers?
Thank you
You are most likely going to need separate projects and deployments.
A MongoDB client instance is generally capable of using any database on the server (subject to authorization rules and APIs provided in the language in question), therefore to get a breakdown of data transfer by database would require the server to track bytes transferred per operation and then aggregate those counts. As far as I know this isn't a feature that currently exists.
The most practical way of tracking this today is probably writing a layer on top of the driver on the client side that would look at data actually received.

MarkLogic REST interface to send data to Qlik Sense

I need to present ~10 million XML documents to Qlik Sense using MarkLogic REST interface with the intention of analyzing raw data on Qlik.
I'm unable to send that bulk data using simple cts:search.
A template view with SQL call like below is not helping as it is not recognized at Qlik Sense.
xdmp:to-json(xdmp:sql('select * from SC1.V1'))
Is there a better way to achieve this?
I understand it is not usual to load such huge data to Qlik, but what limitations should I consider?
You are unlikely to be able transfer that volume of data into or out of ANY system in a single 'transaction' (or request ). And if you could you wouldn't want to because when it fails, it's likely to fail forever as you have to start all over.
You should 'batch' up the documents into manageable chunks .. 100MB or '1 minute' is a reasonable high upper bound -- as size and time increase the probability of problems goes up (way up) due to timeouts, memory, temp space, internet and network transient problems etc.
A simple strategy that often works well is to first produce a 'list' of what to extract (document uris, primary keys ..), save that, and then work your way through the list in batches - retrying as needed. Depending on the destination and local storage etc. you can either combine the lot to send on to the recipient, or generally better, send the target data in batches as well.
This approach has good transactional characteristics ... you effectively 'freeze' the set of data when you make the list, but can take your time collecting and sending it. Depending -- you may be able to do so in parallel.

tableau extract vs live

I just need a bit more clarity around tableau extract VS live. I have 40 people who will use tableau and a bunch of custom SQL scripts. If we go down the extract path will the custom SQL queries only run once and all instances of tableau will use a single result set or will each instance of tableau run the custom SQL separately and only cache those results locally?
There are some aspects of your configuration that aren't completely clear from your question. Tableau extracts are a useful tool - they essentially are temporary, but persistent, cache of query results. They act similar to a materialized view in many respects.
You will usually want to employ your extract in a central location, often on Tableau Server, so that it is shared by many users. That's typical. With some work, you can make each individual Tableau Desktop user have a copy of the extract (say by distributing packaged workbooks). That makes sense in some environments, say with remote disconnected users, but is not the norm. That use case is similar to sending out data marts to analysts each month with information drawn from a central warehouse.
So the answer to your question is that Tableau provides features that you can can employ as you choose to best serve your particular use case -- either replicated or shared extracts. The trick is then just to learn how extracts work and employ them as desired.
The easiest way to have a shared extract, is to publish it to Tableau Server, either embedded in a workbook or separately as a data source (which is then referenced by workbooks). The easiest way to replicate extracts is to export your workbook as a packaged workbook, after first making an extract.
A Tableau data source is the meta data that references an original source, e.g. CSV, database, etc. A Tableau data source can optionally include an extract that shadows the original source. You can refresh or append to the extract to see new data. If published to Tableau Server, you can have the refreshes happen on schedule.
Storing the extract centrally on Tableau Server is beneficial, especially for data that changes relatively infrequently. You can capture the query results, offload work from the database, reduce network traffic and speed your visualizations.
You can further improve performance by filtering (and even aggregating) extracts to have only the data needed to display your viz. Very useful for large data sources like web server logs to do the aggregation once at extract creation time. Extracts can also just capture the results of long running SQL queries instead of repeating them at visualization time.
If you do make aggregated extracts, just be careful that any further aggregation you do in the visualization makes sense. SUMS of SUMS and MINS of MINs are well defined. Averages of Averages etc are not always meaningful.
If you use the extract, than if will behave like a materialized SQL table, thus anything before the Tableau extract will not influence the result, until being refreshed.
The extract is used when the data need to be processed very fast. In this case, the copy of the source of data is stored in the Tableau memory engine, so the query execution is very fast compared to the live. The only problem with this method is that the data won't automatically update when the source data is updated.
The live is used when handling real-time data. Here each query is accessed from the source data, so the performance won't be as good as the extract.
If you need to work on a static database use extract else the live.
I am feeling from your question that you are worrying about performance issues, which is why you are wondering if your users should use tableau extract or use live connection.
From my opinion for both cases (live vs extract) it all depends on your infrastructure and the size of the table. It makes no sense to make an extract of a huge table that would take hours to download (for example 1 billion rows and 400 columns).
In the case all your users are directly connected on a database (not a tableau server), you may run on different issues. If the tables they are connecting to, are relatively small and your database processes well multiple users that may be OK. But if your database has to run many resource-intensive queries in parallel, on big tables, on a database that is not optimized for many users to access at the same time and located in a different time zone with high latency, that will be a nightmare for you to find a solution. On the worse case scenario you may have to change your data structure and update your infrastructure to allow 40 users to access the data simultaneously.

What is the optimal way to do server side paging in expressjs with mongoose

I'm currently doing a project with my own MEAN stack.
Now in a new project I'm creating I've got a collection that I'm paging with Express on serverside, returning the page size every time (e.g 10 results out of the total 2000) and the total rows found for the query the user preformed (e.g 193 for UserID 3).
Although this works fine, I'm afraid that this will create an enormous load on the server since a user can easily pull 50-60 pages a session with 10, 20, 50 or even 100 results each.
My question to you guys is: if I have say 1000 concurrent users paging every few seconds like this, will MongoDB be able to cope with this? If not, what might be my alternatives here?
Also is there anyway I can simulate such concurrent read tests on my app/MongoDB?
Please take in account that I must do server side paging because the app will be quite dynamic and information can change very often.
If you're planning on only using a single webserver, you could cache the result set belonging to a certain page in memory. If you're planning on using multiple webservers, caching in-memory would lead to different result sets across servers, so in that case I'd recommend storing your cache either in MongoDB or in Redis.
A certain result set would be stored under a certain key in your cache. Your key would probably be composed of something like entityName + filterOptions + offset + resultsLimit. So for example you're loading movies with title=titanic, skipping the first 100, so offset=100 and loading only 50 per page so limit=50, which would all be concatenated into a single key.
When a request comes in, you would first try to load the result set from the cache. If the result set is inside the cache, you'll return that to the client. If it's not in the cache, you'd query the database for the latest result set, put that in the cache and return it to the client.
Whether or not you could pull it off with 1000 concurrent users depends a lot on your hardware, the data you are loading, how you're loading it and the efficiency of your implementation. There's one way to find out, and that's testing.
Of course by using the asynchronous capabilities of Node.js you can achieve the best scalability, so every call that can be executed async, such as database calls, should definitely be executed asynchronously.
You could load test your application for free from your local computer using Apache JMeter or let it be tested using for example Azure.

Enterprise integration via a data warehouse, or via messages?

Imagine a large organisation with many applications. The applications are not currently integrated to any great extent. There is a new and empty enterprise data warehouse, and it would store all data in a canonical format. The first step is to set up the warehouse and seed it with data from the applications.
I am looking for pros and cons between the following two enterprise integration patterns:
1) Using a combination of integration tools, setup batching to extract transform and load data on a periodic interval into the warehouse. Then, as part of the process, integrate the data from the warehouse to the required applications.
2) Using a combination of integration tools, detect changes real-time, or in batch and publish them to a service bus (in canonical format). Then, for each required application, subscribe to the messages to integrate them. The data warehouse is another subscriber to the same messages.
Thanks in advance.
One aspect that is hard to get right with integration-via-messages is periodic datasets.
Say you have a table in your data warehouse (DW) that contains data partitioned by day. If an ETL job loads that table, you can be sure that if the load job is finished, the respective dataset is complete (unless there's a bug in the job).
Messaging systems, on the other hand, usually don't provide guarantees of timely delivery. So you might get 90% of messages for a particular day by midnight, 8% within the next hour, and the remaining 2% within the next 6 hours (and a few messages might never arrive). In this situation, if you have a job that depends on this data, how can you know that the dataset is ready? You can set an arbitrary cutoff time (e.g. 1 hour past midnight) based on previous experience, SLAs, or some other criteria, when you consider the dataset complete, but that will by design be an approximation. You will also need some means to detect missing data (because of lost messages) and re-request it from the source.
This answer talks about similar problems.
Another issue is backfills. Imagine your source sends a backdated message, for example to correct some previously-sent one that belongs to a dataset in the past. Presumably, any consumers of that dataset need to be notified of the change and recompute their results. However, without some additional logic in the DW they might not know about it. With the ETL approach, since you already have dependencies between jobs, if you rerun some job with a backfill date, its dependencies will run automatically, or at least it'll be explicitly known that some consumers are affected.
With these caveats in mind, the messaging approach has some great advantages:
all your systems will be integrated using a uniform approach
the propagation time for your data will potentially be much lower
you won't have to fix ETL jobs that exploded because the data volume has grown past their ability to scale
you won't get SLA violations because your ETL jobs timed out
I guess you are talking about both ETL Systems and Mediation (intra-communication) design pattern. I don't know why have to choose between them, in my current project we combine them.
The ETL solution is implemented as Layer responsible for management of the Data integration (via Orchestrator module). It a single entry point and part of the Pipes and filters design pattern
concept that we rely on. It's able to perform a variety of tasks of varying complexity on the information that it processes.
On the other hand the Mediation as EAI system acts as "broker" between multiple applications. Whenever an interesting event occurs in an application (for instance, new information is created or a new transaction completed) an integration module in the EAI system is notified. The module then propagates the changes to other relevant applications.
So as bottom line I can't give you pros & cons for both, since to me they are a good solution together and their use is dependent on your goals, design etc.. But from your description it's seems to me that is similar to what I've suggested.