DolphinDB: how could I write data efficiently to a stream table or a dfs table? - streaming

I need help to improve the data writing performance in DolphinDB.
The client end receives stock quotation data, one at a time. Taking latency and throughput into consideration, how could I write data efficiently to a stream table or a dfs table? I need more suggestions to improve the data writing efficiency. Many thanks.

On how to improve DolphinDB data import performance:
Increase your Internet speed. The data importing process in the distributed system involves numerous network transmission processes. It is recommended to deploy at least 10 Gigabit Ethernet to avoid high latency.
Use bulk data import instead of insert. Insert is not recommended because inserting a single record involves all the processing procedures and will lead to high latency. Steps such as writing to the log, opening a transaction and multiple network transmissions are unavoidable consumption during processing, and only the insert step takes slightly less time than bulk import. That’s why bulk import is highly recommended. Currently, bulk import is supported in C++ , Python and C# APIs.
Increase the number of remoteexecutor. The default value is 1.When a node needs to send data to other nodes after receiving data, remoteexecutor = 1 indicates that there is only one thread sending the data. Therefore, increase the number of remoteexecutor to realize multiple nodes sending data simultaneously.
Adopt data compression. DolphinDB provides data compression automatically for relatively large volume of data. it will compress the data blocks so as to conserve network bandwidth.
Partition data in advance on the client. Data node in dfs database will group the data to be imported based on the partitioning method before importing, which can be done in advance on the client to reduce the consumption.
Import data in asynchronous batches. Conduct multi-threaded batch processing with at least two threads. One of the threads receives data and maintains a queue, and the other thread runs in loop to obtain data from the queue and write data.

Related

Single Batch job performing heavy database reads

I have a Spring Batch solution which reads several tables in Oracle database, does some flattening and cleaning of data, and sends it to a Restful Api which is our BI platform. The Spring Batch breaks down this data in chunks by date and not by size. It may happen that on a particular day, one chunk could consist of million rows. We are running the complete end-to-end flow in the following way:
Control-M sends a trigger to Load Balancer at a scheduled time
Through Load Balancer request lands on to an instance of Spring Batch app
Spring Batch reads data for that day in chunks from Oracle database
Chunks are then sent to target API
My problems are:
The chunks can get heavier. If it contains of million rows then the instance's heap size increases and at one point chunks will get processed at trickling pace
One instance bears the load of entire batch processing
How can I distribute this processing across a group of instances? Is parallel processing achievable and if yes then how can I make sure that the same rows are not read by multiple instances (to avoid duplication)? Any other suggestions?
Thanks.
You can use a (locally or remotely) partitioned step where each worker step is assigned a distinct dataset. You can find more details and a code example in the documentation here:
https://docs.spring.io/spring-batch/docs/current/reference/html/spring-batch-integration.html#remote-partitioning
https://github.com/spring-projects/spring-batch/tree/main/spring-batch-samples#partitioning-sample

How to do simple cache file in Flink-Scala?

I am new to Flink. I am really confused how to do file caching and load it into a dataset ? I can't find a simple example. I am confused why we need to create a dataset first to call "RichMapFunction" ? How I cache file that with nothing do with any other dataset? In sample I found, it kind of performed join with other dataset. Thank you.
For the case to join two data sets, and one data set is small, use broadcast to avoid shuffle. Without broadcasting, it is a pain to shuffle a large data set.
E.g. one dataset has 1 billion records, another one has 100 records. With broadcast, the small dataset will be distributed to all task managers processing those 1 billion records - no moving 1 billion record for join. Without broadcast, the typical behaviour for joining operation is to shuffle the 1 billion records and 100 records, so that records with same key are in the same machine, which is much more expensive compared to broadcast.
The RichMapFunction provides the open() method and method to access RuntimeContext. In the open() function, the Flink job can get broadcasted dataset through getRuntimeContext(). getBroadcastVariable(). The open() function is called only one time for each operator, so the broadcasted dataset is initialised one time and then it can be applied to all incoming records. That is the reason why to use RichMapFunction() instead of MapFunction().
Note - Broadcast applies to the case that the dataset to broadcast is small. Need to create a dataset first and then broadcast the dataset to all operator. Please refer to here for the usage of the API.
For distributed file caching, it is for the case that the operation(e.g. Map operation) needs to load external file one time and use it in the operation.
E.g. A trained model is saved on HDFS. In Flink job, it needs to load the model and apply the model to each record. For this case, the Flink job can use distributed file cache API. The model file will be pulled from HDFS to local machine, and all tasks running on that machine can share the pulled file locally, which saves network and time.
You do not need to create a dataset for the file to be distributed, but using registerCachedFile(). Like the same reason for broadcasting dataset, using RichMapFunction allows the Flink job to load/init distributed file one time.
Please refer to this document for the usage.

How to modify the configuration of Kafka to process large amount of data

I am using kafka_2.10-0.10.0.1. I have two questions:
- I want to know how I can modify the default configuration of Kafka to process large amount of data with good performance.
- Is it possible to configure Kafka to process the records in memory without storing in disk?
thank you
Is it possible to configure Kafka to process the records in memory without storing in disk?
No. Kafka is all about storing records reliably on disk, and then reading them back quickly off of disk. In fact, its documentation says:
As a result of taking storage seriously and allowing the clients to control their read position, you can think of Kafka as a kind of special purpose distributed filesystem dedicated to high-performance, low-latency commit log storage, replication, and propagation.
You can read more about its design here: https://kafka.apache.org/documentation/#design. The implementation section is also quite interesting: https://kafka.apache.org/documentation/#implementation.
That said, Kafka is also all about processing large amounts of data with good performance. In 2014 it could handle 2 million writes per second on three cheap instances: https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines. More links about performance:
https://docs.confluent.io/current/kafka/deployment.html
https://www.confluent.io/blog/optimizing-apache-kafka-deployment/
https://community.hortonworks.com/articles/80813/kafka-best-practices-1.html
https://www.cloudera.com/documentation/kafka/latest/topics/kafka_performance.html

Read from mongodb without lock

We're using MongoDB 2.2.0 at work. The DB contains about 51GB of data (at the moment) and I'd like to do some analytics on the user data that we've collected so far. Problem is, it's the live machine and we can't afford another slave at the moment. I know MongoDB has a read lock which may affect any writes that happen especially with complex queries. Is there a way to tell MongoDB to treat my (particular) query with the lowest priority?
In MongoDB reads and writes do affect each other. Read locks are shared, but read locks block write locks from being acquired and of course no other reads or writes are happening while a write lock is held. MongoDB operations yield periodically to keep other threads waiting for locks from starving. You can read more about the details of that here.
What does that mean for your use case? Because there is no way to tell MongoDB to access the data without a read lock, nor is there a way to prioritize the requests (at least not yet) whether the reads significantly affect the performance of your writes depends on how much "headroom" you have available while write activity is going on.
One suggestion I can make is when figuring out how to run analytics, rather than scanning the entire data set (i.e. doing an aggregation query over all historical data) try running smaller aggregation queries on short time slices. This will accomplish two things:
reads jobs will be shorter lived and therefore will finish quicker, this will give you a chance to assess what impact the queries have on your "live" performance.
you won't be pulling all old data into RAM at once - by spacing out these analytical queries over time you will minimize the impact it will have on current write performance.
Depending on what it is you can't afford about getting another server - you might consider getting a short lived AWS instance which may be not very powerful but would be available to run a long analytical query against a copy of your data set. Just be careful when making it a copy of your data - doing a full sync off of the production system will place a heavy load on it (more effective way would be to use a recent backup/file snapshot to resume from).
Such operations are best left for slaves of a replica set. For one thing, read locks can be shared to allow many reads at once, but write locks will block reads. And, while you can't prioritize queries, mongodb yields long running read/write queries. Their concurrency docs should help
If you can't afford another server, you can setup a slave on the same machine, provided you have some spare RAM/Disk headroom, and you use the slave lightly/occasionally. You must be careful though, your disk I/O will increase significantly.

Use of MSMQ to control SQL write operations

Can I use MSMQ to reduce the number of synchronous write operations to a database and instead have the records written to the database every X number of minutes?
You can't reduce the number of write operations by queuing them, but you can use a message queue to cluster the writes together.
That might be a bit more efficient (by dint of sharing a single connection), and could also let you schedule the writes at a convenient time if you wanted to ('every X minutes' wouldn't do that, but you could perform the writes during low usage times).
The increased complexity of that arrangement will normally outweigh the benefits - what do you really want to achieve?