What should be the data structure to store the big date OHLCV (exchange example)? - mongodb

Exchangers, sites like TradingView.com, etc. Quickly response data for different intervals 15m, 30m, 1h, 4h and more. I have 2 questions about working with data:
They generate large timeframes on the fly using smaller ones or store all intervals in DB?
Recommend the best data structure for NOSQL to store OHLCV data for high performance and data manipulation.
Example structure:
{
symbol: "BTCUSDT",
data: [{
ts: 1,
o: 1,
h: 2,
l: 0.5,
c: 1.5
},
]
}
What can I try next?

Related

Most efficient MongoDB find() query for very large data

I trying to implement a research system for a scientific study in 2D astronomical coordinate systems. Into an hand I have a process that generates a lot of geospatial data that we can thing organized in documents of two coordinates and a string value. In the other hand I have a little set of data which have only one coordinate that must match wit at least one of the two of the generated date.
Simplifying, I have organized data in two collections:
collA avg. size: ~2GB (constant in the time)
collB more than 50GB (continuously increasing)
Where:
The schema of a document in the collA is:
{
terrainType: 'myType00001',
lat: '000000123',
lon: '987000000'
},
{
terrainType: 'myType00002',
lat: '000000124',
lon: '987000000'
},
{
terrainType: 'myType00003',
lat: '000000124',
lon: '997000000'
}
Please note that first of all we put indexes to avoid COLLSCAN. There are two indexes in collA: __lat_idx (unique) and __long_idx (unique). I can guarantee that the generation process could not generate duplicates in lat e in lot columns (as seen above: lat and lon have nine digits, but it is only for simplicty... in the real case these values are extremely huge).
The schema of a document in the collB is:
{
latOrLon: '0045600'
},
{
latOrLon: '0045622'
},
{
latOrLon: '1145600'
}
I tried some different query strategies.
Strategy A
let cursor = collB.find() // Should load entire 2GB into memory?
curosor.forEach(c => {
collA.find({lat: c.latOrLon})
collA.find({lon: c.latOrLon})
})
This takes two mongo calls for each document in collB: is extremely slow.
Strategy B
let cursor = collB.find() // Should load entire 2GB into memory?
curosor.forEach(c => {
collA.find({$expr: {$or: [{$eq: [lat, c.latOrLon]}, {$eq: [lon, c.latOrLon]} })
})
This taks on mongo call for each document in collB, faster than A but still slow.
Strategy C
for chunk in chunks:
docs = []
for doc in chunk:
CHUNK_SIZE = 5000
batch_cursor = collB.find({}, {'_id': 0}, batch_size=CHUNK_SIZE)
chunks = yield_rows(batch_cursor, CHUNK_SIZE) # yield_rows defined below
res = collA.find({
'$or': [
{
'lon': {
'$in': docs
}
}, {
'lat': {
'$in': docs
}
}
]
})
This takes firstly 5000 documents from collA then puts them into an array and send the find() query. Good efficiency: one mongo call for each 5000 documents in collA.
This solution is very fast, but I noticed that becomes slower as collA increases in size, why? Since I am using indexes, the search for a indexed value should costs O(1) in terms of computation time... for example, when collA was around 25GB in size, it will takes roughly 30 minutes to perform a full find(). Now this collection is 50GB size and it is taking morte than 2 hours.
We plan that the DB will reach at least 5TB next month and this will be a problem.
I am asking to university the possibility to parallelize this job using MongoDB Sharding but it will no be immediate. I am looking for a temporary solution until we can parallelize this job.
Please note that we tried more than these three strategies and we mixed Python3 and NodeJS.
Definition of yield_rows:
def yield_rows(batch_cursor, chunk_size):
"""
Generator to yield chunks from cursor
:param cursor:
:param chunk_size:
:return:
"""
chunk = []
for i, row in enumerate(batch_cursor):
if i % chunk_size == 0 and i > 0:
yield chunk
del chunk[:]
chunk.append(row)
yield chunk
You can just index the field through which you're querying ( It doesn't have to be unique), that way only the matching data will be returned, It won't go through the entire collection data.

Whether mongodb can be used to save this format of document or something else?

I am asking a question about whether mongodb can be used for below data format.
**Header**
GSGT Version 1.9.4
Processing Date 12/28/2016 4:07 PM
Content GSAMD-24v1-0_20011747_A1.bpm
Num SNPs 700078
Total SNPs 700078
Num Samples 44
Total Samples 48
File 1 of 44
**Data**
Sample ID SNP Name Allele1 - Plus Allele2 - Plus Allele1 - AB Allele2 - AB
B01 1:100292476 A A A A
B01 1:101064936 A A A A
B01 1:103380393 G G B B
B01 1:104303716 G G B B
B01 1:104864464 C C B B
B01 1:106737318 T T A A
B01 1:109439680 A A A A
...
The above data is one data record and I am going to have millions of such records. I want to find a good database for storing this kind of data. And MongoDB is the one I want to use. One such record can be saved as one document. The whole data will be saved into a collection. Below is a description of the data structure.
One record includes header and data two parts. The data parts usually have 700,000 lines. In order to save them into MongoDB I propose to change the format into json as a collection as below:
{ "header":{
"GSGT Version": "1.9.4",
"Processing Date" : "12/28/2016 4:07 PM",
...
},
"data" : [{
"Sample ID": "B01",
"SNP Name": "1:100292476",
"Allele1 - Plus" : "A",
...
},{
"Sample ID": "B01",
"SNP Name": "1:100292476",
"Allele1 - Plus" : "A",
...
}
...
}
Since the data part has 700,000 lines I am not confident about this design. What is the reasonable number of nested data in one document? If I save such record in one document whether it is good for querying, saving? Or should I split this data into two collections? Or are there any other databases better than MongoDB to handle this structure?
Yes, You can.
Mongo Supports rich data.
Also you can index them for faster retrieval and embed them to query specifically
But you have to make sure that your individual document (record) doesn't grow more than 16 megabytes

Mongo Query - Number of Constraints vs Speed (and Indexing!)

Lets say I have a million entries in the db with 10 fields/("columns") in the db. It seems to me that the more columns I search by, the faster the query goes - for example:
db.items.find( {
$and: [
{ field1: x },
{ field2: y },
{ field3: z}
]
} )
is faster than:
db.items.find( {
$and: [
{ field1: x },
{ field2: y }
]
} )
While I would love to say "Great, this makes total sense to me" - it doesn't. I just know it's happening in my particular case, and wondering if this is actually always true. If so, ideally, I would like to know why.
Furthermore, when creating multi-field indices, does it help to have them in any sort of order. For example, let's say I add a compound index:
db.collection.ensureIndex( { field1: 1, field2: 1, field3: 1 } )
Do these have any sort of order? If yes, would the order matter? Let's say 90% of items will match the field1 criteria, but 1% of items will match the field 3 criteria. Would ordering them make some sort of difference?
It may be just the case that more restrictive query returns less documents, since 90% of items will match the field1 criteria and only 1% of items will match the field 3 criteria. Check what explain says for both queries.
Mongo has quite good profiler. Give it a try. Play with different indexes and different queries. Not on production db of course.
Order of fields in the index matters. If you have an index { field1: 1, field2: 1, field3: 1 }
and a query db.items.find( { field2: x, field3: y }), the index wount be used at all,
and for query db.items.find( { field1: x, field3: y }) it can be used only partially for field1.
From the other hand, order of conditions in the query does not matter:
db.items.find( { field1: x, field2: y }) is as good as
db.items.find( { field2: y, field1: x }) and will use the index in both cases.
Choosing indexing strategy you should examine both data and typical queries. It may be the case, that index intersection works better for you, and instead of a single compound index you get better total performance with simple indexes like { field1: 1}, { field2: 1}, { field3: 1}, rather than multiple compound indexes for different kind of queries.
It is also important to check index size to fit it in memory. In most cases anyway.
It's complicated... MongoDB keeps recently accessed documents in RAM, and the query plan is calculated the first time a query is executed, so the second time you run a query may be much faster than the first time.
But, putting that aside, the order of a compound index does matter. In a compound index, you can use the index in the order it was created, a bit like opening a door, walking through and finding that you have more doors to open.
So, having two overlapping indexes setup, e.g.:
{ city: 1, building: 1, room: 1 }
AND
{ city: 1, building: 1 }
Would be a waste, because you can still search for all the rooms in a particular building using the first two levels (fields) of the "{ city: 1, building: 1, room: 1 }" index.
Your intuition does make sense. If you had to find a particular room in a building, going straight to the right city, straight to the right building and then knowing the approximate place in the building will make it faster to find the room than if you didn't know the approximate place (assuming that there are lots of rooms). Take a look at levels in a B-Tree, e.g. the search visualisation here: http://visualgo.net/bst.html
It's not universally the case though, not all data is neatly distributed in a sort order - for example, English names or words tend to clump together under common letters - there's not many words that start with the letter X.
The (free, online) MongoDB University developer courses cover indexes quite well, but the best way to find out about the performance of a query is to look at the results of the explain() method against your query to see if an index was used, or whether a collection was scanned (COLLSCAN).
db.items.find( {
$and: [
{ field1: x },
{ field2: y }
]
})
.explain()

PyMongo updating array records with calculated fields via cursor

Basically the collection output of an elaborate aggregate pipeline for a very large dataset is similar to the following:
{
"_id" : {
"clienta" : NumberLong(460011766),
"clientb" : NumberLong(2886729962)
},
"states" : [
[
"fixed", "fixed.rotated","fixed.rotated.off"
]
],
"VBPP" : [
244,
182,
184,
11,
299,
],
"PPF" : 72.4,
}
The intuitive, albeit slow, way to update these fields to be calculations of their former selves (length and variance of an array) with PyMongo before converting to arrays is as follows:
records_list = []
cursor = db.clientAgg.find({}, {'_id' : 0,
'states' : 1,
'VBPP' : 1,
'PPF': 1})
for record in cursor:
records_list.append(record)
for dicts in records_list:
dicts['states'] = len(dicts['states'])
dicts['VBPP'] = np.var(dicts['VBPP'])
I have written various forms of this basic flow to optimize for speed, but bringing in 500k dictionaries in memory to modify them before converting them to arrays to go through a machine learning estimator is costly. I have tried various ways to update the records directly via a cursor with variants of the following with no success:
cursor = db.clientAgg.find().skip(0).limit(50000)
def iter():
for item in cursor:
yield item
l = []
for x in iter():
x['VBPP'] = np.var(x['VBPP'])
# Or
# db.clientAgg.update({'_id':x['_id']},{'$set':{'x.VBPS': somefunction as above }},upsert=False, multi=True)
I also unsuccessfully tried using Mongo's usual operators since the variance is as simple as subtracting the mean from each element of the array, squaring the result, then averaging the results.
If I could successfully modify the collection directly then I could utilize something very fast like Monary or IOPro to load data directly from Mongo and into a numpy array without the additional overhead.
Thank you for your time
MongoDB has no way to update a document with values calculated from the document's fields; currently you can only use update to set values to constants that you pass in from your application. So you can set document.x to 2, but you can't set document.x to document.y + document.z or any other calculated value.
See https://jira.mongodb.org/browse/SERVER-11345 and https://jira.mongodb.org/browse/SERVER-458 for possible future features.
In the immediate future, PyMongo will release a bulk API that allows you to send a batch of distinct update operations in a single network round-trip which will improve your performance.
Addendum:
I have two other ideas. First, run some Javascript server-side. E.g., to set all documents' b fields to 2 * a:
db.eval(function() {
var collection = db.test_collection;
collection.find().forEach(function(doc) {
var b = 2 * doc.a;
collection.update({_id: doc._id}, {$set: {b: b}});
});
});
The second idea is to use the aggregation framework's $out operator, new in MongoDB 2.5.2, to transform the collection into a second collection that includes the calculated field:
db.test_collection.aggregate({
$project: {
a: '$a',
b: {$multiply: [2, '$a']}
}
}, {
$out: 'test_collection2'
});
Note that $project must explicitly include all the fields you want; only _id is included by default.
For a million documents on my machine the former approach took 2.5 minutes, and the latter 9 seconds. So you could use the aggregation framework to copy your data from its source to its destination, with the calculated fields included. Then, if desired, drop the original collection and rename the target collection to the source's name.
My final thought on this, is that MongoDB 2.5.3 and later can stream large result sets from an aggregation pipeline using a cursor. There's no reason Monary can't use that capability, so you might file a feature request there. That would allow you to get documents from a collection in the form you want, via Monary, without having to actually store the calculated fields in MongoDB.

What is the performance implications of nested hashes in mongodb?

So I have a very large set of metrics (15GB and growing) that has some of the data in nested hashes. Like so:
{
_id: 'abc0000',
type: 'foo',
data: { a: 20, b: 30, c: 3 }
},
... more data following this schema...
{
_id: 'abc5000',
type: 'bar'
data: { a: 1, b: 2, c: 4, d: 10 }
}
What is the performance implications when I run a query on the nested hashes? The data inside the hash can't be indexed...or rather, it would be pointless to index it.
I can always reform the data into a flat style data_a, data_b, etc...
You can create indexes on attributes in nested hashes. Take a look at Indexing with dot notation for more details. You can also create compound indexes if you need, but be careful about caveats with parallel arrays. Basically, if you create a compound index only one of the indexed values can be array, however, that shouldn't effect you(judging from posted schema).
So you can create indexes on data.a, data.b or data.a, data.c as per your needs.