Suppose I have several millions of statements in my PostgreSQL database and I want to get only 10000 of them. But not the first 10000, rather, a random selection of 10000 (it would be best if I could also choose the logic, e.g. select every 4th statement).
How could I do this using Prisma, or — if it's not possible using Prisma — using a good old PostgreSQL request?
For now, I'm using this code to limit the number of results I'm getting:
const statements = await this.prisma.statement.findMany({
where: {
OR: conditions,
},
orderBy: {
createdAt: 'asc',
},
take: 10000,
});
This will use the conditions I have, then order them in ascending order, and "take" or limit the first 10000 results.
What could I use in place of the "take" or what request I could make directly in PostgreSQL to randomly sample my DB for records?
Prisma doesn't natively support fetching random data as of now.
There is a Feature Request that discusses the exact same scenario as you need.
The alternative could be to use queryRaw for raw database access and use PostgreSQL's random function as described in the above mentioned Feature Request.
Related
Currently I have something as below.
Collection1 - system
{
_id: system_id,
... system fields
system_name: ,
system_site: ,
system_group: ,
....
device_errors: [1,2,3,4,5,6,7]
}
I have 2K unique error codes.
I have an error collection as below.
{
_id: error_id,
category,
impact,
action,
}
I have got a use case where each each system|burt combination can have unique error_description because error has some system data.
I am confused how to handle this in this scenario.
One system can have many errors.
One error can be part of multiple systems.
Now, how to maintain the unique details of a burt specific to a system? I thought of having a nested field instead array in system collection. I am wondering about the scalability.
Any suggestion?
system1|burt1
error_desc:unique system1
system2|burt1
error_Description: unique
If I store like above in another collection, API request has to make three calls and form the response.
1. Find all errors for set of systems
2. Find top 50 burts from point1
3. For top 50 burts, find error desc
Combine all three call responses and reply to the user?
I am not thinking it is best as we need to make 3 data source calls to respond a request.
I have already tried flatten structure with redundant data.
{
... system1_info
... error1_info
},
{
... system2_info
... error1_info
},
{
... system1_info
... error2_info
},
{
... system10_info
... error1200_info
}
Here, I am using many aggregation as below in single query
1. Match
2. Group error
3. Sort
4. total count of errors - another group
5. Project
I feel it is a heavier query than the approach1[actual question].
Let's say I have 2k errors, 20million systems = I have totally 40million doc.
In worst case each system has 2k errors. My query should support more than 1 system. Let's say I have to query for 25k systems.
25k systems * 2k errors => match result
Apply all the mentioned above operations
Then slice to 100[For pagination]
If I go with relational model like without redundancy, I will get 25k systems, then i have to query for only 2k errors = It is very less operation than above aggregation.
Presumably the set of possible errors does not change very frequently. Cache it in the application.
I have a large table where the columns are user_id, user_feature_1, user_feature_2, ...., user_feature_n
So each row corresponds to a user and his or her features.
I stored this table in MongoDB by storing each column's values as an array, e.g.
{
'name': 'user_feature_1',
'values': [
15,
10,
...
]
}
I am using Meteor to pull data from MongoDB, and this way of storage facilitates fast and easy retrieval of the whole column's values for graph plotting.
However, this way of storing has a major drawback; I can't store arrays larger than 16mb.
There are a couple of possible solutions, but non of them seems good enough:
Store each column's values using gridFS. I am not sure if meteor supports gridFS, and it lacks support for slicing of the data, i.e., I may need to just get the top 1000 values of a column.
Store the table in row oriented format. E.g.
{
'user_id': 1,
'user_feature_1': 10,
'user_feature_2': 0.9,
....
'user_feature_n': 42
}
But I think this way of storing data is inefficient for querying a feature column's values
Or MongoDB is not suitable at all and sql is the way to go? But Meteor does not support sql
Update 1:
I found this interesting article which talks about array in mongodb is inefficient. https://www.mongosoup.de/blog-entry/Storing-Large-Lists-In-MongoDB.html
Following explanation is from http://bsonspec.org/spec.html
Array - The document for an array is a normal BSON document with integer values for the keys, starting with 0 and continuing sequentially. For example, the array ['red', 'blue'] would be encoded as the document {'0': 'red', '1': 'blue'}. The keys must be in ascending numerical order.
This means that we can store at most 1 million values in a document, if the values and keys are of float type (16mb/128bits)
There is also a third option. A separate document for each user and feature:
{ u:"1", f:"user_feature_1", v:10 },
{ u:"1", f:"user_feature_2", v:11 },
{ u:"1", f:"user_feature_3", v:52 },
{ u:"2", f:"user_feature_1", v:4 },
{ u:"2", f:"user_feature_2", v:13 },
{ u:"2", f:"user_feature_3", v:12 },
You will have no document growth problems and you can query both "all values for user x" and "all values for feature x" without also accessing any unrelated data.
16MB / 64bit float = 2,000,000 uncompressed datapoints. What kind of graph requires a minimum of 2 million points per column??? Instead try:
Saving a picture on an s3 server
Using a map-reduce solution like hadoop (probably your best bet)
Reducing numbers to small ints if they're currently floats
Computing the data on the fly, on the client (preferred, if possible)
Using a compression algo so you can save a subset & interpolate the rest
That said, a document-based DB would outperform a SQL DB in this use case because a SQL DB would do exactly as Philipp suggested. Either way, you cannot send multiple 16MB files to a client, if the client doesn't leave you for poor UX then you'll go broke for server costs :-).
I have some simple transaction-style data in a flat format like the following:
{
'giver': 'Alexandra',
'receiver': 'Julie',
'amount': 20,
'year_given': 2015
}
There can be multiple entries for any 'giver' or 'receiver'.
I am mostly querying this data based on the giver field, and then split up by year. So, I would like to speed up these queries.
I'm fairly new to Mongo so I'm not sure which of the following methods would be the best course of action:
1. Restructure the data into the format:
{
'giver': 'Alexandra',
'transactions': {
'2015': [
{
'receiver': 'Julie',
'amount': 20
},
...
],
'2014': ...,
...
}
}
This makes me the most sense to me. We place all transactions into subdocuments of a person rather than having transactions all over the collection. It provides the data in the form I query it by the most, so it should be fast to query by 'giver' and then 'transactions.year'
I'm unsure if restructuring data like this is possible inside of mongo or if I should export it and modify it outside of mongo via some programming language.
2. Simply index by 'giver'
This doesn't quite match the way I'm querying this data (by 'giver' then 'year'), but it could be fast enough to do what I'm looking for. It's simple within mongo to do, and doesn't require restructuring of the data.
How should I go about adjusting my database to make my queries faster? And which way is the 'Mongo way'?
I have browsed through various examples but have failed to find what I am looking for.. What I want is to search for a specific document by _id and skip multiple times between a collection by using one query? Or some alternative which is fast enough to my case.
Following query would skip first one and return second in advance:
db.posts.find( { "_id" : 1 }, { comments: { $slice: [ 1, 1 ] } } )
That would be skip 0, return 1 and leaves the rest out from result..
But what If there would be like 10000 comments and I would want to use same pattern, but return that array values like this:
skip 0, return 1, skip 2, return 3, skip 4, return 5
So that would return collection which comments would be size of 5000, because half of them is skipped away. Is this possible? I applied large number like 10000 because I fear that using multiple queries to apply this would not be performance wise.. (example shown in here: multiple queries to accomplish something similar). Thnx!
I went through several resources and concluded that currently this is impossible to make with one query.. Instead, I agreed on that there are only two options to overcome this problem:
1.) Make a loop of some sort and run several slice queries while increasing the position of a slice. Similar to resource I linked:
var skip = NUMBER_OF_ITEMS * (PAGE_NUMBER - 1)
db.companies.find({}, {$slice:[skip, NUMBER_OF_ITEMS]})
However, depending on the type of a data, I would not want to run 5000 individual queries to get only half of the array contents, so I decided to use option 2.) Which seems for me relatively fast and performance wise.
2.) Make single query by _id to row you want and before returning results to client or some other part of your code, skip your unwanted array items away by using for loop and then return the results. I made this at java side since I talked to mongo via morphia. I also used query explain() to mongo and understood that returning single line with array which has 10000 items while specifying _id criteria is so fast, that speed wasn't really an issue, I bet that slice skip would only be slower.
I am moving our messaging system to MongoDB and am curious what approach to take with respect to various stats, like number of messages per user etc. In MS SQL database I have a table where I have different counts per user and they get updated by trigger on corresponding tables, so I can for example know how many unread messages UserA has without calling an expensive SELECT Count(*) operation.
Is count function in MongoDB also expensive?
I started reading about map/reduce but my site is high load, so statistics has to update in real time, and my understanding is that map/reduce is time consuming operation.
What would be the best (performance-wise) approach on gathering various aggregate counts in MongoDB?
If you've got a lot of data, then I'd stick with the same approach and increment an aggregate counter whenever a new message is added for a user, using a collection something like this:
counts
{
userid: 123,
messages: 10
}
Unfortunately (or fortunately?) there are no triggers in MongoDB, so you'd increment the counter from your application logic:
db.counts.update( { userid: 123 }, { $inc: { messages: 1 } } )
This'll give you the best performance, and you'd probably also put an index on the userid field for fast lookups:
db.counts.ensureIndex( { userid: 1 } )
Mongodb good fit for the data denormaliztion. And if your site is high load then you need to precalculate almost everything, so use $inc for incrementing messages count, no doubt.