Is there any limit to Entity Framework in terms of data size or number of records either Transaction-wise or Context-wise? Thanks in advance for reply - entity-framework

I want to know if there are any limitations to the EF transactions or context in terms of size of data involved or number of records dealt with.
I have around 50k records that needs to be updated and this is done by using SaveContext for every record. So, looping through 50k times, I found that some records not getting updated.

Related

Is it bad to have just 1 chunk size in Spring Batch?

I have to process a file which has records with same ID and different dates. If a specific ID has multiple records with the different dates, it has to sum all of it. Currently, my solution is writing by one chunk and and letting SQL query to do the summation part because I don't have a way to know if multiple entries of same ID are in the same chunk. Is there a huge performance effect of doing it this way especially that I am working on 100k worth of data?
Is there a huge performance effect of doing it this way especially that I am working on 100k worth of data?
Yes, this could impact the performance of your step since each item will be processed in its own transaction. With 100k you would have 100k transactions, whereas if chunk-size=1000 for example, you would have only 100 transactions.
The chunk-oriented processing model is not really suitable to what you are trying to do, as items with the same ID could span different chunks. A common technique for this kind of requirement is to load your data in a temporary table (which could be a very fast step if done against sqlite for example) and then run your aggregation SQL query against that table.

Limit frequency with which firestore retrieves data

I am using swift and Firestore and in my application I have a snapshotlistener which retrieves data every time some documents are changed. As I expect this to happen many times a second, I would like to limit the snapshotlistener to retrieve data once every 2 seconds, say. Is this possible? I looked everywhere but could not find anything.
Cloud Firestore stores your data in multiple data centers, and only confirms the write operations once it's written to all of those. For this reason the maximum update frequency of a single document in Cloud Firestore is roughly once per second. So if your plan is to update a document many times per second, that won't work anyway.
There is no way to set a limit on how frequently Firestore broadcasts out updates to the underlying data. If the data gets updated, it is broadcast out to all active listeners.
The typical solution would be to limit how frequently you update the data. If nobody is going to see a significant chunk of the updates, you might as well not write them to the database. This sort of logic if often accomplished with a client side throttle/debounce (see 1, 2).

Updating large number of records in a collection

I have collection called TimeSheet having few thousands records now. This will eventually increase to 300 million records in a year. In this collection I embed few fields from another collection called Department which is mostly won't get any updates and only rarely some records will be updated. By rarely I mean only once or twice in a year and also not all records, only less than 1% of the records in the collection.
Mostly once a department is created there won't any update, even if there is an update, it will be done initially (when there are not many related records in TimeSheet)
Now if someone updates a department after a year, in a worst case scenario there are chances collection TimeSheet will have about 300 million records totally and about 5 million matching records for the department which gets updated. The update query condition will be on a index field.
Since this update is time consuming and creates locks, I'm wondering is there any better way to do it? One option that I'm thinking is run update query in batches by adding extra condition like UpdatedDateTime> somedate && UpdatedDateTime < somedate.
Other details:
A single document size could be about 3 or 4 KB
We have a replica set containing three replicas.
Is there any other better way to do this? What do you think about this kind of design? What do you think if there numbers I given are less like below?
1) 100 million total records and 100,000 matching records for the update query
2) 10 million total records and 10,000 matching records for the update query
3) 1 million total records and 1000 matching records for the update query
Note: The collection names department and timesheet, and their purpose are fictional, not the real collections but the statistics that I have given are true.
Let me give you a couple of hints based on my global knowledge and experience:
Use shorter field names
MongoDB stores the same key for each document. This repetition causes a increased disk space. This can have some performance issue on a very huge database like yours.
Pros:
Less size of the documents, so less disk space
More documennt to fit in RAM (more caching)
Size of the do indexes will be less in some scenario
Cons:
Less readable names
Optimize on index size
The lesser the index size is, the more it gets fit in RAM and less the index miss happens. Consider a SHA1 hash for git commits for example. A git commit is many times represented by first 5-6 characters. Then simply store the 5-6 characters instead of the all hash.
Understand padding factor
For updates happening in the document causing costly document move. This document move causing deleting the old document and updating it to a new empty location and updating the indexes which is costly.
We need to make sure the document don't move if some update happens. For each collection there is a padding factor involved which tells, during document insert, how much extra space to be allocated apart from the actual document size.
You can see the collection padding factor using:
db.collection.stats().paddingFactor
Add a padding manually
In your case you are pretty sure to start with a small document that will grow. Updating your document after while will cause multiple document moves. So better add a padding for the document. Unfortunately, there is no easy way to add a padding. We can do it by adding some random bytes to some key while doing insert and then delete that key in the next update query.
Finally, if you are sure that some keys will come to the documents in the future, then preallocate those keys with some default values so that further updates don't cause growth of document size causing document moves.
You can get details about the query causing document move:
db.system.profile.find({ moved: { $exists : true } })
Large number of collections VS large number of documents in few collection
Schema is something which depends on the application requirements. If there is a huge collection in which we query only latest N days of data, then we can optionally choose to have separate collection and old data can be safely archived. This will make sure that caching in RAM is done properly.
Every collection created incur a cost which is more than cost of creating collection. Each of the collection has a minimum size which is a few KBs + one index (8 KB). Every collection has a namespace associated, by default we have some 24K namespaces. For example, having a collection per User is a bad choice since it is not scalable. After some point Mongo won't allow us to create new collections of indexes.
Generally having many collections has no significant performance penalty. For example, we can choose to have one collection per month, if we know that we are always querying based on months.
Denormalization of data
Its always recommended to keep all the related data for a query or sequence of queries in the same disk location. You something need to duplicate the information across different documents. For example, in a blog post, you'll want to store post's comments within the post document.
Pros:
index size will be very less as number of index entries will be less
query will be very fast which includes fetching all necessary details
document size will be comparable to page size which means when we bring this data in RAM, most of the time we are not bringing other data along the page
document move will make sure that we are freeing a page, not a small tiny chunk in the page which may not be used in further inserts
Capped Collections
Capped collection behave like circular buffers. They are special type of fixed size collections. These collection can receive very high speed writes and sequential reads. Being fixed size, once the allocated space is filled, the new documents are written by deleting the older ones. However document updates are only allowed if the updated document fits the original document size (play with padding for more flexibility).

Is it a good idea to generate per day collections in mongodb

Is it a good idea to create per day collections for data on a given day (we could start with per day and then move to per hour if there is too much data). Is there a limit on the number of collections we can create in mongodb, or does it result in performance loss (is it an overhead for mongodb to maintain so many collections). Does a large number of collections have any adverse effect on performance?
To give you more context, the data will be more like facebook feeds, and only the latest data (say last one week or month) is more important to us. Making per day collections keeps the number of documents low, and probably would result in fast access. Even if we need old data, we can fall back to older collections. Does this make sense, or am I heading in the wrong direction?
what you actually need is to archive the old data. I would suggest you to take a look at this thread at the mongodb mailing list:
https://groups.google.com/forum/#!topic/mongodb-user/rsjQyF9Y2J4
Last post there from Michael Dirolf (10gen)says:
"The OS will handle LRUing out data, so if all of your queries are
touching the same portion of data that should stay in memory
independently of the total size of the collection."
so I guess you can stay with single collection and good indexes will do the work.
anyhow, if the collection goes too big you can always run manual archive process.
Yes, there is a limit to the number of collections you can make. From the Mongo documentation Abhishek referenced:
The limitation on the number of namespaces is the size of the namespace file divided by 628.
A 16 megabyte namespace file can support approximately 24,000 namespaces. Each index also counts as a namespace.
Indexes etc. are included in the namespaces, but even still, it would take something like 60 years to hit that limit.
However! Have you considered what happens when you want data that spans collections? In other words, if you wanted to know how many users have feeds updated in a week, you're in a bit of a tight spot. It's not easy/trivial to query across collections.
I would recommend instead making one collection to store the data and simply move data out periodically as Tamir recommended. You can easily write a job to move data out of the collection every week or every month.
Creating a collection is not much overhead, but it the overhead is larger than creating a new document inside a collections.
There is a limitation on the no of collections that you can create: " http://docs.mongodb.org/manual/reference/limits/#Number of Namespaces "
Making new collections to me, won't be having any performance difference because in RAM you cache only those data that you actually query. In your case it will be recent feeds etc.
But having per day/hour collection will help you in achieving old data very easily.

Morphia is there a difference between fetch and asList in performance wise

We are using morphia 0.99 and java driver 2.7.3 I would like to learn is there any difference between fetching records one by one using fetch and retrieving results by asList (assume that there is enough memory to retrieve records through asList).
We iterate over a large collection, while using fetch I sometimes encounter cursor not found exception on the server during the fetch operation, so I need to execute another command to continue, what could be the reason for this?
1-)fetch the record
2-)do some calculation on it
3-)+save it back to database again
4-)fetch another record and repeat the steps until there isn't any more records.
So which one would be faster? Fetching records one by one or retrieving bulks of results using asList, or isn't there any difference between them using morphia implementation?
Thanks for the answers
As far as I understand the implementation, fetch() streams results from the DB while asList() will load all query results into memory. So they will both get every object that matches the query, but asList() will load them all into memory while fetch() leaves it up to you.
For your use case, it neither would be faster in terms of CPU, but fetch() should use less memory and not blow up in case you have a lot of DB records.
Judging from the source-code, asList() uses fetch() and aggregates the results for you, so I can't see much difference between the two.
One very useful difference would be if the following two conditions applied to your scenario:
You were using offset and limit in the query.
You were changing values on the object such that it would no longer be returned in the query.
So say you were doing a query on awesome=true, and you were using offset and limit to do multiple queries, returning 100 records at a time to make sure you didn't use up too much memory. If, in your iteration loop, you set awesome=false on an object and saved it, it would cause you to miss updating some records.
In a case like this, fetch() would be a better approach.