Best way to get atomic / sql timestamp like functionality - azure-service-fabric

I need some functionality to move data to other systems and was wondering if there is some sort of atomic increment . I did see something like that for the reliable collection replication but it does not appear to be exposed.
Any ideas ? I could build my own but having an entire collection for a single value and running transactions over both seems over kill .
I see transactionId but i assume this is not persisted across restarts ?

Related

how to choose best mechanism for delete logs saved to mongodb

I'm implementing a logger using MongoDB and I'm quite new to the concept.
The logger is supposed to log each request and Its response.
I'm facing the question of using the TTL Index of mongo or just using the query overnight approach.
I think that the first method might bring some overhead by using a background thread and probably rebuilding the index after each deletion but, it frees space as soon as the documents expire and this might be beneficial.
The second approach, on the other hand, does not have this kind of overhead but it frees up space just at the end of each day.
It seems to me that the second approach will suit my case better as it would not be the case that my server just goes on the edge of not having enough disk space, but it will always be the case that we need to reduce the overhead on the server.
I'm wondering if there are some aspects to the subject that I'm missing and also I'm not sure about the applications of the MongoDB TTL.
Just my opinion:
It seems to be best to store logs in monthly , daily or hourly collection depends on your applications write load , and at the end of the day to just drop() the oldest collections with custom script. From experience TTL indices not working well when there is heavy write load to your collection since they add additional write load based on expiration time.
For example imagine you insert at 06:00h log events with 100k/sec and your TTL index life time is set to 3h , this mean after 3h at 09:00h you will have those 100k/sec deletes applied to your collection that are also stored in the oplog ... , solution in such cases is to add more shards , but it become kind of expensive... , far easier is to just drop the exprired collection ...
Moreover depending on your project size for bigger collections to speed up searches you can additionally shard and pre-split the collections based on compound index hashed datetime field(every log contain timestamp) with another field which you will search often and this will allow you scalable search across multiple distributed shards.
Also note mongoDB is a general purpose document database and fulltext search is kind of limited to expensinve regex expressions , so in case you need to do fast raw fulltext search in your logs some inverse index search engine like elasticsearch on top of your mongoDB backand maybe a good solution to cover this functionality.

Is dynamically creating and dropping collections in MongoDB going to create scalability issues?

I have an application (built in Meteor) that provides some ad hoc reporting capabilities to the end user. I have built up that functionality by using the aggregation pipeline to produce the results for a given query. This makes it extremely fast and I was using $out to push the results right into a results table.
The results table included a queryID, which the client used to figure out which were the correct results.
Unfortunately, as you may know (and I discovered), that doesn't work so well once you have more than one user running reports at a time because $out deletes the whole results table before pushing the new query in.
I see three possible workarounds:
Run the aggregation, but manually push the results into the results collection
$out the results into a temporary collection (dynamically named to avoid conflicts) and then manually copy the results from there into results collection, immediately dropping the temporary one. This made some sense when I thought I could use copyTo(), but that doesn't appear possible within Meteor, so I think this option doesn't make much sense relative to #1 in this case.
$out the results into a temporary collection (dynamically named to avoid conflicts) and have the client pull its results directly from there. I would then periodically drop the extra collections after say 24 hours (like I do with specific query results in the main collection today).
#3 would be the fastest by far - the time it takes to manually copy rows dwarfs the time it takes the queries to run. But I'm concerned about the impact of creating and dropping so many collections.
We're not talking millions of users here, but if an average of 500 users a day were each running 10-20 reports, there could be an additional 5-10k collections in the database at any one time. That seems like a lot. Perhaps I could be smarter about cleaning them up somehow, though I can't just immediately remove them because a user might want to have multiple tabs open with different reports. Even still, we're potentially talking about hundreds to thousands of collections.
Is that going to be a problem?
Are there other approaches I should consider instead?
Other recommendations?
Thanks!
Dropping a collection in mongoDB is very efficient operation, anyway much more efficient than deleting some documents in a larger collection.
Maximum number of collections is quite high, only limited by namespace namespace in MMAPv1 while no hard limit exists in wiretiger engine.
So I would favor your solution #3.
Some improvements/alternatives you can think:
Consider creating the collections in a separated database (say per day) then you can drop the entire database in a single operation without having to drop individual collections.
Use an endpoint for the result set, cash the results then drop the $out collection. Let cache handle user requirements and only rerun the aggregation if cache has expired or something.
This kind of activity is done very easily in relational databases such as mysql or pgsql. You might consider synchronising your data to a separate relational database for the purposes of reporting.
There is a package https://github.com/perak/mysql-shadow which claims to provide synchronisation. I played with it and it didn't work perfectly, although doing just one way sync is more likely to succeed.
The other option is to use Graphql over a mongo/mysql hybrid database which can be done with the Apollo stack http://www.apollodata.com/

MongoDB Schema Suggestion

I am trying to pick MongoDB as my preferred database. I need help on the design of my table.
App background - analytics app where contacts push their own events and related custom data. Contact can have many events. Eg: contact did this, did that etc.
event_type, custom_data (json), epoch_time
eg:
event 1: event_type: page_visited, custom-data: {url: pricing, referrer: google}, current_time
event 2: event_type: video_watched, custom-data: {url: video_link}, current_time
event 3: event_type: paid, custom_data: {plan:lite, price:35}
These events are custom and are defined by the user. Scalability is a concern.
These are the common use cases:
give me a list of users who have come to pricing page in the last 7 days
give me a list of users who watched the video and paid more than 50
give me a list of users who have visited pricing, watched video but NOT paid at least 20
What's the best way to design my table?
Is it a good idea to use embedded events in this case?
In Mongo they are called collections and not tables, since the data is not rows/columns :)
(1) I'd make an Event collection and a Users collections
(2) I'd do 1 document per Event which has a userId in it.
(3) If you need realtime data you will want an index on what you want to query by (i.e. never do a scan over the whole collection).
(4) if there are things which are needed for reporting only, I'd recommend making a reporting node (i.e. a different mongo instance) and using replication to copy data to that mongo instance. You can put additional indexes for reporting on that node. That way the additional indexes and any expensive queries will not affect production performance.
Notes on sharding
If your events collection is going to become large - you may need to consider sharding. Perhaps sharding by user Id. However, I'd recommend that may be a longer term solution and not to dive into that until you need it.
One thing to note, is that mongo has currently (2.6) a database level write locking implementation. Which means you can only perform 1 write at a time. It allows many reads. Which means that if you want a high write system AND have a lot of users, you will need to look into sharding at some point. However, in my experience so far, administratively 1 primary node with a secondary (and reporting node) is easier to setup. We currently can handle around 10,000 operations per second with that setup.
However, we have had issues with spikes in users coming to the system. You'll want to make sure you have enough memory for your indexes. And SSD's would be recommended to. as a surge in users can result in cache misses (i.e. index not in memory) which causes it to be read off the hard disk.
One final note - there are a lot of NoSQL DB's and they all have their pros and cons. I personally found that high write, low read, and realtime anaysis of lots of data is not really mongo's strength. So it does depend on what you are doing. It sounds like you are still learning the fundamentals. It might be worth a read of all the available types to pick the right tool for the right job.

GET Consistency (and Quorum) in ElasticSearch

I am new to ElasticSearch and I am evaluating it for a project.
In ES, Replication can be sync or async. In case of async, the client is returned success as soon as the document is written to the primary shard. And then the document is pushed to other replicas asynchronously.
When written asynchronously, how do we ensure that when GET is done, data is returned even if it has not propagated to all the replicas. Because when we do a GET in ES, the query is forwarded to one of the replicas of the appropriate shard. Provided we are writing asynchronously, the primary shard may have the document but the selected replica for doingthe GET may not have received/written the document yet. In Cassandra, we can specify consistency levels (ONE, QUORUM, ALL) at the time of writes as well as reads. Is something like that possible for reads in ES?
Right, you can set replication to be async (default is sync) to not wait for the replicas, although in practice this doesn't buy you much.
Whenever you read data you can specify the preference parameter to control where the documents are going to be taken from. If you use preference:_primary you make sure that you always take the document from the primary shard, otherwise, if the get is done before the document is available on all replicas, it might happen that you hit a shard that doesn't have it yet. Given that the get api works in real-time, it usually makes sense to keep replication sync, so that after the index operation returned you can always get back the document by id from any shard that is supposed to contain it. Still, if you try to get back a document while indexing it for the first time, well it can happen that you don't find it.
There is a write consistency parameter in elasticsearch as well, but it is different compared to how other data storages work, and it is not related to whether replication is sync or async. With the consistency parameter you can control how many copies of the data need to be available in order for a write operation to be permissible. If not enough copies of the data are available the write operation will fail (after waiting for up to 1 minute, interval that you can change through the timeout parameter). This is just a preliminary check to decide whether to accept the operation or not. It doesn't mean that if the operation fails on a replica it will be rollbacked. In fact, if a write operation fails on a replica but succeeds on a primary, the assumption is that there is something wrong with the replica (or the hardward it's running on), thus the shard will be marked as failed and recreated on another node. Default value for consistency is quorum, and can also be set to one or all.
That said, when it comes to the get api, elasticsearch is not eventually consistent, but just consistent as once a document is indexed you can retrieve it.
The fact that newly added documents are not available for search till the next refresh operation, which happens every second automatically by default, is not really about eventual consistency (as the documents are there and can be retrieved by id), but more about how search and lucene work and how documents are made visible through lucene.
Here is the answer I gave on the mailing list:
As far as I understand the big picture, when you index a document it's written in the transaction log and then you get a succesful answer from ES.
After, in an asynchronous manner, it's replicated on other nodes and indexed by Lucene.
That said, you can not search immediatly for the document, but you can GET it.
ES will read the tlog if needed when you GET a document.
I think (not sure) that if the replica is not up to date, the GET will be sent on the primary tlog.
Correct me if I'm wrong.

MongoDB. Use cursor as value for $in in next query

Is there a way to use the cursor returned by the previous query as a value for $in in the next query? For example, something like this:
var users = db.user.find({state:1})
var offers = db.offer.find({user:{$in:users}})
I think this can reduce the traffic between mongodb and client in case the client doesn't need user information at all, just offers. Am i wrong?
Basically you want to do a join between two collections which Mongo doesn't support. You can reduce the amount of data being transferred from the server by limiting the fields returned from the first query to only the unique user information (i.e. the _id) that you need to get data from the offers collection.
If you really just want to make one query then you should store more information in the offers collection. For example, if you're trying to find offers for active users then you would store the active state of the user in the offers collection.
To work from your comment:
Yes, that's why I used tag 'join' in a question. The idea is that I
can make a first query more сomplex using a bunch of fields and
regexes without storing user data in other collections except
references. In these cases I always have to perform two consecutive
queries, but transfering of the results of the first query is not
necessary neither for me nor for the mongodb itself. I just want to
understand could it be done now, will it be possible to do so in the
future or it cannot be implemented for some technical reasons
As far as I understand it there is no immediate hurry to make this possible. Also the way it is coded atm will make this quite a big change to the way cursors work and are defined. A change big enough to possibly cause implementation breaks for other people. It is really a case of whether to set safe for inserts and updates for all future drivers. It is recognised that safe should be default but this will break implementation for other people who expect it the other way around.
It is rather inefficient if you don't require the results of the first query at all however since most networks are prepped with high traffic in mind and the traffic is cheap there hasn't been a demand to make it able to do chained queries server side in the cursor.
However subselects (which this basically is, it is selecting a set of rows based upon a sub selection of previous rows) have been on mongodb-user a couple of times and there might even be a JIRA for it somewhere, if not might be useful to make one.
As for doing it right now: there is no way.