I have a standalone instance with Opensearch for testing purposes, I want to keep it light and clean so I'm using ISM to delete indices older than x days.
What I noticed is that by default Opensearch generates a management index (".opensearch-ism-config") with replica "1".
Since I'm using a standalone instance (it is just testing, I'm not worried with redundancy, HA or anything like that) and want to keep my cluster with green status, I have decided that I want those indices to have replica "0".
In order to achieve that, I have created a template in which I set replica "0" for these indices:
{
"order" : 100,
"version" : 1,
"index_patterns" : [".opensearch-ism-*"],
"settings" : {
"index": {
"number_of_shards" : "1",
"number_of_replicas": 0
}
}
}
After a PUT, I start using ISM so that the management ISM index is created after this template is on Opensearch node.
What I observe is that all management indices from ISM are generated with replica "1", therefore ignoring the template.
I can set replica to "0" by updating index settings after creation but this is not the ideal scenario as ISM index rotate and new ones are generated from time to time.
Is there any way to have ISM indices applying replica "0" automatically ?
Related
Is there such possibility from database backend to force user to read only from SECONDARY members ?
I would like to restrict some users to not be able to impact performance in PRIMARY replicaset members in my on-premise deployment ( not atlas )
Issue is easy to solve if customer agree adding to the URI
readPreference=secondary
But I am checking if there is option to force from the database side without asking the customer ...
the only option I have found is to restrict by server IP address:
use admin
db.createUser(
{
user: "dbuser",
pwd: "password"
roles: [ { role: "readWrite", db: "reporting" } ],
authenticationRestrictions: [ {
clientSource: ["192.0.2.0"],
serverAddress: ["198.51.100.1","192.51.100.2"]
} ]
}
)
There are currently no supported ways to enforce this from within MongoDB itself apart from the authenticationRestrictions configurations for defining users which is noted in the question itself.
Regarding the comments - ANALYTICS tag in Atlas are a (automatic) Replica Set Tag. Replica set tags themselves can be used in on-premise deployments. But tags are used in conjunction with read preference which is set by the client application (at least in the connection string). So that approach/solution really doesn't provide any additional enforcement from read preference alone for the purposes of this question. Additional information about tags can be found here and here.
In an 'unsupported'/hacky fashion, you could create the user(s) directly and only on the SECONDARY members that you want the client to read from. This would be accomplished by taking the member out of the replica set, starting it up as a standalone, creating the user, and then joining it back to the replica set. While it would probably work, there are a number of implications that don't make this a particularly good approach. For example, elections (for high availability purposes) would change the PRIMARY (therefore where the client can read from) among other things.
Other approaches to this would be in redirecting/restricting traffic at the network layer. Again not a great approach.
I am using mongodb and storing tree data( MongoDB is the only option for now ).
10 ->>> Root node
/\
/ \
8 6 ---->> 8 & 6 child node of 10
/\ /\
/ \ / \
4 5 2 1 ---->> 4 & 5 child node of 8 ...
Each node is a separate document in mongoDB and each document has bunch of fields.
Sample data,
{
"_id": "234463456453643563456",
"name": "Mike",
"empId": "10",
"managerId": "8",
"hierarchy": [
8,
10
]
"projects" : [ "123", "456", "789"]
}
Here, hierarchies field will have manager ids from 1st level to top level.
Any document might get updated with any field and node might move to any location. Basically, an org change.
I have a use case where changes will be captured in other system and my system will be updated with the full active load( 200k records out of 800k records ) every 2 hours.
Here, if there is any org change like, 8 is moving under 6, the bottom to top hierarchy will change for all nodes under 8. If the full load failed in b/w the org hierarchy result will not be correct until the complete load is done.
The result should be either before the full update or after the full update not in b/w. I was thinking on versioning to handle this. Is there any better way to handle this with mongo?
There are about 200k records for full load. But, the actual changes might be less than 1k record many times which we dont know.
If you need an all-or-nothing (atomic) database update, where your database clients must not read an invalid mid-update state, then you need a transaction.
You can optimize by recognizing that some subsets of the graph are valid after you update them, and so queries against that subset of the graph is valid, and then you don't need to use the transaction feature of the database.
But you'll still be blocking or rejecting queries from some clients, and that makes your schema, queries or architecture more complex.
If this is a business issue, then I'd push on the business requirements. (If you're in a position to do that, you didn't say whether that was an option.)
Your clients are already reading data that are potentially 2 hours out of date. If the batch update you're applying is sorted, then you can make those updates in time-order, and your clients will always be receiving a state that was recently valid (but maybe not the most recent).
I am trying to write a BLUE/GREEN CFT that tears down and rebuilds the EC2 Instances, ELB and Update the Route53 record Alias with this updated DNS name of this ELB.
If the Alias Record DOESNT exist, I'm able to create the Alias Record Set correctly after the EC2 instances are created and the ELB attaches these instances. But If the recordset exists with the old ELB DNS Name, the CFT is failing with "Alias RecordSet exists". Naturally - am looking to UPDATE this record with the updated ELB DNS name on running the full CFT. Any suggestions?
"HostRecord" : {
"Type" : "AWS::Route53::RecordSet",
"Properties" : {
"HostedZoneName" : "REDACTED",
"Comment" : "Updates the ELB DNS name into Route 54 recordset.",
"Name" : "REDACTED",
"Type" : "A",
"AliasTarget" : {
"DNSName" : { "Fn::GetAtt" : [ "ESClusterELB" , "DNSName" ] },
"HostedZoneId" : { "Fn::GetAtt" : [ "ESClusterELB" , "CanonicalHostedZoneNameID" ] }
}
Managing a single resource (such as a RecordSet) from 2 different CloudFormation stacks is not supported.
I have a few recommandations for your use-case:
I recommend you manage the record independently from the templates that you're using for blue/green. Once green is created/updated and you want your record to resolve for the green ELB, you can just update the stack that govern the RecordSet, setting it to the appropriate alias.
Using the same base as the first suggestion. You could automate this using the SNS notification triggered by CloudFormation when a stack is created/updated. Using this in conjunction with a Lambda you could dynamically update the stack that controls the RecordSet.
You could create a custom resource that solely serve the purpose of updating the record set to the wanted alias.
Your problem seems to revolve around creating two CloudFormation stacks with conflicting resources which cannot coexist. One way to approach this is to always create the alias records in such a way that they can coexist.
An approach that should allow that is to create a weighted routing type. Set the weight of the recordset in both stack instances to 1 and set the recordset ID to "blue" or "green" respectively.
Now you should be able to deploy both CFN stacks side by side without conflict. If the blue stackinstance is active and the green is not, all dns responses will return the blue alias. When you then activate green, it will create a recordset alongside the blue and should start to take about half of the traffic. Now if you deactivate the blue stack, green will take over all traffic.
This does mean you need to disable blue to test green in complete isolation, which is perhaps a little inconvenient and may slow down rollback. You could have a two-phase deployment, where you keep the weights as stack parameters, then once green is deployed with weight=1, redeploy blue with weight=0 to take it out of dns without tearing it down. If green is bad, you can deactivate it and blue with weight zero should take over.
Weighted routing is only one routing type option, you could also look at multi-value answers, failover or even geolocation.
Just to get creative. You could also set a parameter that sets a condition in your CF that will execute execute different portions for your Route53 For example, having a condition to of your CF. CREATE, DELETE, IGNORE. Something like that.
I have a collection in which below is the data:
"sel_att" : {
"Technical Specifications" : {
"In Sales Package" : "Charger, Handset, User Manual, Extra Ear Buds, USB Cable, Headset",
"Warranty" : "1 year manufacturer warranty for Phone and 6 months warranty for in the box accessories"
},
"General Features" : {
"Brand" : "Sony",
"Model" : "Xperia Z",
"Form" : "Bar",
"SIM Size" : "Micro SIM",
"SIM Type" : "Single Sim, GSM",
"Touch Screen" : "Yes, Capacitive",
"Business Features" : "Document Viewer, Pushmail (Mail for Exchange, ActiveSync)",
"Call Features" : "Conference Call, Hands Free, Loudspeaker, Call Divert",
"Product Color" : "Black"
},
"Platform/Software" : {
"Operating Frequency" : "GSM - 850, 900, 1800, 1900; UMTS - 2100",
"Operating System" : "Android v4.1 (Jelly Bean), Upgradable to v4.4 (KitKat)",
"Processor" : "1.5 GHz Qualcomm Snapdragon S4 Pro, Quad Core",
"Graphics" : "Adreno 320"
}
}
The data mentioned above is too huge and the fields are all dynamically inserted, how can I index such fields to get faster results?
It seems to me that you have not fully understood the power of document based databases such as MongoDB.
Bellow are just a few thoughts:
you have 1 million records
you have 1 million index values for that collection
you have to RAM available to store 1 million index values in-memory, otherwise the benefits of indexing would not be so keen to show up
yes you can have sharding but you need lots of hardware to accommodate basic needs
What you for sure need is something that can make dynamically link random text to valuable indexes and that allows you to search in vast amounts of text very fast. And for that you should use a tool like ElasticSearch.
Note that you can and should store your content in a NoSQL database and yes MongoDB is a viable option. And for the indexing part ElasticSearch has plugins available to enhance the communication between the two.
P.S. If I recall correctly the plugin is called MongoDB River
EDIT:
I've also added a more comprehensive definition for ElasticSearch. I won't take credit for it since I've grabbed it from Wikipedia:
Elasticsearch is a search server based on Lucene. It provides a
distributed, multitenant-capable full-text search engine with a
RESTful web interface and schema-free JSON documents
EDIT 2:
I've scaled down a bit on the numbers since it might be far-fetched for most projects. But the main idea remains the same. Indexes are not recommended for the use-case described in the question.
Based on what you want to query, you will end up indexing those fields. You can also have secondary indexes in MongoDB. But beware creating too many indexes may improve your query performance but consume additional disk space and make inserts slower due to re-indexing.
MongoDB indexes
Short answer: you can't. Use Elastic Search.
Here is a good tutorial to setup MongoDB River on Elastic Search
The reason is simple, MongoDB does not work like that. It helps you store complex schemaless sets of documents. But you cannot index dozens of different fields and hope to get good performance. Generally a max of 5-6 indices are recommended per collection.
Elastic Search is commonly used in the fashion described above in many other use-cases, so it is an established pattern. For example, Titan Graph DB has the built-in option to use ES for this purpose. If I were you, I would just use that and would not try to make MongoDB do something it is not built to do.
If you have the time and if your data structure lends itself to (I think it might from the json above), then you could also use rdbms to break down these pieces and store them on-the-fly with an EAV like pattern. Elastic Search would be easier to start and probably easier to achieve performance quickly.
Well, there are lots of problems w.r.t having many indexes and has been discussed here. But if at all you need to add indexes for dynamic fields you actually create index from you mongo db driver.
So, lets say if you are using the Mongodb JAVA driver then you could create an index like below: http://docs.mongodb.org/ecosystem/tutorial/getting-started-with-java-driver/#creating-an-index
coll.createIndex(new BasicDBObject("i", 1)); // create index on "i", ascending
PYTHON
http://api.mongodb.org/python/current/api/pymongo/collection.html#pymongo.collection.Collection.create_index
So, when you are populating data using any of the drivers and you find a new field which has come thru then you could fire index creation using driver itself and not have to do it manually.
P.S.: I have not tried this and it might not be suitable or advisable.
Hope this helps!
Indexing of dynamic fields is tricky. There is no such thing as wildcard-indexes. Your options would be:
Option A: Whenever you insert a new document, do an ensureIndex with the option sparse:true for each of its fields. This does nothing when the index already exists and creates a new one when it's a new field. The drawback will be that you will end up with a very large number of indexes and that inserts could get slow because of all the new and old indexes which need to be created/updated.
Option B: Forget about the field-names and refactor your documents to an array of key/value pairs. So
"General Features" : {
"Brand" : "Sony",
"Form" : "Bar"
},
"Platform/Software" : {,
"Processor" : "1.5 GHz Qualcomm",
"Graphics" : "Adreno 320"
}
becomes
properties: [
{ category: "General Features", key: "Brand", value: "Sony" },
{ category: "General Features", key: "Form", value: "Bar" },
{ category: "Platform/Software", key: "Processor", value: "1.5 GHz Qualcomm" },
{ category: "Platform/Software", key: "Graphics", value: "Adreno 320" }
]
This allows you to create a single compound index on properties.category and properties.key to cover all the array entries.
This question is specifically pertaining to Couchbase, but I believe it would apply to anything with the memcached api.
Lets, say I am creating a client/server chat application, and on my server, I am storing chat session information for each user in a data bucket. Well after the chat session is over, I will remove the session object from the data bucket, but at the same time I also want to persist it to a permanent NoSQL datastore for reporting and analytics purposes. I also want session objects to be persisted upon cache eviction, when sessions timeout, etc.
Is there some sort of "best practice" (or even a function of Couchbase that I am missing) that enables me to do this efficiently and maintaining best possible performance of my in memory caching system?
Using Couchbase Server 2.0, you could setup two buckets (or two separate clusters if you want to separate physical resources). On the session cluster, you'd store JSON documents (the value in the key/value pair), perhaps like the following:
{
"sessionId" : "some-guid",
"users" : [ "user1", "user2" ],
"chatData" : [ "message1", "message2"],
"isActive" : true,
"timestamp" : [2012, 8, 6, 11, 57, 00]
}
You could then write a Map/Reduce view in the session database that gives you a list of all expired items (note the example below with the meta argument requires a recent build of Couchbase Server 2.0 - not the DP4.
function(doc, meta) {
if (doc.sessionId && ! doc.isActive) {
emit(meta.id, null);
}
}
Then, using whichever Couchbase client library you prefer, you could have a task to query the view, get the items and move them into the analytics cluster (or bucket). So in C# this would look something like:
var view = sessionClient.GetView("sessions", "all_inactive");
foreach(var item in view)
{
var doc = sessionClient.Get(item.ItemId);
analyticsClient.Store(StoreMode.Add, item.ItemId, doc);
sessionClient.Remove(item.ItemId);
}
If you instead, wanted to use an explicit timestamp or expiry, your view could index based on the timestamp:
function(doc) {
if (doc.sessionId && ! doc.isActive) {
emit(timestamp, null);
}
}
Your task could then query the view by including a startkey to return all documents that have not been touched in x days.
var view = sessionClient.GetView("sessions", "all_inactive").StartKey(new int[] { DateTime.Now.Year, DateTime.Now.Months, DateTime.Now.Days-1);
foreach(var item in view)
{
var doc = sessionClient.Get(item.ItemId);
analyticsClient.Store(StoreMode.Add, item.ItemId, doc);
sessionClient.Remove(item.ItemId);
}
Checkout http://www.couchbase.com/couchbase-server/next for more info on Couchbase Server 2.0 and if you need any clarification on this approach, just let me know on this thread.
-- John
CouchDB storage is (eventually) persistent and without built-in expiry mechanism, so whatever you store in it will remain stored until you remove it - it's not like in Memcached where you can set timeout for stored data.
So if you are storing session in CouchDB you will have to remove them on your own when they expire and since it's not an automated mechanism, but something you do on your own there is no reason for you not to save data wherever you want at the same time.
BTH I see no advantage of using Persistent NoSQL over SQL for session storage (and vice versa) - performance of both will be IO bound. Memory only key store or hybrid solution is a whole different story.
As for your problem: move data in you apps session expiry/session close mechanism and/or run a cron job that periodically checks session storage for expired sessions and move the data.