Unique ids in MongoDB - mongodb

I'm about to deploy my first production version of a web service that uses MongoDB.
This web service could be prone to attacks (hackers).
I have been using the built-in Object-ID as unique identifier for each value, and this is exposed publicly (or at least to authenticated users).
Could this be a problem considering that it's built up on data such as, object creation timestamp, machine and process IDs etc (http://docs.mongodb.org/manual/core/object-id/)?
Could it be that I'm giving away too much information about when the object was created, how many machines that is used to etc?
What would your recommendations be?

Not really. It can't be used to graph your network or your machines or find hidden objects or to even graph your traffic and load times (as I have found out myself from trying).
For example, unlike an auto incrementing ID you cannot easily judge what time stamp or PID or Machine ID was used to create the next ObjectID. This makes it very hard to crawl for the hidden objects especially if you don't publicly link to them somewhere.
The PID and Machine ID are not very good at identifying anything about your network really and the PID can change almost any time it likes, whether the process was restarted or whether you restarted the server; or if you are using a language like PHP, every time a new connection comes in.
The machine ID is another piece of useless information that doesn't really give any meaningful results for anyone but your computer. I don't believe it uses the network interfaces ID (some drivers did) any more so it cannot be used to identify the machine externally.
So in short, not really.

Related

How exactly does backend work from a developer perspective?

Theres a ton of videos and websites trying to explain backend vs frontend, but unfortunately none of them explains it in a way that you know how to develop a backend - driven website (at least I haven't found anything good).
So, I wanted to ensure that I understood it and kindly ask you to confirm or correct me on this topic.
Example:
I wanted to build Mini - Google. I have a Database containing 1000 stored websites.
Assumption #1:
Everytime I type something into the search bar, the autofill suggestions change. This means, everytime i type, another website / API gets called returning the current autofill suggestions. On a developer site, this means the website e.g. is a Python script which gets called with the current word typed in as a Parameter and is returning all suggestions as e.g. JSON:
// Client Side Script
function ontype(input):
suggestions = get("https://api.googlemini.com/suggestions?q=" + str(input))
show(suggestions)
Assumption #2:
This also means I could manually call the website containing the Python script, providing a random word and it would always return a JSON containing the autofill suggestions for that word.
Question #1:
If A#1 turns out true but A#2 turns out false, how could I prevent a user from randomly accessing the "API" while still returning results when called by a script?
Assumption #3:
After pressing enter, my website googlemini.com/search?... would be called. As google.com/search reloads everytime searching for a new query (or going to page 2 etc.), I assume, instead of calling an API, when the server gets the client request, it first searches through its database, sorts the results and then returns a whole html as a static webpage:
// Server Side Script
#app.route("/search")
function oncall():
query = getparam("q")
results = searchdatabase(query)
html = buildhtml(results)
return html
Question #2:
Often, I hear (or at least understand it this way) that database and webserver are 2 seperate servers. How would that work? Wouldn't that mean the database server needs to be accessible to the web too (of course it would have security layers etc., but technically it would)? How could I access the database server from the webserver?
Question #3:
Are there, on a technical basis, any other ways to build backend services?
That's it. I would also appreciate any recommendations like videos, websites or others to learn how to technically setup and / or secure backend servers.
Thanks in advance.
For your first question you can yes there is a way to prevent miss use.
What you can do is add identifier to api like Auth token to identify a user and every time a user access the api you can save the count on the server n whenever the count has exceeded a limit within a time span you can reject the call. And the limit can be set in such a way that it doesn't trouble the honest user and punishes the wrong one. There are even more complex and effective methods but this is the basic idea.
For question number to let me explain you a simple concept a database is a very efficient, resourcefull and expensive data storage solution we never want it to be used in a general sense as varible store or something. We always want to access the database in call get the data process the data update the data. So we do it data way and its not necessary you make sepreate server for data base. The thing is we mostly make databse to be accessible to various platforms android, ios, windows. So its better to add some abstraction and keep data base as a separte entity.
For the last, I am not well aware about what you meant by other but I am listing some backend teechnologies, some of these might be used in isolation some of these not some other tools as well.
Django
FLask
Djnago rest
GraphQL
SQL
PHP
Node
Deno

Sync two offline masters when network available

I have a use case where I need to set up two physical stations at a venue. Each station will be running a couple of app servers and a mongodb server.
I can't rely on the venue's internet access so I need my app to be able to work offline and "sync" the dbs every once in a while.
I initially thought about having two masters that would somehow sync with a remote one but TIL that master-master replication is not possible with mongodb.
I've read about the active-active approach, however, that won't let me write to a different shard when offline.
I'm running out of ideas, any recommendation would be greatly appreciated.
------ Update on what I'm trying to achieve:
I'm working with a venue that has two entrances. The idea is to be able to capture some information from people attending the events (name, email, etc). After getting registered we will print a name tag with some of the info.
Everything sounds pretty easy, however, if possible, I would like to not rely on the venue's network (internet). So that's where I started struggling figuring out whats the best approach. I guess what I want is being able to have a remote mongo but if the network goes down somehow keep saving records locally and send them to the remote mongo instance when network is available again.
Extra considerations:
- Events last a couple of days, some people lose their name tag overnight, they should be able to go to either of the entrances and get it reprinted. So we should be able to find their info even if they registered in entrance A but they are asking for a reprint in entrance B.
More questions:
- Am I overthinking it? Maybe venue's network + a 4G/LTE modem as a backup should be enough? I would prefer not relying on it tho.
I believe you're overthinking things. Here's what I would do if faced with a similar situation:
From the description, it doesn't sound like the two sites need to be connected in real time at all. I would create a server on Entry A, another in Entry B, and consolidate their data each day after the day ended if required. This is because:
It's unlikely that one person will register in both sites within a single day. If they lost their tag on that day, I'll just tell them to go back to where they registered earlier and get it reprinted there. Worst case, you'll create a duplicate entry (should be obvious which is the duplicate since no one would lose their tag within seconds) but I would not anticipate hundreds of people all lost their tags within a day.
If the attendee lost their tag overnight, both servers will have synced data and should be able to reprint.
If you're concerned about the venue's Wifi access, just run cables from the server to the printing stations.
Personally, I would argue that the overnight sync is not really needed at all (see the likelihood of people registering twice). I would just collect the data from both servers after the event ended. That is, unless you have specific needs for the combined data from both entries during the 2nd day.
Note: please make sure you're running a minimum of 3-node replica set. Running a standalone instance for prod environment is not recommended. Hardware/disk corruption is a common event.

Client/Server state synchronization for desktop application

I am working on a desktop application that requires synchronization between several clients. Basically, a group of people (let's say between 2 and 10) all run the same application. One of them hosts a server and the other clients connect to that server. The client that hosts the server also connects to his own server.
The applications should stay synchronized between all clients, meaning all clients see the same data in the application. Specifically, the data in question I can define in two separate forms:
A simple property with a certain value (this value must stay synchronized)
A list of properties (the items in the list and their values must stay synchronized)
Simple examples of (1) could be: which item in a list does the client currently have selected, and what's the current location of the client's mouse pointer within the application window. These properties keep changing continuously but the number of these properties is constant and does not grow (e.g. defined during design time).
An example of (2) could be a list of chat messages. These lists will grow during runtime with no way to predict how many items there will be.
Here is an example code in C# for the state, client and chat messages:
public class State
{
// A single value shared between all clients
public int SimpleInteger {get;set;}
// List of connected clients and their individual states
public List<Client> Clients {get;set;}
// List of chat messages
public List<ChatMessage> Messages {get;set;}
}
public class Client
{
public string ClientId {get;set;}
public string Username {get;set;}
public ClientState ClientState {get;set;}
}
public class ClientState
{
public string ClientId {get;set;}
public int SelectedIndex {get;set;}
public int MouseX {get;set;}
public int MouseY {get;set;}
}
public class ChatMessage
{
public string ClientId {get;set;}
public string Message {get;set;}
}
I've been working on this on and off for a long time but whatever kind of state synchronization I came up with, it never worked well.
When I search for solutions, I only ever find solutions for games, but those are not very helpful because my requirements are different:
I cannot deal with "dropped updates", I cannot predict (interpolate or extrapolate) what the other clients are doing. Every client needs to receive every update to stay in sync.
On the other hand, I don't care about lag (within reason). It is fine if I see the updates of other client with about a second delay.
When a new client connects (or reconnects), a large portion of the state must be transfered (for example: the list of chat messages from example 2). Each client is required to know about the entire history of the chat so this must be downloaded when a client connects.
My current solution can be summarized as follows:
The server keeps track of the state, e.g. the source of truth.
The state contains the properties that require synchronizing.
The state also contains a list of connected users (and their usernames etc).
Clients also each keep a local copy of the state, which they can act upon immediately. For example, they update their mouse position in their local state continously.
Whenever a client updates his local state, this update is sent to the server.
Potential exceptions here are things that change too fast such as the mouse position, those I will only send in regular intervals.
The server also updates the common "source of truth" state.
Finally, the server updates all other clients with the new updated state.
The last two steps are where I'm struggling. I can think of two methods to synchronize the state, one is easy but probably not efficient and the other is efficient but prone to errors.
The server simply sends the entire state to all clients.
As soon as the server receives an update from the client, the update is applied to the state and the new state is broadcasted. Every other client replaces their local state.
I feel this will probably work, but the state can grow in size quickly due to the "list" items (for example chat messages). In my previous attempts, this quickly became a problem and sending the state back become much too slow.
The server re-sends the same update (that it received) to all other clients.
Each client then only applies the new update to their state locally to sync back with the server.
This is probably much more efficient and sending the entire state is only necessary when a client connects.
However, in the past I frequently ran into desync issues where clients were no longer in sync. I don't really know what caused it, probably conflicts between messages (for example server telling the client to update a value in the state, but the client just updated his local value, which has precedence?). After this happens, everything went completely wrong as the updates are now being applied to two different states and have different outcomes.
I'm looking for some guidance on general concepts on how to achieve this. I'm using several messaging libraries to achieve the actual communication between client and server and that part is not an issue I think. I can make sure in these libraries that every message is received for example (though I'm not sure if the order is guaranteed). Like I said before, lag is not an issue, but I must guarantee every state update is received both by the server and by every other client.
Any help would be great! Thanks.
This is a hard problem and there are enough tricky areas that I wouldn't want to build this myself. Authentication, conflicting updates, API management, network outages, single point of failure, and local persistence come to mind.
If you're up for using a cloud-based solution, Google Cloud Firestore takes care of those tricky areas and does what you need:
Clients save data to the database, by creating, updating, or deleting records. Example code.
Whenever a record is created, updated, or deleted, all clients get realtime notifications. Example code.
(After you follow the links above, make sure you click C# above the code boxes to see the C# code).
This is a complicated issue, with many moving parts, as you seem to understand. As I've been researching this, I've read a couple comments on questions like this one on a variety of Q&A sites, stating this kind of thing is a project all on it's own.
Disclaimer: I haven't done this myself, so I don't know how well this would work, but maybe you can take my suggestions and work with them, if you haven't already done so. I've worked on projects where this was implemented, but I wasn't part of that implementation directly.
Connection
Since you haven't said which library you are using for the connection, I'm going to assume you are using websockets or something similar. If not, I suggest you move to something like websockets. It allows for a (near) constant connection between client and server so that data can be pushed both directions, avoiding the client from having to poll and pull the data. The link below seems to have a decent walk-though on how to do it, so I won't try to. Because links die, here's the first example code they give, which seems pretty simple.
​using System.Net.Sockets;
using System.Net;
using System;
class Server {
public static void Main() {
TcpListener server = new TcpListener(IPAddress.Parse("127.0.0.1"), 80);
server.Start();
Console.WriteLine("Server has started on 127.0.0.1:80.{0}Waiting for a connection...", Environment.NewLine);
TcpClient client = server.AcceptTcpClient();
Console.WriteLine("A client connected.");
}
}
https://developer.mozilla.org/en-US/docs/Web/API/WebSockets_API/Writing_WebSocket_server
Client start up
Once you have a stable connection between server and client, you need to make sure the data is in sync. When the user starts the app, you can get the timestamp of the latest change in each table and compare that to the server. If they are exactly the same, you have a somewhat reasonable expectation that the table hasn't changed. I'm assuming each table has a column containing the timestamp for the last edit made to the row.
For the tables that have changed, you can have the server send the new and updated rows to the client based on the client's "last changed timestamp".
Since the internet isn't 100% guaranteed to be connected, you will also need to keep track of the times the client has been connected vs. when they've been on the app (unless the app just won't work without being connected to the server). This information also needs to be sent to the server to compare to data changed during intervals where the client hasn't been connected.
Once timestamp matching has been done, you need to compare the row counts. If they match, you can more reasonably assume the tables are the same. If they aren't, you can see about matching ID/primary keys. There's a variety of different ways to do this, including 1:1 matching (which is slowest but most reliable), or you can do some math with the IDs (assuming numerical IDs) and try to see what's different in batches of 100 rows (for example). Idea: If adding the sorted, auto-increment integer IDs for the first 100 rows is the same on the client and the server, all those rows exist on both servers, but if it doesn't match, you can try the 1:1 match to see what's missing. Because this can be lengthy for large databases, you may want to track this type of sync in another table, so it doesn't need to be done all the time.
Instead, you may want a table to track all the data not sent to a client. This would require a confirmation that the data sent was correctly inserted into the client DB. This could also work on the client side to track what hasn't been sent to the server. Of course, this kind of thing can get cumbersome quickly, even if you're just tracking keys, table names, and timestamps. You can rack up millions of rows quickly, if you don't remove old data periodically. This is why I suggest tracking unsent data, so that anything that becomes "sent" is no longer tracked by this table and removed.
If you don't want to code and manage all that, you can try for a library that does it. There are a variety out there. Even Microsoft has one, but it's on extended support to only 1/1/2021. What happens after that, I doubt even Microsoft knows, but it gets you 1.25 years to come up with a different solution.
Creating Synchronization Providers With The Sync Framework
The Sync Framework can be used to build apps that synchronize data from any data store using any protocol over a network. We'll show you how it works and get you started building a custom sync provider.
https://learn.microsoft.com/en-us/previous-versions/sql/synchronization/mt490616(v=msdn.10)
https://support.microsoft.com/en-us/lifecycle/search?alpha=Microsoft%20Sync%20Framework%202.1
Normal runtime
Once you have your data synced on startup (or in the background after startup), you can simply send the data to the server normally, as in when the user makes changes. Since you'll have a websocket type connection, any changes the server gets from other clients will be able to be pushed to all the other clients.
As far as changing the data in real time in your app, you may have to be constantly polling your local/client DB for timestamp changes so the UI can be appropriately updated. There may be something within C# that does this for you or another library you can find.
Conclusion
At this point, I'm out of ideas. It seems reasonable to me this would work, even though it's a lot of work. Hopefully you can take what I have and use it as a foundation to your own ideas on how to accomplish your task. It seems there's a lot of work ahead of you, so good luck!
Footnote
As I'm currently the only answer after several days of it being unanswered, I'm going to assume no one else has anything better to suggest. If they do, I'd encourage them to make their own answer instead of complaining about mine. People tweaking this answer is expected, but please remember community standards when making comments.
I'm only answering this because I haven't seen anyone else do it on this or other sites. It's only been bits and disconnected pieces here & there, with people still not being able to make sense of it as a whole.
This and similar questions have been asked before on this site and closed as "too broad". If you feel this same way as a reader, please vote so on the Question not this answer.
There are several solutions to your problem.
You could use a BizTalk server out-of-the box. This may not be what you have in mind.
If you want something more home-brewed, you could use WCF (Windows Communication Foundation) with MSMQ (Microsoft Message Queue). This would give you guaranteed message delivery, and durable messages (if you want). You would not have to worry about lost connections, and other errors occurring during messages transmission.
You can go down another level and use direct TCP and UDP protocols to transmit messages. But now, you have to take care of more error cases.
Any SQL DBMS implements one important part of your problem statement: it maintains shared state. Consider what ACID promises:
Consistency. At any one instant, all clients reading from the database are guaranteed to see the same information.
Atomicity. The client updating the database can use as many steps as needed. When the transaction is committed, the data are changed entirely or not at all.
Isolation. The server gives each client the illusion of interacting with it alone. It handles concurrent updates, and updates the database as though the updates arrived serially.
You may not care about durability for this application.
The mediation among the clients is, for my money, the most useful feature of the DBMS for your application. That will save you work, and headaches. Another, non-obvious, benefit is that it can enforce consistency rules for the state information; that can be remarkably useful to prevent an obsolete/corrupt client from munging the shared state.
The second part of your problem statement is notifying 2-10 clients of changed state. There are any number of ways to do that.
Some DBMSs can access OS services from triggers. You could have an update trigger issue a notification. Alternatively, the updating client could do that.
The actual notification mechanism could be quite simple. Clients could connect to a server (that you write) and block on read(2). The server itself listens on a port for update notifications. On receipt of one, it repeats it to all connected clients. When the client's read request returns, it's time to query the database for the updated state, and post a new read.
To prevent a kind of "thundering herd" problem when several updates arrive back-to-back, when a client reads the update message, it could keep reading updates until EWOULDBLOCK, and only then query the DBMS. OTOH, if it's important to see the intermediate states (to see every update, not just the current state), the DBMS is perfectly capable of storing and providing all versions and distinguishing them with a timestamp or serial number.
If you don't want to use TCP sockets directly, you might prefer ZeroMQ.
In this design, each client has three connections: the DBMS, the read-notify socket, and (maybe) the server-notify socket. The server has N+1 connections, for N clients and one listening socket. You have no locks to implement, very little tracking of participation, no problems re-synchronizing, and short windows inconsistency among clients as each one acts on its notification.

Proper way to distinguish between multiple services using zeroconf

I'm writing a piece of software that will run on computers as well as phones.
The service uses an HTTP API for communication and will be published over the local network using Zeroconf.
Initially I published my service using _http._tcp. as the service type but I quickly discovered that both my NAS and my music receiver(!) also broadcasts themselves with that exact service type.
So the question now arises how to differentiate between my service and other services that are using HTTP.
Alternatives
Using a different service type
The is certainly the most certainly the easiest way and (almost) guarantees no other services will be picked up.
However, according to Apple1 new services should be registered with IANA. This is obviously not required but seeing as they recommend it it feels like it would be the wrong way to do it
Using the TXT record
Apple2 describes the TXT record like this:
When a service is registered, three related DNS records are created: a service (SRV) record, a pointer (PTR) record, and a text (TXT) record. The TXT record contains additional data needed to resolve or use the service, although it is also often empty.
The certainly feels like it could be the right way to do it, but I'm still not sure and it's hard to find a description of what the field should contain.
My first though would be to put something like <service_name>-<version> which will then be parsed to see which service it actually is.
My NAS seems to use this for identifying model and version numbers.
Try talking to the service
After finding a service one could always perform a HEAD request on a known endpoint and look for a known header set by the service.
This feels like a fairly slow approach and who knows what making a HEAD request to my receiver will do.
And just to be clear, this question has nothing to do with a specific language or framework, it's about the concepts of zeroconf.
I could show some code but I don't see how that would help.
First, does the service you're advertising actually meet the qualifications for _http as defined by RFC 2782. Specifically- is it not just using HTTP for a transport but is also:
can be displayed by "typical" web browser client software, and
is intended primarily to be viewed by a human user.
If no, register your own service type (there are a couple other services that use HTTP as a transport but don't meet those qualifications so they have -http as a suffix to the service name, see pgpkey-http, senteo-http, xul-http).
If yes, there are a couple ways to go depending on how strict one's interpretation of the RFC is. The least strict being just adding a TXT record as you've already noted in your question. iTunes registers itself with a TXT record in the format iTSh Version=196618.
If you're feeling a little more strict, the RFC only explicitly states that the u=, p= and path= TXT records exist for HTTP. Perhaps someone can chime in on this, but I haven't seen much discussion on whether adding TXT records to already existing entries is frowned upon or not. So with that, the other way is to just an algorithmic instance name. For example, adding the suffix "-NicklasAService" to the device name. Hopefully giving it a unique name to the local network but still making it so that the service can be easily picked out by the PTR record by just looking for the suffix.

How to limit the effect of client modifications to production systems

Our shop has developed a few WEB/SMS/DB solution for a dozen client installations. The applications have some real-time performance requirements, and are just good enough to function properly. The problem is that the clients (owners of the production servers) are using the same server/database for customizations that are causing problems with the performance of the applications that we created and deployed.
A few examples of clients' customizations:
Adding large tables with many text datatypes for the columns that get cast to other data types in the queries
No primary keys, indexes, or FK constraints
Use of external scripts that use count(*) from table where id = x, in a loop from the script, to determine how to construct more queries later in the same script. (no bulk actions that the planner can optimize or just do everything in a single pass)
All new code files on the server are created/owned by root, with 0777 permissions
The clients don't take suggestions/criticism well. If we just go ahead and try to port/change the scripts ourselves, the old code can come back, clobbering any changes that we make! Or with out limited knowledge of their use cases, we break functionality while trying to optimize their changes.
My question is this: how can we limit the resources to queries/applications other that what we create and deploy? Are there any pragmatic options in scenarios like this? We prided ourselves in having an OSS solution, but it seems that it's become a liability.
We use PG 8.3 running on a range on Linux Distos. The clients prefer php, but shell scripts, perl, python, and plpgsql are all used on the system in one form or another.
This problem started about two minutes after the first client was given full access to the first computer, and it hasn't gone away since. Anytime someone whose priorities are getting business oriented work done quickly they will be sloppy about it and screw up things for everyone. That's just how things work, because proper design and implementation are harder than cheap hacks. You're not going to solve this problem, all you can do is figure out how to make it easier for the client to work with you than against you. If you do it right, it will look like excellent service rather than nagging.
First off, the database side. There's now way to control query resources in PostgreSQL. The main difficulty is that tools like "nice" control CPU usage, but if the database doesn't fit in RAM it may very well be I/O usage that is killing you. See this developer message summarizing the issues here.
Now, if in fact it's CPU the clients are burning through, you can use two techniques to improve that situation:
Install a C function that changes the process priority (example 1, example 2) and make sure whenever they run something it gets called first (maybe put it into their psql config file, there are other ways).
Write a script that looks for postmaster processes spawned by their userid and renice them, make it run often in cron or as a daemon.
It sounds like your problem isn't the particular query processes they're running, but rather other modifications they're making to the larger structure. There's only one way to cope with that: you have to treat the client like they're an intruder and use the approaches of that portion of the computer security field to detect when they screw things up. Seriously! Install an intrusion detection system like Tripwire on the server (there are better tools, that's just the classic example), and have it alert you when they touch anything. New file that's 0777? Should jump right out of a proper IDS report.
On the database side, you can't directly detect the database being modified usefully. You should do a pg_dump of the schema every day into a file (pg_dumpall -g and pg_dump -s, then diff that against the last one you delivered and again alert you when it's changed. If you manage that this well, the contact with the client turns into "we noticed you changed on the server...what is it you're trying to accomplish with that?" which makes you look like you're really paying attention to them. That can turn into a sales opportunity, and they may stop fiddling with things as much just knowing you're going to catch it immediately.
The other thing you should start doing immediately is install as much version control software as you can on each client box. You should be able to login to each system, run the appropriate status/diff tool for the install, and see what's changed. Get that mailed to you regularly too. Again, this works best if combined with something that dumps the schema as a component to what it manages. Not enough people use serious version control approaches on the code that lives in the database.
That's the main set of technical approaches useful here. The rest of what you've got is a classic consulting client management problem that's far more of a people problem than a computer one. Cheer up, it could be worse--FSM help you if you give them ODBC access and they discover they can write their own queries in Access or something simple like that.