In MongoDB production, if a value of a key is empty or not provided (optional), should I use empty string value or should I use null for value.
1) Is there any pros vs cons between using empty string vs null?
2) Is there any pros vs cons if I set value to undefined to remove properties from your existing doc vs letting properties value to either empty string or null?
Thanks
I think the best way is undefined as I would suggest not including this key altogether. Mongo doesn't work as SQL, where you have to have at least null in every column. If you don't have value, simply don't include the key. Then if you make query for all documents, where this key doesn't exists it will work correctly, otherwise not. Also if you don't use the key you save a little bit of disk space. Do this is the correct way in Mongo.
function deleteEmpty (v) {
if(v==null){
return undefined;
}
return v;
}
var UserSchema = new Schema({
email: { type: String, set: deleteEmpty }
});
i would say that null indicates absence of the value and empty string indicates that the value is there, but its empty.
While reading the data you can distinguish between blank values and non-existing values.
Still it depends upon your use-case
This question has been answered at least 4 times by me and a Google search will get you a lot of information.
You must take into consideration what removing the key means. If your document will eventually use that schema in most of its defined state, within the application, then you could be seeing a lot of movement of the document, this neuts the benefit of no having these keys: space. Those couple of bytes you will save in space will be rendered useless and you will get a swiss cheese effect.
However if you do not use these fields at all then having those few extra bytes with millions of documents in your working set could cause real problems that need not be there (if you for some reason want to shove that many documents into your working set), as for the space issue, MongoDB fundamentally has a space issue and I have not really known omitting a couple of keys to do anything to help that.
Related
Debugging MongoDB mapreduce is painful, so I'm not 100% sure I understand what's going on here, but I think I get the general idea...
The error message I'm getting is this: mr failed, removing collectionCannotCreateIndex: namespace name generated from index name "my_dbname.tmp.mr.collectionname_69.$_id.aggregation_method_1__id.date_key.start_1__id.date_key.timeres_1__id.region.center_2dsphere" is too long (127 byte max)
The key I'm using for mapreduce is a complex object with four or five properties, so I'm guessing what's happening is that when Mongo tries to create its temporary output collections using my specified key, it tries to auto-create an index on that complex key; but since the key itself has several properties, the default name for the key is too long. When I index complex objects like this under "normal" circumstances, I just give the index a custom name. But I don't see a way to do that for the collections mapreduce generates automatically.
Is there a simple way to fix this without changing my key structure?
Well, turns out I was tricked by the error message! the <collectionname> in the error message referenced above is the name of the INPUT collection whose records I'm processing with mapreduce... but the index it's referring to is an index that was part of the OUTPUT collection! So I just had to give the index in the output collection a name, and voila, problem solved. What weird behavior.
So here is the problem: I'm using MongoDB in my project so there are 24-characters ObjectId, using only hexadecimal alphabet. I'm make http request in my project to a provider, in this request I need to put a unique Id for callbacks purpose, but the provider allows only 20 characters for this id, and I don't know why.
So, my question is, with a 16 characters alphabet (hexa), there are : 16^24 possible mongo Ids, right ?
Supposing I use in the HTTP request an Id based on 64 different characters ([0-9][a-z][A-Z]-_),
correct me if I'm wrong but I think there are 64^20 possible Ids.
So technically, it is possible to encode every possible MongoDB ObjectId with a corresponding Id, isn't it ?
It seems to be a classic Base64 encoding but mysteriously this does not work as I expected, I think I didn't understand how Base64 encoding works because the generated strings are bigger than original strings...
Do you think all of this is even possible or did I totally miss something ?
Thanks in advance!
EDIT:
One of my colleague tried something which seems to work.
Here is the Java code :
byte[] decodedHex = Hex.decodeHex("53884594e4b0695f366f8128".toCharArray());
byte[] encodedHexB64 = Base64.encodeBase64(decodedHex);
System.out.println(new String(encodedHexB64)); // --> U4hFlOSwaV82b4Eo
For a reason that I ignore, doing this is not the same:
String anotherB64 = Base64.encodeBase64String("53884594e4b0695f366f8128".getBytes());
System.out.println(anotherB64);
And it prints : NTM4ODQ1OTRlNGIwNjk1ZjM2NmY4MTI4
MongoDB is using ObjectId as a default primary key for the documents because it's fast to generate and very likely to be unique.
But you are not forced to use it as a primary key. You can use any BSON data type in the _id field as long is not an array. That being said, you can use your 20-char Id in _id field.
EDIT:
From your original question I didn't know that you're using an existing DB. The _id field is immutable and it cannot be changed in an existing document.
If you only wanted to convert the existing ObjectId to something else that's 20 chars long the method you posted will work.
The second method produces a long string because you're basically base64 encoding a string which will produce an even longer string.
I feel like I'm missing something really basic here...
I am trying to merge two datasets in Stata, FranceSQ.dta and FranceHQ.dta. They both have a variable that I created named "uid" that uniquely identifies the observations.
use FranceSQ, clear
merge 1:1 uid using FranceHQ, gen(_merge) keep(match)
Now what's confusing me is that it tells me that uid doesn't uniquely identify my observations. What I realized it happening is that when I open FranceSQ, everything is normal, and when I look at my uid variable, I have the following values...
25010201
25010202
25010203
...
But then once I try to run the merge, it changes all of my values, so that I see...
2.50101e+10
2.50101e+10
2.50101e+10
...
Any help would be very appreciated...I'm sure there's a simple answer but it's eluding me at the moment.
*** EDIT ***
So Nick's advice helped, thanks! This is what I was doing that went wrong, so I wonder if someone could point out why it didn't work.
1) I created the uid variable in each dataset by concatenating two numeric variables, which cast the uid variable as a string.
2) I ran destring on the whole dataset (because there were a lot of incorrectly cast variables), which turned uid into a double.
3) Then I recast uid as a string. It was with this that I was unable to do the initial merge. I noticed that the value it was changing all of my observations to was the last value in the dataset.
4) Just because I was tweaking around, I recast the uid variable as double, and was getting the same results.
Now I finally got it to work by just starting over and not recasting the uid variable as a string in the first place, but I'm still at a loss as to why my previous efforts did not work, or how the merge command actually decided to change my values.
Very likely, this is a problem with precision. Long integers need to be held in long or double data types. You might need to recast one identifier before the merge.
You should check by looking at the results of describe whether uid has the same data type in both datasets.
To check whether your variable really identifies observations, type isid uid. Stata would complain if uid is not a unique identifier when performing merge, anyway, but that's a useful check on its own. If uid passes the check in both files, it should still do so in the merged file; it must be failing in at least one of the source files in order to fail in the merged file.
On top of Nick Cox' answer concerning data types, the issue may simply be formatting. Type describe uid to find out what the current format is, and may be format uid %12.0f to get rid of the scientific notation.
I think Stata promotes variables to the more accurate format when it needs to, say when you replace an integer-valued variable with non-integer values; same thing should happen with merge when you have say byte values in one data set, and you merge in float values on the same variable from the other data set.
Missing values in uid may be the reason Stata does not believe this variable works well. Check for these, too, before and after merge (see help data types that I references above concerning the valid ranges for each type).
This is probably a super simple question, but I'm struggling to come up with the right keywords to find it on Google.
I have a Postgres table that has among its contents a column of type text named content_type. That stores what type of entry is stored in that row.
There are only about 5 different types, and I decided I want to change one of them to display as something else in my application (I had been directly displaying these).
It struck me that it's funny that my view is being dictated by my database model, and I decided I would convert the types being stored in my database as strings into integers, and enumerate the possible types in my application with constants that convert them into their display names. That way, if I ever got the urge to change any category names again, I could just change it with one alteration of a constant. I also have the hunch that storing integers might be somewhat more efficient than storing text in the database.
First, a quick threshold question of, is this a good idea? Any feedback or anything I missed?
Second, and my main question, what's the Postgres command I could enter to make an alteration like this? I'm thinking I could start by renaming the old content_type column to old_content_type and then creating a new integer column content_type. However, what command would look at a row's old_content_type and fill in the new content_type column based off of that?
If you're finding that you need to change the display values, then yes, it's probably a good idea not to store them in a database. Integers are also more efficient to store and search, but I really wouldn't worry about it unless you've got millions of rows.
You just need to run an update to populate your new column:
update table_name set content_type = (case when old_content_type = 'a' then 1
when old_content_type = 'b' then 2 else 3 end);
If you're on Postgres 8.4 then using an enum type instead of a plain integer might be a good idea.
Ideally you'd have these fields referring to a table containing the definitions of type. This should be via a foreign key constraint. This way you know that your database is clean and has no invalid values (i.e. referential integrity).
There are many ways to handle this:
Having a table for each field that can contain a number of values (i.e. like an enum) is the most obvious - but it breaks down when you have a table that requires many attributes.
You can use the Entity-attribute-value model, but beware that this is too easy to abuse and cause problems when things grow.
You can use, or refer to my implementation solution PET (Parameter Enumeration Tables). This is a half way house between between 1 & 2.
I'm just trying to get a grip on when you would need to use a hash and when it might be better to use an array. What kind of real-world object would a hash represent, say, in the case of strings?
I believe sometimes a hash is referred to as a "dictionary", and I think that's a good example in itself. If you want to look up the definition of a word, it's nice to just do something like:
definition['pernicious']
Instead of trying to figure out the correct numeric index that the definition would be stored at.
This answer assumes that by "hash" you're basically just referring to an associative array.
I think you're looking at things in the wrong direction. It is not the object which determines if you should use a hash but the manner in which you are accessing it. A common use of a hash is when using a lookup table. If your objects are strings and you want to check if they exist in a Dictionary, looking them up will (assuming the hash works properly) by O(1). WIth sorting, the time would instead be O(logn), which may not be acceptable.
Thus, hashes are ideal for use with Dictionaries (hashmaps), sets (hashsets), etc.
They are also a useful way of representing an object without storing the object itself (for passwords).
The phone book - key = name, value = phone number.
I also think of the old World Book Encyclopedias (actual books). Each article is "hashed" into a single book (cat goes in the "C" volume).
Any time you have data that is well served by a 1-to-1 map.
For example, grades in a class:
"John Smith" => "B+"
"Jacob Jenkens" => "C"
etc
In general hashes are used to find things fast - a hash map can be used to assosiate one thing with another fast, a hash set will just store things "fast".
Please consider also the hash function complexity and cost when considering whether it's better to use a hash container or a normal less then container - the additional size of the hash value and the time needed to compute a "perfect" hash, and the time needed to make a 1:1 comparision at the end in case of a hash function conflict may in fact be a lot higher then just going through a tree structure with logharitmic complexity using the less then operators.
When you need to associate one variable with another. There isn't a "type limit" to what can be a key/value in a hash.
Hashed have many uses. Aside from cryptographic uses, they are commonly used for quick lookups of information. To get similarly quick lookups using an array you would need to keep the array sorted and then used a binary search. With a hash you get the fast lookup without having to sort. This is the reason most scripting languages implement hashing under one name or another (dictionaries, et al).
I use one often for a "dictionary" of settings for my app.
Setting | Value
I load them from the database or config file, into hashtable for use by my app.
Works well, and is simple.
One example could be zip code associated with an area, city or any postal address.
A good example is a cache with lot's of elements in it. You have some identifer by which you want to look up the a value (say an URL, and you want to find the according cached webpage). You want these lookups to be as fast as possible and don't want to search through all the stored pages everytime some URL is requested. A hash table is a great data structure for a problem like this.
One real world example I just wrote is when I was adding up the amount people spent on meals when filing expense reports.I needed to get a daily total with no idea how many items would exist on a particular day and no idea what the date range for the expense report would be. There are restrictions on how much a person can expense with many variables (What city, weekend, etc...)
The hash table was the perfect tool to handle this. The key was the date the value was the receipt amount (converted to USD). The receipts could come in in any order, i just keep getting the value for that date and adding to it until the job was done. Displaying was easy as well.
(php code)
$david = new stdclass();
$david->name = "david";
$david->age = 12;
$david->id = 1;
$david->title = "manager";
$joe = new stdclass();
$joe->name = "joe";
$joe->age = 17;
$joe->id = 2;
$joe->title = "employee";
// option 1: lets put users by index
$users[] = $david;
$users[] = $joe;
// option 2: lets put users by title
$users[$david->title] = $david;
$users[$joe->title] = $joe;
now the question: who is the manager?
answer:
$users["manager"]