How to create huge random document in MongoDB - mongodb

I am newbie with MongoDB. I'm trying to create a database which will be included 10,000 data. The data will contain "username" and "Birthday".
I want to create 10,000 data with random username and birthday. Do we have a fastest way to create this kind of database?.
Thank you so much for your help!

Here are some functions that will help you to create a random string(name) and random date since 1950 untill 2000 and insert it into mongodb.
function getRandomInt(min, max) {
return Math.floor(Math.random() * (max - min + 1)) + min;
}
function getRandomDate() {
// aprox nr of days since 1970 untill 2000: 30years * 365 days
var nr_days1 = 30*365;
// aprox nr of days since 1950 untill 1970: 20years * 365 days
var nr_days2 = -20*365;
// milliseconds in one day
var one_day=1000*60*60*24
// get a random number of days passed between 1950 and 2000
var days = getRandomInt(nr_days2, nr_days1);
return new Date(days*one_day)
}
for (var i = 1; i <= 10000; i++) {
db.test.insert(
{
name : "name"+i,
birthday: getRandomDate()
}
)
}

Best way will be read off docs about random generate in mongodb.
https://docs.mongodb.com/v2.6/tutorial/generate-test-data/
Also you could use special service for generate random data.
For example:
https://www.mockaroo.com/

I have try mgeneratejs, very easy to use, mgeneratejs
here is sample command, mgeneratejs print data to stdout, then use mongoimport import these data into mongod:
mongodb-osx-x86_64-4.0.1 $ mgeneratejs '{"name": "$name", "age": "$age", "emails": {"$array": {"of": "$email", "number": 3}}}' -n 5 | mongoimport --uri mongodb://localhost:27017/test --collection user --mode insert
2018-08-09T16:19:13.295+0800 connected to: localhost
2018-08-09T16:19:14.544+0800 imported 5 documents

Related

Calculating moving average for every 5 seconds in MongoDB

I want to calculate moving average for my data in MongoDB. My data structure is as below
{
"_id" : NUUID("54ab1171-9c72-57bc-ba20-0a06b4f858b3"),
"DateTime" : ISODate("2018-05-30T21:31:05.957Z"),
"Type" : 3,
"Value" : NumberDecimal("15.905414991993847")
}
I want to calculate the average of values for each type within 2 days and for each 5 seconds. In this case I put Type in $match pipeline but I prefer to group the result by Type and separate the result by Type. Something I did is as below
var start = new Date("2018-05-30T21:31:05.957Z");
var end = new Date("2018-06-01T21:31:05.957Z");
var arr = new Array();
for (var i = 0; i < 34560; i++) {
start.setSeconds(start.getSeconds() + 5);
if (start <= end)
{
var a = new Date(start);
arr.push(a);
}
}
db.Data.aggregate([
{$match:{"DateTime":{$gte:new Date("2018-05-30T21:31:05.957Z"),
$lte:new Date("2018-06-01T21:31:05.957Z")}, "Type":3}},
{$bucket: {
groupBy: "$DateTime",
boundaries: arr,
default: "Other",
output: {
"count": { $sum: 1 },
"Value": {$avg:"$Value"}
}
}
}
])
It seems, it is working but the performance is too slow. How can I make this faster?
I reproduced the behavior you describe with 2 days worth of 1 second observations in the DB and a $match that pulls just one day's worth. The agg works "fine" if you bucket by, say, 60 seconds. But 15 seconds took 6 times as long, to 30 seconds. And every 5 seconds? 144 seconds. 5 seconds yields an array of 17280 buckets. Yep.
So I went client-side and dragged all 43200 docs to the client and created a naive linear search bucket slot finder and calc in javascript.
c=db.foo.aggregate([
{$match:{"date":{$gte:new Date(osv), $lte:new Date(endv) }}}
]);
c.forEach(function(r) {
var x = findSlot(arr, r['date']);
if(buckets[x] == undefined) {
buckets[x] = {lb: arr[x], ub: arr[x+1], n: 0, v:0};
}
var zz = buckets[x];
zz['n']++;
zz['v'] += r['val'];
});
This actually worked somewhat faster but same order of performance, about 92 seconds.
Next, I changed the linear search in findSlot to a bisection search. The 5 second bucket went from 144 seconds to .750 seconds: almost 200x faster. This includes dragging the 43200 records and running the forEach and bucketing logic above. So it stands to reason that $bucket may not be using a great algo and suffers when the bucket array is more than a couple hundred long.
Acknowledging this, we can instead make use of $floor of the delta between the start time and the observation time to bucket the data:
db.foo.aggregate([
{$match:{"date":{$gte:now, $lte:new Date(endv) }}}
// Bucket by turning offset from "now" into floor divided by the number
// of seconds of grouping. In this way, the resulting number becomes the
// slot into the virtual buckets, e.g.:
// date now diff/1000 floor # 5 seconds:
// 1514764800000 1514764800000 0 0
// 1514764802000 1514764800000 2 0
// 1514764804000 1514764800000 4 0
// 1514764806000 1514764800000 6 1
// 1514764808000 1514764800000 8 1
// 1514764810000 1514764800000 10 2
,{$addFields: {"ff": {$floor: {$divide: [ {$divide: [ {$subtract: [ "$date", now ]}, 1000.0 ]}, secondsBucket ] }} }}
// Now just group by the numeric slot number!
,{$group: {_id: "$ff", n: {$sum:1}, tot: {$sum: "$val"}, avg: {$avg: "$val"}} }
// Get it in 0-n order....
,{$sort: {_id: 1}}
]);
found 17280 in 204 millis
So we now have a server-side solution that is just .204 seconds, or 700x faster. And you don't have to sort the input because $group will take care of bundling the slot numbers. And the $sort after the $group is optional (but sort of handy...)

How to update collection and increment hours for ISO date

I have an ISO date in my collection documents.
"start" : ISODate("2015-07-25T17:35:00Z"),
"end" : ISODate("2015-09-01T23:59:00Z"),
Currently they are in GMT +0, i need them to be GMT +8. Therefore i need to add 8 hours to the existing field. How do i do this via a mongodb query?
Advice appreciated.
Updated Code Snippet
var offset = 8,
bulk = db.collection.initializeUnorderedBulkOp(),
count = 0;
db.collection.find().forEach(doc) {
bulk.find({ "_id": doc._id }).updateOne({
"$set": { “startDateTime": new Date(
doc.startDateTime.valueOf() + ( 1000 * 60 * 60 * offset )
) }
});
count++;
if ( count % 1000 == 0 ) {
bulk.execute();
bulk = db.collection.initializeUnorderedBulkOp();
}
});
if ( count % 1000 !=0 )
bulk.execute();
I aggree wholeheartedly with the answer provided by Ewan here in that you really should keep all times in a database in UTC. And all the sentiments are correct there. Only really adding to this with practical examples.
As a working example, Let's say I have two people using the data, one in New York and one in Sydney, being UTC-5 and UTC+10 respectively. Now consider the following data:
{ "date": ISODate("2015-08-01T04:40:03.389Z") }
Based on that, this is the time the actual "event" takes place. To the perspective of the user in Sydney the event takes place on the 1st August as a whole day where as to the person in New York it is still occuring on the 31st July.
If however I construct a "localized" time for Sydney as follows, the UTC consideration is still correct:
new Date("2015/08/01")
ISODate("2015-07-31T14:00:00Z")
This enforces the time difference like it should by converting from the local timezone to UTC. Therefore a localized date will select the correct values in UTC. So the Sydney user perpective of the start of the 1st August includes all times from 2pm on 31st July and similarly adjusted to the end date of a range selection. With data in UTC, this assertion from the client end it correct, and to their perpective the selected data was in the expected range.
In the case where you were "aggregating" results for a given day, then you build in the "time difference" math into the expression. So for UTC+10 you would do:
var offset = 10;
db.collection.aggregate([
{ "$group": {
"_id": {
"$subtract": [
{ "$add": [
{"$subtract": [ "$date", new Date(0)]},
offset * 1000 * 60 * 60
]},
{ "$mod": [
{ "$add": [
{ "$subtract": [ "$date", new Date(0) ] },
offset * 1000 * 60 * 60
]},
1000 * 60 * 60 * 24
]}
]
},
"count": { "$sum": 1 }
}}
Which then takes the "offset" for the locale in consideration when reporting back the "dates" to the perpective of the client that was viewing the data. So anything that occurred on an "Adjusted date" resulting in a different day such as the 31st August would be aggregated into the correct grouping by this adjustment.
Because your data may very well be used from the perpective of people in different timezones is exactly the reason why you should keep data for dates in UTC format. The client will do the work, or you can adjust accordingly where needed.
In short:
Client: Construct in local time, send in UTC
Server: Provide TZ Offset and adjust from UTC to local on return
Leave your dates in the correct format they are already in and use the methods described here to report on them.
But if you made a mistake
If however you made a mistake in contruction of your data and all times are actually "local" times but repesented as UTC, ie:
ISODate("2015-08-01T11:10:43.569Z") // actually meant to be 11am in UTC+10 :(
Where it should be:
ISODate("2015-08-01T01:10:43.569Z") // This is 11am UTC+10 :)
Then you would correct this as follows:
var offset = 10,
bulk = db.collection.initializeUnorderedBulkOp(),
count = 0;
db.collection.find().forEach(doc) {
bulk.find({ "_id": doc._id }).updateOne({
"$set": { "date": new Date(
doc.date.valueOf() - ( 1000 * 60 * 60 * offset )
) }
});
count++;
if ( count % 1000 == 0 ) {
bulk.execute();
bulk = db.collection.initializeUnorderedBulkOp();
}
});
if ( count % 1000 !=0 )
bulk.execute();
Reading each document to get the "date" value and adjusting that accordingly and sending the updated date value back to the document.
By default MongoDB stores all DateTimes as UTC.
There are 2 ways of doing this:
App side (Recommended)
When extracting the start and end from the database, in your language of choice just change it from a UTC to a local datetime.
To have a look at a good example in Python, check out this answer
Database side (Not recommended)
The other option is to write a mongodb query which adds 8 hours on to your start and end like you originally wanted. However this then sets the time as UTC but 8 hours in the future and becomes illogical for other developers and when parsing app side.
This requires updating based on another value in your document so you'll have to loop through each document as described here.

How to paginate and group in MongoDB?

My objects are of the following structure:
{id: 1234, ownerId: 1, typeId: 3456, date:...}
{id: 1235, ownerId: 1, typeId: 3456, date:...}
{id: 1236, ownerId: 1, typeId: 12, date:...}
I would like to query the database so that it returns all the items that belong to a given ownerId but only the first item of a given typeId. IE the typeId field is unique in the results. I would also like to be able to use skip and limit.
In SQL the query would be something like:
SELECT * FROM table WHERE ownerId=1 SORT BY date GROUP BY typeId LIMIT 10 OFFSET 300
I currently have the following query (using pymongo) but it is giving my errors for using $sort, $limit and $skip:
search_dict['ownerId'] = 1
search_dict['$sort'] = {'date': -1}
search_dict['$limit'] = 10
search_dict['$skip'] = 200
collectionName.group(['typeId'], search_dict, {'list': []}, 'function(obj, prev) {prev.list.push(obj)}')
-
I have also tried the aggregation route but as I understand grouping will touch all the items in the collection, group them, and then limit and skip. This will be too computationally expensive and slow. I need an iterative grouping algorithm.
search_dict = {'ownerId':1}
collectionName.aggregate([
{
'$match': search_dict
},
{
'$sort': {'date': -1}
},
{
'$group': {'_id': "$typeId"}
},
{
'$skip': skip
},
{
'$limit': 10
}
])
Your aggregation looks correct. You need to include the fields you want in the output in the $group stage using $first.
grouping will touch all the items in the collection, group them, and then limit and skip. This will be too computationally expensive and slow.
It won't touch all items in the collection. If the match + sort is indexed ({ "ownerId" : 1, "date" : -1 }), the index will be used for the match + sort, and the group will only process the documents that are the result of the match.
The constraint is hardly ever cpu, except in cases of unindexed sort. It's usually disk I/O.
I need an iterative grouping algorithm.
What precisely do you mean by "iterative grouping"? The grouping is iterative, as it iterates over the result of the previous stage and checks which group each document belongs to!
I am not to sure how you get the idea that this operation should be computational expensive. This isn't really true for most SQL databases, and it surely isn't for MongoDB. All you need is to create an index over your sort criterium.
Here is how to prove it:
Open up a mongo shell and have this executed.
var bulk = db.speed.initializeOrderedBulkOp()
for ( var i = 1; i <= 100000; i++ ){
bulk.insert({field1:i,field2:i*i,date:new ISODate()});
if((i%100) == 0){print(i)}
}
bulk.execute();
The bulk execution may take some seconds. Next, we create a helper function:
Array.prototype.avg = function() {
var av = 0;
var cnt = 0;
var len = this.length;
for (var i = 0; i < len; i++) {
var e = +this[i];
if(!e && this[i] !== 0 && this[i] !== '0') e--;
if (this[i] == e) {av += e; cnt++;}
}
return av/cnt;
}
The troupe is ready, the stage is set:
var times = new Array();
for( var i = 0; i < 10000; i++){
var start = new Date();
db.speed.find().sort({date:-1}).skip(Math.random()*100000).limit(10);
times.push(new Date() - start);
}
print(times.avg() + " msecs");
The output is in msecs. This is the output of 5 runs for comparison:
0.1697 msecs
0.1441 msecs
0.1397 msecs
0.1682 msecs
0.1843 msecs
The test server runs inside a docker image which in turn runs inside a VM (boot2docker) on my 2,13 GHz Intel Core 2 Duo with 4GB of RAM, running OSX 10.10.2, a lot of Safari windows, iTunes, Mail, Spotify and Eclipse additionally. Not quite a production system. And that collection does not even have an index on the date field. With the index, the averages of 5 runs look like this:
0.1399 msecs
0.1431 msecs
0.1339 msecs
0.1441 msecs
0.1767 msecs
qed, hth.

D3.js- How to format tick values as quarters instead of months

I have a set of data in quarters. here is the array:
var dataGDP = [
{date: "Q1-2008", GDPreal: "2.8"},
{date: "Q2-2008", GDPreal: "0.6"},
{date: "Q3-2008", GDPreal: "-2.1"},
{date: "Q4-2008", GDPreal: "-4.3"},
{date: "Q1-2009", GDPreal: "-6.8"},
{date: "Q2-2009", GDPreal: "-6.3"},
{date: "Q3-2009", GDPreal: "-5"}
];
How do I get these dates to show up on my X axis like 1Q 2008, 2Q 2008, 3Q 2008..ect? my X axis uses a time based scale I'm not sure that there is a way to parse these dates as they are now using d3.time.format. I can however parse them if I use months instead like 01/2008, 04/2008... by using: parseDate = d3.time.format("%m/%Y").parse;
Should I write my dates in the array as months and then write a function to convert the months into quarters? or is there a way to keep the Q1..ect in array as it is now and parse the dates?
Here's how I solved this for my data.
Note that I took off 10 seconds from the date (x.getTime() - 10000) to account for the data seeing 3/31/2015 as midnight on 4/1/2015, which throws off the calculation. Depending on your data, you may or may not have to do this.
var xAxis = d3.svg.axis()
.scale( x )
.ticks( d3.time.months, 3 )
.tickFormat( function ( x ) {
// get the milliseconds since Epoch for the date
var milli = (x.getTime() - 10000);
// calculate new date 10 seconds earlier. Could be one second,
// but I like a little buffer for my neuroses
var vanilli = new Date(milli);
// calculate the month (0-11) based on the new date
var mon = vanilli.getMonth();
var yr = vanilli.getFullYear();
// return appropriate quarter for that month
if ( mon <= 2 ) {
return "Q1 " + yr;
} else if ( mon <= 5 ) {
return "Q2 " + yr;
} else if ( mon <= 8 ) {
return "Q3 " + yr;
} else {
return "Q4 " + yr;
}
} )
.orient( "bottom" );
D3 doesn't support quarters (neither parsing nor formatting them). Unless you explicitly need the time-related functionality, you could simply leave those values as they are and use an ordinal scale.
There's a good readme here of how to parse quarters manually but d3 doesn't automate the process.
Google group link
The quarter format was recently added
https://github.com/d3/d3-time-format/pull/58
%q yields an integer in the range [1,4]
To create a format function that yields 'Q1 2020', 'Q2 2020', etc.., do the following:
import { utcFormat } from 'd3-time-format';
const quarterFormat = d => `Q${utcFormat('%q %Y')(d)}`;

Google Bookmark Export date format?

I been working on parsing out bookmarks from an export file generated by google bookmarks. This file contains the following date attributes:
ADD_DATE="1231721701079000"
ADD_DATE="1227217588219000"
These are not standard unix style timestamps. Can someone point me in the right direction here? I'll be parsing them using c# if you are feeling like really helping me out.
Chrome uses a modified form of the Windows Time format (“Windows epoch”) for its timestamps, both in the Bookmarks file and the history files. The Windows Time format is the number of 100ns-es since January 1, 1601. The Chrome format is the number of microseconds since the same date, and thus 1/10 as large.
To convert a Chrome timestamp to and from the Unix epoch, you must convert to seconds and compensate for the difference between the two base date-times (11644473600).
Here’s the conversion formulas for Unix, JavaScript (Unix in milliseconds), Windows, and Chrome timestamps (you can rearrange the +/× and -/÷, but you’ll lose a little precision):
u : Unix timestamp eg: 1378615325
j : JavaScript timestamp eg: 1378615325177
c : Chrome timestamp eg: 13902597987770000
w : Windows timestamp eg: 139025979877700000
u = (j / 1000)
u = (c - 116444736000000) / 10000000
u = (w - 1164447360000000) / 100000000
j = (u * 1000)
j = (c - 116444736000000) / 10000
j = (w - 1164447360000000) / 100000
c = (u * 10000000) + 116444736000000
c = (j * 10000) + 116444736000000
c = (w / 10)
w = (u * 100000000) + 1164447360000000
w = (j * 100000) + 1164447360000000
w = (c * 10)
Note that these are pretty big numbers, so you’ll need to use 64-bit numbers or else handle them as strings like with PHP’s BC-math module.
In Javascript the code will look like this
function chromeDtToDate(st_dt) {
var microseconds = parseInt(st_dt, 10);
var millis = microseconds / 1000;
var past = new Date(1601, 0, 1).getTime();
return new Date(past + millis);
}
1231721701079000 looks suspiciously like time since Jan 1st, 1970 in microseconds.
perl -wle 'print scalar gmtime(1231721701079000/1_000_000)'
Mon Jan 12 00:55:01 2009
I'd make some bookmarks at known times and try it out to confirm.
Eureka! I remembered having read the ADD_DATE’s meaning at some website, but until today, I could not find it again.
http://MSDN.Microsoft.com/en-us/library/aa753582(v=vs.85).aspx
offers this explanation as a “Note” just before the heading “Exports and Imports”:
“Throughout this file[-]format definition, {date} is a decimal integer that represents the number of seconds elapsed since midnight January 1, 1970.”
Before that, examples of {date} were shown:
<DT><H3 FOLDED ADD_DATE="{date}">{title}</H3>
…
and
<DT>{title}
…
Someday, I will write a VBA macro to convert these to recognizable dates, but not today!
If someone else writes a conversion script first, please share it. Thanks.
As of the newest Chrome Version 73.0.3683.86 (Official Build) (64-bit):
When I export bookmark, I got an html file like "bookmarks_3_22_19.html".
And each item has an 'add_date' field which contains date string. like this:
Stack Overflow
This timestamp is actually seconds (not microseconds) since Jan 1st, 1970. So we can parse it with Javascript like following code:
function ChromeTimeToDate(timestamp) {
var seconds = parseInt(timestamp, 10);
var dt = new Date();
dt.setTime(seconds * 1000);
return dt;
}
For the upper example link, we can call ChromeTimeToDate('1553220774') to get Date.
ChromeTimeToDate('1553220774')
12:09:03.263 Fri Mar 22 2019 10:12:54 GMT+0800 (Australian Western Standard Time)
Initially looking at it, it almost looks like if you chopped off the last 6 digits you'd get a reasonable Unix Date using the online converter
1231721701 = Mon, 12 Jan 2009 00:55:01 GMT
1227217588 = Thu, 20 Nov 2008 21:46:28 GMT
The extra 6 digits could be formatting related or some kind of extended attributes.
There is some sample code for the conversion of Unix Timestamps if that is in fact what it is.
look here for code samples: http://www.epochconverter.com/#code
// my groovy (java) code finally came out as:
def convertDate(def epoch)
{
long dv = epoch / 1000; // divide by 1,000 to avoid milliseconds
String dt = new java.text.SimpleDateFormat("dd/MMM/yyyy HH:mm:ss").format(new java.util.Date (dv));
// to get epoch date:
//long epoch = new java.text.SimpleDateFormat("MM/dd/yyyy HH:mm:ss").parse("01/01/1970 01:00:00").getTime() * 1000;
return dt;
} // end of def
So firefox bookmark date exported as json gave me:
json.lastModified :1366313580447014
convert from epoch date:18/Apr/2013 21:33:00
from :
println "convert from epoch date:"+convertDate(json.lastModified)
function ConvertToDateTime(srcChromeBookmarkDate) {
//Hp --> The base date which google chrome considers while adding bookmarks
var baseDate = new Date(1601, 0, 1);
//Hp --> Total number of seconds in a day.
var totalSecondsPerDay = 86400;
//Hp --> Read total number of days and seconds from source chrome bookmark date.
var quotient = Math.floor(srcChromeBookmarkDate / 1000000);
var totalNoOfDays = Math.floor(quotient / totalSecondsPerDay);
var totalNoOfSeconds = quotient % totalSecondsPerDay;
//Hp --> Add total number of days to base google chrome date.
var targetDate = new Date(baseDate.setDate(baseDate.getDate() + totalNoOfDays));
//Hp --> Add total number of seconds to target date.
return new Date(targetDate.setSeconds(targetDate.getSeconds() + totalNoOfSeconds));
}
var myDate = ConvertToDateTime(13236951113528894);
var alert(myDate);
//Thu Jun 18 2020 10:51:53 GMT+0100 (Irish Standard Time)
#Python program
import time
d = 1630352263 #for example put here, if (ADD_DATE="1630352263")
print(time.ctime(d)) #Mon Aug 30 22:37:43 2021 - you will see