Mongodb Query to CSV dump (mlab hosted mongodb) - mongodb

I am querying an already populated mlab MongoDB database, and I want to store the resulting multiple documents in one single CSV file.
EDIT: output format of CSV file I hope to get:
uniqueid status date
191b117fcf5c 0 2017-03-01 15:26:28.217000
191b117fcf5c 1 2017-03-01 18:26:28.217000
MongoDB database document format is
{
"_id": {
"$oid": "58b6bcc00bd666355805a3ee"
},
"sensordata": {
"operation": "chgstatus",
"user": {
"status": "1",
"uniqueid": "191b117fcf5c"
}
},
"created_date": {
"date": "2017-03-01 17:51:17.216000"
}
}
Database name:mparking_sensor
collection name: demo
The python code to query is as follows:
# -*- coding: utf-8 -*-
"""
Created on Wed Mar 01 18:55:18 2017
#author: Being_Rohit
"""
import pymongo
uri = 'mongodb://#####:*****#ds157529.mlab.com:57529/mparking_sensor'
client = pymongo.MongoClient(uri)
db = client.get_default_database().demo
print db
results = db.find()
f = open("mytest.csv", "w")
for record in results:
query1 = (record["sensordata"]["user"],record["created_date"])
print query1
print "done"
client.close()
EDIT: output format of query1 I am getting is:
({u'status': u'0', u'uniqueid': u'191b117fcf5c'}, {u'date': u'2017-03-01 17:51:08.263000'})
Does someone know the correct way to dump this data in a .csv file (pandas/or any other means) or some other approach for further prediction based analysis to do on it in future like linear regression?

Mongoexport will do the job for you. It can, uniquely among native MongoDB tools, export in CSV format, limited to a specific set of fields.
Your mongoexport command would be something like this:
mongoexport.exe \
--db mparking_sensor \
--collection demo \
--type=csv \
--fields sensordata.user.uniqueid,sensordata.user.status,created_date
That will export something like the following:
sensordata.user.uniqueid,sensordata.user.status,created_date
191b117fcf5c,0,2017-03-01T15:26:28.217000Z
191b117fcf5c,1,2017-03-01T18:26:28.217000Z

I was trying to export a collection to csv using mlabs 'export collection' they make it harder than it needs to be. So i used https://studio3t.com and connected using the standard MongoDB URI

Related

Running batch operations in MongoDB/ Robo3t

I have a list of users in a file and I want to update their record in a collection.
i.e
db.getCollection('users').update({username: "<a user>"}, { $set: { <set some values here> }})
How can I feed a list of users into this command or something similar in Robo 3T or from a terminal command line?
The following from the command line seems the easier options:
Option A) generate the update queries on the fly from the list of users and send to the mongo shell:
cat file.csv | awk '{ print("db.users.update({user:\""$1"\"},{ $set:{x:1} }) ") }' | mongo
Option B) mongoimport
Step 1) Import the user list to the database in temporary collection:
mongoimport --type csv -d test -c usersToUpdate --headerline file.csv
file.csv:
userlist
John
Donald
Jeny
Step 2) As soon as the collection is imported you can do as follow:
db.usersToUpdate.find({},{_id:0,userlist:1}).forEach(function(theuser){ db.users.update({username: theuser.userlist}, { $set: { <set some values here> }}); print(theuser+" record updated successfully"); })
Step 3) Finally you can clean the temporary usersToUpdate collection with:
db.usersToUpdate.drop()

MongoDB to BigQuery

What is the best way to export data from MongoDB hosted in mlab to google bigquery?
Initially, I am trying to do one time load from MongoDB to BigQuery and later on I am thinking of using Pub/Sub for real time data flow to bigquery.
I need help with first one time load from mongodb to bigquery.
In my opinion, the best practice is building your own extractor. That can be done with the language of your choice and you can extract to CSV or JSON.
But if you looking to a fast way and if your data is not huge and can fit within one server, then I recommend using mongoexport. Let's assume you have a simple document structure such as below:
{
"_id" : "tdfMXH0En5of2rZXSQ2wpzVhZ",
"statuses" : [
{
"status" : "dc9e5511-466c-4146-888a-574918cc2534",
"score" : 53.24388894
}
],
"stored_at" : ISODate("2017-04-12T07:04:23.545Z")
}
Then you need to define your BigQuery Schema (mongodb_schema.json) such as:
$ cat > mongodb_schema.json <<EOF
[
{ "name":"_id", "type": "STRING" },
{ "name":"stored_at", "type": "record", "fields": [
{ "name":"date", "type": "STRING" }
]},
{ "name":"statuses", "type": "record", "mode": "repeated", "fields": [
{ "name":"status", "type": "STRING" },
{ "name":"score", "type": "FLOAT" }
]}
]
EOF
Now, the fun part starts :-) Extracting data as JSON from your MongoDB. Let's assume you have a cluster with replica set name statuses, your db is sample, and your collection is status.
mongoexport \
--host statuses/db-01:27017,db-02:27017,db-03:27017 \
-vv \
--db "sample" \
--collection "status" \
--type "json" \
--limit 100000 \
--out ~/sample.json
As you can see above, I limit the output to 100k records because I recommend you run sample and load to BigQuery before doing it for all your data. After running above command you should have your sample data in sample.json BUT there is a field $date which will cause you an error loading to BigQuery. To fix that we can use sed to replace them to simple field name:
# Fix Date field to make it compatible with BQ
sed -i 's/"\$date"/"date"/g' sample.json
Now you can compress, upload to Google Cloud Storage (GCS) and then load to BigQuery using following commands:
# Compress for faster load
gzip sample.json
# Move to GCloud
gsutil mv ./sample.json.gz gs://your-bucket/sample/sample.json.gz
# Load to BQ
bq load \
--source_format=NEWLINE_DELIMITED_JSON \
--max_bad_records=999999 \
--ignore_unknown_values=true \
--encoding=UTF-8 \
--replace \
"YOUR_DATASET.mongodb_sample" \
"gs://your-bucket/sample/*.json.gz" \
"mongodb_schema.json"
If everything was okay, then go back and remove --limit 100000 from mongoexport command and re-run above commands again to load everything instead of 100k sample.
ALTERNATIVE SOLUTION:
If you want more flexibility and performance is not your concern, then you can use mongo CLI tool as well. This way you can write your extract logic in a JavaScript and execute it against your data and then send output to BigQuery. Here is what I did for the same process but used JavaScript to output in CSV so I can load it much easier to BigQuery:
# Export Logic in JavaScript
cat > export-csv.js <<EOF
var size = 100000;
var maxCount = 1;
for (x = 0; x < maxCount; x = x + 1) {
var recToSkip = x * size;
db.entities.find().skip(recToSkip).limit(size).forEach(function(record) {
var row = record._id + "," + record.stored_at.toISOString();;
record.statuses.forEach(function (l) {
print(row + "," + l.status + "," + l.score)
});
});
}
EOF
# Execute on Mongo CLI
_MONGO_HOSTS="db-01:27017,db-02:27017,db-03:27017/sample?replicaSet=statuses"
mongo --quiet \
"${_MONGO_HOSTS}" \
export-csv.js \
| split -l 500000 --filter='gzip > $FILE.csv.gz' - sample_
# Load all Splitted Files to Google Cloud Storage
gsutil -m mv ./sample_* gs://your-bucket/sample/
# Load files to BigQuery
bq load \
--source_format=CSV \
--max_bad_records=999999 \
--ignore_unknown_values=true \
--encoding=UTF-8 \
--replace \
"YOUR_DATASET.mongodb_sample" \
"gs://your-bucket/sample/sample_*.csv.gz" \
"ID,StoredDate:DATETIME,Status,Score:FLOAT"
TIP: In above script I did small trick by piping output to able to split the output in multiple files with sample_ prefix. Also during split it will GZip the output so you can load it easier to GCS.
From a basic reading of MongoDB's documentation, it sounds like you can use mongoexport to dump your database as JSON. Once you've done that, refer to the BigQuery loading data topic for a description of how to create a table from JSON files after copying them to GCS.
You can read data from MongoDB and stream it to BigQuery. You can find an example in NodeJS here.
This is an extension of the linked example that prevents duplicated records (as long as they are still streaming buffer):
const { BigQuery } = require('#google-cloud/bigquery');
const bigqueryClient = new BigQuery();
...
const jsonData = // Array of documents from MongoDB
const inputRows = jsonData.map(row => ({
insertId: row._id,
json: row
}));
const insertOptions = {
raw: true
};
await bigqueryClient
.dataset(datasetId)
.table(tableId)
.insert(inputRows, insertOptions);

How to export mongo data into csv using pymongo?

My code :
data = db.get_collection('activity_tracker').find({"companyId" : "527d4b23-347a-4ad2-81d8-dfd66af5631a", 'userEmail':{'$ne':'abc#xyz.in'}})
with open('asdxk.csv', 'w') as outfile:
fields = ['companyId', 'userEmail']
writer = csv.writer(outfile, fields)
for post in data:
writer.writerow([post])
Problem statement :
Using above code I am exporting data to CSV file as a junk but what i want to do is companyID and userEmail details should be export to csv as row and column format. Header name should be companyID and userEmail.
Use mongoexport utility to export the data to csv:
mongoexport -h localhost -d test -c activity_tracker --type=csv
--fields companyId,userEmail
--q '{"companyId":"527d4b23-347a-4ad2-81d8-dfd66af5631a","userEmail":{"$ne":"abc#xyz.in"}}'
--out asdxk.csv

How to use mongoexport with query script file

I'm trying to follow this tutorial: http://www.ultrabug.fr/tag/mongoexport/
and use a sh file for the query line.
this is my file:
#!/bin/bash
d=`date --date="-3 month"`
echo "{ timeCreated: { "\$lte": $d} }"
this is my mongoexport line:
mongoexport --db game_server --collection GameHistory -query /home/dev/test2.sh --out /home/dev/file.json
I keep getting:
assertion: 16619 code FailedToParse: FailedToParse: Expecting '{': offset:0 of:/home/dev/test2.sh
why? How can I make this work?
I found several errors in your approach, let's examine them one by one.
Date format
MongoDB expects date to be a number or ISO 8601 string.
Unfortunately, unix date utility have no build-in support for this format, so you should use:
d=`date --date="-3 month" -u +"%Y-%m-%dT%H:%M:%SZ"`
Using extended JSON
JSON specification have no support for dates, so you should use MongoDB extended JSON. So, your final query should look like this:
{ "timeCreated": { "$lte": { "$date": "2014-05-12T08:53:29Z" } } }
test.sh output
You messed up with quotation marks. Here is a script example, outputting correct JSON:
#!/bin/bash
d=`date --date="-3 month" -u +"%Y-%m-%dT%H:%M:%SZ"`
echo '{ "timeCreated": { "$lte": { "$date": "'$d'" } } }'
Passing query to mongoexport
mongoexport expects --query to be a JSON string, not .sh script. So, when you're passing file path to --query, mongoexport expects it to be a JSON file.
To fix it you should execute test2.sh yourself and pass resulting string to mongoexport:
mongoexport --db game_server --collection GameHistory \
--query "`./test2.sh`" --out ./test2.json
N.B. Notice " quotation marks around ./test2.sh call. They're telling bash to treat ./test2.sh output as a single parameter, ignoring all inner quotation marks and whitespaces.
You need to add back ticks around a script or command to have it evaluated:
mongoexport --db game_server --collection GameHistory \
-query `/home/dev/test2.sh` --out /home/dev/file.json

Can i use mongoexport --query <file> where file is a list of conditions

I have an array of ids stored in a file, and I want to retrieve their data from the mongdb
so i looked into the mongoexport method. it seems --query option can only accept a json instead read a large json or array from a file. In my case, it is about 4000 ids stored in the file. Is there a solution to this?
I was able to use
mongoexport --db db --collection collection --field name --csv -oout ~/data.csv
but how to read query conditions from a file
for example, for mongoid in rails application, query like this is Data.where(:_id.in => array).
or is it possible to do from mongo shell by executing a javscript file
tks
I believe you can use a javascript to output the array you need.
you can use "printjson" command in your script, for example:
create a script.js javascript file as following:
script.js:
printjson( db.albums.find({_id : 18}, {"images" : 1,"_id":0}).toArray() )
Call hi as follow:
mongo test script.js > out.txt
In my local environment albums collection has the following structure:
db.albums.findOne({"_id":18
{
"_id" : 18,
"images" : [
2926,
5377,
8036,
9023,
10119,
11543,
12305,
12556,
12576,
13753,
14414,
14865,
15193,
15933,
17156,
17314,
17391,
20168,
21705,
22016,
22348,
23036,
23452,
24112,
27086,
27310,
27864,
28092,
29184,
29190,
29250,
29354,
29454,
29563,
30366,
30619,
31390,
31825,
31906,
32339,
32674,
33307,
33844,
37475,
37976,
38717,
38774,
39801,
41369,
41752,
44977,
45384,
45643,
46918,
47069,
50099,
52755,
54314,
54497,
62338,
63438,
63572,
63600,
65631,
66953,
67160,
67369,
69802,
71087,
71127,
71282,
73123,
73201,
73954,
74972,
76279,
77054,
78397,
78645,
78936,
79364,
79707,
83065,
83142,
83568,
84160,
85391,
85443,
85488,
86143,
86240,
86949,
89406,
89846,
92591,
92639,
92655,
93844,
93934,
94987,
95324,
95431,
95817,
95864,
96230,
96975,
97026
]
}
>
, so the output I got was:
$ cat out.txt
MongoDB shell version: 2.2.1
connecting to: test
[
{
"images" : [
2926,
5377,
8036,
9023,
10119,
11543,
12305,
12556,
12576,
13753,
14414,
14865,
15193,
15933,
17156,
17314,
17391,
20168,
21705,
22016,
22348,
23036,
23452,
24112,
27086,
27310,
27864,
28092,
29184,
29190,
29250,
29354,
29454,
29563,
30366,
30619,
31390,
31825,
31906,
32339,
32674,
33307,
33844,
37475,
37976,
38717,
38774,
39801,
41369,
41752,
44977,
45384,
45643,
46918,
47069,
50099,
52755,
54314,
54497,
62338,
63438,
63572,
63600,
65631,
66953,
67160,
67369,
69802,
71087,
71127,
71282,
73123,
73201,
73954,
74972,
76279,
77054,
78397,
78645,
78936,
79364,
79707,
83065,
83142,
83568,
84160,
85391,
85443,
85488,
86143,
86240,
86949,
89406,
89846,
92591,
92639,
92655,
93844,
93934,
94987,
95324,
95431,
95817,
95864,
96230,
96975,
97026
]
}
]
Regards,
Moacy