Loading contents of json array in redshift

Loading contents of json array in redshift - amazon-redshift

I'm setting up redshift and importing data from mongo. I have succeeded in using a json path file for a simple document but am now needing to import from a document containing an array.
{
"id":123,
"things":[
{
"foo":321,
"bar":654
},
{
"foo":987,
"bar":567
}
]
}
How do I load the above in to a table like so:
select * from things;
id | foo | bar
--------+------+-------
123 | 321 | 654
123 | 987 | 567
or is there some other way?
I can't just store the json array in a varchar(max) column as the content of Things can exceed 64K.

Given
db.baz.insert({
"myid":123,
"things":[
{
"foo":321,
"bar":654
},
{
"foo":987,
"bar":567
}
]
});
The following will display the fields you want
db.baz.find({},{"things.foo":1,"things.bar":1} )
To flatten the result set use aggregation like so
db.baz.aggregate(
{"$group": {"_id": "$myid", "things": { "$push" : {"foo":"$things.foo","bar":"$things.bar"}}}},
{
$project : {
_id:1,
foo : "$things.foo",
bar : "$things.bar"
}
},
{ "$unwind" : "$foo" },
{ "$unwind" : "$bar" }
);

Related

Convert Array to Json in Mongodb Aggregate

I have a Mongo Document in below format:
{
"id":"eafa3720-28e2-11ed-bf07"
"type":"test"
"serviceType_details": [
{
"is_custom_service_type": false,
"bill_amount": 100
}
]
}
"serviceType_details" Key doesn't have any definite schema.
Now I want to export it using MongoDB aggregate to Parquet so that I could use Presto to query it.
My Pipeline Code:
db.test_collection.aggregate([
{
$match: {
"id": "something"
}
},
{
$addFields: {
...
},
}
{
"$out" : {
"format" : {
"name" : "parquet",
"maxFileSize" : "10GB",
"maxRowGroupSize" : "100MB"
}
}
}
])
Now I want to export the value of "serviceType_details" in json string not as array ( when using current code parquet recognises it as an array)
I have tried $convert,$project and it's not working.
Currently the generated Parquet schema looks something like this:
I want the generated Parquet schema for "serviceType_details" to have as string and value should be stringify version of array which is present in mongo document.
Reason for me to have need it as string is because in each document "serviceType_details" details have completely different schema, its very difficult to maintain Athena table on top of it.

You can use the $function operator to define custom functions to implement behaviour not supported by the MongoDB Query Language
It could be done using "$function" like this:
db.test_collection.aggregate([
{
$match: {
"id": "something"
}
},
{
$addFields: {
newFieldName: {
$function: {
body: function(field) {
return (field != undefined && field != null) ? JSON.stringify(field) : "[]"
},
args: ["$field"],
lang: "js"
}
},
},
}
{
"$out" : {
"format" : {
"name" : "parquet",
"maxFileSize" : "10GB",
"maxRowGroupSize" : "100MB"
}
}
}
])
Executing JavaScript inside an aggregation expression may decrease performance. Only use the $function operator if the provided pipeline operators cannot fulfill your application's needs.

Query document for record with array value of null

I have a document like this:
{
"_id" : "4mDYgt6gID",
...
"MultipleChoiceQuestions" : [
{
...
"LeadInFile" : null,
...
},
{
...
"LeadInFile" : 'some string',
...
}
],
...
}
How do I query for any documents that have a non-null value in LeadInFile?
I'm trying different things, currently something like this
db.getCollection('QuizTime:Quizzes').find({"MultipleChoiceQuestions": [{ "LeadInFile": { $ne: null}}]});
Is returning 0 records.

The current form of the query is saying:
Find documents where MultipleChoiceQuestions is [{ "LeadInFile": { $ne: null}}]
Try using dot notation; this is used to access elements of an array or fields in an embedded document. For example:
db.getCollection('QuizTime:Quizzes').find({
"MultipleChoiceQuestions.LeadinFile" : { "$ne" : null }
})

Sorting MongoDB collection by same key with different values

I am trying to query my Mongo database to display all values from a certain collection, sorted by all the values of a certain key. For example, I have the collection:
{
"id":"1235432",
"name":"John Smith",
"occupation":"janitor",
"salary":"30000"
},
{
"id":"23412312",
"name":"Mathew Colins",
"occupation":"janitor"
"salary":"32000"
},
{
"id":"7353452",
"name":"Depp Jefferson",
"occupation":"janitor"
"salary":"33000"
},
{
"id":"342434212",
"name":"Clara Smith",
"occupation":"Accountant",
"salary":"45000"
},
{
"id":"794563452",
"name":"Jonathan Drako",
"occupation":"Accountant",
"salary":"46000"
},
{
"id":"8383747",
"name":"Simon Well",
"occupation":"Accountant",
"salary":"41000"
}
and I am trying to display only the TOP 2 with highest salary by occupation. My query looks something like this at the moment:
Stats.find({occupation:{$exists:true}}).populate('name').sort({salary:1}).limit(2)
by that only returns only 1 result instead of one from each occupation.
How can I change my query to display the top 2 of each occupation by salary range?

You can use $aggregate as mentioned below.
db.collectionName.aggregate({$match:{occupation:{$exists:true}}},
{ $sort: {"salary":-1}},
{ $limit: 2},
{ $project: {"name":1,"salary":1,_id:0} })
Output JSON:
{"name" : "Jonathan Drako",
"salary" : "46000"},
{"name" : "Clara Smith",
"salary" : "45000"}

MongoDB Aggregation with DBRef

Is it possible to aggregate on data that is stored via DBRef?
Mongo 2.6
Let's say I have transaction data like:
{
_id : ObjectId(...),
user : DBRef("user", ObjectId(...)),
product : DBRef("product", ObjectId(...)),
source : DBRef("website", ObjectId(...)),
quantity : 3,
price : 40.95,
total_price : 122.85,
sold_at : ISODate("2015-07-08T09:09:40.262-0700")
}
The trick is "source" is polymorphic in nature - it could be different $ref values such as "webpage", "call_center", etc that also have different ObjectIds. For example DBRef("webpage", ObjectId("1")) and DBRef("webpage",ObjectId("2")) would be two different webpages where a transaction originated.
I would like to ultimately aggregate by source over a period of time (like a month):
db.coll.aggregate( { $match : { sold_at : { $gte : start, $lt : end } } },
{ $project : { source : 1, total_price : 1 } },
{ $group : {
_id : { "source.$ref" : "$source.$ref" },
count : { $sum : $total_price }
} } );
The trick is you get a path error trying to use a variable starting with $ either by trying to group by it or by trying to transform using expressions via project.
Any way to do this? Actually trying to push this data via aggregation to a subcollection to operate on it there. Trying to avoid a large cursor operation over millions of records to transform the data so I can aggregate it.

Mongo 4. Solved this issue in the following way:
Having this structure:
{
"_id" : LUUID("144e690f-9613-897c-9eab-913933bed9a7"),
"owner" : {
"$ref" : "person",
"$id" : NumberLong(10)
},
...
...
}
I needed to use "owner.$id" field. But because of "$" in the name of field, I was unable to use aggregation.
I transformed "owner.$id" -> "owner" using following snippet:
db.activities.find({}).aggregate([
{
$addFields: {
"owner": {
$arrayElemAt: [{ $objectToArray: "$owner" }, 1]
}
}
},
{
$addFields: {
"owner": "$owner.v"
}
},
{"$group" : {_id:"$owner", count:{$sum:1}}},
{$sort:{"count":-1}}
])
Detailed explanations here - https://dev.to/saurabh73/mongodb-using-aggregation-pipeline-to-extract-dbref-using-lookup-operator-4ekl

You cannot use DBRef values with the aggregation framework. Instead you need to use JavasScript processing of mapReduce in order to access the property naming that they use:
db.coll.mapReduce(
function() {
emit( this.source.$ref, this["total_price"] )
},
function(key,values) {
return Array.sum( values );
},
{
"query": { "sold_at": { "$gte": start, "$lt": end } },
"out": { "inline": 1 }
}
)
You really should not be using DBRef at all. The usage is basically deprecated now and if you feel you need some external referencing then you should be "manually referencing" this with your own code or implemented by some other library, with which you can do so in a much more supported way.

Mongo: select only one field from the nested object

In mongo I store object that have field "titleComposite". This field contains array of title object, like this:
"titleComposite": [
"0": {
"titleType": "01",
"titleText": "Test cover uploading"
}
]
I'm perfoming query and I would like to receive only "titleText" value for the returned values. Here is an example of my query:
db.onix_feed.find({"addedBy":201, "mediaFileComposite":{$exists:false}}, {"isbn13":1,"titleComposite.titleText":1})
In the results I see values like
{
"_id" : ObjectId("559ab286fa4634f309826385"),
"titleComposite" : [ { "titleText" : "The Nonprofit World" } ],
"isbn13" : "9781565495296"
}
Is there any way to get rid of "titleComposite" wrapper object and receive only titleText? For example, take titleText of the first element only?
Would appreciate any help

You can mongodb aggregation to achieve your expected result. Re-arrange your query as following...
db.onix_feed.aggregate([
{
$match: {
$and: [
{"addedBy":201},
{"mediaFileComposite":{$exists:false}}
]
}
},
{
$project : { titleText: "$titleComposite.titleText",
"isbn13" : 1 }
}
])

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Loading contents of json array in redshift - amazon-redshift

Related

Convert Array to Json in Mongodb Aggregate

Query document for record with array value of null

Sorting MongoDB collection by same key with different values

MongoDB Aggregation with DBRef

Mongo: select only one field from the nested object

Categories

Resources