How can I perform a word count in spring-batch and sort the output?

How can I perform a word count in spring-batch and sort the output? - spring-batch

Given an input of some objects each containing a set of strings, I want to count the number of occurrences of each string for the entire batch, and output the word counts (to a CSV in my case) alongside each string (preferably sorted by frequency). How can I achieve this in spring batch? I can't find any suitable examples. I've tried to implement this using item readers/processors but am getting the output duplicated, I'm assuming because it's chunked? Should I use Tasklets for this?
Spring Batch : Aggregating records and write count seems close to what I want to achieve, but it's not clear to me how this is working.
The input is along the lines of:
[{
"id": 1,
"tags": ["foo", "bar"]
}, {
"id": 2,
"tags": ["foo", "baz"]
}
...]
and the desired output from that would be
foo, 2
bar, 1
baz, 1

Related

MongoDB find lowest missing value

I have a collection in MongoDB that looks something like:
{
"foo": "something",
"tag": 0,
},
{
"foo": "bar",
"tag": 1,
},
{
"foo": "hello",
"tag": 0,
},
{
"foo": "world",
"tag": 3,
}
If we consider this example, there are entries in the collection with tag of value 0, 1 or 3 and these aren't unique values, tag value can be repeated. My goal is to find that 2 is missing. Is there a way to do this with a query?

Query1
in the upcoming mongodb 5.2 we will have sort on arrays that could do this query easier without set operation but this will be ok also
group and find the min,max and all the values
take the range(max-min)
the missing are (setDifference range_above tags)
and from them you take only the smallest => 2
Test code here
aggregate(
[{"$group":
{"_id":null,
"min":{"$min":"$tag"},
"max":{"$max":"$tag"},
"tags":{"$addToSet":"$tag"}}},
{"$project":
{"_id":0,
"missing":
{"$min":
{"$setDifference":
[{"$range":[0, {"$subtract":["$max", "$min"]}]}, "$tags"]}}}}])
Query2
in Mongodb 5 (the current version) we can use also $setWindowFields
sort by tag, add the dense-rank(same values=same rank), and the min
then find the difference of tag-min
and then filter those that this difference < rank
and find the max of them (max of the tag that are ok)
increase 1 to find the one missing
*test it before using it to be sure, i tested it 3-4 times seemed ok,
for big collection if you have many different tags, this is better i think. (the above addtoset can cause memory problems)
Test code here
aggregate(
[{"$setWindowFields":
{"output":{"rank":{"$denseRank":{}}, "min":{"$first":"$tag"}},
"sortBy":{"tag":1}}},
{"$set":{"difference":{"$subtract":["$tag", "$min"]}}},
{"$match":{"$expr":{"$lt":["$difference", "$rank"]}}},
{"$group":{"_id":null, "last":{"$max":"$tag"}}},
{"$project":{"_id":0, "missing":{"$add":["$last", 1]}}}])

DMN Nested Object Filering

Literal Expression = PurchaseOrders.pledgedDocuments[valuation.value=62500]
Purchase Order Structure
PurchaseOrders:
[
{
"productId": "PURCHASE_ORDER_FINANCING",
"pledgedDocuments" : [{"valuation" : {"value" : "62500"} }]
}
]
The literal Expression produces null result.
However if
PurchaseOrders.pledgedDocuments[valuation = null]
Return all results !
What am I doing wrong ?

I was able to solve using flatten function - but dont know how it worked :(
In your original question, it is not exactly clear to me what is your end-goal, so I will try to provide some references.
value filtering
First, your PurchaseOrders -> pledgedDocuments -> valuation -> value appears to be a string, so in your original question trying to filter by
QUOTE:
... [valuation.value=62500]
will not help you.
You'll need to filter to something more ~like: valuation.value="62500"
list projection
In your original question, you are projecting on the PurchaseOrders which is a list and accessing pledgedDocuments which again is a list !
So when you do:
QUOTE:
PurchaseOrders.pledgedDocuments (...)
You don't have a simple list; you have a list of lists, it is a list of all the lists of pledged documents.
final solution
I believe what you wanted is:
flatten(PurchaseOrders.pledgedDocuments)[valuation.value="62500"]
And let's do the exercise on paper about what is actually happening.
First,
Let's focus on PurchaseOrders.pledgedDocuments.
You supply PurchaseOrders which is a LIST of POs,
and you project on pledgedDocuments.
What is that intermediate results?
Referencing your original question input value for POs, it is:
[
[{"valuation" : {"value" : "62500"} }]
]
notice how it is a list of lists?
With the first part of the expression, PurchaseOrders.pledgedDocuments, you have asked: for each PO, give me the list of pledged documents.
By hypothesis, if you supplied 3 POs, and each having 2 documents, you would have obtained with PurchaseOrders.pledgedDocuments a resulting list of again 3 elements, each element actually being a list of 2 elements.
Now,
With flatten(PurchaseOrders.pledgedDocuments) you achieve:
[{"valuation" : {"value" : "62500"} }]
So at this point you have a list containing all documents, regardless of which PO.
Now,
With flatten(PurchaseOrders.pledgedDocuments)[valuation.value="62500"] the complete expression, you still achieve:
[{"valuation" : {"value" : "62500"} }]
Because you have asked on the flattened list, to keep only those elements containing a valuation.value equal to the "62500" string.
In other words iff you have used this expression, what you achieved is:
From any PO, return me the documents having the valuations' value
equals to the string 62500, regardless of the PO the document belongs to.

$reduce with $regex in $group aggregation so that length can be displayed

I think I have a pretty complex one here - not sure if I can do this or not.
I have data that has an address and a data field. The data field is a hex value. I would like to run an aggregation that groups the data by address and then the length of the hex data. All of the data will come in as 16 characters long, but the length of that data should calculated in bytes.
I think I have to take the data, strip the trailing 00's (using regex 00+$), and divide that number by 2 to get the length. After that, I would have to then group by address and final byte length.
An example dataset would be:
{addr:829, data:'4100004822000000'}
{addr:829, data:'4100004813000000'}
{addr:829, data:'4100004804000000'}
{addr:506, data:'0000108000000005'}
{addr:506, data:'0000108000000032'}
{addr:229, data:'0065005500000000'}
And my desired output would be:
{addr:829, length:5}
{addr:506, length:8}
{addr:229, length:4}
Is this even possible in an aggregation query w/o having to use external code to do?

This is not too complicated if your "data" is in fact strings as you show in your sample data. Assuming data exists and is set to something (you can add error checking as needed) you can get the result you want like this:
db.coll.aggregate([
{$addFields:{lastNonZero:{$add:[2,{$reduce:{
initialValue:-2,
input:{$range:[0,{$strLenCP:"$data"},2]},
in:{$cond:{
if: {$eq:["00",{$substr:["$data","$$this",2]}]},
then: "$$value",
else: "$$this"
}}
}}]}}},
{$group:{_id:{
addr:"$addr",
length:{$divide:["$lastNonZero",2]}
}}}
])
I used two stages but of course they could be combined into a single $group if you wish. Here in $reduce I step through data 2 characters at a time, checking if they are equal to "00". Every time they are not I update the value to where I am in the sequence. Since that returns the position of the last non-"00" characters, we add 2 to it to find where the string of zeros that goes to the end starts and then later in $group we divide that by 2 to get the true length.
On your sample data, this returns:
{ "_id" : { "addr" : 229, "length" : 4 } }
{ "_id" : { "addr" : 506, "length" : 8 } }
{ "_id" : { "addr" : 829, "length" : 5 } }
You can add a $project stage to transform the field names into ones you want returned.

Is there a way to return part of an array in a document in MongoDB?

Pretend I have this document:
{
"name": "Bob",
"friends": [
"Alice",
"Joe",
"Phil"
],
"posts": [
12,
15,
55,
61,
525,
515
]
}
All is good with only a handful of posts. However, let's say posts grows substantially (and gets to the point of 10K+ posts). A friend mentioned that I might be able to keep the array in order (i.e. the first entry is the ID of the newest post so I don't have to sort) and append new posts to the beginning. This way, I could get the first, say, 10 elements of the array to get the 10 newest items.
Is there a way to only retrieve posts n at a time? I don't need 10K posts being returned, when most of them won't even be looked at, but I still need to keep around for records.

You can use $slice operator of mongoDB in projection to get n elements from array like following:
db.collection.find({
//add condition here
}, {
"posts": {
$slice: 3 //set number of element here
//negative number slices from end of array
}
})

You can do this :
create a list for posts you want to have (say you want first 3 posts) and return that list
for doc in db.collections.find({your query}):
temp = ()
for i in range (2):
temp.push(doc['posts'][i])
return temp

How to incorporate $lt and $divide in mongodb

I am a newbie in MongoDB but I am trying to query to identify if any of my field meets the requirements.
Consider the following:
I have a collection where EACH document is formatted as:
{
"nutrition" : [
{
"name" : "Energy",
"unit" : "kcal",
"value" : 150.25,
"_id" : ObjectId("fdsfdslkfjsdf")
}
{---then there's more in the array---}
]
"serving" : 4
"id": "Food 1"
}
My current code looks something like this:
db.recipe.find(
{"nutrition": {$elemMatch: {"name": "Energy", "unit": "kcal", "value": {$lt: 300}}}},
{"id":1, _id:0}
)
Under the array nutrition, there's a field with its name called Energy with it's value being a number. It checks if that value is less than 300 and outputs all the documents that meets this requirement (I made it only output the field called id).
Now my question is the following:
1) For each document, I have another field called "serving" and I am supposed to find out if "value"/"serving" is still less than 300. (As in divide value by serving and see if it's still less than 300)
2) Since I am using .find, I am guessing I can't use $divide operator from aggregation?
3) I been trying to play around with aggregation operators like $divide + $cond, but no luck so far.
4) Normally in other languages, I would just create a variable a = value/serving then run it through an if statement to check if it's less than 300 but I am not sure if that's possible in MongoDB
Thank you.

In case anyone was struggling with similar problem, I figured out how to do this.
db.database.aggregate([
{$unwind: "$nutrition"}, //starts aggregation
{$match:{"nutrition.name": "Energy", "nutrition.unit": "kcal"}}, //breaks open the nutrition array
{$project: {"Calories per serving": {$divide: ["$nutrition.value", "$ingredients_servings"]}, //filters out anything not named Energy and unit is not kcal, so basically 'nutrition' array now has become a field with one data which is the calories in kcal data
"id": 1, _id:0}}, //show only the food id
{$match:{"Calories per serving": {$lt: 300}}} //filters out any documents that has their calories per serving equal to or greater than 300
]}
So basically, you open the array and filter out any sub-fields you don't want in the document, then display it using project along with any of your math evaluations that needs to be done. Then you filter out any condition you had, which for me was that I don't want to see any foods that have their calories per serving more than 300.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How can I perform a word count in spring-batch and sort the output? - spring-batch

Related

MongoDB find lowest missing value

DMN Nested Object Filering

$reduce with $regex in $group aggregation so that length can be displayed

Is there a way to return part of an array in a document in MongoDB?

How to incorporate $lt and $divide in mongodb

Categories

Resources