Count number of words across an array of strings

Count number of words across an array of strings - mongodb

I have the following aggregation:
db.subtitles.aggregate()
.match({})
.group({_id: {chunkId: "$chunk_id"}, text: { $push:"$text"}})
What this will render is a result so:
{
"_id" : {
"chunkId" : "ffdd704b-c441-4b49-a32e-fc2277d99250"
},
"text" : [
"Mula doon, sumasama ako sa grocery, sa palengke, sinusundan ko saan napupunta ang pera.",
"Nagkakaroon sila ng resibo na makikita sa kanilang device.",
"Parang ganun na nga, pero…",
"Kaya parang akong naging buhay na QuickBooks. Gusto ko malaman kung ano ang ginagawa ng mga tao sa pera, magkano kinita nila. ",
"Sa kanilang email o text ay may impormasyon na masasabi mo na \"Itong numero na ito, itong text ay galing halimbawa sa Bank of America, at kumpirmado ito\"",
"Mga 4,500 na interbyu o mahigit pa. Sa buong Silangang Africa, sub Saharan Africa at sa Timog Asia.",
"Sa mga umuusbong na merkado, kapag nagbabayad sila ng kuryente, o kapag sumweldo sila.",
"Hindi ko na gustong makita ang nangyari 3 taon nakalipas. Nais ko lang malaman kung kaya mo itong bayaran sa katapusan ng buwan.",
"Saan ako magpunta?"
]
},
…
What I'd like to do is add another field to this group that gives me a total word count for the text array. In this case roughly 136 words.
How could I adjust my aggregation to accomplish this?

You can calculate words before grouping, so you don't need to deal with array of strings but with a single "text" field.
Starting from v4.2 you can benefit from $regexFindAll operator:
db.subtitles.aggregate([
{ $match: {} },
{ $addFields: { words: { $size: { $regexFindAll: { input: "$text", regex: /\w+/ }}}}},
{ $group: {_id: {chunkId: "$chunk_id"}, text: { $push:"$text"}, words: {$sum: "$words"}}}
])
Please read the docs regarding collation to ensure proper behaviour of \w+ regexp. You may want to add some other characters there, e.g. apostrophe etc depending on language. Precise counting may require quite sophisticated regexs especially for non-english strings. See Regex word count - matching words with apostrophe for inspiration.

You can use $stLenCP and $addFields stage
db.subtitle.aggregate([
{ $match: { "_id": ObjectId("5d5b889c33acba0b89b97cda") } },
{ $addFields: {
"length": {
$strLenCP: {
$reduce: {
input: "$text",
initialValue: "",
in: { $concat: ["$$value", "$$this"] }
}
}
}
}}
])

Related

How to get the first element from the result of db.collection.find(...)?

I have an array which is the result from db.zips.find().
sample_training> a = db.zips.find({"pop": 9999})
[
{
_id: ObjectId("5c8eccc1caa187d17ca6ed18"),
city: 'ACMAR',
zip: '88888',
loc: { y: 33.584132, x: 86.51557 },
pop: 9999,
state: 'AL'
}
]
sample_training>
As you can see from the brackets, the result is indeed an array. But I can't get its first element. I tried:
sample_training> a[0]
sample_training> a.0
Uncaught:
SyntaxError: Missing semicolon. (1:1)
> 1 | a.0
| ^
2 |
sample_training>
both failed. How can I get it?

Importing nested JSON documents into Elasticsearch and making them searchable

We have MongoDB-collection which we want to import to Elasticsearch (for now as a one-off effort). For this end, we have exported the collection with monogexport. It is a huge JSON file with entries like the following:
{
"RefData" : {
"DebtInstrmAttrbts" : {
"NmnlValPerUnit" : "2000",
"IntrstRate" : {
"Fxd" : "3.1415"
},
"MtrtyDt" : "2020-01-01",
"TtlIssdNmnlAmt" : "200000000",
"DebtSnrty" : "SNDB"
},
"TradgVnRltdAttrbts" : {
"IssrReq" : "false",
"Id" : "BMTF",
"FrstTradDt" : "2019-04-01T12:34:56.789"
},
"TechAttrbts" : {
"PblctnPrd" : {
"FrDt" : "2019-04-04"
},
"RlvntCmptntAuthrty" : "GB"
},
"FinInstrmGnlAttrbts" : {
"ClssfctnTp" : "DBFNXX",
"ShrtNm" : "AVGO 3.625 10/16/24 c24 (URegS)",
"FullNm" : "AVGO 3 5/8 10/15/24 BOND",
"NtnlCcy" : "USD",
"Id" : "USU1109MAXXX",
"CmmdtyDerivInd" : "false"
},
"Issr" : "549300WV6GIDOZJTVXXX"
}
We are using the following Logstash configuration file to import this data set into Elasticsearch:
input {
file {
path => "/home/elastic/FIRDS.json"
start_position => "beginning"
sincedb_path => "/dev/null"
codec => json
}
}
filter {
mutate {
remove_field => [ "_id", "path", "host" ]
}
}
output {
elasticsearch {
hosts => [ "localhost:9200" ]
index => "firds"
}
}
All this works fine, the data ends up in the index firds of Elasticsearch, and a GET /firds/_search returns all the entries within the _source field.
We understand that this field is not indexed and thus is not searchable, which we are actually after. We want make all of the entries within the original nested JSON searchable in Elasticsearch.
We assume that we have to adjust the filter {} part of our Logstash configuration, but how? For consistency reasons, it would not be bad to keep the original nested JSON structure, but that is not a must. Flattening would also be an option, so that e.g.
"RefData" : {
"DebtInstrmAttrbts" : {
"NmnlValPerUnit" : "2000" ...
becomes a single key-value pair "RefData.DebtInstrmAttrbts.NmnlValPerUnit" : "2000".
It would be great if we could do that immediately with Logstash, without using an additional Python script operating on the JSON file we exported from MongoDB.
EDIT: Workaround
Our current work-around is to (1) dump the MongoDB database to dump.json and then (2) flatten it with jq using the following expression, and finally (3) manually import it into Elastic
ad (2): This is the flattening step:
jq '. as $in | reduce leaf_paths as $path ({}; . + { ($path | join(".")): $in | getpath($path) }) | del(."_id.$oid") '
-c dump.json > flattened.json
References
Walker Rowe: ElasticSearch Nested Queries: How to Search for Embedded Documents
ElasticSearch search in document and in dynamic nested document
Mapping for Nested JSON document in Elasticsearch
Logstash - import nested JSON into Elasticsearch
Remark for the curious: The shown JSON is a (modified) entry from the Financial Instruments Reference Database System (FIRDS), available from the European Securities and Markets Authority (ESMA) who is an European financial regulatory agency overseeing the capital markets.

remove last part with cloudformation syntax

I got an arn reference with Fn::GetAtt: [ logGroup, Arn ]
arn:aws:logs:us-east-1:123456789012:log-group:/log-group-1234:*
but i need:
arn:aws:logs:us-east-1:123456789012:log-group:/log-group-1234
So the last part (*) need be removed.
How can I use the reference to archive it? I can split and select the last session, but how to remove it? (I hard code the log group name as sample only)
{ "Fn::Select" : [ "8", { "Fn::Split": [":", {"Fn::ImportValue": "arn:aws:logs:us-east-1:123456789012:log-group:/log-group-1234:*"}]}] }
Update:
Thanks, #Miles. I made it work
Fn::Select:
- '0'
- Fn::Split:
- ":*"
- Fn::GetAtt: [ LogsGroup, Arn ]

You should be able to split by more than one character. Try:
{
"Fn::Select":[
"0",
{
"Fn::Split":[
":*",
{
"Fn::ImportValue":"arn:aws:logs:us-east-1:123456789012:log-group:/log-group-1234:*"
}
]
}
]
}
Just as a side note, it doesn't make much sense to use ImportValue like that, but I guess you provided this just as a placeholder.

How to use Clauses with API REST Neo4J?

I have a complexe request that perfectly works in Neo4j Browser that I want to use through API Rest, but there are Clauses I can't cope with.
The syntax looks like :
MATCH p=()-[*]->(node1)
WHERE …
WITH...
....
FOREACH … SET …
I constructed the query with Transactional Cyper as i have been suggested by #cybersam, but I don't manage to use more than one clause anyway.
To give an exemle, if I write the statement in one line :
:POST /db/data/transaction/commit {
"statements": [
{
"statement": "MATCH p = (m)-[*]->(n:SOL {PRB : {PRB1}}) WHERE nodes (p)
MATCH q= (o:SOL {PRB : {PRB2}} RETURN n, p, o, q;",
"parameters": {"PRB1": "Title of problem1", "PRB2": "Title of problem2"}
} ],
"resultDataContents": ["graph"] }
I shall obtain :
{"results":[],"errors":[{"code":"Neo.ClientError.Statement.SyntaxError","message":"Invalid input 'R': expected whitespace, comment, ')' or a relationship pattern (line 1, column 90 (offset: 89))\r\n\"MATCH p = (m)-[*]->(n:SOL {PRB : {PRB1}}) WHERE nodes (p) MATCH q= (o:SOL {PRB : {PRB2}} RETURN n, p, o, q;\"\r\n ^"}]}
But if I put it in several lines, :
:POST /db/data/transaction/commit {
"statements": [
{
"statement": "MATCH p = (m)-[*]->(n:SOL {PRB : {PRB1}})
WHERE nodes (p)
MATCH q= (o:SOL {PRB : {PRB2}}
RETURN n, p, o, q;",
"parameters": {"PRB1": "Title of problem1", "PRB2": "Title of problem2"}
}
],
"resultDataContents": ["graph"]
}
it is said :
{"results":[],"errors":[{"code":"Neo.ClientError.Request.InvalidFormat","message":"Unable
to deserialize request: Illegal unquoted character ((CTRL-CHAR, code
10)): has to be escaped using backslash to be included in string
value\n at [Source: HttpInputOverHTTP#41fa906c; line: 4, column:
79]"}]}
Please, I need your help !
Alex

Using the Transaction Cypher HTTP API, you could just pass the same Cypher statement to the API.
To quote from this section of the doc, here is an example of the simplest way to do that:
Begin and commit a transaction in one request If there is no need to
keep a transaction open across multiple HTTP requests, you can begin a
transaction, execute statements, and commit with just a single HTTP
request.
Example request
POST http://localhost:7474/db/data/transaction/commit
Accept: application/json; charset=UTF-8
Content-Type: application/json
{
"statements" : [ {
"statement" : "CREATE (n) RETURN id(n)"
} ]
}
Example response
200: OK
Content-Type: application/json
{
"results" : [ {
"columns" : [ "id(n)" ],
"data" : [ {
"row" : [ 6 ],
"meta" : [ null ]
} ]
} ],
"errors" : [ ]
}

Is there a way to return a JSON object within a xe:restViewColumn?

I'm trying to generate a REST-Service on a XPage with the viewJsonService service type.
Within a column I need to have a JSON object and tried to solve that with this code:
<xe:restViewColumn name="surveyResponse">
<xe:this.value>
<![CDATA[#{javascript:
var arrParticipants = new Array();
arrParticipants.push({"participant": "A", "selection": ["a1"]});
arrParticipants.push({"participant": "B", "selection": ["b1", "b2"]});
return (arrParticipants);
}
]]>
</xe:this.value>
</xe:restViewColumn>
I was expecting to get this for that specific column:
...
"surveyResponse": [
{ "participant": "A",
"selection": [ "a1" ]
},
{ "participant": "B",
"selection": [ "b1", "b2" ]
}
]
...
What I am getting is this:
...
"surveyResponse": [
"???",
"???"
]
...
When trying to use toJson for the array arrParticipants the result is not valid JSON format:
...
"surveyResponse": "[{\"selection\": [\"a1\"],\"participant\":\"A\"},{\"selection\": [\"b1\",\"b2\"],\"participant\":\"B\"}]"
...
When tyring to use fromJson for the array arrParticipants the result is:
{
"code": 500,
"text": "Internal Error",
"message": "Error while executing JavaScript computed expression",
"type": "text",
"data": "com.ibm.xsp.exception.EvaluationExceptionEx: Error while executing JavaScript computed expression at com.ibm.xsp.binding.javascript.JavaScriptValueBinding.getValue(JavaScriptValueBinding.java:132) at com.ibm.xsp.extlib.component.rest.DominoViewColumn.getValue(DominoViewColumn.java:93) at com.ibm.xsp.extlib.component.rest.DominoViewColumn.evaluate(DominoViewColumn.java:133) at com.ibm.domino.services.content.JsonViewEntryCollectionContent.writeColumns(JsonViewEntryCollectionContent.java:213) at com.ibm.domino.services.content.JsonViewEntryCollectionContent.writeEntryAsJson(JsonViewEntryCollectionContent.java:191) at com.ibm.domino.services.content.JsonViewEntryCollectionContent.writeViewEntryCollection(JsonViewEntryCollectionContent.java:170) at com.ibm.domino.services.rest.das.view.RestViewJsonService.renderServiceJSONGet(RestViewJsonService.java:394) at com.ibm.domino.services.rest.das.view.RestViewJsonService.renderService(RestViewJsonService.java:112) at com.ibm.domino.services.HttpServiceEngine.processRequest(HttpServiceEngine.java:167) at com.ibm.xsp.extlib.component.rest.UIBaseRestService._processAjaxRequest(UIBaseRestService.java:242) at com.ibm.xsp.extlib.component.rest.UIBaseRestService.processAjaxRequest(UIBaseRestService.java:219) at com.ibm.xsp.util.AjaxUtilEx.renderAjaxPartialLifecycle(AjaxUtilEx.java:206) at com.ibm.xsp.webapp.FacesServletEx.renderAjaxPartial(FacesServletEx.java:225) at com.ibm.xsp.webapp.FacesServletEx.serviceView(FacesServletEx.java:170) at com.ibm.xsp.webapp.FacesServlet.service(FacesServlet.java:160) at com.ibm.xsp.webapp.FacesServletEx.service(FacesServletEx.java:138) at com.ibm.xsp.webapp.DesignerFacesServlet.service(DesignerFacesServlet.java:103) at com.ibm.designer.runtime.domino.adapter.ComponentModule.invokeServlet(ComponentModule.java:576) at com.ibm.domino.xsp.module.nsf.NSFComponentModule.invokeServlet(NSFComponentModule.java:1281) at com.ibm.designer.runtime.domino.adapter.ComponentModule$AdapterInvoker.invokeServlet(ComponentModule.java:847) at com.ibm.designer.runtime.domino.adapter.ComponentModule$ServletInvoker.doService(ComponentModule.java:796) at com.ibm.designer.runtime.domino.adapter.ComponentModule.doService(ComponentModule.java:565) at com.ibm.domino.xsp.module.nsf.NSFComponentModule.doService(NSFComponentModule.java:1265) at com.ibm.domino.xsp.module.nsf.NSFService.doServiceInternal(NSFService.java:653) at com.ibm.domino.xsp.module.nsf.NSFService.doService(NSFService.java:476) at com.ibm.designer.runtime.domino.adapter.LCDEnvironment.doService(LCDEnvironment.java:341) at com.ibm.designer.runtime.domino.adapter.LCDEnvironment.service(LCDEnvironment.java:297) at com.ibm.domino.xsp.bridge.http.engine.XspCmdManager.service(XspCmdManager.java:272) Caused by: com.ibm.jscript.InterpretException: Script interpreter error, line=7, col=8: Error while converting from a JSON string at com.ibm.jscript.types.FBSGlobalObject$GlobalMethod.call(FBSGlobalObject.java:785) at com.ibm.jscript.types.FBSObject.call(FBSObject.java:161) at com.ibm.jscript.types.FBSGlobalObject$GlobalMethod.call(FBSGlobalObject.java:219) at com.ibm.jscript.ASTTree.ASTCall.interpret(ASTCall.java:175) at com.ibm.jscript.ASTTree.ASTReturn.interpret(ASTReturn.java:49) at com.ibm.jscript.ASTTree.ASTProgram.interpret(ASTProgram.java:119) at com.ibm.jscript.ASTTree.ASTProgram.interpretEx(ASTProgram.java:139) at com.ibm.jscript.JSExpression._interpretExpression(JSExpression.java:435) at com.ibm.jscript.JSExpression.access$1(JSExpression.java:424) at com.ibm.jscript.JSExpression$2.run(JSExpression.java:414) at java.security.AccessController.doPrivileged(AccessController.java:284) at com.ibm.jscript.JSExpression.interpretExpression(JSExpression.java:410) at com.ibm.jscript.JSExpression.evaluateValue(JSExpression.java:251) at com.ibm.jscript.JSExpression.evaluateValue(JSExpression.java:234) at com.ibm.xsp.javascript.JavaScriptInterpreter.interpret(JavaScriptInterpreter.java:221) at com.ibm.xsp.javascript.JavaScriptInterpreter.interpret(JavaScriptInterpreter.java:193) at com.ibm.xsp.binding.javascript.JavaScriptValueBinding.getValue(JavaScriptValueBinding.java:78) ... 27 more Caused by: com.ibm.commons.util.io.json.JsonException: Error when parsing JSON string at com.ibm.commons.util.io.json.JsonParser.fromJson(JsonParser.java:61) at com.ibm.jscript.types.FBSGlobalObject$GlobalMethod.call(FBSGlobalObject.java:781) ... 43 more Caused by: com.ibm.commons.util.io.json.parser.ParseException: Encountered " "object "" at line 1, column 2. Was expecting one of: "false" ... "null" ... "true" ... ... ... ... "{" ... "[" ... "]" ... "," ... at com.ibm.commons.util.io.json.parser.Json.generateParseException(Json.java:568) at com.ibm.commons.util.io.json.parser.Json.jj_consume_token(Json.java:503) at com.ibm.commons.util.io.json.parser.Json.arrayLiteral(Json.java:316) at com.ibm.commons.util.io.json.parser.Json.parseJson(Json.java:387) at com.ibm.commons.util.io.json.JsonParser.fromJson(JsonParser.java:59) ... 44 more "
}
Is there any way to get the desired answer?

Well, the best way to achieve the desired result is to use the xe:customRestService if you need to return a cascaded JSON object.
All other xe:***RestService elements assume that you will return a flat JSON construct of parameter and value pairs, where the value is a simple data type (like boolean, number or string and - funny though - arrays) but not a complex data type (like objects).
This is, that this result here
...
"surveyResponse": [
{ "participant": "A",
"selection": [ "a1" ]
},
{ "participant": "B",
"selection": [ "b1", "b2" ]
}
]
...
will be only available on using xe:customRestService where you can define your JSON result by yourself.
Using the other services the results are limited to this constructions:
...
"surveyResponse": true;
...
or
...
"surveyResponse": [
"A",
"B"
]
...

cant you use built-in server-side javascript function toJson ?

You could try intercepting the AJAX call when reading the JSON and then manually de-sanitise the JSON string data.
There are more details here.
http://www.browniesblog.com/A55CBC/blog.nsf/dx/15112012082949PMMBRD68.htm
Personally I'd recommend against this as unless you are absolutely sure the end user can't inject code into the JSON data.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Count number of words across an array of strings - mongodb

You can use $stLenCP and $addFields stage db.subtitle.aggregate([ { $match: { "_id": ObjectId("5d5b889c33acba0b89b97cda") } }, { $addFields: { "length": { $strLenCP: { $reduce: { input: "$text", initialValue: "", in: { $concat: ["$$value", "$$this"] } } } } }} ])

Related

How to get the first element from the result of db.collection.find(...)?

Importing nested JSON documents into Elasticsearch and making them searchable

remove last part with cloudformation syntax

How to use Clauses with API REST Neo4J?

Is there a way to return a JSON object within a xe:restViewColumn?

Categories

Resources