I have file:
"data_personnel": [
{
"id": "1",
"name": "Mathieu"
}
],
"struct_hospital": [
{
"id": "9",
"geo": "chamb",
"nb": ""
},
{
"id": "",
"geo": "jsj",
"nb": "SMITH"
},
{
"id": "10",
"geo": "",
"nb": "12"
},
{
"id": "2",
"geo": "marqui",
"nb": "20"
},
{
"id": "4",
"geo": "oliwo",
"nb": "1"
},
{
"id": "1",
"geo": "par",
"nb": "5"
}
]
How to use sed for for to have all the values of geo in struct_hospital? (chamb, jsj, , marqui, oliwo, etc ..)
The file can be in any form. With tabs, everything on a line, etc ..
As pointed out by Sundeep, it makes more sense to use a proper JSON parser.
But if you are looking for a one-time quick and dirty solution, then this might do:
sed -n '/^"struct_hospital"/,/^]/s/^.*"geo"\s*:\s*"\([^"]*\)"\s*,\?.*$/\1/p' input.txt
Sample output:
chamb
jsj
marqui
oliwo
par
Explanation:
/^"struct_hospital"/,/^]/ - only consider lines between struct_hospital and the closing bracket.
s/.../\1/p search and replace; only print the first capturing subpattern of every matching line
^.*"geo"\s*:\s*"\(.*\)"\s*,\?.*$ matches the geo lines; captures the value following the colon
In case the input spans a single line, you can use another sed invocation as a preprocessor to insert line breaks:
sed 's/]\|,/\n&/g'
This makes the full command:
sed 's/]\|,/\n&/g' input.txt | sed -n '/^"struct_hospital"/,/^]/s/^.*"geo"\s*:\s*"\([^"]*\)"\s*,\?.*$/\1/p'
Related
I have below malformed json file. I want to quote the value of email, i.e "sampleemail#sampledoman.co.org". How do I go about it? I tried below but doesn't work.
sed -e 's/"email":\[\(.*\)\]/"email":["\1"]/g' sample.json
where sample.json looks like below
{
"supplementaryData": [
{
"xmlResponse": {
"id": "100001",
"externalid": "200001",
"info": {
"from": "C1022929291",
"phone": "000963586",
"emailadresses": {
"email": [sampleemail#sampledoman.co.org
]
}
},
"status": "OK",
"channel": "mobile"
}
}
]
}
Your code does not work because
[ is not escaped so not treated as a literal
You are using BRE, so capturing brackets will need to be escaped. In its current format, you will need -E to use extended functionality
The line does not end with ]
You did not add the space so there is no match, hence, no replacement.
For your code to work, you can use;
$ sed -E 's/"email": \[(.*)/"email": ["\1"/' sample.json
or
$ sed -E '/\<email\>/s/[a-z#.]+$/"&"/' sample.json
{
"supplementaryData": [
{
"xmlResponse": {
"id": "100001",
"externalid": "200001",
"info": {
"from": "C1022929291",
"phone": "000963586",
"emailadresses": {
"email": ["sampleemail#sampledoman.co.org"
]
}
},
"status": "OK",
"channel": "mobile"
}
}
]
}
With your shown samples, please try following awk code. Written and tested in GNU awk. Making RS as NULL and using awk's function named match where I am using regex (.*)(\n[[:space:]]+"emailadresses": {\n[[:space:]]+"email": \[)([^\n]+)(.*) to get required output which is creating 4 capturing groups which are 4 different values into array named arr(GNU awk's functionality in match function to save captured values into arrays) and then printing values as per requirement(adding " before and after email address value, which is 3rd element of arr OR 3rd capturing group of regex).
awk -v RS= '
match($0,/(.*)(\n[[:space:]]+"emailadresses": {\n[[:space:]]+"email": \[)([^\n]+)(.*)/,arr){
print arr[1] arr[2] "\"" arr[3] "\"" arr[4]
}
' Input_file
I am trying to append a txt file with punctuations to text to a file in AHK but it seems to break the line even if i use round brackets. How can I append to a file without breaking it?
FileAppend, ({
"accounts": [
{
"active": true,
"type": "dummy",
"ygg": {
"extra": {
"clientToken": "123456789",
"userName": "BX0W"
},
"iat": 1655273051,
"token": "BX0W"
}
}
],
"formatVersion": 3
}), accounts.json
Try this instead. Even inside a function my experience with a continuation section is that the parentheses must be the beginning of each line (with no preceding whitespaces) for it to work. I am guessing that it has something to do with the compiler.
FileAppend,
(
{
"accounts": [
{
"active": true,
"type": "dummy",
"ygg": {
"extra": {
"clientToken": "123456789",
"userName": "BX0W"
},
"iat": 1655273051,
"token": "BX0W"
}
}
],
"formatVersion": 3
}
), accounts.json
let's say I have a collection like so:
{
"id": "2902-48239-42389-83294",
"data": {
"location": [
{
"country": "Italy",
"city": "Rome"
}
],
"time": [
{
"timestamp": "1626298659",
"data":"2020-12-24 09:42:30"
}
],
"details": [
{
"timestamp": "1626298659",
"data": {
"url": "https://example.com",
"name": "John Doe",
"email": "john#doe.com"
}
},
{
"timestamp": "1626298652",
"data": {
"url": "https://www.myexample.com",
"name": "John Doe",
"email": "doe#john.com"
}
},
{
"timestamp": "1626298652",
"data": {
"url": "http://example.com/sub/directory",
"name": "John Doe",
"email": "doe#johnson.com"
}
}
]
}
}
Now the main focus is on the array of subdocument("data.details"): I want to get output only of relevant matches e.g:
db.info.find({"data.details.data.url": "example.com"})
How can I get a match for all "data.details.data.url" contains "example.com" but won't match with "myexample.com". When I do it with $regex I get too many results, so if I query for "example.com" it also return "myexample.com"
Even when I do get partial results (with $match), It's very slow. I tried this aggregation stages:
{ $unwind: "$data.details" },
{
$match: {
"data.details.data.url": /.*example.com.*/,
},
},
{
$project: {
id: 1,
"data.details.data.url": 1,
"data.details.data.email": 1,
},
},
I really don't understand the pattern, with $match, sometimes Mongo do recognize prefixes like "https://" or "https://www." and sometime it does not.
More info:
My collection has dozens of GB, I created two indexes:
Compound like so:
"data.details.data.url": 1,
"data.details.data.email": 1
Text Index:
"data.details.data.url": "text",
"data.details.data.email": "text"
It did improve the query performance but not enough and I still have this issue with the $match vs $regex. Thanks for helpers!
Your mistake is in the regex. It matches all URLs because the substring example.com is in all URLs. For example: https://www.myexample.com matches the bolded part.
To avoid this you have to use another regex, for example that just start with that domain.
For example:
(http[s]?:\/\/|www\.)YOUR_SEARCH
will check that what you are searching for is behind an http:// or www. marks.
https://regex101.com/r/M4OLw1/1
I leave you the full query.
[
{
'$unwind': {
'path': '$data.details'
}
}, {
'$match': {
'data.details.data.url': /(http[s]?:\/\/|www\.)example\.com/)
}
}
]
Note: you must scape special characters from the regex. A dot matches any character and the slash will close your regex causing an error.
i'm new to the topic MongoDB and have 4 different problems importing a big (16GB) file (jsonl) into my MongoDB (simple PSA-Cluster).
Below attached you will find a sample entry from the mentiond JSON-Dump.
With this file which i get from an external provider I actually have 4 problems.
"hotel_id" is the key and should normally be (re-)named as "_id"
"hotel_id" should not be treated as string rather than as Number
"location" is not properly formatted (if i understood correctly the MongoDB Manual) as GeoJSON as it should be like
"location": {
"type": "Point",
"coordinates": [-93.26838,37.15845]
}
instead of
"location": {
"coordinates": {
"latitude": 37.15845,
"longitude": -93.26838
}
}
"dates" can this be used to efficiently update just the records which needs to be updated?
So my challenge is now to transform the data according to my needs before importing the data or at time of import, but in both cases of course as quickly as possible.
Therefore i searched a lot for hints and best practices, but i was not able to find a solution yet, maybe due to the fact that i'm a beginner with MongoDB.
I played around with "jq" to adjust the data and for example add the type which seems to be necessary for the location (point 3), but wasn't really successful.
cat dump.jsonl | ./bin/jq --arg typeOfField Point '.location + {type: $typeOfField}'
Beside that i was injecting a sample dump of round-about 500MB which took 1,5 mins when importing it the first time (empty database). If i run it in "upsert" mode it will take round-about 12 hours. So i was also wondering what is the best practice to import such a big JSON-dump?
Any help is appreciated!! :-)
Kind regards,
Lumpy
{
"hotel_id": "12345",
"name": "Test Hotel",
"address": {
"line_1": "123 Test St",
"line_2": "Apt A",
"city": "Test City",
},
"ratings": {
"property": {
"rating": "3.5",
"type": "Star"
},
"guest": {
"count": 48382,
"average": "3.1"
}
},
"location": {
"coordinates": {
"latitude": 22.54845,
"longitude": -90.11838
}
},
"phone": "555-0153",
"fax": "555-7249",
"category": {
"id": 1,
"name": "Hotel"
},
"rank": 42,
"dates": {
"added": "1998-07-19T05:00:00.000Z",
"updated": "2018-03-22T07:23:14.000Z"
},
"statistics": {
"11": {
"id": 11,
"name": "Total number of rooms - 220",
"value": "220"
},
"12": {
"id": 12,
"name": "Number of floors - 7",
"value": "7"
}
},
"chain": {
"id": -2,
"name": "Test Hotels"
},
"brand": {
"id": 2,
"name": "Test Brand"
}
}
I have a textmate grammar with just strings, numbers and colons.
Is it possible to have a pattern that matches the strings that are followed by a colon with a different name than the strings that are not followed by a colon?
(if this is not possible, but it is possible to match a colon followed by a string, that would also suffice)
example:
"a" 1 "this one should be different": "c" "d"
The expected output would be 6 tokens:
"a" //string
1 //number
"this one should be different" //label
":" //colon
"c" //string
"d" //string
ebnf:
VALUE: ( STRING ":" )? NUMBER | STRING
STRING: ....
NUMBER: ....
my current source:
{
"name": "mylang",
"scopeName": "source.mylang",
"patterns": [
{
"match": ":",
"name": "punctuation.definition.colon.mylang"
},
{
"include": "#string"
}
],
"repository": {
"string": {
"begin": "\"",
"beginCaptures": {
"0": {
"name": "punctuation.definition.string.quote.begin.mylang"
}
},
"end": "\"",
"endCaptures": {
"0": {
"name": "punctuation.definition.string.quote.end.mylang"
}
},
"name": "string.quoted.double.mylang",
"patterns": [
{
"include": "#stringcontent"
}
]
},
"stringcontent": {
"patterns": [
{
"match": "(?x) # turn on extended mode\n \\\\ # a literal backslash\n (?: # ...followed by...\n [\"\\\\/bfnrt] # one of these characters\n | # ...or...\n u # a u\n [0-9a-fA-F]{4}) # and four hex digits",
"name": "constant.character.escape.mylang"
},
{
"match": "\\\\.",
"name": "invalid.illegal.unrecognized-string-escape.mylang"
}
]
}
}
}