Find one word from given array in string - mongodb

Let's assume we have a user given array:
q = ['dolor', 'sed']
And a item in my db is:
{//data,
'paragraphs' : [{ 'header' : 'Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nullam venenatis lectus risus, a interdum lectus rhoncus sed. Vestibulum sit amet massa eu metus iaculis laoreet et non est. ',
//mode data
},
{ .. }]
}
I want to find if the 'paragraphs.header' has the word dolor or sed.
I tried $in and search with no success. What should I use?

Infective way would be to use regular expressions:
db.foo.find({"paragraphs.header": {$in: [/dolor/, /sed/]}})
Because none of this is anchored the beginning of the string it means you cannot use indexes and you'll have to perform full scan to find matching documents.
If you want effective way you should look at TextSearch.
start mongod with --setParameter textSearchEnabled=true
create text index db.foo.ensureIndex({"paragraphs.header": "text"})
search db.foo.runCommand("text", {search: "dolor sed"})
You can specify language when you create text index or run text command but unfortunately Latin is not supported ;)

Related

Replace string inside curvy brackets

I would need to replace the strings contained within the curved brackets with the same strings but with an initial prefix and curly brackets \fill{(test_string)}. Is this possible?
Example:
Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam
nonummy nibh euismod tincidunt ut laoreet dolore.
(first_string)
Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam nonummy
nibh euismod tincidunt ut laoreet dolore.
(second_string)
Transform in:
Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam
nonummy nibh euismod tincidunt ut laoreet dolore.
\fill{(first_string)}
Lorem ipsum dolor sit amet, consectetuer
adipiscing elit, sed diam nonummy nibh euismod tincidunt ut laoreet dolore.
\fill{(second_string)}
As suggested, you could use a regex search and replace. A simple example of this can be found here. In your case, this should work:
The regular expression \(([^\)]+)\) does the following (as taken from this site - you'll need to paste the regex into the site to see the explanation):
\( matches the character literally (case sensitive)
1st Capturing Group ([^\)]+)
Match a single character not present in the list below [^\)]
+ matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)
\) matches the character ) literally (case sensitive)
\) matches the character ) literally (case sensitive)
In Visual Studio Code, if you enable regex search by clicking the .* icon in the search bar, you can put this regular expression in. Then, in the replace section, you can put \fill{($1)} where the $1 is the 1st Capturing Group mentioned previously (the first_string, second_string, etc. part found by the regular expression).
There are a lot of Regex posts here on Stackoverflow you may want to read. One notable one is Greedy versus Lazy.

Dart: Is there a way to split strings into sentences without using Dart's split method?

I'm looking to split a paragraph of text into individual sentences using Dart. The problem I am having is that sentences can end in a number of punctuation marks (e.g. '.', '!', '?') and in some cases (such as the Japanese language), sentences can end in unique symbols (e.g. '。').
Additionally, Dart's split method removes the split value from the string. For example, 'Hello World!" becomes "Hello World" when using the code text.split('! ');
I've looked around at Dart packages available but I'm unable to find anything that does what I'm looking for.
Ideally, I'm looking for something similar to BreakIterator in Java which allows the programmer to define which locale they wish to use when detecting punctuation and also maintains the punctuation mark when splitting the string into sentences. I'm happy to use a solution in Dart that doesn't automatically detect sentence endings based on Locale but if this isn't available I would like to have the ability to define all sentence endings to look for when splitting a string.
Any help is appreciated. Thank you in advance.
it can be done using regex, something like this:
String str1 = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. In vulputate odio eros, sit amet ultrices ipsum auctor sed. Mauris in faucibus elit. Nulla quam orci? ultrices a leo a, feugiat pharetra ex. Nunc et ipsum lorem. Integer quis congue nisi! In et sem eget leo ullamcorper consectetur dignissim vitae massa。Nam quis erat ac tellus laoreet posuere. Vivamus eget sapien eget neque euismod mollis.";
// regular expression:
RegExp re = new RegExp(r"(\w|\s|,|')+[。.?!]*\s*");
// get all the matches:
Iterable matches = re.allMatches(str1);
// Iterate all matches:
for (Match m in matches) {
String match = m.group(0);
print("match: $match");
}
output:
// match: Lorem ipsum dolor sit amet, consectetur adipiscing elit.
// match: In vulputate odio eros, sit amet ultrices ipsum auctor sed.
// match: Mauris in faucibus elit.
// match: Nulla quam orci?
// match: ultrices a leo a, feugiat pharetra ex.
// match: Nunc et ipsum lorem.
// match: Integer quis congue nisi!
// match: In et sem eget leo ullamcorper consectetur dignissim vitae massa。
// match: Nam quis erat ac tellus laoreet posuere.
// match: Vivamus eget sapien eget neque euismod mollis.

How do I compose format=flowed emails that include hanging indents with vim?

Is there a good way to configure vim to send format=flowed emails that include hanging indents?
My complete vimrc (for testing purposes) is:
set nocompatible
set fo+=awn
set tw=72
set ai
I'm typing something like:
1. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Etiam
posuere dui lorem, et condimentum nulla. Sed pharetra justo nec ante
fringilla non mattis nisi blandit. Donec molestie ligula dolor.
Nulla facilisi. Aliquam vel nulla elit, mollis facilisis metus. Sed
id eros a ante blandit convallis id sit amet elit. Duis malesuada
lobortis leo a placerat. Sed ut ipsum nisl. Sed pretium mauris vitae
velit sollicitudin iaculis.
vim adds a trailing space to each line except the last, per set fo+=w. It also adds spaces for the hanging indent. It looks great!
My mail client sets the format=flowed header. The result when this email is viewed in either Mail.app or mutt is not pretty:
1. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Etiam posuere dui lorem, et condimentum nulla. Sed pharetra justo nec ante fringilla non mattis nisi blandit. Donec molestie ligula dolor. Nulla facilisi. Aliquam vel nulla elit, mollis facilisis metus. Sed id eros a ante blandit convallis id sit amet elit. Duis malesuada lobortis leo a placerat. Sed ut ipsum nisl. Sed pretium mauris vitae velit sollicitudin iaculis.
The paragraph wraps correctly, in the sense that resizing the reader client reflows it (which is not what you'll see here on stackoverflow, but you get the idea). The problem is, there are 5 spaces between "Etiam" and "posuere" and all the other lines that have been joined back together.
Is there a fix for this in vim? Or is this a limitation of the format=flowed spec? How do other people handle this?
The paragraph wraps correctly, in the sense that resizing the reader client reflows it (which is not what you'll see here on stackoverflow, but you get the idea). The problem is, there are 5 spaces between "Etiam" and "posuere" and all the other lines that have been joined back together.
This is a limitation of the "format=flowed" MIME parameter as specified in RFC 3676. There is nothing in the specification that would allow a client to recognize the leading spaces as ornaments intended only for plaintext versions of the mail.
Section 4.1 of the RFC states:
If the first character of a line is a space, the line has been space-stuffed (see Section 4.4). Logically, this leading space is deleted before examining the line further (that is, before checking for flowed).
The referenced "space-stuffing" from Section 4.4:
Space-stuffing adds a single space to the start of any line which needs protection when the message is generated. On reception, if the first character of a line is a space, it is logically deleted. This occurs after the test for a quoted line (which logically counts and deletes any quote marks), and before the test for a flowed line.
So an RFC 3676-compliant mail client would remove a single leading space from each line beginning with such a character and then (optionally) remove any the linebreaks that following a single space character. This process would not touch the remaining leading whitespace

How to diff rewrapped text?

When editing documents I always stick to a certain line width of max 80 or 150 characters, depends what I am writing (code, text, etc.). If I change only a little the whole paragraph will shift and hence multiple lines are now in different order to optimal fit for the given line width. How do I diff this to see the actual real change an not the rewrapping artifacts?
Example, textwidth=30:
The actual changes are rather tiny:
line 9 insert: "Now I change a little"
line 15 insert: "Fill in here something and write totally new stuff with much more lines. "
line 18 change: s/Duis/TYPO/
The fact that I use (g)vimdiff here is of no matter, if other software can accomplish the desired diff.
Of course software is designed to wrap automatically when text reaches window borders, so I also tried to use just line breaks in the end of a paragraph. The reason why this is not good is, that automatically diffs are line based, and for small changes in paragraphs I get the whole line, meaning then the whole paragraph as diff update :(.
GNU wdiff does a word-by-word diff, not treating spaces and new lines any differently. One can even find vim syntax files for it (e.g. here).
$ cat file1
Lorem ipsum dolor sit amet, consectetur
adipiscing elit. Aenean vel molestie
nulla. Pellentesque placerat lacus vel
eros malesuada tristique. Nulla vitae
volutpat justo. Donec est mauris,
$ cat file2
Lorem amet, consectetur adipiscing some
inserted text! elit. Aenean vel molestie
nulla. Pellentesque placerat lacus vel
eros malesuada replacement. Nulla vitae
volutpat justo. Donec est mauris,
$ wdiff file1 file2
Lorem [-ipsum dolor sit-] amet, consectetur
adipiscing {+some inserted text!+} elit. Aenean vel molestie
nulla. Pellentesque placerat lacus vel
eros malesuada [-tristique.-] {+replacement.+} Nulla vitae
volutpat justo. Donec est mauris
([- ... -] is deleted text, {+ ... +} is inserted text).
(There are other diff programs that do a similar thing: e.g. adiff, and maybe some of the ones listed in https://stackoverflow.com/questions/12625/best-diff-tool)
I like Beyond Compare for this kind of side-by-side file comparison. Also lets you do folder comparisons and bit-level comparisons, and you can right-click to select the left-hand file to compare, then another to select the right-hand one; or select two files and right-click Compare to bring them both up straight away.
I use DiffMerge which is free and available on many platforms.

Analyzing and storing text in a data structure

I hope you understand what I want to do. It is hard to choose the best words, because English is not my first language and I distrust automatic translators. I will try to explain as well as I can.
I was thinking about analyzing a long text. Suppose, for example, that I have a string divided into paragraphs.
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla vitae elit libero, a pharetra augue. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Cras mattis consectetur purus sit amet fermentum.
Duis mollis, est non commodo luctus, nisi erat porttitor ligula, eget lacinia odio sem nec elit. Aenean eu leo quam. Pellentesque ornare sem lacinia quam venenatis vestibulum. Cras justo odio, dapibus ac facilisis in, egestas eget quam. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Curabitur blandit tempus porttitor. Maecenas sed diam eget risus varius blandit sit amet non magna.
I would like to store this string in an array or something similar, in a way I can find the length or location of the two paragraphs very quickly. For example (pseudocode):
Array => {
paragraphs => {
"Lorem ipsum dolor sit amet, [...] fermentum.",
...
}
}
I don't really know whether this has a name. I suppose there is much theory about how to do this type of task. I am really interested in practices that take care about performance when processing a big amount of text. I would like to have something to study and read carefully.
Any help would be very appreciated. Thanks in advance,
—Alberto
Perhaps read into Apache's UIMA, it's all about analyzing unstructured information, text analysis being a major component of it.