Eclipse CDT autocomplete for tradtional C identifiers - eclipse

Eclipse autocomplete works OK for CamelCaseIdentifiers. But it is completely useless for MORE_TRADITONAL_style_identifiers which have upper case prefexes and are separated by "_"s.
Something like MTsi should match the latter, just like CCI matches the former.
Is there a way to do that? I could not find any preference.
Incidentally there is MTst*id.

It looks like this already works, as long as you capitalize every letter in the query:
int MORE_TRADITIONAL_style_identifier();
int main() {
int x = MTSI/*complete*/ // <-- completes MORE_TRADITIONAL_style_identifier
}
But it doesn't if some of the letters in the query are not capitalized, such as MTsi. I think capital letters are the signal to the matching algorithm that two subsequent letters are potentially the beginnings of two different segments, whereas a sequence of lowercase letters like si just expects to find that substring verbatim.
If you feel the matching algorithm could be improved to handle mixed-case queries like this better, you could consider filing a bug and/or contributing a patch.

Related

How to find common patterns in thousands of strings?

I don't want to find "abc" in strings ["kkkabczzz", "shdirabckai"]
Not like that.
But bigger patterns like this:
If I have to __, then I will ___.
["If I have to do it, then I will do it right.", "Even if I have to make it, I will not make it without Jack.", "....If I have to do, I will not...."]
I want to discover patterns in a large array or database of strings. Say going over the contents of an entire book.
Is there a way to find patterns like this?
I can work with JavaScript, Python, PHP.
The following could be a starting point:
The RegExp rx=/(\b\w+(\s+\w+\b)+)(?=.+\1)+/g looks for small (multiple word) patterns that occur at least twice in the text.
By playing around with the repeat quantifier + after (\s+\w+\b) (i.e. changing it to something like {2}) you can restrict your word patterns to any number of words (in the above case to 3: original + 2 repetitions) and you will get different results.
(?=.+\1)+ is a look ahead pattern that will not consume any of the matched parts of the string, so there is "more string" left for the remaining match attempts in the while loop.
const str="If I have to do it, then I will do it right. Even if I have to make it, I will not make it without Jack. If I have to do, I will not."
const rx=/(\b\w+(\s+\w+\b)+)(?=.+\1)+/g, r={};
let t;
while (t=rx.exec(str)) r[t[1]]=(rx.lastIndex+=1-t[1].length);
const res=Object.keys(r).map(p=>
[p,[...str.matchAll(p)].length]).sort((a,b)=>b[1]-a[1]||b[0].localeCompare(a[0]));
// list all repeated patterns and their occurrence counts,
// ordered by occurrence count and alphabet:
console.log(res);
I extended my snippet a little bit by collecting all the matches as keys in an object (r). At the end I list all the keys of this object alphabetically with Object.keys(r).sort().
In the while loop I also reset the rx.lastIndex property to start the search for that next pattern immediately after the start of the last one found: rx.lastIndex+=1-t[1].length.

Multiple regex in one command

Disclaimer: I have no engineering background whatsoever - please don't hold it against me ;)
What I'm trying to do:
Scan a bunch of text strings and find the ones that
are more than one word
contain title case (at least one capitalized word after the first one)
but exclude specific proper nouns that don't get checked for title case
and disregard any parameters in curly brackets
Example: Today, a Man walked his dogs named {FIDO} and {Fifi} down the Street.
Expectation: Flag the string for title capitalization because of Man and Street, not because of Today, {FIDO} or {Fifi}
Example: Don't post that video on TikTok.
Expectation: No flag because TikTok is a proper noun
I have bits and pieces, none of them error-free from what https://www.regextester.com/ keeps telling me so I'm really hoping for help from this community.
What I've tried (in piece meal but not all together):
(?=([A-Z][a-z]+\s+[A-Z][a-z]+))
^(?!(WordA|WordB)$)
^((?!{*}))
I think your problem is not really solvable solely with regex...
My recommendation would be splitting the input via [\s\W]+ (e.g. with python's re.split, if you really need strings with more than one word, you can check the length of the result), filtering each resulting word if the first character is uppercase (e.g with python's string.isupper) and finally filtering against a dictionary.
[\s\W]+ matches all whitespace and non-word characters, yielding words...
The reasoning behind this different approach: compiling all "proper nouns" in a regex is kinda impossible, using "isupper" also works with non-latin letters (e.g. when your strings are unicode, [A-Z] won't be sufficient to detect uppercase). Filtering utilizing a dictionary is a way more forward approach and much easier to maintain (I would recommend using set or other data type suited for fast lookups.
Maybe if you can define your use case more clearer we can work out a pure regex solution...

How do I use pg_trgm to be more permissible

I used pg_trgrm to check string matches and I am pretty happy with the results. But it is not pefrectly the way I want it. I want that searches like "poduto" finds "produtos" (the r was missing). And Also that "sofáa" finds "sofa". I am using posgresql 9.6.
It does find "vermelho" when I type "vermelo" (h is missing). And it does find "sofa" when I type "sof". It seems that only some letters in middle can be left out and I always can miss a final letter. I want to be able to miss any letter in the middle of the word. And also be able to commit "two mistakes" in the case of sofáa and sofá (I used an accent and used one additional "a").
The solution is to lower pg_trgm.similarity_threshold (or pg_trgm.word_similarity_threshold if you are using <% or %>).
Then words with lower similarity will also be found.

Building an ngram frequency table and dealing with multibyte runes

I am currently learning Go and am making a lot of progress. One way I do this is to port past projects and prototypes from a prior language to a new one.
Right now I am busying myself with a "language detector" I prototyped in Python a while ago. In this module, I generate an ngram frequency table, where I then calculate the difference between a given text and a known corpora.
This allows one to effectively determine which corpus is the best match by returning the cosine of two vector representations of the given ngram tables. Yay. Math.
I have a prototype written in Go that works perfectly with plain ascii characters, but I would very much like to have it working with unicode multibyte support. This is where I'm doing my head in.
Here is a quick example of what I'm dealing with: http://play.golang.org/p/2bnAjZX3r0
I've only posted the table generating logic since everything already works just fine.
As you can see by running the snippet, the first text works quite well and builds an accurate table. The second text, which is German, has a few double-byte characters in it. Due to the way I am building the ngram sequence, and due to the fact that these specific runes are made of two bytes, there appear 2 ngrams where the first byte is cut off.
Could someone perhaps post a more efficient solution or, at the very least, guide me through a fix? I'm almost positive I am over analysing this problem.
I plan on open sourcing this package and implementing it as a service using Martini, thus providing a simple API people can use for simple linguistic computation.
As ever, thanks!
If I understand correctly, you want chars in your Parse function to hold the last n characters in the string. Since you're interested in Unicode characters rather than their UTF-8 representation, you might find it easier to manage it as a []rune slice, and only convert back to a string when you have your ngram ready to add to the table. This way you don't need to special case non-ASCII characters in your logic.
Here is a simple modification to your sample program that does the above: http://play.golang.org/p/QMYoSlaGSv
By keeping a circular buffer of runes, you can minimise allocations. Also note that reading a new key from a map returns the zero value (which for int is 0), which means the unknown key check in your code is redundant.
func Parse(text string, n int) map[string]int {
chars := make([]rune, 2 * n)
table := make(map[string]int)
k := 0
for _, chars[k] = range strings.Join(strings.Fields(text), " ") + " " {
chars[n + k] = chars[k]
k = (k + 1) % n
table[string(chars[k:k+n])]++
}
return table
}

Fully correct Unicode visual string reversal

[Inspired largely by trying to explain the problems with Character Encoding independent character swap, but also these other questions neither of which contain a complete answer: How to reverse a Unicode string, How to get a reversed String (unicode safe)]
Doing a visual string reversal in Unicode is much harder than it looks. In any storage format other than UTF-32 you have to pay attention to codepoint boundaries rather than going byte-by-byte. But that's not good enough, because of combining glyphs; the spec has a concept of "grapheme cluster" that's closer to the basic unit you want to be reversing. But that's still not good enough; there are all sorts of special case characters, like bidi overrides and final forms, that will have to be fixed up.
This pseudo-algorithm handles all the easy cases I know about:
Segment the string into an alternating list of words and word-separators (some word-separators may be the empty string)
Reverse the order of this list.
For each string in the list:
Segment the string into grapheme clusters.
Reverse the order of the grapheme clusters.
Check the initial and final cluster in the reversed sequence; their base characters may need to be reassigned to the correct form (e.g. if U+05DB HEBREW LETTER KAF is now at the end of the sequence it needs to become U+05DA HEBREW LETTER FINAL KAF, and vice versa)
Join the sequence back into a string.
Recombine the list of reversed words to produce the final reversed string.
... But it doesn't handle bidi overrides and I'm sure there's stuff I don't know about, as well. Can anyone fill in the gaps?