Using regexp to index a file for imenu, performance is unacceptable - emacs

I'm producing a function for imenu-create-index-function, to index a source code module, for csharp-mode.el
It works, but delivers completely unacceptable performance. Any tips for fixing this?
The Background
I looked at js.el, which is the rebadged "espresso" now included, since v23.2, into emacs. It indexes Javascript files very nicely, does a good job with anonymous functions and various coding styles and patterns in common use. For example, in javascript one can do:
(function() {
var x = ... ;
function foo() {
if (x == 1) ...
}
})();
...to define a scope where x is "private" or inaccessible from other code. This gets indexed nicely by js.el, using regexps, and it indexes the inner functions (anonymous or not) within that scope also. It works quickly. A big module can be indexed in less than a second.
I tried following a similar approach in csharp-mode, but it's quite a bit more complicated. In Js, everything that gets indexed is a function. So the starting regex is "function" with some elaboration on either end. Once an occurrence of the function keyword is found, then there are 4 - 8 other regexps that get tried via looking-at - the number depends on settings. One nice thing about js mode is that you can turn on or off regexps for various coding styles, to speed things along I suppose. The default "styles" work for most of the code I tried.
This doesn't work in csharp-mode. It works, but it performs poorly enough to make it not very usable. I think the reason for this is that
there is no single marker keyword in C#, as function behaves in javascript. In C# I need to look for namespace, class, struct, interface, enum, and so on.
there's a great deal of flexibility with which csharp constructs can be defined. As one example, a class can define base classes as well as implemented interfaces. Another example: The return type for a method isn't a simple word-like string, but can be something messy like Dictionary<String, List<String>> . The index routine needs to handle all those cases, and capture the matches. This makes it run sloooooowly.
I use a lot of looking-back. The marker I use in the current approach is the open curly brace. Once I find one of those, I use looking-back to determine if the curly is a class, interface, enum, method, etc. I read that looking-back can be slow; I'm not clear on how much slower it is than, say, looking-at.
once I find an open-close pair of curlies, I call narrow-to-region in order to index what's inside. not sure if this is will kill performance or not. I suspect that it is not the main culprit, because the perf problems I see happen in modules with one namespace and 2 or 3 classes, which means narrow gets called 3 or 4 times total.
What's the Question?
My question is: do you have any tips for speeding up imenu-like indexing in a C# buffer?
I'm considering:
avoiding looking-back. I don't know exactly how to do this because when re-search-forward finds, say, the keyword class, the cursor is already in the middle of a class declaration. looking-back seems essential.
instead of using open-curly as the marker, use the keywords like enum, interface, namespace, class
avoid narrow-to-region
any hard advice? Further suggestions?
Something I've tried and I'm not really enthused about re-visiting: building a wisent-based parser for C#, and relying on semantic to do the indexing. I found semantic to be very very very (etc) difficult to use, hard to discover, and problematic. I had semantic working for a while, but then upgraded to v23.2, and it broke, and I never could get it working again. Simple things - like indexing the namespace keyword - took a very long time to solve. I'm very dissatisfied with it and don't want to try again.

I don't really know C# syntax, and without looking at your elisp it's hard to give an answer, but here goes anyway.
looking-back can be deadly slow. It's the first thing I'd experiment with. One thing that helps a lot is using the limit arg to, say, restrict your search to the beginning of the current line. A different approach is when you hit the open curly do backward-char then backward-sexp (or whatever) to get to the front of the previous word, then use looking-at.
Using keywords to search around instead of open curly is probably what I would have done. Maybe something like (re-search-forward "\\(enum\\|interface\\|namespace\\|class\\)[ \t\n]*{" nil t) then using match-string-no-properties on the first capture group to see which of the keywords was found. This might help with the looking-back problem as well.
I don't know how expensive narrow-to-region is, but could be avoided by when you find a open curly do save-excursion forward-sexp and keep point as a limit for the current iteration of your (I assume recursive) searches.

Related

Why word refactor was chosen for change of program or its part?

My question is partly liguistic, but very related to programming (of almost anything, web pages or anything else).
I would like to know why word refactor was chosen for changing of program or its part, if else word probably would be more exact and better describing done change.
IDEs (for example NetBeans or Eclipse) use this word only for renaming of any part of chosen program (project), including moving of file to else place (from view of any OS it is probably only renaming).
But renaming is not about changing of factor (because it is something that is not changed when it is renamed).
Closer to meaning of word refactor (as changing of factor) is manual rewriting of any part, when rewritten part has changed behaviour (but not what program does from outer view - as is written in topic What is refactoring and what is only modifying code?).
The word "Refactoring" is derived from mathematics where you find an equivalent expression by applying factoring again. The equivalent expression does not change the final outcome but it is much easier to understand, use, or reuse.
There are many refactoring techniques and renaming is one of them. Other techniques include extract method, extract class, move method, move class, pull/push method to super/sub-class and many more.

What's the recommended replacement for Perl's deprecated-ish given/when?

Now that the Perl devs have decided to sort-of deprecate given/when statements, is there a recommended replacement, beyond just going back to if/elsif/else?
if/elsif/else chains are the best option most of the time — except when something completely different is better than both if/elsif/else and given/when, which is actually reasonably often. Examples of "completely different" approaches are creating different types of objects to handle different scenarios, and letting method dispatch do your work for you, or finding an opportunity to make your code more data-driven. Both of those, if they're appropriate and you do them right, can greatly reduce the number of "switch statement" constructs in your code.
Just as a supplement, I've found that a combination of 'for' and if/elsif/else is good if you have some given/when/default code that needs to be quickly updated. Just replace given with for and replace the when statements with a cascade of if & elsif, and replace default with else. This allows all your tests to continue using $_ implicitly, requiring less rewriting. (But be aware that other special smart match features will not work any more.)
This is just for rewriting code that already uses given/when, though. For writing new code, #hobbs has the right answer.

What does internal mean in function names in Emacs Lisp?

Some people use double dash to indicate that the function is subject to change:
What does the double minus (--) convention in function names mean in Emacs Lisp
Does including internal in function names mean similar things?
Two examples
where-is-internal
internal-make-var-non-special
The function where-is-internal has a detailed docstring and is mentioned in the manual as well. Is where-is-internal an exception?
Is there a difference between having -internal as suffix and having internal- as prefix?
Adding to confusion, there are also function names with internal-- (with double dash) as prefix.
The confusion is not just in the naming convention (variability due to history and perhaps sometimes whim). The confusion is in the very notion of "internal" in free software, where the source code is open to everyone to use or modify (even fork) as they please.
To answer your question from (what I think is) the point of view of Emacs Dev, and thus in terms of the underlying intention: "internal" means that someone using such a function is perhaps more likely to encounter future changes in the Emacs-Dev implementation and use of that function than might be the case for a non-"internal" function. IOW, you might not want to count on it remaining as it is now. That's all.
But there's a lot of "perhaps", "more likely", and "might" in there. In practice, some non-"internal" functions change more radically or more quickly than some "internal" functions. It might be the case that for the former there will be a deprecation grace period, during which the pre-change situation is tolerated, i.e., still works. That might not be the case for something "internal". But again, in practice there is some gray between the black of "internal" and the white of non-"internal".
Someone from Emacs Dev (e.g. #Stefan) will perhaps put this differently or correct my interpretation.
My own take: there have sometimes (often) been functions and variables that the author did not expect users to make use of directly, and thus naturally thought of as "internal", which users have nevertheless put to good use, or even "had" to use (modulo rewriting lots of code). Some such have had their "internal" status removed (no, I don't have examples memorized). Or sometimes a new, non-"internal" function has been added to make the behavior available - e.g., a wrapper or function-valued argument has been added (again, I have no offhand examples to give).
IOW, for Emacs Dev too it is not always clear what should be considered "internal". Just take the label as a flag that you might not want to count too much on that function or variable.
Wrt the various notations: My impression is that the -- convention seems recently to be used more (though there is also some old code that uses it); using internal is an older convention, for the most part.
The "internal" and the "--" conventions are similar. Basically "internal" is used when there's no prefix after which to put a double dash (which is usually the case for functions implemented in C).
And yes, as Drew explains, the intention behind the notion of something being "internal" is just to recommend people not use it directly. IOW if they need the corresponding functionality, they should report a bug requesting to promote its status to "non-internal".

would it be worth it to use inline::C to speed up math

i have been working on a perl program to process large amounts of dna. It outputs exactly what i need however it takes much longer than i would like using NYTprof i have narrowed down the major problem areas to be the loop that adds my values together. would using inline::C to do the math make my program faster or should i accept the speed and move on? is there another way to improve the speed? here is my program and an input it would run as well as an executable with the default values entered already.
It's unlikely you'll get useful help here (this included). I can see various problems with your code, and none have to do with the choice of language.
use CPAN. If you're parsing genbank, then use some an appropriate module.
You're writing assembly in Perl, and neither Perl nor you are very good at that. It's near impossible to know what's going on when you don't pass parameters to subroutines, instead relying on globals all over the place. What do #X1, #X2, #Y1, #Y2 mean?
The following might be your problem: until ($ender - $starter > $tlength) { (line 153). According to your test case, these start by being 103, 1, and 200, and it's not clear when or if they change. Depending on what's in #te, it might or might not ever get out of the loop; I just can't tell from your code.
It would help if we knew, exactly, what are the parameters to add, the in-out invariants, and what it is returning.
That's all I got.
I second the recommendation of PDL made in a comment, if it's applicable. Or the use of a CPAN module tailored to your problem (again, if applicable).
I didn't see anything that looked unambiguously like "the loop that adds my values together" in that code; please, show just the code you are considering optimizing, ideally with just enough structure around it to actually run it.
So to answer your generic question generically, yes, Inline::C can be a useful tool for optimization if you are certain your performance problem is limited to what it actually can do for you. In using it, be aware that invoking your C code from Perl or vice versa is non-trivially expensive, so you have to have enough code translated to C to minimize the transitions.

Keeping CL and Scheme straight in your head

Depending on my mood I seem to waffle back and forth between wanting a Lisp-1 and a Lisp-2. Unfortunately beyond the obvious name space differences, this leaves all kinds of amusing function name/etc problems you run into. Case in point, trying to write some code tonight I tried to do (map #'function listvar) which, of course, doesn't work in CL, at all. Took me a bit to remember I wanted mapcar, not map. Of course it doesn't help when slime/emacs shows map IS defined as something, though obviously not the same function at all.
So, pointers on how to minimize this short of picking one or the other and sticking with it?
Map is more general than mapcar, for example you could do the following rather than using mapcar:
(map 'list #'function listvar)
How do I keep scheme and CL separate in my head? I guess when you know both languages well enough you just know what works in one and not the other. Despite the syntactic similarities they are quite different languages in terms of style.
Well, I think that as soon you get enough experience in both languages this becomes a non-issue (just with similar natural languages, like Italian and Spanish). If you usually program in one language and switch to the other only occasionally, then unfortunately you are doomed to write Common Lisp in Scheme or vice versa ;)
One thing that helps is to have a distinct visual environment for both languages, using syntax highlighting in some other colors etc. Then at least you will always know whether you are in Common Lisp or Scheme mode.
I'm definitely aware that there are syntactic differences, though I'm certainly not fluent enough yet to automatically use them, making the code look much more similar currently ;-).
And I had a feeling your answer would be the case, but can always hope for a shortcut <_<.
The easiest way to keep both languages straight is to do your thinking and code writing in Common Lisp. Common Lisp code can be converted into Scheme code with relative ease; however, going from Scheme to Common Lisp can cause a few headaches. I remember once where I was using a letrec in Scheme to store both variables and functions and had to split it up into the separate CL functions for the variable and function namespaces respectively.
In all practicality though I don't make a habit of writing CL code, which makes the times that I do have to all the more painful.