Banish unmatched top-level trees when performing tag search in Emacs Org Mode - emacs

For the past year and a half, I've maintained a monolithic buffer in Org Mode for my engineering notes with my current employer. Despite containing mostly pointers to other documents, this file has become quite large by human standards (48,290 lines of text), while remaining trivially searchable and editable through programmatic means (read: grep and Org Mode tag search).
One thing bothers me, though. When I perform a tag search using Org Mode 6.33x, Org's sparse tree view retains the folded representation of unmatched trees within the buffer (that is, content preceded by a single asterisk, *). This is generally useful for smaller buffers or those better organized into a single tree with multiple branches. However, this doesn't work especially well for documentation where each new tree is generated chronologically, one for each day, as I've been doing.
.
Before I continue, I'll note that my workaround is inherent in what I've just asked, as are the obvious alterations in my documentation habits with this buffer. However, the following questions remain:
1) Why does Org Mode organize trees in this manner when performing sparse tag searching? The technical details are self-evident, the UX decisions less so.
2) If I wished to correct this issue with a script written in Emacs Lisp, what hooks and commands should I explore in more detail to restructure the document view? Writing overrides for the standard commands (for example, org-match-sparse-tree) is already self-evident.
.
Thank you in advance.

As you already noticed the problem only affects the top level headings. The good thing is that in org-mode you can demote easily all headings with simple keystrokes. This way you can avoid the problem. Also cleaning up afterwards are just some simple keystrokes.
Step-by-step instructions:
Mark the full buffer
Call M-right (for outline-demote)
Input * root\n at the beginning of the file
Now, build up your subtree and do what you want with it.
When done you can remove * root\n at the beginning of the file and promote the headings again with M-left
I have got the impression that you can even leave the overall-heading where it is for your application.

Related

How to maintain notes organized with org-mode?

I have a bunch of papers to read and taking notes. The problem is I don't have much time to spend looking for a way to organize my note taking system. For me emacs org-mode seems to be a quite powerfull solution, and pretty straightforward.
I encounter another problem, how can I keep my notes organized with a single file, in a way that I can rapidly access all the notes?
Since you're short time, you might do well to start with a simple system in which you can capture the notes you need to take, and worry later about organization. Consider the following:
* Title of a paper
** First section name
- A note
- Another note
** Second section name
- Yet another note
- A fourth note
- A fifth note
* Title of another paper
** First section name
- Yet more notes
** Second section name
- &c., &c.
Using paper titles as top-level section headings makes it easy to navigate among papers with isearch; C-s Title of a paper RET brings you to the section containing all your notes on that paper. From there, you can search for a section title, or just use TAB on headings to fold and unfold until you're looking at what you want.
Unless I've misunderstood your requirement, that should give you a pretty quick and straightforward way to dive in and start taking your notes, without losing navigability. That'll also give you an opportunity for some initial, shallow exploration of the problem domain; then, once you've gotten past the current glut of work and have time to think about how you want your note-taking system to work, you can explore the problem more deeply, using org-mode's quick outline rearrangement tools at need to turn the scheme you've got into the scheme you need.

Compare two sides of one bilingual text in emacs

How can you work with the two sides of a bilingual/parallel text?
I know how to run diff and ediff on a text to spot little differences, but a bilingual text will have two completely different sides (expect for the paragraphs, number of chapters and other structural elements like notes). End of line and end of paragraph are certainly useful to mark the units of the two sides.
Is it possible to open two buffers, side by side, and tell what matches what?
This is a hard problem but I dug up an old blog post I read awhile ago that is relevant (and even mentions emacs for preprocessing):
https://languagefixation.wordpress.com/2011/02/09/how-to-create-parallel-texts-for-language-learning-part-1/
Especially check out part2
Beyond that, my suggestion is twofold:
1) Operate on small parts at a time (a chapter or less) and not an entire book
2) Utilize alignment tools available to generate metadata which emacs uses to just 'prettify' the buffer
As there's no existing solution (that I know of or can find), you'll have to get dirty with elisp and create a major or minor mode to colorize matching segments and/or navigate segments.
Quick Hack
However, I hacked some elisp together that takes preprocessed text and uses emacs concept of 'paragraph' and 'sentence' to colorize the buffer; it's a little verbose so I stuck it in a gist:
https://gist.github.com/terranpro/3175bb9f3ed00b3a145c
It's pretty ugly but should give you a start; just run it once in each of the text buffers. But be aware that you'll need to have the text already ready in terms of emacs' paragraphs and sentences (two spaces after a period!!). Hope it gives you a decent starting point.

Tool to compare/diff HTML in bulk

I have a lot of HTML files (10,000's and GBs worth) scraped from a server and I want to check to make sure the server produces the same results after some modifications but ignore kinds of differences that don't matter, e.g. whitespace, missing newlines, timestamps, small changes in some kinds of number, etc.
Does anyone know of a tool for doing this? I'd really rather not do more filtering than I have to.
(Oh and it needs to run under linux)
You might consider using a clone detector such as our CloneDR. This tool parses large sets of computer program (HTML is special case) files, builds abstract syntax trees representing the essential structure of each files, and compares programs for similarity.
Because it is comparing essential program structure, it ignores inessential differences such as comments and whitespace, and deterimines that two code segments are either identical or one can be obtained from the other by substituting other blocks of code. The latter allows the recognition of code that has been modified in various ways. You can see samples of clone detection runs on a variety of computer languages at the web site.
In your case, what you would be looking for are files in system A which are essentially clones (exact or near misses) of files in system B. As a general rule, if a file a is a variant of file b (e.g., with a few changes) the CloneDr will report it as a clone and show the exact differences.
At the scale of 20,000 files, I can see why you want a tool, and I can see why you want near-miss matches rather than exact matches.
Doesn't run under Linux, but I assume your problem is hard to enough to solve so that isn't what you are optimizing.
I use winmerge alot in windows and from what i can see some people enjoy meld in linux, so perhaps that could do the trick for you
http://meld.sourceforge.net/
Other examples i saw from a quick googling was Kompare,xxdiff.sourceforge.net, and kdiff3.sourceforge.net
(could only post 1 link so wrote the adresses to xxdiff and kdiff3 as text)
Beyond Compare is purchased software that is actually worth the money (I never thought I'd hear myself typing that!). It is GUI based but handles thousands of files very well. It will allow you to specify unimportant changes with regular expressions as well as whitespace (beginning, middle and end of line). The feature set is very extensive, check out a trial download.
I do not work for this company, I just use Beyond Compare every day at work and enjoy it every time!

How to maintain an emacs-based knowledge base?

I've been using org-mode for a little while, I've kept it really simple for now, with only two files :
one to act as an inbox, with remember-mode
another where I stick just about anything that's been processed from the inbox
This is great for managing somewhat 'actionable' items, but I keep adding things of a more general nature, that I won't be needing on a day-to-day basis (how-tos, reading notes etc), so it's getting slow and hard to manage.
The material I'm concerning myself with doesn't fit the /projects/tasks/sub-tasks paradigms, they are more like little knowledge nuggets on selected topics, which are inherently more complex to classify and manage.
I've been wondering what kind of structure could be used to handle that kind of information (classification and retrieval), and if there are maybe other modes that could help with the job ?
I guess there is no pre-made answer to this question, since everyone may have different needs.
Noufal gave good conceptual tips that I'll keep in mind, but overall, the accepted answer provided more pragmatic views on this, the linked resource was a GREAT read.
I think this excellent document on how to use org-mode to it's fullest potential will be very helpful to you: "Org Mode: Organize Your Life In Plain Text". It is lengthy reading, but trust me, completely worth the effort.
UPDATE:
You can use the remember-mode section mentioned in the document for your use-case. (I use it for the same use-case) Remember-mode is extremely handy to make quick notes. I use it when I have to store random observations or information that won't go anywhere else. I use the following templates for remember:
(setq org-default-notes-file (concat org-directory "/remember-notes.org"))
(setq org-remember-templates
`(("Todo" ?t "* TODO %?\n %i\n" ,(concat org-directory "/remember-notes.org") bottom)
("Misc" ?m "* %?\n %i\n" ,(concat org-directory "/Notes.org") "Misc")
("iNfo" ?n "* %?\n %i\n" ,(concat org-directory "/Notes.org") "Information")
("Idea" ?i "* %?\n %i\n" ,(concat org-directory "/Notes.org") "Ideas")
("Journal" ?j "* %T %?\n\n %i\n" ,(concat org-directory "/journal.org") bottom)
("Blog" ?b "* %T %? :BLOG:\n\n %i\n" ,(concat org-directory "/journal.org") bottom)
))
As you can see, misc notes and other information goes in the notes.org file under the headings Misc and Information. If the note I'm making doesn't fall in any of the categories defined above, it gets filed in the default file (remember-notes.org) and I can always refile it to another location at a convenient time. This makes my note-taking, jotting down random ideas, and such things extremely simple, without taking away the focus from the job I'm currently doing.
[org-mode] is great for managing somewhat 'actionable' items, but I keep adding things of a more general nature, that I won't be needing on a day-to-day basis (how-tos, reading notes etc), so it's getting slow and hard to manage.
I'm a follower of David Allen and his Getting Things Done methodology. I'm using Emacs for three of the lists he recommends:
Next Actions
Project Resources
The Someday/Maybe list
The material I'm concerning myself with doesn't fit the /projects/tasks/sub-tasks paradigms, they are more like little knowledge nuggets on selected topics, which are inherently more complex to classify and manage.
I've been wondering what kind of structure could be used to handle that kind of information (classification and retrieval), and if there are maybe other modes that could help with the job ?
For this kind of information I've migrated away from emacs. Instead I keep a directory ~/etc/howto, and in that directory I put files that contain "little knowledge nuggets on selected topics", where the key criterion is that the information has long-term value.
I could search this directory with Emacs, but my Emacs Lisp is not so hot, so I wrote a howto shell script instead (some error checking omitted for clarity):
case $# in
1) ;;
*) echo "Usage: $0 <topic>" 1>&2; exit 2 ;;
esac
topic="$1"
# Note the ordering: first exact matches, then beginning matches, then any matches
set xxx `find $HOME/etc/howto/. -name "$topic" -not -type d -print` \
`find $HOME/etc/howto/. -name "${topic}?*" -not -type d -not -name '*~' -print` \
`find $HOME/etc/howto/. -name "?*$topic*" -not -type d -not -name '*~' -print`
shift
case $# in
0) echo "No file found matching *$topic*" 1>&2 ; exit 1 ;;
*) for i
do
less "$i"
done
;;
esac
Examples include:
howto football brings up three nuggets, in this order:
Instructions to give to my wife on how to record a football game on the computer
Instructions for me as exactly what to take and how to dress when I have tickets to a football game
Instructions for transcoding a football game so it can be transmitted over the net and viewed away from home
howto filesystem brings up instructions on how to copy a filesystem
howto batteries brings up a list of recommended rechargeable batteries
One reason I don't use Emacs is that my real script is a little more complicated than what you see above: it also handles PDF and djvu files, so for example howto razor brings up a djvu document of the manual which came with my electric razor.
I have over 500 items in the main directory or in subdirectories, and even at this scale the system works quite well for me. I hope you may find it helpful too.
I personally keep a list of project directories with somewhat similar structure. Each has a tasklist.org, a tracking subdirectory (where I do project estimates and time tracking and always maintain a diary which is the main thing for the project - it will have links to other files for the project), a docs subdirectory which usually consist of stuff I'm going to publish (docs for the project, proposals etc.). I get my agenda-files to the tasklist.org in each of the subdirectories so that my agenda works fine.
I think the organisation of the data would change a little in your case (perhaps topics like "functional programming" etc.). I'm sceptical of how much a hierarchical structure would help since that would confine you into one way of looking at things (tags vs. folders again). Here are some things that come to mind.
Keep a 'master' org file that has links to all the 'interesting' top level pieces of the other content (similar to the diary I mentioned above).
Tag all your material properly (you'll settle on a set of useful tags after a while) and then use the Tag search feature to search through the files quickly. This assumes that all the files are in your agenda-files though.
Finally, if your data is too exotic to put into a structure, you can consider using a full text indexer (like xapian) and integrate that into your Emacs. There was some discussion of this over here.
Update : 26/Nov/2019
I recently came across the hyperbole package which claims to be a information organiser. I haven't used it but I'm quite tempted to and will update this when I do.
I have tried several ways to manage a knowledge base in the past. I have a bunch of "knowledge nuggets" (btw, thanks, I like that term a lot) on all sorts of diverse topics ranging from how to set up apache tomcat ssl cert, to checklists for doing a monthly family budget, to keeping a list of weights and reps completed on workouts.
I've tried keeping these in a wordpress blog, on a personal wiki, using pen and paper, etc.
In the end, emacs and org-mode is the clear winner for me. I love to have the ability to start simple, and build more complex functionality as I need it. I've used a lot of the tips described by Sacha Chua.
In my case, I always end up with a bunch of notes (more formal and organized) mixed in with action items (less formal). In general I maintain one master "action item" list and then create a separate file for notes on each topic. So far, grep has worked well for me to quickly find the file containing notes. I'll often create a emacs bookmark C-x r m to quickly navigate to notes files as well.
Simple Blogs, CMS, and wikis (like Drupal and Wordpress) are good at classification and retrieval. Maybe you could export org files to html and publish them to a blog, cms or wiki? It might not be too difficult to hook into the blog/wiki/cms tagging capability.
At work, we use a wiki (actually, several - a global wiki plus a wiki per project) for this. It's ideal for non-hierarchical data, but can also be used for hierarchical data. It's formattable, hypertextual, linkable-to, searchable, shareable but also ownable, it maintains history, and other good stuff.
Personally, i used to also use a wiki for this. But these days, i generally just forget things instead. Far easier.
You can speed up org-mode by keeping your notes in a separate file (or one for each broad topic) which is not included in the usual list of agenda files. Customise org-agenda-files to see the list.
Use org-remember to enter notes quickly without interrupting your flow. Either tag them at the time, or save them somewhere to be refiled later. You can use a tag in the remember template (customise org-remember-templates) to mark notes for refiling, and use a custom agenda search (org-agenda-custom-commands) to list them.
Tag each note with relevant topics, and use the agenda view's search facilities to find them. You can define a custom search which knows to look in the right files, or you can visit the file and restrict the agenda search to just that file.
I keep a file of notes, and your question has just inspired me to go back and start tagging them all. Works a treat!
I hold tips in .rst (reStructuredText) files in single directory. Each file have own topic.
For search I use M-x occur, or M-x lgrep, or M-x ack.
Web hosted example: http://tips.defun.work/frame.html and it is easy to turn that pages into blog like solution.
Original sources with build script: http://hg.defun.work/tips/
Main advantage of reStructuredText format:
TOC support.
include syntax.
Ability to build JavaScript based full text offline search index in HTML site with TOC, index, reference by Sphinx!!
Remember Markdown suck at extensibility, RST have regular syntax for marking data by your tokens and include any plain text format as inline into your document. Thinks about graphs with dot, prog lang syntax highlighting, etc.
The material I'm concerning myself with doesn't fit the /projects/tasks/sub-tasks paradigms, they are more like little knowledge nuggets on selected topics, which are inherently more complex to classify and manage.
Use bookmarks and Bookmark+. You can create bookmarks for sets of files and directories, in addition to individual files, and you can tag bookmarks or files, a la delicious, for organization and search purposes.

Is there a diff algorithm that preserves line ownership

My goal is coming up with a script to track the point a line was added, even if the line is subsequently modified or moved around (both of which confuse traditional vcs 'blame' scripts. I've done some minor background research (see bottom) but didn't find anything useful. I have a concept for how to proceed but the runtime would be atrocious (there's a factorial involved).
The two missing features are tracking edited-in-place lines separate from a deletion-and-addition of that line, and tracking entire functions moved around so they're in different hunks. For those experienced with diff but unfamiliar with the terminology, a subsequence is a contiguous group of + or - lines, with a type of either delete (all -), add (all +), or replace (a combination). I need more information, on moves and edit-in-place lines, vaguely alluded to in an entry on c2: DiffAlgorithm (paragraph starts with "My favorite mode"). Does anyone know what that is? (seems to be based on Tichy, see bottom.)
Here's more info on the two missing features:
no concept of a change on a line, (a fourth type, something like edit-in-place). In this hunk, the parent of 'bc' is 'b' but 'd' is new and isn't a descendant of 'b':
a
-b
+bc
+d
The workaround for this isn't too complicated, if the position of edits is the same (just an expanded version of markup_instraline_changes but comparing edit distance on all equal-sized subsets of old and new lines.
no concept of "moving" code that preserves the ownership of the lines, e.g. this diff shouldn't alter the ownership of "line", although its position changes.
a
-line
c
+line
This could be dealt with in the same way but with much worse runtime (instead of only checking single blocks marked 'replace', you'd need to check Levenshtein distance between all added against all removed lines) and with likely false positives (some, like whitespace-only lines, aren't relevant to my problem).
Research I've done: reading about gestalt pattern matching (Ratcliff and Obershelp, used in Python's difflib) and An O(ND) Difference Algorithm and its Variations (EW Myers).
After posting the question, I found references to Tichy84 which appears to be The string-to-string correction problem with block moves (which I haven't read yet) according to Walter Tichy's paper a year later on RCS
You appear to be interested in origin tracking, the problem of tracing where a line came from.
Ideally, you'd instrument the editor to remember how things were edited, and store the edits with the text in your repository, thus solving the problem trivially, but none of us software engineers seem to be smart enough to implement this simple idea.
As a weak substitute, one can look at a sequence of source code revisions from the repository and reconstruct a "plausible" history of changes. This is what you seem to be doing by proposing the use of "diff". As you've noted, diff doesn't understand the idea of "moving" or "copying".
SD Smart Differencer tools compare source text by parsing the text according to the langauge it is in, discovering the code structures, and computing least-Levensthein differences in terms of programming language constructs (identifiers, expressions, statements, blocks, classes, ...) and abstract editing operators "insert", "delete", "copy", "move" and "rename identifier within a scope". They produce diff-like output, a little richer because they tell you line/column -> line/column with different editing operations.
Obviously the "move" and "copy" edits are the ones most interesting to you in terms of tracking specific lines (well, specific language constructs). Our experience is that code goes through lots of copy and edits, too, which I suspect won't surprise you.
These tools are in Beta, and are presently available for COBOL, Java and C#. Lots of other langauges are in the pipe, because the SmartDifferencer is built on top of a langauge-parameterized infrastructure, DMS Software Reengineering Toolkit, which has quite a number of already existing, robust langauge grammars.
I think the idea of what amount of editing a line that can be done while it remains a descendent of some previously written line is very subjective, and based on context, both things that a computer cannot work with. You'd have to specify some sort of configurable minimum similarity on lines in your program I think... The other problem is that it is entirely possible for two identical lines to be written completely independently (for example incrementing the value of some variable), and this will be be quite a common thing, so your desired algorithm won't really give truthful or useful information about a line quite often.
I would like to suggest an algorithm for this though (which makes tons of hopefully obvious assumptions by the way) so here goes:
Convert both texts to lists of lines
Copy the lists and Strip all whitespace from inside of each line
Delete blank lines from both lists
Repeat
Do a Levenshtein distance from the old to new lists ...
... keeping all intermediate data
Find all lines in the new text that were matched with old lines
Mark the line in both new/old original lists as having been matched
Delete the line from the new text (the copy)
Optional: If some matched lines are in a contiguous sequence ...
... in either original text assign them to a grouping as well!
Until there is nothing left but unmatchable lines in the new text
Group together sequences of unmatched lines in both old and new texts ...
... which are contiguous in the original text
Attribute each with the line match before and after
Run through all groups in old text
If any match before and after attributes with new text groups for each
//If they are inside the same area basically
Concatenate all the lines in both groups (separately and in order)
Include a character to represent where the line breaks are
Repeat
Do a Levenshtein distance on these concatenations
If there are any significantly similar subsequences found
//I can't really define this but basically a high proportion
//of matches throughout all lines involved on both sides
For each matched subsequence
Find suitable newline spots to delimit the subsequence
Mark these lines matched in the original text
//Warning splitting+merging of lines possible
//No 1-to-1 correspondence of lines here!
Delete the subsequence from the new text group concat
Delete also from the new text working list of lines
Until there are no significantly similar subsequences found
Optional: Regroup based on remaining unmatched lines and repeat last step
//Not sure if there's any point in trying that at the moment
Concatenate the ENTIRE list of whitespaced-removed lines in the old text
Concatenate the lines in new text also (should only be unmatched ones left)
//Newline character added in both cases
Repeat
Do Levenshtein distance on these concatenations
Match similar subsequences in the same way as earlier on
//Don't need to worry deleting from list of new lines any more though
//Similarity criteria should be a fair bit stricter here to avoid
// spurious matchings. Already matched lines in old text might have
// even higher strictness, since all of copy/edit/move would be rare
While you still have matchings
//Anything left unmatched in the old text is deleted stuff
//Anything left unmatched in the new text is newly written by the author
Print out some output to show all the comparing results!
Well, hopefully you can see the basics of what I mean with that completely untested algorithm. Find obvious matches first, and verbatim moves of chunks of decreasing size, then compare stuff that's likely to be similar, then look for anything else which is similar, but both modified and moved: probably just coincidentally similar.
Well, if you try implementing this, tell me how it works out, and what details you changed, and what kind of assignments you made to the various variables involved... I expect there will be some test cases where it works brilliantly and others where it just abyssmally fails due to some massive oversight. The idea is that most stuff will be matched before you get to the inefficient final loop, and indeed the previous one