Better naming convention for Jupyter notebook - jupyter

When naming a Jupyter notebook, if spaces are used, i.e.
This is my notebook.ipynb
then it renders very nicely when opened with the web browser. However, spaces are evil on the command line environment. But if instead:
This_is_my_notebook.ipynb
or
This-is-my-notebook.ipynb
then the title rendered does not look as good. Any suggestions for an alternative convention but still look somewhat nice?

This is really going to depend on your personal preferences, and use cases. Here is the approach I use:
My Approach
[#]_[2-4 word description]_[DS-initials]_[ISO 8601 date].ipynb
e.g.
- jupyter-notebooks
+ 1_exploratory_analysis_ag_2019-02-16.ipynb
+ 1_exploratory_analysis_jw_2019-02-19.ipynb
+ 1_exploratory_analysis.ipynb
+ 1_exploratory_analysis.py
+ 1_exploratory_analysis.html
+ 2_topic_modeling_ag_2019-02-20.ipynb
+ 2_topic_modeling.ipynb
+ 2_topic_modeling.py
+ 2_topic_modeling.html
+ dev_2019-02-15.ipynb
+ dev_2019-02-18.ipynb
+ dev.ipynb
+ dev.py
+ dev.html
Here I've got all the notebooks in a folder called jupyter-notebooks. Files and script outputs would generally live outside of this folder.
If applicable, notebooks are numbered in a logical order e.g. date first created. In many cases it will not make sense to number the files (e.g. this repo of mine).
Notebooks are versioned with timestamps and (optionally) initials of the most recent author. This can be done automatically with a post-save hook (see below).
Non timestamped versions of the notebooks have .py and .html versions. By committing the .py files to git, you can get some benefits of version control. These files can also be generated with a post-save hook.
Miscellaneous development work can be put in dev... files.
I have a post-save hooks set up to automate this. In my configuration file (~/.jupyter/jupyter_notebook_config.py) I've got:
import os
from subprocess import check_call
import glob
import datetime
import re
def timestamped_file(fname):
return bool(re.match('.*\d{4}-\d{2}-\d{2}\.ipynb', fname))
def post_save(model, os_path, contents_manager):
if model['type'] != 'notebook':
return # only do this for notebooks
# Post-save hook for converting notebooks to .py scripts
d, fname = os.path.split(os_path)
if not timestamped_file(fname):
check_call(['jupyter', 'nbconvert', '--to', 'script', fname], cwd=d)
check_call(['jupyter', 'nbconvert', '--to', 'html', fname], cwd=d)
# Post-save hook for saving datestamped versions of the notebook
notebooks = glob.glob('*.ipynb') + glob.glob('jupyter-notebooks/*.ipynb')
if notebooks:
latest_mod_file = max(notebooks, key=os.path.getctime)
# Don't do this for datestamped notebooks
if not timestamped_file(latest_mod_file):
date = datetime.datetime.now().strftime('%Y-%m-%d')
fname_timestamped = '{}_{}.ipynb'.format(os.path.splitext(latest_mod_file)[0], date)
check_call(['cp', latest_mod_file, fname_timestamped])
c.FileContentsManager.post_save_hook = post_save
If saving a non-timestamped file, this will automatically convert to .py and .html, and make a timestamped version.
Another Approach
My convention was heavily inspired by this blog post. They describe conventions that I began using a couple years ago before adopting my own variation.
Here's the example they include:
- develop
+ [ISO 8601 date]-[DS-initials]-[2-4 word description].ipynb
+ 2015-06-28-jw-initial-data-clean.html
+ 2015-06-28-jw-initial-data-clean.ipynb
+ 2015-06-28-jw-initial-data-clean.py
+ 2015-07-02-jw-coal-productivity-factors.html
+ 2015-07-02-jw-coal-productivity-factors.ipynb
+ 2015-07-02-jw-coal-productivity-factors.py
- deliver
+ Coal-mine-productivity.ipynb
+ Coal-mine-productivity.html
+ Coal-mine-productivity.py
As you can see, there are separate folders (and naming conventions) for the development and delivery notebooks.

I found myself searching for this answer as well...
I scoured through some popular GitHub repositories involving .ipynb files, under the pretense of the work being reputably standardized. I found that there was no standard between utilizing dashes and underscores; however, I didn't see any notebooks using spaces - so don't do that.
Sources:
https://github.com/IRkernel/IRkernel/tree/master/example-notebooks
https://github.com/rlabbe/Kalman-and-Bayesian-Filters-in-Python/tree/master/Supporting_Notebooks

Related

Can I perform a find/replace driven by a CSV/Excel file?

I have to perform a find/replace across my project's files using a rename rule-set which I have in CSV format.
My rename CSV is simple and in the format from value,to value:
foo,bar
car,dog
...
zip,zip
All from and to values are exact (so no need to do weird regex).
Is there any way (even w/ an extension) to feed this CSV into VS Code and have it perform the find and replace against all files in my project?
I can of course reformat this CSV to other formats (JSON, excel, etc.) fairly easily if that helps.
You could write a simple python script to do the replacing for you.
I ended up using Batch Replace extension for VS Code.
https://marketplace.visualstudio.com/items?itemName=angelomollame.batch-replacer
Originally I had tried this extension but it wasnt working. I had an ah-ha momement as to why (i have about 500 replace rules). I also use a local history VS Code extension which creates a (massive) local history in a .history folder in the workspace. This extension was choking on processing the 10,000's of files in that (since technically its in my workspace).
Once i excluded that, it worked - though it did take ~1 min to process all my files, and during that time there is no indication that its running.

Can I configure Jupyter Notebook to split source files and generated files?

I really like Jupyter Notebooks.
However, working with them is cumbersome in conjunction with a source control system like git, because an ipynb-File contains the source code (what you actually write in the notebook) and the generated output text / HTML / images / metadata / ...
For example, merge conflicts are difficult to resolve now, because everything is stored in one huge file with lots of generated data.
I wonder if I can configure Jupyter to store notebooks as
A source file: For example, I imagine this to be a Markdown file where everything surrounded by three backticks (```) is interpreted as a code cell. Diffs of that file would be meaningful and merge conflicts would be simple to resolve manually.
A generated file: This contains everything else. If there is a merge conflict within this file, it can be resolved by regenerating it.
Is this possible?
For reference: There is a slightly more general version of this question which lists various efforts at adapting IPython and Jupyter to this effect, and this answer proposes to solve the problem via Git. There is a Github project with a Git filter based on that answer, and (in its edit at the end) the answer links a few similar tools like nbstripout.

Latex rendering in README.md on Github

Is there any way to render LaTex in README.md in a GitHub repository? I've googled it and searched on stack overflow but none of the related answers seems feasible.
For short expresions and not so fancy math you could use the inline HTML to get your latex rendered math on codecogs and then embed the resulting image. Here an example:
- <img src="https://latex.codecogs.com/gif.latex?O_t=\text { Onset event at time bin } t " />
- <img src="https://latex.codecogs.com/gif.latex?s=\text { sensor reading } " />
- <img src="https://latex.codecogs.com/gif.latex?P(s | O_t )=\text { Probability of a sensor reading value when sleep onset is observed at a time bin } t " />
Which should result in something like the next
Update: This works great in eclipse but not in github unfortunately. The only work around is the next:
Take your latex equation and go to http://www.codecogs.com/latex/eqneditor.php, at the bottom of the area where your equation appears displayed there is a tiny dropdown menu, pick URL encoded and then paste that in your github markdown in the next way:
![equation](http://latex.codecogs.com/gif.latex?O_t%3D%5Ctext%20%7B%20Onset%20event%20at%20time%20bin%20%7D%20t)
![equation](http://latex.codecogs.com/gif.latex?s%3D%5Ctext%20%7B%20sensor%20reading%20%7D)
![equation](http://latex.codecogs.com/gif.latex?P%28s%20%7C%20O_t%20%29%3D%5Ctext%20%7B%20Probability%20of%20a%20sensor%20reading%20value%20when%20sleep%20onset%20is%20observed%20at%20a%20time%20bin%20%7D%20t)
I upload repositories with equations to Gitlab because it has native support for LaTeX in .md files:
```math
SE = \frac{\sigma}{\sqrt{n}}
```
The syntax for inline latex is $`\sqrt{2}`$.
Gitlab renders equations with JavaScript in the browser instead of showing images, which improves the quality of equations.
More info here.
Let's hope Github will implement this as well in the future.
My trick is to use the Jupyter Notebook.
GitHub has built-in support for rendering .ipynb files. You can write inline and display LaTeX code in the notebook and GitHub will render it for you.
Here's a sample notebook file: https://gist.github.com/cyhsutw/d5983d166fb70ff651f027b2aa56ee4e
Readme2Tex
I've been working on a script that automates most of the cruft out of getting LaTeX typeset nicely into Github-flavored markdown: https://github.com/leegao/readme2tex
There are a few challenges with rendering LaTeX for Github. First, Github-flavored markdown strips most tags and most attributes. This means no Javascript based libraries (like Mathjax) nor any CSS styling.
The natural solution then seems to be to embed images of precompiled equations. However, you'll soon realize that LaTeX does more than just turning dollar-sign enclosed formulas into images.
Simply embedding images from online compilers gives this really unnatural look to your document. In fact, I would argue that it's even more readable in your everyday x^2 mathematical slang than jumpy .
I believe that making sure that your documents are typeset in a natural and readable way is important. This is why I wrote a script that, beyond compiling formulas into images, also ensures that the resulting image is properly fitted and aligned to the rest of the text.
For example, here is an excerpt from a .md file regarding some enumerative properties of regular expressions typeset using readme2tex:
As you might expect, the set of equations at the top is specified by just starting the corresponding align* environment
**Theorem**: The translation $[\![e]\!]$ given by
\begin{align*}
...
\end{align*}
...
Notice that while inline equations ($...$) run with the text, display equations (those that are delimited by \begin{ENV}...\end{ENV} or $$...$$) are centered. This makes it easy for people who are already accustomed to LaTeX to keep being productive.
If this sounds like something that could help, make sure to check it out. https://github.com/leegao/readme2tex
Since May 2022, this has been officially supported:
Inline:
Where $x = 0$, evaluate $x + 1$
Blocks:
Where
$$x = 0$$
Evaluate
$$x + 1$$
One can also use this online editor: https://www.codecogs.com/latex/eqneditor.php which generates SVG files on the fly. You can put a link in your document like this:
![](https://latex.codecogs.com/svg.latex?y%3Dx%5E2) which results in:
.
I test some solution proposed by others and I would like to recommend TeXify created and proposed in comment by agurodriguez and further described by Tom Hale - I would like develop his answer and give some reason why this is very good solution:
TeXify is wrapper of Readme2Tex (mention in Lee answer). To use Readme2Tex you must install a lot of software in your local machine (python, latex, ...) - but TeXify is github plugin so you don't need to install anything in your local machine - you only need to online installation that plugin in you github account by pressing one button and choose repositories for which TeXify will have read/write access to parse your tex formulas and generate pictures.
When in your repository you create or update *.tex.md file, the TeXify will detect changes and generate *.md file where latex formulas will be exchanged by its pictures saved in tex directory in your repo. So if you create README.tex.md file then TeXify will generate README.md with pictures instead tex formulas. So parsing tex formulas and generate documentation is done automagically on each commit&push :)
Because all your formulas are changed into pictures in tex directory and README.md file use links to that pictures, you can even uninstall TeXify and all your old documentation will still works :). The tex directory and *.tex.md files will stay on repository so you have access to your original latex formulas and pictures (you can also safely store in tex directory your other documentation pictures "made by hand" - TeXify will not touch them).
You can use equations latex syntax directly in README.tex.md file (without loosing .md markdown syntax) which is very handy. Julii in his answer proposed to use special links (with formulas) to external service e.g . http://latex.codecogs.com/gif.latex?s%3D%5Ctext%20%7B%20sensor%20reading%20%7D which is good however has some drawbacks: the formulas in links are not easy (handy) to read and update, and if there will be some problem with that third-party service your old documentation will stop work... In TeXify your old documentation will works always even if you uninstall that plugin (because all your pictures generated from latex formulas are stay in repo in tex directory).
The Yuchao Jiang in his answer, proposed to use Jupyter Notebook which is also nice however have som drawbacks: you cannot use formulas directly in README.md file, you need to make link there to other file *.ipynb in your repo which contains latex (MathJax) formulas. The file *.ipynb format is JSON which is not handy to maintain (e.g. Gist don't show detailed error with line number in *.ipynb file when you forgot to put comma in proper place...).
Here is link to some of my repo where I use TeXify for which documentation was generated from README.tex.md file.
Update
Today 2020.12.13 I realised that TeXify plugin stop working - even after reinstallation :(
For automatic conversion upon push to GitHub, take a look at the TeXify app:
GitHub App that looks in your pushes for files with extension *.tex.md and renders it's TeX expressions as SVG images
How it works (from the source repository):
Whenever you push TeXify will run and seach for *.tex.md files in your last commit. For each one of those it'll run readme2tex which will take LaTeX expressions enclosed between dollar signs, convert it to plain SVG images, and then save the output into a .md extension file (That means that a file named README.tex.md will be processed and the output will be saved as README.md). After that, the output file and the new SVG images are then commited and pushed back to your repo.
I just published a new version of xhub, a browser extension that renders LaTeX (and other things) in GitHub pages.
Cons:
You have to install the extension once.
Pros:
No need to set up anything.
Just write Markdown with math
Display math:
```math
e^{i\pi} + 1 = 0
```
and line math $`a^2 + b^2 = c^2`$.
(Syntax like on GitLab.)
Works on light and dark background. (Math has text-color)
You can copy-and-paste the math just like text
As an example, check out this GitHub README:
You can get a continuous integration service (e.g. Travis CI) to render LaTeX and commit results to github. CI will deploy a "cloud" worker after each new commit. The worker compiles your document into pdf and either cuses ImageMagick to convert it to an image or uses PanDoc to attempt LaTeX->HTML conversion where success may vary depending on your document. Worker then commits image or html to your repository from where it can be shown in your readme.
Sample TravisCi config that builds a PDF, converts it to a PNG and commits it to a static location in your repo is pasted below. You would need to add a line that fetches pdfconverts PDF to an image
sudo: required
dist: trusty
os: linux
language: generic
services: docker
env:
global:
- GIT_NAME: Travis CI
- GIT_EMAIL: builds#travis-ci.org
- TRAVIS_REPO_SLUG: your-github-username/your-repo
- GIT_BRANCH: master
# I recommend storing your GitHub Access token as a secret key in a Travis CI environment variable, for example $GH_TOKEN.
- secure: ${GH_TOKEN}
script:
- wget https://raw.githubusercontent.com/blang/latex-docker/master/latexdockercmd.sh
- chmod +x latexdockercmd.sh
- "./latexdockercmd.sh latexmk -cd -f -interaction=batchmode -pdf yourdocument.tex -outdir=$TRAVIS_BUILD_DIR/"
- cd $TRAVIS_BUILD_DIR
- convert -density 300 -quality 90 yourdocument.pdf yourdocument.png
- git checkout --orphan $TRAVIS_BRANCH-pdf
- git rm -rf .
- git add -f yourdoc*.png
- git -c user.name='travis' -c user.email='travis' commit -m "updated PDF"
# note we are again using GitHub access key stored in the CI environment variable
- git push -q -f https://your-github-username:$GH_TOKEN#github.com/$TRAVIS_REPO_SLUG $TRAVIS_BRANCH-pdf
notifications:
email: false
This Travis Ci configuration launches a Ubuntu worker downloads a latex docker image, compiles your document to pdf and commits it to a branch called branchanme-pdf.
For more examples see this github repo and its accompanying sx discussion, PanDoc example,
https://dfm.io/posts/travis-latex/, and this post on Medium.
I have been looking around and found that this answer in another question works best for me. i.e. use githubcontent math renderer, e.g. to display:
Use this link
Beware of the latex needs to be url encoded, but otherwise work quite well for me.
If you are having issues with https://www.codecogs.com/latex/eqneditor.php, I found that https://alexanderrodin.com/github-latex-markdown/ worked for me. It generates the Markdown code you need, so you just cut and paste it into your README.md document.
You may also take a look on my tool latexMarkdown2Markdown which convert LaTeX to SVG and generate a table of content with chapter numbering.
Good news!
According to this blogpost, now GitHub supports Mathjax in readme files.
You can use in-line LaTeX inspired syntax using $ delimiters, or in-blocks using $$ delimiters.
Writing inline expressions:
This sentence uses $ delimiters to show math inline:
$\sqrt{3x-1}+(1+x)^2$
Writing expressions as blocks:
The Cauchy-Schwarz Inequality
$$\left( \sum_{k=1}^n a_k b_k \right)^2 \leq \left( \sum_{k=1}^n a_k^2
\right) \left( \sum_{k=1}^n b_k^2 \right)$$
Source: https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/writing-mathematical-expressions
You can use markdowns, e.g.
![equ](https://latex.codecogs.com/gif.latex?log(y)=\beta_0&space;&plus;&space;\beta_1&space;x&space;&plus;&space;u)
Code can be typed here: https://www.codecogs.com/latex/eqneditor.php.
Edit: As germanium pointed out, it does not work for README.md but other git pages though no explanation is available.
My quick solution is this
step 1. Add latex to your .md file
$$x=\sqrt{2}$$
Note: math eqns must be in $$...$$ or \\(... \\).
step 2. Add the following to your scripts.html or theme file (append this code at the end)
<script type="text/javascript" async
src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-MML-AM_CHTML">
Done!. See your eq. by loading the page.

Which source control uses a "s." prefix on its filenames?

I found what appears to be an old source repository for some source code that I need to resurrect. But I have no idea what source control tools were used to generate and manage this source repository. In the directory, all of the files have a "s." prefixed to the file name. Without knowing the format in these files, I cannot manually extract the source code with any degree of accuracy. And even if I did, manually extracting the source code would be very time consuming and error prone.
What source/version control system prefixes its source files with "s." when it stores the source file in its repository directory?
How can I effectively extract the latest source code from this repository directory?
The s. prefix is characteristic of SCCS, the Source Code Control System. The code for that is probably still proprietary, but GNU has the CSSC project which can manipulate SCCS files. It tracks changes per-file in revisions, known as 'deltas'.
SCCS is the official revision control system for POSIX; you can find the commands documented on the Open Group site (but the file format is not specified there, AFAICT):
admin
delta
get
prs
rmdel
sact
unget
val
what
The file format is not specified by POSIX. The manual page for get says:
The SCCS files shall be files of an unspecified format.
The original SCCS command set included some extras not recorded by POSIX:
cdc — change delta commentary (for changing the checkin comments for a delta)
comb — combine, effectively for merging deltas
help — no prefix; the wasn't any other help program at the time. Commands generate error codes such as cm3 and help interpreted them.
sccsdiff — difference between two deltas of a file
Most systems now have a single command, sccs, which takes the operation name and then options. Often, the files were placed into an ./SCCS/ subdirectory and extracted from that as required, and the sccs front-end would handle name expansion, adding s. or SCCS/s. to the start of the file names.
For extracting the latest version of the source code, use get.
get s.*
sccs get s.*
These will get the default version of each file, and the default default is the latest version of the file.
If you need to make changes, use:
get -e s.filename.c
...make changes...
delta -y'Why you made the changes' s.filename.c
get s.filename.c
Note that the files 'lose' the s. prefix for the working file names, rather like RCS (Revision Control System) files lose the ,v suffix for the working file names. If you've not come across that, accept that it was different when SCCS and RCS were created, back in the late 70s or early 80s.
SCCS uses an s. prefix. But it might not be the only one!
I never knew this knowledge would come in useful some day!

Uncompress OpenOffice files for better storage in version control

I've heard discussion about how OpenOffice (ODF) files are compressed zip files of XML and other data. So making a tiny change to the file can potentially totally change the data, so delta compression doesn't work well in version control systems.
I've done basic testing on an OpenOffice file, unzipping it and then rezipping it with zero compression. I used the Linux zip utility for my testing. OpenOffice will still happily open it.
So I'm wondering if it's worth developing a small utility to run on ODF files each time just before I commit to version control. Any thoughts on this idea? Possible better alternatives?
Secondly, what would be a good and robust way to implement this little utility? Bash shell that calls zip (probably Linux only)? Python? Any gotchas you can think of? Obviously I don't want to accidentally mangle a file, and there are several ways that could happen.
Possible gotchas I can think of:
Insufficient disk space
Some other permissions issue that prevents writing the file or temporary files
ODF document is encrypted (probably should just leave these alone; the encryption probably also causes large file changes and thus prevents efficient delta compression)
First, version control system you want to use should support hooks which are invoked to transform file from version in repository to the one in working area, like for example clean / smudge filters in Git from gitattributes.
Second, you can find such filter, instead of writing one yourself, for example rezip from "Management of opendocument (openoffice.org) files in git" thread on git mailing list (but see warning in "Followup: management of OO files - warning about "rezip" approach"),
You can also browse answers in "Tracking OpenOffice files/other compressed files with Git" thread, or try to find the answer inside "[PATCH 2/2] Add keyword unexpansion support to convert.c" thread.
Hope That Helps
You may consider to store documents in FODT-format - flat XML format.
This is relatively new alternative solution available.
Document is just stored unzipped.
More info is available at https://wiki.documentfoundation.org/Libreoffice_and_subversion.
I've modified the python program in Craig McQueen's answer just a bit. Changes include:
Actually checking the return of testZip (according to the docs, it appears that the original program will happily proceed with a corrupt zip file past the checkzip step).
Rewrite the for-loop to check for already-uncompressed files to be a single if-statement.
Here is the new program:
#!/usr/bin/python
# Note, written for Python 2.6
import sys
import shutil
import zipfile
# Get a single command-line argument containing filename
commandlineFileName = sys.argv[1]
backupFileName = commandlineFileName + ".bak"
inFileName = backupFileName
outFileName = commandlineFileName
checkFilename = commandlineFileName
# Check input file
# First, check it is valid (not corrupted)
checkZipFile = zipfile.ZipFile(checkFilename)
if checkZipFile.testzip() is not None:
raise Exception("Zip file is corrupted")
# Second, check that it's not already uncompressed
if all(f.compress_type==zipfile.ZIP_STORED for f in checkZipFile.infolist()):
raise Exception("File is already uncompressed")
checkZipFile.close()
# Copy to "backup" file and use that as the input
shutil.copy(commandlineFileName, backupFileName)
inputZipFile = zipfile.ZipFile(inFileName)
outputZipFile = zipfile.ZipFile(outFileName, "w", zipfile.ZIP_STORED)
# Copy each input file's data to output, making sure it's uncompressed
for fileObject in inputZipFile.infolist():
fileData = inputZipFile.read(fileObject)
outFileObject = fileObject
outFileObject.compress_type = zipfile.ZIP_STORED
outputZipFile.writestr(outFileObject, fileData)
outputZipFile.close()
Here's another program I stumbled across: store_zippies_uncompressed by Mirko Friedenhagen.
The wiki also shows how to integrate it with Mercurial.
Here is a Python script that I've put together. It's had minimal testing so far. I've done basic testing in Python 2.6. But I prefer the idea of Python in general because it should abort with an exception if any error occurs, whereas a bash script may not.
This first checks that the input file is valid and not already uncompressed. Then it copies the input file to a "backup" file with ".bak" extension. Then it uncompresses the original file, overwriting it.
I'm sure there are things I've overlooked. Please feel free to give feedback.
#!/usr/bin/python
# Note, written for Python 2.6
import sys
import shutil
import zipfile
# Get a single command-line argument containing filename
commandlineFileName = sys.argv[1]
backupFileName = commandlineFileName + ".bak"
inFileName = backupFileName
outFileName = commandlineFileName
checkFilename = commandlineFileName
# Check input file
# First, check it is valid (not corrupted)
checkZipFile = zipfile.ZipFile(checkFilename)
checkZipFile.testzip()
# Second, check that it's not already uncompressed
isCompressed = False
for fileObject in checkZipFile.infolist():
if fileObject.compress_type != zipfile.ZIP_STORED:
isCompressed = True
if isCompressed == False:
raise Exception("File is already uncompressed")
checkZipFile.close()
# Copy to "backup" file and use that as the input
shutil.copy(commandlineFileName, backupFileName)
inputZipFile = zipfile.ZipFile(inFileName)
outputZipFile = zipfile.ZipFile(outFileName, "w", zipfile.ZIP_STORED)
# Copy each input file's data to output, making sure it's uncompressed
for fileObject in inputZipFile.infolist():
fileData = inputZipFile.read(fileObject)
outFileObject = fileObject
outFileObject.compress_type = zipfile.ZIP_STORED
outputZipFile.writestr(outFileObject, fileData)
outputZipFile.close()
This is in a Mercurial repository in BitBucket.
If you don't need the storage savings, but just want to be able to diff OpenOffice.org files stored in your version control system, you can use the instructions on the oodiff page, which tells how to make oodiff the default diff for OpenDocument formats under git and mercurial. (It also mentions SVN, but it's been so long since I used SVN regularly I'm not sure if those are instructions or limitations.)
(I found this using Mirko Friedenhagen's page (cited by Craig McQueen above))