Can long-running cells in iPython/Jupyter report runtime status info without cluttering the cell output? - ipython

I'm writing some functions in a Jupyter notebook to prepare data, but they can take quite a while to run.
In [*]: print "hello"
df['expensive_col'] = df.apply(myfunc, axis=1)
It would be useful if myfunc would print status information such that I see how far into the execution it is. The downside of this is that it would seriously clutter the Jupyter output cell.
In [*]: print hello
df['expensive_col'] = df.apply(myfunc, axis=1)
hello
ix0
ix1
ix2
...
ix999
...
noisy output
...
Is there a way to print status updates from the long-running Jupyter cell in such a way to keep from cluttering the output cell?
For example, I could wipe everything printed by calling IPython.display.clear_output() but I might want to do a more targeted wipe, keeping stuff from before my expensive (hello, in the above example).

One simple thing you could do is to invoke the print function with the end parameter set to carriage return instead of newline, like this:
for i in range(100):
print('progress %i' % i, end='\r')
# things happen here
It effectively writes over the previous line, without cluttering the output field.
For a more advanced approach (and fancy looking progress bars!) check out the tqdm project.

Related

How do I use re.sub to replace a specific iteration?

I'm making a madlibs python program for a project, and I'm having trouble replacing the specific words in my input.
I want to change the individual iterations, not all of them.
Go easy on me, I'm a newb
# Run a loop to replace the words
print("Replacing Words...")
for loopCount in range(replacedNumber):
currentReplace = replaceList[loopCount]
print(re.finditer(r'<.*?>',storyContent))
print("Replacing", currentReplace)
filteredReplaced = re.sub(r'<.*?>',currentReplace, storyContent, loopCount)
Note: The print() lines where just for debugging
Here is a link to all of the code
Console Output

Benchmarking in BaseX: how to set up

Currently I am an intern at a research group that makes large sets of texts (corpora) searchable. Not only can one search for literal strings, but more importantly it is also possible to look for similar syntactical dependency structures as the given input, without the need of being proficient in any programming language or corpus annotation style. It may be clear that this tool is intended for linguists.
At the start of the project - before I was engaged in the project - the tool was limited to rather small corpora (up to 9 million words). The goal is to make large sets of texts searchable as well. We are talking about +- 500 millions words. Attempts have been made that in theory ought to improve speed by reducing the search space (see this paper) but this has not been tested yet. The results of this attempt is a new file structure. Let's call this structure B, compared to a non-processed structure A. We expect B to provide faster results when queried with BaseX.
My question is: what is the best way to test and compare both approaches with a Perl script? Below you find my current script to query BaseX locally. It takes two arguments. A directory that stores different files. These files each individually store XPaths. Those XPaths are the ones that I have selected to benchmark with. A second argument is the limit of results to return. When set to zero, no limit is set.
Because some parts of the dataset are so incredibly huge, we have divided them in different, equally sized files as well, called treebankparts. They are stored in <tb> tags inside treebankparts.lst.
#!/usr/bin/perl
use warnings;
$| = 1; # flush every print
# Directory where XPaths are stored
my $directory = shift(#ARGV);
# Set limit. If set to zero all results will be returned
my $limit = shift(#ARGV);
# Create session, connect to BaseX
my $session = Session->new([INFORMATION WITHHELD]);
# List all files in directory
#xpathfiles = <$directory/*.txt>;
# Read lines of treebank parts into variable
open( my $tfh, "treebankparts.lst" ) or die "cannot open file treebankparts.lst";
chomp( my #tlines = <$tfh> );
close $tfh;
# Loop through all XPaths in $directory
foreach my $xpathfile (#xpathfiles) {
open( my $xfh, $xpathfile ) or die "cannot open file $xpathfile";
chomp( my #xlines = <$xfh> );
close $xfh;
print STDOUT "File = $xpathfile\n";
# Loop through lines from XPath file (= XPath query)
foreach my $xline (#xlines) {
# Loop through the lines of treebank file
foreach my $tline (#tlines) {
my ($treebank) = $tline =~ /<tb>(.+)<\/tb>/;
QuerySonar( $xline, $treebank );
}
}
}
$session->close();
sub QuerySonar {
my ( $xpath, $db ) = #_;
print STDOUT "Querying $db for $xpath\n";
print STDOUT "Limit = $limit\n";
my $x_limit;
my $x_resultsofxp = 'declare variable $results := db:open("' . $db . '")/treebank/alpino_ds'
. $xpath . ';';
my $x_open = '<results>';
my $x_totalcount = '<total>{count($results)}</total>';
my $x_loopinit = '{for $node at $limitresults in $results';
# Spaces are important!
if ( $limit > 0 ) {
$x_limit = ' where $limitresults <= ' . $limit . ' ';
}
# Comment needed to prevent `Incomplete FLWOR expression`
else { $x_limit = '(: No limit set :)'; }
my $x_sentenceinfo = 'let $sentid := ($node/ancestor::alpino_ds/#id)
let $sentence := ($node/ancestor::alpino_ds/sentence)
let $begin := ($node//#begin)
let $idlist := ($node//#id)
let $beginlist := (distinct-values($begin))';
# Separate sentence info by tab
my $x_loopexit = 'return <match>{data($sentid)}
{string-join($idlist, "-")}
{string-join($beginlist, "-")}
{data($sentence)}</match>}';
my $x_close = '</results>';
# Concatenate all XQuery parts
my $x_concatquery =
$x_resultsofxp
. $x_open
. $x_totalcount
. $x_loopinit
. $x_limit
. $x_sentenceinfo
. $x_loopexit
. $x_close;
my $querysent = $session->query($x_concatquery);
my $basexoutput = $querysent->execute();
print $basexoutput. "\n\n";
$querysent->close();
}
(Note that this is a stripped down version and that it may not work as-is. This snippet does not use structure B!)
What happens is: loop through all XPath files, loop through each line in an XPath file, loop through all treebankparts and then execute the sub. The sub then queries BaseX. This comes down to sending a new XQuery to BaseX, and returning the total hits and the results (possibly limited by an argument in the Perl script). So I got that going, but now the question is: how can I improve this script so I can get some benchmarking results out of it.
First of all, I'd start with adding a profiler to this script. I guess that bit is obvious. However, I am not sure how I should start comparing structure A with B. Would I put both queries (to different databases) in separate scripts, then call a profiler on both, and run both scripts a number of times and get a mean value and compare? Or would I run each query by both databases in the same script, almost at the same time?
It is important to consider caching that is happening. Therefore I am not entirely sure what build-up for benchmarking of a database this huge is appropriate. First one script, then the other. Both at the same time. Alternating queries between the two. And so on. There are so many possibilities, but I wonder which would provide the best results. Also, I would repeat the process a couple of times. Would I repeat each query and then continue to the next, or finish all XPath files, and then repeat the whole process again?
(Reading the description of the benchmark-tag I am confident that this - albeit elaborate - post is suited for SO.)
One possible improvement: minimize the number of times you transfer control from Perl to the database -- just as you have minimized the number of database connections. (Or at least set yourself up to measure the cost of the transfer of control.) I suspect you will get significantly better results if you move your loop into XQuery rather than running the loop in Perl.
A single call to a database management system asking it to perform 1000 searches is likely to be somewhat faster than 1000 calls to the DBMS each requesting a single search. The first involves two context switches: one from your script or bash to the dbms, and one back; the second involves 2000. The last time I measured something like this carefully, each context switch cost something like 500 ms; it mounted up fast. (That said, this was a long time ago, with a different database. But it was surprising [and sobering] to learn than the difference between the two query formulations I was trying to compare was dwarfed by the difference between running the test loop in a script or inside the dbms.)
A second suggestion: From what you say, the size of the database and the result sets seem likely to ensure that caching between runs doesn't have a big effect on the results. But this seems to be a testable assertion, and one worth testing. So set up your A and B scripts, and then do a trial run: does for runcount in 1 2 3 4 5; do perl A.pl; perl B.pl; done produce results comparable to for runcount in 1 2 3 4 5; do perl A.pl; done; for runcount in 1 2 3 4 5; do perl B.pl; done? If they are comparable, then you have reason to believe it doesn't matter if you run A and B separately or in alternation. If they are not comparable, then you know it does matter, which would be very valuable information. Other things being equal, I would expect caching to produce lower times when running one query several times before moving on to the next query, and cache misses to produce higher times if running each query just once. Probably worth measuring.
In the same spirit, I would recommend that you run tests both with the loop in the Perl script and with the loop in an XQuery query.
A third suggestion: in practice, a query at the corpus query interface will involve several stages, each with measurable time: transmission of the query from the user's browser (assuming it's a Web interface) to the server, translation of the request into a form suitable for transmission to the back end dbms (here BaseX), context switch to BaseX, processing within BaseX, context switch back, handling by web server, transmission to user. It would be useful to have at least rough estimates of the times involved in each of these steps, or at least of the time taken for everything-but-BaseX.
So if it were me running the tests, I think I'd also prepare a set of vacuous XQuery tests, along the lines of
2 + 3
or just
42
to push the BaseX time as close to zero as possible; the measured time between user initiation of the query and display of response is the per-query overhead. (Interesting question: should one use many different trivial expressions to prevent caching of results, or should one use the same expression over and over, to encourage caching of the result? How can we try to ensure that BaseX will cache the result, but the Web server won't? ...)
A final suggestion: remember that other people who need to do benchmarking will often have the same questions as you do. This means that you can reformulate every question of the form "Should I do X or Y?" into the form "What measurable effect does the difference between X and Y have on the results of a benchmarking test?" Run some tests to try to measure that effect, and write them up. (I always find it makes it more interesting if I force myself to make a prediction after formulating the question but before measuring the difference.)
There are several things we have to separate here: The first issue is that BaseX performance should not be confused with your perl script as your perl script seems to simply construct an XQuery (and not XPath as you suggested in your question and tags). So for testing I would suggest to use some already pre-fined XQueries suitable to your real-world scenarios, as your XQuery construction should be negligible. How you pass your query to BaseX, so via the Perl API or via any other means should not be relevant. Even if your perl performance is relevant, you should test the performance separately.
Hence, your original question whether you should put test both scenarios in the same script or not is not relevant anymore. Instead you simply execute the two separate XQueries for the scenario A and B by themself without the perl script.
You are partly correct to worry about caching, however it is the Java JIT compiler which most likely will be relevant here (as BaseX is written in java, JIT and use caching, not BaseX itself. You should therefore use the Client/Server infrastructure and have a long-running server and warm it up before running performance measurements.
Regarding performance: The BaseX GUI and also the command line already have an included measurement (using command line you can set -V to get run times for parsing, compiling, evaluating and printing). Also, using the -r parameter you can execute a query multiple times and it will give you the average execution times.
In general, if you want to improve the performance of your script you should take a look at the query plan and the optimized query and check whether the appropriate indexes are used. Also, our new Selective Indexing might be very useful to you. If the index isn't used, your query will definitely not perform well for 500 million words.
Full Disclosure: I am with the BaseX team and you might get better help at the BaseX mailing list or might want to reference this questions as our head architect isn't watching SO as regularly as the ML.

Using Expect with Perl and pipe to a file

I'm fairly new to Perl and have been searching the interwebs for documentation for what I'm trying to do. I'm not having any luck.
I have a program that outputs information to stdout with prompts throughout. I need to make a Perl script to pipe that information to a file.
I thought I could use Expect but there seems to be a problem with the pipe after the first prompt.
Here is the part of my code:
# Run program and compare the output to the BASE file
$cmd = "./program arg1 arg2 arg3 arg4 > $outfile";
my $exp = new Expect;
$exp->spawn($cmd);
BAIL_OUT("couldn't create expect object") if (! defined $exp);
$exp->expect(2);
$exp->send("\n");
For this case there is only a single prompt for the user to press "enter". This program is small and very fast - 2 seconds is plenty of time to reach the first prompt.
The output file only contains the first half of the information.
Does anyone have any suggestions on how I can grab the second half as well?
UPDATE:
I've verified that this works with Expect by using a simple script:
spawn ./program arg1 arg2 arg3 arg4
expect "<Return>"
send "\r"
interact
Where "< Return >" is a verbose expression that the Perl script could look for.
Note: I've tried writing my Perl script to expect "< Return >"...it makes no difference.
i.e.
$exp->expect(2, '-re', "<Return>")
Any thoughts?
UPDATE2:
Hazaah! I've found a solution to my problem...completely by accident.
So, I had a mistype in some test code I made...
$exp->expect(2);
$exp->send("\r");
$exp->expect(2);
Note the trailing expect(2)...I accidentally left that in and it worked!
So, I'm trying to understand what is happening. Unix expect does not seem work this way! It appears Expect implemented in Perl "expects" anything...not just prompts?
So, I provided expect another 2 seconds to collect stdout and I am able to get everything.
If anyone can offer some more detailed information as to what is going on here I'd love to understand what is going on.
Try sending \r instead of \n - you're trying to emulate a carriage return, not a newline, and the tty settings might not be translating them.
ALSO:
A suggestion from the Expect docs FAQ section, which seems likely given your accidental solution:
My script fails from time to time without any obvious reason. It
seems that I am sometimes loosing output from the spawned program.
You could be exiting too fast without giving the spawned program
enough time to finish. Try adding $exp->soft_close() to terminate the
program gracefully or do an expect() for 'eof'.
Alternatively, try adding a 'sleep 1' after you spawn() the program.
It could be that pty creation on your system is just slow (but this is
rather improbable if you are using the latest IO-Tty).
Standard unix/tcl expect doesn't exit in interactive mode, which could give your program enough time to finish running.
It's been a while since I've used Expect, but I'm pretty sure you need to provide something for Expect to match the prompt against:
$exp->expect( 2, 'Press enter' );
for example.

How to verify normal termination of R scripts executed from Perl?

I have written a shebang R script and would like to execute it from a Perl script. I currently use system ($my_r_script_path, $r_script_arg1, $r_script_arg2, ...) and my question is how can I verify the R script terminates normally (no errors or warnings).
guess I should make my R script return some true value at the end, only if everything is OK, then catch this value in Perl, but I'm not sure how to do that.
Thanks!
You can set the return value in the command quit(), eg q(status=1). Default is 0, see also ?quit. How to catch that one in Perl, is like catching any other returning value in Perl. It is saved in a special variable $? if I remember right. See also the examples in the perldoc for system, it should be illustrated there.
On a sidenote, I'd just use the R-Perl interface. You can find info and examples here :
http://www.omegahat.org/RSPerl/
Just for completeness :
At the beginning of your script, you can put something like :
options(
warn=2, # This will change all warnings into errors,
# so warnings will also be handled like errors
error= quote({
sink(file="error.txt"); # save the error message in a file
dump.frames();
print(attr(last.dump,"error.message"));
sink();
q("no",status=1,FALSE) # standard way for R to end after errors
})
)
This will save the error message, and break out of the R session without saving, with exit code 1 and without running the .Last.
Still, the R-Perl interface offers a lot more possibilities that are worth checking out if you're going to do this more often.

Limiting the amount of information printed by Perl debugger

One of my pet peeves with debugging Perl code (in command line debbugger, perl -d) is the fact that mistakenly printing (via x command) the contents of a huge datastructure is guaranteed to freeze up your terminal for forever and a half while 100s of pages of data are printed. Epecially if that happens across slowish network.
As such, I'd like to be able to limit the amount of data that x prints.
I see two approaches - I'd be willing to try either if someone knows how to do.
Limit the amount of data any single command in debugger prints.
Better yet, somehow replace the built-in x command with a custom Perl method (which would calculate the "size" of the data structure, and refuse to print its contents without confirmation).
I'm specifically asking "how to replace x with custom code" - building a Good Enough "is the data structure too big" Perl method is something I can likely do on my own without too much effort although I see enough pitfalls preventing the "perfect" one from being a fairly frustrating endeavour. Heck, merely doing Data::Dumper->Dump and taking the length of the string might do the trick :)
Please note that I'm perfectly well aware of how to manually avoid the issue by recursively examining layers of datastructure (e.g. print the ref, print the count of keys/array elements, etc...)... the whole point is I want to be able to avoid thoughtlessly typing x $huge_pile_of_data without thinking - or stumbling on a bug populating said huge pile of data into what should be a scalar.
The x command takes an optional argument for the maximum depth to display. That's not quite the same as limiting the amount of data to N pages, but it's definitely useful to prevent overload.
DB<1> %h = (a => { b => { c => 1 } } )
DB<2> x %h
0 'a'
1 HASH(0x1d5ff44)
'b' => HASH(0x1d61424)
'c' => 1
DB<3> x 2 %h
0 'a'
1 HASH(0x1d5ff44)
'b' => HASH(0x1d61424)
You can specify the default depth to print via the o command, e.g.
DB<1>o dumpDepth=1
Add that to your .perldb file to apply it to all debugger sessions.
Otherwise, it looks like the x command invokes DB::dumpit() which is just a wrapper for dumpval.pl (or, more specifically, the main::dumpValue() sub it defines). You could modify/replace that script as you see fit. I'm not sure how you'd make it interactive, though.
The | command in the debugger pipes another command's output to your pager, e.g.
DB<1> |x %huge_datastructure