I'm trying to do a quick and dirty querying of my mongodb database using ipython notebook.
I have several cells each with its own query. Since mongodb can support several connections I would like to run each query in parallel. I thought an ideal way would be just do something like
%%script --bg python
query = pymongo.find(blahbalhba)
You can imagine several cells each with its own query. However I'm not able to access the query returned by pymongo.find.
I understand that this is a subprocess run in a seperate thread, but I have no idea how to access the data since the process is quickly destroyed and the namespace goes away.
I found a similar post for %%bash here but I'm having trouble translating this to a python namespace.
%%script is just a convenient magic, it will not replace writing a full blown magic.
The only thing I can see is to write your own magic. Basically if you can do it with a function that takes a string a parameter, you know how to write a magic.
So How would you (like to) write it in pure python ? (Futures, multiprocessing, queuing library ?) ... then move it to a magic.
Related
I recently started to work with MongoDB and I came across the mapReduce method. I understood theory behind it, but i'm having problem with practice, i'll try to explain: i'm using Studio 3T as IDE and I saw the 'add/edit stored functions' option by right-clicking DBs, i creted map and reduce functon with this option, but i don't know how to call them.
This is how I define map and reduce function:
and this is how i call them, receiving the ReferenceError.
EDIT 1 : I saw this thread, but it doesn't do what i'd like to do, he define functions in mongodb shell, i'd like to be able to define them in studio 3t and call them "when" i want to.
Instead of using IntelliShell (the smarter mongo shell equivalent in Studio 3T) for map-reduce functions, it would be simpler to use the dedicated Map-Reduce feature (full documentation here), which will spare you the task of defining, storing, and calling separate functions.
I'm trying to run a SELECT statement on PostgreSQL database and save its result into a file.
The code runs on my environment but fails once I run it on a lightweight server.
I monitored it and saw that the reason it fails after several seconds is due to a lack of memory (the machine has only 512MB RAM). I didn't expect this to be a problem, as all I want to do is to save the whole result set as a JSON file on disk.
I was planning to use fetchrow_array or fetchrow_arrayref functions hoping to fetch and process only one row at a time.
Unfortunately I discovered there's no difference when it comes to the true fetch operations between the two above and fetchall_arrayref when you use DBD::Pg. My script fails at the $sth->execute() call, even before it has a chance to do call any fetch... function.
This suggests to me that the implementation of execute in DBD::Pg actually fetches ALL the rows into memory, leaving only the actual format its returned to the fetch... functions.
A quick look at the DBI documentation gives a hint:
If the driver supports a local row cache for SELECT statements, then this attribute holds the number of un-fetched rows in the cache. If the driver doesn't, then it returns undef. Note that some drivers pre-fetch rows on execute, whereas others wait till the first fetch.
So in theory I would just need to set the RowCacheSize parameter. I've tried but this feature doesn't seem to be implemented by DBD::Pg
Not used by DBD::Pg
I find this limitation a huge general problem (execute() call pre-fetches all rows?) and more inclined to believe that I'm missing something here, than that this is actually a true limitation of interacting with PostgreSQL databases using Perl.
Update (2014-03-09): My script works now thanks to using a workaround as described in my comment to Borodin's answer. The maintainer of DBD::Pg library got back to me on the issue actually saying the root cause is deeper and lies within libpq postgresql internal library (used by DBD::Pg). Also, I think very similar issue to the one described here affects pgAdmin. Being postgresql native tool it still doesn't give in the Options chance to define the default limit of the result set row size. This is probably why it makes Query tool sometimes waiting a good while before presenting results from bulky queries, potentially breaking the app in some cases too.
In the section Cursors, the documentation for the database driver says this
Therefore the "execute" method fetches all data at once into data structures located in the front-end application. This fact must to be considered when selecting large amounts of data!
So your supposition is correct. However the same section goes on to describe how you can use cursors in your Perl application to read the data in chunks. I believe this would fix your problem.
Another alternative is to use OFFSET and LIMIT clauses on your SELECT statement to emulate cursor functionality. If you write
my $select = $dbh->prepare('SELECT * FROM table OFFSET ? LIMIT 1');
then you can say something like (all of this is untested)
my $i = 0;
while ($select->execute($i++)) {
my #data = $select->fetchrow_array;
# Process data
}
to read your tables one row at a time.
You may find that you need to increase the chunk size to get an acceptable level of efficiency.
Short story: A report running against a Progress database (OpenEdge Release 10.1C03) takes hours to complete. I suspect that it does not take advantage of existing data indexes. Would like to understand how it scans the data to then try to add an index that will make it run faster.
Source code of the report is not available. The code is native Progress 4GL, not SQL.
If it were an SQL database I would try to do a dump of SQL queries and would then go from that. With 4GL I did not find any such functionality. Is it possible to somehow peek at what gets executed at the low level?
What else can be done if there is no source code?
Thanks!
There are several things you can do:
If I recall correctly 10.1C should have the _usertablestat and _userindexstat virtual system tables available. These allow you to observe, at runtime, what tables and indexes are being accessed by a particular session. You can either write your own 4GL program to query them or you can use the screens in PROMON, R&D, 3 "Other Displays", 5 "I/O Operations by User by Table" and 6 "I/O Operations by User by Index". That will show you what tables and indexes are actually in use and how much use they are getting. If the observed data seems wrong it will probably give you a clue. (If the VSTs are missing it might be because the db was upgraded from an older version -- add them with proutil dbname -C updatevsts.)
You could also use the session startup parameters -clientlog "filename" and -logentrytypes QryInfo to obtain more detailed information about the queries being executed.
Keep in mind that Progress is not SQL. Unlike most SQL databases the 4gl uses a static, compile-time, optimizer. Index selection happens when the code is compiled. So unless you can recompile (and you seem to not have source so that seems unlikely) you won't be able to improve things by adding missing indexes. You might, however, at least be able to show the person who does have source where the problem is.
Another tool that can help is the profiler. This will identify where in the code the time is being spent. That can also be good information to provide to the original vendor if they need help finding the problem. For more information on the profiler: http://dbappraise.com/ppt/profiler.pptx
I'm looking at using some JavaScript in a MongoDb query. I have a couple of choices:
db.system.js.save the function in the db then execute it
db.myCollection.find with a $where clause and send the JS each time
exec_js in MongoEngine (which I imagine uses one of the above)
I plan to use the JavaScript in a regularly used query that's executed as part of a request to a site or API (i.e. not a batch administrative jobs) so it's important that the query executes with reasonable speed.
I'm looking at a 30ish line function.
Is the Javascript interpreted fresh each time? Will the performance be ok? Is it a sensible basis upon which to build queries?
Is the Javascript interpreted fresh each time?
Pretty much. MongoDB only has one "javascript instance" per running instance of MongoDB. You'll notice this if you try to run two different Map/Reduces at the same time.
Will the performance be ok?
Obviously, there are different definitions of "OK" here. The $where clause can not use indexes. You can combine that clause with another indexed query. In either case each object will need to be pushed from BSON over to the Javascript run-time and then acted on inside the run-time.
The process is definitely not what you would call "performant". Of course, by that measure Map/Reduce is also not very performant and people use that on production systems.
Is it a sensible basis upon which to build queries?
The real barrier here isn't the number of lines in the code, it's the number of possible documents this code will interpret. Even though it's "server-side" javascript, it's still a bunch of work that the server has to do. (in one thread, in an interpreted environment)
If you can test it and scope it correctly, it may well work out. Just don't expect miracles.
What is your point here? Write a JS script and call it regularly through cron. What should be the problem with that?
I'm working on a .NET program that executes arbitrary scripts against a database.
When a colleage started writing the database access code, he simply exposed one command object to the rest of the application which is re-used (setting CommandText/Type, calling ExecuteNonQuery() etc.) for each statement.
I imagine this is a big performance hit for repeated, identical statements, because they are parsed anew each time.
What I'm wondering about, though, is: will this also degrade execution speed if each statement is different from the previous one (not only different parameters, but an entirely different statement)? I couldn't easily find an answer on that in the documentation.
Btw, the RDBMS used is Oracle, but I guess this question is not really database specific.
P.S. I know exposing the same Command object is not thread safe, but that's not an issue here.
There is some overhead involved in creating new command objects, and so in certain circumstances it can make sense to re-use the same command. But as the general case enforced for an entire application it seems more than a little odd.
The performance hit usually comes from establishing a connection to the database, but ADO.NET creates a connection pool to help here.
If you wish to avoid parsing statements each time anew, you can put them into stored procedures.
I imagine your colleague just uses some old style approach that he's inherited from working on other platforms where reusing a command object did make a difference.