Best practice for mongodb bulk inserts in Symfony2 - mongodb

In my symfony2 command, I am running a script that inserts hundreds of thousands of urls (as string) into a document.
Here are the basic structures of the 2 documents I'm using. Before the program is run, there are thousands of ParentDocuments already inside the mongodb, but zero ChildDocuments:
ParentDocument:
$id:id
$subDocument:OneToManyReference(ChildDocument)
$etc:everythingelse
ChildDocument:
$id:id
$url:string
$parentDocument:ManyToOneReference(ParentDocument)
And my Command code:
$dm = $this->getContainer()->get('doctrine_mongodb.odm.document_manager');
$parentDocuments = $dm->repository('My:Bundle:ParentDocument')->findAll();
while ($parentDocument = $parentDocuments->getNext()) {
//Returns an array of hundreds of thousands urls
$urls = $this->somehowFetchUrlsRelatedToTheParentDocument($parentDocument);
foreach ($urls as $url) {
$subDocument = new SubDocument();
$subDocument->setUrl($url);
$subDocument->setParentDocument($parentDocument);
$dm->persist($subDocument);
}
$dm->flush();
}
When I run this simple command, the write speed at first is incredibly fast. However, in the case of inserting millions of rows, the write speeds become significantly slower. As slow as 1 write per second after the command has been running for 10 minutes, making the code extremely ineffective.
My first attempt at fixing this problem was to clear the document manager right after it flushes using $dm->clear();
But this meant that the document manager would lose track of the current ParentDocument. So my solution was this:
$dm = $this->getContainer()->get('doctrine_mongodb.odm.document_manager');
$parentDocumentCursors = $dm->repository('My:Bundle:ParentDocument')->findAll();
$parentDocuments = array();
while ($parentDocument = $parentDocumentCursors->getNext()) {
array_push($parentDocuments, $parentDocument);
}
$dm->clear();
unset($dm);
$dm = $this->getContainer()->get('doctrine_mongodb.odm.document_manager');
foreach ($parentDocuments as $parentDocument) {
$urls = $this->somehowFetchUrlsRelatedToTheParentDocument($parentDocument);
foreach ($urls as $url) {
$subDocument = new SubDocument();
$subDocument->setUrl($url);
$subDocument->setParentDocument($parentDocument);
$dm->persist($subDocument);
}
$dm->flush();
$dm->clear();
}
This solved the problem. The write speeds were consistently fast throughout the whole execution of the program and millions of rows were able to be inserted without gradual delay.
However, this feels like a bad practice and a quick fix hack. What is the best practice for inserting millions of rows in Symfony2 using document manager without read/write speeds becoming slow?

I would avoid using Symfony's document manager and use the batchInsert() function directly. This is described in the documentation at http://php.net/manual/en/mongocollection.batchinsert.php It feels to me like Doctrine's ODM is actually hurting you here.

In order to do a bulk insert in doctrine you would need to move your flush outside of your loop. Consider the scenario below where you would persist in the foreach then flush when the foreach is completed. Your only catch will be that you will not be able to query any of the data being inserted in the batch until after the flush.
$dm = $this->getContainer()->get('doctrine_mongodb.odm.document_manager');
foreach ($parentDocuments as $parentDocument) {
$urls = $this->somehowFetchUrlsRelatedToTheParentDocument($parentDocument);
foreach ($urls as $url) {
$subDocument = new SubDocument();
$subDocument->setUrl($url);
$subDocument->setParentDocument($parentDocument);
$dm->persist($subDocument);
}
}
$dm->flush();
$dm->clear();
Another option is to do a a push,pushall, or addto set.
One issue to consider is you will need to use stdClass in php in order to add an object.
I find this to be the quickest way to update a subdocument.
For example:
$dm->createQueryBuilder('My:Bundle:ParentDocument')
->update()
->field('subDocument')->push( (object) array('url'=> $url) )
->field('id')->equals( $parentDocumentId )
->getQuery()
->execute();

Related

Is there a "function-like-thing" that can act as a template?

I have a code like this that is repeated multiple times in each of my conditional statements/cases. i have 3 conditions...for now, and everything works perfectly, but im mulling reformatting the script for easier reading.
One of the ways ive thought is to make a function, but the problem is that, i have a while loop that is intended for a specific scenario in each conditional statement that dequeues from a Queue containing some column names from a file.
so based on the code below that i want to put in some sort of template, i cant think of how this could work because as you can see, $tb stands for $table, which is what im opening prior to the conditional statements in my code.
if i were to include everything regarding the server connection and table in a function, that means when i pass the "function" containing the code to the while loops, it will be creating/instantiating the table every iteration, which wont make sense and wont work anyways.
so i am thinking of using something like annotations, something like a template which wont expect to return anything or need reasonable arguments like a function otherwise would. The question is, does something like that exist?
This is the code that is the same across all my while loops that i would like to "store" somewhere and just pass it to them:
$dqHeader = $csvFileHeadersQueue.Dequeue()
$column = New-Object Microsoft.SqlServer.Management.Smo.Column($tb, $dqHeader, $DataType1)
if ($dqHeader -in $PrimaryKeys)
{
# We require a primary key.
$column.Nullable = $false
#$column.Identity = $true #not needed with VarChar
#$column.IdentitySeed = 1 #not needed with VarChar
$tb.Columns.Add($column)
$primaryKey = New-Object Microsoft.SqlServer.Management.Smo.Index($tb, "PK_$csvFileBaseName")
$primaryKey.IndexType = [Microsoft.SqlServer.Management.Smo.IndexType]::ClusteredIndex
$primaryKey.IndexKeyType = [Microsoft.SqlServer.Management.Smo.IndexKeyType]::DriPrimaryKey #Referential Integrity to prevent data inconsistency. Changes in primary keys must be updated in foreign keys.
$primaryKey.IndexedColumns.Add((New-Object Microsoft.SqlServer.Management.Smo.IndexedColumn($primaryKey, $dqHeader)))
$tb.Indexes.Add($primaryKey)
}
else
{
$tb.Columns.Add($column)
}
think of it like a puzzle piece that would fit right in when requested to do so in the while loops to complete that "puzzle"
As per comment:
you can share a (hardcoded) [ScriptBlock] ($template = {code in post goes here}) with a While loop (or function) and invoke it with e.g. Invoke-Command $template or the call operator: &$template. Dynamically modifying an expression and using commands like Invoke-Expression or [ScriptBlock]::Create() is not a good idea due to risk of malicious code injections (see: #1454).
You might even add parameters to your shared [ScriptBlock], like:
$Template = {
[CmdletBinding()]Param ($DataType)
$column = New-Object Microsoft.SqlServer.Management.Smo.Column($tb, $dqHeader, $DataType)
...
}
ForEach ($MyDataType in #('MyDataType')) {
Invoke-Command $Template -ArgumentList $MyDataType
}
But the counter-question remains: Why not just creating a "helper" function?:
Function template($DataType) {
$column = New-Object Microsoft.SqlServer.Management.Smo.Column($tb, $dqHeader, $DataType)
...
}
ForEach ($MyDataType in #('MyDataType')) {
template $MyDataType
}

Perl -> Avoiding unnecessary method calls

I have to read log files of a store. The log shows the item id and the word "sold" after it. So I made a script to read this file, counting how many times a word "sold" appears for each item id. Turns out that there are many "owners" for the items. That is, there is a relation between "owner_id" (a data in my DB) and "item_id". Im interested in knowing how many items owners sell per day, so I create a "%item_id_owner_map":
my %item_id_sold_times;
my %item_id_owner_map;
open my $infile, "<", $file_location or die("$!: $file_location");
while (<$infile>) {
if (/item_id:(\d+)\s*,\s*sold/) {
my $item_id = $1;
$item_id_sold_times{$item_id}++;
my $owner_ids =
Store::Model::Map::ItemOwnerMap->fetch_by_keys( [$item_id] )
->entry();
for my $owner_id (#$owner_ids) {
$item_id_owner_map{$owner_id}++;
}
}
}
close $infile;
The "Store::Model::Map::ItemOwnerMap->fetch_by_keys( [$item_id] )->entry();" method takes item_id or ids as input, and gives back owner_id as output.
Everything looks great but actually, you will see that every time Perl finds a regex match (that is, every time the "if" condition applies), my script will call "Store::Model::Map::ItemOwnerMap->fetch_by_keys" method, which is very expensive, as these log files are very very long.
Is there a way to make my script more efficient? If possible, I only want to call my Model method once.
Best!
Separate your logic into two loops:
while (<$infile>) {
if (/item_id:(\d+)\s*,\s*sold/) {
my $item_id = $1;
$item_id_sold_times{$item_id}++;
}
}
my #matched_items_ids = keys %item_id_sold_times;
my $owner_ids =
Store::Model::Map::ItemOwnerMap->fetch_by_keys( \#matched_item_ids )
->entry();
for my $owner_id (#$owner_ids) {
$item_id_owner_map{$owner_id}++;
}
I don't know if the entry() call is correct, but the general shape of that code should do it for you.
In general databases are good at fetching sets of rows, so you're right to minimise the calls to fetch from the DB.

Zend framework 1.11 Gdata Spreadsheets insertRow very slow

I'm using insertRow to populate an empty spreadsheet, it starts off taking about 1 second to insert a row and then slows down to around 5 seconds after 150 rows or so.
Has anyone experienced this kind of behaviour?
There aren't any calculations on the data in the spreadsheet that could be getting longer with more data.
Thanks!
I'll try to be strict.
If you take a look at class "Zend_Gdata_Spreadsheets" you figure that the method insertRow() is written in a very not optimal way. See:
public function insertRow($rowData, $key, $wkshtId = 'default')
{
$newEntry = new Zend_Gdata_Spreadsheets_ListEntry();
$newCustomArr = array();
foreach ($rowData as $k => $v) {
$newCustom = new Zend_Gdata_Spreadsheets_Extension_Custom();
$newCustom->setText($v)->setColumnName($k);
$newEntry->addCustom($newCustom);
}
$query = new Zend_Gdata_Spreadsheets_ListQuery();
$query->setSpreadsheetKey($key);
$query->setWorksheetId($wkshtId);
$feed = $this->getListFeed($query);
$editLink = $feed->getLink('http://schemas.google.com/g/2005#post');
return $this->insertEntry($newEntry->saveXML(), $editLink->href, 'Zend_Gdata_Spreadsheets_ListEntry');
}
In short, it loads your whole spreadsheet just in order to learn this value $editLink->href in order to post new row into your spreadsheet.
The cure is to avoid using this method insertRow.
Instead, get your $editLink->href once in your code and then insert new rows each time by reproducing the rest of behaviour of this method. I.e, in your code instead of $service->insertRow() use following:
//get your $editLink once:
$query = new Zend_Gdata_Spreadsheets_ListQuery();
$query->setSpreadsheetKey($key);
$query->setWorksheetId($wkshtId);
$query->setMaxResults(1);
$feed = $service->getListFeed($query);
$editLink = $feed->getLink('http://schemas.google.com/g/2005#post');
....
//instead of $service->insertRow:
$newEntry = new Zend_Gdata_Spreadsheets_ListEntry();
$newCustomArr = array();
foreach ($rowData as $k => $v) {
$newCustom = new Zend_Gdata_Spreadsheets_Extension_Custom();
$newCustom->setText($v)->setColumnName($k);
$newEntry->addCustom($newCustom);
}
$service->insertEntry($newEntry->saveXML(), $editLink->href, 'Zend_Gdata_Spreadsheets_ListEntry');
Don't forget to encourage this great answer, it costed me few days to figure out. I think ZF is great however sometimes you dont want to rely on their coode too much when it comes to resources optimization.

memcached - how to use it for multiple records returned from single SELECT operation?

memchached is useful for caching and looking up of single independent records. For the multiple records returned from doing a SELECT operation, how could I make good use of the memcached for caching and looking up later?
I didn't get you. If you use php or .net (Enyim client) you can store your result in to an object and set it in to the memcached. (Client will serialize the object and store it in the memcached.)
Following example will store one or more records (results) returned by the db query.
//Init Memcache, Try avoid multiple init if possible because this is a costly operation
$memcache = new Memcache;
$memcache->connect('localhost', 11211) or die ("Could not connect");
mysql_pconnect("localhost","root","");
mysql_select_db("YOURDB");
$query = "SELECT * FROM table where `firstname`='dasun';";
$key = md5($query);
$get_result = $memcache->get($key);
if ($get_result) {
print_r($get_result);
echo "It's a hit! :)";
}
else {
$result = mysql_query($query);
$row = mysql_fetch_array($result);
print_r($row);
// Store the result for 5 minutes
$memcache->set($key, $row, TRUE, 300);
echo "Not a hit. :(";
}

How to get zend_lucene and zend_paginator to work

I've been using Zend Framework for a few months now. So, my knowledge is pretty good but I'm not quite an expert yet. I am trying to use zend_lucene with zend_paginator and so far not successful. I am able to use zend_lucene and index data successfully by itself and able to do use zend_paginator when querying the database, but I can't seem to combine the two. Here is a sample of what I am doing:
try {
$searchresults = $index->find($lucenequery);
}
catch (Zend_Search_Lucene_Exception $e) {
echo "Unable {$e->getMessage()}";
}
$page = $this->_getParam('page',1);
$paginator = Zend_Paginator::factory($searchresults);
$paginator->setItemCountPerPage(20);
$paginator->setCurrentPageNumber($page);
$this->view->paginator = $paginator;
Is there a different step I need to do with lucene and zend_paginator? I am really uncertain. The result I get is that for the first page results display properly. But when I hit the second page or third my results are blank. So uncertain what might be wrong as I can't find docs or tutorials in using the two together. Any help would be greatly appreciated.
I think this may work with the iterator adapter:
public function searchAction() {
$index = Zend_Search_Lucene::open('/path/to/lucene');
$results = $index->find($this->_getParam('q'));
$paginator = Zend_Paginator::factory($results);
$paginator->setCurrentPageNumber($this->_getParam('page', 1));
$paginator->setItemCountPerPage(10);
$this->view->results = $paginator;
}
Perhaps the problem you are having is that $paginator doesn't know how many search results there are..
So you may need to do that manually:
$paginator->setDefaultPageRange($results->count());