Iterate over all vertices in a large Titan graph via Rexpro - titan

I am using Titan in my Python application (connecting via RexPro & rexpro-python). I would like to perform a few operations that involve iterating over all vertices in the graph, and I'm wondering what would be the best way to do this (if there's even a sensible way at all).
The first idea that comes to mind is to request batches of g.V via the i-j filter, e.g.:
g.V[1..100]
g.V[101..200]
...
g.V[100001..100100]
...
However, the filter will load & iterate over vertices 0 to i, which will be prohibitively expensive for large graphs.
What's the best way to iterate over all vertices via RexPro?

One fairly easy solution is to use a Rexster session variable with the g.V pipe, and request batches using Pipe.next
res = conn.execute("my_iter = g.V; my_iter.next(100);", isolate=False)
while len(res) > 0:
for d in res:
yield d
#get next 100
res = conn.execute("my_iter.next(100);", isolate=False)

Related

PyTorch DataLoader using Mongo DB

I would like to know if using a DataLoader connected to a MongoDB is a sensible thing to do and how this could be implemented.
Background
I have about 20 million documents sitting in a (local) MongoDB. Way more documents than fit in memory. I would like to train a deep neural net on the data. So far, I have been exporting the data to the file system first, with subfolders named as the classes of the documents. But I find this approach nonsensical. Why export first (and later delete) if the data is already well-maintained sitting in a DB.
Question 1:
Am I right? Would it make sense to directly connect to the MongoDB? Or are there reasons not to do it (e.g. DBs generally being too slow etc.)? If DBs are too slow (why?), can one prefetch the data somehow?
Question 2:
How would one implement a PyTorch DataLoader?
I have found only very few code snippets online ([1] and [2]) which makes me doubt my approach.
Code snippet
The general way how I access MongoDB is as follows below. Nothing special about this, I think.
import pymongo
from pymongo import MongoClient
myclient = pymongo.MongoClient("mongodb://localhost:27017/")
mydb = myclient["xyz"]
mycol = mydb["xyz_documents"]
query = {
# some filters
}
results = mycol.find(query)
# results is now a cursor that can run through all docs
# Assume, for the sake of this example, that each doc contains a class name and some image that I want to train a classifier on
Introduction
This one is a little open-ended, but let's try, also please correct me if I'm wrong somewhere.
So far, I have been exporting the data to the file system first, with subfolders named as the classes of the documents.
IMO this isn't sensible because:
you are essentially duplicating data
any time you would like to train a-new given only code and database this operation would have to be repeated
you can access multiple datapoints at once and cache them in RAM for later reuse without reading from hard drive multiple times (which is quite heavy)
Am I right? Would it make sense to directly connect to the MongoDB?
Given above, probably yes (especially when it comes to clear and portable implementation)
Or are there reasons not to do it (e.g. DBs generally being to slow etc.)?
AFAIK DB shouldn't be slower in this case as it will cache access to it, but I'm no db expert unfortunately. Many tricks for faster access are implemented out-of-the-box for databases.
can one prefetch the data somehow?
Yes, if you just want to get data, you could load a larger part of data (say 1024 records) at one go and return batches of data from that (say batch_size=128)
Implementation
How would one implement a PyTorch DataLoader? I have found only very few code snippets online ([1] and [2]) which makes me doubt with my approach.
I'm not sure why would you want to do that. What you should go for is torch.utils.data.Dataset as shown in the examples you've listed.
I would start with simple non-optimized approach similar to the one here, so:
open connection to db in __init__ and keep it as long as it's used (I would create a context manager from torch.utils.data.Dataset so the connection is closed after epochs are finished)
I would not transform the results to list (especially since you cannot fit it in RAM for obvious reasons) as it misses the point of generators
I would perform batching inside this Dataset (there is an argument batch_size here).
I am not sure about __getitem__ function but it seems it can return multiple datapoints at once, hence I'd use that and it should allow us to use num_workers>0 (given that mycol.find(query) returns data in the same order every time)
Given that, something along those lines is what I'd do:
class DatabaseDataset(torch.utils.data.Dataset):
def __init__(self, query, batch_size, path: str, database: str):
self.batch_size = batch_size
client = pymongo.MongoClient(path)
self.db = client[database]
self.query = query
# Or non-approximate method, if the approximate method
# returns smaller number of items you should be fine
self.length = self.db.estimated_document_count()
self.cursor = None
def __enter__(self):
# Ensure that this find returns the same order of query every time
# If not, you might get duplicated data
# It is rather unlikely (depending on batch size), shouldn't be a problem
# for 20 million samples anyway
self.cursor = self.db.find(self.query)
return self
def shuffle(self):
# Find a way to shuffle data so it is returned in different order
# If that happens out of the box you might be fine without it actually
pass
def __exit__(self, *_, **__):
# Or anything else how to close the connection
self.cursor.close()
def __len__(self):
return len(self.examples)
def __getitem__(self, index):
# Read takes long, hence if you can load a batch of documents it should speed things up
examples = self.cursor[index * batch_size : (index + 1) * batch_size]
# Do something with this data
...
# Return the whole batch
return data, labels
Now batching is taken care of by DatabaseDataset, hence torch.utils.data.DataLoader can have batch_size=1. You might need to squeeze additional dimension.
As MongoDB uses locks (which is no surprise, but see here) num_workers>0 shouldn't be a problem.
Possible usage (schematically):
with DatabaseDataset(...) as e:
dataloader = torch.utils.data.DataLoader(e, batch_size=1)
for epoch in epochs:
for batch in dataloader:
# And all the stuff
...
dataset.shuffle() # after each epoch
Remember about shuffling implementation in such case! (also shuffling can be done inside context manager and you might want to close connection manually or something along those lines).

What is a more efficient way to compare and filter Sequences (multiple calls to a single call)

I have two sequences of Data objects and I want to establish what has been added, removed and is common between DataSeq1 and DataSeq2 based upon the id in the Data objects within each sequence.
I can achieve this using the following:
val dataRemoved = DataSeq1.filterNot(c => DataSeq2.exists(_.id == c.id))
val dataAdded = DataSeq2.filterNot(c => DataSeq1.exists(_.id == c.id))
val dataCommon = DataSeq1.filter(c => DataSeq2.exists(_.id == c.id))
//Based upon what is common I want to filter DataSeq2
var incomingDataToCompare = List[Data]()
dataCommon.foreach(data => {incomingDataToCompare = DataSeq2.find(_.id == data.id).get :: incomingDataToCompare})
However as the Data object gets larger calling filters three different times may have a performance impact. Is there a more efficient way to achieve the same output (i.e. what has been removed, added and in common) in a single call?
The short answer is, quite possibly not, unless you are going to add some additional features into the system. I would guess that you need to keep a log of operations in order to improve the time complexity. Even better if that log will be indexed both by the order in which the operation has occurred and by the id of the item that was added/removed. I will leave it to you to discover how such a log can be used.
Also you might be able to improve time complexity if you are going to keep the original sequences sorted by id (or a separate seq of sorted ids, you should probably be able to do that incurring a logN penalty per a single operation). This seq should be of type vector or something, to allow fast random access. Then you can probably iterate with two pointers. But this algorithm's efficiency will depend greatly on whether the unique ids are bounded and also whether this "establish added/removed/same" operation will be called much more frequently compared to the operations that mutate the sequences.

What's the advantage of the streaming support in Anorm (Play Scala)?

I have been reading the section Streaming results in Play docs. What I expected to find is a way to create a Scala Stream based on the results, so if I create a run that returns 10,000 rows that need to be parsed, it would parse them in batches (e.g. 100 at a time) or would just parse the first one and parse the rest as they're needed (so, a Stream).
What I found (from my understanding, I might be completely wrong) is basically a way to parse the results one by one, but at the end it creates a list with all the parsed results (with an arbitrary limit if you like, in this case 100 books). Let's take this example from the docs:
val books: Either[List[Throwable], List[String]] =
SQL("Select name from Books").foldWhile(List[String]()) { (list, row) =>
if (list.size == 100) (list -> false) // stop with `list`
else (list := row[String]("name")) -> true // continue with one more name
}
What advantages does that provide over a basic implementation such as:
val books: List[String] = SQL("Select name from Books").as(str("name")) // please ignore possible syntax errors, hopefully understandable still
Parsing a very large number of rows is just inefficient. You might not see it as easily for a simple class, but when you start adding a few joins and have a more complex parser, you will start to see a huge performance hit when the row count gets into the thousands.
From my personal experience, queries that return 5,000 - 10,000 rows (and more) that the parser tries to handle all at once consume so much CPU time that the program effectively hangs indefinitely.
Streaming avoids the problem of trying to parse everything all at once, or even waiting for all the results to make it back to the server over the wire.
What I would suggest is using the Anorm query result as a Source with Akka Streams.
I successfully stream that way hundreds of thousands rows.

MATLAB: How to create multiple mapped memory files with a simple "iterator"?

I have files (>100) that each contain recorded sets of data like this:
file0: [no. of data sets in file, no. of data points for recording1, related data to recording1, no. of data points for recording2, related data to recording2, ... , no. of data points for recordingM, related data to recordingM]
file1: [no. of data sets in file, ...] (same as above)
All of the data together may exceed 20 GB, so loading all of it into memory is not an option. Hence, I would like to create memory-mapped files for each of the files BUT hiding from the "user" the complexity of the underlying data, e.g., I would like to be able to operate on the data like this:
for i=1:TotalNumberOfRecordings
recording(i) = recording(i) * 10; % some stupid data operation
% or even more advanced better:
recording(i).relatedData = 2000;
end
So, no matter if recording(i) is in file0, file1, or some other file, and no matter its position within the file, I have a list that allows to me access the related data via a memory map.
What I have so far, is a list of all files within a certain directory, my idea now was to simply create a list like this:
entry1: [memoryMappedFileHandle, dataRangeOfRecording]
entry2: [memoryMappedFileHandle, dataRangeOfRecording]
And then use this list to further abstract files and recordings. I started with this code:
fileList = getAllFiles(directoryName);
list = []; n = 0;
for file = 1:length(fileList);
m = memmapfile(fileList(file));
for numberOfTracesInFile
n = n+1;
list = [list; [n, m]];
end
end
But I do get the error:
Memmapfile objects cannot be concatenated
I'm quite new to MATLAB so this is probably a bad idea after all. How to do it better? Is it possible to create a memorymapped table that contains multiple files?
I'm not sure whether the core of your question is specifically about memory-mapped files, or about whether there is a way to seamlessly process data from multiple large files without the user needing to bother with the details of where the data is.
To address the second question, MATLAB 2014b introduced a new datastore object that is designed to do pretty much this. Essentially, you create a datastore object that refers to your files, and you can then pull data from the datastore without needing to worry about which file it's in. datastore is also designed to work very closely with the new mapreduce functionality that was introduced at the same time, which allows you to easily parallelize map-reduce programming patterns, and even tie in with Hadoop.
To answer the first question - I'm afraid I think you've found your answer, which is that memmapfile objects can not be concatenated, so no, not straightforward. I think your best approach would be to build your own class, which would contain multiple memmapfile objects in a cell array along with information about which data was in which file, along with some sort of getData method that would retrieve the appropriate data from the appropriate file. (This would be basically like writing your own datastore class, but which worked with memory-mapped files rather than files, so you might be able to copy much of the design and/or implementation details from datastore itself).
Like Horchler said; you could put the memmepfile objects in a cell array:
list = cell(1,10); % preallocate cell
for it = 1:10
memmapfile_object = memmepfile('/path/to/file');
list{it} = memmapfile_object;
end

How do you perform blocking IO in apache spark job?

What if, when I traverse RDD, I need to calculate values in dataset by calling external (blocking) service? How do you think that could be achieved?
val values: Future[RDD[Double]] = Future sequence tasks
I've tried to create a list of Futures, but as RDD id not Traversable, Future.sequence is not suitable.
I just wonder, if anyone had such a problem, and how did you solve it?
What I'm trying to achieve is to get a parallelism on a single worker node, so I can call that external service 3000 times per second.
Probably, there is another solution, more suitable for spark, like having multiple working nodes on single host.
It's interesting to know, how do you cope with such a challenge? Thanks.
Here is answer to my own question:
val buckets = sc.textFile(logFile, 100)
val tasks: RDD[Future[Object]] = buckets map { item =>
future {
// call native code
}
}
val values = tasks.mapPartitions[Object] { f: Iterator[Future[Object]] =>
val searchFuture: Future[Iterator[Object]] = Future sequence f
Await result (searchFuture, JOB_TIMEOUT)
}
The idea here is, that we get the collection of partitions, where each partition is sent to the specific worker and is the smallest piece of work. Each that piece of work contains data, that could be processed by calling native code and sending that data.
'values' collection contains the data, that is returned from the native code and that work is done across the cluster.
Based on your answer, that the blocking call is to compare provided input with each individual item in the RDD, I would strongly consider rewriting the comparison in java/scala so that it can be run as part of your spark process. If the comparison is a "pure" function (no side effects, depends only on its inputs), it should be straightforward to re-implement, and the decrease in complexity and increase in stability in your spark process due to not having to make remote calls will probably make it worth it.
It seems unlikely that your remote service will be able to handle 3000 calls per second, so a local in-process version would be preferable.
If that is absolutely impossible for some reason, then you might be able to create a RDD transformation which turns your data into a RDD of futures, in pseudo-code:
val callRemote(data:Data):Future[Double] = ...
val inputData:RDD[Data] = ...
val transformed:RDD[Future[Double]] = inputData.map(callRemote)
And then carry on from there, computing on your Future[Double] objects.
If you know how much parallelism your remote process can handle, it might be best to abandon the Future mode and accept that it is a bottleneck resource.
val remoteParallelism:Int = 100 // some constant
val callRemoteBlocking(data:Data):Double = ...
val inputData:RDD[Data] = ...
val transformed:RDD[Double] = inputData.
coalesce(remoteParallelism).
map(callRemoteBlocking)
Your job will probably take quite some time, but it shouldn't flood your remote service and die horribly.
A final option is that if the inputs are reasonably predictable and the range of outcomes is consistent and limited to some reasonable number of outputs (millions or so), you could precompute them all as a data set using your remote service and find them at spark job time using a join.