I would like to know if using a DataLoader connected to a MongoDB is a sensible thing to do and how this could be implemented.
Background
I have about 20 million documents sitting in a (local) MongoDB. Way more documents than fit in memory. I would like to train a deep neural net on the data. So far, I have been exporting the data to the file system first, with subfolders named as the classes of the documents. But I find this approach nonsensical. Why export first (and later delete) if the data is already well-maintained sitting in a DB.
Question 1:
Am I right? Would it make sense to directly connect to the MongoDB? Or are there reasons not to do it (e.g. DBs generally being too slow etc.)? If DBs are too slow (why?), can one prefetch the data somehow?
Question 2:
How would one implement a PyTorch DataLoader?
I have found only very few code snippets online ([1] and [2]) which makes me doubt my approach.
Code snippet
The general way how I access MongoDB is as follows below. Nothing special about this, I think.
import pymongo
from pymongo import MongoClient
myclient = pymongo.MongoClient("mongodb://localhost:27017/")
mydb = myclient["xyz"]
mycol = mydb["xyz_documents"]
query = {
# some filters
}
results = mycol.find(query)
# results is now a cursor that can run through all docs
# Assume, for the sake of this example, that each doc contains a class name and some image that I want to train a classifier on
Introduction
This one is a little open-ended, but let's try, also please correct me if I'm wrong somewhere.
So far, I have been exporting the data to the file system first, with subfolders named as the classes of the documents.
IMO this isn't sensible because:
you are essentially duplicating data
any time you would like to train a-new given only code and database this operation would have to be repeated
you can access multiple datapoints at once and cache them in RAM for later reuse without reading from hard drive multiple times (which is quite heavy)
Am I right? Would it make sense to directly connect to the MongoDB?
Given above, probably yes (especially when it comes to clear and portable implementation)
Or are there reasons not to do it (e.g. DBs generally being to slow etc.)?
AFAIK DB shouldn't be slower in this case as it will cache access to it, but I'm no db expert unfortunately. Many tricks for faster access are implemented out-of-the-box for databases.
can one prefetch the data somehow?
Yes, if you just want to get data, you could load a larger part of data (say 1024 records) at one go and return batches of data from that (say batch_size=128)
Implementation
How would one implement a PyTorch DataLoader? I have found only very few code snippets online ([1] and [2]) which makes me doubt with my approach.
I'm not sure why would you want to do that. What you should go for is torch.utils.data.Dataset as shown in the examples you've listed.
I would start with simple non-optimized approach similar to the one here, so:
open connection to db in __init__ and keep it as long as it's used (I would create a context manager from torch.utils.data.Dataset so the connection is closed after epochs are finished)
I would not transform the results to list (especially since you cannot fit it in RAM for obvious reasons) as it misses the point of generators
I would perform batching inside this Dataset (there is an argument batch_size here).
I am not sure about __getitem__ function but it seems it can return multiple datapoints at once, hence I'd use that and it should allow us to use num_workers>0 (given that mycol.find(query) returns data in the same order every time)
Given that, something along those lines is what I'd do:
class DatabaseDataset(torch.utils.data.Dataset):
def __init__(self, query, batch_size, path: str, database: str):
self.batch_size = batch_size
client = pymongo.MongoClient(path)
self.db = client[database]
self.query = query
# Or non-approximate method, if the approximate method
# returns smaller number of items you should be fine
self.length = self.db.estimated_document_count()
self.cursor = None
def __enter__(self):
# Ensure that this find returns the same order of query every time
# If not, you might get duplicated data
# It is rather unlikely (depending on batch size), shouldn't be a problem
# for 20 million samples anyway
self.cursor = self.db.find(self.query)
return self
def shuffle(self):
# Find a way to shuffle data so it is returned in different order
# If that happens out of the box you might be fine without it actually
pass
def __exit__(self, *_, **__):
# Or anything else how to close the connection
self.cursor.close()
def __len__(self):
return len(self.examples)
def __getitem__(self, index):
# Read takes long, hence if you can load a batch of documents it should speed things up
examples = self.cursor[index * batch_size : (index + 1) * batch_size]
# Do something with this data
...
# Return the whole batch
return data, labels
Now batching is taken care of by DatabaseDataset, hence torch.utils.data.DataLoader can have batch_size=1. You might need to squeeze additional dimension.
As MongoDB uses locks (which is no surprise, but see here) num_workers>0 shouldn't be a problem.
Possible usage (schematically):
with DatabaseDataset(...) as e:
dataloader = torch.utils.data.DataLoader(e, batch_size=1)
for epoch in epochs:
for batch in dataloader:
# And all the stuff
...
dataset.shuffle() # after each epoch
Remember about shuffling implementation in such case! (also shuffling can be done inside context manager and you might want to close connection manually or something along those lines).
Related
Quite frequently, I'd like to retrieve only the first N rows from a query, but I don't know in advance what N will be. For example:
try(var stream = sql.selectFrom(JOB)
.where(JOB.EXECUTE_AFTER.le(clock.instant()))
.orderBy(JOB.PRIORITY.asc())
.forUpdate()
.skipLocked()
.fetchSize(100)
.fetchLazy()) {
// Now run as many jobs as we can in 10s
...
}
Now, without adding an arbitrary LIMIT clause, the PG query planner sometimes decides to do a painfully slow sequential table scan for such queries, AFAIK because it thinks I'm going to fetch every row in the result set. An arbitrary LIMIT kind of works for simple cases like this one, but I don't like it at all because:
the limit's only there to "trick" the query planner into doing the right thing, it's not there because my intent is to fetch at most N rows.
when it gets a little more sophisticated and you have multiple such queries that somehow depend on each other, choosing an N large enough to not break your code can be hard. You don't want to be the next person who has to understand that code.
finding out that the query is unexpectedly slow usually happens in production where your tables contain a few million/billion rows. Totally avoidable if only the DB didn't insist on being smarter than the developer.
I'm getting tired of writing detailed comments that explain why the queries have to look like this-n-that (i.e. explain why the query planner screws up if I don't add this arbitrary limit)
So, how do I tell the query planner that I'm only going to fetch a few rows and getting the first row quickly is the priority here? Can this be achieved using the JDBC API/driver?
(Note: I'm not looking for server configuration tweaks that indirectly influence the query planner, like tinkering with random page costs, nor can I accept a workaround like set seq_scan=off)
(Note 2: I'm using jOOQ in the example code for clarity, but under the hood this is just another PreparedStatement using ResultSet.TYPE_FORWARD_ONLY and ResultSet.CONCUR_READ_ONLY, so AFAIK we can rule out wrong statement modes)
(Note 3: I'm not in autocommit mode)
PostgreSQL is smarter than you think. All you need to do is to set the fetch size of the java.sql.Statement to a value different from 0 using the setFetchSize method. Then the JDBC driver will create a cursor and fetch the result set in chunks. Any query planned that way will be optimized for fast fetching of the first 10% of the data (this is governed by the PostgreSQL parameter cursor_tuple_fraction). Even if the query performs a sequential scan of a table, not all the rows have to be read: reading will stop as soon as no more result rows are fetched.
I have no idea how to use JDBC methods with your ORM, but there should be a way.
In case the JDBC fetchSize() method doesn't suffice as a hint to get the behaviour you want, you could make use of explicit server side cursors like this:
ctx.transaction(c -> {
c.dsl().execute("declare c cursor for {0}", dsl
.selectFrom(JOB)
.where(JOB.EXECUTE_AFTER.le(clock.instant()))
.orderBy(JOB.PRIORITY.asc())
.forUpdate()
.skipLocked()
);
try {
JobRecord r;
while ((r = dsl.resultQuery("fetch forward 1 from c")
.coerce(JOB)
.fetchOne()) != null) {
System.out.println(r);
}
}
finally {
c.dsl().execute("close c");
}
});
There's a pending feature request to support the above also in the DSL API (see #11407), but the above example shows that this can still be done in a type safe way using plain SQL templating and the ResultQuery::coerce method
I have a frequent use case I couldn't solve.
Let's say I have a filepattern like gs://mybucket/mydata/*/files.json where * is supposed to match a date.
Imagine I want to keep 251 dates (this is an example, let's say a big number of dates but without a meta-pattern to match them like 2019* or else).
For now, I have two options :
create a TextIO for every single file, which is overkill and fails almost everytime (graph too large)
read ALL data and then filter it within my job from data : which is also overkill when you have 10 TB of data while you only need 10 Gb for instance
In my case, I would like to just do something like that (pseudo code) :
Read(LIST[uri1,uri2,...,uri251])
And that this instruction actually spawn a single TextIO task on the graph.
I am sorry if I missed something, but I couldn't find a way to do it.
Thanks
Ok I found it, the naming was mileading me :
Example 2: reading a PCollection of filenames.
Pipeline p = ...;
// E.g. the filenames might be computed from other data in the pipeline, or
// read from a data source.
PCollection<String> filenames = ...;
// Read all files in the collection.
PCollection<String> lines =
filenames
.apply(FileIO.matchAll())
.apply(FileIO.readMatches())
.apply(TextIO.readFiles());
(Quoted from Apache Beam documentation https://beam.apache.org/releases/javadoc/2.13.0/org/apache/beam/sdk/io/TextIO.html)
So we need to generate a PCollection of URIS (with Create/of) or to read it from the pipeline, then to match all the uris (or patterns I guess) and the to read all files.
I am trying to figure out efficient algorithm for processing Documents in distributed (FaaS to be more precise) environment.
Bruteforce approach would be O(D * F * R) where:
D is amount of Documents to process
F is amount of filters
R is highest amount of Rules in single Filter
I can assume, that:
single Filter has no more than 10 Rules
some Filters may share Rules (so it's N-to-N relation)
Rules are boolean functions (predicates) so I can try to take advantage of early cutting, meaning that if I have f() && g() && h() with f() evaluating to false then I do not have to process g() and h() at all and can return false immediately.
in single Document amount of Fields is always same (and about 5-10)
Filters, Rules and Documents are already in database
every Filter has at least one Rule
Using sharing (second assumption) I had an idea to first process Document against every Rule and then (after finishing) for every Filter using already computed Rules compute result. This way if Rule is shared then I am computing it only once. However, it doesn't take advantage of early cutting (third assumption).
Second idea is to use early cutting as slightly optimized bruteforce, but it won't use Rules sharing then.
Rules sharing looks like subproblem sharing, so probably memoization and dynamic programming will be helpful.
I have noticed, that Filter-Rule relation is bipartite graph. Not quite sure if it can help me though. I also have noticed, that I could use reverse sets and in every Rule store corresponding Set. This would however create circular dependency and may cause desynchronization problems in database.
Default idea is that Documents are streamed, and every single of them is event that will create FaaS instance to process it. However, this would probably force every FaaS instance to query for all Filters, which leaves me at O(F * D) queries because of Shared-Nothing architecture.
Sample Filter:
{
'normalForm': 'CONJUNCTIVE',
'rules':
[
{
'isNegated': true,
'field': 'X',
'relation': 'STARTS_WITH',
'value': 'G',
},
{
'isNegated': false,
'field': 'Y',
'relation': 'CONTAINS',
'value': 'KEY',
},
}
or in more condense form:
document -> !document.x.startsWith("G") && document.y.contains("KEY")
for Document:
{
'x': 'CAR',
'y': 'KEYBOARD',
'z': 'PAPER',
}
evaluates to true.
I can slightly change data model, stream something else instead of Document (ex. Filters) and use any nosql database and tools to help it. Apache Flink (event processing) and MongoDB (single query to retrieve Filter with it's Rules) or maybe Neo4j (as model looks like bipartite graph) looks like could help me, but not sure about it.
Can it be processed efficiently (with regard to - probably - database queries)? What tools would be appropriate?
I have been also wondering, if maybe I am trying to solve special case of some more general (math) problem that may have useful theorems and algorithms.
EDIT: My newest idea: Gather all Documents in cache like Redis. Then single event starts up and publishes N functions (as in Function as a Service), and every function selects F/N (amount of Filters divided by number of instances - so just evenly distributing Filters across instances) this way every Filter is fetched from database only once.
Now, every instance streams all Documents from cache (one document should be less than 1MB and at the same time I should have 1-10k of them so should fit in cache). This way every Document is selected from database only once (to cache).
I have reduced database read operations (still some Rules are selected multiple times), but still I am not taking advantage of Rule sharing across Filters. I could intentionally ignore it by using document database. This way by selecting Filter I will also get it's Rules. Still - I have to recalculate it's value.
I guess that's what I get for using Shared Nothing scalable architecture?
I realized that although my graph is indeed (in theory) bipartite but (in practice) it's going to be set of disjoint bipartite graphs (as not all Rules are going to be shared). This means, that I can process those disjoint parts independently on different FaaS instances without recalculating same Rules.
This reduces my problem to processing single bipartite connected graph. Now, I can use benefits of dynamic programming and share result of Rule computation only if memory i shared, so I cannot divide (and distribute) this problem further without sacrificing this benefit. So I thought this way: if I have already decided, that I will have to recompute some Rules, then let it be low compared to disjoint parts that I will get.
This is actually minimum cut problem, that has (fortunately) polynomial complexity known algorithm.
However, this may be not ideal in my case, because I don't want to cut any part of graph - I would like to cut graph ideally in half (divide and conquer strategy, that could be reapplied recursively till graph would be so small that could be processed in seconds in FaaS instance, that has time bound).
This means, that I am looking for cut, that would create two disjoint bipartite graphs, with possibly same amount of vertexes each (or at least similar).
This is sparsest cut problem, that is NP-hard, but has O(sqrt(logN)) approximated algorithm, that also favors less cut edges.
Currently, this does look like solution for my problem, however I would love to hear any suggestions, improvements and other answers.
Maybe it can be done better with other data model or algorithm? Maybe I can reduce it further with some theorem? Maybe I could transform it to other (simpler) problem, or at least that is easier to divide and distribute across nodes?
This idea and analysis strongly suggests using graph database.
I'm interested in a scenario where a document is fetched from the database, some computations are run based on some external conditions, one of the fields of the document gets updated and then the document gets saved, all in a system that might have concurrent threads accessing the DB.
To make it easier to understand, here's a very simplistic example. Suppose I have the following document:
{
...
items_average: 1234,
last_10_items: [10,2187,2133, ...]
...
}
Suppose a new item (X) comes in, five things will need to be done:
read the document from the DB
remove the first (oldest) item in the last_10_items
add X to the end of the array
re-compute the average* and save it in items_average.
write the document to the DB
* NOTE: the average computation was chosen as a very simple example, but the question should take into account more complex operations based on data existing in the document and on new data (i.e. not something solvable with the $inc operator)
This certainly is something easy to implement in a single-threaded system, but in a concurrent system, if 2 threads would like to follow the above steps, inconsistencies might occur since both will update the last_10_items and items_average values without considering and/or overwriting the concurrent changes.
So, my question is how can such a scenario be handled? Is there a way to check or react-upon the fact that the underlying document was changed between steps 1 and 5? Is there such a thing as WATCH from redis or 'Concurrent Modification Error' from relational DBs?
Thanks
In database system,it uses a memory inspection and roll back scheme which is similar to transactional memory.
Briefly speaking, it simply monitors the share memory parts you specified and do something like compare and swap or load and link or test and set.
Therefore,if any memory content is changed during transaction,it will abort and try again until there is no conflict operation for that shared memory.
For example,GCC implements the following:
https://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html
type __sync_lock_test_and_set (type *ptr, type value, ...)
type __sync_val_compare_and_swap (type *ptr, type oldval type newval, ...)
For more info about transactional memory,
http://en.wikipedia.org/wiki/Software_transactional_memory
I have files (>100) that each contain recorded sets of data like this:
file0: [no. of data sets in file, no. of data points for recording1, related data to recording1, no. of data points for recording2, related data to recording2, ... , no. of data points for recordingM, related data to recordingM]
file1: [no. of data sets in file, ...] (same as above)
All of the data together may exceed 20 GB, so loading all of it into memory is not an option. Hence, I would like to create memory-mapped files for each of the files BUT hiding from the "user" the complexity of the underlying data, e.g., I would like to be able to operate on the data like this:
for i=1:TotalNumberOfRecordings
recording(i) = recording(i) * 10; % some stupid data operation
% or even more advanced better:
recording(i).relatedData = 2000;
end
So, no matter if recording(i) is in file0, file1, or some other file, and no matter its position within the file, I have a list that allows to me access the related data via a memory map.
What I have so far, is a list of all files within a certain directory, my idea now was to simply create a list like this:
entry1: [memoryMappedFileHandle, dataRangeOfRecording]
entry2: [memoryMappedFileHandle, dataRangeOfRecording]
And then use this list to further abstract files and recordings. I started with this code:
fileList = getAllFiles(directoryName);
list = []; n = 0;
for file = 1:length(fileList);
m = memmapfile(fileList(file));
for numberOfTracesInFile
n = n+1;
list = [list; [n, m]];
end
end
But I do get the error:
Memmapfile objects cannot be concatenated
I'm quite new to MATLAB so this is probably a bad idea after all. How to do it better? Is it possible to create a memorymapped table that contains multiple files?
I'm not sure whether the core of your question is specifically about memory-mapped files, or about whether there is a way to seamlessly process data from multiple large files without the user needing to bother with the details of where the data is.
To address the second question, MATLAB 2014b introduced a new datastore object that is designed to do pretty much this. Essentially, you create a datastore object that refers to your files, and you can then pull data from the datastore without needing to worry about which file it's in. datastore is also designed to work very closely with the new mapreduce functionality that was introduced at the same time, which allows you to easily parallelize map-reduce programming patterns, and even tie in with Hadoop.
To answer the first question - I'm afraid I think you've found your answer, which is that memmapfile objects can not be concatenated, so no, not straightforward. I think your best approach would be to build your own class, which would contain multiple memmapfile objects in a cell array along with information about which data was in which file, along with some sort of getData method that would retrieve the appropriate data from the appropriate file. (This would be basically like writing your own datastore class, but which worked with memory-mapped files rather than files, so you might be able to copy much of the design and/or implementation details from datastore itself).
Like Horchler said; you could put the memmepfile objects in a cell array:
list = cell(1,10); % preallocate cell
for it = 1:10
memmapfile_object = memmepfile('/path/to/file');
list{it} = memmapfile_object;
end