I am trying to store a binary file in OrientDB. I am using pyorient. The file can be large (more than 50mb) and I could not find a way except for storing its hex as a list of strings which takes a long time to store. Is there a way I can do it in a more elegant way and get done faster?
Unfortunately I think not; I spent a bunch of time on this and OrientDB's ability to store binary files isn't exposed in PyOrient as far as I could tell.
As you may know, PyOrient hasn't received an update since 2017 (I assume no longer officially supported) and none of the more recent features in the database are available via the PyOrient driver either.
Personally I've reached the conclusion that OrientDB is no longer a viable choice for a Python based solution even without the binary file limitation, unless you have the time/energy to dig into the driver and bring it up to spec.
Related
I'm working on a video server, and I want to use a database to keep video files.
Since I only need to store simple video files with metadata I tried to use MongoDB in Java, via its GridFS mechanism to store the video files and their metadata.
However, there are two major features I need, and that I couldn't manage using MongoDB:
I want to be able to add to a previously saved video, since saving a video might be performed in chunks. I don't want to delete the binary I have so far, just append bytes at the end of an item.
I want to be able to read from a video item while it is being written. "Thread A" will update the video item, adding more and more bytes, while "Thread B" will read from the item, receiving all the bytes written by "Thread A" as soon as they are written/flushed.
I tried writing the straightforward code to do that, but it failed. It seems MongoDB doesn't allow multi-threaded access to the binary (even if one thread is doing all the writing), nor could I find a way to add to a binary file - the Java GridFS API only gives an InputStream from an already existing GridFSDBFile, I cannot get an OutputStream to write to it.
Is this possible via MongoDB, and if so how?
If not, do you know of any other DB that might allow this (preferably nothing too complex such as a full relational DB)?
Would I be better off using MongoDB to keep only the metadata of the video files, and manually handle reading and writing the binary data from the filesystem, so I can implement the above requirements on my own?
Thanks,
Al
I've used mongo gridfs for storing media files for a messaging system we built using Mongo so I can share what we ran into.
So before I get into this for your use case scenario I would recommend not using GridFS and actually using something like Amazon S3 (with excellent rest apis for multipart uploads) and store the metadata in Mongo. This is the approach we settled on in our project after first implementing with GridFS. It's not that GridFS isn't great it's just not that well suited for chunking/appending and rewriting small portions of files. For more info here's a quick rundown on what GridFS is good for and not good for:
http://www.mongodb.org/display/DOCS/When+to+use+GridFS
Now if you are bent on using GridFS you need to understand how the driver and read/write concurrency works.
In mongo (2.2) you have one writer thread per schema/db. So this means when you are writing you are essentially locked from having another thread perform an operation. In real life usage this is super fast because the lock yields when a chunk is written (256k) so your reader thread can get some info back. Please look at this concurrency video/presentation for more details:
http://www.10gen.com/presentations/concurrency-internals-mongodb-2-2
So if you look at my two links essentially we can say quetion 2 is answered. You should also understand a little bit about how Mongo writes large data sets and how page faults provide a way for reader threads to get information.
Now let's tackle your first question. The Mongo driver does not provide a way to append data to GridFS. It is meant to be a fire/forget atomic type operation. However if you understand how the data is stored in chunks and how the checksum is calculated then you can do it manually by using the fs.files and fs.chunks methods as this poster talks about here:
Append data to existing gridfs file
So going through those you can see that it is possible to do what you want but my general recommendation is to use a service (such as Amazon S3) that is designed for this type of interaction instead of trying to do extra work to make Mongo fit your needs. Of course you can go to the filesystem directly as well which would be the poor man's choice but you lose redundancy, sharding, replication etc etc that you get with GridFS or S3.
Hope that helps.
-Prasith
I'm developing a php platform that will make huge use of images, documents and any file format that will come in my mind so i was wondering if Cassandra is a good choice for my needs.
If not, can you tell me how should i store files? I'd like to keep using cassandra because it's fault-tolerant and uses auto-replication among nodes.
Thanks for help.
From the cassandra wiki,
Cassandra's public API is based on Thrift, which offers no streaming abilities
any value written or fetched has to fit in memory. This is inherent to Thrift's
design and is therefore unlikely to change. So adding large object support to
Cassandra would need a special API that manually split the large objects up
into pieces. A potential approach is described in http://issues.apache.org/jira/browse/CASSANDRA-265.
As a workaround in the meantime, you can manually split files into chunks of whatever
size you are comfortable with -- at least one person is using 64MB -- and making a file correspond
to a row, with the chunks as column values.
So if your files are < 10MB you should be fine, just make sure to limit the file size, or break large files up into chunks.
You should be OK with files of 10MB. In fact, DataStax Brisk puts a filesystem on top of Cassandra if I'm not mistaken: http://www.datastax.com/products/enterprise.
(I'm not associated with them in any way- this isn't an ad)
As fresh information, Netflix provides utilities for their cassandra client called astyanax for storing files as handled object stores. Description and examples can be found here. It can be a good starting point to write some tests using astyanax and evaluate Cassandra as a file storage.
I'm developing an open source product and need an embedded dbms.
Can you recommend an embedded open source database that ...
Can handle objects over 10 GB each
Has a license friendly to embedding (LGPL, not GPL).
Is pure Java
Is (preferably) nosql. Sql might work, but prefer nosql
I've looked over some of the document DBMSs, like mongodb,
but they seem to be limited to 4 or 16 mb documents.
Berkeley DB looked attractive but has a GPL like license.
Sqlite3 is attractive: good license, and you can compile
with whatever max blob size you like. But, it's not Java.
I know JDBC drivers exist, but we need a pure Java system.
Any suggestions?
Thanks
Steve
Although it's an old question, I've been looking into this recently and have come across the following (at least two of which were written after this question was asked). I'm not sure how any of these handle very large objects - and at 10GB you would probably have to do some serious testing, as I presume few database developers would have objects of that size in mind for their products (just a guess). I would definitely consider storing them to disk directly, with just a reference to the file location in your database.
(Opinions below are all pretty superficial, by the way, as I haven't used them in earnest yet).
OrientDB looks like the most mature of the three I found. It appears to be a document and/or graph database and claims to be very fast (making use of and "RB+Tree" data structure - a combination of B+ and Red Black trees). It claims to be super fast and light, with no external dependencies. There seems to be an active community developing it, with lots of commits over the last few days, for example. It's also compliant with TinkerPop graph database standard, which adds another layer of features (such as the Gremlin graph querying language). It's ACID compliant, has REST and other external APIs and even a web based management app (which presumably could be deployed with your embedded DB, but I'm not sure).
The next two fall more into the simple key-value store camp of N(ot)O(nly)SQL world.
JDBM3 is an extremely minimal data store: it has a hash map, tree map, tree set and linked list which are written to disk through memory mapped files. It claims to be very light and fast, is fully transactional and is being actively developed.
HawtDB looks similary very simple and fast - a BTree or Hash based index persisted to disk with memory mapped files. It's (optionally) fully transactional. There has been no commit in the past seven months (to end March 2012) and there's not much activity on the mailing list. That's not to say it's not a good library, but worth mentioning.
JDBM3 and HawtDB are pretty minimal, so you're not going to get any fancy GUIs. But I think they both look very attractive for their speed and simplicity.
Those are all I've found matching your requirements. In addition, Neo4J is great - a graph database, which is now a pretty mature and works very well in embedded mode. It's GPL/AGPL licensed, though, so may require a paid license, unless you can open source your code too:
http://neotechnology.com/products/price-list/
Of course, you could also use the H2 SQL database with one big table and no indices!
As part of my work we get approx 25TB worth log files annually, currently it been saved over an NFS based filesystem. Some are archived as in zipped/tar.gz while others reside in pure text format.
I am looking for alternatives of using an NFS based system. I looked at MongoDB, CouchDB. The fact that they are document oriented database seems to make it the right fit. However the log files content needs to be changed to JSON to be store into the DB. Something I am not willing to do. I need to retain the log files content as is.
As for usage we intend to put a small REST API and allow people to get file listing, latest files, and ability to get the file.
The proposed solutions/ideas need to be some form of distributed database or filesystem at application level where one can store log files and can scale horizontally effectively by adding more machines.
Ankur
Since you dont want queriying features, You can use apache hadoop.
I belive HDFS and HBase will be nice fit for this.
You can see lot of huge storage stories inside Hadoop powered by page
Take a look at Vertica, a columnar database supporting parallel processing and fast queries. Comcast used it to analyze about 15GB/day of SNMP data, running at an average rate of 46,000 samples per second, using five quad core HP Proliant servers. I heard some Comcast operations folks rave about Vertica a few weeks ago; they still really like it. It has some nice data compression techniques and "k-safety redundancy", so they could dispense with a SAN.
Update: One of the main advantages of a scalable analytics database approach is that you can do some pretty sophisticated, quasi-real time querying of the log. This might be really valuable for your ops team.
Have you tried looking at gluster? It is scalable, provides replication and many other features. It also gives you standard file operations so no need to implement another API layer.
http://www.gluster.org/
I would strongly disrecommend using a key/value or document based store for this data (mongo, cassandra, etc.). Use a file system. This is because the files are so large, and the access pattern is going to be linear scan. One thing problem that you will run into is retention. Most of the "NoSQL" storage systems use logical delete, which means that you have to compact your database to remove deleted rows. You'll also have a problem if your individual log records are small and you have to index each one of them - your index will be very large.
Put your data in HDFS with 2-3 way replication in 64 MB chunks in the same format that it's in now.
If you are to choose a document database:
On CouchDB you can use the _attachement API to attach the file as is to a document, the document itself could contain only metadata (like timestamp, locality and etc) for indexing. Then you will have a REST API for the documents and the attachments.
A similar approach is possible with Mongo's GridFs, but you would build the API yourself.
Also HDFS is a very nice choice.
We are trying to duplicate one of our informix database on a test server, but without Informix expertise in house we can only guess what we need to do. I am learning this stuff on the fly myself and nowhere near the expertise level needed to operate Informix efficiently or even inefficiently. Anyhow...
We managed to copy the .dat and .idx files from the live server somewhere. Installed Linux and the latest Informix Dynamic Server on it and have it up and running.
Now what should we do with the .dat and idx files from the live server? Do we copy it somewhere and it will recognize it automatically?
Or is there an equivalent way like you can do attach DB from MS SQLServer to register the database files in the new database?
At my rope end...
You've asked a pretty complicated question without realizing it. Informix is architected as a shared everything database engine, meaning all resources available to the instance are available to every database in that instance. This means that more than one database can store data in any given dbspace, .dat or .idx file in your case. Most DBA's know better than to do that but it's something to be aware of. Given this knowledge you now know that the .dat and .idx files do not belong to a database but belong the instance. The dbspaces and files were created to contain your databases data but they technically belong to the instance. It's worth noting that the .dat and .idx files are known to the database by the logical dbspace name.
Armed with this background info and assuming that the production and development servers are running the same OS and that your hardware is relatively the same, not a combination of PARISC, Itanium or x86/x64, I'll throw a couple of options out for you.
Create the dbspaces that you need in the new instance and use onunload and onload
to copy the database from production to development.
Use ontape or onbar to backup the entire production instance and
restore it over your development instance.
Option 1 requires that you know what the dbspaces are named and how large they are. Use onstat -d on the production instance to find this out. BTW, the numbers listed in onstat -d are in pages, I believe that Linux is a 2K page.
Option 2 simply requires that the paths for the data files are the same on both servers. This means that the ROOTDBS needs to be the same in both instances. That can be found by executing onstat -c | grep ROOTDBS
There's a lot that has been left out but I hope that this gives you the info that you need to move forward with your task.
The .dat and .idx files are associated with C-ISAM, or, when organized in a directory called dbase.dbs (where dbase is the name of your database), the .dat and .idx files are associated with Informix Standard Engine, aka Informix SE. SE uses C-ISAM to manage its storage. SE is rather different from (and much simpler than) Informix Dynamic Server (IDS). It is not impossible that the .dat and .idx files are associated with IDS; it is just extremely unlikely.
From the information available, it sounds as though your production server is running SE. To get the data from SE to IDS, you will probably want to use DB-Export at the SE end and DB-Import at the Linux/IDS end. Certainly, that is the simplest way to do it.
There are other possible solutions - C-ISAM datablade being one such - but they are more expensive and probably not warranted. There are other possible loading solutions, such as HPL (High-Performance Loader).
For more information about Informix, either use the various web sites already referenced (http://www.informix.com is a link to the Informix section of IBM's web site), or use the International Informix User Group (IIUG) web site. There are mailing lists available (which require you to belong, but membership is free) for discussing Informix in detail.
Those Informix-SE datafiles (.DAT) and their associated index files (.IDX) are useless unless you also have all the associated catalog files, such as SYSTABLES.DAT SYSTABLES.IDX, SYSCOLUMNS, SYSINDEXES, etc.
Then you also have to worry about which version of Informix-SE created them, as some have a 2K or 4K index file node size.
Your best approach is to obtain all the .DAT and .IDX files from the source db, plus the correct standard engine, installed on the same hardware and operating system it came from.
Long story short, on the source machine, run "dbexport" to unload all the data to ascii files, and run "dbschema" to generate all the table schemas and indexes. It also wouldn't hurt to run a "bcheck" on all the files before unloading them to ascii flat files.
I don't have any Informix-specific advice but for situations like this you can usually find the answer by looking up how to move a database (a common admin task, and usually well described in the manual) and just skipping the steps that would remove the old database.
Also, be careful of problems caused by different system architectures; some DBs fail spectacularly if you move them from a big-endian system (such as Solaris) to a little-endian system (such as x86 Linux) Again, the manual section on moving a DB would cover any extra steps that are needed.