According to code segments below, PostgreSQL (the version is REL_13_STABLE) seems to use inversion to describe large object, but I don't understand the meaning of "inversion", or why they are using "inversion" to describe large object.
/*
* Read/write mode flags for inversion (large object) calls
*/
#define INV_WRITE 0x00020000
#define INV_READ 0x00040000
Inversion stems from the inversion file system, an academic idea from the PostgreSQL ecosystem at the University of Berkeley. Read the original paper fir illustration:
Conventional file systems handle naming and
layout of chunks of user data. Users may move
around in the file system’s namespace, and may typically examine a small set of attributes of any given
chunk of data. Most file systems guarantee some
degree of consistency of user data. These observations make it possible to categorize conventional file
systems as rudimentary database systems.
Conventional database systems, on the other
hand, allow users to define objects with new attributes, and to query these attributes easily. Consistency guarantees are typically much stronger than
in file systems. Database systems frequently use an
underlying file system to store user data. Virtually
no commercially-available database system exports a
file system interface.
This paper describes the design and implementation of a file system built on top of a database system. This file system, called “Inversion” because
the conventional roles of the file system and database system are inverted, runs on top of POSTGRES
[MOSH92] version 4.0.1. It supports file storage on
any device managed by POSTGRES, and provides useful services not found in many conventional file systems.
(The emphasis is mine.)
It seems that (inversion) large objects were originally intended as the basis for a file system implementation.
Related
The output of the WordCount is being stored in multiple files.
However the developer doesn't have control on where(ip,path) the files stay on cluster.
In MapReduce API, there is a provision for developers to write reduce program to address this.How to handle this in ApacheBeam with DirectRunner or any other runners?
Indeed -- the WordCount example pipeline in Apache Beam writes its output using TextIO.Write, which doesn't (by default) specify the number of output shards.
By default, each runner independently decides how many shards to produce, typically based on its internal optimizations. The user can, however, control this via .withNumShards() API, which would force a specific number of shards. Of course, forcing a specific number may require more work from a runner, which may or may not result in a somewhat slower execution.
Regarding "where the files stay on the cluster" -- it is Apache Beam's philosophy that this complexity should be abstracted away from the user. In fact, Apache Beam raises the level of abstraction such that user don't need to worry about this. It is runner's and/or storage system's responsibility to manage this efficiently.
Perhaps to clarify -- we can make an easy parallel with low-level programming (e.g., direct assembly), vs. non-managed programming (e.g., C or C++), vs. managed (e.g., C# or Java). As you go higher in abstraction, you no longer can control data locality (e.g., processor caching), but gain power, ease of use, and portability.
I had some doubts regarding an Inode vs a Vnode. As far as my understanding goes, inode is the representation of a file that is used by the Virtual File System. Whereas vnodes are file system specific. Is this correct?
Also, I am confused whether inode is a kernel data structure i.e whether it is an in-memory data structure or a data structure that exists on blocks in an actual disk?
To add something to this: The best explanation I could find for a vnode is here on the FreeBSD docs. For the more academically minded there is also the original paper that introduced the concept, [Vnodes: An Architecture for Multiple File System Types in Sun UNIX], which provides a much more in depth resource.
That said vnodes were originally created for FreeBSD because of the proliferation of different types of filesystems that needed to be used like UFS, NFS, etc. Vnodes are meant to provide an abstraction across all possible filesystems so that the OS can interface with them and so that kernel functions don't have to specifically support every filesystem under the sun; they only have to have knowledge of how to interact with the file's vnode.
Back to your original question vnodes, as #Allen Luce mentioned, are in memory abstractions and are not filesystem specific. They can be used interchangeably in UFS, ext4, and whatever else. In contrast, inodes are stored on disk and are specific to the exact filesystem being used. inodes contain metadata about the file like size, owner, pointers to the block addresses among other things. Vnodes contain some data about the file but only attributes that do not change over the file's lifetime so an inode would be the location to reference if you wanted the most information possible about a file. If you're more interested in inodes I would suggest you check out wikipedia which has a good article on them.
Typically (like for Linux and BSD mainstream filesystems), an inode is first an on-disk structure that describes the storage of a file in terms of that disk (usually in blocks). A vnode is an in-memory structure that abstracts much of what an inode is (an inode can be one of its data fields) but also captures things like operations on files, locks, etc. This lets it support non-inode based filesystems, in particular networked filesystems.
It depends on your OS. On a Linux system for instance there is no v-node, only a generic i-node struct inode which although is conceptually similar to a v-node is implemented differently.
For BSD-derived and UNIX kernels, the v-node points to an i-node structure specific to the filesystem, along with some additional information including pointers to functions that operate on the file and metadata not included in the inode. A major difference is the inode is files system while the vnode is not. (In Linux as mentioned above there is both a system-independent inode and a file system-dependent inode)
An inode is not a kernel data structure, the vnode/generic inode is however, being an in-kernel representation of the inode.
The concept of vnode differs to a degree depending on what system you're under, everyone just kind of took the name and ran with it.
Under Linux and Unix, you can consider the abstraction as follow.
Suppose there exists
f.tmp
You want f to stick around while your program is running, because you're accessing it, but if your program ends or crashes, you want to be sure it goes away.
You can do this by opening f, and then unlink() it. You will still retain a reference to f, even though its inode now has 0 directory entries, and so has been marked free. The operating system is still retaining where the file started and its allocation state, until your program ends. This "virtualization" of the inode that no longer exists is a vnode.
Another common situation for this is when you're reading a resource that disappears out from under you. Suppose you're watching a movie, while it is streaming to a temporary location. When the movie is completely downloaded, it will be relocated to another volume for storage. Somehow you can continue watching and scrubbing through the movie so long as it remains open. In this case even though there are again no links, since there is a vnode, this inode can't be cleaned up yet.
This depends on the operating system and the file system you are using or working on. For instance VXFS and ADVFS inode's are nothing but on-disk data structure called vnode's. In general both refer to file metadata.
Simply put, the in-memory data structure vnode is just an inode cache that stores information about
the file(typically inode stores in the disk) so that it can be accessed more quickly.
I'm developing an open source product and need an embedded dbms.
Can you recommend an embedded open source database that ...
Can handle objects over 10 GB each
Has a license friendly to embedding (LGPL, not GPL).
Is pure Java
Is (preferably) nosql. Sql might work, but prefer nosql
I've looked over some of the document DBMSs, like mongodb,
but they seem to be limited to 4 or 16 mb documents.
Berkeley DB looked attractive but has a GPL like license.
Sqlite3 is attractive: good license, and you can compile
with whatever max blob size you like. But, it's not Java.
I know JDBC drivers exist, but we need a pure Java system.
Any suggestions?
Thanks
Steve
Although it's an old question, I've been looking into this recently and have come across the following (at least two of which were written after this question was asked). I'm not sure how any of these handle very large objects - and at 10GB you would probably have to do some serious testing, as I presume few database developers would have objects of that size in mind for their products (just a guess). I would definitely consider storing them to disk directly, with just a reference to the file location in your database.
(Opinions below are all pretty superficial, by the way, as I haven't used them in earnest yet).
OrientDB looks like the most mature of the three I found. It appears to be a document and/or graph database and claims to be very fast (making use of and "RB+Tree" data structure - a combination of B+ and Red Black trees). It claims to be super fast and light, with no external dependencies. There seems to be an active community developing it, with lots of commits over the last few days, for example. It's also compliant with TinkerPop graph database standard, which adds another layer of features (such as the Gremlin graph querying language). It's ACID compliant, has REST and other external APIs and even a web based management app (which presumably could be deployed with your embedded DB, but I'm not sure).
The next two fall more into the simple key-value store camp of N(ot)O(nly)SQL world.
JDBM3 is an extremely minimal data store: it has a hash map, tree map, tree set and linked list which are written to disk through memory mapped files. It claims to be very light and fast, is fully transactional and is being actively developed.
HawtDB looks similary very simple and fast - a BTree or Hash based index persisted to disk with memory mapped files. It's (optionally) fully transactional. There has been no commit in the past seven months (to end March 2012) and there's not much activity on the mailing list. That's not to say it's not a good library, but worth mentioning.
JDBM3 and HawtDB are pretty minimal, so you're not going to get any fancy GUIs. But I think they both look very attractive for their speed and simplicity.
Those are all I've found matching your requirements. In addition, Neo4J is great - a graph database, which is now a pretty mature and works very well in embedded mode. It's GPL/AGPL licensed, though, so may require a paid license, unless you can open source your code too:
http://neotechnology.com/products/price-list/
Of course, you could also use the H2 SQL database with one big table and no indices!
Many file storage systems use hashes to avoid duplication of the same file content data (among other reasons), e.g., Git and Dropbox both use SHA256. The file names and dates can be different, but as long as the content gets the same hash generated, it never gets stored more than once.
It seems this would be a sensible thing to do in a OS file system in order to save space. Are there any file systems for Windows or *nix that do this, or is there a good reason why none of them do?
This would, for the most part, eliminate the need for duplicate file finder utilities, because at that point the only space you would be saving would be for the file entry in the file system, which for most users is not enough to matter.
Edit: Arguably this could go on serverfault, but I feel developers are more likely to understand the issues and trade-offs involved.
ZFS supports deduplication since last month: http://blogs.oracle.com/bonwick/en_US/entry/zfs_dedup
Though I wouldn't call this a "common" filesystem (afaik, it is currently only supported by *BSD), it is definitely one worth looking at.
It would save space, but the time cost is prohibitive. The products you mention are already io bound, so the computational cost of hashing is not a bottleneck. If you hashed at the filesystem level, all io operations which are already slow will get worse.
NTFS has single instance storage.
NetApp has supported deduplication (that's what its called in the storage industry) in the WAFL filesystem (yeah, not your common filesystem) for a few years now. This is one of the most important features found in the enterprise filesystems today (and NetApp stands out because they support this on their primary storage also as compared to other similar products which support it only on their backup or secondary storage; they are too slow for primary storage).
The amount of data which is duplicate in a large enterprise with thousands of users is staggering. A lot of those users store the same documents, source code, etc. across their home directories. Reports of 50-70% data deduplicated have been seen often, saving lots of space and tons of money for large enterprises.
All of this means that if you create any common filesystem on a LUN exported by a NetApp filer, then you get deduplication for free, no matter what the filesystem created in that LUN. Cheers. Find out how it works here and here.
btrfs supports online de-duplication of data at the block level. I'd recommend duperemove as an external tool is needed.
It would require a fair amount of work to make this work in a file system. First of all, a user might be creating a copy of a file, planning to edit one copy, while the other remains intact -- so when you eliminate the duplication, the hard link you created that way would have to give COW semantics.
Second, the permissions on a file are often based on the directory into which that file's name is placed. You'd have to ensure that when you create your hidden hard link, that the permissions were correctly applied based on the link, not just the location of the actual content.
Third, users are likely to be upset if they make (say) three copies of a file on physically separate media to ensure against data loss from hardware failure, then find out that there was really only one copy of the file, so when that hardware failed, all three copies disappeared.
This strikes me as a bit like a second-system effect -- a solution to a problem long after the problem ceased to exist (or at least matter). With hard drives current running less than $100US/terabyte, I find it hard to believe that this would save most people a whole dollar worth of hard drive space. At that point, it's hard to imagine most people caring much.
There are file systems that do deduplication, which is sort of like this, but still noticeably different. In particular, deduplication is typically done on a basis of relatively small blocks of a file, not on complete files. Under such a system, a "file" basically just becomes a collection of pointers to de-duplicated blocks. Along with the data, each block will typically have some metadata for the block itself, that's separate from the metadata for the file(s) that refer to that block (e.g., it'll typically include at least a reference count). Any block that has a reference count greater than 1 will be treated as copy on write. That is, any attempt at writing to that block will typically create a copy, write to the copy, then store the copy of the block to the pool (so if the result comes out the same as some other block, deduplication will coalesce it with the existing block with the same content).
Many of the same considerations still apply though--most people don't have enough duplication to start with for deduplication to help a lot.
At the same time, especially on servers, deduplication at a block level can serve a real purpose. One really common case is dealing with multiple VM images, each running one of only a few choices of operating systems. If we look at the VM image as a whole, each is usually unique, so file-level deduplication would do no good. But they still frequently have a large chunk of data devoted to storing the operating system for that VM, and it's pretty common to have many VMs running only a few operating systems. With block-level deduplication, we can eliminate most of that redundancy. For a cloud server system like AWS or Azure, this can produce really serious savings.
I'm currently trying to pick a database vendor.
I'm just seeking some personal opinions from fellow database developers out there.
My question is especially targeted towards people who:
1) have used Main Memory DB (MMDB) that supports replicating to disk (hybrid) before (i.e. ExtremeDB)
or
2) have used Versant Object Database and/or Objectivity Database and/or Progress ObjectStore
and the question is really: if you could recommend a database vendor, based on your experience, that would suit my application.
My application is a commercial real-time (read: high-performance) object-oriented C++ GIS kind of app, where we need to do a lot of lat/lon search (i.e. given an area, find all matching targets within the area...R-Tree index).
The types of data that I would like to store into the database are all modeled as objects and they make use of std::list and std::vector, so naturally, Object Database seems to make sense. I have read through enough articles to convince myself that a traditional RDBMS probably isnt what I'm really looking for in terms of
performance (joins or multiple
tables for dynamic-length data like
list/vector)
ease of programming
(impedance mismatch)
However, in terms of performance,
Input data is being fed into the system at about 40 MB/s.
Hence, the system will also be doing insert into the database at the rate of roughly 350 inserts per second (where each object varies from 64KB to 128KB),
Database will consistently be searched and updated via multiple threads.
From my understanding, all of the Object DBs I have listed here use cache for storing database objects. ExtremeDB claims that since it's designed especially for memory, it can avoid overhead of caching logic, etc. See more by googling: Main Memory vs. RAM-Disk Databases: A Linux-based Benchmark
So..I'm just a bit confused. Can Object DBs be used in real-time system? Is it as "fast" as MMDB?
Fundamentally, I difference between a MMDB and a OODB is that the MMDB has the expectation that all of its data is based in RAM, but persisted to disk at some point. Whereas an OODB is more conventional in that there's no expectation of the entire DB fitting in to RAM.
The MMDB can leverage this by giving up on the concept that the persisted data doesn't necessarily have to "match" the in RAM data.
The way anything with persistence is going to work, is that it has to write the data to disk on update in some fashion.
Almost all DBs use some kind of log for this. These logs are basically "raw" pages of data, or perhaps individual transactions, appended to a file. When the file gets "too big", a new file is started.
Once the logs are properly consolidated in to the main store, the logs are discarded (or reused).
Now, a crude, in RAM DB can exist simply by appending transactions to a log file, and when it's restarted, it just loads the log in to RAM. So, in essence, the log file IS the database.
The downside of this technique is the longer and more transactions you have, the bigger your log/DB is, and thus the longer the DB startup time. But, ideally, you can also "snapshot" the current state, which eliminates all of the logs up to date, and effectively compresses them.
In this manner, all the routine operations of the DB have to manage is appending pages to logs, rather than updating other disk pages, index pages, etc. Since, ideally, most systems don't need to "Start up" that often, perhaps start up time is less of an issue.
So, in this way, a MMDB can be faster than an OODB who has a different contract with the disk, maintaining logs and disk pages. In this way, an OODB can be slower even if the entire DB fits in to RAM and is properly cached, simply because you incur disk operations outside of the log operations during normal operations, vs a MMDB where these operations happen as a "maintenance" task, which can be scheduled during down time and/or quiet time.
As to whether either of these systems can meet you actual performance needs, I can't say.
The back ends of databases (reader and writer processes, caching, lock managing, txn log files, ACID semantics) are the same, so RDBs and OODB are actually very similar here. The difference is the interface to the application programmer. Is your data model complicated, consists of lots of classes with real inheritance relationships? Then OO is good. Is it relatively flat and simple? Then go RDB. What is the nature of the relationships? Is it pointer-like and set like? Then go RDB. Is is more complicated, like (ordered) list, array, map? Then you should go OO. Also, do you have a stand-alone application with no need to integrate with other apps? Then OO is ok. Do you have to share data with other apps (i.e. several apps access the same database)? Then that's a deal-breaker for OO, and you should stick with RDB. Is the schema of your database stable or do you expect it to evolve frequently? OODBs are bad ad schema evolution, so if you expect frequent changes, stick with RDBs.