I am building an application based on a single table with a column with text. Occassionally, an adjacent column will have an image. Is it better to store this image as a BLOB in SQLITE or should I store them on the file system and reference them from my programs.
Thanks!
Files will cause you fewer problems in the long run. You really don't want to be serving tons of files from your database server especially as you scale
Assuming the images you are going to use are not extremely large and there is not an exorbitant number of them I would go with the database.
I am currently using a Sqlite database on several different Windows Mobile and WinCE devices with over 10,000 small images stored as blobs and it is working great.
I have seen software similar to ours running on the same hardware using file based image loading and it was significantly slower. Of course this was on WinCE and different software, so that is not the best test.
I find the single database is much easier to work with than many image files.
EDIT:
Didnt realize you meant for the iPhone environment specifically. In that case I would use the DB just for the simplicity of having all content in one place. You wont have to worry about scalability because its not like your iphone is going to be used as a server or anything.
Original Response:
I dont have any links to back this up, but I do recall reading in several studies that the "cut off" is 1 MB for blob efficiency. But this can moved up to 10 MB with a fast enough disk array. Totally depends on the system.
So basically, given your efficiency cutoff, any data smaller than that would better be served by the DB, anything larger, just index in the DB and leave in a file cache.
I like to keep images in the file system because UIImage can cache image files & dump them from memory automatically when necessary. Just be careful not to change or delete an image file that is loaded into a UIImage or you will get crashing or other weird bugs.
It really depends on your application. Having the images stored in a database will make your life easier as you have them readily accessible in a single point instead of having them in separate files that might gone missing. On the other hand, many images, that are rather large, might prove too much for a SQLITE database. In your situation I would just reference them in the database.
Related
I'm implementing a web scraper that needs to scrape and store about 15GB+ of HTML files a day. The amount of daily data will likely grow as well.
I intend on storing the scraped data as long as possible, but would also like to store the full HTML file for at least a month for every page.
My first implementation wrote the HTML files directly to disk, but that quickly ran into inode limit problems.
The next thing I tried was using Couchbase 2.0 as a key/value store, but the Couchbase server would start to return Temp_OOM errors after 5-8 hours of web scraping writes. Restarting the Couchbase server is the only route for recovery.
Would MongoDB be a good solution? This article makes me worry, but it does sound like their requirements are beyond what I need.
I've also looked a bit into Cassandra and HDFS, but I'm not sure if those solutions are overkill for my problem.
As for querying the data, as long as I can get the specific page data for a url and a date, it will be good. The data too is mostly write once, read once, and then store for possible reads in the future.
Any advice pertaining to storing such a large amount of HTML files would be helpful.
Assuming 50kB per HTML page, 15GB daily gives us 300.000+ pages per day. About 10 million monthly.
MongoDB will definitely work well with this data volume. Concerning its limitations, all depends on how do you plan to read and analyze the data. You may take advantage of map/reduce features given that amount of data.
However if your problem size may further scale, you may want to consider other options. It might be worth noting that Google search engine uses BigTable as a storage for HTML data. In that sense, using Cassandra in your use case can be a good fit. Cassandra offers excellen, persistent write/read performance and scales horizontally much beyond your data volume.
I'm not sure what deployment scenario you did when you used Cassandra to give you those errors .. may be more investigation is required to know what is causing the problem. You need to trace back the errors to know their source, because, as per requirements described above, Cassandra should work fine, and shouldn't stop after 5 hours (unless you have a storage problem).
I suggest that you give MongoDB a try, it is very powerful and optimized to what you need, and shouldn't complain of the requirement you mentioned above.
You can use HDFS, but you don't really need it while MongoDB (or even Cassandra) can do it.
I believe Wordpress stores multiple entries of posts as "revisions" but I think thats terribly inefficient use of space?
Is there a better way? I think gitit is a Wiki that uses GIT for version control, but how is it done? eg. my application is in PHP and I must make it talk to GIT to commit and retrieve data?
So, what is a good way of implementing version control in web apps (eg. in a blog it might be the post content)
I've recently implemented just such a system - which uses the concept of superseded records, together with a previous and current link. I did a considerable amount of research into how best to achieve this - in the end the model I arrived at is similar to the Wordpress (and other systems) - store the changes as a new record and use this.
Considering all of the options available, space is really the last concern for authored content such as posts - media files take up way more space and these can't be stored as deltas anyway.
In any case the way that Git works is virtually identical in that it stores the entire content for every revision except that it will eventually pack down into deltas (or when you ask it to).
Going back to 1990 we were using SCCS or RCS and sometimes with only 30mb of disk space free we really needed the version control to be efficient to avoid running out of storage.
Using deltas to save space is not really worth all of the associated aggravation given the average amount of available storage on modern systems. You could argue it's wasteful of space, however I'd argue that it is much more efficient in the long run to store things uncompressed in their original form
it's faster
it's easier to search through old versions
it's quicker to view
it's easier to jump into the middle of a set of changes without having to process a lot of deltas.
it's a lot easier to implement because you don't have to write delta generation algorithms.
Also markup doesn't fare as well as plain text with deltas especially when editing with a wysiwyg editor.
Keep one table with the most recent version of the e.g. article.
When a new version is saved, move the current over in an archive table and put a version number on it while keeping the most recent version in the first table.
The archive table can have the property ROW_FORMAT=COMPRESSED (MySQL InnoDb example) to take up less space and it won't be a performance issue since it is rarely accessed. Yes, it is somewhat overhead not to only store changesets but if you do some math you can keep a huge amount of revisions in almost no space as your articles are highly compressable text anyway.
In example, the source code of this entire page is 11Kb compressed. That gives you almost 100 versions on 1Mb. In comparison, normal articles are quite a bit smaller and may on average give you 500-1000 articles/versions on 1Mb. You can probbably afford that.
I keep a content revision history for a certain content type. It's stored in MongoDB. But since the data is not frequently accessed I don't really need it there, taking up memory. I'd put it in a slower hard disk database.
Which database should I put it in? I'm looking for something that's really cheap and with cloud hosting available. And I don't need speed. I'm looking at SimpleDB, but it doesn't seem very popular. rdbms doesn't seem easy enough to handle since my data is structured into documents. What are my options?
Thanks
Depends on how often you want to look at this old data:
Why don't you mongodump it to your local disk and mongorestore when you want it back.
Documentation here
OR
Setup a local mongo instance and clone the database using the information here
Based on your questions and comments, you might not find the perfect solution. You want free or dirt cheap storage, and you want to have your data available online.
There is only one solution I can see feasible:
Stick with MongoDB. SimpleDB does not allow you to store documents, only key-value pairs.
You could create a separate collection for your history. Use a cloud service that gives you a free tier. For example, http://MongoLab.com gives you 240Mb free tier.
If you exceed the free tier, you can look at discarding the oldest data, moving it to offline storage, or start paying for what you are using.
If you data grows a lot you will have to make the decision whether to pay for it, keep it available online or offline, or discard it.
If you are dealing with a lot of large objects (BLOBS or CLOBS), you can also store the 'non-indexed' data separate from the database. This keeps the database both cheap and fast. The large objects can be retrieved from any cheap storage when needed.
Cloudant.com is pretty cool for hosting your DB in the cloud and it uses Big Couch which is a nosql thing. I'm using it for my social site in the works as Couch DB (Big Couch) similar has an open ended structure and you talk to it via JSON. Its pretty awesome stuff but so weird to move from SQL to using Map-Reduce but once you do its worth it. I did some research because I'm a .NET guy for a long time but moving to Linux and Node.js partly out of bordom and the love of JavaScript. These things just fit together because Node.js is all JavaScript on the backend and talks seemlessly to Couch DB and the whole thing scales like crazy.
Still beavering away, slowly sorting out how things work. Today I've been looking at persistent stores and managed objects. I think I understand the basics of it all, but I've noticed something odd. When I save my managed object context and open up the resulting sqlite file in an editor, there are three tables there I don't expect. They're named after objects that I was originally using as managed objects, but later altered so that they weren't any more. I have no idea why they've been retained, since I've completely changed my file saving structure since then. No data gets put into these tables, but they keep cropping up. Is there any way I can remove them, or are they being added for some purpose I'm unaware of?
-Ash
Right, I have finally worked out where the extra tables were coming from, but it's a little bit weird.
Basically, I had tested a previous version of the application on OS3.2 to ensure backward compatibility. I later changed the format of the database, but by then I had moved to testing on OS4 exclusively. Somehow, the program was checking the SQLite files in both the 4.0 and the 3.2 directories and combining them into one big table. The extra tables were also responsible for causing errors when I tried to upload a desktop version of the database to my device, because the device's own version of the tables had a slightly different format thanks to the unwanted tables not being on there.
So the moral of this story folks is, always delete your program files from the simulator when you want to test for problems on a different OS.
-Ash
I'm looking for an efficient means of copying an online file to the local iPhone. Something like RSync would be an ideal technology to use, since it only transfers the bytes that are different; does anyone have experience with anything that fits my description?
What I'm specifically trying to accomplish is the transfer of an SQLite database hosted via http to the iPhone. I expect the database to grow in time; hence my search for an efficient means of transferring the data to the iPhone.
I would have the app query the database for the changes and then download and store the changes in the local database. Treating the database as an opaque file results in a partially downloaded database that is useless and partial downloads could happen quite frequently with mobile devices.