I am looking for more metrics that i can use to compare XML and JSON in iPhone applications using Apple's libraries. As of now i am doing the following
serialization and deserialization speed
Energy Consumption
Memory Usage
CPU Usage
Are there more metrics i can use to compare the two.
Note: Metrics need to be testable. I can't include size because its apparent that XML will always be larger than JSON irrespective of environment.
Related
I have a problem...
I need to store a daily barrage of about 3,000 mid-sized XML documents (100 to 200 data elements).
The data is somewhat unstable in the sense that the schema changes from time to time and the changes are not announced with enough advance notice, but need to be dealt with retroactively on an emergency "hotfix" basis.
The consumption pattern for the data involves both a website and some simple analytics (some averages and pie charts).
MongoDB seems like a great solution except for one problem; it requires converting between XML and JSON. I would prefer to store the XML documents as they arrive, untouched, and shift any intelligent processing to the consumer of the data. That way any bugs in the data-loading code will not cause permanent damage. Bugs in the consumer(s) are always harmless since you can fix and re-run without permanent data loss.
I don't really need "massively parallel" processing capabilities. It's about 4GB of data which fits comfortably in a 64-bit server.
I have eliminated from consideration Cassandra (due to complex setup) and Couch DB (due to lack of familiar features such as indexing, which I will need initially due to my RDBMS ways of thinking).
So finally here's my actual question...
Is it worthwhile to look for a native XML database, which are not as mature as MongoDB, or should I bite the bullet and convert all the XML to JSON as it arrives and just use MongoDB?
You may have a look at BaseX, (Basex.org), with built in XQuery processor and Lucene text indexing.
That Data Volume is Small
If there is no need for parallel data processing, there is no need for Mongo DB. Especially if dealing with small data amounts like 4GB, the overhead of distributing work can easily get larger than the actual evaluation effort.
4GB / 60k nodes is not large of XML databases, either. After some time of getting into it you will realize XQuery as a great tool for XML document analysis.
Is it Really?
Or do you get daily 4GB and have to evaluate that and all data you already stored? Then you will get to some amount which you cannot store and process on one machine any more; and distributing work will get necessary. Not within days or weeks, but a year will already bring you 1TB.
Converting to JSON
How does you input look like? Does it adhere any schema or even resemble tabular data? MongoDB's capabilities for analyzing semi-structured are way worse than what XML databases provide. On the other hand, if you only want to pull a few fields on well-defined paths and you can analyze one input file after the other, Mongo DB probably will not suffer much.
Carrying XML into the Cloud
If you want to use both an XML database's capabilities in analyzing the data and some NoSQL's systems capabilities in distributing the work, you could run the database from that system.
BaseX is getting to the cloud with exactly the capabilities you need -- but it will probably still take some time for that feature to get production-ready.
I need a timeseries datastore and visualization platform where I can dump experiment data into hierarchical namespaces and then go back later for analysis. Saving graph templates, linking to graphs and other features to go from analysis to presentation would be very useful. Initially I was really excited to read about Graphite and Graphiti, because they appear to fit the bill. However, the events I'm tracking are milliseconds apart and I need to keep millisecond precision without aggregation or averaging. It looks like the only way to make Graphite play nice is to aggregate up from statsd to metrics per second, which will obscure the events I'm interesting in. Optional aggregation would be fine in some cases, but not always.
Cube takes events with millisecond timestamps, but Cubism appears to be a rich library and not a full-fledged platform like Graphite. It also appears to be heavily real-time oriented. If I can't find a good stack to meet my needs I'll probably use Cube to store my data, but visualizing it with batch scripts that generate piles and piles of matplotlib graphs is not fun.
Am I misinformed, or is there another framework out there which will give me decent analysis/interactivity with an arbitrary time granularity?
Cubism.js is just a front-end for Graphite (and other back-ends, like Cube), so I think it would fit your needs.
You would need to setup a Graphite system to store your metrics (rather than Cube) with the appropriate level of detail (eg. millisecond), and then use Cubism's Graphite context to display it with the same step value.
I'm developing a php platform that will make huge use of images, documents and any file format that will come in my mind so i was wondering if Cassandra is a good choice for my needs.
If not, can you tell me how should i store files? I'd like to keep using cassandra because it's fault-tolerant and uses auto-replication among nodes.
Thanks for help.
From the cassandra wiki,
Cassandra's public API is based on Thrift, which offers no streaming abilities
any value written or fetched has to fit in memory. This is inherent to Thrift's
design and is therefore unlikely to change. So adding large object support to
Cassandra would need a special API that manually split the large objects up
into pieces. A potential approach is described in http://issues.apache.org/jira/browse/CASSANDRA-265.
As a workaround in the meantime, you can manually split files into chunks of whatever
size you are comfortable with -- at least one person is using 64MB -- and making a file correspond
to a row, with the chunks as column values.
So if your files are < 10MB you should be fine, just make sure to limit the file size, or break large files up into chunks.
You should be OK with files of 10MB. In fact, DataStax Brisk puts a filesystem on top of Cassandra if I'm not mistaken: http://www.datastax.com/products/enterprise.
(I'm not associated with them in any way- this isn't an ad)
As fresh information, Netflix provides utilities for their cassandra client called astyanax for storing files as handled object stores. Description and examples can be found here. It can be a good starting point to write some tests using astyanax and evaluate Cassandra as a file storage.
We're in the process of building an internal, Java-based RESTful web services application that exposes domain-specific data in XML format. We want to supplement the architecture and improve performance by leveraging a cache store. We expect to host the cache on separate but collocated servers, and since the web services are Java/Grails, a Java or HTTP API to the cache would be ideal.
As requests come in, unique URI's and their responses would be cached using a simple key/value convention, for example...
KEY VALUE
http://prod1/financials/reports/JAN/2007 --> XML response of 50Mb
http://prod1/legal/sow/9004 --> XML response of 250Kb
Response values for a single request can be quite large, perhaps up to 200Mb, but could be as small as 1Kb. And the number of requests per day is small; not more than 1000, but averaging 250; we don't have a large number of consumers; again, it's an internal app.
We started looking at MongoDB as a potential cache store, but given that MongoDB has a max document size of 8 or 16Mb, we did not feel it was the best fit.
Based on the limited details I provided, any suggestions on other types of stores that could be suitable in this situation?
The way I understand your question, you basically want to cache the files, i.e. you don't need to understand the files' contents, right?
In that case, you can use MongoDB's GridFS to cache the xml as a file. This way, you can smoothly stream the file in and out of the database. You could use the URI as a 'file name' and, well, that should do the job.
There are no (reasonable) file size limits and it is supported by most, if not all, of the drivers.
Twitter's engineering team just blogged about their SpiderDuck project that does something like what you're describing. They use Cassandra and Scribe+HDFS for their backends.
http://engineering.twitter.com/2011/11/spiderduck-twitters-real-time-url.html
The simplest solution here is just caching these pieces of data in a file system. You can use tmpfs to ensure everything is in the main memory or any normal file system if you want the size of your cache be larger than the memory you have. Don't worry, even in the latter case the OS kernel will efficiently cache everything that is used frequently in the main memory. Still you have to delete the old files via cron if you're using Linux.
It seems to be like an old school solution, but it could be simpler to implement and less error prone than many others.
How do the memory utilization and performance for XML or SQLite compare on the iPhone?
The initial data set for our application is 500 records with no more than 750 characters each.
How well would XML compare with SQLite for accessing say record 397 without going through the first 396? I know SQLite3 would have a better methods for that, but how is the memory utilization?
When dealing with XML, you'll probably need to read the entire file into memory to parse it, as well as write out the entire file when you want to save. With SQLite and Core Data, you can query the database to extract only certain records, and can write only the records that have been changed or added. Additionally, Core Data makes it easy to do batched fetching.
These limited reads and writes can make your application much faster if it is using SQLite or Core Data for its data store, particularly if you take advantage of Core Data's batched fetching. As Graham says, specific numbers on performance can only be obtained by testing under your specific circumstances, but in general XML is significantly slower for all but the smallest data sets. Memory usage can also be much greater, due to the need to load and parse records you do not need at that instant.
To find out how the memory usage for your application fares, you need to measure your application :). The Instruments tool will help you.