Use NIO with Image IO or Thumbnailator - thumbnails

I am planning to use Thumbnailator to generate thumbnails for large size (0.5-10MB) images.
I looked through their code and found that ImageIO is being used to create thumbnails.
I am a newbie to both image files and their technicalities and the ImageIO package. What I would like to know is whether ImageIO uses (or can be made to use) NIO to read files and generate thumbnails? This will help in increasing performance in generating thumbnails, and we do have to generate a lot - 4 thumbnails per image, images ranging from 0.5 MB to 10 MB, at around 30 requests per second on an average.

ImageIO uses an abstraction over streams, called ImageInputStream. Multiple implementations exists, backed by InputStream, RandomAccessFile etc.
To answer your question, yes, it's possible to create plugins for ImageIO to provide ImageInputStreams backed by NIO (FileChannel as an example). Have a look at the ImageInputStreamSpi class.
But I'm not sure if this will create much of an improvement compared to the existing implementation based on RandomAccessFile (many existing classes were retrofitted to benefit from NIO when it was introduced).
One thing that could potentially increase performance a lot, is calling ImageIO.setUseCache(false), to turn off disk caching (at the expense of in-memory caching).
Unfortunately, I don't know Thumbnailator, so I can't say how these options would affect the performance in your case.


Data management in matlab versus other common analysis packages

I am analyzing large amounts of data using an object oriented composition structure for sanity and easy analysis. Often times the highest level of my OO is an object that when saved is about 2 gigs. Loading the data into memory is not an issue always, and populating sub objects then higher objects based on their content is much more java memory efficient than just loading in a lot of mat files directly.
The Problem:
Saving these objects that are > 2 gigs will often fail. It is a somewhat well known problem that I have gotten around by just deleting a number of sub objects until the total size is below 2-3 gigs. This happens regardless of how boss the computer is, a 16 gigs of ram 8 cores etc, will still fail to save the objects correctly. Back versioning the save also does not help
Is this a problem that others have solved somehow in MATLAB? Is there an alternative that I should look into that still has a lot of high level analysis and will NOT have this problem?
Questions welcome, thanks.
I am not sure this will help, but here: Do you make sure to use recent version of mat file? Check for instance save. Quoting from the page:
'-v7.3' 7.3 (R2006b) or later Version 7.0 features plus support for data items greater than or equal to 2 GB on 64-bit systems.
'-v7' 7.0 (R14) or later Version 6 features plus data compression and Unicode character encoding. Unicode encoding enables file sharing between systems that use different default character encoding schemes.
Also, could by any chance your object by or contain a graphic handle object? In that case, it is wise to use hgsave

What's a suitable storage RDBMS,NoSQL, for caching web site responses?

We're in the process of building an internal, Java-based RESTful web services application that exposes domain-specific data in XML format. We want to supplement the architecture and improve performance by leveraging a cache store. We expect to host the cache on separate but collocated servers, and since the web services are Java/Grails, a Java or HTTP API to the cache would be ideal.
As requests come in, unique URI's and their responses would be cached using a simple key/value convention, for example...
http://prod1/financials/reports/JAN/2007 --> XML response of 50Mb
http://prod1/legal/sow/9004 --> XML response of 250Kb
Response values for a single request can be quite large, perhaps up to 200Mb, but could be as small as 1Kb. And the number of requests per day is small; not more than 1000, but averaging 250; we don't have a large number of consumers; again, it's an internal app.
We started looking at MongoDB as a potential cache store, but given that MongoDB has a max document size of 8 or 16Mb, we did not feel it was the best fit.
Based on the limited details I provided, any suggestions on other types of stores that could be suitable in this situation?
The way I understand your question, you basically want to cache the files, i.e. you don't need to understand the files' contents, right?
In that case, you can use MongoDB's GridFS to cache the xml as a file. This way, you can smoothly stream the file in and out of the database. You could use the URI as a 'file name' and, well, that should do the job.
There are no (reasonable) file size limits and it is supported by most, if not all, of the drivers.
Twitter's engineering team just blogged about their SpiderDuck project that does something like what you're describing. They use Cassandra and Scribe+HDFS for their backends.
The simplest solution here is just caching these pieces of data in a file system. You can use tmpfs to ensure everything is in the main memory or any normal file system if you want the size of your cache be larger than the memory you have. Don't worry, even in the latter case the OS kernel will efficiently cache everything that is used frequently in the main memory. Still you have to delete the old files via cron if you're using Linux.
It seems to be like an old school solution, but it could be simpler to implement and less error prone than many others.

Writing a functional and yet functional image processing library in Scala

We are developing a small image processing library for Scala (student project). The library is completely functional (i.e. no mutability). The raster of image is stored as Stream[Stream[Int]] to exploit the benefits of lazy evaluation with least efforts. However upon performing a few operations on an image the heap gets full and an OutOfMemoryError is thrown. (for example, up to 4 operations can be performed on a jpeg image sized 500 x 400, 35 kb before JVM heap runs out of space.)
The approaches we have thought of are:
Twiddling with JVM options and increase the heap size. (We don't know how to do this under IDEA - the IDE we are working with.)
Choosing a different data structure than Stream[Stream[Int]], the one which is more suited to the task of image processing. (Again we do not have much idea about the functional data structures beyond the simple List and Stream.)
The last option we have is giving up on immutability and making it a mutable library (like the popular image processing libraries), which we don't really want to do. Please suggest us some way to keep this library functional and still functional, if you know what I mean.
Thank you,
Siddharth Raina.
For an image sized 1024 x 768, the JVM runs out of heap space even for a single mapping operation. Some example code from our test:
val image = Image from "E:/metallica.jpg"
val redded = & 0xff0000)
redded.display(title = "Redded")
And the output:
"C:\Program Files (x86)\Java\jdk1.6.0_02\bin\java" -Didea.launcher.port=7533 "-Didea.launcher.bin.path=C:\Program Files (x86)\JetBrains\IntelliJ IDEA Community Edition 10.0.2\bin" -Dfile.encoding=windows-1252 -classpath "C:\Program Files (x86)\Java\jdk1.6.0_02\jre\lib\charsets.jar;C:\Program Files (x86)\Java\jdk1.6.0_02\jre\lib\deploy.jar;C:\Program Files (x86)\Java\jdk1.6.0_02\jre\lib\javaws.jar;C:\Program Files (x86)\Java\jdk1.6.0_02\jre\lib\jce.jar;C:\Program Files (x86)\Java\jdk1.6.0_02\jre\lib\jsse.jar;C:\Program Files (x86)\Java\jdk1.6.0_02\jre\lib\management-agent.jar;C:\Program Files (x86)\Java\jdk1.6.0_02\jre\lib\plugin.jar;C:\Program Files (x86)\Java\jdk1.6.0_02\jre\lib\resources.jar;C:\Program Files (x86)\Java\jdk1.6.0_02\jre\lib\rt.jar;C:\Program Files (x86)\Java\jdk1.6.0_02\jre\lib\ext\dnsns.jar;C:\Program Files (x86)\Java\jdk1.6.0_02\jre\lib\ext\localedata.jar;C:\Program Files (x86)\Java\jdk1.6.0_02\jre\lib\ext\sunjce_provider.jar;C:\Program Files (x86)\Java\jdk1.6.0_02\jre\lib\ext\sunmscapi.jar;C:\Program Files (x86)\Java\jdk1.6.0_02\jre\lib\ext\sunpkcs11.jar;C:\new Ph\Phoebe\out\production\Phoebe;E:\Inventory\Marvin.jar;C:\\lib\scala-library.jar;C:\\lib\scala-swing.jar;C:\\lib\scala-dbc.jar;C:\new Ph;C:\\lib\scala-compiler.jar;E:\Inventory\commons-math-2.2.jar;E:\Inventory\commons-math-2.2-sources.jar;E:\Inventory\commons-math-2.2-javadoc.jar;E:\Inventory\jmathplot.jar;E:\Inventory\jmathio.jar;E:\Inventory\jmatharray.jar;E:\Inventory\Javax;E:\Inventory\jai-core-1.1.3-alpha.jar;C:\Program Files (x86)\JetBrains\IntelliJ IDEA Community Edition 10.0.2\lib\idea_rt.jar" com.intellij.rt.execution.application.AppMain phoebe.test.ImageTest
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at scala.collection.Iterator$class.toStream(Iterator.scala:1011)
at scala.collection.IndexedSeqLike$Elements.toStream(IndexedSeqLike.scala:52)
at scala.collection.Iterator$$anonfun$toStream$1.apply(Iterator.scala:1011)
at scala.collection.Iterator$$anonfun$toStream$1.apply(Iterator.scala:1011)
at scala.collection.immutable.Stream$Cons.tail(Stream.scala:565)
at scala.collection.immutable.Stream$Cons.tail(Stream.scala:557)
at scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:168)
at scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:168)
at scala.collection.immutable.Stream$Cons.tail(Stream.scala:565)
at scala.collection.immutable.Stream$Cons.tail(Stream.scala:557)
at scala.collection.immutable.Stream$$anonfun$flatten1$1$1.apply(Stream.scala:453)
at scala.collection.immutable.Stream$$anonfun$flatten1$1$1.apply(Stream.scala:453)
at scala.collection.immutable.Stream$Cons.tail(Stream.scala:565)
at scala.collection.immutable.Stream$Cons.tail(Stream.scala:557)
at scala.collection.immutable.Stream.length(Stream.scala:113)
at scala.collection.SeqLike$class.size(SeqLike.scala:221)
at scala.collection.immutable.Stream.size(Stream.scala:48)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:388)
at scala.collection.immutable.Stream.toArray(Stream.scala:48)
at phoebe.picasso.Image.force(Image.scala:85)
at phoebe.picasso.SimpleImageViewer.<init>(SimpleImageViewer.scala:10)
at phoebe.picasso.Image.display(Image.scala:91)
at phoebe.test.ImageTest$.main(ImageTest.scala:14)
at phoebe.test.ImageTest.main(ImageTest.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(
at sun.reflect.DelegatingMethodAccessorImpl.invoke(
at java.lang.reflect.Method.invoke(
at com.intellij.rt.execution.application.AppMain.main(
Process finished with exit code 1
If I understood correctly, you store each individual pixel in one Stream element, and this can be inefficient. What you can do is create your custom LazyRaster class which contains lazy references to blocks of the image of some size (for instance, 20x20). The first time some block is written, its corresponding array is initialized, and from there on changing a pixel means writing to that array.
This is more work, but may result in better performance. Furthermore, if you wish to support stacking of image operations (e.g. do a map - take - map), and then evaluating the image in "one-go", the implementation could get tricky - stream implementation is the best evidence for this.
Another thing one can do is ensure that the old Streams are being properly garbage collected. I suspect image object in your example is a wrapper for your streams. If you wish to stack multiple image operations (like mapping) together and be able to gc the references you no longer need, you have to make sure that you don't hold any references to a stream - note that this is not ensured if:
you have a reference to your image on the stack (image in the example)
your Image wrapper contains such a reference.
Without knowing more about the exact use cases, its hard to say more.
Personally, I would avoid Streams altogether, and simply use some immutable array-based data structure which is both space-efficient and avoids boxing. The only place where I potentially see Streams being used is in iterative image transformations, like convolution or applying a stack of filters. You wouldn't have a Stream of pixels, but a Stream of images, instead. This could be a nice way to express a sequence of transformations - in this case, the comments about gc in the link given above apply.
If you process large streams, you need to avoid holding onto a reference to the head of the stream. This will prevent garbage collection.
It's possible that calling certain methods on Stream will internally hold onto the head. See the discussion here: Functional processing of Scala streams without OutOfMemory errors
Stream is very unlikely to be the optimum structure here. Given the nature of a JPEG it makes little sense to "stream" it into memory line-by-line.
Stream also has linear access time for reading elements. Again, probably not what you want unless you're streaming data.
I'd recommend using an IndexedSeq[IndexedSeq[Int]] in this scenario. Or (if performance is important) an Array[Array[Int]], which will allow you to avoid some boxing/unboxing costs.
Martin has written a good overview of the 2.8 collections API which should help you understand the inherent trade-offs in the various collection types available.
Even if using Arrays, there's still every reason to use them as immutable structures and maintain a functional programming style. Just because a structure is mutable doesn't mean you have to mutate it!
I recommend also looking at continuous rather than just discrete models for imagery. Continuous is generally more modular/composable than discrete--whether time or space.
As a first step you should take a memory dump and analyze it. It is very possible that you will see the problem immediately.
There is special command line option to force JVM to make dump on OOME: -XX:+HeapDumpOnOutOfMemoryError. And good tools, like jhat and VisualVM, which can help you in analysis.
Stream is more about lazy evaluation than immutability. And you're
forcing an insane amount of space and time overhead for each pixel by
doing so. Furthermore, Streams only make sense when you can defer the
determination (calculation or retrieval) of individual pixel values.
And, of course, random access is impossible. I'd have to deem the
Stream an entirely inappropriate data structure for image processing.
I'd strongly recommend that you manage your own raster memory (bonus
points for not fixing a single raster image organization into your
code) and allocate storage for whole channels or planes or bands
thereof (depending on the raster organization in play).
UPDATE: By the foregoing, I mean don't use nested Array or IndexedSeq, but allocate a block and compute which element using the row and column values.
Then take an "immutable after initialization" approach. Once a given
pixel or sample has been established in the raster, you never allow it
to be changed. This might require a one-bit raster plane to track the
established pixels. Alternatively, if you know how you'll be filling
the raster (the sequence in which pixels will be assigned) you can get
away with a much simpler and cheaper representation of how much of the
raster is established and how much remains to be filled.
Then as you perform processing on the raster images, do so in a pipeline
where no image is altered in place, but rather a new image is always
generated as various transforms are applied.
You might consider that for some image transformations (convolution,
e.g.) you must take this approach or you will not get the correct
I strongly recommend Okasaki's Purely Functional Data Structures if you don't have any experience with functional data structures (as you seem to indicate).
To increase your heap size using intellij, you need to add the following to the VM Parameters section of the Run/Debug Configuration:
-Xms256m -Xmx256m
This will increase the maximum heap size to 256MB and also ensure this amount is requested by the VM at startup, which generally represents a performance increase.
Additionally, you're using a relatively old JDK. If possible, I recommend you update to the latest available version, as newer builds enable escape analysis, which can in some cases have a dramatic effect on performance.
Now, in terms of algorithms, I would suggest that you follow the advice above and divide the image into blocks of say, 9x9 (any size will do though). I'd then go and have a look at Huet's Zipper and think about how that might be applied to an image represented as a tree structure, and how that might enable you to model the image as a persistent data structure.
Increasing the heap size in idea can be done in the vmoptions file, which can be found in the bin directory in your idea installation directory (add -Xmx512m to set the heap size to 512 megabyte, for example).
Apart from that, it is hard to say what causes the out of memory without knowing what operations you exactly perform, but perhaps this question provides some useful tips.
One solution would be to put the image in an array, and make filters like "map" return a wrapper for that array. Basically, you have a trait named Image. That trait requires abstract pixel retrieving operations. When, for example, the "map" function is called, you return an implementation, which delegates the calls to the old Image, and executes the function on it. The only problem with that would be that the transformation could end up being executed multiple times, but since it is a functional library, that is not very important.

How to efficiently process 300+ Files concurrently in scala

I'm going to work on comparing around 300 binary files using Scala, bytes-by-bytes, 4MB each. However, judging from what I've already done, processing 15 files at the same time using java.BufferedInputStream tooks me around 90 sec on my machine so I don't think my solution would scale well in terms of large number of files.
Ideas and suggestions are highly appreciated.
EDIT: The actual task is not just comparing the difference but to processing those files in the same sequence order. Let's say I have to look at byte ith in every file at the same time, and moving on to (ith + 1).
Did you notice your hard drive slowly evaporating as you read the files? Reading that many files in parallel is not something mechanical hard drives are designed to do at full-speed.
If the files will always be this small (4MB is plenty small enough), I would read the entire first file into memory, and then compare each file with it in series.
I can't comment on solid-state drives, as I have no first-hand experience with their performance.
You are quite screwed, indeed.
Let's see... 300 * 4 MB = 1.2 GB. Does that fit your memory budget? If it does, by all means read them all into memory. But, to speed things up, you might try the following:
Read 512 KB of every file, sequentially. You might try reading from 2 to 8 at the same time -- perhaps through Futures, and see how well it scales. Depending on your I/O system, you may gain some speed by reading a few files at the same time, but I do not expect it to scale much. EXPERIMENT! BENCHMARK!
Process those 512 KB using Futures.
Go back to step 1, unless you are finished with the files.
Get the result back from the processing Futures.
On step number 1, by limiting the parallel reads you avoid trashing your I/O subsystem. Push it as much as you can, maybe a bit less than that, but definitely not more than that.
By not reading all files on step number 1, you use some of the time spent reading these files doing useful CPU work. You may experiment with lowering the bytes read on step 1 as well.
Are the files exactly the same number of bytes? If they are not, the files can be compared simply via the File.length() method to determine a first-order guess of equality.
Of course you may be wanting to do a much deeper comparison than just "are these files the same?"
If you are just looking to see if they are the same I would suggest using a hashing algorithm like SHA1 to see if they match.
Here is some java source to make that happen
many large systems that handle data use sha1 Including the NSA and git
Its simply more efficient use a hash instead of a byte compare. the hashes can also be stored for later to see if the data has been altered.
Here is a talk by Linus Torvalds specifically about git, it also mentions why he uses SHA1.
I would suggest using nio if possible. Introudction To Java NIO and NIO2 seems like a decent guide to using NIO if you are not familiar with it. I would not suggest reading a file and doing a comparison byte by byte, if that is what you are currently doing. You can create a ByteBuffer to read in chunks of data from a file and then do comparisons from that.

Optimal way to persist an object graph to flash on the iPhone

I have an object graph in Objective-C on the iPhone platform that I wish to persist to flash when closing the app. The graph has about 100k-200k objects and contains many loops (by design). I need to be able to read/write this graph as quickly as possible.
So far I have tried using NSCoder. This not only struggles with the loops but also takes an age and a significant amount of memory to persist the graph - possibly because an XML document is used under the covers. I have also used an SQLite database but stepping through that many rows also takes a significant amount of time.
I have considered using Core-Data but fear I will suffer the same issues as SQLite or NSCoder as I believe the backing stores to core-data will work in the same way.
So is there any other way I can handle the persistence of this object graph in a lightweight way - ideally I'd like something like Java's serialization? I've been thinking of trying Tokyo Cabinet or writing the memory occupied by bunch of C structs out to disk - but that's going to be a lot of rewrite work.
I would reccomend re-writing as c structs. I know it will be a pain, but not only will it be quick to write to disk but should perform much better.
Before anyone gets upset, I am not saying people should always use structs, but there are some situations where this is actually better for performance. Especially if you pre-allocate your memory in say 20k contiguous blocks at a time (with pointers into the block), rather than creating/allocating lots of little chunks within a repeated loop.
ie if your loop continually allocates objects, that is going to slow it down. If you have preallocated 1000 structs and just have an array of pointers (or a single pointer) then this is a large magnitude faster.
(I have had situations where even my desktop mac was too slow and did not have enough memory to cope with those millions of objects being created in a row)
Rather than rolling your own, I'd highly recommend taking another look at Core Data. Core Data was designed from the ground up for persisting object graphs. An NSCoder-based archive, like the one you describe, requires you to have the entire object graph in memory and all writes are atomic. Core Data brings objects in and out of memory as needed, and can only write the part of your graph that has changed to disk (via SQLite).
If you read the Core Data Programming Guide or their tutorial guide, you can see that they've put a lot of thought into performance optimizations. If you follow Apple's recommendations (which can seem counterintuitive, like their suggestion to denormalize your data structures at some points), you can squeeze a lot more performance out of your data model than you'd expect. I've seen benchmarks where Core Data handily beat hand-tuned SQLite for data access within databases of the size you're looking at.
On the iPhone, you also have some memory advantages when using controlling the batch size of fetches and a very nice helper class in NSFetchedResultsController.
It shouldn't take that long to build up a proof-of-principle Core Data implementation of your graph to compare it to your existing data storage methods.