Best approach to convert existing video files into mpeg-dash - streaming

We have over 50MM videos with an average of 3 different resolution e.g 240, 360 etc.
It is time for us to move to Dynamic Adaptive Streaming or Mpeg-Dash. At the moment our biggest challenge is convert existing data into Mpeg-DASH. Our current approach is to convert all videos one by one and create MPD file, this could take months.
Is there an alternate approach? I am aware of streaming existing files realtime using different tools but this will require Huge CPU resource? Any benchmarks that can help us decide how should we be moving for Mpeg Dash?

If you want to convert a huge amount of videos in a short amount of time it might be a good approach to use one of the cloud encoding services available as they can convert multiple files in parallel.
I personally have good experience with bitmovin cloud encoding as I really like their API clients with lots of examples:
However, there are other services available as well


What would be the best distributed storage solution for a heavy use web scraper/crawler?

I'm implementing a web scraper that needs to scrape and store about 15GB+ of HTML files a day. The amount of daily data will likely grow as well.
I intend on storing the scraped data as long as possible, but would also like to store the full HTML file for at least a month for every page.
My first implementation wrote the HTML files directly to disk, but that quickly ran into inode limit problems.
The next thing I tried was using Couchbase 2.0 as a key/value store, but the Couchbase server would start to return Temp_OOM errors after 5-8 hours of web scraping writes. Restarting the Couchbase server is the only route for recovery.
Would MongoDB be a good solution? This article makes me worry, but it does sound like their requirements are beyond what I need.
I've also looked a bit into Cassandra and HDFS, but I'm not sure if those solutions are overkill for my problem.
As for querying the data, as long as I can get the specific page data for a url and a date, it will be good. The data too is mostly write once, read once, and then store for possible reads in the future.
Any advice pertaining to storing such a large amount of HTML files would be helpful.
Assuming 50kB per HTML page, 15GB daily gives us 300.000+ pages per day. About 10 million monthly.
MongoDB will definitely work well with this data volume. Concerning its limitations, all depends on how do you plan to read and analyze the data. You may take advantage of map/reduce features given that amount of data.
However if your problem size may further scale, you may want to consider other options. It might be worth noting that Google search engine uses BigTable as a storage for HTML data. In that sense, using Cassandra in your use case can be a good fit. Cassandra offers excellen, persistent write/read performance and scales horizontally much beyond your data volume.
I'm not sure what deployment scenario you did when you used Cassandra to give you those errors .. may be more investigation is required to know what is causing the problem. You need to trace back the errors to know their source, because, as per requirements described above, Cassandra should work fine, and shouldn't stop after 5 hours (unless you have a storage problem).
I suggest that you give MongoDB a try, it is very powerful and optimized to what you need, and shouldn't complain of the requirement you mentioned above.
You can use HDFS, but you don't really need it while MongoDB (or even Cassandra) can do it.

PDF Storage System with REST API

I have hundreds of thousands of PDFs that are presently stored in the filesystem. I have a custom application that, as an afterthought to its actual purpose, provides access to these PDFs. I would like to take the "storage & retrieval" part out of the custom application and use an OpenSource document storage backend.
Access to the PDF Store should be via a REST API, so that users would not need a custom client for basic document browsing and viewing. Programs that store PDFs should also be able to work via the REST API. They would provide the actual binary or ASCII data plus structured meta data, which could later be used in retrieval.
A typical query for retrieval would be "give me all documents that were created between days X and Y with document types A or B".
My research, whether such a storage backend exists, has come up empty. Do any of you know a system that provides these features? OpenSource preferred, reasonably priced systems considered.
I am not looking for advice on how to "roll my own" using available technologies. Rather, I'm trying to find out whether that can be avoided. Many thanks in advance.
What you describe sounds like a document management or asset management system of which there are many; and many work with PDF files. I have some fleeting experience with commercial offerings such as Xinet ( - now acquired apparently) or Elvis ( Both might fit your requirements but they're probably too big and likely too expensive.
Have you looked at Alfresco? This is an open source alternative I came into contact with years ago while being on the board of a selection committee. As far as I remember it definitely goes in the direction of what you are looking for and it is open source so might fit that angle as well:

Thoughts on Dropbox Sync, Merging CoreData

I have data that I need to organize, and the easiest way to do it would be with CoreData. I also want to sync this data to Dropbox so that it will be synced across multiple iOS devices and Macs. I looked at this post, and now I am kind of concerned:
You want to look at this pessimistic
take on cloud sync: Why Cloud Sync
Will Never Work. It covers a lot of
the issues that you are wrestling
with. Many of them are largely
It is very, very, very difficult to
synchronize information period. Adding
in different devices, different
operating systems, different data
structures, etc snowballs the
complexity often fatally. People have
been working on variants of this
problem since the 70s and things
really haven't improve much.
I am especially concerned because I am pretty new to iOS and programming in general, and I was hoping it would be easier. I was wondering if anyone had some tips/tutorials/experience with doing this. I could use property lists (or a different method) to store the data, but that would make it harder later in case I wanted to change any of the attribues for the data I am storing. Is this really as complicated as they are making it sound, and should I just try to find some other way to sync the data (e.g. email, drag and drop in iTunes, etc.)?
I don't have any experience with cloud sync, but I do have experience with data management. Plist files are not at all bad in terms of data manipulation. The main problem with plist files is speed when handling large amounts of data, but for what you are intending to do they should work fine. It is difficult to provide more of an answer because in your question you did not say what kind of data, or how much data, or how often this data will be changed/accessed. If you are a beginner in iPhone development of programming in general, I will just say that Core Data has a very steep learning curve. When i first started programming for the iPhone all I used were plist's because they are simple and versatile.
Also, from reading the article that was linked in your question, it seems that he was condemning cloud providers for the way they handle data storage, and the services offered to the users. That article was written in 2009, since then great strides in "cloud" storage and syncing have been made. Also, you are not actually creating a cloud sync service, you are simply using one that is already in existence, so almost none of those problems apply to you.
Syncing is rather easy. You just have to keep track of file creation and deletion.
I wrote this blog post about how to sync a local data store with a remote one: Basic Syncing Algorithm
In the comments, tell me what (in general) you are using CoreData to manage. I need more information.
Now there is a product to sync your CoreData across devices with the data being stored in your's Dropbox, Box, or Google Drive account. It's called NimbusBase.
You can directly use your CoreData, import our libraries, and your data will be saved straight to your's Dropbox. We handle authentication and also moving the data back and forth.
Feel free to email me at if you have questions.
Disclosure: I am a programmer at NimbusBase

I need suggestions for a distributed media storage data store

I want to develop one multimedia system, the system need to save millions videos and images, so I want to select a distributed storage subsystem. who can give me some suggestion ? thanks!
I guess that best option for the 'millions videos and images' is content distribution/delivery network (CDN):
CDN is a server setup which allows for
faster, more efficient delivery of
your media files. It does this by
maintaining copies of your media at
different points of presence (POPs)
along a global network to ensure quick
client access and the fastest delivery
If you will use CDN you no need care about many problems(distribution, fast access). Integration with CDN also should be very simple.
You can configure your writes to be first replicated to multiple nodes before it return to the client. Now whether or not that is needed is of course unto the use case. And definitely involves a performance hit. So if you are implementing a write heavy analytical database, it will have a significant impact on write throughput.
All other points you make about the question in terms of lack of requirements etc, I second that.
Having replicated file system with metadata in a nosql database is a very common way of doing things. #why did you consider this kinda approach?
Have you taken a look at Mongodb gridfs? I have never used it, but it is something I would take a look at to see if it gives you any ideas.
Yo gave us (near) zero information about what your requirements are. Eg:
Do you want atomic transactions?
Is the system read or write heavy?
Do you need fast queries or want to batch-process the data set?
How big are the videos?
Do you want to distribute data locally (on a LAN) or spanning multiple data centers / continents?
How are we supposed to pick the right tool if we don't know what it needs to support?
Without any knowledge of the system I would advise using some kind of FS replication for the videos and images and then storing the metadata associated with the items either in MongoDB, MySQL Master-Master or MySQL Cluster.
Distributed related to what?
If you are talking of replication to distribute:
MongoDb only restricted to Master-Slave replication, so only one node is able to read/write which leaves you with a single point of failure for a really distributed system.
CouchDB is able to peer-to-peer replicate.
Find a very good comparison here and here also compared with hbase.
With CouchDB you also have to be aware that you are going to talk http to the database and have build in webservices.
An alternative is to use MongoDB's GridFS, serving as a (very easily manageable) redundant and distributed filesystem.
Some will say that it's slow on reads, (and it is, mostly because of the nature of its design) but that doesn't have to mean it's a dealbreaker for your system in whole, because if you need performance later on, you could always put Varnish or Squid in front of the filesystem tier.
For all I know, Squid also supports on-disk cache for all the less-hot files.

Event-based analytics package that won't break the bank with high volume

I'm using an application that is very interactive and is now at the point of requiring a real analytics solution. We generate roughly 2.5-3 million events per month (and growing), and would like to build reports to analyze cohorts of users, funneling, etc. The reports are standard enough that it would seem feasible to use an existing service.
However, given the volume of data I am worried that the costs of using a hosted analytics solution like MixPanel will become very expensive very quickly. I've also looked into building a traditional star-schema data warehouse with offline background processes (I know very little about data warehousing).
This is a Ruby application with a PostgreSQL backend.
What are my options, both build and buy, to answer such questions?
Why not building your own?
Check this open source project as an exemple:
It is very basic and you will have to built datamart feature you will need in your case