Play Framework with Spark MLib vs PredictionIO - scala

Good morning,
currently I'm exploring my options for building an internal platform for the company I work for. Our team is responsible for the company's data warehouse and reporting.
As we evolve, we'll be developing an intranet to answer some of the company's necessities and, for some time now, I'm considering scala (and PlayFramework) as the way to go.
This will also envolve a lot of machine learning to cluster clients, predict sales evolution, and so on. This is when I've started to think in Spark ML and came across PredictionIO.
As we are shifting our skills towards data science, what will benefit and teach us/company most:
build everything on top of Play and Spark and have both the plataform and machine learning on the same project
using Play and PredictionIO where most of the stuff is already prepared
I'm not trying to open a question opinion based, rather then, learn from your experience / architectures / solutions.
Thank you

Both are good options: 1. use PredictionIO if you are new to ML, easy to start but it will limit you in a long run, 2. use spark if you have confidence in your data science and data engineering team, spark has excellent and easy to use api along with extensive ML library, saying that in order to put things into production, you will require some distributed spark knowledge - experience and it is tricky at times to make it efficient and reliable.
Here are options:
spark databricks cloud expensive but easy to use spark, no data engineering
PredictionIO if you certain that their ML can solve all your business cases
spark in google dataproc, easy managed cluster for 60% less than aws, still some engineering required
In summary: PredictionIO for a quick fix, and spark for long term data - science / engineering development. You can start with databricks to minimise expertise overheads and move to dataproc as you go along to minimise costs

PredictionIO uses Spark's MLLib for the majority of their engine templates.
I'm not sure why you're separating the two?
PredictionIO is as flexible as Spark is, and can alternatively use other libraries such as deeplearning4j & H2O to name a few.

Related

Building a recommender system for videos from scratch vs. using a SAAS

I have been tasked with developing a recommender system for a video app and am relatively new to data science.
I was wondering whether, given a short time scale of about a month, it would be wiser to turn to a software as a service recommender engine like Recombee or to build the recommender algorithms from scratch using open source software like Apache Spark?
My main hesitation with the first option is that there might not be as much freedom using a SAAS. As such, the recommender system might not be as accurate as building from scratch?
However, I am concerned about the feasibility of creating a recommender system from scratch, especially given my lack of experience. Could I create something within a month that is as accurate and as scalable as using a SAAS?

Best suited NoSQL database for Content Recommender

I am currently working in a project which includes migrating a content recommender from MySQL to a NoSQL database for performarce reasons. Our team has been evaluating some alternatives like MongoDB, CouchDB, HBase and Cassandra. The idea is to choose a database that is capable of running in a single server or in a cluster.
So far we have discarded the use of Hbase due to its dependency on a distributed environment. Even having the idea of scaling horizontally, we need to run the DB in a single server for a little while in production. MongoDB was also discarded because it does not support map/reduce features.
We have still 2 alternatives and we have no solid background to decide. Any guidance or help is appreciated
NOTE: I do not pretend to create a religion-like discussion with non-founded arguments. It is a strictly technical question to be discussed in the problem's context
Graph databases are usually considered as best suited for recommendation engines, since a lot of the recommendation algorithms are actually graph based. I recommend looking into Neo4J - it can handle billions of nodes/edges on a single machine and it supports a so-called high availability mode which is a master-slave setup with automatic master selection.

What would be a good application for an enhanced version of MapReduce that shares information between Mappers?

I am building an enhancement to the Spark framework (http://www.spark-project.org/). Spark is a project out of UC Berkeley that does MapReduce quickly in RAM. Spark is built in Scala.
The enhancement I'm building allows some data to be shared between the mappers while they are computing. This can be useful, for example, if each of the mappers is looking for an optimal solution, and they all want to share the current best solution (to prune out bad solutions early). The solution may be slightly out of date as it propagates, but this should still speed up the solution. In general, this is called the branch-and-bound approach.
We can share monotonically increasing numbers, but also we can share arrays, and dictionaries.
We are also looking at machine learning applications where the mappers describe local natural gradient information, and then a new best current optimal solution is shared among all nodes.
What are some other good real-world applications of this kind of enhancement? What kinds of real, useful applications might benefit from a Map Reduce computation with just a little bit of information-sharing between mappers. What applications use MapReduce or Hadoop right now but are just a little too slow because of the independence restriction of the Map phase?
The benefit can be to either speed up the map phase, or improve the solution.
The enhancement I'm building allows some data to be shared between the mappers while they are computing.
Apache Giraph is based on Google Pregel which is based on BSP and is used for graph processing. In BSP, there is data sharing between the processes in the communication phase.
Giraph depends on Hadoop for implementation. In general there is no communication between the mappers in MapReduce, but in Giraph the mappers communicate with each other during the communication phase of BSP.
You might be also interested in Apache Hama which implements BSP and can be used for more than graph processing.
There might be some reason why mappers don't communicate in the MR. Have you considered these factors in your enhancement?
What are some other good real-world applications of this kind of enhancement?
Graph processing is one thing I can think of, similar to Giraph. Checkout the different use cases for BSP, some might be applicable for this kind of enhancement. I am also very interested what other have to say on this.

NodeJS vs Play Framework for large project

I am really torn between two different stacks with which to build a large application. One the one hand there is this option:
Node.js
express
coffee script
coffeekup
mongoose/mongodb
or
presistencejs/mysql
Play Framework w/ Scala
Anorm w/ mysql
or mongodb
The node.js path is appealing to me because i can write all of the server side code, views and client side code in coffeescript, which i already know. If i go down this road i am still not 100% sure which db path i would take. mongoose makes storing data quick and easy, but the lack of true relationships might be more difficult to work with given the data model i have in mind (very SQLish).
The Play Framework path is also appealing because i know the framework well when using Java, but i don't know much about Scala, so there would be a hit to productivity as i work through learning that language. The Anorm database access layer is appealing because i can write the SQL by hand which i would prefer, and have the results mapped to objects automatically, which saves a lot of effort.
I keep leaning towards node.js, but i'm not sold on the best db access layer to use. Anyone have any experience with any of this and can share some insight?
The stack you choose should depend upon the needs of your application. Let's look at Play vs. Node for their strengths:
Node
Real-time applications (chat, feeds)
Event-driven architecture
Can perform client-server duties (e.g. serve files), but not well-suited for this
Database management, testing tools, etc, available as additional packages
Play!
Client-server applications (website, services)
Share-nothing architecture
Can perform real-time duties (e.g. Websockets), but not well-suited for this
Database management (including migrations!), testing tools, etc, built into core
If your application more closely matches a traditional web-based model, Play is probably your best choice. If you need immediate feedback and real-time dynamic messaging, Node is the better choice.
For large traditional applications, seriously consider the Play! Framework because of the built-in unit and functional testing along with database migrations. If incorporated into the development process, these go a long way toward an end product that works as expected and is stable and error-free.
There are 10 major categories you should consider when comparing web frameworks:
Learn: getting started, ramp up, overall learning curve.
Develop: routing, templates, i18n, forms, json, xml, data store access, real time web.
Test: unit tests, functional tests, integration tests, test coverage.
Secure: CSRF, XSS, code injection, headers, authentication, security advisories.
Build: compile, run tests, preprocess static content (sass/less/CoffeScript), package.
Deploy: hosting, monitoring, configuration.
Debug: step by step debugger, profilers, logging,
Scale: throughput, latency, concurrency.
Maintain: code reuse, stability, maturity, type safety, IDEs.
Share: open source activity, mailing lists, popularity, plugins, commercial support, jobs.
Check out my talk Node.js vs Play Framework for a detailed breakdown of how these two frameworks compare across these 10 dimensions.

Looking for a mature, scalable GraphDB with .NET or C++ binding

My basic requirements from a GraphDB:
Mature (production-ready)
Native .NET or C++ language binding
Horizontal scalability: both
Automated data redundancy and sharding
Distributed graph algorithms / query execution
Currently I disqualified the following:
InfiniteGraph: no C++ / .NET language binding
HyperGraphDB: no C++ / .NET language binding
Microsoft Trinity: Not mature
Neo4j: not distributed
I'm not sure about the scalability of the following:
Sparsity DEX
Franz Inc. AllegroGraph
Sones GraphDB
I found the available information about horizontal scalability capabilities quite general. I guess there are good reasons for this.
Any information would be appreciated.
Unfortunately your basic requirements already extend todays general understanding of graphs - even in the academia. No listed pure graph database will be able to satisfy all your needs. Distributed graph algorithms which are aware of large distributed but interconnected graphs are still a big research issue. So for your application it might be best to find a well matching graph database, graph processing stack or RDF-Store and implement the missing parts on your own.
When your application is mostly Online Transactional Graph Processing (OLTP) (read/write heavy) with a focus on the vertices and you can resign on the distributed algorithms for a moment then use one of these:
Neo4j
OrientDB
DEX
HyperGraphDB
InfiniteGraph
InfoGrid
Microsoft Horton
When it is more Online Analytical Processing (OLAP) (mostly read) still with a focus on the vertices and distribution really matters then :
Apache Hama (early stage project)
Microsoft Trinity (research project)
Golden Orb (good, but Java only)
Signal/Collect (http://www.ifi.uzh.ch/ddis/research/sc , but a research project)
Or is its focus more on the edges, logical reasoning/pattern matching and you need or better can live with a distribution on an edge level like in the Semantic Web then use one of these RDF-/Triple-/Quadstores:
AllegroGraph (okay, they are a graphdb/rdf store hybrid ;)
Jena
Sesame
Stardog
Virtuoso
...and many more RDF stores
Good starting points might be DEX or Neo4j: If you're looking for a good and really fast graphdb kernel for C++ DEX might be best, but you would have to implement a lot of networking and distribution stuff on your own. Neo4j has a lot of distribution and fault tolerance, but at the moment more on a vertex sharding level and it's kernel is Java. For ideas and inspiration on implementing distributed graph algorithms perhaps take a look at Golden Orb and Signal/Collect.
An alternative approach might be starting with AllegroGraph or Stardog. Especially AllegroGraph might be a bit tricky in the beginning until you get adopted to their way of thinking. Stardog is still young and Java, but fast and already quite mature.