I need some help to be confirm my choice... and to learn if you can give me some information.
My storage database is TitanDb with Cassandra.
I have a very large graph. My goal is to use Mllib on the graph latter.
My first idea : use Titan with GraphX but I did not found anything or in development in progress... TinkerPop is not ready yet.
So I have a look to Giraph. TinkerPop, Titan can communique with Rexster from TinkerPop.
My question is :
What are the benefit to use Giraph ? Gremlin seems to do the same think and is distributed.
Thank you very much to explain me. I think I don't really understand the difference between Gremlin and Giraph (or GraphX).
Have a nice day.
Gremlin is a graph traversal language while
Giraph or Graphx is graph processing system.
I believe you're asking for difference between graphx or giraph and titan. To be more specific, why should you use graph processing system when you already have your data in graph database?
So it essentially is the difference between graph database and graph processing system.
Graph database is your guy when your application requires frequently querying the data. E.g. for a facebook kind of application, given a user, return all his/her friends. This is suitable for graph database and you can use gremlin to query.
Now, if you want to compute rank of each user in facebook, you need to run the pagerank algorithm over whole graph. In other words, pagerank algorithm process your whole graph and returns you the map . This is suitable application for graph processing system. Yes, you can write queries using gremlin framework to do this but 1. it won't be as userfriendly as underlying pregel model used by giraph or graphx. 2. it won't be efficient.
To summarize, it really depends on your application. If you think your application is like query. Don't bother loading unloading into any graph processing system. If you think your application is more like pagerank (which requires processing whole graph) and you have a large graph (atleast 1M edges). Go for giraph or graphx.
giraph and graphx has the graph input format. You can dump your data into that form in a file and can input it into one of these systems or you can write your own input format.
p.s. it'd be good to have an input format added in giraph graphx which accepts data stored in titan.
Interesting question. I am on the same track.
First your question about MLlib. I assume that you mean Apache Spark MLlib, the machine learning (ML) implementation on top of Apache Spark. So my conclusion is: you want to run ML algorithms for purposes such as clustering and classification using the data in your Titan/Cassandra based graph database.
Please note that you could also use graph processing algorithms like Page Rank mentioned by spidy to do things like clustering on top of your Titan/Cassandra graph database. In other words: you don't need ML to do clustering when your starting point is a graph database.
Apache Spark MLlib seems to be future proof and widely supported, their most recent announcements were regarding new ML algorithms, although Apache Mahout, another Apache ML project, is more mature regarding the amount of supported ML algorithms. Apache Mahout has also adopted Apache Spark as their data storage layer, so I therefore mention it in this post.
Apache Spark offers, in addition to in-memory computing, the mentioned MLlib for machine learning, Spark SQL which is like Hive on Spark, GraphX which is a graph processing system as explained by spidy and Spark Streaming for processing of streaming data.
I consider Apache Spark itself as a logical data layer, represented as RDDs (Resilient Distributed Datasets) on top of storage layers such as Cassandra, Hadoop/Hcatalog and HBase. Apache Spark offers a connector to Cassandra. Note that RDDs are immutable, you cannot alter data using Spark, you can only process and analyze the data in Spark.
Regarding the Apache Spark logical storage layer RDD: You could compare an RDD as a view in the good old SQL times, RDDs give you a view on for example a table in Cassandra of HBase. Note also that Apache Spark offers an API for 3 development environments: Scala, Java and Python.
Apache Giraph is also a graph processing toolset, functional equivalent to Apache Spark GraphX. Apache Giraph uses Hadoop as the data storage layer. You are using Titan/Cassandra so you will probably enter data migration tasks when you select Apache Giraph as your solution. Secondly, you started your post with a question regarding ML using MLlib and Apache Giraph is not a ML solution.
Your conclusion regarding Giraph and Gremlin is not correct: they are not the same although both are using a graph database. Giraph is a solution for graph processing as spidy explained. Using Giraph you can execute graph analysis algorithms such as Page Rank, e.g. who has the most followers, whilst Gremlin is meant for traversing e.g. query the graph database using the complex relationships (edges) between entities (vertices) obtaining result sets of vertex and edge properties.
Related
I am exploring PyFlink and I wonder if it is possible to use PyFlink together with all these ML libs that ML engineers normally use: PyTorch, Tensorflow, Scikit Learn, Xgboost, LightGBM, etc.
According to this SO thread, PySpark cannot use Scikit Learn directly inside UDF because Scikit Learn algorithms are not implemented to be distributed, while Spark runs distributedly.
Given PyFlink is similar to PySpark, I guess the answer maybe "no". But I would love to double check, and to see what I need to do to make PyFlink able to define UDFs using these ML libs.
Thanks for the investigation of PyFlink together with all these ML libs. IMO, you could refer to the flink-ai-extended project that supports the Tensorflow on Flink, PyTorch on Flink etc, which repository url is https://github.com/alibaba/flink-ai-extended. Flink AI Extended is a project extending Flink to various machine learning scenarios, which could be used together with PyFlink. You can also join the group by scanning the QR code involved in the README file.
I have to replace map reduce code written in pig and java to Apache Spark & Scala as much as possible, and reuse or find an alternative where it is not possible.
I can find most of pig conversions to spark. Now, I have encountered with java cascading code of which I have minimal knowledge.
I have researched cascading and understood how plumbing works but i cannot come to a conclusion whether to replace it with spark. Here are my few basic doubts.
Can cascading java code completely be rewritten in Apache Spark?
If possible, Should Cascading code be replaced with Apache Spark? Is it more optimal and fast?(Considering RAM is not the issue for RDD's)
Scalding is a Scala library built on top of Cascading. Should this be used to convert java code to Scala code which will remove java source code dependency? Will this be more optimal?
Cascading works on mapreduce which reads I/O Stream whereas Spark reads from memory. Is this the only difference, or are there any limitations or special features which can be only performed by either one?
I am very new to Big-Data segment, and very immature with concepts/comparisons of all Big-Data related terminologies Hadoop, Spark, Map-Reduce, Hive, Flink etc. I got hold of these Big Data responsibility with with my new job profile and minimal senior knowledge/experience. Please, provide answer explanatory if possible. Thanks
Is it possible to use Spark libraries such as Spark.ml within a Beam pipeline?
From my understanding, you will write your pipeline in “beam syntax” and let Beam
execute it on spark using spark as a runner.
Hence, I don’t see how you could use spark.ml within beam.
But maybe I get something wrong here?
Did someone already try to use it, if not, do other ML libraries exist for native usage in Beam (except from Tensorflow Transform)?
Many Thanks,
Jonathan
Apache Beam unifies stream and batch data processing. Its portable, meaning SDKs can be written in any language and it can be executed in any data processing frameworks with enough capabilities(see: runners). ML in not its main concern. So its programming model does not define any unified API to work with ML.
But id does not mean that you cannot use it with ML libraries to preprocess data needed to your ML model for training or inference. It is well suited to do it for you. Beam comes with set of build IOs. Which may help you to get data from many sources.
I working on implementation of both individual and stereotype user model in recommendation system. I came across with Apache Mahout but it seems that it only works with individual user model.
My question is how can i work with stereo type user model in Apache Mahout Taste?
My understanding for the recommendation engine is
that you have these core parameters
Method of information acquisition (Implicit or explicit)
User model (Individual or stereotype)
Recommendation techniques (Collaborative or content base)
Taste is being deprecated. Mahout has undergone a major reboot and no longer accepts Hadoop MapReduce code. Many of the Hadoop MapReduce algorithms have been rewritten on the Mahout Samsara codebase that virtualizes a great deal of linear algebra type operations to run on multiple compute engines. The most complete is Spark, which runs something like 10x faster than Hadoop MapReduce.
That as a preamble the new "recommender" implementations, although they include ALS, also have code for item and row similarity, which in recommender data means item and user similarity.
See the description of "spark-rowsimilarity" here: http://mahout.apache.org/users/algorithms/intro-cooccurrence-spark.html#2-spark-rowsimilarity
The example is wrong for your case but works just as well to compute user similarities by inputting user interaction vectors.
Another way to do this is put user interaction vectors into a similarity engine that uses Lucene like Solr or Elasticsearch. Then query with a specific user's data and you will get back similar users.
Are there any open-source graph-databases around which are able to store binary data, scale horizontally and optionally provide versioning of stored data?
I am overwhelmed by the sheer amount of dbs out there, but none of them seems to have all the desired features.
Look at OrientDB: open source (Apache 2 license), very fast. Supports SQL and graph GREMLIN language.
The binary storage, horizontal scale, and versioning requirements all sound like good candidates for a BigTable model like Cassandra or HBase. If you really need a graph database, those may not be a good fit, however. If you can expand a bit more on what the requirements are, we could make a better suggestion.
[http://en.wikipedia.org/wiki/NoSQL][1]
for example:
InfiniteGraph - High-performance, scalable, distributed Graph Database
Horizontal scaling, look at Titan (uses Cassandra underneath): Titan homepage, Titan presentation video
For versioning your graph (if that's what you really need), you could try using Antiquity on top of a graph store.
From the Titan site:
Titan is a highly scalable graph database optimized for storing and querying massive-scale graphs containing hundreds of billions of vertices and edges distributed across a multi-machine cluster. Titan is a transactional database that can support thousands of concurrent users executing complex graph traversals.
In addition, Titan provides the following features:
Elastic and linear scalability for a growing data and user base.
Data distribution and replication for performance and fault tolerance.
Multi-datacenter high availability and hot backups.
Support for ACID and eventual consistency.
Support for various storage backends:
Apache Cassandra
Apache HBase
Oracle BerkeleyDB
Support for geo, numeric range, and full-text search via:
ElasticSearch
Apache Lucene
Native integration with the TinkerPop graph stack:
Gremlin graph query language
Frames object-to-graph mapper
Rexster graph server
Blueprints standard graph API
Open source with the liberal Apache 2 license.-