Is it possible to use Spark libraries such as Spark.ml within a Beam pipeline?
From my understanding, you will write your pipeline in “beam syntax” and let Beam
execute it on spark using spark as a runner.
Hence, I don’t see how you could use spark.ml within beam.
But maybe I get something wrong here?
Did someone already try to use it, if not, do other ML libraries exist for native usage in Beam (except from Tensorflow Transform)?
Many Thanks,
Jonathan
Apache Beam unifies stream and batch data processing. Its portable, meaning SDKs can be written in any language and it can be executed in any data processing frameworks with enough capabilities(see: runners). ML in not its main concern. So its programming model does not define any unified API to work with ML.
But id does not mean that you cannot use it with ML libraries to preprocess data needed to your ML model for training or inference. It is well suited to do it for you. Beam comes with set of build IOs. Which may help you to get data from many sources.
Related
I am exploring PyFlink and I wonder if it is possible to use PyFlink together with all these ML libs that ML engineers normally use: PyTorch, Tensorflow, Scikit Learn, Xgboost, LightGBM, etc.
According to this SO thread, PySpark cannot use Scikit Learn directly inside UDF because Scikit Learn algorithms are not implemented to be distributed, while Spark runs distributedly.
Given PyFlink is similar to PySpark, I guess the answer maybe "no". But I would love to double check, and to see what I need to do to make PyFlink able to define UDFs using these ML libs.
Thanks for the investigation of PyFlink together with all these ML libs. IMO, you could refer to the flink-ai-extended project that supports the Tensorflow on Flink, PyTorch on Flink etc, which repository url is https://github.com/alibaba/flink-ai-extended. Flink AI Extended is a project extending Flink to various machine learning scenarios, which could be used together with PyFlink. You can also join the group by scanning the QR code involved in the README file.
I have to replace map reduce code written in pig and java to Apache Spark & Scala as much as possible, and reuse or find an alternative where it is not possible.
I can find most of pig conversions to spark. Now, I have encountered with java cascading code of which I have minimal knowledge.
I have researched cascading and understood how plumbing works but i cannot come to a conclusion whether to replace it with spark. Here are my few basic doubts.
Can cascading java code completely be rewritten in Apache Spark?
If possible, Should Cascading code be replaced with Apache Spark? Is it more optimal and fast?(Considering RAM is not the issue for RDD's)
Scalding is a Scala library built on top of Cascading. Should this be used to convert java code to Scala code which will remove java source code dependency? Will this be more optimal?
Cascading works on mapreduce which reads I/O Stream whereas Spark reads from memory. Is this the only difference, or are there any limitations or special features which can be only performed by either one?
I am very new to Big-Data segment, and very immature with concepts/comparisons of all Big-Data related terminologies Hadoop, Spark, Map-Reduce, Hive, Flink etc. I got hold of these Big Data responsibility with with my new job profile and minimal senior knowledge/experience. Please, provide answer explanatory if possible. Thanks
I'm actually building a streaming module in my system that read and write from/to Kafka. It's done using Spark Streaming. My need is to structure the code in a clean and modular way: I have several chained steps and I would like to compose them. Is there a library that already gives the right abstractions for Spark Streaming? Otherwise, I would like to read something on a design pattern that solve this problem and I could write the abstractions myself.
(Then, is it just me or good resources about idiomatic design patterns in Scala are hard to find on the internet?
I'm writing a spark application and would like to use algorithms in MLlib. In the API doc I found two different classes for the same algorithm.
For example, there is one LogisticRegression in org.apache.spark.ml.classification also a LogisticRegressionwithSGD in org.apache.spark.mllib.classification.
The only difference I can find is that the one in org.apache.spark.ml is inherited from Estimator and was able to be used in cross validation. I was quite confused that they are placed in different packages. Is there anyone know the reason for it? Thanks!
It's JIRA ticket
And From Design Doc:
MLlib now covers a basic selection of machine learning algorithms, e.g., logistic regression, decision trees, alternating least squares, and k-means. The current set of APIs contains several design flaws that prevent us moving forward to
address practical machine learning pipelines,
make MLlib itself a scalable project.
The new set of APIs will live under org.apache.spark.ml, and o.a.s.mllib will be deprecated once we migrate all features to o.a.s.ml.
The spark mllib guide says:
spark.mllib contains the original API built on top of RDDs.
spark.ml provides higher-level API built on top of DataFrames for constructing ML pipelines.
and
Using spark.ml is recommended because with DataFrames the API is more versatile and flexible. But we will keep supporting spark.mllib along with the development of spark.ml. Users should be comfortable using spark.mllib features and expect more features coming. Developers should contribute new algorithms to spark.ml if they fit the ML pipeline concept well, e.g., feature extractors and transformers.
I think the doc explains it very well.
I need some help to be confirm my choice... and to learn if you can give me some information.
My storage database is TitanDb with Cassandra.
I have a very large graph. My goal is to use Mllib on the graph latter.
My first idea : use Titan with GraphX but I did not found anything or in development in progress... TinkerPop is not ready yet.
So I have a look to Giraph. TinkerPop, Titan can communique with Rexster from TinkerPop.
My question is :
What are the benefit to use Giraph ? Gremlin seems to do the same think and is distributed.
Thank you very much to explain me. I think I don't really understand the difference between Gremlin and Giraph (or GraphX).
Have a nice day.
Gremlin is a graph traversal language while
Giraph or Graphx is graph processing system.
I believe you're asking for difference between graphx or giraph and titan. To be more specific, why should you use graph processing system when you already have your data in graph database?
So it essentially is the difference between graph database and graph processing system.
Graph database is your guy when your application requires frequently querying the data. E.g. for a facebook kind of application, given a user, return all his/her friends. This is suitable for graph database and you can use gremlin to query.
Now, if you want to compute rank of each user in facebook, you need to run the pagerank algorithm over whole graph. In other words, pagerank algorithm process your whole graph and returns you the map . This is suitable application for graph processing system. Yes, you can write queries using gremlin framework to do this but 1. it won't be as userfriendly as underlying pregel model used by giraph or graphx. 2. it won't be efficient.
To summarize, it really depends on your application. If you think your application is like query. Don't bother loading unloading into any graph processing system. If you think your application is more like pagerank (which requires processing whole graph) and you have a large graph (atleast 1M edges). Go for giraph or graphx.
giraph and graphx has the graph input format. You can dump your data into that form in a file and can input it into one of these systems or you can write your own input format.
p.s. it'd be good to have an input format added in giraph graphx which accepts data stored in titan.
Interesting question. I am on the same track.
First your question about MLlib. I assume that you mean Apache Spark MLlib, the machine learning (ML) implementation on top of Apache Spark. So my conclusion is: you want to run ML algorithms for purposes such as clustering and classification using the data in your Titan/Cassandra based graph database.
Please note that you could also use graph processing algorithms like Page Rank mentioned by spidy to do things like clustering on top of your Titan/Cassandra graph database. In other words: you don't need ML to do clustering when your starting point is a graph database.
Apache Spark MLlib seems to be future proof and widely supported, their most recent announcements were regarding new ML algorithms, although Apache Mahout, another Apache ML project, is more mature regarding the amount of supported ML algorithms. Apache Mahout has also adopted Apache Spark as their data storage layer, so I therefore mention it in this post.
Apache Spark offers, in addition to in-memory computing, the mentioned MLlib for machine learning, Spark SQL which is like Hive on Spark, GraphX which is a graph processing system as explained by spidy and Spark Streaming for processing of streaming data.
I consider Apache Spark itself as a logical data layer, represented as RDDs (Resilient Distributed Datasets) on top of storage layers such as Cassandra, Hadoop/Hcatalog and HBase. Apache Spark offers a connector to Cassandra. Note that RDDs are immutable, you cannot alter data using Spark, you can only process and analyze the data in Spark.
Regarding the Apache Spark logical storage layer RDD: You could compare an RDD as a view in the good old SQL times, RDDs give you a view on for example a table in Cassandra of HBase. Note also that Apache Spark offers an API for 3 development environments: Scala, Java and Python.
Apache Giraph is also a graph processing toolset, functional equivalent to Apache Spark GraphX. Apache Giraph uses Hadoop as the data storage layer. You are using Titan/Cassandra so you will probably enter data migration tasks when you select Apache Giraph as your solution. Secondly, you started your post with a question regarding ML using MLlib and Apache Giraph is not a ML solution.
Your conclusion regarding Giraph and Gremlin is not correct: they are not the same although both are using a graph database. Giraph is a solution for graph processing as spidy explained. Using Giraph you can execute graph analysis algorithms such as Page Rank, e.g. who has the most followers, whilst Gremlin is meant for traversing e.g. query the graph database using the complex relationships (edges) between entities (vertices) obtaining result sets of vertex and edge properties.