Everything in Spark is created in the form of RDD (Key and value pairs). Is this necessary? Which type of analytics can be created/performed through an RDD dataset?
Please provide an example and uses of converting it to an RDD
Thanks,
Aditya
Spark is used to solve problems which involve huge datasets where data transformation is required, Spark is built with functional programming language (Scala) instead than imperative language (C or C++) since functional programming allows to separate tasks in a lazy way between multiple nodes in a cluster which imperative programming paradigms cannot do and depend on an external data store for distributed algorithms to work. In Spark there are many libraries that perform distributed machine learning algorithms, which is something impossible with standard R or Python scripts
Related
The fundamental problem is attempting to use spark to generate data but then work with the data internally. I.e., I have a program that does a thing, and it generates "rows" of data - can I leverage Spark to parallelize that work across the worker nodes, and have them each contribute back to the underlying store?
The reason I want to use Spark is that seems to be a very popular framework, and I know this request is a little outside of the defined range of functions Spark should offer. However, the alternatives of MapReduce or Storm are dreadfully old and there isn't much support anymore.
I have a feeling there has to be a way to do this, has anyone tried to utilize Spark in this way?
To be honest, I don't think adopting Spark just because it's popular is the right decision. Also, it's not obvious from the question why this problem would require a framework for distributed data processing (that comes along with a significant coordination overhead).
The key consideration should be how you are going to process the generated data in the next step. If it's all about dumping it immediately into a data store I would really discourage using Spark, especially if you don't have the necessary infrastructure (Spark cluster) at hand.
Instead, write a simple program that generates the data. Then run it on a modern resource scheduler such as Kubernetes and scale it out and run as many instances of it as necessary.
If you absolutely want to use Spark for this (and unnecessarily burn resources), it's not difficult. Create a distributed "seed" dataset / stream and simply flatMap that. Using flatMap you can generate as many new rows for each seed input row as you like (obviously limited by the available memory).
I have to replace map reduce code written in pig and java to Apache Spark & Scala as much as possible, and reuse or find an alternative where it is not possible.
I can find most of pig conversions to spark. Now, I have encountered with java cascading code of which I have minimal knowledge.
I have researched cascading and understood how plumbing works but i cannot come to a conclusion whether to replace it with spark. Here are my few basic doubts.
Can cascading java code completely be rewritten in Apache Spark?
If possible, Should Cascading code be replaced with Apache Spark? Is it more optimal and fast?(Considering RAM is not the issue for RDD's)
Scalding is a Scala library built on top of Cascading. Should this be used to convert java code to Scala code which will remove java source code dependency? Will this be more optimal?
Cascading works on mapreduce which reads I/O Stream whereas Spark reads from memory. Is this the only difference, or are there any limitations or special features which can be only performed by either one?
I am very new to Big-Data segment, and very immature with concepts/comparisons of all Big-Data related terminologies Hadoop, Spark, Map-Reduce, Hive, Flink etc. I got hold of these Big Data responsibility with with my new job profile and minimal senior knowledge/experience. Please, provide answer explanatory if possible. Thanks
hello everyone,
so I started to learn about Apache spark architecture and I understand how the data flow works in higher-level.
what I learn is that spark jobs work on stages that have tasks to operate on RDDS in which they are created with lazy transformations starting from the Spark console. (correct me if I'm wrong)
what I didn't get it :
there are other types of data structures in Spark: Data Frame and Dataset, and there are functions to manipulate them,
so what is the relation between those functions and the tasks applied on RDDs ?
coding with Scala has operations on RDD which is logic as far as I know, and there is also other types of data structure that I can do operations on them and manipulate them like list, Stream, vector, etc... so my question is
so how can spark execute these operations if they are not applied on RDDS ?
I have an estimation of time-complexity of each algorithm operating on any type of data structure of Scala referring to the official documents but I can't find the estimation of time-complexity of operations on RDDS, for example, count() or ReduceKey() applied in RDDS.
why we can't we evaluate exactly the complexity of Spark-app, and is it possible to evaluate elementary tasks complexities ?
more formally, what are RDDS and what is the relation between them and everything in Spark
if someone can clarify to me this confusion of information, I'd be grateful.
so what is the relation between those functions and the tasks applied on RDDs ?
DataFrames, Datasets, and RDD are three API from Spark. Check out this link
so how can spark execute these operations if they are not applied on RDDS?
RDD are structural Data structures, Actions and Transformation specified by Spark can be applied on RDD. Within RDD action or transformation, we do apply some of scala native operations. Each Spark API has it's own set operations. Read the link as shown in previous to get a better idea on how parallelism is achieved in the operations
why we can't we evaluate exactly the complexity of Spark-app, and is it possible to evaluate elementary tasks complexities ?
This article explains Map Reduce Complexity
https://web.stanford.edu/~ashishg/papers/mapreducecomplexity.pdf
I'll try my best to describe my situation and then I'm hoping another user on this site can tell me if the course I'm taking makes sense or if I need to reevaluate my approach/options.
Background:
I use pyspark since I am most familiar with python vs scala, java or R. I have a spark dataframe that was constructed from a hive table using pyspark.sql to query the table. In this dataframe I have many different 'files'. Each file is consists of timeseries data. I need to perform a rolling regression on a subset of the data, across the entire time values for each 'file'. After doing a good bit of research I was planning on creating a window object, making a UDF that specified how I wanted my linear regression to occur (using the spark ml linear regression inside the function), then returning the data to the dataframe. This would happen inside of the context of a .withColumn() operation. This made sense and I feel like this approach is correct. What I discovered is that currently pyspark does not support the ability to create UDAF (see the linked jira). So here is what I'm currently considering doing.
It is shown here and here that it is possible to create a UDAF in scala and then reference said function within the context of pyspark. Furthermore it is shown here that a UDAF (written in scala) is able to take multiple input columns (a necessary feature since I will be doing multiple linear regression - taking in 3 parameters). What I am unsure of is the ability for my UDAF to use org.apache.spark.ml.regression which I plan on using for my regression. If this can't be done, I could manually execute the operation using matrices (I believe, if scala allows that). I have virtually no experience using scala but am certainly motivated to learn enough to write this one function.
I'm wondering if anyone has insight or suggestions about this task ahead. I feel like after the research I've done, this is both possible and the appropriate course of action to take. However, I'm scared of burning a ton of time trying to make this work when it is fundamentally impossible or way more difficult than I could imagine.
Thanks for your insight.
After doing a good bit of research I was planning on creating a window object, making a UDF that specified how I wanted my linear regression to occur (using the spark ml linear regression inside the function
This cannot work, no matter if PySpark supports UDAF or not. You are not allowed to use distributed algorithms from UDF / UDAF.
Question is a bit vague, and it is not clear how much data you have but I'd consider using plain RDD with scikit-learn (or similar tool) or try to implement a whole thing from scratch.
I read
What is the difference between Spark DataSet and RDD
Difference between DataSet API and DataFrame
http://spark.apache.org/docs/latest/sql-programming-guide.html#datasets-and-dataframes
https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html
In Spark 1.6 a Dataset seems to be more like an improved DataFrame ("Conceptually Spark DataSet is just a DataFrame with additional type safety"). In Spark 2.0 it seems a lot more like an improved RDD. The former has a relational model, the later is more like a list. For Spark 1.6 it was said that Datasets are an extension of DataFrames, while in Spark 2.0 DataFrames are just Datasets containing a Type [Row], making DataFrames a special case of Datasets, making DataFrames a special case of Datasets. Now I am a little confused. Are Datasets in Spark 2.0 conceptually more like RDDs or like DataFrames? What is the conceptual difference between an RDD to a Dataset in Spark 2.0?
I thin they are very similar from an user-perspective, but are quite differently implemented under the hood. Dataset API now seems almost as flexible as RDD API, but adds the entire story of optimization (Catalyst & Tungsten)
Citing from http://www.agildata.com/apache-spark-2-0-api-improvements-rdd-dataframe-dataset-sql/
RDDs can be used with any Java or Scala class and operate by
manipulating those objects directly with all of the associated costs
of object creation, serialization and garbage collection.
Datasets are limited to classes that implement the Scala Product
trait, such as case classes. There is a very good reason for this
limitation. Datasets store data in an optimized binary format, often
in off-heap memory, to avoid the costs of deserialization and garbage
collection. Even though it feels like you are coding against regular
objects, Spark is really generating its own optimized byte-code for
accessing the data directly.