In spark scala is there a way to create local dataframe in executors like pandas in pyspark. In mappartitions method i want to convert iterator to local dataframe (like pandas dataframe in python) so that dataframe features can be used instead of hand coding them on iterators.
That is not possible.
Dataframe is a distributed collection in Spark. And Dataframes can only be created on driver node (i.e. outside of transformations/actions).
Additionally, in Spark you cannot execute operations on RDDs/Dataframes/Datasets inside other operations:
e.g. following code will produce errors.
rdd.map(v => rdd1.filter(e => e == v))
DF and DS also have RDDs underneath, so same behavior there.
Related
I'm quite new to pySpark but I'm confused about the difference between a spark Dataframe (created for example from an RDD ) and a pandas-on-spark Dataframe.
Are those the same object ? Looking at the type it seems they are different classes.
What's the core difference, if any ? (I know that working with pandas-on-spark Dataframe you can use almost the same syntax of Pandas on a distributed Dataframe but I'm wondering if is only this one the difference )
Thanks
Answering directly:
Are those the same object ? Looking at the type it seems they are different classes.
No, they are completely different objects (classes).
What's the core difference, if any ?
A pySpark DataFrame is an object from the PySpark library, with its own API and it can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs.
A Pandas-on-Spark DataFrame and pandas DataFrame are similar. However, the former is distributed and the latter is in a single machine. When converting to each other, the data is transferred between multiple machines and the single client machine.
A Pandas DataFrame, is an object from the pandas library, also with its own API and it can also be constructed from a wide range of methods.
Also, I recommend checking this documentation about Pandas on Spark
Do we have another library that has a 2D matrix(table) structure like Spark dataframe and allows SQL queries on it?
I don't want to use spark because I am not going to make it parallel + it is heavy.
How I can write Spark dataframe to DynamoDB using emr-dynamodb-connector and Python?
I can't find how I can create new JobConf with pyspark.
I am new to Scala, and i have to use Scala and Spark's SQL, Mllib and GraphX in order to perform some analysis on huge data set. The analyses i want to do are:
Customer life cycle Value (CLV)
Centrality measures (degree, Eigenvector, edge-betweenness,
closeness) The data is in a CSV file (60GB (3 years transnational data))
located in Hadoop cluster.
My question is about the optimal approach to access the data and perform the above calculations?
Should i load the data from the CSV file into dataframe and work on
the dataframe? or
Should i load the data from the CSV file and convert it into RDD and
then work on the RDD? or
Are there any other approach to access the data and perform the analyses?
Thank you so much in advance for your help..
Dataframe gives you sql like syntax to work with the data where as RDD gives Scala collection like methods for data manipulation.
One extra benefit with Dataframes is underlying spark system will optimise your queries just like sql query optimisation. This is not available in case of RDD's.
As you are new to Scala its highly recommended to use Dataframes API initially and then Pick up RDD API later based on requirement.
You can use Databricks CSV reader api, which is easy to use and returns DataFrame. It automatically infer data types. If you pass the file with header it can automatically use that as Schema, otherwise you can construct schema using StructType.
https://github.com/databricks/spark-csv
Update:
If you are using Spark 2.0 Version , by default it support CSV datasource, please see the below link.
https://spark.apache.org/releases/spark-release-2-0-0.html#new-features
See this link for how to use.
https://github.com/databricks/spark-csv/issues/367
I am new to apache ignite as well as for spark...
Can any one help with example to convert ignite rdd to spark rdd in scala.
Updated----
Use case:
I will receive a dataframes of hbase tables.. I will execute some logic to build report out of it, save it to the ignite rdd... and same ignite rdd will be updated for each table... once all the tables are executed final ignite rdd will be converted to spark or java rdd and last rule will be executed on that rdd... to run that rule I need that rdd to be converted into dataframe. and that dataframe would be saved as a final report in hive...
What do you mean by converting? IgniteRDD is a Spark RDD, technically it' a subtype of RDD trait.
Spark internally has many type of RDDs: MappedRDD, HadoopRDD, LogicalRDD. IgniteRDD is only one of possible type of RDD and after some transformations it also will be wrapped by other RDD type, i.e. MappedRDD.
You can also write your own RDD :)
Example from documentation:
val cache = igniteContext.fromCache("partitioned")
val result = cache.filter(_._2.contains("Ignite")).collect()
After filtering cache RDD, type will be different - IgniteRDD will be wrapped to FilteredRDD. However it's still implementation of RDD trait.
Update after comment:
At first, have you imported implicits? import spark.implicits._
In SparkSession you've got various createDataFrame methods that will convert your RDD into DataFrame / Dataset
If it still not help you, please provide us error that you're getting while creating DataFrame and code example