spark Dataframe vs pandas-on-spark Dataframe - pyspark

I'm quite new to pySpark but I'm confused about the difference between a spark Dataframe (created for example from an RDD ) and a pandas-on-spark Dataframe.
Are those the same object ? Looking at the type it seems they are different classes.
What's the core difference, if any ? (I know that working with pandas-on-spark Dataframe you can use almost the same syntax of Pandas on a distributed Dataframe but I'm wondering if is only this one the difference )
Thanks

Answering directly:
Are those the same object ? Looking at the type it seems they are different classes.
No, they are completely different objects (classes).
What's the core difference, if any ?
A pySpark DataFrame is an object from the PySpark library, with its own API and it can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs.
A Pandas-on-Spark DataFrame and pandas DataFrame are similar. However, the former is distributed and the latter is in a single machine. When converting to each other, the data is transferred between multiple machines and the single client machine.
A Pandas DataFrame, is an object from the pandas library, also with its own API and it can also be constructed from a wide range of methods.
Also, I recommend checking this documentation about Pandas on Spark

Related

Polars DataFrame save to sql

Is there a way to save Polars DataFrame into a database, MS SQL for example?
ConnectorX library doesn’t seem to have that option.
Polars doesen't support direct writing to a database. You can proceed in two ways:
Export the DataFrame in an intermediate format (such as .csv using .write_csv()), then import it into the database.
Process it in memory: you can convert the DataFrame in a simpler data structure using .to_dicts(). The result will be a list of dictionaries, each of them containing a row in key/value format. At this point is easy to insert them into a database using SqlAlchemy or any specific library for your database of choice.

Ways of Pyspark tabular data manipulation

I'm very new to pyspark and I'm kind of confused with the data manipulation. What I learned lately is that we can manipulate data (tabular data) with SQL queries or with pyspark dataframes built in methods. My question is
Is there another way to manipulate tabular data in pyspark other than with SQL queries or with pyspark dataframes built in methods?
Why some people manipulate the data with SQL and some others with the built in methods? I mean it's mentioned that spark dataframes can act like SQL table, so why using the built in functions?
In the best practice, when to manipulate the data with SQL queries and with pyspark dataframes built in methods?
I'm sorry if this is a basic question but I'm very new at this and I have been looking for articles to answer the questions I have but to no avail.

What changes do I have to do to migrate an application from Spark 1.5 to Spark 2.1?

I have to migrate to Spark 2.1 an application written in Scala 2.10.4 using Spark 1.6.
The application treats text files with around 7GB of dimension, and contains several rdd transformations.
I was told to try to recompile it with scala 2.11, which should be enough to make it work with Spark 2.1. This sounds strange to me as I know in Spark 2 there are some relevant changes, like:
Introduction of SparkSession object
Merge of DataSet and DataFrame
APIs
I managed to recompile the application in spark 2 with scala 2.11 with only minor changes due to Kryo Serializer registration.
I still have some runtime error that I am trying to solve and I am trying to figure out what will come next.
My question regards what changes are "neccessary" in order to make the application work as before, and what changes are "recommended" in terms of performance optimization (I need to keep at least the same level of performances), and whatever you think could be useful for a newbie in spark :).
Thanks in advance!
I did the same 1 year ago, there are not many changes you need to do, what comes in my mind:
if your code is cluttered with spark/sqlContext, then just extract this variable from SparkSession instace at the beginning of your code.
df.map switched to RDD API in Spark 1.6, in Spark 2.+ you stay in DataFrame API (which now has a map method). To get same functionality as before, replace df.map with df.rdd.map. The same is true for df.foreach and df.mapPartitions etc
unionAll in Spark 1.6 is just union in Spark 2.+
The databrick csv library is now included in Spark.
When you insert into a partitioned hive table, then the partition columns must now come as last column in the schema, in Spark 1.6 it had to be the first column
What you should consider (but would require more work):
migrate RDD-Code into Dataset-Code
enable CBO (cost based optimizer)
collect_list can be used with structs, in Spark 1.6 it could only be used with primitives. This can simplify some things
Datasource API was improved/unified
leftanti join was introduced

Spark : how can i create local dataframe in each executor

In spark scala is there a way to create local dataframe in executors like pandas in pyspark. In mappartitions method i want to convert iterator to local dataframe (like pandas dataframe in python) so that dataframe features can be used instead of hand coding them on iterators.
That is not possible.
Dataframe is a distributed collection in Spark. And Dataframes can only be created on driver node (i.e. outside of transformations/actions).
Additionally, in Spark you cannot execute operations on RDDs/Dataframes/Datasets inside other operations:
e.g. following code will produce errors.
rdd.map(v => rdd1.filter(e => e == v))
DF and DS also have RDDs underneath, so same behavior there.

Data Analysis Scala on Spark

I am new to Scala, and i have to use Scala and Spark's SQL, Mllib and GraphX in order to perform some analysis on huge data set. The analyses i want to do are:
Customer life cycle Value (CLV)
Centrality measures (degree, Eigenvector, edge-betweenness,
closeness) The data is in a CSV file (60GB (3 years transnational data))
located in Hadoop cluster.
My question is about the optimal approach to access the data and perform the above calculations?
Should i load the data from the CSV file into dataframe and work on
the dataframe? or
Should i load the data from the CSV file and convert it into RDD and
then work on the RDD? or
Are there any other approach to access the data and perform the analyses?
Thank you so much in advance for your help..
Dataframe gives you sql like syntax to work with the data where as RDD gives Scala collection like methods for data manipulation.
One extra benefit with Dataframes is underlying spark system will optimise your queries just like sql query optimisation. This is not available in case of RDD's.
As you are new to Scala its highly recommended to use Dataframes API initially and then Pick up RDD API later based on requirement.
You can use Databricks CSV reader api, which is easy to use and returns DataFrame. It automatically infer data types. If you pass the file with header it can automatically use that as Schema, otherwise you can construct schema using StructType.
https://github.com/databricks/spark-csv
Update:
If you are using Spark 2.0 Version , by default it support CSV datasource, please see the below link.
https://spark.apache.org/releases/spark-release-2-0-0.html#new-features
See this link for how to use.
https://github.com/databricks/spark-csv/issues/367