Importing spark.sql.catalyst library in spark - pyspark

I am trying to find a way to interpret the table names from spark sql.
The answer given here is in Scala How to get table names from SQL query?
I want to change this into pyspark.
For that I want to import the library
org.apache.spark.sql.catalyst.analysis.UnresolvedRelation (or its equivalent) into pyspark.
Can this be done?

Related

Polars DataFrame save to sql

Is there a way to save Polars DataFrame into a database, MS SQL for example?
ConnectorX library doesn’t seem to have that option.
Polars doesen't support direct writing to a database. You can proceed in two ways:
Export the DataFrame in an intermediate format (such as .csv using .write_csv()), then import it into the database.
Process it in memory: you can convert the DataFrame in a simpler data structure using .to_dicts(). The result will be a list of dictionaries, each of them containing a row in key/value format. At this point is easy to insert them into a database using SqlAlchemy or any specific library for your database of choice.

spark Dataframe vs pandas-on-spark Dataframe

I'm quite new to pySpark but I'm confused about the difference between a spark Dataframe (created for example from an RDD ) and a pandas-on-spark Dataframe.
Are those the same object ? Looking at the type it seems they are different classes.
What's the core difference, if any ? (I know that working with pandas-on-spark Dataframe you can use almost the same syntax of Pandas on a distributed Dataframe but I'm wondering if is only this one the difference )
Thanks
Answering directly:
Are those the same object ? Looking at the type it seems they are different classes.
No, they are completely different objects (classes).
What's the core difference, if any ?
A pySpark DataFrame is an object from the PySpark library, with its own API and it can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs.
A Pandas-on-Spark DataFrame and pandas DataFrame are similar. However, the former is distributed and the latter is in a single machine. When converting to each other, the data is transferred between multiple machines and the single client machine.
A Pandas DataFrame, is an object from the pandas library, also with its own API and it can also be constructed from a wide range of methods.
Also, I recommend checking this documentation about Pandas on Spark

Ways of Pyspark tabular data manipulation

I'm very new to pyspark and I'm kind of confused with the data manipulation. What I learned lately is that we can manipulate data (tabular data) with SQL queries or with pyspark dataframes built in methods. My question is
Is there another way to manipulate tabular data in pyspark other than with SQL queries or with pyspark dataframes built in methods?
Why some people manipulate the data with SQL and some others with the built in methods? I mean it's mentioned that spark dataframes can act like SQL table, so why using the built in functions?
In the best practice, when to manipulate the data with SQL queries and with pyspark dataframes built in methods?
I'm sorry if this is a basic question but I'm very new at this and I have been looking for articles to answer the questions I have but to no avail.

Pycharm does not auto suggest spark dataframe methods

I have been using Spark Scala for a long time, new to PySpark.
I am trying to setup PyCharm for a spark project. Everything is setup from a dependencies point of view (pip install spark for e.g.). I can create a new python file and write spark code, everything is resolved. Here's a snippet of the code:
from pyspark.sql import SparkSession
spark=SparkSession.builder.enableHiveSupport.getOrCreate()
data = spark.sql ('select * from db.tbl')
At this point should I expect data to be a DataFrame? When I type data. I expect PyCharm to tell me the possible methods like filter, join etc as a drop down, but it does not.
Is there anything more I need to do for this to work? I am using python 2.7 (have to, since that's what our hadoop cluster supports)
In Python, variables are dynamically typed so you declare them without their types.
But starting from Python 3.6+, you can declare the variable type like this :
data : DataFrame = spark.sql ('select * from db.tbl')
This way you let PyCharm know what is the type of data and will suggest possible methods for that object.

Data Analysis Scala on Spark

I am new to Scala, and i have to use Scala and Spark's SQL, Mllib and GraphX in order to perform some analysis on huge data set. The analyses i want to do are:
Customer life cycle Value (CLV)
Centrality measures (degree, Eigenvector, edge-betweenness,
closeness) The data is in a CSV file (60GB (3 years transnational data))
located in Hadoop cluster.
My question is about the optimal approach to access the data and perform the above calculations?
Should i load the data from the CSV file into dataframe and work on
the dataframe? or
Should i load the data from the CSV file and convert it into RDD and
then work on the RDD? or
Are there any other approach to access the data and perform the analyses?
Thank you so much in advance for your help..
Dataframe gives you sql like syntax to work with the data where as RDD gives Scala collection like methods for data manipulation.
One extra benefit with Dataframes is underlying spark system will optimise your queries just like sql query optimisation. This is not available in case of RDD's.
As you are new to Scala its highly recommended to use Dataframes API initially and then Pick up RDD API later based on requirement.
You can use Databricks CSV reader api, which is easy to use and returns DataFrame. It automatically infer data types. If you pass the file with header it can automatically use that as Schema, otherwise you can construct schema using StructType.
https://github.com/databricks/spark-csv
Update:
If you are using Spark 2.0 Version , by default it support CSV datasource, please see the below link.
https://spark.apache.org/releases/spark-release-2-0-0.html#new-features
See this link for how to use.
https://github.com/databricks/spark-csv/issues/367