I wanted to Convert scala dataframe into pandas data frame
val collection = spark.read.sqlDB(config)
collection.show()
#Should be like df=collection
You are asking for a way of using a Python library from Scala. This is a bit weird to me. Are you sure you have to do that? Maybe you know that, but Scala DataFrames have a good API that will probably give you the functionality you need from pandas.
If you still need to use pandas, I would suggest you to write the data that you need to a file (a csv, for example). Then, using a Python application you can load that file into a pandas dataframe and work from there.
Trying to create a pandas object from Scala is probably overcomplicating things (and I am not sure it is currently possible).
I think If you want to use pandas based API in SPARK code, then you can install Koalas-Python library. So, Whatever the function you want to use from pandas API directly you can embed them in SPARK code.
To install kolas
pip install koalas
Related
I'm quite new to pySpark but I'm confused about the difference between a spark Dataframe (created for example from an RDD ) and a pandas-on-spark Dataframe.
Are those the same object ? Looking at the type it seems they are different classes.
What's the core difference, if any ? (I know that working with pandas-on-spark Dataframe you can use almost the same syntax of Pandas on a distributed Dataframe but I'm wondering if is only this one the difference )
Thanks
Answering directly:
Are those the same object ? Looking at the type it seems they are different classes.
No, they are completely different objects (classes).
What's the core difference, if any ?
A pySpark DataFrame is an object from the PySpark library, with its own API and it can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs.
A Pandas-on-Spark DataFrame and pandas DataFrame are similar. However, the former is distributed and the latter is in a single machine. When converting to each other, the data is transferred between multiple machines and the single client machine.
A Pandas DataFrame, is an object from the pandas library, also with its own API and it can also be constructed from a wide range of methods.
Also, I recommend checking this documentation about Pandas on Spark
I'm converting a process from postgreSQL over to DataBrick ApacheSpark,
The postgresql process uses the following sql function to get the point on a map from a X and Y value. ST_Transform(ST_SetSrid(ST_MakePoint(x, y),4326),3857)
Does anyone know how I can achieve this same logic in SparkSQL o databricks?
To achieve this you need to use some library, like, Apache Sedona, GeoMesa, or something else. Sedona, for example, has the ST_TRANSFORM function, maybe it has the rest as well.
The only thing that you need to take care, is that if you're using pure SQL, then on Databricks you will need:
install Sedona libraries using the init script, so libraries should be there before Spark starts
set Spark configuration parameters, as described in the following pull request
Update June 2022nd: people at Databricks developed the Mosaic library that is heavily optimized for geospatial analysis on Databricks, and it's compatible with standard ST_ functions.
I have been using Spark Scala for a long time, new to PySpark.
I am trying to setup PyCharm for a spark project. Everything is setup from a dependencies point of view (pip install spark for e.g.). I can create a new python file and write spark code, everything is resolved. Here's a snippet of the code:
from pyspark.sql import SparkSession
spark=SparkSession.builder.enableHiveSupport.getOrCreate()
data = spark.sql ('select * from db.tbl')
At this point should I expect data to be a DataFrame? When I type data. I expect PyCharm to tell me the possible methods like filter, join etc as a drop down, but it does not.
Is there anything more I need to do for this to work? I am using python 2.7 (have to, since that's what our hadoop cluster supports)
In Python, variables are dynamically typed so you declare them without their types.
But starting from Python 3.6+, you can declare the variable type like this :
data : DataFrame = spark.sql ('select * from db.tbl')
This way you let PyCharm know what is the type of data and will suggest possible methods for that object.
I am trying to find a way to interpret the table names from spark sql.
The answer given here is in Scala How to get table names from SQL query?
I want to change this into pyspark.
For that I want to import the library
org.apache.spark.sql.catalyst.analysis.UnresolvedRelation (or its equivalent) into pyspark.
Can this be done?
I am trying to save a dataframe as table using saveAsTable and well it works but I want to save the table to not the default database, Does anyone know if there is a way to set the database to use? I tried with hiveContext.sql("use db_name") and this did not seem to do it. There is an saveAsTable that takes in some options. Is there a way that i can do it with the options?
It does not look like you can set the database name yet... if you read the HiveContext.scala code you see a lot comments like...
// TODO: Database support...
So I am guessing that its not supported yet.
Update:
In spark 1.5.1 this works, which did not work in early versions. In early version you had to use a using statement like in deformitysnot answer.
df.write.format("parquet").mode(SaveMode.Append).saveAsTable("databaseName.tablename")
This was fixed in Spark 1.5 and you can do it using :
hiveContext.sql("USE sparkTables");
dataFrame.saveAsTable("tab3", "orc", SaveMode.Overwrite);
By the way in Spark 1.5 you can read Spark saved dataframes from Hive command line (beeline, ...), something that was impossible in earlier versions.