How to save a PySpark dataframe to the personal machine using Databricks?

How to save a PySpark dataframe to the personal machine using Databricks? - pyspark

I have a dataframe within the Databricks environment. I need to download this dataframe to my personal machine. This dataframe contains 10,000 rows. So try to do the following:
df_test.coalesce(1).write.csv("dbfs:/FileStore/tables/df_test", header=True, mode='overwrite')
However, I'm not able to run the cell. The following error message appears:
org.apache.spark.SparkException: Job aborted.
Could someone help me solve the problem?

If you didn’t resolve the error, you can try this alternate to save your Pyspark dataframe to local machine as csv file.
With display(dataframe):
Here I created a dataframe with 10,000 rows for your reference. With the display(), databricks allows to download the rows up to 1 million.
Code:
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
schema=StructType([ \
StructField("id",IntegerType(),True), \
StructField("firstname",StringType(),True) \
])
data2=[(1,"Rakesh")]
for i in range(2,10000):
data2.append((i,"Rakesh"))
df=spark.createDataFrame(data=data2,schema=schema)
df.show(5)
display(df)
Dataframe Creation:
display(df):
In this output by default the display() shows 1000 rows and to download the total dataframe click on the downarrow and then click on Download full results.
Then, click on re-execute and download, now you can download the dataframe as csv file to your local machine.

Related

Is there a way to use Impala rather than Hive in PySpark?

I have queries that work in Impala but not Hive. I am creating a simply PySpark file such as:
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext, HiveContext
sconf = SparkConf()
sc = SparkContext.getOrCreate(conf=sconf)
sqlContext = HiveContext(sc)
sqlContext.sql('use db1')
...
When I run this script, it's queries get the errors I get when I run them in the Hive editor (they work in the Impala editor). Is there a way to fix this so that I can run these queries in the script using Impala?

You can use Impala or HiveServer2 in Spark SQL via JDBC Data Source. That requires you to install Impala JDBC driver, and configure connection to Impala in Spark application. But "you can" doesn't mean "you should", because it incurs overhead and creates extra dependencies without any particular benefits.
Typically (and that is what your current application is trying to do), Spark SQL runs against underlying file system directly, not needing to go through either HiveServer2 or Impala coordinators. In this scenario, Spark only (re)uses Hive Metastore to retrieve the metadata -- database and table definitions.

Hive table created with Spark not visible from HUE / Hive GUI

I am creating a hive table from scala using the next code:
val spark = SparkSession
.builder()
.appName("self service")
.enableHiveSupport()
.master("local")
.getOrCreate()
spark.sql("CREATE TABLE default.TEST_TABLE (C1 INT)")
The table must be successfully created, because if I run this code twice I receive an error saying the table already exists.
However, when I try to access this table from the GUI (HUE), I cannot see any table in Hive, so it seems it's being saved in a different path that the used by Hive in HUE to get this information.
Do you know what should I do to see the tables I create from my code from the HUE/Hive web GUI?
Any help will be very appreciated.
Thank you very much.

I seems to me you have not added hive-site.xml to the proper path.
Hive-site has the properties that spark need to connect successfully with Hive and you should add this to the directory
SPARK_HOME/conf/
You can also add this file by using spark.driver.extraClassPath and give the directory where this file exists. For example in pyspark submit
/usr/bin/spark2-submit \
--conf spark.driver.extraClassPath=/../ Directory with Hive-site.xml / \
--master yarn --deploy-mode client --driver-memory nG --executor-memory nG \
--executor-cores n myScript.py

spark read from hdfs with kerberos and write on local filesystem

I am trying to get the following use case:
spark read files from HDFS with Kerberos in parquet format
spark write this files in csv format
If I write to hdfs, it works perfectly. If I try to write to local filesystem, it doesn´t work: "main" java.io.IOException: Can't get Master Kerberos principal for use as renewer
I am using Spark 1.6.2.
To sumarize, my code is
val dfIn = sqc.read.parquet(pathIsilon)
dfIn.coalesce(1).write.format("com.databricks.spark.csv").save(pathFilesystem)

Running complex SQL queries on Cassandra tables using Spark SQL

hereI have setup Cassandra and Spark with cassandra- spark connector. I am able to create RDDs using Scala. But I would like to run complex SQL queries (Aggregation/Analytical functions/Window functions) using Spark SQL on Cassandra tables , could you help how should I proceed ?getting error like this
following is the query used :
sqlContext.sql(
"""CREATE TEMPORARY TABLE words
|USING org.apache.spark.sql.cassandra
|OPTIONS (
| table "words",
| keyspace "test",
| cluster "Test Cluster",
| pushdown "true"
|)""".stripMargin)
below is the error :[enter image description here][2]
new error:
enter image description here

First thing I noticed from your post is that , sqlContext.sql(...) used in your query but your screenshot shows sc.sql(...).
I take screenshot content as your actual issue. In Spark shell, Once you've loaded the shell, both the SparkContext (sc) and the SQLContext (sqlContext) are already loaded and ready to go. sql(...) does't exit in SparkContext so you should try with sqlContext.sql(...).

Most probably in your spark-shell context started as Spark Session and value for that is spark. Try your commands with spark instead of sqlContext.

Spark SQL build for hive?

I have downloaded spark release - 1.3.1 and package type is Pre-build for Hadoop 2.6 and later
now i want to run below scala code using spark shell so i followed this steps
1. bin/spark-shell
2. val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
3. sqlContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
Now the problem is if i verity it on hue browser like
select * from src;
then i get
table not found exception
that means table not created how do i configure hive with spark shell to make this successful. i want to use SparkSQL also i need to read and write data from hive.
i randomly heard that we need to copy hive-site.xml file somewhere in spark directory
can someone please explain me with the steps - SparkSQL and Hive configuration
Thanks
Tushar

Indeed, the hive-site.xml direction is correct. Take a look at https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables .
Also it sounds like you wish to create a hive table from spark, for that look at "Saving to Persistent Tables" in the same document as above.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to save a PySpark dataframe to the personal machine using Databricks? - pyspark

Related

Is there a way to use Impala rather than Hive in PySpark?

Hive table created with Spark not visible from HUE / Hive GUI

spark read from hdfs with kerberos and write on local filesystem

Running complex SQL queries on Cassandra tables using Spark SQL

Spark SQL build for hive?

Categories

Resources