How can I change SparkContext.sparkUser() setting (in pyspark)? - scala

I am new with Spark and pyspark.
I use pyspark, after my rdd processing, I tried to save it to hdfs using the saveAsTextfile() function.
But I get a 'permission denied' error message because pyspark tries to write hdfs
using my local account, 'kjlee', which does not exist on the hdfs system.
I can check the spark user name by SparkContext().sparkUser(), But I can't find how to change the spark user name.
How can I change the spark user name?

There is an environment variable for this : HADOOP_USER_NAME
so simply use export HADOOP_USER_NAME=anyuser or in pyspark you can use os.environ["HADOOP_USER_NAME"] = "anyuser"

In Scala could be done with System.setProperty:
System.setProperty("HADOOP_USER_NAME","newUserName")
val spark = SparkSession
.builder()
.appName("SparkSessionApp")
.master("local[*]")
.getOrCreate()
println(spark.sparkContext.sparkUser)

Related

How can we implement multiple core division with pyspark in local machine?

My PC is an 18 core machine. I want to make use of all the cores at a time in order to avail the parallelism in pyspark.
spark = (SparkSession
.builder
.master("local[18]")
.appName("sparkexample")
.getOrCreate())
I used the above query. Are all cores working together or I have to set any configuration file for the same?
local[*] will use all the available cores.
spark = (SparkSession
.builder
.master("local[*]")
.appName("sparkexample")
.getOrCreate())

Not able to execute Pyspark script using spark action in Oozie - Error while instantiating 'org.apache.spark.sql.hive.HiveExternalCatalog

I am facing below error while running spark action through oozie workflow on an EMR 5.14 cluster:
pyspark.sql.utils.IllegalArgumentException: u"Error while instantiating 'org.apache.spark.sql.hive.HiveExternalCatalog'"
My Pyspark script runs fine when executed as a normal spark job but is not being executed via Oozie
Pyspark Program:-
spark = SparkSession.builder.appName("PysparkTest").config("hive.support.quoted.identifiers", "none").enableHiveSupport().getOrCreate()
sc = SparkContext.getOrCreate();
sqlContext = HiveContext(sc)
sqlContext.sql("show databases").show()
I have created a workflow.xml and job.properties taking reference from the LINK.
I copied all the spark and hive related configuration file under the same directory($SPARK_CONF_DIR/).
Hive is also configured to use MySQL for the metastore.
It will be great if you can help me figure out the problem which I am facing when running this Pyspark program as a jar file in an Oozie spark action.
Error while instantiating 'org.apache.spark.sql.hive.HiveExternalCatalog' This means the Catalog jar its trying find is not is ooziesharelib spark directory.
Please add the following property in your job.properties file.
oozie.action.sharelib.for.spark=hive,spark,hcatalog
Also can you please post the whole log?
And if possible could you please run the same on EMR 5.29, I have faced few jar issue on 5.26 and the lower version while running PySpark.

End of file exception while reading a file from remote hdfs cluster using spark

I am new to working with HDFS. I am trying to read a csv file which is stored in a hadoop cluster using spark. Every time i try to access it i get the following error:
End of File Exception between local host
I have not setup hadoop locally since i already had access to hadoop cluster.
I may be missing some configurations but i dont know which ones. Would appreciate the help.
I tried to debug it using this :
link
Did not work for me.
This is the code using spark.
val conf= new SparkConf().setAppName("Read").setMaster("local").set("fs.hdfs.impl", classOf[org.apache.hadoop.hdfs.DistributedFileSystem].getName)
.set("fs.file.impl", classOf[org.apache.hadoop.fs.LocalFileSystem].getName)
val sc=new SparkContext(conf)
val data=sc.textfile("hdfs://<some-ip>/abc.csv)
I expect it to read the csv and convert it into RDD.
Getting this error:
Exception in thread "main" java.io.EOFException: End of File Exception between local host is:
Run you spark jobs on hadoop cluster. Use below code:
val spark = SparkSession.builder().master("local[1]").appName("Read").getOrCreate()
val data = spark.sparkContext.textFile("<filePath>")
or you can use spark-shell as well.
If you want to access hdfs from your local, follow this:link

How can I read a csv file stored in a hdfs file system at a server in spark using intellij at my local system?

I am using intellij to write spark code. And, I want to access files stored in hdfs file system at a server. How can I access the hdfs file in the Scala spark code so that it can be loaded as a dataframe??
val spark = SparkSession.builder().appName("CSV_Import_Example")
.config("spark.hadoop.yarn.resourcemanager.hostname","XXX")
.config("spark.hadoop.yarn.resourcemanager.address","XXX:8032")
.config("spark.yarn.access.namenodes", "hdfs://XXXX:8020,hdfs://XXXX:8020")
.config("spark.yarn.stagingDir", "hdfs://XXXX:8020/user/hduser/")
.getOrCreate()
The entry point into all functionality in Spark is the SparkSession class.
val sourceDF = spark.read.format("csv").option("header", "true").load("hdfs://192.168.1.1:8020/user/cloudera/example_csvfile.csv")
hdfs://192.168.1.1:8020 here is accessing the HDFS Cluster & 8020 port is related to namenode.

Why do I get a “Hive support is required to CREATE Hive TABLE (AS SELECT)” error when creating a table?

I'm running into an issue when trying to create a table.
Here is the code to create the table, where the exception is occurring:
sparkSession.sql(s"CREATE TABLE IF NOT EXISTS mydatabase.students(" +
s"name string," + s"age int)")
Here is the spark session configuration:
lazy val sparkSession = SparkSession
.builder()
.appName("student_mapping")
.enableHiveSupport()
.getOrCreate()
And this is the exception:
org.apache.spark.sql.AnalysisException: Hive support is required to
CREATE Hive TABLE (AS SELECT);;'CreateTable `mydatabase`.`students`,
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Ignore
My question is: Why is this exception occurring? I have several other spark programs running with the same session configurations, running flawlessly. I'm using Scala 2.11 and Spark 2.3.
SparkSession is the entry point to Spark SQL. It is one of the very
first objects you create while developing a Spark SQL application.
SessionState is the state separation layer between Spark SQL sessions,
including SQL configuration, tables, functions, UDFs, SQL parser, and
everything else that depends on a SQLConf.
SessionState is available as the sessionState property of a
SparkSession
Internally, sessionState clones the optional parent SessionState (if
given when creating the SparkSession) or creates a new SessionState
using BaseSessionStateBuilder as defined by
spark.sql.catalogImplementation configuration property:
in-memory (default) for
org.apache.spark.sql.internal.SessionStateBuilder
hive for org.apache.spark.sql.hive.HiveSessionStateBuilder
For using hive you should use the class org.apache.spark.sql.hive.HiveSessionStateBuilder and according to the document this can be done by setting the property spark.sql.catalogImplementation to hive when creating SparkSession object:
val conf = new SparkConf
.set("spark.sql.warehouse.dir", "hdfs://namenode/sql/metadata/hive")
.set("spark.sql.catalogImplementation","hive")
.setMaster("local[*]")
.setAppName("Hive Example")
val spark = SparkSession.builder()
.config(conf)
.enableHiveSupport()
.getOrCreate()
or you can pass the property --conf spark.sql.catalogImplementation=hive when you submit your job to the cluster.
Citing:
By default, Spark SQL uses the embedded deployment mode of a Hive metastore with a Apache Derby database.
In other words, by default Spark's sql context doesn't know about any tables managed by Hive on your cluster.
You need to use Hive's metastore (storage knowing of databases and tables in Hive) in Spark to be able to manipulate them from Spark application.
To do so you need either set spark.hadoop.hive.metastore.warehouse.dir if you are using embdedded metastore, or hive.metastore.uris for accessing metastore via Thrift protocol in case of metastore in remote db.
It worked by passing --conf spark.sql.catalogImplementation=hive to spark-submit.
This was not needed in spark 1.6 but it seems needed since spark 2.0