Unable to connect with S3 and Spark Locally - scala

Below is my code :
I am trying to access s3 files from spark locally.
But getting error :
Exception in thread "main" org.apache.hadoop.security.AccessControlException: Permission denied: s3n://bucketname/folder
I am also using jars :hadoop-aws-2.7.3.jar,aws-java-sdk-1.7.4.jar,hadoop-auth-2.7.1.jar while submitting spark job from cmd.
package org.test.snow
import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.log4j._
import org.apache.spark.storage.StorageLevel
import org.apache.spark.sql.SparkSession
import org.apache.spark.util.Utils
import org.apache.spark.sql._
import org.apache.hadoop.fs.FileSystem
import org.apache.hadoop.fs.Path
object SnowS3 {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("IDV4")
val sc = new SparkContext(conf)
val spark = new org.apache.spark.sql.SQLContext(sc)
import spark.implicits._
sc.hadoopConfiguration.set("fs.s3a.impl","org.apache.hadoop.fs.s3native.NativeS3FileSystem")
sc.hadoopConfiguration.set("fs.s3a.awsAccessKeyId", "A*******************A")
sc.hadoopConfiguration.set("fs.s3a.awsSecretAccessKey","A********************A")
val cus_1=spark.read.format("com.databricks.spark.csv")
.option("header","true")
.option("inferSchema","true")
.load("s3a://tb-us-east/working/customer.csv")
cus_1.show()
}
}
Any help would be appreciated.
FYI: I am using spark 2.1

You shouldn't set that fs.s3a.impl option; that's a superstition which seems to persist in spark examples.
Instead uses the S3A connector just by using the s3a:// prefix with
consistent versions of hadoop-* jar versions. Yes, hadoop-aws-2.7.3 needs hadoop-common-2.7.3
setting the s3a specific authentication options, fs.s3a.access.key and `fs.s3a.secret.key'
If that doesn't work, look at the s3a troubleshooting docs

Related

How do I create SparkSession from SparkContext in PySpark?

I have a SparkContext sc with a highly customised SparkConf(). How do I use that SparkContext to create a SparkSession? I found this post: https://stackoverflow.com/a/53633430/201657 that shows how to do it using Scala:
val spark = SparkSession.builder.config(sc.getConf).getOrCreate()
but when I try and apply the same technique using PySpark:
from pyspark.sql import SparkSession
spark = SparkSession.builder.config(sc.getConf()).enableHiveSupport().getOrCreate()
It fails with error
AttributeError: 'SparkConf' object has no attribute '_get_object_id'
As I say I want to use the same SparkConf in my SparkSession as used in the SparkContext. How do I do it?
UPDATE
I've done a bit of fiddling about:
from pyspark.sql import SparkSession
spark = SparkSession.builder.enableHiveSupport().getOrCreate()
sc.getConf().getAll() == spark.sparkContext.getConf().getAll()
returns
True
so the SparkConf of both the SparkContext & the SparkSession are the same. My assumption from this is that SparkSession.builder.getOrCreate() will use an existing SparkContext if it exists. Am I correct?

this sparkcontext is an existing one

I am setting up a SparkSession using
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('nlp').getOrCreate()
But I am getting an error:
# This SparkContext may be an existing one.

Unable to read kakfa messages through spark streaming

We are writing the spark streaming application, to read kafka messages using createStream method and batch interval is 180 seconds.
The code successfully working and creating files for every 180 seconds into s3 buckets , but no messages in the files. Below is the Environment
Spark 2.3.0
Kakfa 1.0
Please go through code and please let me know anything wrong here
#import dependencies
import findspark
findspark.init()
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
import json
from pyspark.sql import *
Creating Context variables
sc = SparkContext(appName="SparkStreamingwithPython").getOrCreate()
sc.setLogLevel("WARN")
ssc = StreamingContext(sc,180)
topic="thirdtopic"
ZkQuorum = "localhost:2181"
Connect to Kafka And create Stream
kakfaStream = KafkaUtils.createStream(ssc,ZkQuorum,"Spark-Streaming-Consumer",{topic:1})
def WritetoS3(rdd):
rdd.saveAsTextFile("s3://BucketName/thirdtopic/SparkOut")
kakfaStream.foreachRDD(WritetoS3)
ssc.start()
ssc.awaitTermination()
Thanks in Advance.

Not able to load hive table into Spark

I am trying to load data from hive table using spark-sql. However, it doesn't return me anything. I tried to execute the same query in hive and it prints out the result. Below is my code which I am trying to execute in scala.
sc.setLogLevel("ERROR")
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructField, StructType, LongType}
import org.apache.spark.sql.hive.HiveContext
import sqlContext.implicits._
val sqlContext = new HiveContext(sc)
sqlContext.setConf("spark.sql.hive.convertMetastoreOrc", "false")
val data = sqlContext.sql("select `websitename` from db1.table1 limit 10").toDF
Kindly let me know what could be the possible reason.
Spark- version : 1.6.2
Scala - 2.10
Depends how the table was created in the first place. If it was created by an external application and you have hive running as separate service make sure that the settings in SPARK_HOME/conf/hive-site.xml are correct.
If it's an internal spark-sql table, it sets up the metastore in a folder on the master node, which in your case might have been deleted or moved.

Apache Kudu with Apache Spark NoSuchMethodError: exportAuthenticationCredentials

I have this function with Spark and Scala:
import org.apache.kudu.client.CreateTableOptions
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql.{DataFrame, Dataset, Encoders, SparkSession}
import org.apache.kudu.spark.kudu._
def save(df: DataFrame): Unit ={
val kuduContext: KuduContext = new KuduContext("quickstart.cloudera:7051")
kuduContext.createTable(
"test_table", df.schema, Seq("anotheKey", "id", "date"),
new CreateTableOptions()
.setNumReplicas(1))
kuduContext.upsertRows(df, "test_table")
}
But when trying to create the kuduContext raises an exception:
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.kudu.client.KuduClient.exportAuthenticationCredentials()[B
at org.apache.kudu.spark.kudu.KuduContext.<init>(KuduContext.scala:63)
at com.mypackge.myObject$.save(myObject.scala:24)
at com.mypackge.myObject$$anonfun$main$1.apply$mcV$sp(myObject.scala:59)
at com.mypackge.myObject$$anonfun$main$1.apply(myObject.scala:57)
at com.mypackge.myObject$$anonfun$main$1.apply(myObject.scala:57)
at com.mypackge.myObject$.time(myObject.scala:17)
at com.mypackge.myObject$.main(myObject.scala:57)
at com.mypackge.myObject.main(myObject.scala)
Spark works without any problem. I have installed kudu VM as described on official docs and I have logged from bash to impala instance without a problem.
Someone have any idea about what I am doing wrong?
The problem was a dependency of the project using an old version of kudu-client (1.2.0), when I was using kudu-spark 1.3.0 (which includes kudu-client 1.3.0). Excluding kudu-client from pom.xml was the solution.