java.lang.NullPointerException while reading data from MSSQL server with spark - scala

I am having issues with reading data from MSSQL server using Cloudera Spark. I am not sure where is the problem and what is causing it.
Here is my build.sbt
val sparkversion = "1.6.0-cdh5.10.1"
name := "SimpleSpark"
organization := "com.huff.spark"
version := "1.0"
scalaVersion := "2.10.5"
mainClass in Compile := Some("com.huff.spark.example.SimpleSpark")
assemblyJarName in assembly := "mssql.jar"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-streaming-kafka" % "1.6.0" % "provided",
"org.apache.spark" %% "spark-streaming" % "1.6.0" % "provided",
"org.apache.spark" % "spark-core_2.10" % sparkversion % "provided", // to test in cluseter
"org.apache.spark" % "spark-sql_2.10" % sparkversion % "provided" // to test in cluseter
)
resolvers += "Confluent IO" at "http://packages.confluent.io/maven"
resolvers += "Cloudera Repository" at "https://repository.cloudera.com/artifactory/cloudera-repos"
And here is my scala source:
package com.huff.spark.example
import org.apache.spark.sql._
import java.sql.{Connection, DriverManager}
import java.util.Properties
import org.apache.spark.{SparkContext, SparkConf}
object SimpleSpark {
def main(args: Array[String]) {
val sourceProp = new java.util.Properties
val conf = new SparkConf().setAppName("SimpleSpark").setMaster("yarn-cluster") //to test in cluster
val sc = new SparkContext(conf)
var SqlContext = new SQLContext(sc)
val driver = "com.microsoft.sqlserver.jdbc.SQLServerDriver"
val jdbcDF = SqlContext.read.format("jdbc").options(Map("url" -> "jdbc:sqlserver://sqltestsrver;databaseName=LEh;user=sparkaetl;password=sparkaetl","driver" -> driver,"dbtable" -> "StgS")).load()
jdbcDF.show(5)
}
}
And this is the error I see:
17/05/24 04:35:20 ERROR ApplicationMaster: User class threw exception: java.lang.NullPointerException
java.lang.NullPointerException
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:155)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.<init>(JDBCRelation.scala:91)
at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:222)
at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:146)
at com.huff.spark.example.SimpleSpark$.main(SimpleSpark.scala:16)
at com.huff.spark.example.SimpleSpark.main(SimpleSpark.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:552)
17/05/24 04:35:20 INFO ApplicationMaster: Final app status: FAILED, exitCode: 15, (reason: User class threw exception: java.lang.NullPointerException)
I know the problem is in line 16 which is:
val jdbcDF = SqlContext.read.format("jdbc").options(Map("url" -> "jdbc:sqlserver://sqltestsrver;databaseName=LEh;user=sparkaetl;password=sparkaetl","driver" -> driver,"dbtable" -> "StgS")).load()
But I can't pinpoint out what exactly is the problem. Is it something to do with access? (which is doubtful), problems with connection parameters (the error message would say it), or something else which I am not aware of. Thanks in advance :-)

If you are using azure SQL server please copy the jdbc connection string from azure portal. I tried and it worked for me.
Azure databricks using scala mode:
import com.microsoft.sqlserver.jdbc.SQLServerDriver
import java.sql.DriverManager
import org.apache.spark.sql.SQLContext
import sqlContext.implicits._
// MS SQL JDBC Connection String ...
val jdbcSqlConn = "jdbc:sqlserver://***.database.windows.net:1433;database=**;user=***;password=****;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net;loginTimeout=30;"
// Loading the ms sql table via spark context into dataframe
val jdbcDF = sqlContext.read.format("jdbc").options(
Map("url" -> jdbcSqlConn,
"driver" -> "com.microsoft.sqlserver.jdbc.SQLServerDriver",
"dbtable" -> "***")).load()
// Registering the temp table so that we can SQL like query against the table
jdbcDF.registerTempTable("yourtablename")
// selecting only top 10 rows here but you can use any sql statement
val yourdata = sqlContext.sql("SELECT * FROM yourtablename LIMIT 10")
// display the data
yourdata.show()

The NPE occurs when you try to close the database Connection which indicates that the system could not obtain the proper connector via JdbcUtils.createConnectionFactory. You should check your connection URL and the logs for failures.

Related

Exception while running hive support on Spark: Unable to instantiate SparkSession with Hive support because Hive classes are not found

Hello i am trying use Hive with spark but when i try executing, it shows this error
Exception in thread "main" java.lang.IllegalArgumentException: Unable to instantiate SparkSession with Hive support because Hive classes are not found.
This is my source code
package com.spark.hiveconnect
import java.io.File
import org.apache.spark.sql.{Row, SaveMode, SparkSession}
object sourceToHIve {
case class Record(key: Int, value: String)
def main(args: Array[String]){
val warehouseLocation = new File("spark-warehouse").getAbsolutePath
val spark = SparkSession
.builder()
.appName("Spark Hive Example")
.config("spark.sql.warehouse.dir", warehouseLocation)
.enableHiveSupport()
.getOrCreate()
import spark.implicits._
import spark.sql
sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING) USING hive")
sql("LOAD DATA LOCAL INPATH '/usr/local/spark3/examples/src/main/resources/kv1.txt' INTO TABLE src")
sql("SELECT * FROM src").show()
spark.close()
}
}
This is my build.sbt file.
name := "SparkHive"
version := "0.1"
scalaVersion := "2.12.10"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.4.5"
// https://mvnrepository.com/artifact/org.apache.spark/spark-sql
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.5"
And i also have hive running in the console.
Can anyone help me with this?
Thank You.
try adding
libraryDependencies += "org.apache.spark" %% "spark-hive" % "2.4.5"
The Major Problem is that the class "org.apache.hadoop.hive.conf.HiveConf" can not be loaded.
you can insert ther following code for testing.
Class.forName("org.apache.hadoop.hive.conf.HiveConf",true,
Thread.currentThread().getContextClassLoader);
And an error will occur in this line.
To be exactly,the fundament problem is your pom may not support hive on spark.
you may check the following dependency.
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.11</artifactId>
<version>2.4.3</version>
</dependency>
The class "org.apache.hadoop.hive.conf.HiveConf" locates in this dependcy.

spark streaming save base64 rdd to json on s3

The scala application below cannot save an rdd in json format onto S3
I have :-
a kinesis stream that has complex objects placed on the stream. This object has had JSON.stringify() applied to it before being placed on the stream as part of the Kinesis PutRecord method.
A scala spark stream job reads these items off the stream,
I cannot seem to save the rdd record that comes off the stream into json file onto an S3 bucket.
In the code i've attempted to convert the RDD[Bytes] to RDD[String] then load with spark.read.json but no luck. I've tried various other combinations and can't seem to output the onto S3 in it's raw format.
import org.apache.spark._
import org.apache.spark.sql._
import java.util.Base64
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.{Milliseconds, StreamingContext}
import org.apache.spark.streaming.Duration
import org.apache.spark.streaming.kinesis._
import org.apache.spark.streaming.kinesis.KinesisInputDStream
import org.apache.spark.streaming.kinesis.KinesisInitialPositions.Latest
object ScalaStream {
def main(args: Array[String]): Unit = {
val appName = "ScalaStreamExample"
val batchInterval = Milliseconds(2000)
val outPath = "s3://xxx-xx--xxx/xxxx/"
val spark = SparkSession
.builder()
.appName(appName)
.getOrCreate()
val sparkContext = spark.sparkContext
val streamingContext = new StreamingContext(sparkContext, batchInterval)
// Populate the appropriate variables from the given args
val checkpointAppName = "xxx-xx-xx--xx"
val streamName = "cc-cc-c--c--cc"
val endpointUrl = "https://kinesis.xxx-xx-xx.amazonaws.com"
val regionName = "cc-xxxx-xxx"
val initialPosition = new Latest()
val checkpointInterval = batchInterval
val storageLevel = StorageLevel.MEMORY_AND_DISK_2
val kinesisStream = KinesisInputDStream.builder
.streamingContext(streamingContext)
.endpointUrl(endpointUrl)
.regionName(regionName)
.streamName(streamName)
.initialPosition(initialPosition)
.checkpointAppName(checkpointAppName)
.checkpointInterval(checkpointInterval)
.storageLevel(StorageLevel.MEMORY_AND_DISK_2)
.build()
kinesisStream.foreachRDD { rdd =>
if (!rdd.isEmpty()){
//**************** . <---------------
// This is where i'm trying to save the raw json object to s3 as json file
// tried various combinations here but no luck.
val dataFrame = rdd.map(record=>new String(record)) // convert bytes to string
dataFrame.write.mode(SaveMode.Append).json(outPath + "/" + rdd.id.toString())
//**************** <----------------
}
}
// Start the streaming context and await termination
streamingContext.start()
streamingContext.awaitTermination()
}
}
Any ideas what i'm missing?
So it was complete red herring why it failed to work. Turns out it was a scala version conflict with what is available on EMR.
Many similar questions asked on SO that suggested this may be the issue but whilst the spark documentation lists 2.12.4 is compatible with spark 2.4.4, the EMR instance does not appear to support scala version 2.12.4. So i've updated my build.sbt and deploy script from
build.sbt:
name := "Simple Project"
version := "1.0"
scalaVersion := "2.12.8"
ibraryDependencies += "org.apache.spark" % "spark-sql_2.12" % "2.4.4"
libraryDependencies += "org.apache.spark" % "spark-streaming_2.12" % "2.4.4"
libraryDependencies += "org.apache.spark" % "spark-streaming-kinesis-asl_2.12" % "2.4.4"
to:
name := "Simple Project"
version := "1.0"
scalaVersion := "2.11.12"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.4"
libraryDependencies += "org.apache.spark" %% "spark-streaming" % "2.4.4" % "provided"
libraryDependencies += "org.apache.spark" %% "spark-streaming-kinesis-asl" % "2.4.4"
deploy.sh
aws emr add-steps --cluster-id j-xxxxx --steps Type=spark,Name=ScalaStream,Args=[\
--class,"ScalaStream",\
--deploy-mode,cluster,\
--master,yarn,\
--packages,\'org.apache.spark:spark-streaming-kinesis-asl_2.11:2.4.4\',\
--conf,spark.yarn.submit.waitAppCompletion=false,\
--conf,yarn.log-aggregation-enable=true,\
--conf,spark.dynamicAllocation.enabled=true,\
--conf,spark.cores.max=4,\
--conf,spark.network.timeout=300,\
s3://ccc.xxxx/simple-project_2.11-1.0.jar\
],ActionOnFailure=CONTINUE

Proxy Issues while Dumping Data into CosmosDB from Dataframe using intelliJ

I have to dump json data into cosmosDB from spark dataframe using scala and intelliJ.
I am reading a csv file from my local machine and converting it into json format. Now I have to dump this json data into cosmosDB collection.
Spark version is 2.2.0 and scala version is 2.11.8
Below is the code which I wrote in IntelliJ with scala for fetching a csv file from my local machine and convert it into a json file.
import org.apache.spark.sql.SparkSession
import com.microsoft.azure.cosmosdb.spark.config.Config
object DataLoadConversion {
def main(args: Array[String]): Unit = {
System.setProperty("spark.sql.warehouse.dir", "file:///C:/spark-warehouse")
val spark = SparkSession.builder().master("local").appName("DataConversion").getOrCreate()
val df = spark.read.format("com.databricks.spark.csv")
.option("quote", "\"")
.option("escape", "\"")
.option("delimiter", ",")
.option("header", "true")
.option("mode", "FAILFAST")
.option("inferSchema","true")
.load("file:///C:/Users/an/Desktop/ct_temp.csv")
val finalDf = df.select(df("history_temp_id").as("NUM"),df("history_temp_time").as("TIME"))
val jsonData = finalDf.select("NUM", "TIME").toJSON
jsonData.show(2)
// COSMOS DB Write configuration
val writeConfig = Config(Map(
"Endpoint" -> "https://cosms.documents.azure.com:443/",
"Masterkey" -> "YOUR-KEY-HERE", //provided primary key
"Database" -> "DBName", //provided with DB name
"Collection" -> "Collection", //provided with collection name
))
// Write to Cosmos DB from the DataFrame
import org.apache.spark.sql.SaveMode
jsonData.write.mode(SaveMode.Overwrite).cosmosDB(writeConfig)
}
Below is the build.sbt file
scalaVersion := "2.11.8"
val sparkVersion = "2.2.0"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion,
"org.apache.spark" %% "spark-sql" % sparkVersion,
"com.databricks" %% "spark-csv" % "1.5.0",
)
libraryDependencies += "com.microsoft.azure" % "azure-cosmosdb-spark_2.2.0_2.11" % "1.1.1" % "provided" exclude("org.apache.spark", "spark-core_2.10")
Added cosmosDB dependency to the build.sbt file.
I am new to Spark and Scala. please let me know what all steps to be followed to get connected with cosmos DB from intelliJ with spark and scala?
Build is successful but I am getting below error while running the code.
19/07/10 16:32:41 INFO DocumentClient: Initializing DocumentClient with serviceEndpoint [https://cosms.documents.azure.com/], ConnectionPolicy [ConnectionPolicy [requestTimeout=60, mediaRequestTimeout=300, connectionMode=Gateway, mediaReadMode=Buffered, maxPoolSize=400, idleConnectionTimeout=60, userAgentSuffix= SparkConnector/2.2.0_2.11-1.1.1, retryOptions=com.microsoft.azure.documentdb.RetryOptions#1ef5cde4, enableEndpointDiscovery=true, preferredLocations=[Japan East]]], ConsistencyLevel [Session]
19/07/10 16:33:03 WARN DocumentClient: Failed to retrieve database account information. org.apache.http.conn.HttpHostConnectException: Connect to cosms.documents.azure.com:443 [cosms.documents.azure.com/13.78.51.35] failed: Connection timed out: connect
......
Exception in thread "main" java.lang.IllegalStateException: Http client execution failed.
at com.microsoft.azure.documentdb.internal.GatewayProxy.performGetRequest(GatewayProxy.java:244)
at com.microsoft.azure.documentdb.internal.GatewayProxy.doRead(GatewayProxy.java:93)
If I am connecting out of my office network this is working, but when I my machine is connected with office network i am getting above error.
I have tried with configuring proxy settings in below shown page. Settings>>> Proxy settings.
If i try the same end point in chrome i am getting below error.
{"code":"Unauthorized","message":"Required Header authorization is missing. Ensure a valid Authorization token is passed.\r\nActivityId: 54999e41-179e-4877-b8bf-f2c2a33280fd, Microsoft.Azure.Documents.Common/2.5.1"}
how to resolve this? how to bypass proxy from office network?

File not found exception while loading a properties file on a Scala SBT project

I am trying to learn a Scala-Spark JDBC program on IntelliJ IDEA. In order to do that, I have created a Scala SBT Project and the project structure looks like:
Before writing the JDBC connection parameters in the class, I first tried loading a properties file which contain all my connection properties and trying to display if they are loading properly as below:
connection.properties content:
devUserName=username
devPassword=password
gpDriverClass=org.postgresql.Driver
gpDevUrl=jdbc:url
Code:
package com.yearpartition.obj
import java.io.FileInputStream
import java.util.Properties
import org.apache.spark.sql.SparkSession
import org.apache.log4j.{Level, LogManager, Logger}
import org.apache.spark.SparkConf
object PartitionRetrieval {
var conf = new SparkConf().setAppName("Spark-JDBC")
val properties = new Properties()
properties.load(new FileInputStream("connection.properties"))
val connectionUrl = properties.getProperty("gpDevUrl")
val devUserName=properties.getProperty("devUserName")
val devPassword=properties.getProperty("devPassword")
val gpDriverClass=properties.getProperty("gpDriverClass")
println("connectionUrl: " + connectionUrl)
Class.forName(gpDriverClass).newInstance()
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().enableHiveSupport().config(conf).master("local[2]").getOrCreate()
println("connectionUrl: " + connectionUrl)
}
}
Content of build.sbt:
name := "YearPartition"
version := "0.1"
scalaVersion := "2.11.8"
libraryDependencies ++= {
val sparkCoreVer = "2.2.0"
val sparkSqlVer = "2.2.0"
Seq(
"org.apache.spark" %% "spark-core" % sparkCoreVer % "provided" withSources(),
"org.apache.spark" %% "spark-sql" % sparkSqlVer % "provided" withSources(),
"org.json4s" %% "json4s-jackson" % "3.2.11" % "provided",
"org.apache.httpcomponents" % "httpclient" % "4.5.3"
)
}
Since I am not writing or saving data into any file and trying to display the values of properties file, I executed the code using following:
SPARK_MAJOR_VERSION=2 spark-submit --class com.yearpartition.obj.PartitionRetrieval yearpartition_2.11-0.1.jar
But I am getting file not found exception as below:
Caused by: java.io.FileNotFoundException: connection.properties (No such file or directory)
I tried to fix it in vain. Could anyone let me know what is the mistake I am doing here and how can I correct it ?
You must write to full path of your connection.properties file (file:///full_path/connection.properties) and in this option when you submit a job in cluster if you want to read file the local disk you must save connection.properties file on the all server in the cluster to same path. But in other option, you can read the files from HDFS. Here is a little example for reading files on HDFS:
#throws[IOException]
def readFileFromHdfs(file: String): org.apache.hadoop.fs.FSDataInputStream = {
val conf = new org.apache.hadoop.conf.Configuration
conf.set("fs.default.name", "HDFS_HOST")
val fileSystem = org.apache.hadoop.fs.FileSystem.get(conf)
val path = new org.apache.hadoop.fs.Path(file)
if (!fileSystem.exists(path)) {
println("File (" + path + ") does not exists.")
null
} else {
val in = fileSystem.open(path)
in
}
}

Spark different behavior between spark-submit and spark-shell

Im using Spark 1.3.1 (on ubuntu 14.04) stand alone, sbt 0.13.10, and trying to execute the following script:
package co.some.sheker
import java.sql.Date
import org.apache.spark.{SparkContext, SparkConf}
import SparkContext._
import org.apache.spark.sql.{Row, SQLContext}
import com.datastax.spark.connector._
import java.sql._
import org.apache.spark.sql._
import org.apache.spark.sql.cassandra.CassandraSQLContext
import java.io.PushbackReader
import java.lang.{ StringBuilder => JavaStringBuilder }
import java.io.StringReader
import com.datastax.spark.connector.cql.CassandraConnector
import org.joda.time.{DateTimeConstants}
case class TableKey(key1: String, key2: String)
object myclass{
def main(args: scala.Array[String]) {
val conf = ...
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val csc = new CassandraSQLContext(sc)
val data_x = csc.sql("select distinct key1, key2 from keyspace.table where key1 = 'sheker'").map(row => (row(0).toString, row(1).toString))
println("Done cross mapping")
val snapshotsFiltered = data_x.map(x => TableKey(x._1,x._2)).joinWithCassandraTable("keyspace", "table")
println("Done join")
val jsons = snapshotsFiltered.map(_._2.getString("json"))
...
sc.stop()
println("Done.")
}
}
By using:
/home/user/spark-1.3.1/bin/spark-submit --master spark://1.1.1.1:7077 --driver-class-path /home/user/spark-cassandra-connector-java-assembly-1.3.1-FAT.jar --properties-file prop.conf --class "myclass" "myjar.jar"
The prop.conf file is:
spark.cassandra.connection.host myhost
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.eventLog.enabled true
spark.eventLog.dir /var/tmp/eventLog
spark.executor.extraClassPath /home/ubuntu/spark-cassandra-connector-java-assembly-1.3.1-FAT.jar
And I get this exception:
Done cross mapping
Exception in thread "main" java.lang.NoSuchMethodError: com.datastax.spark.connector.mapper.ColumnMapper$.defaultColumnMapper(Lscala/reflect/ClassTag;Lscala/reflect/api/TypeTags$TypeTag;)Lcom/datastax/spark/connector/mapper/ColumnMapper;
at co.crowdx.aggregation.CassandraToElasticTransformater$.main(CassandraToElasticTransformater.scala:79)
at co.crowdx.aggregation.CassandraToElasticTransformater.main(CassandraToElasticTransformater.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Done Sending Signal aggregation job to Spark
And The strange part is when I trying to run the commands from the script- in the shell its working fine. Im using:
/home/user/spark-1.3.1/bin/spark-shell --master spark://1.1.1.1:7077 --driver-class-path /home/ubuntu/spark-cassandra-connector-java-assembly-1.3.1-FAT.jar --properties-file prop.conf
The Build.scala file is:
import sbt._
import Keys._
import sbtassembly.Plugin._
import AssemblyKeys._
object AggregationsBuild extends Build {
lazy val buildSettings = Defaults.defaultSettings ++ Seq(
version := "1.0.0",
organization := "co.sheker",
scalaVersion := "2.10.4"
)
lazy val app = Project(
"geo-aggregations",
file("."),
settings = buildSettings ++ assemblySettings ++ Seq(
parallelExecution in Test := false,
libraryDependencies ++= Seq(
"com.datastax.spark" %% "spark-cassandra-connector" % "1.2.1",
// spark will already be on classpath when using spark-submit.
// marked as provided, so that it isn't included in assembly.
"org.apache.spark" %% "spark-core" % "1.2.1" % "provided",
"org.apache.spark" %% "spark-catalyst" % "1.2.1" % "provided",
"org.apache.spark" %% "spark-sql" % "1.2.1" % "provided",
"org.scalatest" %% "scalatest" % "2.1.5" % "test",
"org.postgresql" % "postgresql" % "9.4-1201-jdbc41",
"com.github.nscala-time" %% "nscala-time" % "2.4.0",
"org.elasticsearch" % "elasticsearch-hadoop" % "2.2.0" % "provided"
),
resolvers += "conjars.org" at "http://conjars.org/repo",
resolvers += "clojars" at "https://clojars.org/repo"
)
)
}
What is wrong? Why it fails on the submit but not in the shell?
You said that you are using spark 1.3 but your build contains spark 1.2.1 dependencies.
Like I said in the comment, I believe that your spark driver's version is different from the one in your application which leads to the error that you are getting.