Proxy Issues while Dumping Data into CosmosDB from Dataframe using intelliJ - scala

I have to dump json data into cosmosDB from spark dataframe using scala and intelliJ.
I am reading a csv file from my local machine and converting it into json format. Now I have to dump this json data into cosmosDB collection.
Spark version is 2.2.0 and scala version is 2.11.8
Below is the code which I wrote in IntelliJ with scala for fetching a csv file from my local machine and convert it into a json file.
import org.apache.spark.sql.SparkSession
import com.microsoft.azure.cosmosdb.spark.config.Config
object DataLoadConversion {
def main(args: Array[String]): Unit = {
System.setProperty("spark.sql.warehouse.dir", "file:///C:/spark-warehouse")
val spark = SparkSession.builder().master("local").appName("DataConversion").getOrCreate()
val df = spark.read.format("com.databricks.spark.csv")
.option("quote", "\"")
.option("escape", "\"")
.option("delimiter", ",")
.option("header", "true")
.option("mode", "FAILFAST")
.option("inferSchema","true")
.load("file:///C:/Users/an/Desktop/ct_temp.csv")
val finalDf = df.select(df("history_temp_id").as("NUM"),df("history_temp_time").as("TIME"))
val jsonData = finalDf.select("NUM", "TIME").toJSON
jsonData.show(2)
// COSMOS DB Write configuration
val writeConfig = Config(Map(
"Endpoint" -> "https://cosms.documents.azure.com:443/",
"Masterkey" -> "YOUR-KEY-HERE", //provided primary key
"Database" -> "DBName", //provided with DB name
"Collection" -> "Collection", //provided with collection name
))
// Write to Cosmos DB from the DataFrame
import org.apache.spark.sql.SaveMode
jsonData.write.mode(SaveMode.Overwrite).cosmosDB(writeConfig)
}
Below is the build.sbt file
scalaVersion := "2.11.8"
val sparkVersion = "2.2.0"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion,
"org.apache.spark" %% "spark-sql" % sparkVersion,
"com.databricks" %% "spark-csv" % "1.5.0",
)
libraryDependencies += "com.microsoft.azure" % "azure-cosmosdb-spark_2.2.0_2.11" % "1.1.1" % "provided" exclude("org.apache.spark", "spark-core_2.10")
Added cosmosDB dependency to the build.sbt file.
I am new to Spark and Scala. please let me know what all steps to be followed to get connected with cosmos DB from intelliJ with spark and scala?
Build is successful but I am getting below error while running the code.
19/07/10 16:32:41 INFO DocumentClient: Initializing DocumentClient with serviceEndpoint [https://cosms.documents.azure.com/], ConnectionPolicy [ConnectionPolicy [requestTimeout=60, mediaRequestTimeout=300, connectionMode=Gateway, mediaReadMode=Buffered, maxPoolSize=400, idleConnectionTimeout=60, userAgentSuffix= SparkConnector/2.2.0_2.11-1.1.1, retryOptions=com.microsoft.azure.documentdb.RetryOptions#1ef5cde4, enableEndpointDiscovery=true, preferredLocations=[Japan East]]], ConsistencyLevel [Session]
19/07/10 16:33:03 WARN DocumentClient: Failed to retrieve database account information. org.apache.http.conn.HttpHostConnectException: Connect to cosms.documents.azure.com:443 [cosms.documents.azure.com/13.78.51.35] failed: Connection timed out: connect
......
Exception in thread "main" java.lang.IllegalStateException: Http client execution failed.
at com.microsoft.azure.documentdb.internal.GatewayProxy.performGetRequest(GatewayProxy.java:244)
at com.microsoft.azure.documentdb.internal.GatewayProxy.doRead(GatewayProxy.java:93)
If I am connecting out of my office network this is working, but when I my machine is connected with office network i am getting above error.
I have tried with configuring proxy settings in below shown page. Settings>>> Proxy settings.
If i try the same end point in chrome i am getting below error.
{"code":"Unauthorized","message":"Required Header authorization is missing. Ensure a valid Authorization token is passed.\r\nActivityId: 54999e41-179e-4877-b8bf-f2c2a33280fd, Microsoft.Azure.Documents.Common/2.5.1"}
how to resolve this? how to bypass proxy from office network?

Related

spark streaming save base64 rdd to json on s3

The scala application below cannot save an rdd in json format onto S3
I have :-
a kinesis stream that has complex objects placed on the stream. This object has had JSON.stringify() applied to it before being placed on the stream as part of the Kinesis PutRecord method.
A scala spark stream job reads these items off the stream,
I cannot seem to save the rdd record that comes off the stream into json file onto an S3 bucket.
In the code i've attempted to convert the RDD[Bytes] to RDD[String] then load with spark.read.json but no luck. I've tried various other combinations and can't seem to output the onto S3 in it's raw format.
import org.apache.spark._
import org.apache.spark.sql._
import java.util.Base64
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.{Milliseconds, StreamingContext}
import org.apache.spark.streaming.Duration
import org.apache.spark.streaming.kinesis._
import org.apache.spark.streaming.kinesis.KinesisInputDStream
import org.apache.spark.streaming.kinesis.KinesisInitialPositions.Latest
object ScalaStream {
def main(args: Array[String]): Unit = {
val appName = "ScalaStreamExample"
val batchInterval = Milliseconds(2000)
val outPath = "s3://xxx-xx--xxx/xxxx/"
val spark = SparkSession
.builder()
.appName(appName)
.getOrCreate()
val sparkContext = spark.sparkContext
val streamingContext = new StreamingContext(sparkContext, batchInterval)
// Populate the appropriate variables from the given args
val checkpointAppName = "xxx-xx-xx--xx"
val streamName = "cc-cc-c--c--cc"
val endpointUrl = "https://kinesis.xxx-xx-xx.amazonaws.com"
val regionName = "cc-xxxx-xxx"
val initialPosition = new Latest()
val checkpointInterval = batchInterval
val storageLevel = StorageLevel.MEMORY_AND_DISK_2
val kinesisStream = KinesisInputDStream.builder
.streamingContext(streamingContext)
.endpointUrl(endpointUrl)
.regionName(regionName)
.streamName(streamName)
.initialPosition(initialPosition)
.checkpointAppName(checkpointAppName)
.checkpointInterval(checkpointInterval)
.storageLevel(StorageLevel.MEMORY_AND_DISK_2)
.build()
kinesisStream.foreachRDD { rdd =>
if (!rdd.isEmpty()){
//**************** . <---------------
// This is where i'm trying to save the raw json object to s3 as json file
// tried various combinations here but no luck.
val dataFrame = rdd.map(record=>new String(record)) // convert bytes to string
dataFrame.write.mode(SaveMode.Append).json(outPath + "/" + rdd.id.toString())
//**************** <----------------
}
}
// Start the streaming context and await termination
streamingContext.start()
streamingContext.awaitTermination()
}
}
Any ideas what i'm missing?
So it was complete red herring why it failed to work. Turns out it was a scala version conflict with what is available on EMR.
Many similar questions asked on SO that suggested this may be the issue but whilst the spark documentation lists 2.12.4 is compatible with spark 2.4.4, the EMR instance does not appear to support scala version 2.12.4. So i've updated my build.sbt and deploy script from
build.sbt:
name := "Simple Project"
version := "1.0"
scalaVersion := "2.12.8"
ibraryDependencies += "org.apache.spark" % "spark-sql_2.12" % "2.4.4"
libraryDependencies += "org.apache.spark" % "spark-streaming_2.12" % "2.4.4"
libraryDependencies += "org.apache.spark" % "spark-streaming-kinesis-asl_2.12" % "2.4.4"
to:
name := "Simple Project"
version := "1.0"
scalaVersion := "2.11.12"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.4"
libraryDependencies += "org.apache.spark" %% "spark-streaming" % "2.4.4" % "provided"
libraryDependencies += "org.apache.spark" %% "spark-streaming-kinesis-asl" % "2.4.4"
deploy.sh
aws emr add-steps --cluster-id j-xxxxx --steps Type=spark,Name=ScalaStream,Args=[\
--class,"ScalaStream",\
--deploy-mode,cluster,\
--master,yarn,\
--packages,\'org.apache.spark:spark-streaming-kinesis-asl_2.11:2.4.4\',\
--conf,spark.yarn.submit.waitAppCompletion=false,\
--conf,yarn.log-aggregation-enable=true,\
--conf,spark.dynamicAllocation.enabled=true,\
--conf,spark.cores.max=4,\
--conf,spark.network.timeout=300,\
s3://ccc.xxxx/simple-project_2.11-1.0.jar\
],ActionOnFailure=CONTINUE

spark-hbase-connector : ClusterId read in ZooKeeper is null

I'am trying to run a simple program that copies the content of an rdd into a Hbase table. I'am using spark-hbase-connector by nerdammer https://github.com/nerdammer/spark-hbase-connector. I'am running the code using spark-submit on a local cluster on my machine. Spark version is 2.1.
this is the code i'am trying tu run :
import org.apache.spark.{SparkConf, SparkContext}
import it.nerdammer.spark.hbase._
object HbaseConnect {
def main(args: Array[String]) {
val sparkConf = new SparkConf()
sparkConf.set("spark.hbase.host", "hostname")
sparkConf.set("zookeeper.znode.parent", "/hbase-unsecure")
val sc = new SparkContext(sparkConf)
val rdd = sc.parallelize(1 to 100)
.map(i => (i.toString, i+1, "Hello"))
rdd.toHBaseTable("mytable").toColumns("column1", "column2")
.inColumnFamily("mycf")
.save()
sc.stop
}}
Here is my build.sbt:
name := "HbaseConnect"
version := "0.1"
scalaVersion := "2.11.8"
assemblyMergeStrategy in assembly := {
case PathList("META-INF", xs # _*) => MergeStrategy.discard
case x => MergeStrategy.first}
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.1.0" % "provided",
"it.nerdammer.bigdata" % "spark-hbase-connector_2.10" % "1.0.3")
the execution gets stuck showing the following info:
17/11/22 10:20:34 INFO ZooKeeperRegistry: ClusterId read in ZooKeeper is null
17/11/22 10:20:34 INFO TableOutputFormat: Created table instance for mytable
I am unable to indentify the problem with zookeeper. The HBase clients will discover the running HBase cluster using the following two properties:
1.hbase.zookeeper.quorum: is used to connect to the zookeeper cluster
2.zookeeper.znode.parent. tells which znode keeps the data (and address for HMaster) for the cluster
I overridden these two properties in the code. with
sparkConf.set("spark.hbase.host", "hostname")
sparkConf.set("zookeeper.znode.parent", "/hbase-unsecure")
Another question is that there is no spark-hbase-connector_2.11. can the provided version spark-hbase-connector_2.10 support scala 2.11 ?
Problem is solved. I had to override the Hmaster port to 16000 (wich is my Hmaster port number. I'am using ambari). Default value that sparkConf uses is 60000.
sparkConf.set("hbase.master", "hostname"+":16000").

java.lang.NullPointerException while reading data from MSSQL server with spark

I am having issues with reading data from MSSQL server using Cloudera Spark. I am not sure where is the problem and what is causing it.
Here is my build.sbt
val sparkversion = "1.6.0-cdh5.10.1"
name := "SimpleSpark"
organization := "com.huff.spark"
version := "1.0"
scalaVersion := "2.10.5"
mainClass in Compile := Some("com.huff.spark.example.SimpleSpark")
assemblyJarName in assembly := "mssql.jar"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-streaming-kafka" % "1.6.0" % "provided",
"org.apache.spark" %% "spark-streaming" % "1.6.0" % "provided",
"org.apache.spark" % "spark-core_2.10" % sparkversion % "provided", // to test in cluseter
"org.apache.spark" % "spark-sql_2.10" % sparkversion % "provided" // to test in cluseter
)
resolvers += "Confluent IO" at "http://packages.confluent.io/maven"
resolvers += "Cloudera Repository" at "https://repository.cloudera.com/artifactory/cloudera-repos"
And here is my scala source:
package com.huff.spark.example
import org.apache.spark.sql._
import java.sql.{Connection, DriverManager}
import java.util.Properties
import org.apache.spark.{SparkContext, SparkConf}
object SimpleSpark {
def main(args: Array[String]) {
val sourceProp = new java.util.Properties
val conf = new SparkConf().setAppName("SimpleSpark").setMaster("yarn-cluster") //to test in cluster
val sc = new SparkContext(conf)
var SqlContext = new SQLContext(sc)
val driver = "com.microsoft.sqlserver.jdbc.SQLServerDriver"
val jdbcDF = SqlContext.read.format("jdbc").options(Map("url" -> "jdbc:sqlserver://sqltestsrver;databaseName=LEh;user=sparkaetl;password=sparkaetl","driver" -> driver,"dbtable" -> "StgS")).load()
jdbcDF.show(5)
}
}
And this is the error I see:
17/05/24 04:35:20 ERROR ApplicationMaster: User class threw exception: java.lang.NullPointerException
java.lang.NullPointerException
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:155)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.<init>(JDBCRelation.scala:91)
at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:222)
at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:146)
at com.huff.spark.example.SimpleSpark$.main(SimpleSpark.scala:16)
at com.huff.spark.example.SimpleSpark.main(SimpleSpark.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:552)
17/05/24 04:35:20 INFO ApplicationMaster: Final app status: FAILED, exitCode: 15, (reason: User class threw exception: java.lang.NullPointerException)
I know the problem is in line 16 which is:
val jdbcDF = SqlContext.read.format("jdbc").options(Map("url" -> "jdbc:sqlserver://sqltestsrver;databaseName=LEh;user=sparkaetl;password=sparkaetl","driver" -> driver,"dbtable" -> "StgS")).load()
But I can't pinpoint out what exactly is the problem. Is it something to do with access? (which is doubtful), problems with connection parameters (the error message would say it), or something else which I am not aware of. Thanks in advance :-)
If you are using azure SQL server please copy the jdbc connection string from azure portal. I tried and it worked for me.
Azure databricks using scala mode:
import com.microsoft.sqlserver.jdbc.SQLServerDriver
import java.sql.DriverManager
import org.apache.spark.sql.SQLContext
import sqlContext.implicits._
// MS SQL JDBC Connection String ...
val jdbcSqlConn = "jdbc:sqlserver://***.database.windows.net:1433;database=**;user=***;password=****;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net;loginTimeout=30;"
// Loading the ms sql table via spark context into dataframe
val jdbcDF = sqlContext.read.format("jdbc").options(
Map("url" -> jdbcSqlConn,
"driver" -> "com.microsoft.sqlserver.jdbc.SQLServerDriver",
"dbtable" -> "***")).load()
// Registering the temp table so that we can SQL like query against the table
jdbcDF.registerTempTable("yourtablename")
// selecting only top 10 rows here but you can use any sql statement
val yourdata = sqlContext.sql("SELECT * FROM yourtablename LIMIT 10")
// display the data
yourdata.show()
The NPE occurs when you try to close the database Connection which indicates that the system could not obtain the proper connector via JdbcUtils.createConnectionFactory. You should check your connection URL and the logs for failures.

How does DataStax Spark Cassandra connector create SparkContext?

I have run the following Spark test program successfully. In this program I notice the "cassandraTable" method and "getOrCreate" method in SparkContext class. But I am not able to find it in the Spark Scala API docs for this class. What am I missing in understanding this code? I am trying to understand how this SparkContext is different when Datastax Connector is in sbt.
Code -
import org.apache.spark.{SparkContext, SparkConf}
import com.datastax.spark.connector._
object CassandraInt {
def main(args:Array[String]){
val SparkMasterHost = "127.0.0.1"
val CassandraHost = "127.0.0.1"
val conf = new SparkConf(true)
.set("spark.cassandra.connection.host", CassandraHost)
.set("spark.cleaner.ttl", "3600")
.setMaster("local[12]")
.setAppName(getClass.getSimpleName)
// Connect to the Spark cluster:
lazy val sc = SparkContext.getOrCreate(conf)
val rdd = sc.cassandraTable("test", "kv")
println(rdd.count)
println(rdd.map(_.getInt("value")).sum)
}}
The build.sbt file I used is -
name := "Test Project"
version := "1.0"
scalaVersion := "2.11.7"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.0.0"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.0.0"
addCommandAlias("c1", "run-main CassandraInt")
libraryDependencies += "com.datastax.spark" %% "spark-cassandra-connector" % "2.0.0-M3"
fork in run := true
It is not different. Spark supports only one active SparkContext and getOrCreate is a method defined on the companion object:
This function may be used to get or instantiate a SparkContext and register it as a singleton object. Because we can only have one active SparkContext per JVM, this is useful when applications may wish to share a SparkContext.
This method allows not passing a SparkConf (useful if just retrieving).
To summarize:
If there is an active context it returns it.
Otherwise it creates a new one.
cassandraTable is a method of the SparkContextFunctions exposed using an implicit conversion.

Apache spark-streaming application output not forwarded to the master

I'm trying to run the FlumeEvent example which is the following
import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming._
import org.apache.spark.streaming.flume._
import org.apache.spark.util.IntParam
import org.apache.spark.streaming.flume.FlumeUtils
object FlumeEventCount {
def main(args: Array[String]) {
val batchInterval = Milliseconds(2000)
// Create the context and set the batch size
val sparkConf = new SparkConf().setAppName("FlumeEventCount")
.set("spark.cleaner.ttl","3")
val ssc = new StreamingContext(sparkConf, batchInterval)
// Create a flume stream
var stream = FlumeUtils.createStream(ssc, "192.168.1.5",3564, StorageLevel.MEMORY_ONLY_SER_2)
// Print out the count of events received from this server in each batch
stream.count().map(cnt => "Received flume events." + cnt ).print()
stream.count.print()
stream.print()
ssc.start()
ssc.awaitTermination()
}
}
My sbt file is the following
import AssemblyKeys._
assemblySettings
name := "flume-test"
version := "1.0"
scalaVersion := "2.10.4"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.0.0" % "provided"
libraryDependencies += "org.apache.spark" %% "spark-streaming" % "1.0.0" % "provided"
libraryDependencies += "org.apache.spark" %% "spark-streaming-flume" % "1.0.0" exclude("org.apache.spark","spark-core") exclude("org.apache.spark", "spark-streaming_2.10")
resolvers += "Akka Repository" at "http://repo.akka.io/releases/"
I run the programm with the following command
/tmp/spark-1.0.0-bin-hadoop2/bin/spark-submit --class FlumeEventCount --master local --deploy-mode client /tmp/fooproj/target/scala-2.10/cert-log-manager-assembly-1.0.jar
On the other side, the flume application is sending everything correctly and I can see in the logs that it's received.
I haven't made any changes to spark's configuration nor setup any environment variables, I just downloaded and unpacked the program.
Can someone tell me what am I doing wrong?
//edit: When I execute spark's FlumeEventCount example, it works
//edit2: If I remove the awaiTermination and add an ssc.stop it prints everything one single time, I guess this happens because something is getting flushed
....I should have learned to rtfm more carefully by now,
quoting from this page: https://spark.apache.org/docs/latest/streaming-programming-guide.html
// Spark Streaming needs at least two working thread
val ssc = new StreamingContext("local[2]", "NetworkWordCount", Seconds(1))
I've been launching spark with only one thread
also the following works fine
stream.map(event=>"Event: header:"+ event.event.get(0).toString+" body:"+ new String(event.event.getBody.array) ).print