Issues with Spark and Salesforce Connection - scala

I am trying to load in a table from SalesForce using spark. I invoked this code
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql._
object Sample {
def main(arg: Array[String]) {
val spark = SparkSession.builder().
appName("salesforce").
master("local[*]").
getOrCreate()
val tableName = "Opportunity"
val outputPath = "output/result" + tableName
val salesforceDf = spark.
read.
format("jdbc").
option("url", "jdbc:datadirect:sforce://login.salesforce.com;").
option("driver", "com.ddtek.jdbc.sforce.SForceDriver").
option("dbtable", tableName).
option("user", "").
option("password", "xxxxxxxxx").
option("securitytoken", "xxxxx")
.load()
salesforceDf.createOrReplaceTempView("Opportunity")
spark.sql("select * from Opportunity").collect.foreach(println)
//save the result
salesforceDf.write.save(outputPath)
}
}
And the docs I was referring to said to start a spark shell as:
spark-shell --jars /path_to_driver/sforce.jar
Which outputted a lot of lines in the terminal and this was the last line:
22/07/12 14:57:56 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-a12b060d-5c82-4283-b2b9-53f9b3863b53
And then to submit spark
spark-submit --jars sforce.jar --class <Your class name> your jar file
However I am not sure where this jar file is and if that was substantiated? and how to submit that. Any help is appreciated, thank you.

Related

Parquet file being read as empty

I'm trying to read a parquet file that I downloaded from the HDFS on my Jupyter notebook however it is showing up as empty. I know it is not empty because I had worked on it prior to saving it to the HDFS. Does anyone know why it is being read as empty?
The size of the file on the HDFS and cluster environment:
hadoop fs -du -s -h /user/some/test.parquet
1.2 M 3.5 M /user/some/test.parquet
val test = spark.read.parquet("hdfs:///user/some/test.parquet")
test.count()
res0: Long = 10
On an almond-kernel in Jupyter notebook to work in Scala.
import org.apache.log4j.{Level, Logger}
Logger.getLogger("org").setLevel(Level.OFF)
import org.apache.spark.sql._
val spark = {
SparkSession.builder()
.master("local[*]")
.getOrCreate()
}
def sc = spark.sparkContext
val test = spark.read.parquet("/Users/me/some/test.parquet")
test: DataFrame = [UnitId: string, GeoId: string ... 26 more fields]
test.count()
res28: Long = 0L
For anyone who's interested, I figured out the issue.
I had downloaded the parquet file from the HDFS using "hadoop fs -getmerge", resulting in the corruption of the file.
The right approach when dealing with the parquet file would be to "hadoop fs -get.

Spark Hbase connector (SHC) is not returning any data from HBase table

I am following spark hbase connector basic example to read a HBase table in spark2 shell version 2.2.0. It looks like the code is working, but when I run df.show() command, I do not see any results and it seems to run forever.
import org.apache.spark.sql.{ DataFrame, Row, SQLContext }
import org.apache.spark.sql.execution.datasources.hbase._
val sqlContext = new org.apache.spark.sql.SQLContext(sc);
def catalog = s"""{
|"table":{"namespace":"default", "name":"testmeta"},
|"rowkey":"vgil",
|"columns":{
|"id":{"cf":"rowkey", "col":"vgil", "type":"string"},
|"col1":{"cf":"pp", "col":"dtyp", "type":"string"}
|}
|}""".stripMargin
def withCatalog(cat: String): DataFrame = { sqlContext.read.options(Map(HBaseTableCatalog.tableCatalog->cat)).format("org.apache.spark.sql.execution.datasources.hbase").load()}
val df = withCatalog(catalog)
df.show()
df.show() will neither give any output nor any error. It will keep on running forever.
Also, how can I run queryy for range of row keys.
Here is the scan of the HBase test table.
hbase(main):001:0> scan 'testmeta'
ROW COLUMN+CELL
fmix column=pp:dtyp, timestamp=1541714925380, value=ss1
fmix column=pp:lati, timestamp=1541714925371, value=41.50
fmix column=pp:long, timestamp=1541714925374, value=-81.61
fmix column=pp:modm, timestamp=1541714925377, value=ABC
vgil column=pp:dtyp, timestamp=1541714925405, value=ss2
vgil column=pp:lati, timestamp=1541714925397, value=41.50
I have followed some of solutions on the web, but unfortunately not able to get the data from HBase.
Thanks in advance for help!
Posting my answer after lots of trial, so I found that adding --conf option to start spark shell helped me connect to HBase.
spark2-shell --master yarn --deploy-mode client --packages com.hortonworks:shc-core:1.1.1-2.1-s_2.11,it.nerdammer.bigdata:spark-hbase-connector_2.10:1.0.3 --repositories http://repo.hortonworks.com/content/groups/public/ --conf spark.hbase.host=192.168.xxx.xxx --files /mnt/fs1/opt/cloudera/parcels/CDH-5.13.0-1.cdh5.13.0.p0.29/share/doc/hbase-solr-doc-1.5+cdh5.13.0+71/demo/hbase-site.xml
Then the following code snippet could fetch a value for one column qualifier.
val hBaseRDD_iacp = sc.hbaseTable[(String)]("testmeta").select("lati").inColumnFamily("pp").withStartRow("vg").withStopRow("vgz")
object myschema {
val column1 = StructField("column1", StringType)
val struct = StructType(Array(column1))
}
val rowRDD = hBaseRDD.map(x => Row(x))
val myDf = sqlContext.createDataFrame(rowRDD,myschema.struct)
myDf.show()

Executing Spark scala program after compilation

I have compiled Spark scala program on command line. But now I want to execute it. I dont want to use Maven or sbt.
the program .I have used the command to execute the
scala -cp ".:sparkDIrector/jars/*" wordcount
But I am getting this error
java.lang.NoSuchMethodError: scala.Predef$.refArrayOps([Ljava/lang/Object;)Lscala/collection/mutable/ArrayOps;
import org.apache.spark._
import org.apache.spark.SparkConf
/** Create a RDD of lines from a text file, and keep count of
* how often each word appears.
*/
object wordcount1 {
def main(args: Array[String]) {
// Set up a SparkContext named WordCount that runs locally using
// all available cores.
println("before conf")
val conf = new SparkConf().setAppName("WordCount")
conf.setMaster("local[*]")
val sc = new SparkContext(conf)
println("after the textfile")
// Create a RDD of lines of text in our book
val input = sc.textFile("book.txt")
println("after the textfile")
// Use flatMap to convert this into an rdd of each word in each line
val words = input.flatMap(line => line.split(' '))
// Convert these words to lowercase
val lowerCaseWords = words.map(word => word.toLowerCase())
// Count up the occurence of each unique word
println("before text file")
val wordCounts = lowerCaseWords.countByValue()
// Print the first 20 results
val sample = wordCounts.take(20)
for ((word, count) <- sample) {
println(word + " " + count)
}
sc.stop()
}
}
It is showing that the error is at location
val conf = new SparkConf().setAppName("WordCount").
Any help?
Starting from Spark 2.0 the entry point is the SparkSession:
import org.apache.spark.sql.SparkSession
val spark = SparkSession
.builder
.appName("App Name")
.getOrCreate()
Then you can access the SparkContext and read the file with:
spark.sparkContext().textFile(yourFileOrURL)
Remember to stop your session at the end:
spark.stop()
I suggest you to have a look at these examples: https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples
Then, to launch your application, you have to use spark-submit:
./bin/spark-submit \
--class <main-class> \
--master <master-url> \
--deploy-mode <deploy-mode> \
--conf <key>=<value> \
... # other options
<application-jar> \
[application-arguments]
In your case, it will be something like:
./bin/spark-submit \
--class wordcount1 \
--master local \
/path/to/your.jar

Spark MLLib unable to write out to S3 : path already exists

I have data in a S3 bucket in directory /data/vw/. Each line is of the form:
| abc:2 def:1 ghi:3 ...
I want to convert it to the following format:
abc abc def ghi ghi ghi
The new converted lines should go to S3 in directory /data/spark
Basically, repeat each string the number of times that follows the colon. I am trying to convert a VW LDA input file to a corresponding file for consumption by Spark's LDA library.
The code:
import org.apache.spark.{SparkConf, SparkContext}
object Vw2SparkLdaFormatConverter {
def repeater(s: String): String = {
val ssplit = s.split(':')
(ssplit(0) + ' ') * ssplit(1).toInt
}
def main(args: Array[String]) {
val inputPath = args(0)
val outputPath = args(1)
val conf = new SparkConf().setAppName("FormatConverter")
val sc = new SparkContext(conf)
val vwdata = sc.textFile(inputPath)
val sparkdata = vwdata.map(s => s.trim().split(' ').map(repeater).mkString)
val coalescedSparkData = sparkdata.coalesce(100)
coalescedSparkData.saveAsTextFile(outputPath)
sc.stop()
}
}
When I run this (as a Spark EMR job in AWS), the step fails with exception:
18/01/20 00:16:28 ERROR ApplicationMaster: User class threw exception: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory s3a://mybucket/data/spark already exists
org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory s3a://mybucket/data/spark already exists
at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:131)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1119)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1096)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1096)
at ...
The code is run as:
spark-submit --class Vw2SparkLdaFormatConverter --deploy-mode cluster --master yarn --conf spark.yarn.submit.waitAppCompletion=true --executor-memory 4g s3a://mybucket/scripts/myscalajar.jar s3a://mybucket/data/vw s3a://mybucket/data/spark
I have tried specifying new output paths (/data/spark1 etc), ensuring that it does not exist before the step is run. Even then it is not working.
What am I doing wrong? I am new to Scala and Spark so I might be overlooking something here.
You could convert to a dataframe and then save with overwrite enabled.
coalescedSparkData.toDF.write.mode('overwrite').csv(outputPath)
Or if you insist on using RDD methods, you can do as described already in this answer

spark scala datastax csv load file and print schema

Spark version 2.0.2.6
Scala version 2.11.11
Using DataStax 5.0
import org.apache.log4j.{Level, Logger}
import java.util.Calendar
import org.apache.spark.sql.functions._
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import com.datastax.spark.connector._
import org.apache.spark.sql._
object csvtocassandra {
def main(args: Array[String]): Unit = {
val key_space = scala.io.StdIn.readLine("Please enter cassandra Key Space Name: ")
val table_name = scala.io.StdIn.readLine("Please enter cassandra Table Name: ")
// Cassandra Part
val conf = new SparkConf().setAppName("Sample1").setMaster("local[*]")
val sc = new SparkContext(conf)
sc.setLogLevel("ERROR")
println(Calendar.getInstance.getTime)
// Scala Read CSV Part
val spark1 = org.apache.spark.sql.SparkSession.builder().master("local").config("spark.cassandra.connection.host", "127.0.0.1")
.appName("Spark SQL basic example").getOrCreate()
val csv_input = scala.io.StdIn.readLine("Please enter csv file location: ")
val df_csv = spark1.read.format("csv").option("header", "true").option("inferschema", "true").load(csv_input)
df_csv.printSchema()
}
}
Why am I not able to run this program as a Job trying to submit it to spark. When I run this program using IntelliJ it works.
But When I create a JAR and run it I am getting following Error.
Command:
> dse spark-submit --class "csvtospark" /Users/del/target/scala-2.11/csvtospark_2.11-1.0.jar
I am getting following Error:
ERROR 2017-11-02 11:46:10,245 org.apache.spark.deploy.DseSparkSubmitBootstrapper: Failed to start or submit Spark application
org.apache.spark.sql.AnalysisException: Path does not exist: dsefs://127.0.0.1/Users/Desktop/csv/example.csv;
Why is it appending dsefs://127.0.0.1 part even though I am giving just the path /Users/Desktop/csv/example.csv when asked.
I tried giving --mater option as well. How ever I am getting the same error. I am running DataStax Spark in Local Machine. No Cluster.
Please correct me where I am doing things wrong.
Got it. Never mind. Sorry about that.
input should be file:///file_name