Unable to pass parameters to connection class using Scala-Spark on IntelliJ - scala

I have recently completed studying Scala & Spark. I am trying to do an exercise to read data from a table present on Postgres DB using JDBC connection. I created a Scala SBT Project and created a properties file to store all the connection properties.
I have the following properties in connections.properties file:
devHost=xx.xxx.xxx.xxx
devPort=xxxx
devDbName=base
devUserName=username
devPassword=password
gpDriverClass=org.postgresql.Driver
I created a DBManager class where I initialize the connection properties:
import java.io.FileInputStream
import java.util.Properties
class DBManager {
val dbProps = new Properties()
val connectionProperties = new Properties()
dbProps.load(new FileInputStream(connections.properties))
val jdbcDevHostname = dbProps.getProperty("devHost")
val jdbcDevPort = dbProps.getProperty("devPort")
val jdbcDevDatabase = dbProps.getProperty("devDbName")
val jdbcDevUrl = s"jdbc:postgresql://${jdbcDevHostname}:${jdbcDevPort}/${jdbcDevDatabase}?ssl=true&sslfactory=org.postgresql.ssl.NonValidatingFactory" + s",${uname},${pwd}"
connectionProperties.setProperty("Driver",dbProps.getProperty("gpDriverClass"))
connectionProperties.put("user", dbProps.getProperty("devUserName"))
connectionProperties.put("password", dbProps.getProperty("devPassword"))
}
In a Scala Object I am trying to use all these details as below:
import org.apache.spark.sql.SparkSession
import com.gphive.connections.DBManager
object PartitionRetrieval {
def main(args: Array[String]): Unit = {
val dBManager = new DBManager();
val spark = SparkSession.builder().enableHiveSupport().appName("GP_YEARLY_DATA").getOrCreate()
val tabData = spark.read.jdbc(dBManager.jdbcDevUrl,"tableName",connectionProperties)
}
}
I referred this link to create the above code.
When I click on 'build' on the project:
I see that the connections.properties file is not loaded properly in the code I've written.
Could anyone let me know how can I correct the mistake ?

Related

How to Run Apache Tika on Apache Spark

I am trying to run Apache Tika on Apache Spark on AWS EMR to perform distributed text extraction on a large collection of documents. I have built the Tika JAR with shaded dependencies as explained in https://forums.databricks.com/questions/28378/trying-to-use-apache-tika-on-databricks.html and the job works correctly in local mode. However when running the job in clustered mode, the extracted text always comes out as an empty string. This problem is outlined in Tika's documentation (https://cwiki.apache.org/confluence/display/TIKA/Troubleshooting+Tika#TroubleshootingTika-NoContentExtracted), but I haven't been able to debug the issue. Since the code works for me in local mode it has to be something with the classpath or JARs, and I can't figure it out.
Here is sample Scala code for my Spark job:
/* TikaTest.scala */
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.tika.parser._
import org.apache.tika.sax.BodyContentHandler
import org.apache.tika.metadata.Metadata
import java.io.DataInputStream
// The first argument must be an S3 path to a directory with documents for text extraction.
// The second argument must be an S3 path to a directory where extracted text will be written.
object TikaTest {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Tika Test")
val sc = new SparkContext(conf)
val binRDD = sc.binaryFiles(args(0))
val textRDD = binRDD.map(file => {parseFile(file._2.open( ))})
textRDD.saveAsTextFile(args(1))
sc.stop()
}
def parseFile(stream: DataInputStream): String = {
val parser = new AutoDetectParser()
val handler = new BodyContentHandler()
val metadata = new Metadata()
val context = new ParseContext()
parser.parse(stream, handler, metadata, context)
return handler.toString()
}
}

Setting UP Intellij to run apache spark with remote master

I have a project setup with H2o. I am able to run the code with apache toree and I set the spark master as spark://xxxx.yyyy.zzzz:port.
It works fine and I can see the output in spark UI.
I am trying to run the same code as application in intellij with but I get error java.lang.ClassNotFoundException: org.apache.spark.h2o.utils.NodeDesc. but I see the application in Spark UI for a short amount of Time.
I tried running simple application with hello world and that worked as well as I am able to see the application in Spark UI,
import java.io.File
import hex.tree.gbm.GBM
import hex.tree.gbm.GBMModel.GBMParameters
import org.apache.spark.h2o.{StringHolder, H2OContext}
import org.apache.spark.{SparkFiles, SparkContext, SparkConf}
import water.fvec.H2OFrame
/**
* Example of Sparkling Water based application.
*/
object SparklingWaterDroplet {
def main(args: Array[String]) {
// Create Spark Context
val conf = configure("Sparkling Water Droplet")
val sc = new SparkContext(conf)
// Create H2O Context
val h2oContext = H2OContext.getOrCreate(sc)
import h2oContext.implicits._
// Register file to be available on all nodes
sc.addFile(this.getClass.getClassLoader.getResource("iris.csv").getPath)
// Load data and parse it via h2o parser
val irisTable = new H2OFrame(new File(SparkFiles.get("iris.csv")))
// Build GBM model
val gbmParams = new GBMParameters()
gbmParams._train = irisTable
gbmParams._response_column = 'class
gbmParams._ntrees = 5
val gbm = new GBM(gbmParams)
val gbmModel = gbm.trainModel.get
// Make prediction on train data
val predict = gbmModel.score(irisTable)('predict)
// Compute number of mispredictions with help of Spark API
val trainRDD = h2oContext.asRDD[StringHolder](irisTable('class))
val predictRDD = h2oContext.asRDD[StringHolder](predict)
// Make sure that both RDDs has the same number of elements
assert(trainRDD.count() == predictRDD.count)
val numMispredictions = trainRDD.zip(predictRDD).filter( i => {
val act = i._1
val pred = i._2
act.result != pred.result
}).collect()
println(
s"""
|Number of mispredictions: ${numMispredictions.length}
|
|Mispredictions:
|
|actual X predicted
|------------------
|${numMispredictions.map(i => i._1.result.get + " X " + i._2.result.get).mkString("\n")}
""".stripMargin)
// Shutdown application
sc.stop()
}
def configure(appName:String = "Sparkling Water Demo"):SparkConf = {
val conf = new SparkConf().setAppName(appName)
.setMaster("spark://xxx.yyy.zz.aaaa:oooo")
conf
}
}
I also tried exporting jars as compile from the dependencies menu ,
is there anything I am missing from Intellij setup.?
It looks like the external libraries are not getting pushed to the master

Compilation error saving model written in Scala, Apache Spark

I am running the example source code provided by Apache Spark to create an FPGrowth model. I want to save the model for future use, therefore I wrote the ending line of this code (model.save):
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.mllib.fpm.FPGrowth
import org.apache.spark.mllib.util._
import org.apache.spark.rdd.RDD
import org.apache.spark.sql._
import java.io._
import scala.collection.mutable.Set
object App {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("prediction").setMaster("local[*]")
val sc = new SparkContext(conf)
val data = sc.textFile("FPFeatureSeries.txt")
val transactions: RDD[Array[String]] = data.map(s => s.trim.split(' '))
val fpg = new FPGrowth()
.setMinSupport(0.1)
.setNumPartitions(10)
val model = fpg.run(transactions)
val minConfidence = 0.8
model.generateAssociationRules(minConfidence).collect().foreach { rule =>
if(rule.confidence>minConfidence){
println(
rule.antecedent.mkString("[", ",", "]")
+ " => " + rule.consequent .mkString("[", ",", "]")
+ ", " + rule.confidence)
}
}
model.save(sc, "FPGrowthModel");
}
}
The problem is that I get a compilation error: value save is not a member of org.apache.spark.mllib.fpm.FPGrowth
I have tried including libraries and copying the exact examples from the documentation but I am still getting the same error.
I am using Spark 2.0.0 and Scala 2.10.
i had the same issue.
used this to save model
sc.parallelize(Seq(model), 1).saveAsObjectFile("path")
and to load model
val linRegModel = sc.objectFile[LinearRegressionModel]("path").first()
this might help..
what-is-the-right-way-to-save-load-models-in-spark-pyspark

Wiki xml parser - org.apache.spark.SparkException: Task not serializable

I am newbie to both scala and spark, and trying some of the tutorials, this one is from Advanced Analytics with Spark. The following code is supposed to work:
import com.cloudera.datascience.common.XmlInputFormat
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.io._
val path = "/home/petr/Downloads/wiki/wiki"
val conf = new Configuration()
conf.set(XmlInputFormat.START_TAG_KEY, "<page>")
conf.set(XmlInputFormat.END_TAG_KEY, "</page>")
val kvs = sc.newAPIHadoopFile(path, classOf[XmlInputFormat],
classOf[LongWritable], classOf[Text], conf)
val rawXmls = kvs.map(p => p._2.toString)
import edu.umd.cloud9.collection.wikipedia.language._
import edu.umd.cloud9.collection.wikipedia._
def wikiXmlToPlainText(xml: String): Option[(String, String)] = {
val page = new EnglishWikipediaPage()
WikipediaPage.readPage(page, xml)
if (page.isEmpty) None
else Some((page.getTitle, page.getContent))
}
val plainText = rawXmls.flatMap(wikiXmlToPlainText)
But it gives
scala> val plainText = rawXmls.flatMap(wikiXmlToPlainText)
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
at org.apache.spark.SparkContext.clean(SparkContext.scala:1622)
at org.apache.spark.rdd.RDD.flatMap(RDD.scala:295)
...
Running Spark v1.3.0 on a local (and I have loaded only about a 21MB of the wiki articles, just to test it).
All of https://stackoverflow.com/search?q=org.apache.spark.SparkException%3A+Task+not+serializable didn't get me any clue...
Thanks.
try
import com.cloudera.datascience.common.XmlInputFormat
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.io._
val path = "/home/terrapin/Downloads/enwiki-20150304-pages-articles1.xml-p000000010p000010000"
val conf = new Configuration()
conf.set(XmlInputFormat.START_TAG_KEY, "<page>")
conf.set(XmlInputFormat.END_TAG_KEY, "</page>")
val kvs = sc.newAPIHadoopFile(path, classOf[XmlInputFormat],
classOf[LongWritable], classOf[Text], conf)
val rawXmls = kvs.map(p => p._2.toString)
import edu.umd.cloud9.collection.wikipedia.language._
import edu.umd.cloud9.collection.wikipedia._
val plainText = rawXmls.flatMap{line =>
val page = new EnglishWikipediaPage()
WikipediaPage.readPage(page, line)
if (page.isEmpty) None
else Some((page.getTitle, page.getContent))
}
The first guess which comes to mind is that: all your code is wrapped in the object where SparkContext is defined. Spark tries to serialize this object to transfer wikiXmlToPlainText function to nodes. Try to create different object with the only one function wikiXmlToPlainText.

What are All the ways we can run a scala code in Apache-Spark?

I know there is two ways to run a scala code in Apache-Spark:
1- Using spark-shell
2- Making a jar file from our project and Use spark-submit to run it
Is there any other way to run a scala code in Apache-Spark? for example, can I run a scala object (ex: object.scala) in Apache-Spark directly?
Thanks
1. Using spark-shell
2. Making a jar file from our project and Use spark-submit to run it
3. Running Spark Job programmatically
String sourcePath = "hdfs://hdfs-server:54310/input/*";
SparkConf conf = new SparkConf().setAppName("TestLineCount");
conf.setJars(new String[] { App.class.getProtectionDomain()
.getCodeSource().getLocation().getPath() });
conf.setMaster("spark://spark-server:7077");
conf.set("spark.driver.allowMultipleContexts", "true");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> log = sc.textFile(sourcePath);
JavaRDD<String> lines = log.filter(x -> {
return true;
});
System.out.println(lines.count());
Scala version:
import org.apache.log4j.Logger
import org.apache.log4j.Level
import org.apache.spark.{SparkConf, SparkContext}
object SimpleApp {
def main(args: Array[String]) {
Logger.getLogger("org").setLevel(Level.OFF)
Logger.getLogger("okka").setLevel(Level.OFF)
val logFile = "/tmp/logs.txt"
val conf = new SparkConf()
.setAppName("Simple Application")
.setMaster("local")
val sc = new SparkContext(conf)
val logData = sc.textFile(logFile, 2).cache
println("line count: " + logData.count())
}
}
for more detail refer to this blog post.