I have transactor below
val transactor: Resource[IO, HikariTransactor[IO]] =
for {
ce <- ExecutionContexts.fixedThreadPool[IO](32) // our connect EC
be <- Blocker[IO] // our blocking EC
xa <- HikariTransactor.newHikariTransactor[IO](
"com.mysql.cj.jdbc.Driver",
"jdbc:mysql://localhost:3306/ems",
"username",
"password",
ce,
be
)
} yield xa
I am querying mysql with below code
val table = "companies"
val keyCol = "id"
val columns = List("address",
"city",
"companyname",
"email",
"mobile",
"id",
"registerdate",
"registrationexp")
val queryString =
s"""SELECT ${columns.mkString(", ")}
FROM $table WHERE $keyCol = ? """
log.debug(s"$queryString")
transactor.use { xa =>
{
Query[Int, Company](queryString).option(id).transact(xa)
}
}
Company is a case class having above column name
But getting below error
java.lang.ClassCastException: Cannot cast scala.Some to Company
where I am doing wrong
Thanks in advance
Related
I have the below data having return type as Array(Array(AnyRef)).
The 2nd row is having the datatype of the underlying dataset.
I want to read the data into spark dataframe.
However, i am getting the below exception while doing that.
Exception in thread "main" scala.MatchError: StringType (of class java.lang.String)
Any suggestions to overcome this ????
var data = getData
val df = getTable(data, spark)
df.printSchema()
df.show()
def getData: Array[Array[AnyRef]] =
Array[Array[AnyRef]](Array("MS_FUND_ID", "MS_INVESTMENT_TYPE", "CURRENCY", "MS_CAT_ID", "date", "MS_NETFLOWS_FUND", "MS_AUM_FUND"),
Array("StringType", "StringType", "StringType", "StringType", "StringType", "DoubleType", "DoubleType"),
Array("F00000MLKR", "OE", "USD", "C1", "2017-10-31", "10", "15"),
Array("F00000MLKS", "OE", "USD", "C1", "2017-10-31", "-10", "10"),
Array("F00000MLKS", "OX", "USD", "C1", "2017-10-31", "-10", "10"),
Array("F00000MLKT", "INS", "USD", "C1", "2017-10-31", "30", "50"))
def getTable(table: Array[Array[AnyRef]], spark: SparkSession): DataFrame =
{
val fields = new util.ArrayList[StructField]
val fieldNames = table(0)
var fieldTypes = table(1)
fieldTypes = fieldTypes.map {
case x: DoubleType => x.asInstanceOf[DoubleType]
case x: StringType => x.asInstanceOf[StringType]
}
var f = 0
while ( {
f < table(0).length
}) {
var fn = fieldNames(f).toString
var ft = fieldTypes(f).asInstanceOf[DataType]
fields.add(StructField(fn, ft, true, Metadata.empty))
{
f += 1;
f - 1
}
}
val schema1 = new StructType(fields.toArray(new Array[StructField](fields.size)))
println(schema1)
val rows = new util.ArrayList[Row]
var r = 2
while ( {
r < table.length
}) {
val data = table(r)
rows.add(new GenericRowWithSchema(data.asInstanceOf[Array[Any]], schema1))
{
r += 1;
r - 1
}
}
var i = 0
while ( {
i < rows.size
}) {
System.out.println("Rows : " + rows.get(i))
{
i += 1;
i - 1
}
}
val DF = spark.createDataFrame(rows, schema1)
DF
}
'''
Doobie
Without much experience in either Scala or Doobie I am trying to select data from a DB2 database. Following query works fine and prints as expected 5 employees.
import doobie.imports._, scalaz.effect.IO
object ScalaDoobieSelect extends App {
val urlPrefix = "jdbc:db2:"
val schema = "SCHEMA"
val obdcName = "ODBC"
val url = urlPrefix + obdcName + ":" +
"currentSchema=" + schema + ";" +
"currentFunctionPath=" + schema + ";"
val driver = "com.ibm.db2.jcc.DB2Driver"
val username = "username"
val password = "password"
implicit val han = LogHandler.jdkLogHandler // (ii)
val xa = DriverManagerTransactor[IO](
driver, url, username, password
)
case class User(id: String, name: String)
def find(): ConnectionIO[List[User]] =
sql"SELECT ID, NAME FROM EMPLOYEE FETCH FIRST 10 ROWS ONLY"
.query[User]
.process
.take(5) // (i)
.list
find()
.transact(xa)
.unsafePerformIO
.foreach(e => println("ID = %s, NAME = %s".format(e.id, e.name)))
}
Issue
When I want to read all selected rows and remove take(5), so I have .process.list instead of .process.take(5).list, I get following error. (i)
com.ibm.db2.jcc.am.SqlException: [jcc][t4][10120][10898][3.64.133] Invalid operation: result set is closed. ERRORCODE=-4470, SQLSTATE=null
I am wondering what take(5) changes that it does not return an error. To get more information about the invalid operation, I have tried to enable logging. (ii) Unfortunately, logging is not supported for streaming. How can I get more information about what operation causes this error?
Plain JDBC
Below, in my opinion equivalent, plain JDBC query works as expected and returns all 10 rows.
import java.sql.{Connection,DriverManager}
object ScalaJdbcConnectSelect extends App {
val urlPrefix = "jdbc:db2:"
val schema = "SCHEMA"
val obdcName = "ODBC"
val url = urlPrefix + obdcName + ":" +
"currentSchema=" + schema + ";" +
"currentFunctionPath=" + schema + ";"
val driver = "com.ibm.db2.jcc.DB2Driver"
val username = "username"
val password = "password"
var connection:Connection = _
try {
Class.forName(driver)
connection = DriverManager.getConnection(url, username, password)
val statement = connection.createStatement
val rs = statement.executeQuery(
"SELECT ID, NAME FROM EMPLOYEE FETCH FIRST 10 ROWS ONLY"
)
while (rs.next) {
val id = rs.getString("ID")
val name = rs.getString("NAME")
println("ID = %s, NAME = %s".format(id,name))
}
} catch {
case e: Exception => e.printStackTrace
}
connection.close
}
Environment
As can be seen in the error message, I am using db2jcc.jar version 3.64.133. DB2 is used in version 11.
I am new to Scala and Spark and trying to build on some samples I found. Essentially I am trying to call a function from within a data frame to get State from zip code using Google API..
I have the code working separately but not together ;(
Here is the piece of code not working...
Exception in thread "main" java.lang.UnsupportedOperationException: Schema for type Unit is not supported
at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:716)
at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:654)
at org.apache.spark.sql.functions$.udf(functions.scala:2837)
at MovieRatings$.getstate(MovieRatings.scala:51)
at MovieRatings$$anonfun$4.apply(MovieRatings.scala:48)
at MovieRatings$$anonfun$4.apply(MovieRatings.scala:47)...
Line 51 starts with def getstate = udf {(zipcode:String)...
...
code:
userDF.createOrReplaceTempView("Users")
// SQL statements can be run by using the sql methods provided by Spark
val zipcodesDF = spark.sql("SELECT distinct zipcode, zipcode as state FROM Users")
// zipcodesDF.map(zipcodes => "zipcode: " + zipcodes.getAs[String]("zipcode") + getstate(zipcodes.getAs[String]("zipcode"))).show()
val colNames = zipcodesDF.columns
val cols = colNames.map(cName => zipcodesDF.col(cName))
val theColumn = zipcodesDF("state")
val mappedCols = cols.map(c =>
if (c.toString() == theColumn.toString()) getstate(c).as("transformed") else c)
val newDF = zipcodesDF.select(mappedCols:_*).show()
}
def getstate = udf {(zipcode:String) => {
val url = "http://maps.googleapis.com/maps/api/geocode/json?address="+zipcode
val result = scala.io.Source.fromURL(url).mkString
val address = parse(result)
val shortnames = for {
JObject(address_components) <- address
JField("short_name", short_name) <- address_components
} yield short_name
val state = shortnames(3)
//return state.toString()
val stater = state.toString()
}
}
Thanks for the responses.. I think I figured it out. Here is the code that works. One thing to note is Google API has restriction so some valid zip codes don't have state info.. not an issue for me though.
private def loaduserdata(spark: SparkSession): Unit = {
import spark.implicits._
// Create an RDD of User objects from a text file, convert it to a Dataframe
val userDF = spark.sparkContext
.textFile("examples/src/main/resources/users.csv")
.map(_.split("::"))
.map(attributes => users(attributes(0).trim.toInt, attributes(1), attributes(2).trim.toInt, attributes(3), attributes(4)))
.toDF()
// Register the DataFrame as a temporary view
userDF.createOrReplaceTempView("Users")
// SQL statements can be run by using the sql methods provided by Spark
val zipcodesDF = spark.sql("SELECT distinct zipcode, substr(zipcode,1,5) as state FROM Users ORDER BY zipcode desc") // zipcodesDF.map(zipcodes => "zipcode: " + zipcodes.getAs[String]("zipcode") + getstate(zipcodes.getAs[String]("zipcode"))).show()
val colNames = zipcodesDF.columns
val cols = colNames.map(cName => zipcodesDF.col(cName))
val theColumn = zipcodesDF("state")
val mappedCols = cols.map(c =>
if (c.toString() == theColumn.toString()) getstate(c).as("state") else c)
val geoDF = zipcodesDF.select(mappedCols:_*)//.show()
geoDF.createOrReplaceTempView("Geo")
}
val getstate = udf {(zipcode: String) =>
val url = "http://maps.googleapis.com/maps/api/geocode/json?address="+zipcode
val result = scala.io.Source.fromURL(url).mkString
val address = parse(result)
val statenm = for {
JObject(statename) <- address
JField("types", JArray(types)) <- statename
JField("short_name", JString(short_name)) <- statename
if types.toString().equals("List(JString(administrative_area_level_1), JString(political))")
// if types.head.equals("JString(administrative_area_level_1)")
} yield short_name
val str = if (statenm.isEmpty.toString().equals("true")) "N/A" else statenm.head
}
I process a set of files using Spark. The results after conversion to a Spark Dataframe should be saved to a database. The following code works when Spark runs in the "local[*]" mode. But when I run it on a cluster using YARN mode, processing ends without errors (except some these errors at the very beginning) but the database remains empty.
import java.sql.{Connection, DriverManager, Timestamp, SQLException}
import java.util.Properties
import org.apache.spark.sql.SparkSession
import scala.collection.JavaConverters._
import java.util.Calendar
import scala.collection.mutable.ListBuffer
import com.qbeats.cortex.library.{PartialDateTime, TimeExtractor}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.functions._
object CommoncrawlExtractor extends App {
var driver: String = null
var connectionString: String = null
var helper: Helper = null
var sc: SparkContext = null
var pte = sc.broadcast(new TimeExtractor)
def uncertainty = 60 * 60 * 12
case class SectionData(warcinfoID: String, recordID: String, sectionName: Int,
timestamp: Timestamp, uncertainty: Int, wordsets: Array[Array[String]])
case class Word(word: String)
case class Wordset(section_id: Int, wordset: Seq[Int])
def dropFirst(iterator: Iterator[String]): Iterator[String] = {
if (iterator.hasNext) {
iterator.next
}
iterator
}
def extractSentences(entity: String) = {
val result = ListBuffer[(String, String, Int, Timestamp, Int, Array[Array[String]])]()
val warcinfoIDPattern = """WARC-Warcinfo-ID: <urn:uuid:(.+)>""".r
val warcinfoID = warcinfoIDPattern.findFirstMatchIn(entity).map(_ group 1).getOrElse("")
val recordIDPattern = """WARC-Record-ID: <urn:uuid:(.+)>""".r
val recordID = recordIDPattern.findFirstMatchIn(entity).map(_ group 1).getOrElse("")
val requestTimePattern = """WARC-Date: (.+)""".r
val requestTimeString = requestTimePattern.findFirstMatchIn(entity).map(_ group 1).getOrElse("")
val requestTimeFormat = new java.text.SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss'Z'")
val requestTime = requestTimeFormat.parse(requestTimeString)
var cal: Calendar = Calendar.getInstance()
cal.setTime(requestTime)
val referenceDate1 = new PartialDateTime(cal, null)
val contentPattern = """(?s)\r\nHTTP/1\.. 200(.+?)(\r\n){2,}(.+)WARC/1.0\r\nWARC-Type: metadata""".r
val contentString = contentPattern.findFirstMatchIn(entity).map(_ group 3).getOrElse("")
try {
val de = pte.value.extractTimes(contentString)
if (de.getEntries != null) {
for (entry <- de.getEntries.asScala) {
val pdt = entry.resolve(12 * 3600, referenceDate1)
if (pdt != null) {
val sectionWordsets = entry.getSentences.asScala.map(x => x.getTokens.asScala.toArray[String]).toArray
val sectionData = (
warcinfoID, recordID, entry.getId,
new Timestamp(pdt.secondsSinceEpoch * 1000), pdt.uncertaintyInterval.toInt, sectionWordsets
)
result += sectionData
}
}
}
} catch {
case e: Exception => println("\n" + "-" * 100 + "\n" + entity)
}
result
}
def initDB() = {
driver = "org.postgresql.Driver"
connectionString = "jdbc:postgresql://lv-ws10.lviv:5432/commoncrawl?user=postgres&password=postgres"
Class.forName(driver)
}
def prepareDB() = {
var conn: Connection = null
try {
conn = DriverManager.getConnection(connectionString)
val statement = conn.createStatement()
val tableResultSet = statement.executeQuery(
"""
|SELECT table_name
| FROM information_schema.tables
| WHERE table_schema='public'
| AND table_type='BASE TABLE';
""".stripMargin)
val tablesToDelete = ListBuffer[String]()
while (tableResultSet.next()) {
tableResultSet.getString("table_name") match {
case "warcinfo" => tablesToDelete.append("warcinfo")
case "record" => tablesToDelete.append("record")
case "section" => tablesToDelete.append("section")
case "word" => tablesToDelete.append("word")
case "wordset" => tablesToDelete.append("wordset")
case _ =>
}
}
for (tableName <- tablesToDelete) statement.executeUpdate("DROP TABLE " + tableName + ";")
val storedProcedureResultSet = statement.executeQuery(
"""
|SELECT proname, prosrc
|FROM pg_catalog.pg_namespace n
|JOIN pg_catalog.pg_proc p
|ON pronamespace = n.oid
|WHERE nspname = 'public';
""".stripMargin)
val storedProcedureDeletions = ListBuffer[String]()
while (storedProcedureResultSet.next()) {
storedProcedureResultSet.getString("proname") match {
case "update_word_ids" =>
storedProcedureDeletions.append("DROP FUNCTION update_word_ids();")
case _ =>
}
}
statement.executeUpdate("DROP TRIGGER IF EXISTS update_word_ids_trigger ON wordset_occurrence;")
for (storedProcedureDeletion <- storedProcedureDeletions) statement.executeUpdate(storedProcedureDeletion)
statement.executeUpdate(
"""
|CREATE TABLE warcinfo (
| warcinfo_id serial PRIMARY KEY,
| batch_name varchar NOT NULL,
| warcinfo_uuid char(36) NOT NULL
|);
""".stripMargin)
statement.executeUpdate(
"""
|CREATE TABLE record (
| record_id serial PRIMARY KEY,
| record_uuid char(36) NOT NULL
|);
""".stripMargin)
statement.executeUpdate(
"""
|CREATE TABLE section (
| section_id serial PRIMARY KEY,
| record_id integer NOT NULL,
| section_name integer NOT NULL,
| timestamp timestamp NOT NULL,
| uncertainty integer NOT NULL
|);
""".stripMargin)
statement.executeUpdate(
"""
|CREATE TABLE word (
| word_id serial PRIMARY KEY,
| word varchar NOT NULL
|);
""".stripMargin)
statement.executeUpdate(
"""
|CREATE TABLE wordset (
| section_id integer NOT NULL,
| wordset integer ARRAY
|);
""".stripMargin)
} catch {
case e: SQLException => println("exception caught: " + e)
} finally {
if (conn != null) conn.close()
}
}
def processFile(fileNames: Array[String], accessKeyId: String = "", secretAccessKey: String = ""): Unit = {
val delimiter = "WARC/1.0\r\nWARC-Type: request\r\n"
pte = sc.broadcast(new TimeExtractor)
val spark = SparkSession
.builder()
.appName("CommoncrawlExtractor")
.getOrCreate()
import spark.implicits._
val connString = "jdbc:postgresql://lv-ws10.lviv:5432/commoncrawl"
val prop = new Properties()
prop.put("user", "postgres")
prop.put("password", "postgres")
val entities = sc.
textFile(fileNames.mkString(",")).
mapPartitions(dropFirst).
map(delimiter + _).
flatMap(extractSentences).
map(x => SectionData(x._1, x._2, x._3, x._4, x._5, x._6)).toDF().
cache()
val warcinfo = entities.select("warcinfoID").distinct().
withColumnRenamed("warcinfoID", "warcinfo_uuid").
withColumn("batch_name", lit("June 2016, batch 1"))
val warcinfoWriter = warcinfo.write.mode("append")
println("Saving warcinfo.")
println(Calendar.getInstance().getTime)
warcinfoWriter.jdbc(connString, "warcinfo", prop)
println(Calendar.getInstance().getTime)
val record = entities.select("recordID").distinct().
withColumnRenamed("recordID", "record_uuid")
val recordWriter = record.write.mode("append")
println("Saving records.")
println(Calendar.getInstance().getTime)
recordWriter.jdbc(connString, "record", prop)
println(Calendar.getInstance().getTime)
val recordFull = spark.read.
format("jdbc").
options(Map("url" -> connString, "dbtable" -> "public.record", "user" -> "postgres", "password" -> "postgres")).
load().cache()
val section = entities.
join(recordFull, entities.col("recordID").equalTo(recordFull("record_uuid"))).
select("record_id", "sectionName", "timestamp", "uncertainty").distinct().
withColumnRenamed("sectionName", "section_name")
val sectionWriter = section.write.mode("append")
println("Saving sections.")
println(Calendar.getInstance().getTime)
sectionWriter.jdbc(connString, "section", prop)
println(Calendar.getInstance().getTime)
val sectionFull = spark.read.
format("jdbc").
options(Map("url" -> connString, "dbtable" -> "public.section", "user" -> "postgres", "password" -> "postgres")).
load()
val word = entities.
select("wordsets").
flatMap(r => r.getAs[Seq[Seq[String]]]("wordsets").flatten).
distinct().
map(Word(_))
val wordWriter = word.write.mode("append")
wordWriter.jdbc(connString, "word", prop)
val wordFull = spark.read.
format("jdbc").
options(Map("url" -> connString, "dbtable" -> "public.word", "user" -> "postgres", "password" -> "postgres")).
load().
map(row => (row.getAs[String]("word"), row.getAs[Int]("word_id"))).
collect().
toMap
val wordsetTemp = entities.
join(recordFull, entities.col("recordID").equalTo(recordFull("record_uuid"))).
withColumnRenamed("sectionName", "section_name")
val wordset = wordsetTemp.
join(sectionFull, Seq("record_id", "section_name")).
select("section_id", "wordsets").
flatMap(r => r.getAs[Seq[Seq[String]]]("wordsets").map(x => Wordset(r.getAs[Int]("section_id"), x.map(wordFull))))
val wordsetWriter = wordset.write.mode("append")
println("Saving wordsets.")
println(Calendar.getInstance().getTime)
wordsetWriter.jdbc(connString, "wordset", prop)
println(Calendar.getInstance().getTime)
// entities.saveAsTextFile(helper.outputDirectory + "xyz")
sc.stop
}
override def main(args: Array[String]): Unit = {
if (args.length >= 2) {
initDB()
prepareDB()
helper = new Helper
val files =
if (args(0).startsWith("hdfs://")) helper.getHDFSFiles(args(0)).slice(0, args(3).toInt)
else helper.getLocalFiles(args(0))
val appName = "CommoncrawlExtractor"
val conf = new SparkConf().setAppName(appName)
if (args(0).startsWith("hdfs://")) {
conf.set("spark.executor.instances", args(1))
conf.set("spark.executor.cores", args(2))
} else conf.setMaster(args(1))
sc = new SparkContext(conf)
val delimiter = "WARC/1.0\r\nWARC-Type: request"
sc.hadoopConfiguration.set("textinputformat.record.delimiter", delimiter)
processFile(files)
}
}
}
I copied postgresql-9.4.1209.jre7.jar to /home/user/Programs/libs on every machine in the claster and use the following command (run from Spark's directory):
./bin/spark-submit --master yarn --deploy-mode client --driver-class-path /home/user/Programs/libs/postgresql-9.4.1209.jre7.jar --jars /home/user/Programs/libs/postgresql-9.4.1209.jre7.jar --conf "spark.driver.extraClassPath=/home/user/Programs/libs/postgresql-9.4.1209.jre7.jar" --conf "spark.executor.extraClassPath=/home/user/Programs/libs/postgresql-9.4.1209.jre7.jar" spark-cortex-fat.jar hdfs://LV-WS10.lviv:9000/commoncrawl 2 4 8
Please suggest how I can make it work on the cluster.
ADDED LATER:
I discovered that these lines
val warcinfo = entities.select("warcinfoID").
withColumnRenamed("warcinfoID", "warcinfo_uuid").
withColumn("batch_name", lit("June 2016, batch 1"))
val warcinfoWriter = warcinfo.write.mode("append")
println("Saving warcinfo.")
println(Calendar.getInstance().getTime)
warcinfoWriter.jdbc(connString, "warcinfo", prop)
println(Calendar.getInstance().getTime)
lead to exception
16/09/01 17:31:51 WARN scheduler.TaskSetManager: Lost task 0.1 in stage 1.0 (TID 5, LV-WS09): org.apache.spark.storage.BlockFetchException: Failed to fetch block after 1 fetch failures. Most recent failure cause:
at org.apache.spark.storage.BlockManager.getRemoteBytes(BlockManager.scala:565)
at org.apache.spark.storage.BlockManager.getRemoteValues(BlockManager.scala:522)
at org.apache.spark.storage.BlockManager.get(BlockManager.scala:609)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:661)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:281)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:96)
at org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:95)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$7.apply$mcV$sp(PairRDDFunctions.scala:1203)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$7.apply(PairRDDFunctions.scala:1203)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$7.apply(PairRDDFunctions.scala:1203)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1325)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1211)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1190)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:194)
at org.apache.spark.network.BlockTransferService.fetchBlockSync(BlockTransferService.scala:104)
at org.apache.spark.storage.BlockManager.getRemoteBytes(BlockManager.scala:554)
... 31 more
Caused by: java.io.IOException: Failed to connect to ubuntu-cluster-4/192.168.100.139:36378
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:179)
at org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:96)
at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
at org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
at org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
... 3 more
Caused by: java.net.ConnectException: Connection refused: ubuntu-cluster-4/192.168.100.139:36378
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
... 1 more
However, some records are stored in a database.
What would you suggest?
ADDED LATER:
I looked to YARN logs on the node which stopped responding but they weren't helpful: logs.
I have a Scala class that accesses a database through JDBC:
class DataAccess {
def select = {
val driver = "com.mysql.jdbc.Driver"
val url = "jdbc:mysql://localhost:3306/db"
val username = "root"
val password = "xxxx"
var connection:Connection = null
try {
// make the connection
Class.forName(driver)
connection = DriverManager.getConnection(url, username, password)
// create the statement, and run the select query
val statement = connection.createStatement()
val resultSet = statement.executeQuery("SELECT name, descrip FROM table1")
while ( resultSet.next() ) {
val name = resultSet.getString(1)
val descrip = resultSet.getString(2)
println("name, descrip = " + name + ", " + descrip)
}
} catch {
case e => e.printStackTrace
}
connection.close()
}
}
I access this class in my Play application, like so:
def testSql = Action {
val da = new DataAccess
da.select()
Ok("success")
}
The method testSql may be invoked by several users. Question is: could there be a race condition in the while ( resultSet.next() ) loop (or in any other part of the class)?
Note: I need to use JDBC as the SQL statement will be dynamic.
No there cannot.
Each thread is working with a distinct local instance of ResultSet so there cannot be concurrent access to the same object.