Scala mapPartition collect on partition do nothing - scala

I'm trying to move data from rdd to postgres table, using:
def copyIn(reader: java.io.Reader, columnStmt: String = "") = {
//connect to postgres database on the localhost
val driver = "org.postgresql.Driver"
var connection:Connection = null
Class.forName(driver)
connection = DriverManager.getConnection()
try {
connection.unwrap(classOf[PGConnection]).getCopyAPI.copyIn(s"COPY my_table ($columnStmt) FROM STDIN WITH CSV", reader)
} catch {
case se: SQLException => println(se.getMessage)
case t: Throwable => println(t.getMessage)
} finally {
connection.close()
}
}
myRdd.mapPartitions(iter => {
val sb = new StringBuilder()
var n_iter = iter.map(row => {
val mapRequest = Utils.getMyRowMap(myMap, row)
sb.append(mapRequest.values.mkString(", ")).append("\n")
})
copyIn(new StringReader(sb.toString), geoSelectMap.keySet.mkString(", "))
sb.clear
n_iter
}).collect
The script keep getting in to the CopyIn function with no data to insert in. I think it maybe because iter.map just map the partition and do not perform collect? I try to collect te whole myRdd object and still didnt get data in copyIn function.
How can I iterate over an rdd and get the StringBuilder appended and why the snippet above doesn't work?
Anybody have a clue?

iter is an Iterator. So iter.map creates a new Iterator, but you don't actually iterate it and it does nothing. You probably want foreach instead. Except then iter will be empty by the time you return it and the result of collect will be an empty RDD.
The actual method you want is foreachPartition:
myRdd.foreachPartition(iter => {
val sb = new StringBuilder()
iter.foreach(row => {
val mapRequest = Utils.getMyRowMap(myMap, row)
sb.append(mapRequest.values.mkString(", ")).append("\n")
})
copyIn(new StringReader(sb.toString), geoSelectMap.keySet.mkString(", "))
sb.clear
})
and then myRdd.collect if you want to collect it as well. (Persist myRdd if you want to use it twice without recalculating it.)

Related

Create Spark DataFrame from list row keys

I have a list of HBase row keys in form or Array[Row] and want to create a Spark DataFrame out of the rows that are fetched from HBase using these RowKeys.
Am thinking of something like:
def getDataFrameFromList(spark: SparkSession, rList : Array[Row]): DataFrame = {
val conf = HBaseConfiguration.create()
val mlRows : List[RDD[String]] = new ArrayList[RDD[String]]
conf.set("hbase.zookeeper.quorum", "dev.server")
conf.set("hbase.zookeeper.property.clientPort", "2181")
conf.set("zookeeper.znode.parent","/hbase-unsecure")
conf.set(TableInputFormat.INPUT_TABLE, "hbase_tbl1")
rList.foreach( r => {
var rStr = r.toString()
conf.set(TableInputFormat.SCAN_ROW_START, rStr)
conf.set(TableInputFormat.SCAN_ROW_STOP, rStr + "_")
// read one row
val recsRdd = readHBaseRdd(spark, conf)
mlRows.append(recsRdd)
})
// This works, but it is only one row
//val resourcesDf = spark.read.json(recsRdd)
var resourcesDf = <Code here to convert List[RDD[String]] to DataFrame>
//resourcesDf
spark.emptyDataFrame
}
I can do recsRdd.collect() in the for loop and convert it to string and append that json to an ArrayList[String but am not sure if its efficient, to call collect() in a for loop like this.
readHBaseRdd is using newAPIHadoopRDD to get data from HBase
def readHBaseRdd(spark: SparkSession, conf: Configuration) = {
val hBaseRDD = spark.sparkContext.newAPIHadoopRDD(conf, classOf[TableInputFormat],
classOf[ImmutableBytesWritable],
classOf[Result])
hBaseRDD.map {
case (_: ImmutableBytesWritable, value: Result) =>
Bytes.toString(value.getValue(Bytes.toBytes("cf"),
Bytes.toBytes("jsonCol")))
}
}
}
Use spark.union([mainRdd, recsRdd]) instead of a list or RDDs (mlRows)
And why read only one row from HBase? Try to have the largest interval as possible.
Always avoid calling collect(), do it only for debug/tests.

Unable to convert a ResultSet into a List in Cassandra datastax driver

In the following Cassandra code, I am querying a database and expect multiple values. The function takes and id and should return Option[List[M]] where M is my model. I have a function rowToModel(row: Row): MyModel which could take a row from ResultSet and convert it into instance of my model.
My issue is that the List I am returning is always empty even though ResultSet has data. I checked it by adding debug prints in rowToModel
def getRowsByPartitionKeyId(id:I):Option[List[M]] = {
val whereClause = whereConditions(tablename, id);
val resultSet = session.execute(whereClause) //resultSet is an iterator
val it = resultSet.iterator();
val resultList:List[M] = List();
if(it.hasNext){
while(it.hasNext) {
val item:M = rowToModel(it.next())
resultList.:+(item)
}
Some(resultList) //THIS IS ALWAYS List()
}
else
None
}
I suspect that as resultList is a val, its value is not getting changed in the while loop. I probably should use yield or something else but I don't know what and how.
Solved it by converting Java Iterator to Scala and then using toList
import collection.JavaConverters._
val it = resultSet.iterator();
if(it.hasNext){
val resultSetAsList:List[Row] = asScalaIterator(it).toList
val resultSetAsModelList = resultSetAsList.map((row:Row) => rowToModel(row))
Some(resultSetAsModelList)

not able to store result in hdfs when code runs for second iteration

Well I am new to spark and scala and have been trying to implement cleaning of data in spark. below code checks for the missing value for one column and stores it in outputrdd and runs loops for calculating missing value. code works well when there is only one missing value in file. Since hdfs does not allow writing again on the same location it fails if there are more than one missing value. can you please assist in writing finalrdd to particular location once calculating missing values for all occurrences is done.
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("app").setMaster("local")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val files = sc.wholeTextFiles("/input/raw_files/")
val file = files.map { case (filename, content) => filename }
file.collect.foreach(filename => {
cleaningData(filename)
})
def cleaningData(file: String) = {
//headers has column headers of the files
var hdr = headers.toString()
var vl = hdr.split("\t")
sqlContext.clearCache()
if (hdr.contains("COLUMN_HEADER")) {
//Checks for missing values in dataframe and stores missing values' in outputrdd
if (!outputrdd.isEmpty()) {
logger.info("value is zero then performing further operation")
val outputdatetimedf = sqlContext.sql("select date,'/t',time from cpc where kwh = 0")
val outputdatetimerdd = outputdatetimedf.rdd
val strings = outputdatetimerdd.map(row => row.mkString).collect()
for (i <- strings) {
if (Coddition check) {
//Calculates missing value and stores in finalrdd
finalrdd.map { x => x.mkString("\t") }.saveAsTextFile("/output")
logger.info("file is written in file")
}
}
}
}
}
}``
It is not clear how (Coddition check) works in your example.
In any case function .saveAsTextFile("/output") should be called only once.
So I would rewrite your example into this:
val strings = outputdatetimerdd
.map(row => row.mkString)
.collect() // perhaps '.collect()' is redundant
val finalrdd = strings
.filter(str => Coddition check str) //don't know how this Coddition works
.map (x => x.mkString("\t"))
// this part is called only once but not in a loop
finalrdd.saveAsTextFile("/output")
logger.info("file is written in file")

Scala Tail Recursion java.lang.StackOverflowError

I am iteratively querying a mysql table called txqueue that is growing continuously.
Each successive query only considers rows that were inserted into the txqueue table after the query executed in the previous iteration.
To achieve this, each successive query selects rows from the table where the primary key (seqno field in my example below) exceeds the maximum seqno observed in the previous query.
Any newly inserted rows identified in this way are written into a csv file.
The intention is for this process to run indefinitely.
The tail recursive function below works OK, but after a while it runs into a java.lang.StackOverflowError. The results of each iterative query contains two to three rows and results are returned every second or so.
Any ideas on how to avoid the java.lang.StackOverflowError?
Is this actually something that can/should be achieved with streaming?
Many thanks for any suggestions.
Here's the code that works for a while:
object TXQImport {
val driver = "com.mysql.jdbc.Driver"
val url = "jdbc:mysql://mysqlserveraddress/mysqldb"
val username = "username"
val password = "password"
var connection:Connection = null
def txImportLoop(startID : BigDecimal) : Unit = {
try {
Class.forName(driver)
connection = DriverManager.getConnection(url, username, password)
val statement = connection.createStatement()
val newMaxID = statement.executeQuery("SELECT max(seqno) as maxid from txqueue")
val maxid = new Iterator[BigDecimal] {
def hasNext = newMaxID.next()
def next() = newMaxID.getBigDecimal(1)
}.toStream.max
val selectStatement = statement.executeQuery("SELECT seqno,someotherfield " +
" from txqueue where seqno >= " + startID + " and seqno < " + maxid)
if(startID != maxid) {
val ts = System.currentTimeMillis
val file = new java.io.File("F:\\txqueue " + ts + ".txt")
val bw = new BufferedWriter(new FileWriter(file))
// Iterate Over ResultSet
while (selectStatement.next()) {
bw.write(selectStatement.getString(1) + "," + selectStatement.getString(2))
bw.newLine()
}
bw.close()
}
connection.close()
txImportLoop(maxid)
}
catch {
case e => e.printStackTrace
}
}
def main(args: Array[String]) {
txImportLoop(0)
}
}
Your function is not tail-recursive (because of the catch in the end).
That's why you end up with stack overflow.
You should always annotate the functions you intend to be tail-recursive with #scala.annotation.tailrec - it will fail compilation in case tail recursion is impossible, so that you won't be surprised by it at run time.

using spark to read specific columns data from hbase

I have a table in HBase named as "orders" it has column family 'o' and columns as {id,fname,lname,email}
having row key as id. I am trying to get the value of fname and email only from hbase using spark. Currently what 'i am doing is given below
override def put(params: scala.collection.Map[String, Any]): Boolean = {
var sparkConfig = new SparkConf().setAppName("Connector")
var sc: SparkContext = new SparkContext(sparkConfig)
var hbaseConfig = HBaseConfiguration.create()
hbaseConfig.set("hbase.zookeeper.quorum", ZookeeperQourum)
hbaseConfig.set("hbase.zookeeper.property.clientPort", zookeeperPort)
hbaseConfig.set(TableInputFormat.INPUT_TABLE, schemdto.tableName);
hbaseConfig.set(TableInputFormat.SCAN_COLUMNS, "o:fname,o:email");
var hBaseRDD = sc.newAPIHadoopRDD(hbaseConfig, classOf[TableInputFormat],
classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
classOf[org.apache.hadoop.hbase.client.Result])
try {
hBaseRDD.map(tuple => tuple._2).map(result => result.raw())
.map(f => KeyValueToString(f)).saveAsTextFile(sink)
true
} catch {
case _: Exception => false
}
}
def KeyValueToString(keyValues: Array[KeyValue]): String = {
var it = keyValues.iterator
var res = new StringBuilder
while (it.hasNext) {
res.append( Bytes.toString(it.next.getValue()) + ",")
}
res.substring(0, res.length-1);
}
But nothing is returned and If I try to fetch only one column such as
hbaseConfig.set(TableInputFormat.SCAN_COLUMNS, "o:fname");
then it returns all the values of column fname
So my question is how to get multiple columns from hbase using spark
Any help will be appreciated.
List of columns to scan needs to be space-delimited, according to the documentation.
hbaseConfig.set(TableInputFormat.SCAN_COLUMNS, "o:fname o:email");