I am trying to do some transformations on a data set. After reading the data set when performing df.show() operations, I am getting the rows listed in spark shell. But when I try to do df.count or any aggregate functions, I am getting
java.lang.ArrayIndexOutOfBoundsException: 1.
val itpostsrow = sc.textFile("/home/jayk/Downloads/spark-data")
import scala.util.control.Exception.catching
import java.sql.Timestamp
implicit class StringImprovements(val s:String) {
def toIntSafe = catching(classOf[NumberFormatException])
opt s.toInt
def toLongsafe = catching(classOf[NumberFormatException])
opt s.toLong
def toTimeStampsafe = catching(classOf[IllegalArgumentException]) opt Timestamp.valueOf(s)
}
case class Post(commentcount:Option[Int],lastactivitydate:Option[java.sql.Timestamp],ownerUserId:Option[Long],body:String,score:Option[Int],creattiondate:Option[java.sql.Timestamp],viewcount:Option[Int],title:String,tags:String,answerCount:Option[Int],acceptedanswerid:Option[Long],posttypeid:Option[Long],id:Long)
def stringToPost(row:String):Post = {
val r = row.split("~")
Post(r(0).toIntSafe,
r(1).toTimeStampsafe,
r(2).toLongsafe,
r(3),
r(4).toIntSafe,
r(5).toTimeStampsafe,
r(6).toIntSafe,
r(7),
r(8),
r(9).toIntSafe,
r(10).toLongsafe,
r(11).toLongsafe,
r(12).toLong)
}
val itpostsDFcase1 = itpostsrow.map{x=>stringToPost(x)}
val itpostsDF = itpostsDFcase1.toDF()
Your function stringToPost() might cause a Java error ArrayIndexOutOfBoundsException if the text file contains some empty row or if the number of fields after the split is not 13.
Due to Spark's lazy evaluation one notices such errors only when performing an action like count.
Related
When I execute a function in a mapPartition of dataset (executeStrategy()) it returns a result which I could check by debug but when I use dataset.show () it shows me an empty table and I do not know why this happens
This is for a data mining job at my school. I'm using windows 10, scala 2.11.12 and spark-2.2.0, which work without problems.
case class MyState(code: util.ArrayList[Object], evaluation: util.ArrayList[java.lang.Double])
private def executeStrategy(iter: Iterator[Row]): Iterator[(String,MyState)] = {
val listBest = new util.ArrayList[State]
Predicate.fuzzyValues = iter.toList
for (i <- 0 until conf.runNumber) {
Strategy.executeStrategy(conf.iterByRun, 1, conf.algorithm("algorithm").asInstanceOf[GeneratorType])
listBest.addAll(Strategy.getStrategy.listBest)
}
val result = postMining(listBest)
result.map(x => (x.getCode.toString, MyState(x.getCode,x.getEvaluation))).iterator
}
def run(sparkSession: SparkSession, n: Int): Unit = {
import sparkSession.implicits._
var data0 = conf.dataBase.repartition(n).persist(StorageLevel.MEMORY_AND_DISK_SER)
var listBest = new util.ArrayList[State]
implicit def enc1 = Encoders.bean(classOf[(String,MyState)])
val data1 = data0.mapPartitions(executeStrategy)
data1.show(3)
}
I expect that the dataset has the results of the processing of each partition, which I can see when I debug, but I get an empty dataset.
I have tried rdd with the same function executeStrategy() and this one returns an rdd with the results. What is the problem with the dataset?
I have a large query that seems to be a prime candidate for streaming results.
I would like to make a call to a function, which returns an object which I can apply additional map transformations on, and then ultimately convert the entire result into a list. This is because the conversions will results in a set of objects much smaller than the results in the database and there are many different transformations that must take place sequentially. Processing each result at a time will save me significant memory.
For example, if the results from the database were a stream (though the correct thing is likely an AkkaStream or an Iteratee), then I could do something like:
def outer(converter1[String, Int}, converter2[Int,Double]) {
val sqlIterator = getSqlIterator()
val mappedIterator1 = sqlIterator.map(x => converter1(x.bigColumn))
val mappedIterator2 = sqlIterator.map(x => converter2(x))
val retVal = mappedIterator.toList
retVal
}
def getSqlIterator() {
val selectedObjects = SQL( """SELECT * FROM table""").map { x =>
val id = x[Long]("id")
val tinyColumn = x[String]("tiny_column")
val bigColumn = x[String]("big_column")
NewObject(id, tinyColumn, bigColumn)
}
val transformed = UNKNOWN_FUNCTION(selectedObjects)
transformed
}
Most of the documentation appears to provide the mechanism to apply a "reduce" function to the results, rather than a "map" function, but the resulting mapped functions will be much smaller, saving me significant memory. What should I do for UNKNOWN_FUNCTION?
The following is a simple example of using Anorm's Akka Streams support to read the values from a single column of type String, applying two transformations to each element, and placing the results in a Seq. I'll leave it as an exercise for you to retrieve the values from multiple columns at a time, if that's what you need.
import akka.actor.ActorSystem
import akka.stream.ActorMaterializer
import akka.stream.scaladsl.Sink
import anorm._
import scala.collection.immutable.Seq
import scala.concurrent.Future
implicit val system = ActorSystem("MySystem")
implicit val materializer = ActorMaterializer()
implicit val ec = system.dispatcher
val convertStringToInt: String => Int = ???
val convertIntToDouble: Int => Double = ???
val result: Future[Seq[Double]] =
AkkaStream.source(SQL"SELECT big_column FROM table", SqlParser.scalar[String])
.map(convertStringToInt)
.map(convertIntToDouble)
.runWith(Sink.seq[Double])
I have a function that I want to apply to a every row of a .csv file:
def convert(inString: Array[String]) : String = {
val country = inString(0)
val sellerId = inString(1)
val itemID = inString(2)
try{
val minidf = sqlContext.read.json( sc.makeRDD(inString(3):: Nil) )
.withColumn("country", lit(country))
.withColumn("seller_id", lit(sellerId))
.withColumn("item_id", lit(itemID))
val finalString = minidf.toJSON.collect().mkString(",")
finalString
} catch{
case e: Exception =>println("AN EXCEPTION "+inString.mkString(","))
("this is an exception "+e+" "+inString.mkString(","))
}
}
This function transforms an entry of the sort:
CA 112578240 132080411845 [{"id":"general_spam_policy","severity":"critical","timestamp":"2017-02-26T08:30:16Z"}]
Where I have 4 columns, the 4th being a json blob, into
[{"country":"CA", "seller":112578240", "product":112578240, "id":"general_spam_policy","severity":"critical","timestamp":"2017-02-26T08:30:16Z"}]
which is the json object where the first 3 columns have been inserted into the fourth.
Now, this works:
val conv_string = sc.textFile(path_to_file).map(_.split('\t')).collect().map(x => convert(x))
or this:
val conv_string = sc.textFile(path_to_file).map(_.split('\t')).take(10).map(x => convert(x))
but this does not
val conv_string = sc.textFile(path_to_file).map(_.split('\t')).map(x => convert(x))
The last one throw a java.lang.NullPointerException.
I included a try catch clause so see where exactly is this failing and it's failing for every single row.
What am I doing wrong here?
You cannot put sqlContext or sparkContext in a Spark map, since that object can only exist on the driver node. Essentially they are in charge of distributing your tasks.
You could rewite the JSON parsing bit using one of these libraries in pure scala: https://manuel.bernhardt.io/2015/11/06/a-quick-tour-of-json-libraries-in-scala/
input.csv:
200,300,889,767,9908,7768,9090
300,400,223,4456,3214,6675,333
234,567,890
123,445,667,887
What I want:
Read input file and compare with set "123,200,300" if match found, gives matching data
200,300 (from 1 input line)
300 (from 2 input line)
123 (from 4 input line)
What I wrote:
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.rdd.RDD
object sparkApp {
val conf = new SparkConf()
.setMaster("local")
.setAppName("CountingSheep")
val sc = new SparkContext(conf)
def parseLine(invCol: String) : RDD[String] = {
println(s"INPUT, $invCol")
val inv_rdd = sc.parallelize(Seq(invCol.toString))
val bs_meta_rdd = sc.parallelize(Seq("123,200,300"))
return inv_rdd.intersection(bs_meta_rdd)
}
def main(args: Array[String]) {
val filePathName = "hdfs://xxx/tmp/input.csv"
val rawData = sc.textFile(filePathName)
val datad = rawData.map{r => parseLine(r)}
}
}
I get the following exception:
java.lang.NullPointerException
Please suggest where I went wrong
Problem is solved. This is very simple.
val pfile = sc.textFile("/FileStore/tables/6mjxi2uz1492576337920/input.csv")
case class pSchema(id: Int, pName: String)
val pDF = pfile.map(_.split("\t")).map(p => pSchema(p(0).toInt,p(1).trim())).toDF()
pDF.select("id","pName").show()
Define UDF
val findP = udf((id: Int,
pName: String
) => {
val ids = Array("123","200","300")
var idsFound : String = ""
for (id <- ids){
if (pName.contains(id)){
idsFound = idsFound + id + ","
}
}
if (idsFound.length() > 0) {
idsFound = idsFound.substring(0,idsFound.length -1)
}
idsFound
})
Use UDF in withCoulmn()
pDF.select("id","pName").withColumn("Found",findP($"id",$"pName")).show()
For simple answer, why we are making it so complex? In this case we don't require UDF.
This is your input data:
200,300,889,767,9908,7768,9090|AAA
300,400,223,4456,3214,6675,333|BBB
234,567,890|CCC
123,445,667,887|DDD
and you have to match it with 123,200,300
val matchSet = "123,200,300".split(",").toSet
val rawrdd = sc.textFile("D:\\input.txt")
rawrdd.map(_.split("|"))
.map(arr => arr(0).split(",").toSet.intersect(matchSet).mkString(",") + "|" + arr(1))
.foreach(println)
Your output:
300,200|AAA
300|BBB
|CCC
123|DDD
What you are trying to do can't be done the way you are doing it.
Spark does not support nested RDDs (see SPARK-5063).
Spark does not support nested RDDs or performing Spark actions inside of transformations; this usually leads to NullPointerExceptions (see SPARK-718 as one example). The confusing NPE is one of the most common sources of Spark questions on StackOverflow:
call of distinct and map together throws NPE in spark library
NullPointerException in Scala Spark, appears to be caused be collection type?
Graphx: I've got NullPointerException inside mapVertices
(those are just a sample of the ones that I've answered personally; there are many others).
I think we can detect these errors by adding logic to RDD to check whether sc is null (e.g. turn sc into a getter function); we can use this to add a better error message.
I am trying to use Spark to process data that comes from HBase tables. This blog post gives an example of how to use NewHadoopAPI to read data from any Hadoop InputFormat.
What I have done
Since I will need to do this many times, I was trying to use implicits to enrich SparkContext, so that I can get an RDD from a given set of columns in HBase. I have written the following helper:
trait HBaseReadSupport {
implicit def toHBaseSC(sc: SparkContext) = new HBaseSC(sc)
implicit def bytes2string(bytes: Array[Byte]) = new String(bytes)
}
final class HBaseSC(sc: SparkContext) extends Serializable {
def extract[A](data: Map[String, List[String]], result: Result, interpret: Array[Byte] => A) =
data map { case (cf, columns) =>
val content = columns map { column =>
val cell = result.getColumnLatestCell(cf.getBytes, column.getBytes)
column -> interpret(CellUtil.cloneValue(cell))
} toMap
cf -> content
}
def makeConf(table: String) = {
val conf = HBaseConfiguration.create()
conf.setBoolean("hbase.cluster.distributed", true)
conf.setInt("hbase.client.scanner.caching", 10000)
conf.set(TableInputFormat.INPUT_TABLE, table)
conf
}
def hbase[A](table: String, data: Map[String, List[String]])
(interpret: Array[Byte] => A) =
sc.newAPIHadoopRDD(makeConf(table), classOf[TableInputFormat],
classOf[ImmutableBytesWritable], classOf[Result]) map { case (key, row) =>
Bytes.toString(key.get) -> extract(data, row, interpret)
}
}
It can be used like
val rdd = sc.hbase[String](table, Map(
"cf" -> List("col1", "col2")
))
In this case we get an RDD of (String, Map[String, Map[String, String]]), where the first component is the rowkey and the second is a map whose key are column families and the values are maps whose keys are columns and whose content are the cell values.
Where it fails
Unfortunately, it seems that my job gets a reference to sc, which is itself not serializable by design. What I get when I run the job is
Exception in thread "main" org.apache.spark.SparkException: Job aborted: Task not serializable: java.io.NotSerializableException: org.apache.spark.SparkContext
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)
I can remove the helper classes and use the same logic inline in my job and everything runs fine. But I want to get something which I can reuse instead of writing the same boilerplate over and over.
By the way, the issue is not specific to implicit, even using a function of sc exhibits the same problem.
For comparison, the following helper to read TSV files (I know it's broken as it does not support quoting and so on, never mind) seems to work fine:
trait TsvReadSupport {
implicit def toTsvRDD(sc: SparkContext) = new TsvRDD(sc)
}
final class TsvRDD(val sc: SparkContext) extends Serializable {
def tsv(path: String, fields: Seq[String], separator: Char = '\t') = sc.textFile(path) map { line =>
val contents = line.split(separator).toList
(fields, contents).zipped.toMap
}
}
How can I encapsulate the logic to read rows from HBase without unintentionally capturing the SparkContext?
Just add #transient annotation to sc variable:
final class HBaseSC(#transient val sc: SparkContext) extends Serializable {
...
}
and make sure sc is not used within extract function, since it won't be available on workers.
If it's necessary to access Spark context from within distributed computation, rdd.context function might be used:
val rdd = sc.newAPIHadoopRDD(...)
rdd map {
case (k, v) =>
val ctx = rdd.context
....
}