java.lang.ArrayIndexOutOfBoundsException: 0 : If Directory Does Not Have Files - scala

Please assist me with the following scenario. I'm scanning the last two hours of folders and then taking the most recent CSV files and generating a single list.
If both the hours folders contain files, the code below is working as expected. but if any folder does not contain any files, then it is showing "ArrayIndexOutOfBoundsException: 0"
code :
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs._
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import scala.language.postfixOps
val hdfsConf = new Configuration();
var path="/user/hdfs/test/input"
var finalFiles = List[String]()
val currentTs = java.time.LocalDateTime.now
val hours=2
var paths = (0 until hours.toInt).map(h => currentTs.minusHours(h))
.map(ts=>s"${path}/partition_date=${ts.toLocalDate}/hour=${ts.toString.substring(11, 13)}")
.toList
// paths: List[String] = List(/user/hdfs/test/input/partition_date=2022-11-30/hour=19,
// /user/hdfs/test/input/partition_date=2022-11-30/hour=18)
for (eachfolder <- paths) {
var New_Folder_Path: String = eachfolder
var fs = org.apache.hadoop.fs.FileSystem.get(spark.sparkContext.hadoopConfiguration)
var pathstatus = fs.listStatus(new Path(New_Folder_Path))
var currpathfiles = pathstatus.map(x => Row(x.getPath.toString, x.getModificationTime))
var latestFile = spark.sparkContext.parallelize(currpathfiles)
.map(row => (row.getString(0), row.getLong(1)))
.toDF("FilePath", "ModificationTime")
.filter(col("FilePath")
.like("%.csv%"))
.sort($"ModificationTime".desc)
.select(col("FilePath")).limit(1)
.map(row => row.getString(0)).collectAsList.get(0)
finalFiles = latestFile :: finalFiles
}
Erorr:
java.lang.ArrayIndexOutOfBoundsException: 0

You're running into an issue when trying to obtain the 0th element from an empty list. You can avoid this by using List's headOption method along with foreach on the resulting Option.
spark.sparkContext.parallelize(currpathfiles)
.map(row => (row.getString(0), row.getLong(1)))
...
.map(row => getString(0))
.collectAsList.headOption
.foreach(latestFile => finalFiles = latestFile :: finalFiles)
Also note that instead of assigning latestFile to a var, my implementation just prepends it within the Option's foreach to the finalFiles list (for each will only act when there exists an element after we call collectAsList).

Related

not able to store result in hdfs when code runs for second iteration

Well I am new to spark and scala and have been trying to implement cleaning of data in spark. below code checks for the missing value for one column and stores it in outputrdd and runs loops for calculating missing value. code works well when there is only one missing value in file. Since hdfs does not allow writing again on the same location it fails if there are more than one missing value. can you please assist in writing finalrdd to particular location once calculating missing values for all occurrences is done.
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("app").setMaster("local")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val files = sc.wholeTextFiles("/input/raw_files/")
val file = files.map { case (filename, content) => filename }
file.collect.foreach(filename => {
cleaningData(filename)
})
def cleaningData(file: String) = {
//headers has column headers of the files
var hdr = headers.toString()
var vl = hdr.split("\t")
sqlContext.clearCache()
if (hdr.contains("COLUMN_HEADER")) {
//Checks for missing values in dataframe and stores missing values' in outputrdd
if (!outputrdd.isEmpty()) {
logger.info("value is zero then performing further operation")
val outputdatetimedf = sqlContext.sql("select date,'/t',time from cpc where kwh = 0")
val outputdatetimerdd = outputdatetimedf.rdd
val strings = outputdatetimerdd.map(row => row.mkString).collect()
for (i <- strings) {
if (Coddition check) {
//Calculates missing value and stores in finalrdd
finalrdd.map { x => x.mkString("\t") }.saveAsTextFile("/output")
logger.info("file is written in file")
}
}
}
}
}
}``
It is not clear how (Coddition check) works in your example.
In any case function .saveAsTextFile("/output") should be called only once.
So I would rewrite your example into this:
val strings = outputdatetimerdd
.map(row => row.mkString)
.collect() // perhaps '.collect()' is redundant
val finalrdd = strings
.filter(str => Coddition check str) //don't know how this Coddition works
.map (x => x.mkString("\t"))
// this part is called only once but not in a loop
finalrdd.saveAsTextFile("/output")
logger.info("file is written in file")

Looping through Map Spark Scala

Within this code we have two files: athletes.csv that contains names, and twitter.test that contains the tweet message. We want to find name for every single line in the twitter.test that match the name in athletes.csv We applied map function to store the name from athletes.csv and want to iterate all of the name to all of the line in the test file.
object twitterAthlete {
def loadAthleteNames() : Map[String, String] = {
// Handle character encoding issues:
implicit val codec = Codec("UTF-8")
codec.onMalformedInput(CodingErrorAction.REPLACE)
codec.onUnmappableCharacter(CodingErrorAction.REPLACE)
// Create a Map of Ints to Strings, and populate it from u.item.
var athleteInfo:Map[String, String] = Map()
//var movieNames:Map[Int, String] = Map()
val lines = Source.fromFile("../athletes.csv").getLines()
for (line <- lines) {
var fields = line.split(',')
if (fields.length > 1) {
athleteInfo += (fields(1) -> fields(7))
}
}
return athleteInfo
}
def parseLine(line:String): (String)= {
var athleteInfo = loadAthleteNames()
var hello = new String
for((k,v) <- athleteInfo){
if(line.toString().contains(k)){
hello = k
}
}
return (hello)
}
def main(args: Array[String]){
Logger.getLogger("org").setLevel(Level.ERROR)
val sc = new SparkContext("local[*]", "twitterAthlete")
val lines = sc.textFile("../twitter.test")
var athleteInfo = loadAthleteNames()
val splitting = lines.map(x => x.split(";")).map(x => if(x.length == 4 && x(2).length <= 140)x(2))
var hello = new String()
val container = splitting.map(x => for((key,value) <- athleteInfo)if(x.toString().contains(key)){key}).cache
container.collect().foreach(println)
// val mapping = container.map(x => (x,1)).reduceByKey(_+_)
//mapping.collect().foreach(println)
}
}
the first file look like:
id,name,nationality,sex,height........
001,Michael,USA,male,1.96 ...
002,Json,GBR,male,1.76 ....
003,Martin,female,1.73 . ...
the second file look likes:
time, id , tweet .....
12:00, 03043, some message that contain some athletes names , .....
02:00, 03023, some message that contain some athletes names , .....
some thinks like this ...
but i got empty result after running this code, any suggestions is much appreciated
result i got is empty :
()....
()...
()...
but the result that i expected something like:
(name,1)
(other name,1)
You need to use yield to return value to your map
val container = splitting.map(x => for((key,value) <- athleteInfo ; if(x.toString().contains(key)) ) yield (key, 1)).cache
I think you should just start with the simplest option first...
I would use DataFrames so you can use the built-in CSV parsing and leverage Catalyst, Tungsten, etc.
Then you can use the built-in Tokenizer to split the tweets into words, explode, and do a simple join. Depending how big/small the data with athlete names is you'll end up with a more optimized broadcast join and avoid a shuffle.
import org.apache.spark.sql.functions._
import org.apache.spark.ml.feature.Tokenizer
val tweets = spark.read.format("csv").load(...)
val athletes = spark.read.format("csv").load(...)
val tokenizer = new Tokenizer()
tokenizer.setInputCol("tweet")
tokenizer.setOutputCol("words")
val tokenized = tokenizer.transform(tweets)
val exploded = tokenized.withColumn("word", explode('words))
val withAthlete = exploded.join(athletes, 'word === 'name)
withAthlete.select(exploded("id"), 'name).show()

Best way to convert online csv to dataframe scala

I am trying to figure out the most efficient way to accomplish putting this online csv file into a data frame in Scala.
To save a download, the csv file in the code looks like this:
"Symbol","Name","LastSale","MarketCap","ADR
TSO","IPOyear","Sector","Industry","Summary Quote"
"DDD","3D Systems Corporation","18.09","2058834640.41","n/a","n/a","Technology","Computer Software: Prepackaged Software","http://www.nasdaq.com/symbol/ddd"
"MMM","3M Company","211.68","126423673447.68","n/a","n/a","Health Care","Medical/Dental Instruments","http://www.nasdaq.com/symbol/mmm"
....
From my research, I start by downloading the csv, and placing it into a list buffer (since you can't do this with a list because it's immutable):
import scala.collection.mutable.ListBuffer
val sc = new SparkContext(conf)
var stockInfoNYSE_ListBuffer = new ListBuffer[java.lang.String]()
import scala.io.Source
val bufferedSource =
Source.fromURL("http://www.nasdaq.com/screening/companies-by-
industry.aspx?exchange=NYSE&render=download")
for (line <- bufferedSource.getLines) {
val cols = line.split(",").map(_.trim)
stockInfoNYSE_ListBuffer += s"${cols(0)},${cols(1)},${cols(2)},${cols(3)},${cols(4)},${cols(5)},${cols(6)},${cols(7)},${cols(8)}"
}
bufferedSource.close
val stockInfoNYSE_List = stockInfoNYSE_ListBuffer.toList
So we have a list. You can basically get each value like this:
// SYMBOL : stockInfoNYSE_List(1).split(",")(0)
// COMPANY NAME : stockInfoNYSE_List(1).split(",")(1)
// IPOYear : stockInfoNYSE_List(1).split(",")(5)
// Sector : stockInfoNYSE_List(1).split(",")(6)
// Industry : stockInfoNYSE_List(1).split(",")(7)
Here is where I get stuck- how do I get this to a dataframe? The wrong approaches I have taken. I didn't put all the values in just yet- was a simple test.
case class StockMap(Symbol: String, Name: String)
val caseClassDS = Seq(StockMap(stockInfoNYSE_List(1).split(",")(0),
StockMap(stockInfoNYSE_List(1).split(",")(1))).toDS()
caseClassDS.show()
The problem with the approach above: I can only figure out how to add one sequence (row) by hard coding it. I want every Row in the list.
My second failed attempt:
val sqlContext= new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val test = stockInfoNYSE_List.toDF
This will just give you the array, and I want to divide up the values.
Array(["Symbol","Name","LastSale","MarketCap","ADR TSO","IPOyear","Sector","Industry","Summary Quote"], ["DDD","3D Systems Corporation","18.09","2058834640.41","n/a","n/a","Technology","Computer Software: Prepackaged Software","http://www.nasdaq.com/symbol/ddd"], ["MMM","3M Company","211.68","126423673447.68","n/a","n/a","Health Care","Medical/Dental Instruments","http://www.nasdaq.com/symbol/mmm"],.......
case class TestClass(Symbol:String,Name:String,LastSale:String,MarketCap :String,ADR_TSO:String,IPOyear:String,Sector: String,Industry:String,Summary_Quote:String
| )
defined class TestClass
var stockDF= stockInfoNYSE_ListBuffer.drop(1)
val demoDS = stockDF.map(line => {
val fields = line.replace("\"","").split(",")
TestClass(fields(0), fields(1), fields(2),fields(3), fields(4), fields(5),fields(6), fields(7), fields(8))
})
scala> demoDS.toDS.show
+------+--------------------+--------+---------------+-------------+-------+-----------------+--------------------+--------------------+
|Symbol| Name|LastSale| MarketCap| ADR_TSO|IPOyear| Sector| Industry| Summary_Quote|
+------+--------------------+--------+---------------+-------------+-------+-----------------+--------------------+--------------------+
| DDD|3D Systems Corpor...| 18.09| 2058834640.41| n/a| n/a| Technology|Computer Software...|http://www.nasdaq...|
| MMM| 3M Company| 211.68|126423673447.68| n/a| n/a| Health Care|Medical/Dental In...|http://www.nasdaq...|
In case anyone is trying to get this example working, here is the code using the above solution:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import scala.collection.mutable.ListBuffer
import sqlContext.implicits._
var stockInfoNYSE_ListBuffer = new ListBuffer[java.lang.String]()
import scala.io.Source
val bufferedSource =
Source.fromURL("http://www.nasdaq.com/screening/companies-by-industry.aspx?exchange=NYSE&render=download")
for (line <- bufferedSource.getLines) {
val cols = line.split(",").map(_.trim)
stockInfoNYSE_ListBuffer += s"${cols(0)},${cols(1)},${cols(2)},${cols(3)},${cols(4)},${cols(5)},${cols(6)},${cols(7)},${cols(8)}"
}
bufferedSource.close
case class TestClass(Symbol:String,Name:String,LastSale:String,MarketCap :String,ADR_TSO:String,IPOyear:String,Sector: String,Industry:String,Summary_Quote:String )
var stockDF= stockInfoNYSE_ListBuffer.drop(1)
val demoDS = stockDF.map(line => {
val fields = line.replace("\"","").split(",")
TestClass(fields(0), fields(1), fields(2),fields(3), fields(4), fields(5),fields(6), fields(7), fields(8))
})
demoDS.toDF().show

Task not serializable in scala

In my application, I'm using parallelize method to save an Array into file.
code as follows:
val sourceRDD = sc.textFile(inputPath + "/source")
val destinationRDD = sc.textFile(inputPath + "/destination")
val source_primary_key = sourceRDD.map(rec => (rec.split(",")(0).toInt, rec))
val destination_primary_key = destinationRDD.map(rec => (rec.split(",")(0).toInt, rec))
val extra_in_source = source_primary_key.subtractByKey(destination_primary_key)
val extra_in_destination = destination_primary_key.subtractByKey(source_primary_key)
val source_subtract = source_primary_key.subtract(destination_primary_key)
val Destination_subtract = destination_primary_key.subtract(source_primary_key)
val exact_bestmatch_src = source_subtract.subtractByKey(extra_in_source).sortByKey(true).map(rec => (rec._2))
val exact_bestmatch_Dest = Destination_subtract.subtractByKey(extra_in_destination).sortByKey(true).map(rec => (rec._2))
val exact_bestmatch_src_p = exact_bestmatch_src.map(rec => (rec.split(",")(0).toInt))
val primary_key_distinct = exact_bestmatch_src_p.distinct.toArray()
for (i <- primary_key_distinct) {
var dummyVar: String = ""
val src = exact_bestmatch_src.filter(line => line.split(",")(0).toInt.equals(i))
var dest = exact_bestmatch_Dest.filter(line => line.split(",")(0).toInt.equals(i)).toArray
for (print1 <- src) {
var sourceArr: Array[String] = print1.split(",")
var exactbestMatchCounter: Int = 0
var index: Array[Int] = new Array[Int](1)
println(print1 + "source")
for (print2 <- dest) {
var bestMatchCounter = 0
var i: Int = 0
println(print1 + "source + destination" + print2)
for (i <- 0 until sourceArr.length) {
if (print1.split(",")(i).equals(print2.split(",")(i))) {
bestMatchCounter += 1
}
}
if (exactbestMatchCounter < bestMatchCounter) {
exactbestMatchCounter = bestMatchCounter
dummyVar = print2
index +:= exactbestMatchCounter //9,8,9
}
}
var z = index.zipWithIndex.maxBy(_._1)._2
if (exactbestMatchCounter >= 0) {
var samparr: Array[String] = new Array[String](4)
samparr +:= print1 + " BEST_MATCH " + dummyVar
var deletedest: Array[String] = new Array[String](1)
deletedest = dest.take(z) ++ dest.drop(1)
dest = deletedest
val myFile = sc.parallelize((samparr)).saveAsTextFile(outputPath)
I have used parallelize method and I even tried with below method to save it as a file
val myFile = sc.textFile(samparr.toString())
val finalRdd = myFile
finalRdd.coalesce(1).saveAsTextFile(outputPath)
but its keep throwing the error :
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
You can't treat an RDD like a local collection. All operations against it happen over a distributed cluster. To work, all functions you run in that rdd must be serializable.
The line
for (print1 <- src) {
Here you are iterating over the RDD src, everything inside the loop must be serialize, as it will be run on the executors.
Inside however, you try to run sc.parallelize( while still inside that loop. SparkContext is not serializable. Working with rdds and sparkcontext are things you do on the driver, and cannot do within an RDD operation.
I'm entirely sure what you are trying to accomplish, but it looks like some sort of hand-coded join operation with the source and destination. You can't work with loops in rdds like you can with local collections. Make use of the apis map, join, groupby, and the like to create your final rdd then save that.
If you absolutely feel you must use a foreach loop over the rdd like this, then you can't use sc.parallelize().saveAsTextFile() Instead open an outputstream using the hadoop file api and write your array to the file manually.
Finally this piece of code helps me to save an array to file.
new PrintWriter(outputPath) { write(array.mkString(" ")); close }

Passing list to sc.textFile -scala-

I'm looking for how to pass a list of paths to sc.textFile (in scala), without using foreach.
Example :
myList :Seq[String] = ArrayBuffer (path1, path2, path3)
Is there a way to do :
var data = sc.textFile(myList)
Try
var data = sc.textFile(myList.mkstring(","))
Alternatively, we can read each text file, then union the resulting rdds:
import scala.util.{Try, Success}
val rdds = myList.flatMap { f =>
Try(sc.textFile(f)) match {
case Success(rdd) => Some(rdd)
case _ => None
}
}
val rdd = sc.union(rdds)