Null values in SCALA in an Array of String - ArrayIndexOutOfBoundsException - scala

I have an Array of String as follows:
res17: Array[String] = Array(header - skip me, blk1|X|||||, a|b|c||||, d|e|f||||, x|y|z||||, blk2|X|||||, h|h|h|h|h|h|h, j|j|j|j|j|j|j, k|k|k|k|k|k|k, m|m|m|m|m|m|m, blk3|X|||||, 7|7|||||)
This is gotten by a SCALA program, not SPARK with SCALA:
for (line <- Source.fromFile(filename).getLines().drop(1).toVector) {
val values = line.split("\\|").map(_.trim)
...
When I perform:
...
println(values(0), values(1), values(2)) // giving an error on 2 or indeed 1, if a null is found.
}
I.e. it fails if there is nothing between the pipe |.
getOrElse does not help, how can I substitute the "nulls" when retrieving or saving? Cannot see from the documentation. It must be quite simple!
Note I am using SCALA only, not SPARK / SCALA.
Thanks in advance

Well, that's not the behaviour i am experiencing. Here a screenshot, i may be doing something different:
Anyway, if you want to get rid of your nulls, you can run a filter like the one below:
val values = s.split("\\|").map(_.trim).filterNot(_.isEmpty)
If you don't want to get rid but transform them in something else you can run:
val values = s.split("\\|").map{x => val trimmed = x.trim; if (trimmed.isEmpty) None else Some(trimmed)}
EDIT:
val values = s.split("\\|").map{x => if (x == null) "" else x.trim}
EDIT (Again):
I can finally reproduce it, sorry for the inconvenience, i missunderstood something. The problem is the split functions, that removes by default the empty values. You should pass the second parameter to the split function as explained in the API
val values = line.split("\\|", -1).map(_.trim)

Related

Matching Column name from Csv file in spark scala

I want to take headers (column name) from my csv file and the want to match with it my existing header.
I am using below code:
val cc = sparksession.read.csv(filepath).take(1)
Its giving me value like:
Array([id,name,salary])
and I have created one more static schema, which is giving me value like this:
val ss=Array("id","name","salary")
and then I'm trying to compare column name using if condition:
if(cc==ss){
println("matched")
} else{
println("not matched")
}
I guess due to [] and () mismatch its always going to else part is there any other way to compare these value without considering [] and ()?
First, for convenience, set the header option to true when reading the file:
val df = sparksession.read.option("header", true).csv(filepath)
Get the column names and define the expected column names:
val cc = df.columns
val ss = Array("id", "name", "salary")
To check if the two match (not considering the ordering):
if (cc.toSet == ss.toSet) {
println("matched")
} else {
println("not matched")
}
If the order is relevant, then the condition can be done as follows (you can't use Array here but Seq works):
cc.toSeq == ss.toSeq
or you a deep array comparison:
cc.deep == d.deep
First of all, I think you are trying to compare a Array[org.apache.spark.sql.Row] with an Array[String]. I believe you should change how you load the headers to something like: val cc = spark.read.format("csv").option("header", "true").load(fileName).columns.toArray.
Then you could compare using cc.deep == ss.deep.
Below code worked for me.
val cc= spark.read.csv("filepath").take(1)(0).toString
The above code gave output as String:[id,name,salary].
created one one stating schema as
val ss="[id,name,salary]"
then wrote the if else Conditions.

Scala add element to Nil list in scala

I'm solving a problem on leetcode-
https://leetcode.com/problems/minimum-absolute-difference/
I can't seem to understand why in the code below the result list is not correctly appended after resetting it to nil.
I looked online of course but could not fathom the concept behind this behavior. Can someone explain why after result is assigned Nil, no value can get added to that list? How do I reset the list?
I tried with ListBuffer and clear() but I got the same issue, at the end of the run the result is Nil
Expected behavior:
Input: arr = [4,2,1,3]
Output: [[1,2],[2,3],[3,4]]
Actual behavior:
Input: arr = [4,2,1,3]
Output: List()
def minimumAbsDifference(arr: Array[Int]): List[List[Int]] = {
val sortedInput = arr.sorted
var min = Integer.MAX_VALUE
var result = Seq[List[Int]]()
for(i <- 0 until sortedInput.length - 1){
val diff = sortedInput(i+1) - sortedInput(i)
if(min > diff){
result = Nil
min = diff
}
if(min == diff){
result :+ List(sortedInput(i),sortedInput(i+1))
}
}
result.toList
}
You're assigning Nil to result and then never assigning anything else.
Because List is immutable result :+ List(...) returns a new list which is then thrown away. You need to assign the new list to result.
A couple of other notes:
It is extremely inefficient (decidedly not "leet") to append to a list. It's much more efficient to prepend (building the result in reverse) and then reverse at the end.
It is also extremely inefficient to access List items by index.
Use of var should generally be avoided in Scala, though this particular usage (contained locally to an otherwise pure function) is not beyond the pale.

Error while finding lines starting with H or I using Scala

I am trying to learn Spark and Scala. I am working on a scenario to identify the lines that start with H or I. Below is my code
def startWithHorI(s:String):String=
{
if(s.startsWith("I")
return s
if(s.startsWith("H")
return s
}
val fileRDD=sc.textFile("wordcountsample.txt")
val checkRDD=fileRDD.map(startWithHorI)
checkRDD.collect
It is throwing an error while creating the function Found:Unit Required:Boolean.
From research I understood that it is not able to recognize the return as Unit means void. Could someone help me.
There are a few things wrong with your def, we will start there:
It is throwing the error because according to the code posted, your syntax is incomplete and the def is defined improperly:
def startWithHorI(s:String): String=
{
if(s.startsWith("I")) // missing extra paren char in original post
s // do not need return statement
if(s.startsWith("H")) // missing extra paren char in original post
s // do not need return statement
}
This will still return an error because we are expecting a String when the compiler sees that it's returning an Any. We cannot do this if we do not have an else case (what will be returned when s does not start with H or I?) - the compiler will see this as an Any return type. The correction for this would be to have an else condition that ultimately returns a String.
def startWithHorI(s: String): String = {
if(s.startsWith("I")) s else "no I"
if(s.startsWith("H")) s else "no H"
}
If you don't want to return anything, then an Option is worth looking at for a return type.
Finally we can achieve what you are doing via filter - no need to map with a def:
val fileRDD = sc.textFile("wordcountsample.txt")
val checkRDD = fileRDD.filter(s => s.startsWith("H") || s.startsWith("I"))
checkRDD.collect
While passing any function to rdd.map(fn) make sure that fn covers all possible scenarios.
If you want to completely avoid strings which does not start with either H or I then use flatMap and return Option[String] from your function.
Example:
def startWithHorI(s:String): Option[String]=
{
if(s.startsWith("I") || s.startsWith("H")) Some(s)
else None
}
Then,
sc.textFile("wordcountsample.txt").flatMap(startWithHorI)
This will remove all rows not starting with H or I.
In general, to minimize run-time errors try to create total functions which handles all possible values of the arguments.
Something like below would work for you?
val fileRDD=sc.textFile("wordcountsample.txt")
fileRDD.collect
Array[String] = Array("Hello ", Hello World, Instragram, Good Morning)
val filterRDD=fileRDD.filter( x=> (x(0) == 'H'||x(0) == 'I'))
filterRDD.collect()
Array[String] = Array("Hello ", Hello World, Instragram)

Scala: Better way for String Formatting to PhoneNumber and using Java's MessageFormat

I'm looking for a better way to better implement the following:
I have imported import java.text.MessageFormat to set the format I would like.
val szphoneFrmt holds the format I would like.
val szInitialString is set to the value I pulled from the database.
val szWorkString breaks up the string via substring.
val szWorkPhone is the final string with the formatted string.
Now the problem that I am seeing is that sometimes the value in the database is null and so szInitialString is null, so I have put in a check to prevent an out of bounds exception. Now this code is working and I am able to format the string properly, but I don't think this is a good solution.
Does anyone have any suggestions with tidying this code up? I would be completely okay with dropping the use of Java's MessageFormat, but I have not seen any other reasonable solutions.
val szphoneFrmt = new MessageFormat("1 {0}{1}-{2}")
val szInitialString = applicant.ApplicantContact.Phone1.toString
val szWorkString = {
if (szInitialString != null) {
Array(szInitialString.substring(0,3),
szInitialString.substring(3,6),
szInitialString.substring(6))
} else { null }
}
val szWorkPhone = phoneFrmt.format(szWorkString)
Since we don't like nulls I wrapped your possibly-null value into an Option. If the option have a value I create the array from it and then pass it into the formatter. If it didn't have a value I give it the value "".
val szPhoneFrmt = new MessageFormat("1 {0}{1}-{2}")
val szInitialString = Option(applicant.ApplicantContact.Phone1.toString)
val szWorkString = szInitialString
.map { s => Array(s.substring(0,3), s.substring(3,6), s.substring(6)) }
.map(szPhoneFrmt.format)
.getOrElse("")

Generate Option[T] in ScalaCheck

I am trying to generate optional parameters in ScalaCheck, without success.
There seems to be no direct mechanism for this. Gen.containerOf[Option, Thing](thingGenerator) fails because it cannot find an implicit Buildable[Thing, Option].
I tried
for {
thing <- Gen.listOfN[Thing](1, thingGenerator)
} yield thing.headOption
But this doesn't work because listOfN produces a list that is always of length N. As a result I always get a Some[Thing]. Similarly, listOf1 does not work, because (a) it doesn't produce empty lists, but also (b) it is inefficient because I can't set a max limit on the number of elements.
How can I generate Option[Thing] that includes Nones?
EDIT: I have found a solution, but it is not succinct. Is there a better way than this?
for {
thing <- for {
qty <- Gen.choose(0,1)
things <- Gen.listOfN[Thing](qty, thingGenerator)
} yield things.headOption
} yield thing
EDIT 2: I generalised this to
def optional[T](g: Gen[T]) =
for (qty <- Gen.choose(0, 1); xs <- Gen.listOfN[T](qty, g)) yield xs.headOption
So I don't have to write it more than once. But surely this is in the library already and I just missed it?
Now you can just use:
Gen.option(yourGen)
You can use pick to randomly choose between a Some and a None generator:
val someThing = thingGenerator.map( Some.apply )
val noThing = Gen.value( None:Option[Thing] )
val optThing = Gen.oneOf( someThing, noThing )