Scala - Get function result in array inside foreach - scala

I want to list the folders from provided source paths. It should go first to 2021 directory get paths and store in array and further to 2022 folder.
Then I need to access this array outside foreach and pass to some other fuction. Unable to figure out how I can go about this. Please help.
val SourcePaths:Array[String] = Array("abfss://cont#mystorage.dfs.core.windows.net/testdata/2021/","abfss://cont#mystorage.dfs.core.windows.net/testdata/2022/")
SourcePaths.foreach(path=>{
var allDirPaths:Array[String] = listDirectories(path,true)
})

Use map instead of foreach (also, don't use arrays, they are bad):
val subfolders: Seq[Seq[String]] = sourcePaths.toSeq.map(listDirectories(_, true).toSeq)
Or flatMap if you want all subfolders as one flat list:
val subfolders: Seq[String] = sourceParths.toSeq.flatMap(listDirectories(_, true))

Related

How to get an element from a List with multiple data types in Spark?

I know this maybe like basic question, but I searched the web still cannot find the right answer.
I have a list in Spark that looks like:
List[(String,Timestamp,Timestamp)]
I want to retrieve the second element within the first element (i.e. the first Timestamp that appears in the list above). My understanding is to use something like the following syntax:
a(0)(1)
However, it seems it's not a multi-dimensional List, so I can't use this syntax.
How to get the element I want out of this list?
As mentioned the example list is not multidimensional, So one option to get the value can be iterating through the list.
For eg: Used below list as example
List((StringType,DoubleType,DoubleType,IntegerType))
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder()
.config("spark.master", "local[1]")
.appName("")
.getOrCreate()
var dataTypeList = List((StringType,DoubleType,DoubleType,IntegerType))
dataTypeList.map { x => println(x._2)}
}
Output
DoubleType
List[(String,Timestamp,Timestamp)] is a list of Tuple3.
To get the timestamp, you should:
a(0)._2
instead of:
a(0)(1)

Scala append Seq in with a string in a List Map or Foreach

I'm looking to append a val adminEmailSeq = Seq.empty[String] from a List of object attribute.
My List[User] called 'admins' and I'm trying to do this, but it doesn't work:
admins.foreach(
admin => {
adminEmailSeq :+ admin.email
}
)
Although admin.email contains the right information, adminEmailSeq.isEmpty is always true.
From the description I assume that you need Emails from Admins
val adminEmailSeq = admins.map(_.email)
The append ':+' actually doesn't append it to the leading Seq. It's making a copy.
Regarding to your solution you would need to do this.
admins.foreach(
admin => {
adminEmailSeq = adminEmailSeq :+ admin.email
}
)
But I think the right solution would be using map.
Just forgot: I'm a fan of immutables and values instead of variables. It mayhelp to understand the code much easier. Therefore I wouldn't use that variable reassigning (therefore I suggested map, as some solutions here may show you).

Why can not modify a map in foreach?

I am new to Scala and use Spark to process data. Why does the following code fail to change the categoryMap?
import scala.collection.mutable.LinkedHashMap
val catFile=sc.textFile(inputFile);
var categoryMap=LinkedHashMap[Int,Tuple2[String,Int]]()
catFile.foreach(line => {
val strs=line.split("\001");
categoryMap += (strs(0).toInt -> (strs(2),strs(3).toInt));
})
It's a good practice to try to stay away from both mutable data structures and vars. Sometimes they are needed, but mostly this kind of processing is easy to do by chaining transformation operations on collections. Also, .toMap is handy to convert a Seq containing Tuple2's to a Map.
Here's one way (that I didn't test properly):
val categoryMap = catFile map { _.split("\001") } map { array =>
(array(0).toInt, (array(2), array(3).toInt))
} toMap
Note that if there are more than one record corresponding to a key then only the last one will be present in the resulting map.
Edit: I didn't actually answer your original question - based on a quick test it results in a similar map to what my code above produces. Blind guess, you should make sure that your catFile actually contains data to process.

Fastest way to append sequence objects in loop

I have a for loop within which I get an Seq[Seq[(String,Int)]] for every run. I have the usual way of running through the Seq[Seq[(String,Int)]] to get every Seq[(String,Int)] and then append it to a ListBuffer[Seq[String,Int]].
Here is the following code:
var lis; //Seq[Seq[Tuple2(String,Int)]]
var matches = new ListBuffer[(String,Int)]
someLoop.foreach(k=>
// someLoop gives lis object on evry run,
// and that needs to be added to matches list
lis.foreach(j => matches.appendAll(j))
)
Is there better way to do this process without running through Seq[Seq[String,Int]] loop, say directly adding all the seq objects from the Seq to the ListBuffer?
I tried the ++ operator, by adding matches and lis directly. It didn't work either. I use Scala 2.10.2
Try this:
matches.appendAll(lis.flatten)
This way you can avoid the mutable ListBuffer at all. lis.flatten will be the Seq[(String, Int)]. So you can shorten your code like this:
val lis = ... //whatever that is Seq[Seq[(String, Int)]]
val flatLis = lis.flatten // Seq[(String, Int)]
Avoid var's and mutable structures like ListBuffer as much as you can
You don't need to append to an empty ListBuffer, just create it directly:
import collection.breakOut
val matches: ListBuffer[(String,Int)] =
lis.flatten(breakOut)
breakOut is the magic here. Calling flatten on a Seq[Seq[T]] would usually create a Seq[T] that you'd then have to convert to a ListBuffer. Using breakOut causes it to look at the expected output type and build that kind of collection instead.
Of course... You were only using ListBuffer for mutability anyway, so a Seq[T] is probably exactly what you really want. In which case, just let the inferencer do its thing:
val matches = lis.flatten

How to create a collection of unique values based on existing array values

Below code prints an Array of fileNames.
val pdfFileArray = getFiles()
for(fileName <- pdfFileArray){
println(fileName)
}
I'm trying to convert this Array (pdfFileArray) into an array which contains unique file name extensions.
Is something like below the correct way of doing this in scala ?
Set<String> fileNameSet = new HashSet<String>
val pdfFileArray = getFiles()
for(fileName <- pdfFileArray){
String extension = fileName.substring(fileName.lastIndexOf('.'));
fileNameSet.add(extension)
}
This will properly handle files with no extension (by ignoring them)
val extensions = getFiles().map{_.split('.').tail.lastOption}.flatten.distinct
so
Array("foo.jpg", "bar.jpg", "baz.png", "foobar")
becomes
Array("jpg", "png")
You could do this:
val fileNameSet = pdfFileArray.groupBy(_.split('.').last).keys
This assumes that all you filenames will have an extension and you only want the last extension. i.e. something.html.erb has the extension 'erb'
There's a method in scala's collection called distinct, which takes away all duplicate entries in the collection. So for instance:
scala> List(1, 2, 3, 1, 2).distinct
res3: List[Int] = List(1, 2, 3)
Is that what you're looking for?
For a sake of completeness:
List("foo.jpg", "bar.jpg").map(_.takeRight(3)).toSet
Here I'm assuming that all extensions are 3 chars long. Conversion to Set, just like .distinct method (which uses mutable set underneath, by the way) in other answers gives you unique items.
You can also do it with regex, which gives a more general solution because you can redefine the expression to match anything you want:
val R = """.*\.(.+)""".r
getFiles.collect{ case R(x) => x }.distinct