Way to Extract list of elements from Scala list - scala

I have standard list of objects which is used for the some analysis. The analysis generates a list of Strings and i need to look through the standard list of objects and retrieve objects with same name.
case class TestObj(name:String,positions:List[Int],present:Boolean)
val stdLis:List[TestObj]
//analysis generates a list of strings
var generatedLis:List[String]
//list to save objects found in standard list
val lisBuf = new ListBuffer[TestObj]()
//my current way
generatedLis.foreach{i=>
val temp = stdLis.filter(p=>p.name.equalsIgnoreCase(i))
if(temp.size==1){
lisBuf.append(temp(0))
}
}
Is there any other way to achieve this. Like having an custom indexof method that over rides and looks for the name instead of the whole object or something. I have not tried that approach as i am not sure about it.

stdLis.filter(testObj => generatedLis.exists(_.equalsIgnoreCase(testObj.name)))
use filter to filter elements from 'stdLis' per predicate
use exists to check if 'generatedLis' has a value of ....

Don't use mutable containers to filter sequences.
Naive solution:
val lisBuf =
for {
str <- generatedLis
temp = stdLis.filter(_.name.equalsIgnoreCase(str))
if temp.size == 1
} yield temp(0)
if we discard condition temp.size == 1 (i'm not sure it is legal or not):
val lisBuf = stdLis.filter(s => generatedLis.exists(_.equalsIgnoreCase(s.name)))

Related

Need help - How to loop through a list and/or a map

Scala is pretty new for me and I have problems as soon as a leave the gatling dsl.
In my case I call an API (Mailhog) which responds with a lot of mails in json-format. I can’t grab all the values.
I need it with “jsonPath” and I need to “regex” as well.
That leads into a map and a list which I need to iterate through and save each value.
.check(jsonPath("$[*]").ofType[Map[String,Any]].findAll.saveAs("id_map"))
.check(regex("href=3D\\\\\"(.*?)\\\\\"").findAll.saveAs("url_list"))
At first I wanted to loop the “checks” but I did’nt find any to repeat them without repeating the “get”-request too. So it’s a map and a list.
1) I need every value of the map and was able to solve the problem with the following foreach loop.
.foreach("${id_map}", "idx") {
exec(session => {
val idMap = session("idx").as[Map[String,Any]]
val ID = idMap("ID")
session.set("ID", ID)
})
.exec(http("Test")
.get("/{ID}"))
})}
2) I need every 3rd value of the list and make a get-request on them. Before I can do this, I need to replace a part of the string. I tried to replace parts of the string while checking for them. But it won’t work with findAll.
.check(regex("href=3D\\\\\"(.*?)\\\\\"").findAll.transform(raw => raw.replace("""=\r\n""","")).saveAs("url"))
How can I replace a part of every string in my list?
also how can I make a get-request on every 3rd element in the list.
I can't get it to work with the same foreach structure above.
I was abole to solve the problem by myself. At first I made a little change to my check(regex ...) part.
.check(regex("href=3D\\\\\"(.*?)\\\\\"").findAll.transform(_.map(raw => raw.replace("""=\r\n""",""))).saveAs("url_list"))
Then I wanted to make a Get-Request only on every third element of my list (because the URLs I extracted appeared three times per Mail).
.exec(session => {
val url_list =
session("url_list").as[List[Any]].grouped(3).map(_.head).toList
session.set("url_list", url_list)
})
At the end I iterate through my final list with a foreach-loop.
foreach("${url_list}", "urls") {
exec(http("Activate User")
.get("${urls}")
)
}

how to update a value in dataframe and drop a row on this basis of a given value in scala

I need to update the value and if the value is zero then drop that row. Here is the snapshot.
val net = sc.accumulator(0.0)
df1.foreach(x=> {net += calculate(df2, x)})
def calculate(df2:DataFrame, x : Row):Double = {
var pro:Double = 0.0
df2.foreach(y => {if(xxx){ do some stuff and update the y.getLong(2) value }
else if(yyy){ do some stuff and update the y.getLong(2) value}
if(y.getLong(2) == 0) {drop this row from df2} })
return pro;
}
Any suggestions? Thanks.
You cannot change the DataFrame or RDD. They are read only for a reason. But you can create a new one and use transformations by all the means available. So when you want to change for example contents of a column in dataframe just add new column with updated contents by using functions like this:
df.withComlumn(...)
DataFrames are immutable, you can not update a value but rather create new DF every time.
Can you reframe your use case, its not very clear what you are trying to achieve with the above snippet (Not able to understand the use of accumulator) ?
You can rather try df2.withColumn(...) and use your udf here.

Get the first elements (take function) of a DStream

I look for a way to retrieve the first elements of a DStream created as:
val dstream = ssc.textFileStream(args(1)).map(x => x.split(",").map(_.toDouble))
Unfortunately, there is no take function (as on RDD) on a dstream //dstream.take(2) !!!
Could someone has any idea on how to do it ?! thanks
You can use transform method in the DStream object then take n elements of the input RDD and save it to a list, then filter the original RDD to be contained in this list. This will return a new DStream contains n elements.
val n = 10
val partOfResult = dstream.transform(rdd => {
val list = rdd.take(n)
rdd.filter(list.contains)
})
partOfResult.print
The previous suggested solution did not compile for me as the take() method returns an Array, which is not serializable thus Spark streaming will fail with a java.io.NotSerializableException.
A simple variation on the previous code that worked for me:
val n = 10
val partOfResult = dstream.transform(rdd => {
rdd.filter(rdd.take(n).toList.contains)
})
partOfResult.print
Sharing a java based solution that is working for me. The idea is to use a custom function, which can send the top row from a sorted RDD.
someData.transform(
rdd ->
{
JavaRDD<CryptoDto> result =
rdd.keyBy(Recommendations.volumeAsKey)
.sortByKey(new CryptoComparator()).values().zipWithIndex()
.map(row ->{
CryptoDto purchaseCrypto = new CryptoDto();
purchaseCrypto.setBuyIndicator(row._2 + 1L);
purchaseCrypto.setName(row._1.getName());
purchaseCrypto.setVolume(row._1.getVolume());
purchaseCrypto.setProfit(row._1.getProfit());
purchaseCrypto.setClose(row._1.getClose());
return purchaseCrypto;
}
).filter(Recommendations.selectTopinSortedRdd);
return result;
}).print();
The custom function selectTopinSortedRdd looks like below:
public static Function<CryptoDto, Boolean> selectTopInSortedRdd = new Function<CryptoDto, Boolean>() {
private static final long serialVersionUID = 1L;
#Override
public Boolean call(CryptoDto value) throws Exception {
if (value.getBuyIndicator() == 1L) {
System.out.println("Value of buyIndicator :" + value.getBuyIndicator());
return true;
}
else {
return false;
}
}
};
It basically compares all incoming elements, and returns true only for the first record from the sorted RDD.
This seems to be always an issue with DStreams as well as regular RDDs.
If you don't want (or can't) to use .take() (especially in DStreams) you can think outside the box here and just use reduce instead. That is a valid function for both DStreams as well as RDD's.
Think about it. If you use reduce like this (Python example):
.reduce( lambda x, y : x)
Then what happens is: For every 2 elements you pass in, always return only the first. So if you have a million elements in your RDD or DStream it will shrink it to one element in the end which is the very first one in your RDD or DStream.
Simple and clean.
However keep in mind that .reduce() does not take order into consideration. However you can easily overcome this with a custom function instead.
Example: Let's assume your data looks like this x = (1, [1,2,3]) and y = (2, [1,2]). A tuple x where the 2nd element is a list. If you are sorting by the longest list for example then your code could look like below maybe (adapt as needed):
def your_reduce(x,y):
if len(x[1]) > len(y[1]):
return x
else:
return y
yourNewRDD = yourOldRDD.reduce(your_reduce)
Accordingly you will get '(1, [1,2,3])' as that has the longer list. There you go!
This has caused me some headaches in the past until I finally tried this. Hopefully this helps.

How to use reduce or fold to avoid mutable state

I have a mutable variable in my code that I want to avoid by using some of aggregation function. Unfortunatelly I couldn't find solution for the following pseudocode.
def someMethods(someArgs) = {
var someMutableVariable = factory
val resources = getResourcesForVariable(someMutableVariable)
resources foreach (resource => {
val localTempVariable = getSomeOtherVariable(resource)
someMutableVariable = chooseBetteVariable(someMutableVariable, localTempVariable)
})
someMutableVariable
}
I have two places in my code where I need to build some variable, then in loop compare it with other possibilities and if it worse then replace it with this newly possibility.
If you resources variable supports it:
//This is the "currently best" and "next" in list being folded over
resources.foldLeft(factory)((cur, next) =>
val local = getSomeOther(next)
//Since this function returns the "best" between the two, you're solid
chooseBetter(local, cur)
}
and then you don't have mutable state.

Preferred way of processing this data with parallel arrays

Imagine a sequence of java.io.File objects. The sequence is not in any particular order, it gets populated after a directory traversal. The names of the files can be like this:
/some/file.bin
/some/other_file_x1.bin
/some/other_file_x2.bin
/some/other_file_x3.bin
/some/other_file_x4.bin
/some/other_file_x5.bin
...
/some/x_file_part1.bin
/some/x_file_part2.bin
/some/x_file_part3.bin
/some/x_file_part4.bin
/some/x_file_part5.bin
...
/some/x_file_part10.bin
Basically, I can have 3 types of files. First type is the simple ones, which only have a .bin extension. The second type of file is the one formed from _x1.bin till _x5.bin. And the third type of file can be formed of 10 smaller parts, from _part1 till _part10.
I know the naming may be strange, but this is what I have to work with :)
I want to group the files together ( all the pieces of a file should be processed together ), and I was thinking of using parallel arrays to do this. The thing I'm not sure about is how can I perform the reduce/acumulation part, since all the threads will be working on the same array.
val allBinFiles = allBins.toArray // array of java.io.File
I was thinking of handling something like that:
val mapAcumulator = java.util.Collections.synchronizedMap[String,ListBuffer[File]](new java.util.HashMap[String,ListBuffer[File]]())
allBinFiles.par.foreach { file =>
file match {
// for something like /some/x_file_x4.bin nameTillPart will be /some/x_file
case ComposedOf5Name(nameTillPart) => {
mapAcumulator.getOrElseUpdate(nameTillPart,new ListBuffer[File]()) += file
}
case ComposedOf10Name(nameTillPart) => {
mapAcumulator.getOrElseUpdate(nameTillPart,new ListBuffer[File]()) += file
}
// simple file, without any pieces
case _ => {
mapAcumulator.getOrElseUpdate(file.toString,new ListBuffer[File]()) += file
}
}
}
I was thinking of doing it like I've shown in the above code. Having extractors for the files, and using part of the path as key in the map. Like for example, /some/x_file can hold as values /some/x_file_x1.bin to /some/x_file_x5.bin. I also think there could be a better way of handling this. I would be interested in your opinions.
The alternative is to use groupBy:
val mp = allBinFiles.par.groupBy {
case ComposedOf5Name(x) => x
case ComposedOf10Name(x) => x
case f => f.toString
}
This will return a parallel map of parallel arrays of files (ParMap[String, ParArray[File]]). If you want a sequential map of sequential sequences of files from this point:
val sqmp = mp.map(_.seq).seq
To ensure that the parallelism kicks in, make sure you have enough elements in you parallel array (10k+).