Finding length of string in Scala - scala

I am a newbie to scala
I have a list of strings -
List[String] (“alpha”, “gamma”, “omega”, “zeta”, “beta”)
I want to count all the strings with length = 4
i.e I want to get output = 2.

You can do like this:
val data = List("alpha", "gamma", "omega", "zeta", "beta")
data.filter(x => x.length == 4).size
res8: Int = 2

You can just use count function as
val list = List[String] ("alpha", "gamma", "omega", "zeta", "beta")
println(list.count(x => x.length == 4))
//2 is printed
I hope the answer is helpful

Related

Spark Scala : how to slice an array in array of array

for an array of array:
val arrarr = Array(Array(0.37,1),Array(145.38,100),Array(149.37,100),Array(149.37,300),Array(119.37,5),Array(144.37,100))
For example, if the input value is 149.37, I want to do some sort of indexing to get 300. 149.37 occurs two times in arrarr(Array(149.37,100),Array(149.37,300). I want to return the last value using spark scala.
Can you please help? Thanks!
you can do it like this :
val result : Doulbe = arrarr.filter(_(0) == 149.37).last(1)
or
val result: Option[Double] = arrarr.reverse.find(_ (0) == 149.37).map(_ (1))
val index = arrarr.lastIndexWhere(x => x(0) == input)
val result = arrarr(index)(1)
Test it here

replace multiple occurrence of duplicate string in Scala with empty

I have a string as
something,'' something,nothing_something,op nothing_something,'' cat,cat
I want to achieve my output as
'' something,op nothing_something,cat
Is there any way to achieve it?
If I understand your requirement correctly, here's one approach with the following steps:
Split the input string by "," and create a list of indexed-CSVs and convert it to a Map
Generate 2-combinations of the indexed-CSVs
Check each of the indexed-CSV pairs and capture the index of any CSV which is contained within the other CSV
Since the CSVs corresponding to the captured indexes are contained within some other CSV, removing these indexes will result in remaining indexes we would like to keep
Use the remaining indexes to look up CSVs from the CSV Map and concatenate them back to a string
Here is sample code applying to a string with slightly more general comma-separated values:
val str = "cats,a cat,cat,there is a cat,my cat,cats,cat"
val csvIdxList = (Stream from 1).zip(str.split(",")).toList
val csvMap = csvIdxList.toMap
val csvPairs = csvIdxList.combinations(2).toList
val csvContainedIdx = csvPairs.collect{
case List(x, y) if x._2.contains(y._2) => y._1
case List(x, y) if y._2.contains(x._2) => x._1
}.
distinct
// csvContainedIdx: List[Int] = List(3, 6, 7, 2)
val csvToKeepIdx = (1 to csvIdxList.size) diff csvContainedIdx
// csvToKeepIdx: scala.collection.immutable.IndexedSeq[Int] = Vector(1, 4, 5)
val strDeduped = csvToKeepIdx.map( csvMap.getOrElse(_, "") ).mkString(",")
// strDeduped: String = cats,there is a cat,my cat
Applying the above to your sample string something,'' something,nothing_something,op nothing_something would yield the expected result:
strDeduped: String = '' something,op nothing_something
First create an Array of words separated by commas using split command on the given String, and do other operations using filter and mkString as below:
s.split(",").filter(_.contains(' ')).mkString(",")
In Scala REPL:
scala> val s = "something,'' something,nothing_something,op nothing_something"
s: String = something,'' something,nothing_something,op nothing_something
scala> s.split(",").filter(_.contains(' ')).mkString(",")
res27: String = '' something,op nothing_something
As per Leo C comment, I tested it as below with some other String:
scala> val s = "something,'' something anything anything anything anything,nothing_something,op op op nothing_something"
s: String = something,'' something anything anything anything anything,nothing_something,op op op nothing_something
scala> s.split(",").filter(_.contains(' ')).mkString(",")
res43: String = '' something anything anything anything anything,op op op nothing_something

Compare two columns which is list with n number of values and save the data as a list in scala

I need to perform comparison operation (like greater than or less than) on two columns which is list with n number of values (values are nothing but timestamp) and my result should also be in list.
How can I do this operation?
Input:
Date1 Date2
["2016-11-24 12:06:47"] ["2017-10-04 03:30:23"]
["null"] []
["2017-01-25 10:07:25","2018-01-25 10:07:25"] ["2017-09-15 03:30:16","2017-09-15 03:30:16"]
Output should be:
Result
["Less"]
["Not Okay"]
["Less","Great"]
I need to perform comparison operation
It seems you are looking for the .compareTo operator:
scala> "a".compareTo("b")
res: Int = -1
scala> "a".compareTo("a")
res: Int = 0
scala> "b".compareTo("a")
res: Int = 1
Using the first example mentioned:
val date1 = "2016-11-24 12:06:47"
val date2 = "2017-10-04 03:30:23"
scala> date1.compareTo(date2)
res: Int = -1
If we ignore for a moment the "Not Okay" case, we could implement the "Less" or "Great" cases with a function like:
def compareLexicographically(s1: String, s2: String): String = s1.compareTo(s2) match {
case -1 => "Less"
case _ => "Great"
}
Looking at your example, I am assuming the rows are tuples of list of Strings:
val rows: List[(List[String], List[String])] =
List((
List("2016-11-24 12:06:47"),
List("2017-10-04 03:30:23")
),
(
List("2017-01-25 10:07:25", "2018-01-25 10:07:25"),
List("2017-09-15 03:30:16", "2017-09-15 03:30:16")
))
I would first zip the elements from the columns to get List[(String, String)]
rows.flatMap(r => r._1.zip(r._2))
Then simple map with compareLexicographically:
scala> rows.flatMap(r => r._1.zip(r._2)).map((compareLexicographically _).tupled)
res: List[String] = List(Less, Great, Great)

Splitting string in dataset Apache Spark

I am absolutely new in Spark
I have a txt dataset with cathegorical attributes, looking like this:
10000,5,0,1,0,0,5,3,2,2,1,0,1,0,4,3,0,2,0,0,1,0,0,0,0,10,0,1,0,1,0,1,4,2,2,3,0,2,0,2,1,4,3,0,0,0,3,1,0,3,22,0,3,0,1,0,1,0,0,0,5,0,2,1,1,0,11,1,0
10001,6,1,1,0,0,7,5,2,2,0,0,3,0,1,1,0,1,0,0,0,0,1,0,0,4,0,2,0,0,0,1,4,1,2,2,0,2,0,2,2,4,2,1,0,0,1,1,0,2,10,0,1,0,1,0,1,0,0,0,1,0,2,1,1,0,5,1,0
10002,3,1,2,0,0,7,4,2,2,0,0,1,0,4,4,0,1,0,1,0,0,0,0,0,1,0,2,0,4,0,10,4,1,2,4,0,2,0,2,1,4,2,2,0,0,0,1,0,2,10,0,6,0,1,0,1,0,0,0,2,0,2,1,1,0,10,1,0
10003,4,1,2,0,0,1,3,2,2,0,0,3,0,3,3,0,1,0,0,0,0,0,0,1,4,0,2,0,2,0,1,4,1,2,2,0,2,0,2,1,2,2,0,0,0,1,1,0,2,10,0,4,0,1,0,1,0,0,0,1,0,1,1,1,0,10,1,0
10004,7,1,1,0,0,0,0,2,2,0,0,3,0,0,0,0,0,0,0,0,1,0,0,0,0,0,2,2,0,0,0,4,1,2,0,0,2,0,2,1,4,0,1,0,0,0,6,0,2,22,0,1,0,1,0,1,0,0,3,0,0,0,2,2,0,5,6,0
10005,1,1,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,4,0,0,0,1,0,0,0,0,0,2,0,4,0,2,0,121,0,0,1,0,10,1,0,0,2,0,1,0,0,0,0,0,0,0,0,0,4,0,0
10006,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,4,0,0,0,1,0,0,0,0,0,2,1,0,0,2,0,121,0,0,1,0,10,1,0,0,2,0,0,0,0,0,0,0,0,0,0,0,4,0,0
10007,4,1,2,0,0,6,0,2,2,0,0,4,0,5,5,0,2,1,0,0,0,0,0,0,9,0,2,0,0,0,11,4,1,2,3,0,2,0,2,1,2,3,1,0,0,0,1,0,3,10,0,1,0,1,0,1,0,0,0,0,0,2,1,1,0,11,1,0
10008,6,1,1,0,0,1,0,2,2,0,0,7,0,1,0,0,1,0,0,0,0,0,0,0,7,0,2,2,0,0,0,4,1,2,6,0,2,0,2,1,2,2,1,0,0,0,6,0,2,10,0,1,0,1,0,1,0,0,3,0,0,1,1,2,0,10,1,0
10009,3,1,12,0,0,1,0,2,2,0,0,0,0,3,0,0,1,0,0,0,0,0,0,0,4,0,2,2,4,0,0,2,1,2,6,0,2,0,2,1,0,2,2,0,0,0,3,0,2,10,0,6,1,1,1,0,0,0,1,0,0,1,1,2,0,8,1,1
10010,5,11,1,0,0,1,3,2,2,0,0,0,0,3,3,0,3,0,0,0,0,0,0,0,6,0,2,0,0,0,1,4,1,2,1,0,2,0,2,1,0,4,0,0,0,1,1,0,4,21,0,1,0,1,0,0,0,0,0,4,0,2,1,1,0,11,1,0
10011,4,0,1,0,0,1,5,2,2,0,0,3,0,1,1,0,1,0,0,0,0,0,0,0,7,0,2,0,0,0,1,4,1,2,1,0,2,0,2,1,3,2,1,0,0,1,1,0,2,10,0,1,0,1,0,1,0,0,0,2,0,2,1,1,0,10,1,0
10012,1,1,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4,0,0,0,0,0,0,0,0,0,2,0,0,0,2,0,112,0,0,1,0,10,1,0,0,1,0,0,0,0,0,0,0,0,0,2,0,1,0,0
10013,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,4,0,0,0,1,0,0,0,0,0,2,1,4,0,2,0,121,0,0,1,0,10,1,0,0,2,0,1,0,0,0,0,0,0,0,0,0,4,0,0
10014,3,11,1,0,0,6,4,2,2,0,0,1,0,2,2,0,0,1,0,0,0,0,0,0,3,0,2,0,3,0,1,4,2,2,5,0,2,0,1,2,4,2,10,0,0,1,1,0,2,10,0,5,0,1,0,1,0,0,0,3,0,1,1,1,0,7,1,0
10015,4,3,1,0,0,1,3,2,2,1,0,0,0,3,5,0,3,0,0,1,0,0,0,0,4,0,1,0,0,1,1,2,2,2,2,0,2,0,2,0,0,4,0,0,0,1,1,0,4,10,0,1,3,1,1,0,0,0,0,3,0,2,1,1,0,11,1,1
10016,4,11,1,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,2,2,4,0,0,4,1,1,0,0,1,0,0,2,0,0,12,0,0,0,6,0,2,23,0,6,0,1,0,0,0,0,3,0,0,0,2,0,0,5,7,0
10017,7,1,1,0,0,0,0,2,2,0,0,3,0,0,0,0,0,0,0,1,1,0,1,0,0,0,2,2,0,0,0,4,1,2,0,0,2,0,2,1,4,0,1,0,0,0,6,0,2,10,0,1,0,1,0,1,0,0,3,0,0,0,2,2,0,6,6,0
The task is to get the number of strings, where numeral on 57th position
10001,6,1,1,0,0,7,5,2,2,0,0,3,0,1,1,0,1,0,0,0,0,1,0,0,4,0,2,0,0,0,1,4,1,2,2,0,2,0,2,2,4,2,1,0,0,1,1,0,2,10,0,1,0,1,0,((1)),0,0,0,1,0,2,1,1,0,5,1,0
is 1 . The problem is that strings are the elements of the RDD, so I need to split each string and make an array(x,y) to get the position i need.
I tried to use
val censusText = sc.textFile("USCensus1990.data.txt")
val splitRDD = censusText.map(line=>line.split(","))
but It didn't help
But I have no idea how to do it.
Can you please help me
You can try:
censusText.filter(l => l.split(",")(56) == "1").count
// res5: Long = 12
Or you can first split the RDD then do the filter / count:
val splitRDD = censusText.map(l => l.split(","))
splitRDD.filter(r => r(56) == "1").count
// res7: Long = 12

Sum values of PairRDD

I have an RDD of type:
dataset :org.apache.spark.rdd.RDD[(String, Double)] = MapPartitionRDD[26]
Which is equivalent to (Pedro, 0.0833), (Hello, 0.001828) ...
I'd like to sum all the value , 0.0833+0.001828.. but I can't find a proper
solution.
Considering your input data, you can do the following :
// example
val datasets = sc.parallelize(List(("Pedro", 0.0833), ("Hello", 0.001828)))
datasets.map(_._2).sum()
// res3: Double = 0.085128
// or
datasets.map(_._2).reduce(_ + _)
// res4: Double = 0.085128
// or even
datasets.values.sum()
// res5: Double = 0.085128
like this?:
map(_._2).reduce((x, y) => x + y)
breakdown: map the tuple to just the double values, then reduce the RDD by summing.