Saving to a custom output format in Spark / Hadoop - scala

I have one RDD which contains multiple datastructures, whereas one of these data structures is a Map[String, Int].
To visualize it easily I get the following after a map transformation:
val data = ... // This is a RDD[Map[String, Int]]
In one of the elements of this RDD, the Map contains the following:
*key value*
map_id -> 7753
Oscar -> 39
Jaden -> 13
Thomas -> 1
Chris -> 52
And then it contains other names and numbers in other elements of the RDD, each map contains a certain map_id. Anyhow, if I simply do data.saveAsTextFile(path), I will get the following output in my file:
Map(map_id -> 7753, Oscar -> 39, Jaden -> 13, Thomas -> 1, Chris -> 52)
Map(...)
Map(...)
However, I would like to format it as the following:
---------------------------
map_id: 7753
---------------------------
Oscar: 39
Jaden: 13
Thomas: 1
Chris: 52
---------------------------
map_id: <some other id>
---------------------------
Name: nbr
Name2: nbr2
Basically, the map_id as some kind of header, then the contents, one line of space and then the next element.
To my question, data RDD only has two options, save as text file or as object file, which neither as far as I can see support my to customize the formatting. How could I go about doing this?

You can just map to String and write the result. For example:
def format(map: Map[String, Int]): String = {
val id = map.get("map_id").map(_.toString).getOrElse("unknown")
val content = map.collect {
case (k, v) if k != "map_id" => s"$k: $v"
}.mkString("\n")
s"""|---------------------------
|map_id: $id
|-------------------------------
|$content
""".stripMargin
}
data.map(format(_)).saveAsTextFile(path)

Related

Scala: find character occurrences from a file

Problem:
suppose, I have a text file containing data like
TATTGCTTTGTGCTCTCACCTCTGATTTTACTGGGGGCTGTCCCCCACCACCGTCTCGCTCTCTCTGTCA
AAGAGTTAACTTACAGCTCCAATTCATAAAGTTCCTGGGCAATTAGGAGTGTTTAAATCCAAACCCCTCA
GATGGCTCTCTAACTCGCCTGACAAATTTACCCGGACTCCTACAGCTATGCATATGATTGTTTACAGCCT
And I want to find occurrences of character 'A', 'T', 'AAA' , etc. in it.
My Approach
val source = scala.io.Source.fromFile(filePath)
val lines = source.getLines().filter(char => char != '\n')
for (line <- lines) {
val aList = line.filter(ele => ele == 'A')
println(aList)
}
This will give me output like
AAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAA
My Question
How can I find total count of occurrences of 'A', 'T', 'AAA' etc. here? can I use map reduce functions for that? How?
There is even a shorter way:
lines.map(_.count(_ == 'A')).sum
This counts all A of each line, and sums up the result.
By the way there is no filter needed here:
val lines = source.getLines()
And as Leo C mentioned in his comment, if you start with Source.fromFile(filePath) it can be just like this:
source.count(_ == 'A')
As SoleQuantum mentions in his comment he wants call count more than once. The problem here is that source is a BufferedSource which is not a Collection, but just an Iterator, which can only be used (iterated) once.
So if you want to use the source mire than once you have to translate it first to a Collection.
Your example:
val stream = Source.fromResource("yourdata").mkString
stream.count(_ == 'A') // 48
stream.count(_ == 'T') // 65
Remark: String is a Collection of Chars.
For more information check: iterators
And here is the solution to get the count for all Chars:
stream.toSeq
.filterNot(_ == '\n') // filter new lines
.groupBy(identity) // group by each char
.view.mapValues(_.length) // count each group > HashMap(T -> TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT, A -> AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA, G -> GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG, C -> CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC)
.toMap // Map(T -> 65, A -> 48, G -> 36, C -> 61)
Or as suggested by jwvh:
stream
.filterNot(_ == '\n')
.groupMapReduce(identity)(_=>1)(_+_))
This is Scala 2.13, let me know if you have problems with your Scala version.
Ok after the last update of the question:
stream.toSeq
.filterNot(_ == '\n') // filter new lines
.foldLeft(("", Map.empty[String, Int])){case ((a, m), c ) =>
if(a.contains(c))
(a + c, m)
else
(s"$c",
m.updated(a, m.get(a).map(_ + 1).getOrElse(1)))
}._2 // you only want the Map -> HashMap( -> 1, CCCC -> 1, A -> 25, GGG -> 1, AA -> 4, GG -> 3, GGGGG -> 1, AAA -> 5, CCC -> 1, TTTT -> 1, T -> 34, CC -> 9, TTT -> 4, G -> 22, CCCCC -> 1, C -> 31, TT -> 7)
Short explanation:
The solution uses a foldLeft.
The initial value is a pair:
a String that holds the actual characters (none to start)
a Map with the Strings and their count (empty at the start)
We have 2 main cases:
the character is the same we have a already a String.
Just add the character to the actual String.
the character is different. Update the Map with the actual String; the new character is the now the actual String.
Quite complex, let me know if you need more help.
Since scala.io.Source.fromFile(filePath) produces stream of chars you can use count(Char => Boolean) function directly on your source object.
val source = scala.io.Source.fromFile(filePath)
val result = source.count(_ == 'A')
You can use Partition method and then just use length on it.
val y = x.partition(_ == 'A')._1.length
You can get the count by doing the following:
lines.flatten.filter(_ == 'A').size
In general regular expressions are a very good tool to find sequences of characters in a string.
You can use the r method, defined with an implicit conversion over strings, to turn a string into a pattern, e.g.
val pattern = "AAA".r
Using it is then fairly easy. Assuming your sample input
val input =
"""TATTGCTTTGTGCTCTCACCTCTGATTTTACTGGGGGCTGTCCCCCACCACCGTCTCGCTCTCTCTGTCA
AAGAGTTAACTTACAGCTCCAATTCATAAAGTTCCTGGGCAATTAGGAGTGTTTAAATCCAAACCCCTCA
GATGGCTCTCTAACTCGCCTGACAAATTTACCCGGACTCCTACAGCTATGCATATGATTGTTTACAGCCT"""
Counting the number of occurrences of a pattern is straightforward and very readable:
pattern.findAllIn(input).size // returns 4
The iterator returned by regular expressions operations can also be used for more complex operations using the matchData method, e.g. printing the index of each match:
pattern. // this code would print the following lines
findAllIn(input). // 98
matchData. // 125
map(_.start). // 131
foreach(println) // 165
You can read more on Regex in Scala on the API docs (here for version 2.13.1)

how to sort each line on the file in scala?

I want sorted list of the names, last name first, one per line. Names will be ordered by the number of characters in the first name in ascending order, shortest first. Within each group of names of each length, they will be ordered by the number of characters in the last name in ascending order, shortest first.
Example:
xxx xxxxx
xxx xxxxx
xxx xxxxxx
xxx xxxxxxx
xxxx xxx
xxxx xxxxx
xxxx xxxxxxx
i append the name to a list of list like this
List(List(Samantha, Sanderfur), List(Kathlene, Lamonica), List(Dixie, Crooker), List(Domitila, Rutigliano))
and i want sort this list of list. Idk how should i sort this, or some other way to solve this problem.
This is how I would do it:
val input = List(List("aa","bbb"), List("a", "bb"), List("aaaa", "b"), List("aa", "bb"))
val tupleInput = input.map{case List(a,b) => (a,b)}
// List((aa,bbb), (a,bb), (aaaa,b), (aa,bb))
val sortedMapValues = tupleInput.groupBy(_._1).mapValues(_.sorted)
// Map(a -> List((a,bb)), aaaa -> List((aaaa,b)), aa -> List((aa,bb), (aa,bbb)))
val sortedMapKeys = scala.collection.immutable.TreeMap(sortedMapValues.toArray:_*)
// Map(a -> List((a,bb)), aa -> List((aa,bb), (aa,bbb)), aaaa -> List((aaaa,b)))
val result = sortedMapKeys.map{case (_, a) => a}
// result = List(List((a,bb)), List((aa,bb), (aa,bbb)), List((aaaa,b)))
You can play with it here
Another one-liner solution can be like this (here) - Thanks to #Anupam Kumar (a little adjustment need to be done to make his solution fit the input required):
val result = input.sortBy{case List(x,y) => (x.length(),y.length())}
Thanks #jwvh for the making it even shorter.
try below code:
val names = List(("Jack","Wilson"),("Alex","Jao"),("Jack","Wildorsowman"),
("Jack","Wiliamson"),("Alex","Joan"),("Alex","J."))
println(names.sortBy( x => (x._1.length(),x._2.length())))
Result:
List((Alex,J.), (Alex,Jao), (Alex,Joan), (Jack,Wilson), (Jack,Wiliamson), (Jack,Wildorsowman))
UPDATED with suggestion from #GalNaor -
val names = `List(List("Jack","Wilson"),List("Alex","Jao"),List("Jack","Wildorsowman"),List("Jack","Wiliamson"),List("Alex","Joan"),List("Alex","J."))`
println(names.sortBy{ case List(x,y) => (x.length(),y.length())})
Result:
List(List(Alex, J.), List(Alex, Jao), List(Alex, Joan), List(Jack, Wilson), List(Jack, Wiliamson), List(Jack, Wildorsowman))
Ty guys for help me. I use sortwith to sort list of list
here is my code
val s=v.sortWith((x,a)=>if(x(0).length==a(0).length){
x(1).length<a(1).length
}
else{
x(0).length<a(0).length
})
var z=""
for(i<- s){
z=i(0)+" "+i(1)
println(z)
}

Scala - have List of String, must split by comma, and put into Map

I have a List of the form: ("string,num", "string,num", ...)
I have found solutions online for how to do this with a single string, but have not been able to adapt it to a list of strings.
Also, numerical values should be cast to Int/Double before being mapped.
I would appreciate any help.
This is a perfect job for a fold
// Your input
val lines = List("a,1", "b,2", "gretzky,99", "tyler,11")
// Fold over the lines, and insert them into a map
val map = lines.foldLeft(Map.empty[String, Int]) {
case (map, line) =>
// Split the line on the comma and separate the two parts
val Array(string, num) = line.split(",")
// Add new entry to the map
map + (string -> num.toInt)
}
println(map)
Output:
Map(a -> 1, b -> 2, gretzky -> 99, tyler -> 11)
Probably not the best way to do that, but it should meet your needs
yourlist.groupBy( _.split(',')(0) ).mapValues(v=>v(0).split(',')(1))

Scala - map function - Only returned last element of a Map

I am new to Scala and trying out the map function on a Map.
Here is my Map:
scala> val map1 = Map ("abc" -> 1, "efg" -> 2, "hij" -> 3)
map1: scala.collection.immutable.Map[String,Int] =
Map(abc -> 1, efg -> 2, hij -> 3)
Here is a map function and the result:
scala> val result1 = map1.map(kv => (kv._1.toUpperCase, kv._2))
result1: scala.collection.immutable.Map[String,Int] =
Map(ABC -> 1, EFG -> 2, HIJ -> 3)
Here is another map function and the result:
scala> val result1 = map1.map(kv => (kv._1.length, kv._2))
result1: scala.collection.immutable.Map[Int,Int] = Map(3 -> 3)
The first map function returns all the members as expected however the second map function returns only the last member of the Map. Can someone explain why this is happening?
Thanks in advance!
In Scala, a Map cannot have duplicate keys. When you add a new key -> value pair to a Map, if that key already exists, you overwrite the previous value. If you're creating maps from functional operations on collections, then you're going to end up with the value corresponding to the last instance of each unique key. In the example you wrote, each string key of the original map map1 has the same length, and so all your string keys produce the same integer key 3 for result1. What's happening under the hood to calculate result1 is:
A new, empty map is created
You map "abc" -> 1 to 3 -> 3 and add it to the map. Result now contains 1 -> 3.
You map "efg" -> 2 to 3 -> 2 and add it to the map. Since the key is the same, you overwrite the existing value for key = 3. Result now contains 2 -> 3.
You map "hij" -> 3 to 3 -> 3 and add it to the map. Since the key is the same, you overwrite the existing value for key = 3. Result now contains 3 -> 3.
Return the result, which is Map(3 -> 3)`.
Note: I made a simplifying assumption that the order of the elements in the map iterator is the same as the order you wrote in the declaration. The order is determined by hash bin and will probably not match the order you added elements, so don't build anything that relies on this assumption.

Scala Saddle Filtering Column Values

I am new to scala Saddle, I have three column (customer name, age and Status) in a frame. I have to apply filter in column (age). If any customer age having more than 18 I need to set the Status is "eligible" other wise I need to put "noteligible".
Code:
f.col("age").filterAt(x => x > 18) //but how to update Status column
Frames are immutable containers, so it is probably better to build your frame with the values fully initialised than start with a partially initialised Frame.
import org.saddle._
object Test {
def main(args: Array[String]): Unit = {
val names: Vec[Any] = Vec("andy", "bruce", "cheryl", "dino", "edgar", "frank", "gollum", "harvey")
val ages: Vec[Any] = Vec(4, 89, 7, 21, 14, 18, 23004, 65)
def status(age: Any): Any = if (age.asInstanceOf[Int] >= 18) "eligible" else "noteligible"
def mapper(indexAge: (Int, Any)): (Int, _) = indexAge match {
case (index, age) => (index, status(age))
}
val nameAge: Frame[Int, String, Any] = Frame("name" -> names, "age" -> ages)
val ageCol: Series[Int, Any] = nameAge.colAt(1)
val eligible: Series[Int, Any] = ageCol.map( mapper )
println("" + nameAge)
println("" + eligible)
val nameAgeStatus: Frame[Int, String, _] = nameAge.joinSPreserveColIx(eligible, how=index.LeftJoin, "status")
println("" + nameAgeStatus)
}
}
If you really need to start from a partially initialised Frame, you can always drop the uninitialised column and add it back with the correctly calculated values.
Although I would prefer to strongly type the data columns, I think a Frame only contains data of one type, and the common type for "Int" and "String" is "Any". This also affects the type signatures of the methods, although you might want to inline them without the type information anyway.
I found that looking at the scaladoc helped a lot.
This is the output from the final println call:
[8 x 3]
name age status
------ ----- -----------
0 -> andy 4 noteligible
1 -> bruce 89 eligible
2 -> cheryl 7 noteligible
3 -> dino 21 eligible
4 -> edgar 14 noteligible
5 -> frank 18 eligible
6 -> gollum 23004 eligible
7 -> harvey 65 eligible