Read input from input file given in specified format in scala - scala

I am given a input file in following format
(a,[b,c,d])
(b,[d,a])
How can format this input to get values in form
key => List()
Following code is used to split lines on basis of space.
val textFile = sc.textFile("hdfs://...")
val counts = textFile.flatMap(line => line.split(" "))
How to store this kind of formatted input ?

To tackle this I started with multiple data elements with and without whitespace separation.
%> cat junk.txt
(a,[b,c,d,e]) (w,[x,y,z])
(q,[wert,cv])(xx,[aa])
Then I opened the file and split the input on every leading ( paren without consuming the character.
val input = io.Source.fromFile("junk.txt")
.getLines()
.flatMap(_.split("(?=\\()"))
I also need a way to recognize the pattern I'm looking for.
val dataRE = "\\(([^,]+),\\[([^\\]]+)]".r.unanchored
Now to transform the data from Strings to Maps.
input.collect{case dataRE(k,v) => k -> v.split(",").toList}.toMap
Result: Map[String,List[String]]
Map(a -> List(b, c, d, e), w -> List(x, y, z), q -> List(wert, cv), xx -> List(aa))

Related

calculate cosine similarity in scala

I have a file (tags.csv) that contains UserId, MovieId,tags.I want to use a domain-based method to calculate the cosine similarity between tags. I want to show the relevant tags for comedy only and measure similarity for each tag relevant to the comedy tag.
dataset
My code is:
val rows = sc.textFile("/usr/local/comedy")
val vecData = rows.map(line => Vectors.dense(line.split(", ").map(_.toDouble)))
val mat = new RowMatrix(vecData)
val exact = mat.columnSimilarities()
val approx = mat.columnSimilarities(0.07)
val exactEntries = exact.entries.map { case MatrixEntry(i, j, u) => ((i, j), u) }
val approxEntries = approx.entries.map { case MatrixEntry(i, j, v) => ((i, j), v) }
val MAE = exactEntries.leftOuterJoin(approxEntries).values.map {
case (u, Some(v)) =>
math.abs(u - v)
case (u, None) =>
math.abs(u)
}.mean()
but this error appear:
java.lang.NumberFormatException: For input string: "[1,898,"black comedy"]"
at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043)
at sun.misc.FloatingDecimal.parseDouble(FloatingDecimal.java:110)
at java.lang.Double.parseDouble(Double.java:538)
What's wrong?
The error message is full of pertinent info.
NumberFormatException: For input string: "[1,898,"black comedy"]"
It looks like the input String isn't being split into separate column data. So .split(", ") isn't doing its job and it's easy to see why, there are no comma-space sequences to split on.
We could take out the space and split on just the comma but that would still leave a non-digit [ in the 1st column data and the 3rd column data has no digit characters at all.
There are a few different ways to attack this. I'd be tempted to use a regex parser.
val twoNums = "(\\d+),(\\d+),".r.unanchored
val vecData = rows.collect{ case twoNums(a, b) =>
Vectors.dense(Array(a.toDouble, b.toDouble))
}

Reading one file as Map(K,V) and pass V as keys while reading the second file as Map

I have two files. One is a text file and another one is CSV. I want to read the text file as Map(keys, values) and pass these values from the first file as key in Map when I read the second file (CSV file).
I am able to read the first file and get Map(key, value). From this Map, I have extracted the values and passed these values as keys in the second file but didn't get the desired result.
first file - text file
sdp:field(0)
meterNumber:field(1)
date:field(2)
time:field(3)
value:field(4),field(5),field(6),field(7),field(8),field(9),
field(10),field(11),field(12),field(13),field(14),
field(15),field(16),field(17)
second file - csv file
SDP,METERNO,READINGDATE,TIME,Reset Count.,Kilowatt-Hour Last Reset .,Kilowatt-Hour Rate A Last Reset.,Kilowatt-Hour Rate B Last Reset.,Kilowatt-Hour Rate C Last Reset.,Max Kilowatt Rate A Last Reset.,Max Kilowatt Rate B Last Reset.,Max Kilowatt Rate C Last Reset.,Accumulate Kilowatt Rate A Current.,Accumulate Kilowatt Rate B Current.,Accumulate Kilowatt Rate C Current.,Total Kilovar-Hour Last Reset.,Max Kilovar Last Reset.,Accumulate Kilovar Last Reset.
9000000001,500001,02-09-2018,00:00:00,2,48.958,8.319333333,24.31933333,16.31933333,6,24,15,10,9,6,48.958,41,40
this is what I have done to read the first file.
val lines = scala.io.Source.fromFile("D:\\JSON_READER\\dailymapping.txt", "UTF8")
.getLines
.map(line=>line.split(":"))
.map(fields => (fields(0),fields(1))).toMap;
val sdp = lines.get("sdp").get;
val meterNumber = lines.get("meterNumber").get;
val date = lines.get("date").get;
val time = lines.get("time").get;
val values = lines.get("value").get;
now I can see sdp has field(0), meterNumber has field(1), date has field(2), time has field(3) and values has field(4) .. to field(17).
Second file which I m reading using below code
val keyValuePairs = scala.io.Source.fromFile("D:\\JSON_READER\\Daily.csv")
.getLines.drop(1).map(_.stripLineEnd.split(",", -1))
.map{field => ((field(0),field(1),field(2),field(3)) -> (field(4),field(5)))}.toList
val map = Map(keyValuePairs : _*)
System.out.println(map);
above code giving me the following output which is desired output.
Map((9000000001,500001,02-09-2018,00:00:00) -> (2,48.958))
But I want to replace field(0), field(1), field(2), field(3) with sdp, meterNumber, date, time in the above code. So, I don't have to mention keys when I read the second file, keys will come from the first file.
I tried to replace but I got below output which is not desired output.
Map((field(0),field(1),field(2),field(3)) -> (,))
Can somebody please guide me on how can I achieve the desired output.
This might get you close to what you're after. The first Map is used to lookup the correct index into the CSV data.
val fieldRE = raw"field\((\d+)\)".r
val idx = io.Source
.fromFile(<txt_file>, "UTF8")
.getLines
.map(_.split(":"))
.flatMap(fields => fieldRE.replaceAllIn(fields(1), _.group(1))
.split(",")
.map(fields(0) -> _.toInt))
.toMap
val resMap = io.Source
.fromFile(<csv_file>)
.getLines
.drop(1)
.map(_.stripLineEnd.split(",", -1))
.map{ fld =>
(fld(idx("sdp")),fld(idx("meterNumber")),fld(idx("date")),fld(idx("time"))) ->
(fld(4),fld(5)) }
.toMap
//resMap: Map((9000000001,500001,02-09-2018,00:00:00) -> (2,48.958))
UPDATE
Changing the Map of (String identifiers -> Int index values) into a Map of (String identifiers -> collection of Int index values) can be done. I'm not sure what that buys you, but it's doable.
val fieldRE = raw"field\((\d+)\)".r
val idx = io.Source
.fromFile(<txt_file>, "UTF8")
.getLines
.map(_.split(":"))
.flatMap(fields => fieldRE.replaceAllIn(fields(1), _.group(1))
.split(",")
.map(fields(0) -> _.toInt))
.foldLeft(Map[String,Seq[Int]]()){ case (m,(k,v)) =>
m + (k -> (m.getOrElse(k,Seq()) :+ v))
}
val resMap = io.Source
.fromFile(<csv_file>)
.getLines
.drop(1)
.map(_.stripLineEnd.split(",", -1))
.map{fld => (fld(idx("sdp").head)
,fld(idx("meterNumber").head)
,fld(idx("date").head)
,fld(idx("time").head)) -> (fld(4),fld(5))}
.toMap

Scala File lines to Map

I'm Currently opening my files and utilizing .getLines to retrieve each lines from the file with a word and its phonetic pronunciation separated by two white spaces, i'm confused as to how would i go about Mapping the word and its pronunciation in Scala as i'm fairly new to the language.
i've previously though to utilize split and separate the words and their sounds into different lines,but, i'm lost
Currently i Started with
def words(filename: String, word: String): Unit = {
val file = Source.fromFile(filename).getLines().drop(56)
for(x <- file){
}
}
EX:
ARTI AA1 R T IY2
AASE AA1 S
ABAIR AH0 B EH1 R
AB AE1 B
Result:
Map("AARTI -> "AA1 R T IY2","AASE" -> "AA1 S", "ABAIR" -> " AH0 B EH1 R")
iterate each line
split by 2 white spaces " "
create a tuple of (a -> b)
convert Array[Tuple[A, B]] => Map[A, B]
example,
val data =
"""
ARTI AA1 R T IY2
AASE AA1 S
ABAIR AH0 B EH1 R
AB AE1 B
""".stripMargin
val lines: Array[String] = data.split("\n").filter(_.trim.nonEmpty)
// if you are reading from file
// val lines = Source.fromFile("src/test/resources/my_filename.txt").getLines()
val res: Array[Tuple2[String, String]] = lines.map { line =>
line.split(" ") match { case Array(a, b) => a -> b }
}
println(res.toMap)
output:
Map(ARTI -> AA1 R T IY2, AASE -> AA1 S, ABAIR -> AH0 B EH1 R, AB -> AE1 B)
Running example - https://scastie.scala-lang.org/prayagupd/jBCnEhUPQJCMPKP9TXlgWA
How to read entire file in Scala?
If your lines are in file then this will create a Map from the first word to the rest of the string:
val res: Map[String, String] = file.map(_.span(_.isLetter))(collection.breakOut)
The values in the Map will contain leading space characters so you may want to call trim on them before using them.
The map call processes each line in turn.
The span method splits the line into a tuple where the first value is your word and the second is the rest of the line.
Using collection.breakOut tells map to put the results directly into a Map rather than going through an intermediate array or list.

Multiline Spark sliding window

I am learning Apache Spark with Scala and would like to use it to process a DNA data set that spans multiple lines like this:
ATGTAT
ACATAT
ATATAT
I want to map this into groups of a fixed size k and count the groups. So for k=3, we would get groups of each character with the next two characters:
ATG TGT GTA TAT ATA TAC
ACA CAT ATA TAT ATA TAT
ATA TAT ATA TAT
...then count the groups (like word count):
(ATA,5), (TAT,5), (TAC,1), (ACA,1), (CAT,1), (ATG,1), (TGT,1), (GTA,1)
The problem is that the "words" span multiple lines, as does TAC in the example above. It spans the line wrap. I don't want to just count the groups in each line, but in the whole file, ignoring line endings.
In other words, I want to process the entire sequence as a sliding window of width k over the entire file as though there were no line breaks. The problem is looking ahead (or back) to the next RDD row to complete a window when I get to the end of a line.
Two ideas I had were:
Append k-1 characters from the next line:
ATATATAC
ACATATAT
ATATAT
I tried this with the Spark SQL lead() function, but when I tried executing a flatMap, I got a NotSerializableException for WindowSpec. Is there any other way to reference the next line? Would I need to write a custom input format?
Read the entire sequence in as a single line (or join lines after reading):
ATATATACATATATATAT
Is there a way to read multiple lines so they can be processed as one? If so, would it all need to fit into the memory of a single machine?
I realize either of these could be done as a pre-processing step. I was wondering the best way is to do it within Spark. Once I have it in either of these formats, I know how to do the rest, but I am stuck here.
You can make a rdd of single character string instead of join them as one line, since that will make the result a string which can not be distributed:
val rdd = sc.textFile("gene.txt")
// rdd: org.apache.spark.rdd.RDD[String] = gene.txt MapPartitionsRDD[4] at textFile at <console>:24
So simply use flatMap to split the lines into List of characters:
rdd.flatMap(_.split("")).collect
// res4: Array[String] = Array(A, T, G, T, A, T, A, C, A, T, A, T, A, T, A, T, A, T)
A more complete solution borrowed from this answer:
val rdd = sc.textFile("gene.txt")
// create the sliding 3 grams for each partition and record the edges
val rdd1 = rdd.flatMap(_.split("")).mapPartitionsWithIndex((i, iter) => {
val slideList = iter.toList.sliding(3).toList
Iterator((slideList, (slideList.head, slideList.last)))
})
// collect the edge values, concatenate edges from adjacent partitions and broadcast it
val edgeValues = rdd1.values.collect
val sewedEdges = edgeValues zip edgeValues.tail map { case (x, y) => {
(x._2 ++ y._1).drop(1).dropRight(1).sliding(3).toList
}}
val sewedEdgesMap = sc.broadcast(
(0 until rdd1.partitions.size) zip sewedEdges toMap
)
// sew the edge values back to the result
rdd1.keys.mapPartitionsWithIndex((i, iter) => iter ++ List(sewedEdgesMap.value.getOrElse(i, Nil))).
flatMap(_.map(_ mkString "")).collect
// res54: Array[String] = Array(ATG, TGT, GTA, TAT, ATA, TAC, ACA, CAT, ATA, TAT, ATA, TAT, ATA, TAT, ATA, TAT)

How to create a map from a RDD[String] using scala?

My file is,
sunny,hot,high,FALSE,no
sunny,hot,high,TRUE,no
overcast,hot,high,FALSE,yes
rainy,mild,high,FALSE,yes
rainy,cool,normal,FALSE,yes
rainy,cool,normal,TRUE,no
overcast,cool,normal,TRUE,yes
Here there are 7 rows & 5 columns(0,1,2,3,4)
I want the output as,
Map(0 -> Set("sunny","overcast","rainy"))
Map(1 -> Set("hot","mild","cool"))
Map(2 -> Set("high","normal"))
Map(3 -> Set("false","true"))
Map(4 -> Set("yes","no"))
The output must be the type of [Map[Int,Set[String]]]
EDIT: Rewritten to present the map-reduce version first, as it's more suited to Spark
Since this is Spark, we're probably interested in parallelism/distribution. So we need to take care to enable that.
Splitting each string into words can be done in partitions. Getting the set of values used in each column is a bit more tricky - the naive approach of initialising a set then adding every value from every row is inherently serial/local, since there's only one set (per column) we're adding the value from each row to.
However, if we have the set for some part of the rows and the set for the rest, the answer is just the union of these sets. This suggests a reduce operation where we merge sets for some subset of the rows, then merge those and so on until we have a single set.
So, the algorithm:
Split each row into an array of strings, then change this into an
array of sets of the single string value for each column - this can
all be done with one map, and distributed.
Now reduce this using an
operation that merges the set for each column in turn. This also can
be distributed
turn the single row that results into a Map
It's no coincidence that we do a map, then a reduce, which should remind you of something :)
Here's a one-liner that produces the single row:
val data = List(
"sunny,hot,high,FALSE,no",
"sunny,hot,high,TRUE,no",
"overcast,hot,high,FALSE,yes",
"rainy,mild,high,FALSE,yes",
"rainy,cool,normal,FALSE,yes",
"rainy,cool,normal,TRUE,no",
"overcast,cool,normal,TRUE,yes")
val row = data.map(_.split("\\W+").map(s=>Set(s)))
.reduce{(a, b) => (a zip b).map{case (l, r) => l ++ r}}
Converting it to a Map as the question asks:
val theMap = row.zipWithIndex.map(_.swap).toMap
Zip the list with the index, since that's what we need as the key of
the map.
The elements of each tuple are unfortunately in the wrong
order for .toMap, so swap them.
Then we have a list of (key, value)
pairs which .toMap will turn into the desired result.
These don't need to change AT ALL to work with Spark. We just need to use a RDD, instead of the List. Let's convert data into an RDD just to demo this:
val conf = new SparkConf().setAppName("spark-scratch").setMaster("local")
val sc= new SparkContext(conf)
val rdd = sc.makeRDD(data)
val row = rdd.map(_.split("\\W+").map(s=>Set(s)))
.reduce{(a, b) => (a zip b).map{case (l, r) => l ++ r}}
(This can be converted into a Map as before)
An earlier oneliner works neatly (transpose is exactly what's needed here) but is very difficult to distribute (transpose inherently needs to visit every row)
data.map(_.split("\\W+")).transpose.map(_.toSet)
(Omitting the conversion to Map for clarity)
Split each string into words.
Transpose the result, so we have a list that has a list of the first words, then a list of the second words, etc.
Convert each of those to a set.
Maybe this do the trick:
val a = Array(
"sunny,hot,high,FALSE,no",
"sunny,hot,high,TRUE,no",
"overcast,hot,high,FALSE,yes",
"rainy,mild,high,FALSE,yes",
"rainy,cool,normal,FALSE,yes",
"rainy,cool,normal,TRUE,no",
"overcast,cool,normal,TRUE,yes")
val b = new Array[Map[String, Set[String]]](5)
for (i <- 0 to 4)
b(i) = Map(i.toString -> (Set() ++ (for (s <- a) yield s.split(",")(i))) )
println(b.mkString("\n"))