How to read CSV columns into vectors in Scala - scala

I have a CSV file:
My CSV file
I want to create a map as such:
((A -> Vector(10.75,10.75,10.47,...,..), B-> Vector(164.56,164.99,160.98,...), C -> Vector(7.1,7.4,9.4,...,), D - > Vector(14.2,14.8,18.8,...,..))
This is what I have so far (No much):
val source=Source.fromFile("train3.csv")
var firstLine = source.getLines.find(_ => true).get
println(firstLine)
source.getLines().foreach(line=>{
val lines = line.split(",").map(_.trim).toVector
println(cols)
})
source.close()
This code prints this:
What my code prints
My Vectors contain the rows, I need my Vectors to contain the columns and not the rows from the CSV.

First off, please don't post important text information as an image. We can't copy-paste from images.
I created a smaller test file as follows:
A,B,C
10.75,10.75,10.47
164.56,164.99,160.98
7.1,7.4,9.4
14.2,14.8,18.8
With that in place I ran it through this code.
util.Using(io.Source.fromFile("train3.csv")){
_.getLines()
.map(_.split(","))
.toVector
.transpose
.groupMapReduce(_.head)(_.tail.map(_.toFloat))(_++_)
} //file is auto-closed
//res0: Try[Map[String,Vector[Float]]] =
// Success(Map(A -> Vector(10.75, 164.56, 7.1, 14.2)
// , B -> Vector(10.75, 164.99, 7.4, 14.8)
// , C -> Vector(10.47, 160.98, 9.4, 18.8)))
This will return Failure() if the file can't be opened, or the row lengths aren't consistent, or the number strings can't be converted. (Which might not be required. I just assumed it would be desirable.)

Once you have read the lines, and have something like
val lines: Seq[Seq[String]] = ???, just do
val columns = "ABC..."
lines
.iterator
.map { row => columns.zip(row) }
.groupBy { _._1 }
.mapValues { _._2 }
.toMap
If you want it specifically to be Vectors for some weird reason, then just add .to[Vector] after _._2

Related

Reading one file as Map(K,V) and pass V as keys while reading the second file as Map

I have two files. One is a text file and another one is CSV. I want to read the text file as Map(keys, values) and pass these values from the first file as key in Map when I read the second file (CSV file).
I am able to read the first file and get Map(key, value). From this Map, I have extracted the values and passed these values as keys in the second file but didn't get the desired result.
first file - text file
sdp:field(0)
meterNumber:field(1)
date:field(2)
time:field(3)
value:field(4),field(5),field(6),field(7),field(8),field(9),
field(10),field(11),field(12),field(13),field(14),
field(15),field(16),field(17)
second file - csv file
SDP,METERNO,READINGDATE,TIME,Reset Count.,Kilowatt-Hour Last Reset .,Kilowatt-Hour Rate A Last Reset.,Kilowatt-Hour Rate B Last Reset.,Kilowatt-Hour Rate C Last Reset.,Max Kilowatt Rate A Last Reset.,Max Kilowatt Rate B Last Reset.,Max Kilowatt Rate C Last Reset.,Accumulate Kilowatt Rate A Current.,Accumulate Kilowatt Rate B Current.,Accumulate Kilowatt Rate C Current.,Total Kilovar-Hour Last Reset.,Max Kilovar Last Reset.,Accumulate Kilovar Last Reset.
9000000001,500001,02-09-2018,00:00:00,2,48.958,8.319333333,24.31933333,16.31933333,6,24,15,10,9,6,48.958,41,40
this is what I have done to read the first file.
val lines = scala.io.Source.fromFile("D:\\JSON_READER\\dailymapping.txt", "UTF8")
.getLines
.map(line=>line.split(":"))
.map(fields => (fields(0),fields(1))).toMap;
val sdp = lines.get("sdp").get;
val meterNumber = lines.get("meterNumber").get;
val date = lines.get("date").get;
val time = lines.get("time").get;
val values = lines.get("value").get;
now I can see sdp has field(0), meterNumber has field(1), date has field(2), time has field(3) and values has field(4) .. to field(17).
Second file which I m reading using below code
val keyValuePairs = scala.io.Source.fromFile("D:\\JSON_READER\\Daily.csv")
.getLines.drop(1).map(_.stripLineEnd.split(",", -1))
.map{field => ((field(0),field(1),field(2),field(3)) -> (field(4),field(5)))}.toList
val map = Map(keyValuePairs : _*)
System.out.println(map);
above code giving me the following output which is desired output.
Map((9000000001,500001,02-09-2018,00:00:00) -> (2,48.958))
But I want to replace field(0), field(1), field(2), field(3) with sdp, meterNumber, date, time in the above code. So, I don't have to mention keys when I read the second file, keys will come from the first file.
I tried to replace but I got below output which is not desired output.
Map((field(0),field(1),field(2),field(3)) -> (,))
Can somebody please guide me on how can I achieve the desired output.
This might get you close to what you're after. The first Map is used to lookup the correct index into the CSV data.
val fieldRE = raw"field\((\d+)\)".r
val idx = io.Source
.fromFile(<txt_file>, "UTF8")
.getLines
.map(_.split(":"))
.flatMap(fields => fieldRE.replaceAllIn(fields(1), _.group(1))
.split(",")
.map(fields(0) -> _.toInt))
.toMap
val resMap = io.Source
.fromFile(<csv_file>)
.getLines
.drop(1)
.map(_.stripLineEnd.split(",", -1))
.map{ fld =>
(fld(idx("sdp")),fld(idx("meterNumber")),fld(idx("date")),fld(idx("time"))) ->
(fld(4),fld(5)) }
.toMap
//resMap: Map((9000000001,500001,02-09-2018,00:00:00) -> (2,48.958))
UPDATE
Changing the Map of (String identifiers -> Int index values) into a Map of (String identifiers -> collection of Int index values) can be done. I'm not sure what that buys you, but it's doable.
val fieldRE = raw"field\((\d+)\)".r
val idx = io.Source
.fromFile(<txt_file>, "UTF8")
.getLines
.map(_.split(":"))
.flatMap(fields => fieldRE.replaceAllIn(fields(1), _.group(1))
.split(",")
.map(fields(0) -> _.toInt))
.foldLeft(Map[String,Seq[Int]]()){ case (m,(k,v)) =>
m + (k -> (m.getOrElse(k,Seq()) :+ v))
}
val resMap = io.Source
.fromFile(<csv_file>)
.getLines
.drop(1)
.map(_.stripLineEnd.split(",", -1))
.map{fld => (fld(idx("sdp").head)
,fld(idx("meterNumber").head)
,fld(idx("date").head)
,fld(idx("time").head)) -> (fld(4),fld(5))}
.toMap

How to create Map[Int,Set[String]] in scala by reading a CSV file?

I want to create Map[Int,Set[String]] in scala by reading input from a CSV file.
My file.csv is,
sunny,hot,high,FALSE,no
sunny,hot,high,TRUE,no
overcast,hot,high,FALSE,yes
rainy,mild,high,FALSE,yes
rainy,cool,normal,FALSE,yes
rainy,cool,normal,TRUE,no
overcast,cool,normal,TRUE,yes
I want the output as,
var Attributes = Map[Int,Set[String]] = Map()
Attributes += (0 -> Set("sunny","overcast","rainy"))
Attributes += (1 -> Set("hot","mild","cool"))
Attributes += (2 -> Set("high","normal"))
Attributes += (3 -> Set("false","true"))
Attributes += (4 -> Set("yes","no"))
This 0,1,2,3,4 represents the column number and Set contains the distinct values in each column.
I want to add each (Int -> Set(String)) to my attribute "Attributes". ie, If we print Attributes.size , it displays 5(In this case).
Use one of the existing answers to read in the CSV file. You'll have a two dimensional array or vector of strings. Then build your map.
// row vectors
val rows = io.Source.fromFile("file.csv").getLines.map(_.split(",")).toVector
// column vectors
val cols = rows.transpose
// convert each vector to a set
val sets = cols.map(_.toSet)
// convert vector of sets to map
val attr = sets.zipWithIndex.map(_.swap).toMap
The last line is bit ugly because there is no direct .toMap method. You could also write
val attr = Vector.tabulate(sets.size)(i => (i, sets(i))).toMap
Or you could do the last two steps in one go:
val attr = cols.zipWithIndex.map { case (xs, i) =>
(i, xs.toSet)
} (collection.breakOut): Map[Int,Set[String]]

How to create a map from a RDD[String] using scala?

My file is,
sunny,hot,high,FALSE,no
sunny,hot,high,TRUE,no
overcast,hot,high,FALSE,yes
rainy,mild,high,FALSE,yes
rainy,cool,normal,FALSE,yes
rainy,cool,normal,TRUE,no
overcast,cool,normal,TRUE,yes
Here there are 7 rows & 5 columns(0,1,2,3,4)
I want the output as,
Map(0 -> Set("sunny","overcast","rainy"))
Map(1 -> Set("hot","mild","cool"))
Map(2 -> Set("high","normal"))
Map(3 -> Set("false","true"))
Map(4 -> Set("yes","no"))
The output must be the type of [Map[Int,Set[String]]]
EDIT: Rewritten to present the map-reduce version first, as it's more suited to Spark
Since this is Spark, we're probably interested in parallelism/distribution. So we need to take care to enable that.
Splitting each string into words can be done in partitions. Getting the set of values used in each column is a bit more tricky - the naive approach of initialising a set then adding every value from every row is inherently serial/local, since there's only one set (per column) we're adding the value from each row to.
However, if we have the set for some part of the rows and the set for the rest, the answer is just the union of these sets. This suggests a reduce operation where we merge sets for some subset of the rows, then merge those and so on until we have a single set.
So, the algorithm:
Split each row into an array of strings, then change this into an
array of sets of the single string value for each column - this can
all be done with one map, and distributed.
Now reduce this using an
operation that merges the set for each column in turn. This also can
be distributed
turn the single row that results into a Map
It's no coincidence that we do a map, then a reduce, which should remind you of something :)
Here's a one-liner that produces the single row:
val data = List(
"sunny,hot,high,FALSE,no",
"sunny,hot,high,TRUE,no",
"overcast,hot,high,FALSE,yes",
"rainy,mild,high,FALSE,yes",
"rainy,cool,normal,FALSE,yes",
"rainy,cool,normal,TRUE,no",
"overcast,cool,normal,TRUE,yes")
val row = data.map(_.split("\\W+").map(s=>Set(s)))
.reduce{(a, b) => (a zip b).map{case (l, r) => l ++ r}}
Converting it to a Map as the question asks:
val theMap = row.zipWithIndex.map(_.swap).toMap
Zip the list with the index, since that's what we need as the key of
the map.
The elements of each tuple are unfortunately in the wrong
order for .toMap, so swap them.
Then we have a list of (key, value)
pairs which .toMap will turn into the desired result.
These don't need to change AT ALL to work with Spark. We just need to use a RDD, instead of the List. Let's convert data into an RDD just to demo this:
val conf = new SparkConf().setAppName("spark-scratch").setMaster("local")
val sc= new SparkContext(conf)
val rdd = sc.makeRDD(data)
val row = rdd.map(_.split("\\W+").map(s=>Set(s)))
.reduce{(a, b) => (a zip b).map{case (l, r) => l ++ r}}
(This can be converted into a Map as before)
An earlier oneliner works neatly (transpose is exactly what's needed here) but is very difficult to distribute (transpose inherently needs to visit every row)
data.map(_.split("\\W+")).transpose.map(_.toSet)
(Omitting the conversion to Map for clarity)
Split each string into words.
Transpose the result, so we have a list that has a list of the first words, then a list of the second words, etc.
Convert each of those to a set.
Maybe this do the trick:
val a = Array(
"sunny,hot,high,FALSE,no",
"sunny,hot,high,TRUE,no",
"overcast,hot,high,FALSE,yes",
"rainy,mild,high,FALSE,yes",
"rainy,cool,normal,FALSE,yes",
"rainy,cool,normal,TRUE,no",
"overcast,cool,normal,TRUE,yes")
val b = new Array[Map[String, Set[String]]](5)
for (i <- 0 to 4)
b(i) = Map(i.toString -> (Set() ++ (for (s <- a) yield s.split(",")(i))) )
println(b.mkString("\n"))

Removing duplicates

I would like to remove duplicates from my data in my CSV file.
The first column is the year, and the second is the sentence. I would like to remove any duplicates of a sentence, regardless of the year information.
Is there a command that I can insert in val text = { } to remove these dupes?
My script is:
val source = CSVFile("science.csv");
val text = {
source ~>
Column(2) ~>
TokenizeWith(tokenizer) ~>
TermCounter() ~>
TermMinimumDocumentCountFilter(30) ~>
TermDynamicStopListFilter(10) ~>
DocumentMinimumLengthFilter(5)
}
Thank you!
Essentially you want a version of distinct where you can specify what makes an object (row) unique (the second column).
Given the code: (modified SeqLike.distinct)
type Row = (Int, String)
def distinct(rows:Seq[Row], f: Row => AnyRef) = {
val b = newBuilder
val seen = mutable.HashSet[AnyRef]()
val key = f(x)
for (x <- rows) {
if (!seen(key)) {
b += x
seen += key
}
}
b.result
}
If you had a list of rows (where a row is a tuple) you could get the filtered/unique ones based on the second column with
distinct(rows, (_._2))
Do you need to have your code reproducible? If not, then in excel, click on the "Data" tab, click the little box directly above "1" and to the left of "A" to highlight everything, click "Remove Duplicates", make sure "My data has headers" is selected if you have headers, and then unclick the column that has the years, only keeping the column that has the sentence with a check mark next to it. This will remove duplicate sentences but keep the first instance of the year occuring.
As sets naturally eliminate duplicates, a simple approach would be to fill the rows into a TreeSet, using a custom ordering which only takes into account the text part of each row.
Update
Here is a sample script to demonstrate the above:
import collection.immutable.TreeSet
import scala.io.Source
val lines = Source.fromFile("science.csv").getLines()
val uniques = lines.foldLeft(TreeSet[String]()(Ordering.by(_.split(',')(1)))) {
(s, l) =>
if (s contains l) s
else s + l
}
uniques.toList.sorted foreach println
The script folds the sequence of lines into a treeset with a custom ordering based on the 2nd part of the comma-separated line. The simplest fold function would be (s, l) => s + l; however, that would result in the lines with later year overwriting lines with the same text of earlier years. This is why I had to test for containment first.
Now we are almost ready, we just need to reorder the collection again by year before printing (this assuming the input was ordered by year).

Storing the contents of a file in an immutable Map in scala

I am trying to implement a simple wordcount in scala using an immutable map(this is intentional) and the way I am trying to accomplish it is as follows:
Create an empty immutable map
Create a scanner that reads through the file.
While the scanner.hasNext() is true:
Check if the Map contains the word, if it doesn't contain the word, initialize the count to zero
Create a new entry with the key=word and the value=count+1
Update the map
At the end of the iteration, the map is populated with all the values.
My code is as follows:
val wordMap = Map.empty[String,Int]
val input = new java.util.scanner(new java.io.File("textfile.txt"))
while(input.hasNext()){
val token = input.next()
val currentCount = wordMap.getOrElse(token,0) + 1
val wordMap = wordMap + (token,currentCount)
}
The ides is that wordMap will have all the wordCounts at the end of the iteration...
Whenever I try to run this snippet, I get the following exception
recursive value wordMap needs type.
Can somebody point out why I am getting this exception and what can I do to remedy it?
Thanks
val wordMap = wordMap + (token,currentCount)
This line is redefining an already-defined variable. If you want to do this, you need to define wordMap with var and then just use
wordMap = wordMap + (token,currentCount)
Though how about this instead?:
io.Source.fromFile("textfile.txt") // read from the file
.getLines.flatMap{ line => // for each line
line.split("\\s+") // split the line into tokens
.groupBy(identity).mapValues(_.size) // count each token in the line
} // this produces an iterator of token counts
.toStream // make a Stream so we can groupBy
.groupBy(_._1).mapValues(_.map(_._2).sum) // combine all the per-line counts
.toList
Note that the per-line pre-aggregation is used to try and reduce the memory required. Counting across the entire file at once might be too big.
If your file is really massive, I would suggest using doing this in parallel (since word counting is trivial to parallelize) using either Scala's parallel collections or Hadoop (using one of the cool Scala Hadoop wrappers like Scrunch or Scoobi).
EDIT: Detailed explanation:
Ok, first look at the inner part of the flatMap. We take a string, and split it apart on whitespace:
val line = "a b c b"
val tokens = line.split("\\s+") // Array(a, b, c, a, b)
Now identity is a function that just returns its argument, so if wegroupBy(identity)`, we map each distinct word type, to each word token:
val grouped = tokens.groupBy(identity) // Map(c -> Array(c), a -> Array(a), b -> Array(b, b))
And finally, we want to count up the number of tokens for each type:
val counts = grouped.mapValues(_.size) // Map(c -> 1, a -> 1, b -> 2)
Since we map this over all the lines in the file, we end up with token counts for each line.
So what does flatMap do? Well, it runs the token-counting function over each line, and then combines all the results into one big collection.
Assume the file is:
a b c b
b c d d d
e f c
Then we get:
val countsByLine =
io.Source.fromFile("textfile.txt") // read from the file
.getLines.flatMap{ line => // for each line
line.split("\\s+") // split the line into tokens
.groupBy(identity).mapValues(_.size) // count each token in the line
} // this produces an iterator of token counts
println(countsByLine.toList) // List((c,1), (a,1), (b,2), (c,1), (d,3), (b,1), (c,1), (e,1), (f,1))
So now we need to combine the counts of each line into one big set of counts. The countsByLine variable is an Iterator, so it doesn't have a groupBy method. Instead we can convert it to a Stream, which is basically a lazy list. We want laziness because we don't want to have to read the entire file into memory before we start. Then the groupBy groups all counts of the same word type together.
val groupedCounts = countsByLine.toStream.groupBy(_._1)
println(groupedCounts.mapValues(_.toList)) // Map(e -> List((e,1)), f -> List((f,1)), a -> List((a,1)), b -> List((b,2), (b,1)), c -> List((c,1), (c,1), (c,1)), d -> List((d,3)))
And finally we can sum up the counts from each line for each word type by grabbing the second item from each tuple (the count), and summing:
val totalCounts = groupedCounts.mapValues(_.map(_._2).sum)
println(totalCounts.toList)
List((e,1), (f,1), (a,1), (b,3), (c,3), (d,3))
And there you have it.
You have a few mistakes: you've defined wordMap twice (val is to declare a value). Also, Map is immutable, so you either have to declare it as a var or use a mutable map (I suggest the former).
Try this:
var wordMap = Map.empty[String,Int] withDefaultValue 0
val input = new java.util.Scanner(new java.io.File("textfile.txt"))
while(input.hasNext()){
val token = input.next()
wordMap += token -> (wordMap(token) + 1)
}