Convert csv to RDD - scala

I tried the accepted solution in How do I convert csv file to rdd, I want to print out all the users except "om":
val csv = sc.textFile("file.csv") // original file
val data = csv.map(line => line.split(",").map(elem => elem.trim)) //lines in rows
val header = new SimpleCSVHeader(data.take(1)(0)) // we build our header with the first line
val rows = data.filter(line => header(line,"user") != "om") // filter the header out
val users = rows.map(row => header(row,"user")
users.collect().map(user => println(user))
but I got an error:
java.util.NoSuchElementException: key not found: user
I try to debug it and find the index attributes in header look like this:
Since I'm new to spark and scala, does this mean that user is already in a Map? Then why the key not found error?

I found out my mistake. It's not related to Spark/Scala. When I created the example csv, I use command in R:
df <- data.frame(user=c('om','daniel','3754978'),topic=c('scala','spark','spark'),hits=c(120,80,1))
write.csv(df, "df.csv",row.names=FALSE)
but write.csv will add " around factors by default, so that's why the map can't find key user because "user" is the real key, using
write.csv(df, "df.csv",quote=FALSE, row.names=FALSE)
will solve this problem.

I've rewritten the sample code to remove the header method.
IMO, this example provides a step by step walkthrough that is easier to follow. Here is a more detailed explanation.
def main(args: Array[String]): Unit = {
val csv = sc.textFile("/path/to/your/file.csv")
// split / clean data
val headerAndRows = csv.map(line => line.split(",").map(_.trim))
// get header
val header = headerAndRows.first
// filter out header
val data = headerAndRows.filter(_(0) != header(0))
// splits to map (header/value pairs)
val maps = data.map(splits => header.zip(splits).toMap)
// filter out the 'om' user
val result = maps.filter(map => map("user") != "om")
// print result
result.foreach(println)
}

Related

How to add the elements into the Map where key is String and Value is List[String] in Scala

I have a text file which contains information about the sender and messages. The format is sender, messages.
I've loaded the file into a RDD and split them by "," and created a key value pair, where key is sender and value is messages RDD[(String,String)].
Then, I did a groupByKey() to group the messages based on sender and I got a RDD[(String,Iterable[String])].
Array[(String, Iterable[String])] = Array((Key,CompactBuffer(value1,value2,value3,....))
Now, I want to iterate the value part, and stores the values one by one to the List, so I've created a empty Map where key is String and value is List[String]
First I should check whether the Map is empty, if it is empty then I should add the first value to the List which is present inside the Map.
The below is whatever I've tried but I could not do it, when I've checked the Map it's shows None.
import org.apache.spark.{SparkConf, SparkContext}
import scala.collection.mutable.ListBuffer
object Demo{
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setMaster("local").setAppName("My App")
val sc = new SparkContext(conf)
val inputFile = "D:\\MyData.txt"
val data = sc.textFile(inputFile)
val data2 = data.map(line => {val arr = line.split(",");
(arr(0),arr(1))})
val grpData = data2.groupByKey()
val myMap = scala.collection.mutable.Map.empty[String,List[String]]
for(value <- grpData.values){
val list = ListBuffer[String]()
if(myMap.isEmpty){
list += value
myMap.put("G1", list.toList)
}
}
}
In the for loop, I gave grpData.values because I need only the value part. I don't want the keys whatever I have in my file as a sender. I Just used them to group the messages based on the sender, but in the Map[String,List[String]] my key should be Group1, Group2 and so on. The value is messages whatever I will get one by one from the CompactBuffer.
First, I should check whether the Map is empty, If it is empty I should add the first message to the List which is present inside the Map. Key should be "Group1" and value should be the message that should be stored in List[String].
For the second Iteration, Map will not be empty then the condition will go to the else part, In the else part I should use lavenshtein distance algorithm to compare the messages. Here first message was already added to the List, now I should get the first message from Map and compare it with second message using lavenshtein distance algorithm with threshold of 70%. If the 2 messages meets the threshold then I should add the second message to the List, If not I should add second message to the separate list and keep the key name as "G2" and so on.
You can use aggregateByKey to get a combined list of strings for each key:
val data = sc.textFile(inputFile)
val data2 = data.map(line => {val arr = line.split(","); (arr(0),arr(1))})
val result = data2.aggregateByKey(List[String]())(_ :+ _, _ ++ _)
// or to prepend rather than append,
// val result = data2.aggregateByKey(List[String]())((x, y) => y :: x, _ ++ _)
If you want the result as a Map, you can do
val resultMap = result.toMap
I'm assuming that you are trying to cluster base off some distance function, in which this might be what you are looking for:
def isWithinThreshold(s1: String, s2: String): Boolean = ???
//2 sets are grouped when there exist elements in a both sets that are closed to each other
def combine(acc: Vector[Vector[String]], s: Vector[String]) = {
val (near, far) = acc.partition(_.exists(str => s.exists(isWithinThreshold(str, _))))
near.fold(s)(_ ++ _) +: far
}
val preClusteringGroups = grpData.values.map(_.toVector) //this is already pre-grouped with with the key from data2 (`arr(0)`)
val res = preClusteringGroups.aggregate(Vector.empty[Vector[String]])(combine, { case (v1, v2) =>
(v1 ++ v2).foldLeft(Vector.empty[Vector[String]])(combine)
}).zipWithIndex.map { case (v, i) => s"G$i" -> v }.toMap //.mapValues(_.toList) if you actually need a list
preClusteringGroups is based of grpData which is already pre-grouped by the orginal key and might not fulfill your distance requirements. If that' the case redefine preClusteringGroups with:
val preClusteringGroups = data2.values.map(Vector(_))

not able to store result in hdfs when code runs for second iteration

Well I am new to spark and scala and have been trying to implement cleaning of data in spark. below code checks for the missing value for one column and stores it in outputrdd and runs loops for calculating missing value. code works well when there is only one missing value in file. Since hdfs does not allow writing again on the same location it fails if there are more than one missing value. can you please assist in writing finalrdd to particular location once calculating missing values for all occurrences is done.
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("app").setMaster("local")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val files = sc.wholeTextFiles("/input/raw_files/")
val file = files.map { case (filename, content) => filename }
file.collect.foreach(filename => {
cleaningData(filename)
})
def cleaningData(file: String) = {
//headers has column headers of the files
var hdr = headers.toString()
var vl = hdr.split("\t")
sqlContext.clearCache()
if (hdr.contains("COLUMN_HEADER")) {
//Checks for missing values in dataframe and stores missing values' in outputrdd
if (!outputrdd.isEmpty()) {
logger.info("value is zero then performing further operation")
val outputdatetimedf = sqlContext.sql("select date,'/t',time from cpc where kwh = 0")
val outputdatetimerdd = outputdatetimedf.rdd
val strings = outputdatetimerdd.map(row => row.mkString).collect()
for (i <- strings) {
if (Coddition check) {
//Calculates missing value and stores in finalrdd
finalrdd.map { x => x.mkString("\t") }.saveAsTextFile("/output")
logger.info("file is written in file")
}
}
}
}
}
}``
It is not clear how (Coddition check) works in your example.
In any case function .saveAsTextFile("/output") should be called only once.
So I would rewrite your example into this:
val strings = outputdatetimerdd
.map(row => row.mkString)
.collect() // perhaps '.collect()' is redundant
val finalrdd = strings
.filter(str => Coddition check str) //don't know how this Coddition works
.map (x => x.mkString("\t"))
// this part is called only once but not in a loop
finalrdd.saveAsTextFile("/output")
logger.info("file is written in file")

Streaming CSV Source with AKKA-HTTP

I am trying to stream data from Mongodb using reactivemongo-akkastream 0.12.1 and return the result into a CSV stream in one of the routes (using Akka-http).
I did implement that following the exemple here:
http://doc.akka.io/docs/akka-http/10.0.0/scala/http/routing-dsl/source-streaming-support.html#simple-csv-streaming-example
and it looks working fine.
The only problem I am facing now is how to add the headers to the output CSV file. Any ideas?
Thanks
Aside from the fact that that example isn't really a robust method of generating CSV (doesn't provide proper escaping) you'll need to rework it a bit to add headers. Here's what I would do:
make a Flow to convert a Source[Tweet] to a source of CSV rows, e.g. a Source[List[String]]
concatenate it to a source containing your headers as a single List[String]
adapt the marshaller to render a source of rows rather than tweets
Here's some example code:
case class Tweet(uid: String, txt: String)
def getTweets: Source[Tweet, NotUsed] = ???
val tweetToRow: Flow[Tweet, List[String], NotUsed] =
Flow[Tweet].map { t =>
List(
t.uid,
t.txt.replaceAll(",", "."))
}
// provide a marshaller from a row (List[String]) to a ByteString
implicit val tweetAsCsv = Marshaller.strict[List[String], ByteString] { row =>
Marshalling.WithFixedContentType(ContentTypes.`text/csv(UTF-8)`, () =>
ByteString(row.mkString(","))
)
}
// enable csv streaming
implicit val csvStreaming = EntityStreamingSupport.csv()
val route = path("tweets") {
val headers = Source.single(List("uid", "text"))
val tweets: Source[List[String], NotUsed] = getTweets.via(tweetToRow)
complete(headers.concat(tweets))
}
Update: if your getTweets method returns a Future you can just map over its source value and prepend the headers that way, e.g:
val route = path("tweets") {
val headers = Source.single(List("uid", "text"))
val rows: Future[Source[List[String], NotUsed]] = getTweets
.map(tweets => headers.concat(tweets.via(tweetToRow)))
complete(rows)
}

How to skip line in spark rdd map action based on if condition

I have a file and I want to give it to an mllib algorithm. So I am following the example and doing something like:
val data = sc.textFile(my_file).
map {line =>
val parts = line.split(",");
Vectors.dense(parts.slice(1, parts.length).map(x => x.toDouble).toArray)
};
and this works except that sometimes I have a missing feature. That is sometimes one column of some row does not have any data and I want to throw away rows like this.
So I want to do something like this map{line => if(containsMissing(line) == true){ skipLine} else{ ... //same as before}}
how can I do this skipLine action?
You can use filter function to filter out such lines:
val data = sc.textFile(my_file)
.filter(_.split(",").length == cols)
.map {line =>
// your code
};
Assuming variable cols holds number of columns in a valid row.
You can use flatMap, Some and None for this:
def missingFeatures(stuff): Boolean = ??? // Determine if features is missing
val data = sc.textFile(my_file)
.flatMap {line =>
val parts = line.split(",");
if(missingFeatures(parts)) None
else Some(Vectors.dense(parts.slice(1, parts.length).map(x => x.toDouble).toArray))
};
This way you avoid mapping over the rdd more than once.
Java code to skip empty lines / header from Spark RDD:
First the imports:
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
Now, the filter compares total columns to 17 or header column which starts with VendorID.
Function<String, Boolean> isInvalid = row -> (row.split(",").length == 17 && !(row.startsWith("VendorID")));
JavaRDD<String> taxis = sc.textFile("datasets/trip_yellow_taxi.data")
.filter(isInvalid);

How to group by a select number of fields in an RDD looking for duplicates based on those fields

I am new to Scala and Spark. I am working in the Spark Shell.
I need to Group By and sort by the first three fields of this file, looking for duplicates. If I find duplicates within the group, I need to append a counter to the third field, starting at "1" and incrementing by "1", for each record in the duplicate group. Resetting the counter back to "1" when reading a new group. When no duplicates are found, then just append the counter which would be "1".
CSV File contains the following:
("00111","00111651","4444","PY","MA")
("00111","00111651","4444","XX","MA")
("00112","00112P11","5555","TA","MA")
val csv = sc.textFile("file.csv")
val recs = csv.map(line => line.split(",")
If I apply the logic properly on the above example, the resulting RDD of recs would look like this:
("00111","00111651","44441","PY","MA")
("00111","00111651","44442","XX","MA")
("00112","00112P11","55551","TA","MA")
How about group the data, change it and put it back:
val csv = sc.parallelize(List(
"00111,00111651,4444,PY,MA",
"00111,00111651,4444,XX,MA",
"00112,00112P11,5555,TA,MA"
))
val recs = csv.map(_.split(","))
val grouped = recs.groupBy(line=>(line(0),line(1), line(2)))
val numbered = grouped.mapValues(dataList=>
dataList.zipWithIndex.map{case(data, idx) => data match {
case Array(fst,scd,thd,rest#_*) => Array(fst,scd,thd+(idx+1)) ++ rest
}
})
numbered.flatMap{case (key, values)=>values}
Also grouping the data, changing it, putting it back.
val lists= List(("00111","00111651","4444","PY","MA"),
("00111","00111651","4444","XX","MA"),
("00112","00112P11","5555","TA","MA"))
val grouped = lists.groupBy{case(a,b,c,d,e) => (a,b,c)}
val indexed = grouped.mapValues(
_.zipWithIndex
.map {case ((a,b,c,d,e), idx) => (a,b,c + (idx+1).toString,d,e)}
val unwrapped = indexed.flatMap(_._2)
//List((00112,00112P11,55551,TA,MA),
// (00111,00111651,44442,XX ,MA),
// (00111,00111651,44441,PY,MA))
Version working on Arrays (of arbitary length >= 3)
val lists= List(Array("00111","00111651","4444","PY","MA"),
Array("00111","00111651","4444","XX","MA"),
Array("00112","00112P11","5555","TA","MA"))
val grouped = lists.groupBy{_.take(3)}
val indexed = grouped.mapValues(
_.zipWithIndex
.map {case (Array(a,b,c, rest#_*), idx) => Array(a,b,c+ (idx+1).toString) ++ rest})
val unwrapped = indexed.flatMap(_._2)
// List(Array(00112, 00112P11, 55551, TA, MA),
// Array(00111, 00111651, 44441, XX, MA),
// Array(00111, 00111651, 44441, PY, MA))