Summing the values based on Key of RDD

Summing the values based on Key of RDD - scala

I have data set of crimes happened from 2001 up till now. I want to calculate no_of_crimes happened per year. My code which i have tried is
val inp = SparkConfig.spark.sparkContext.textFile("file:\\C:\\Users\\M1047320\\Desktop\\Crimes_-_2001_to_present.csv")
val header = inp.first()
val data = inp.filter( line => line(0) != header(0))
val splitRDD = data.map( line =>{
val temp = line.split(",(?![^\\(\\[]*[\\]\\)])")
(temp(0),temp(1),temp(2),temp(3),temp(4),temp(5),
temp(6),temp(7),temp(8),temp(9),temp(10),temp(11),
temp(12),temp(13),temp(14),temp(15),temp(16),temp(17))
})
val crimesPerYear = splitRDD.map( line => (line._18,1)).reduceByKey(_+_)// line._18 represents year column
crimesPerYear.take(20).foreach(println)
Expected Result is,
(2001,54)
(2002,100)
(2003,24) so on
But I am getting result as
(1175860,1)
(1176964,4)
(1178665,123)
(1171273,3)
(1938926,1)
(1141621,8)
(1136278,2)
I am totally confused what i am doing wrong.Why years are summing up ?Please help me

Related

Finding average of values against key using RDD in Spark

I have created RDD with first column is Key and rest of columns are values against that key. Every row has a unique key. I want to find average of values against every key. I created Key value pair and tried following code but it is not producing desired results. My code is here.
val rows = 10
val cols = 6
val partitions = 4
lazy val li1 = List.fill(rows,cols)(math.random)
lazy val li2 = (1 to rows).toList
lazy val li = (li1, li2).zipped.map(_ :: _)
val conf = new SparkConf().setAppName("First spark").setMaster("local[*]")
val sc = new SparkContext(conf)
val rdd = sc.parallelize(li,partitions)
val gr = rdd.map( x => (x(0) , x.drop(1)))
val gr1 = gr.values.reduce((x,y) => x.zip(y).map(x => x._1 +x._2 )).foldLeft(0)(_+_)
gr1.take(3).foreach(println)
I want result to be displayed like
1 => 1.1 ,
2 => 2.7
and so on for all keys

First I am not sure what this line is doing,
lazy val li = (li1, li2).zipped.map(_ :: _)
Instead, you could do this,
lazy val li = li2 zip li1
This will create the List of tuples of the type (Int, List[Double]).
And the solution to find the average values against keys could be as below,
rdd.map{ x => (x._1, x._2.fold(0.0)(_ + _)/x._2.length) }.collect.foreach(x => println(x._1+" => "+x._2))

Reading one file as Map(K,V) and pass V as keys while reading the second file as Map

I have two files. One is a text file and another one is CSV. I want to read the text file as Map(keys, values) and pass these values from the first file as key in Map when I read the second file (CSV file).
I am able to read the first file and get Map(key, value). From this Map, I have extracted the values and passed these values as keys in the second file but didn't get the desired result.
first file - text file
sdp:field(0)
meterNumber:field(1)
date:field(2)
time:field(3)
value:field(4),field(5),field(6),field(7),field(8),field(9),
field(10),field(11),field(12),field(13),field(14),
field(15),field(16),field(17)
second file - csv file
SDP,METERNO,READINGDATE,TIME,Reset Count.,Kilowatt-Hour Last Reset .,Kilowatt-Hour Rate A Last Reset.,Kilowatt-Hour Rate B Last Reset.,Kilowatt-Hour Rate C Last Reset.,Max Kilowatt Rate A Last Reset.,Max Kilowatt Rate B Last Reset.,Max Kilowatt Rate C Last Reset.,Accumulate Kilowatt Rate A Current.,Accumulate Kilowatt Rate B Current.,Accumulate Kilowatt Rate C Current.,Total Kilovar-Hour Last Reset.,Max Kilovar Last Reset.,Accumulate Kilovar Last Reset.
9000000001,500001,02-09-2018,00:00:00,2,48.958,8.319333333,24.31933333,16.31933333,6,24,15,10,9,6,48.958,41,40
this is what I have done to read the first file.
val lines = scala.io.Source.fromFile("D:\\JSON_READER\\dailymapping.txt", "UTF8")
.getLines
.map(line=>line.split(":"))
.map(fields => (fields(0),fields(1))).toMap;
val sdp = lines.get("sdp").get;
val meterNumber = lines.get("meterNumber").get;
val date = lines.get("date").get;
val time = lines.get("time").get;
val values = lines.get("value").get;
now I can see sdp has field(0), meterNumber has field(1), date has field(2), time has field(3) and values has field(4) .. to field(17).
Second file which I m reading using below code
val keyValuePairs = scala.io.Source.fromFile("D:\\JSON_READER\\Daily.csv")
.getLines.drop(1).map(_.stripLineEnd.split(",", -1))
.map{field => ((field(0),field(1),field(2),field(3)) -> (field(4),field(5)))}.toList
val map = Map(keyValuePairs : _*)
System.out.println(map);
above code giving me the following output which is desired output.
Map((9000000001,500001,02-09-2018,00:00:00) -> (2,48.958))
But I want to replace field(0), field(1), field(2), field(3) with sdp, meterNumber, date, time in the above code. So, I don't have to mention keys when I read the second file, keys will come from the first file.
I tried to replace but I got below output which is not desired output.
Map((field(0),field(1),field(2),field(3)) -> (,))
Can somebody please guide me on how can I achieve the desired output.

This might get you close to what you're after. The first Map is used to lookup the correct index into the CSV data.
val fieldRE = raw"field\((\d+)\)".r
val idx = io.Source
.fromFile(<txt_file>, "UTF8")
.getLines
.map(_.split(":"))
.flatMap(fields => fieldRE.replaceAllIn(fields(1), _.group(1))
.split(",")
.map(fields(0) -> _.toInt))
.toMap
val resMap = io.Source
.fromFile(<csv_file>)
.getLines
.drop(1)
.map(_.stripLineEnd.split(",", -1))
.map{ fld =>
(fld(idx("sdp")),fld(idx("meterNumber")),fld(idx("date")),fld(idx("time"))) ->
(fld(4),fld(5)) }
.toMap
//resMap: Map((9000000001,500001,02-09-2018,00:00:00) -> (2,48.958))
UPDATE
Changing the Map of (String identifiers -> Int index values) into a Map of (String identifiers -> collection of Int index values) can be done. I'm not sure what that buys you, but it's doable.
val fieldRE = raw"field\((\d+)\)".r
val idx = io.Source
.fromFile(<txt_file>, "UTF8")
.getLines
.map(_.split(":"))
.flatMap(fields => fieldRE.replaceAllIn(fields(1), _.group(1))
.split(",")
.map(fields(0) -> _.toInt))
.foldLeft(Map[String,Seq[Int]]()){ case (m,(k,v)) =>
m + (k -> (m.getOrElse(k,Seq()) :+ v))
}
val resMap = io.Source
.fromFile(<csv_file>)
.getLines
.drop(1)
.map(_.stripLineEnd.split(",", -1))
.map{fld => (fld(idx("sdp").head)
,fld(idx("meterNumber").head)
,fld(idx("date").head)
,fld(idx("time").head)) -> (fld(4),fld(5))}
.toMap

How to copy matrix to column array

I'm trying to copy a column of a matrix into an array, also I want to make this matrix public.
Heres my code:
val years = Array.ofDim[String](1000, 1)
val bufferedSource = io.Source.fromFile("Top_1_000_Songs_To_Hear_Before_You_Die.csv")
val i=0;
//println("THEME, TITLE, ARTIST, YEAR, SPOTIFY_URL")
for (line <- bufferedSource.getLines) {
val cols = line.split(",").map(_.trim)
years(i)=cols(3)(i)
}
I want the cols to be a global matrix and copy the column 3 to years, because of the method of that I get cols I dont know how to define it

There're three different problems in your attempt:
Your regexp will fail for this dataset. I suggest you change it to:
val regex = ",(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))"
This will capture the blocks wrapped in double quotes but containing commas (courtesy of Luke Sheppard on regexr)
This val i=0; is not very scala-ish / functional. We can replace it by a zipWithIndex in the for comprehension:
for ((line, count) <- bufferedSource.getLines.zipWithIndex)
You can create the "global matrix" by extracting elements from each line (val Array (...)) and returning them as the value of the for-comprehension block (yield):
It looks like that:
for ((line, count) <- bufferedSource.getLines.zipWithIndex) yield {
val Array(theme,title,artist,year,spotify_url) = line....
...
(theme,title,artist,year,spotify_url)
}
And here is the complete solution:
val bufferedSource = io.Source.fromFile("/tmp/Top_1_000_Songs_To_Hear_Before_You_Die.csv")
val years = Array.ofDim[String](1000, 1)
val regex = ",(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))"
val iteratorMatrix = for ((line, count) <- bufferedSource.getLines.zipWithIndex) yield {
val Array(theme,title,artist,year,spotify_url) = line.split(regex, -1).map(_.trim)
years(count) = Array(year)
(theme,title,artist,year,spotify_url)
}
// will actually consume the iterator and fill in globalMatrix AND years
val globalMatrix = iteratorMatrix.toList

Here's a function that will get the col column from the CSV. There is no error handling here for any empty row or other conditions. This is a proof of concept so add your own error handling as you see fit.
val years = (fileName: String, col: Int) => scala.io.Source.fromFile(fileName)
.getLines()
.map(_.split(",")(col).trim())
Here's a suggestion if you are looking to keep the contents of the file in a map. Again there's no error handling just proof of concept.
val yearColumn = 3
val fileName = "Top_1_000_Songs_To_Hear_Before_You_Die.csv"
def lines(file: String) = scala.io.Source.fromFile(file).getLines()
val mapRow = (row: String) => row.split(",").zipWithIndex.foldLeft(Map[Int, String]()){
case (acc, (v, idx)) => acc.updated(idx,v.trim())}
def mapColumns = (values: Iterator[String]) =>
values.zipWithIndex.foldLeft(Map[Int, Map[Int, String]]()){
case (acc, (v, idx)) => acc.updated(idx, mapRow(v))}
val parser = lines _ andThen mapColumns
val matrix = parser(fileName)
val years = matrix.flatMap(_.swap._1.get(yearColumn))
This will build a Map[Int,Map[Int, String]] which you can use elsewhere. The first index of the map is the row number and the index of the inner map is the column number. years is an Iterable[String] that contains the year values.

Consider adding contents to a collection at the same time as it is created, in contrast to allocate space first and then update it; for instance like this,
val rawSongsInfo = io.Source.fromFile("Top_Songs.csv").getLines
val cols = for (rsi <- rawSongsInfo) yield rsi.split(",").map(_.trim)
val years = cols.map(_(3))

Finding the average of a data set using Apache Spark

I am learning how to use Apache Spark and I am trying to get the average temperature from each hour from a data set. The data set that I am trying to use is from weather information stored in a csv. I am having trouble finding how to first read in the csv file and then calculating the average temperature for each hour.
From the spark documentation I am using the example Scala line to read in a file.
val textFile = sc.textFile("README.md")
I have given the link for the data file below. I am using the file called JCMB_2014.csv as it is the latest one with all months covered.
Weather Data
Edit:
The code I have tried so far is:
class SimpleCSVHeader(header:Array[String]) extends Serializable {
val index = header.zipWithIndex.toMap
def apply(array:Array[String], key:String):String = array(index(key))
}
val csv = sc.textFile("JCMB_2014.csv")
val data = csv.map(line => line.split(",").map(elem => elem.trim))
val header = new SimpleCSVHeader(data.take(1)(0)) // we build our header
val header = new SimpleCSVHeader(data.take(1)(0))
val rows = data.filter(line => header(line,"date-time") != "date-time")
val users = rows.map(row => header(row,"date-time")
val usersByHits = rows.map(row => header(row,"date-time") -> header(row,"surface temperature (C)").toInt)

Here is sample code for calculating averages on hourly basis
Step1:Read file, Filter header,extract time and temp columns
scala> val hourlyTemps = lines.map(line=>line.split(",")).filter(entries=>(!"time".equals(entries(3)))).map(entries=>(entries(3).toInt/60,(entries(8).toFloat,1)))
scala> hourlyTemps.take(1)
res25: Array[(Int, (Float, Int))] = Array((9,(10.23,1)))
(time/60) discards minutes and keeps only hours
Step2:Aggregate temperatures and no of occurrences
scala> val aggregateTemps=hourlyTemps.reduceByKey((a,b)=>(a._1+b._1,a._2+b._2))
scala> aggreateTemps.take(1)
res26: Array[(Int, (Double, Int))] = Array((34,(8565.25,620)))
Step2:Calculate Averages using total and no of occurrences
Find the final result below.
val avgTemps=aggregateTemps.map(tuple=>(tuple._1,tuple._2._1/tuple._2._2))
scala> avgTemps.collect
res28: Array[(Int, Float)] = Array((34,13.814922), (4,11.743354), (16,14.227251), (22,15.770312), (28,15.5324545), (30,15.167026), (14,13.177828), (32,14.659948), (36,12.865237), (0,11.994799), (24,15.662579), (40,12.040322), (6,11.398838), (8,11.141323), (12,12.004652), (38,12.329914), (18,15.020147), (20,15.358524), (26,15.631921), (10,11.192643), (2,11.848178), (13,12.616284), (19,15.198371), (39,12.107664), (15,13.706351), (21,15.612191), (25,15.627121), (29,15.432097), (11,11.541124), (35,13.317129), (27,15.602408), (33,14.220147), (37,12.644306), (23,15.83412), (1,11.872819), (17,14.595772), (3,11.78971), (7,11.248139), (9,11.049844), (31,14.901464), (5,11.59693))

You may want to provide Structure definition of your CSV file and convert your RDD to DataFrame, like described in the documentation. Dataframes provide a whole set of useful predefined statistic functions as well as the possibility to write some simple custom functions. You then will be able to compute the average with:
dataFrame.groupBy(<your columns here>).agg(avg(<column to compute average>)

count occurances of each word in apache spark

val sc = new SparkContext("local[4]", "wc")
val lines: RDD[String] = sc.textFile("/tmp/inputs/*")
val errors = lines.filter(line => line.contains("ERROR"))
// Count all the errors
println(errors.count())
the above snippet would count the number of lines that contain the word ERROR. Is there a simplified function similar to "contains" that would return me the number of occurrences of the word instead?
say the file is in terms of Gigs and i would want to parallalize the effort using spark clusters.

Just count the instances per line and sum those together:
val errorCount = lines.map{line => line.split("[\\p{Punct}\\p{Space}]").filter(_ == "ERROR").size}.reduce(_ + _)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Summing the values based on Key of RDD - scala

Related

Finding average of values against key using RDD in Spark

Reading one file as Map(K,V) and pass V as keys while reading the second file as Map

How to copy matrix to column array

Finding the average of a data set using Apache Spark

count occurances of each word in apache spark

Categories

Resources