Spark map reduce with condition - scala

suppose these are my CSV file:
attr1;attr2
11111;MOC
22222;MTC
11111;MOC
22222;MOC
33333;MMS
I want to have the number of occurrences in the first column when attr2 = MOC. Like this :
(11111,2)
(22222,1)
i've tried:
val sc = new SparkContext(conf)
val textFile = sc.textFile(args(0))
val data = textFile.map(line => line.split(";").map(elem => elem.trim))
val header = new SimpleCSVHeader(data.take(1)(0))
val rows = data.filter(line => header(line,"attr1") != "attr1")
val attr1 = rows.map(row => header(row,"attr1"))
val attr2 = rows.map(row => header(row,"attr2"))
attr1.map( k => (k,1) ).reduceByKey(_+_)
attr1.foreach (println)
how can I add the condition in my code?
the result of my code is:
(11111,2)
(22222,2)
(33333,1)

Use filter (again):
val rows = data
.filter(line => header(line,"attr1") != "attr1")
.filter(line => header(line,"attr2") == "MOC")
And then continue as before...

Related

How to transpose many files Given that linesWithFileNames: RDD[(Path, Text)], which Text contains a matrix?

I want to input many files and construct a pair(Array[String],Index) for each column, the index could be "file-i" where i is local column counter.
For example:
tableA.txt:00 01 02\n10 11 12
tableB.txt:03 04\n13 14
Target(each column with its filename and index):
RDD[Array[String],String] : (Array("00","10"),"tableA.txt-0"),(Array("01","11","tableA.txt-1"),(Array("02","12"),"tableA.txt-2"),(Array("03","13"),"tableB.txt-0"),(Array("04","14"),"tableB.txt-1")
My code:
val fc = classOf[TextInputFormat]
val kc = classOf[LongWritable]
val vc = classOf[Text]
val text = sc.newAPIHadoopFile(path, fc ,kc, vc, sc.hadoopConfiguration)
val linesWithFileNames = text.asInstanceOf[NewHadoopRDD[LongWritable, Text]]
.mapPartitionsWithInputSplit((inputSplit, iterator) => {
val file = inputSplit.asInstanceOf[FileSplit]
iterator.map(tup => (file.getPath, tup._2))
})
val columnsData = linesWithFileNames.flatMap(p => {
val filename = p._1.toString
val lines = p._2.toString.split("\n")
lines.map(l => l.split(" "))
.toSeq.transpose.zipWithIndex
.map(pair => (pair._1, filename+"-"+pair._2.toString))
})
My wrong result:
("00","tableA.txt-0"),("10","tableA.txt-0")...
One easy way to achieve what you want is to use wholeTextFiles which generates a RDD that associates each file path to its content.
The code would look like this:
val result : RDD[(Array[String], String)] = sc
.wholeTextFiles("data1")
.flatMap{ case (path, lines) => lines
.split("\\n")
.zipWithIndex
.map{ case (line, i) => (line.split("\\s+"),
path.split("/").last + "-" + i)}
}

Understanding the operation of map function

I came across the following example from the book "Fast Processing with Spark" by Holden Karau. I did not understand what the following line of code does in the program:
val splitLines = inFile.map(line => {
val reader = new CSVReader(new StringReader(line))
reader.readNext()
})
val numericData = splitLines.map(line => line.map(_.toDouble))
val summedData = numericData.map(row => row.sum)
The program is :
package pandaspark.examples
import spark.SparkContext
import spark.SparkContext._
import spark.SparkFiles;
import au.com.bytecode.opencsv.CSVReader
import java.io.StringReader
object LoadCsvExample {
def main(args: Array[String]) {
if (args.length != 2) {
System.err.println("Usage: LoadCsvExample <master>
<inputfile>")
System.exit(1)
}
val master = args(0)
val inputFile = args(1)
val sc = new SparkContext(master, "Load CSV Example",
System.getenv("SPARK_HOME"),
Seq(System.getenv("JARS")))
sc.addFile(inputFile)
val inFile = sc.textFile(inputFile)
val splitLines = inFile.map(line => {
val reader = new CSVReader(new StringReader(line))
reader.readNext()
})
val numericData = splitLines.map(line => line.map(_.toDouble))
val summedData = numericData.map(row => row.sum)
println(summedData.collect().mkString(","))
}
}
I briefly know the functionality of the above program. It parses the input
CSV and sums all the rows. But how exactly those 3 lines of code work to achieve is what I am unable to understand.
Also could anyone explain how the output would change if those lines are replaced with flatMap? Like:
val splitLines = inFile.flatMap(line => {
val reader = new CSVReader(new StringReader(line))
reader.readNext()
})
val numericData = splitLines.flatMap(line => line.map(_.toDouble))
val summedData = numericData.map(row => row.sum)
val splitLines = inFile.map(line => {
val reader = new CSVReader(new StringReader(line))
reader.readNext()
})
val numericData = splitLines.map(line => line.map(_.toDouble))
val summedData = numericData.map(row => row.sum)
so in this code is basically reading a CSV file data and adding it's value.
suppose your CSV file is something like -
10,12,13
1,2,3,4
1,2
so here inFile we are fetching a data from CSV file like -
val inFile = sc.textFile("your CSV file path")
so Here inFile is an RDD Which has text formatted data.
and when you apply collect on it then it will look like this -
Array[String] = Array(10,12,13 , 1,2,3,4 , 1,2)
and when you apply map over it then you will find -
line = 10,12,13
line = 1,2,3,4
line = 1,2
and for reading this data in CSV format it is using -
val reader = new CSVReader(new StringReader(line))
reader.readNext()
so after reading data in CSV format, splitLines look like -
Array(
Array(10,12,13),
Array(1,2,3,4),
Array(1,2)
)
on splitLines, it's applying
splitLines.map(line => line.map(_.toDouble))
here in line you will get Array(10,12,13) and after it, it's using
line.map(_.toDouble)
so it's changing all elements type from string to Double.
so in numericData you will get same
Array(Array(10.0, 12.0, 13.0), Array(1.0, 2.0, 3.0, 4.0), Array(1.0, 2.0))
but all elements now in form of Double
and it's applying the sum of the individual row or array so answer something like -
Array(35.0, 10.0, 3.0)
you will get it when you will apply susummedData.collect()
First of all there is no any flatMap operation in your code sample, so title is misleading. But in general map called on collection returns new collection with function applied to each element of collection.
Going line by line through your code snippet:
val splitLines = inFile.map(line => {
val reader = new CSVReader(new StringReader(line))
reader.readNext()
})
Type of inFile is RDD[String]. You take every such string, create csv reader out of it and call readNext (which returns array of strings). So at the end you will get RDD[String[]].
val numericData = splitLines.map(line => line.map(_.toDouble))
A bit more tricky line with 2 maps operations nested. Again, you take each element of RDD collection (which is now String[]) and apply _.toDouble function to every element of String[]. At the end you will get RDD[Double[]].
val summedData = numericData.map(row => row.sum)
You take elements of RDD and apply sum function to them. Since every element is Double[], sum will produce single Double value. At the end you will get RDD[Double].

splitting into new columns as many time separator between column values has occurred

I’ve a Dataframe where in some columns there are multiple values, always separated by ^
phone|contact|
ERN~58XXXXXX7~^EPN~5XXXXX551~|C~MXXX~MSO~^CAxxE~~~~~~3XXX5|
phone1|phone2|contact1|contact2|
ERN~5XXXXXXX7|EPN~58XXXX91551~|C~MXXXH~MSO~|CAxxE~~~~~~3XXX5|
How can this be achieved using loop as the separator between column values
are not constant.
val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("delimiter", "|").option("charset", "UTF-8").load("test.txt").
val columnList=df.columns
val xx = columnList.map(x => x->0).toMap
val opMap = df.rdd.flatMap { row =>
columnList.foldLeft(xx) { case (y, col) =>
val s = row.getAs[String](col).split("\\^").length
if (y(col) < s)
y.updated(col, s)
else
y
}.toList
}
val colMaxSizeMap = opMap.groupBy(x => x._1).map(x => x._2.toList.maxBy(x => x._2)).collect().toMap
val x = df.rdd.map{x =>
val op = columnList.flatMap{ y =>
val op = x.getAs[String](y).split("\\^")
op++List.fill(colMaxSizeMap(y)-op.size)("")
}
Row.fromSeq(op)
}
val structFieldList = columnList.flatMap{colName =>
List.range(0,colMaxSizeMap(colName),1).map{ i =>
StructField(s"$colName"+s"$i",StringType)
}
}
val schema = StructType(structFieldList)
val da= spark.createDataFrame(x,schema)

Read line from file apply regex and write to parquet file scala spark

Hi I have a log file which has log events. I need to read line and apply regex to get the elements from line and write to parquet file. I have an avro schema which has the column definitions.
Could some one guide me one this to proceed.
val spark = SparkSession
.builder()
.appName("SparkApp")
.config("spark.some.config.option", "some-value")
.getOrCreate()
val rdd = sc.textFile(args(0))
val schemaString = args(1)
val pattern = new Regex(args(2))
val fields = schemaString.split(" ")
.map(fieldName => StructField(fieldName, StringType, nullable = true))
val schema = StructType(fields)
val matches = rdd.map { x => pattern.findFirstMatchIn(x) }.map{ x => x.map{x => x.subgroups}}
val values = matches.map { x => x.map { x => Row(x.toArray) }}
in values i'm getting RDD[Option[Row]].
Any suggestion for this.
You are getting RDD[Option[Row]] because you do regex.Following is the definition of findFirstMatchIn and it returns Option
def findFirstMatchIn(source: CharSequence): Option[Match]
To avoid this
val matches = rdd.map { x => pattern.findFirstMatchIn(x) }.map { x => x.map { x => x.subgroups }.get }
val values = matches.map { x => x.map { x => Row(x.toArray) } }
Result
RDD[List[Row]]
To be defensive, you can think of getOrElse instead get
You can also think of flatmap if you want just RDD[Row]

obtain a specific value from a RDD according to another RDD

I want to map a RDD by lookup another RDD by this code:
val product = numOfT.map{case((a,b),c)=>
val h = keyValueRecords.lookup(b).take(1).mkString.toInt
(a,(h*c))
}
a,b are Strings and c is a Integer. keyValueRecords is like this: RDD[(string,string)]-
i got type missmatch error: how can I fix it ?
what is my mistake ?
This is a sample of data:
userId,movieId,rating,timestamp
1,16,4.0,1217897793
1,24,1.5,1217895807
1,32,4.0,1217896246
2,3,2.0,859046959
3,7,3.0,8414840873
I'm triying by this code:
val lines = sc.textFile("ratings.txt").map(s => {
val substrings = s.split(",")
(substrings(0), (substrings(1),substrings(1)))
})
val shoppingList = lines.groupByKey()
val coOccurence = shoppingList.flatMap{case(k,v) =>
val arry1 = v.toArray
val arry2 = v.toArray
val pairs = for (pair1 <- arry1; pair2 <- arry2 ) yield ((pair1,pair2),1)
pairs.iterator
}
val numOfT = coOccurence.reduceByKey((a,b)=>(a+b)) // (((item,rate),(item,rate)),coccurence)
// produce recommend for an especial user
val keyValueRecords = sc.textFile("ratings.txt").map(s => {
val substrings = s.split(",")
(substrings(0), (substrings(1),substrings(2)))
}).filter{case(k,v)=> k=="1"}.groupByKey().flatMap{case(k,v) =>
val arry1 = v.toArray
val arry2 = v.toArray
val pairs = for (pair1 <- arry1; pair2 <- arry2 ) yield ((pair1,pair2),1)
pairs.iterator
}
val numOfTForaUser = keyValueRecords.reduceByKey((a,b)=>(a+b))
val joined = numOfT.join(numOfTForaUser).map{case(k,v)=>(k._1._1,(k._2._2.toFloat*v._1.toFloat))}.collect.foreach(println)
The Last RDD won't produced. Is it wrong ?