Create multiple RDDs from single file based on row value ( header record in sample file) using Spark scala - scala

I am trying to create multiple RDDs to process independently from below file based on the similar format of data .
Please find the file with different data formats
custid,starttime,rpdid,catry,auapp,sppp,retatype,status,process,fileavil
4fgdfg,00:56:30.034,BM_-unit1,GEN,TRUE,FALSE,NONE,A,45,TRUE
X95GEK,00:56:32.083,CBM_OMDD_RSVCM0CBM-unit0,GEN,TRUE,FALSE,NONE,A,GWC,TRUE
XWZ08K,00:57:01.947,GWC-0-UNIT-1,GEN,TRUE,FALSE,NONE,A,GWC,TRUE
custid,relstatus
fg3-03,R
dfsdf4-01,V
56fbfg,R
devid,reg,hold,devbrn,lname,lcon
CTUTANCM0CBM,TRUE,FALSE,13:17:36.934,CBM_BMI_25_5_2,13:43:21.370
In the above file, we have three different type of data formats exist and I want to split the file into three different RDDs as per the format.
Could you please suggest how to implement using Spark (Scala)?

Your file looks like it has 3 different csv files in it.
You can read it as a single file and extract 3 RDDs from it based on the number of fields you have in each row.
// Caching because you'll be filtering it thrice
val topRdd = sc.textFile("file").cache
topRdd.count
//res0: Long = 10
val rdd1 = topRdd.filter(_.split(",", -1).length == 10 )
val rdd2 = topRdd.filter(_.split(",", -1).length == 2 )
val rdd3 = topRdd.filter(_.split(",", -1).length == 6 )
rdd1.collect.foreach(println)
// custid,starttime,rpdid,catry,auapp,sppp,retatype,status,process,fileavil
// 4fgdfg,00:56:30.034,BM_-unit1,GEN,TRUE,FALSE,NONE,A,45,TRUE
// X95GEK,00:56:32.083,CBM_OMDD_RSVCM0CBM-unit0,GEN,TRUE,FALSE,NONE,A,GWC,TRUE
// XWZ08K,00:57:01.947,GWC-0-UNIT-1,GEN,TRUE,FALSE,NONE,A,GWC,TRUE
rdd2.collect.foreach(println)
// custid,relstatus
// fg3-03,R
// dfsdf4-01,V
// 56fbfg,R
rdd3.collect.foreach(println)
// devid,reg,hold,devbrn,lname,lcon
// CTUTANCM0CBM,TRUE,FALSE,13:17:36.934,CBM_BMI_25_5_2,13:43:21.370

Related

Faster way to get single cell value from Dataframe (using just transformation)

I have the following code where I want to get Dataframe dfDateFiltered from dfBackendInfo containing all rows with RowCreationTime greater than timestamp "latestRowCreationTime"
val latestRowCreationTime = dfVersion.agg(max("BackendRowCreationTime")).first.getTimestamp(0)
val dfDateFiltered = dfBackendInfo.filter($"RowCreationTime" > latestRowCreationTime)
The problem I see is that the first line adds a job in Databricks cluster making it slower.
Is there anyway if I could use a better way to filter (for ex. just using transformation instead of action)?
Below are the schemas of the 2 Dataframes:
case class Version(BuildVersion:String,
MainVersion:String,
Hotfix:String,
BackendRowCreationTime:Timestamp)
case class BackendInfo(SerialNumber:Integer,
NumberOfClients:Long,
BuildVersion:String,
MainVersion:String,
Hotfix:String,
RowCreationTime:Timestamp)
The below code worked:
val dfLatestRowCreationTime1 = dfVersion.agg(max($"BackendRowCreationTime").as("BackendRowCreationTime")).limit(1)
val latestRowCreationTime = dfLatestRowCreationTime1.withColumn("BackendRowCreationTime", when($"BackendRowCreationTime".isNull, DefaultTime))
val dfDateFiltered = dfBackendInfo.join(latestRowCreationTime, dfBackendInfo.col("RowCreationTime").gt(latestRowCreationTime.col("BackendRowCreationTime")))

Filtering data from RDD using value sequence spark

I need the help on the below use case:
Question 1: My RDD is of below format.Now from this RDD ,I want to exclude the rows where airport.code in("PUN","HAR","KAS")
case class airport(code:String,city:String,airportname:String)
val airportRdd=sparkSession.sparkContext.textFile("src/main/resources/airport_data.csv").
map(x=>x.split(","))
val airPortRddTransformed=airportRdd.map(x=>airport(x(0),x(1),x(2)))
val trasnformedRdd=airPortRddTransformed.filter(air=>!(air.code.contains(seqValues:_*)))
But ! is not working .It is telling can not resolve symbol !.Can some one please help me.How to do negate in RDD.I have to use RDD approach only.
Also another question:
Question 2 : The data file is having 70 columns.I have a columns sequence:
val seqColumns=List("lat","longi","height","country")
I want to exclude these columns while loading the RDD.How can I do it.My production RDD is having 70 columns, I just really know the columns names to exclude.Not the index of every column.Again looking for it in RDD approach.I am aware on how to do it in Dataframe approach.
Question 1
Use broadcast to pass list of values to filter function. It seems _* in filter is not working. I changed condition to !seqValues.value.contains(air.code)
Data sample: airport_data.csv
C001,Pune,Pune Airport
C002,Mumbai,Chhatrapati Shivaji Maharaj International Airport
C003,New York,New York Airport
C004,Delhi,Delhi Airport
Code snippet
case class airport(code:String,city:String,airportname:String)
val seqValues=spark.sparkContext.broadcast(List("C001","C003"))
val airportRdd = spark.sparkContext.textFile("D:\\DataAnalysis\\airport_data.csv").map(x=>x.split(","))
val airPortRddTransformed = airportRdd.map(x=>airport(x(0),x(1),x(2)))
//airPortRddTransformed.foreach(println)
val trasnformedRdd = airPortRddTransformed.filter(air => !seqValues.value.contains(air.code))
trasnformedRdd.foreach(println)
Output ->
airport(C002,Mumbai,Chhatrapati Shivaji Maharaj International Airport)
airport(C004,Delhi,Delhi Airport)
Things I would change:
1- You are reading a .csv as a TextFile and then spliting the lines based on ,. You can save this step by reading the file like:
val df = spark.read.csv("src/main/resources/airport_data.csv")
2- Change the order of contains
val trasnformedRdd = airPortRddTransformed.filter(air => !(seqValues.contains(air.code)))

Efficient load CSV coordinate format (COO) input to local matrix spark

I want to convert CSV coordinate format (COO) data into a local matrix. Currently I'm first converting them to CoordinateMatrix and then converting to LocalMatrix. But is there a better way to do this?
Example data:
0,5,5.486978435
0,3,0.438472867
0,0,6.128832321
0,7,5.295923198
0,1,7.738270234
Code:
var loadG = sqlContext.read.option("header", "false").csv("file.csv").rdd.map("mapfunctionCreatingMatrixEntryOutOfRow")
var G = new CoordinateMatrix(loadG)
var matrixG = G.toBlockMatrix().toLocalMatrix()
A LocalMatrix will be stored on a single machine and hence not make use of Spark's strengths. In other words, using Spark seems a bit wasteful, although still possible.
The easiest way to get the CSV file to a LocalMatrix is to first read the CSV with Scala, not Spark:
val entries = Source.fromFile("data.csv").getLines()
.map(_.split(","))
.map(a => (a(0).toInt, a(1).toInt, a(2).toDouble))
.toSeq
The SparseMatrix variant of the LocalMatrix has a method for reading COO formatted data. The number of rows and columns need to be specified to use this. Since the matrix is sparse this should in most cases be done by hand but it's possible to get the highest values in the data as follows:
val numRows = entries.map(_._1).max + 1
val numCols = entries.map(_._2).max + 1
Then create the matrix:
val matrixG = SparseMatrix.fromCOO(numRows, numCols, entries)
The matrix will be stored in CSC format on the machine. Printing the example input above will yield the following output:
1 x 8 CSCMatrix
(0,0) 6.128832321
(0,1) 7.738270234
(0,3) 0.438472867
(0,5) 5.486978435
(0,7) 5.295923198

spark streaming - use previous calculated dataframe in next iteration

I have a streaming app that take a dstream and run an sql manipulation over the Dstream and dump it to file
dstream.foreachRDD { rdd =>
{spark.read.json(rdd)
.select("col")
.filter("value = 1")
.write.csv("s3://..")
now I need to be able to take into account the previous calculation (from eaelier batch) in my calculation (something like the following):
dstream.foreachRDD { rdd =>
{val df = spark.read.json(rdd)
val prev_df = read_prev_calc()
df.join(prev_df,"id")
.select("col")
.filter(prev_df("value)
.equalTo(1)
.write.csv("s3://..")
is there a way to write the calc result in memory somehow and use it as an input to to the calculation
Have you tried using the persist() method on a DStream? It will automatically persist every RDD of that DStream in memory.
by default, all input data and persisted RDDs generated by DStream transformations are automatically cleared.
Also, DStreams generated by window-based operations are automatically persisted in memory.
For more details, you can check https://spark.apache.org/docs/latest/streaming-programming-guide.html#caching--persistence
https://spark.apache.org/docs/0.7.2/api/streaming/spark/streaming/DStream.html
If you are looking only for one or two previously calculated dataframes, you should look into Spark Streaming Window.
Below snippet is from spark documentation.
val windowedStream1 = stream1.window(Seconds(20))
val windowedStream2 = stream2.window(Minutes(1))
val joinedStream = windowedStream1.join(windowedStream2)
or even simpler, if we want to do a word count over the last 20 seconds of data, every 10 seconds, we have to apply the reduceByKey operation on the pairs DStream of (word, 1) pairs over the last 30 seconds of data. This is done using the operation reduceByKeyAndWindow.
// Reduce last 20 seconds of data, every 10 seconds
val windowedWordCounts = pairs.reduceByKeyAndWindow((a:Int,b:Int) => (a + b), Seconds(20), Seconds(10))
more details and examples at-
https://spark.apache.org/docs/latest/streaming-programming-guide.html#window-operations

Processing multiple files separately in single spark submit job

I have following directory structure:
/data/modelA
/data/modelB
/data/modelC
..
Each of these files have data in format (id,score), I have to do following for them separately-
1) group by scores and sort the scores in descending(DF_1: score,count)
2) from DF_1 compute the cumulative frequency for each sorted group of score (DF_2: score, count, cumFreq)
3) from DF_2 select cumulative frequencies that lie between 5-10 (DF_3: score, cumFreq)
4) from DF_3 select minimum score(DF_4: score)
5) from file select all id which have score greater than score in DF_4 and save
I am able to do this by reading the directory as wholeTextFile and creating a common dataframe for all the models, then use group by on model.
I want to do -
val scores_file = sc.wholeTextFiles("/data/*/")
val scores = scores_file.map{ line =>
//step 1
//step 2
//step 3
//step 4
//step 5 : save as line._1
}
This will help dealing with each file separately, and avoid group by.
Assuming that your models are discrete values and you know then you can define all the model into a list
val model = List("modelA", "modelB", "modelC", ... )
you can have the following approach:
model.forEach( model => {
val scoresPerModel = sc.textFile(model);
scoresPerModel.map { line =>
// business logic here
}
})
If the you don't know the model prior to computing the business logic that you have to read using the Hadoop file system API and extract the models from there.
private val fs = {
val conf = new org.apache.hadoop.conf.Configuration()
FileSystem.get(conf)
}
fs.listFiles(new Path(hdfsPath))