Read a csv file using scala and generate analytics - scala

I have started learning scala and I have tryed to solve a scenario as below. I have an input file with multiple transactions separated by ','. Below are my sample values:
transactionId, accountId, transactionDay, category, transactionAmount
A11,A45,1,SA,340
A12,A2,1,FD,567
and I have to calculate the total transaction value for all transactions for each day along with other statistics. Below is my initial snippet
import scala.io.Source
val fileName = "<path of input file>"
Transaction(
transactionId: String, accountId: String,
transactionDay: Int, category: String,
transactionAmount: Double)
Source.fromFile(fileName).getLines().drop(1)
val transactions: List[Transaction] = transactionslines.map { line =>
val split = line.split(',') Transaction(split(0), split(1), split(2).toInt, split(3), split(4).toDouble) }.toList

You can do it as below:
val sd=transactions.groupBy(_.transactionDay).mapValues(_.map(_.transactionAmount).sum)
Further ,you can do complex analytics by converting it into a dataframe.
val scalatoDF = spark.sparkContext.parallelize(transactions).toDF("transactionId","accountId","transactionDay","category","transactionAmount")
scalatoDF.show()
Hope this helps!

Related

scala spark rdd joing two tables with the same id

I have the following rdds:
case class Rating(user_ID: Integer, movie_ID: Integer, rating: Integer, timestamp: String)
case class Movie(movie_ID: Integer, title: String, genre: String)
I join them together in scala, like:
val m = datamovie.keyBy(_.movie_ID)
val r = data.keyBy(_.movie_ID)
val mr = m.join(r)
I get back my result like RDD[(Int, (Movie, Rating))]
how can I print the tile of the movies that have the rating 5 for example. I am not quit sure how to work with the new rdd that was created with the join!
Convert them to spark dataframe and perform joins. Is there a specific reason you wanted to keep em RDD's
val m = datamovie.toDF
val r = data.toDF
val mr = m.join(r, Seq("movie_id"), "left").where($"rating" === "5").select($"title")

Spark SQL convert dataset to dataframe

How do I convert a dataset obj to a dataframe? In my example, I am converting a JSON file to dataframe and converting to DataSet. In dataset, I have added some additional attribute(newColumn) and convert it back to a dataframe. Here is my example code:
val empData = sparkSession.read.option("header", "true").option("inferSchema", "true").option("multiline", "true").json(filePath)
.....
import sparkSession.implicits._
val res = empData.as[Emp]
//for (i <- res.take(4)) println(i.name + " ->" + i.newColumn)
val s = res.toDF();
s.printSchema()
}
case class Emp(name: String, gender: String, company: String, address: String) {
val newColumn = if (gender == "male") "Not-allowed" else "Allowed"
}
But I am expected the new column name newColumn added in s.printschema(). output result. But it is not happening? Why? Any reason? How can I achieve this?
The schema of the output with Product Encoder is solely determined based on it's constructor signature. Therefore anything that happens in the body is simply discarded.
You can
empData.map(x => (x, x.newColumn)).toDF("value", "newColumn")

How to apply sum and groupby on a list in Scala?

I am trying to use groupby method on transactionDay and take the sum of the transactionAmount. and print the output.
case class Transaction(
transactionId: String,
accountId: String,
transactionDay: Int,
category: String,
transactionAmount: Double)
I created a list like this:
val transactions: List[Transaction] = transactionslines.map { line =>
val split = line.split(',')
Transaction(split(0), split(1), split(2).toInt, split(3), split(4).toDouble)
}.toList
Can anyone help with using the groupBy method.
If you have any documents to share it would be really helpful.
Following code should work to get the solution you require
val transactions = transactionslines.map( line => line.split(","))
.map(split => Transaction(split(0), split(1), split(2).toInt, split(3), split(4).toDouble))
transactions.groupBy(_.transactionDay).mapValues(trans => trans.map(amount => amount.transactionAmount).sum).foreach(println)

How to handle dates in Spark using Scala?

I have a flat file that looks like as mentioned below.
id,name,desg,tdate
1,Alex,Business Manager,2016-01-01
I am using the Spark Context to read this file as follows.
val myFile = sc.textFile("file.txt")
I want to generate a Spark DataFrame from this file and I am using the following code to do so.
case class Record(id: Int, name: String,desg:String,tdate:String)
val myFile1 = myFile.map(x=>x.split(",")).map {
case Array(id, name,desg,tdate) => Record(id.toInt, name,desg,tdate)
}
myFile1.toDF()
This is giving me a DataFrame with id as int and rest of the columns as String.
I want the last column, tdate, to be casted to date type.
How can I do that?
You just need to convert the String to a java.sql.Date object. Then, your code can simply become:
import java.sql.Date
case class Record(id: Int, name: String,desg:String,tdate:Date)
val myFile1 = myFile.map(x=>x.split(",")).map {
case Array(id, name,desg,tdate) => Record(id.toInt, name,desg,Date.valueOf(tdate))
}
myFile1.toDF()

How to convert datatype in SPARK SQL to specific datatype but RDD result to a specifical class

I am reading a csv file and need to create a RDDSchema
I read the file by using the sqlContext.csvFile
val testfile = sqlContext.csvFile("file")
testfile.registerTempTable(testtable)
I wanted to change the pick some of the fields and return an RDD type of those fields
For example : class Test(ID: String, order_date: Date, Name: String, value: Double)
Using sqlContext.sql("Select col1, col2, col3, col4 FROM ...)
val testfile = sqlContext.sql("Select col1, col2, col3, col4 FROM testtable).collect
testfile.getClass
Class[_ <: Array[org.apache.spark.sql.Row]] = class [Lorg.apache.spark.sql.Row;
So I wanted to change col1 to double, col2 to a date , and column3 to string?
Is there a way to do this in the sqlContext.sql or I have to run a map function to the result and then turn it back to RDD..
I tried to do the do the item in one statement and I got this error:
val old_rdd : RDD[Test] = sqlContext.sql("SELECT col, col2, col3,col4 FROM testtable").collect.map(t => (t(0) : String ,dateFormat.parse(dateFormat.format(1)),t(2) : String, t(3) : Double))
The issue I am having is the assignment does not result on RDD[Test] where Test is a class defined
The error is saying that the map command is coming out as an Array Class and not an RDD Class
found : Array[edu.model.Test]
[error] required: org.apache.spark.rdd.RDD[edu.model.Test]
Lets say you have a case class like this:
case class Test(
ID: String, order_date: java.sql.Date, Name: String, value: Double)
Since you load your data with csvFile with default parameters it doesn't perform any schema inference and your data is stored as plain strings. Lets assume that there are no other fields:
val df = sc.parallelize(
("ORD1", "2016-01-02", "foo", "2.23") ::
("ORD2", "2016-07-03", "bar", "9.99") :: Nil
).toDF("col1", "col2", "col3", "col4")
Your attempt to use map is wrong for more than one reason:
function you use annotates individual values with incorrect types. Not only Row.apply is of type Int => Any but also your data table contains shouldn't contain any Double values
since you collect (which doesn't makes sense here) you fetch all data to the driver and result is local Array not RDD
finally, if all previous issues were resolved, (String, Date, String, Double) is clearly not a Test
One way to handle this:
import org.apache.spark.sql.Row
import org.apache.spark.rdd.RDD
val casted = df.select(
$"col1".alias("ID"),
$"col2".cast("date").alias("order_date"),
$"col3".alias("name"),
$"col4".cast("double").alias("value")
)
val tests: RDD[Test] = casted.map {
case Row(id: String, date: java.sql.Date, name: String, value: Double) =>
Test(id, date, name, value)
}
You can also try to use new Dataset API but it is far from stable:
casted.as[Test].rdd