Spark mapping a flat file to a class

Spark mapping a flat file to a class - scala

I have a flat file on HDFS containing a list of companies
CompanyA
CompanyA Decription
April '12
San Fran
11-50
CompanyB
...
and i want to map this into a companies class
case class Company(company: String,
desc: String,
founded: Date,
location: String,
employees: String)
I have tried the following but it doesn't seem to map properly
val companiesText = sc.textFile(...)
val companies = companyText.map(
lines => Company(
lines(0).toString.replaceAll("\"", ""),
lines(1).toString.replaceAll("\"", ""),
lines(2).toString.replaceAll("\"", ""),
lines(3).toString.replaceAll("\"", ""),
lines(4).toString.replaceAll("\"", ""),
lines(5).toString.replaceAll("\"", "")
)
)
i know i am not doing the date properly here but that is not the issue.

Related

Address parser in spark scala

I have a csv file containing addressId and address of the customers as below
addressId
address
ADD001
"123, Maccolm Street, Copenhagen, Denmark"
ADD002
"384, East Avenue Street, New York, USA
I want to parse the address column to get number, street, city and country. I am given initial code to build on to get the necessary output
object Address extends App {
val spark = SparkSession.builder().master("local[*]").appName("CustomerAddress").getOrCreate()
import spark.implicits._
Logger.getRootLogger.setLevel(Level.WARN)
case class AddressRawData(
addressId: String,
address: String
)
case class AddressData(
addressId: String,
address: String,
number: Option[Int],
road: Option[String],
city: Option[String],
country: Option[String]
)
def addressParser(unparsedAddress: Seq[AddressData]): Seq[AddressData] = {
unparsedAddress.map(address => {
val split = address.address.split(", ")
address.copy(
number = Some(split(0).toInt),
road = Some(split(1)),
city = Some(split(2)),
country = Some(split(3))
)
}
)
}
val addressDF: DataFrame = spark.read.option("header", "true").csv("src/main/resources/address_data.csv")
val addressDS: Dataset[AddressRawData] = addressDF.as[AddressRawData]
}
I assume that I need to use addressParser function to parse my addressDS information. However, the parameter to the function is of type Seq. I am not sure how should I convert addressDS as an input to function to parse the raw data. some form of guidance to solve this is appreciated.

Failed to read HTTP message error due to null value passing to id :spring-boot,kotlin,mongodb

I'm completely new to kotlin programming and the mongo db.I'm defining a data class, which all the data fields are not nullable and all the fileds are val
data class Order( #Id
val id: String,
val customerId: String,
val externalTransactionId : String,
val quoteId :String,
val manifestItems : List<ManifestItem>,
val externalTokenNumber : String,
val deliveryId : String,
val quoteCreatedTime: String,
val deliveryCreatedTime: String,
val status : String,
val deliveryInfo: DeliveryInfo,
val pickupInfo: PickupInfo,
val riderId : String,
val currency : String,
val expiryTime : String,
val trackingUrl : String,
val complete:Boolean,
val updated:String
)
and I'm sending a http request with following body
{
"pickupAddress":"101 Market St, San Francisco, CA 94105",
"deliveryAddress":"101 Market St, San Francisco, CA 94105",
"deliveryExpectedTime":"2018-07-25T23:31:38Z",
"deliveryAddressLatitude":7.234,
"deliveryAddressLongitude":80.000,
"pickupLatitude":7.344,
"pickupLongitude":8.00,
"pickupReadyTime":""
}
in my router class I'm get the request body to order object and send to the service class
val request = serverRequest.awaitBody<Order>()
val quoteResponse = quoteService.createQuote(request,customerId)
in my service class I'm saving the order to database
suspend fun createQuote(order: Order,customerId:String):QuoteResponse {
ordersRepo.save(order).awaitFirst()
//generating quote response here
return quoteResponse
}
the id is generating at the database.and I'm having this kind of error when sending the request
org.springframework.web.server.ServerWebInputException: 400 BAD_REQUEST "Failed to read HTTP message"; nested exception is org.springframework.core.codec.DecodingException: JSON decoding error: Instantiation of [simple type, class basepackage.repo.Order] value failed for JSON property id due to missing (therefore NULL) value for creator parameter id which is a non-nullable type; nested exception is com.fasterxml.jackson.module.kotlin.MissingKotlinParameterException: Instantiation of [simple type, class basepackage.repo.Order] value failed for JSON property id due to missing (therefore NULL) value for creator parameter id which is a non-nullable type
at [Source: (org.springframework.core.io.buffer.DefaultDataBuffer$DefaultDataBufferInputStream); line: 12, column: 1] (through reference chain: basepackage.repo.Order["id"])
How do I overcome that problem.

If you are using mongoDB, you need to use ObjectID instead of String and it could not generate by itself and you need to create it every time!
Following code probably is a solution :
data class Order constructor(
#Id
#field:JsonSerialize(using = ToStringSerializer::class)
#field:JsonProperty(access = JsonProperty.Access.READ_ONLY)
var id: ObjectId?,
val customerId: String,
val externalTransactionId: String,
val quoteId: String,
val manifestItems: List<ManifestItem>,
val externalTokenNumber: String,
val deliveryId: String,
val quoteCreatedTime: String,
val deliveryCreatedTime: String,
val status: String,
val deliveryInfo: DeliveryInfo,
val pickupInfo: PickupInfo,
val riderId: String,
val currency: String,
val expiryTime: String,
val trackingUrl: String,
val complete: Boolean,
val updated: String
)
I converted val to var that it lets you set value to it after serialize, set access as READ_ONLY that Jackson doesn't expects id as input anymore.
ToStringSerializer is optional, if you like id as String

Read a csv file using scala and generate analytics

I have started learning scala and I have tryed to solve a scenario as below. I have an input file with multiple transactions separated by ','. Below are my sample values:
transactionId, accountId, transactionDay, category, transactionAmount
A11,A45,1,SA,340
A12,A2,1,FD,567
and I have to calculate the total transaction value for all transactions for each day along with other statistics. Below is my initial snippet
import scala.io.Source
val fileName = "<path of input file>"
Transaction(
transactionId: String, accountId: String,
transactionDay: Int, category: String,
transactionAmount: Double)
Source.fromFile(fileName).getLines().drop(1)
val transactions: List[Transaction] = transactionslines.map { line =>
val split = line.split(',') Transaction(split(0), split(1), split(2).toInt, split(3), split(4).toDouble) }.toList

You can do it as below:
val sd=transactions.groupBy(_.transactionDay).mapValues(_.map(_.transactionAmount).sum)
Further ,you can do complex analytics by converting it into a dataframe.
val scalatoDF = spark.sparkContext.parallelize(transactions).toDF("transactionId","accountId","transactionDay","category","transactionAmount")
scalatoDF.show()
Hope this helps!

How to write a toCSV method for scala case class that would generate the csv string for the class?

I have a scala case class that is something like below.
case class Address(
city: Option[String] = None,
country: Option[String] = None
)
case class Student(
id: Option[String] = None,
name: Option[String] = None,
address: Option[Seq[Address]] = None,
phone: Option[Seq[String]] = None
)
Now I want to write a toCSV method for student that would generate a list of csv strings/lines for each student. I am unable to determine how to generate string format for fields that can have multiple values such as: address and phone.
For a student,
val student_1 = Student(
id = Some("1"),
name = Some("john"),
address = Some(Seq(
Address(Some("Newyork"),Some("USA")),
Address(Some("Berlin"),Some("Germany")),
Address(Some("Tokyo"),Some("Japan")),
)),
phone = Some(Seq(
"1111","9999","8888"
))
)
So, student_1.toCSV must result in following csv string:
id, name, address.city, address.country, phone
1 , John, Newyork , USA , 1111/9999/888
, , Berlin , Germany ,
, , Tokyo , Japan ,
This is the csv list of string where first string represents the first row and so on.I need to generate this list of strings for each student. Note that there could be multiple lines for each student because address and phones can have multiple values.In this case, there are 3 addresses and 2 phones for student John.
How do I achieve this in scala ?
Addition:
So far, I am working to produce a List of csv lines i.e a list of Lists where each list would store one row.
So, the list would look like below:
List(
List("id","name","address.city","address.country","phone"),
List("1" ,"John","Newyork" ,"USA" ,"11111/22222"),
List("" ,"" ,"Berlin" ,"Germany" ,""),
List("" ,"" ,"Tokyo" ,"Japan" ,"")
)

In this I simplified your Address type to just a String, and I kept the phone layout as you had it originally (i.e. the one I complained about in the comments). So this is more of a proof-of-concept rather than a finished product.
val student = Student(Some("1")
, Some("John")
, Some(Seq("Newyork", "Berlin", "Tokyo"))
, Some(Seq("1111","9999"))
)
student match {
case Student(i,n,a,p) =>
val maxLen = a.getOrElse(Seq("")).length max p.getOrElse(Seq("")).length
Seq( Seq(i.getOrElse("")).padTo(maxLen,"")
, Seq(n.getOrElse("")).padTo(maxLen,"")
, a.getOrElse(Seq()).padTo(maxLen,"")
, p.getOrElse(Seq()).padTo(maxLen,"")
).transpose
}
// res0: Seq[Seq[String]] = List( List(1, John, Newyork, 1111)
// , List(, , Berlin, 9999)
// , List(, , Tokyo, ))

Based on the answer provided by #jwvh above,
I have come up with a solution which is working for me now:
val student_1 = Student(
id = Some("1"),
name = Some("john"),
address = Some(Seq(
Address(Some("Newyork"),Some("USA")),
Address(Some("Berlin"),Some("Germany")),
Address(Some("Tokyo"),Some("Japan"))
)),
phone = Some(Seq(
"1111","9999","8888"
))
)
def csvHeaders:List[String] = {
List("StudentId","Name","Address.City","Address.Province","Phones")
}
def toCSV:List[List[String]] ={
val maximumLength = address.getOrElse(Seq.empty[Address]).length max 1
//phone.getOrElse(Seq.empty[String]).length for earlier case where phones were kept in separate rows , done by #jwvh above
val idList = List.tabulate(maximumLength)(k => " ").updated(0,id.getOrElse(""))
val nameList = List.tabulate(maximumLength)(k => " ").updated(0,name.getOrElse(""))
val addressCityList = if(address.isDefined){
address.get.map{
k => k.city.getOrElse(" ")
}.toList.padTo(maximumLength," ")
} else{
List.tabulate(maximumLength)(k => " ")
}
val addressProvinceList = if(address.isDefined){
address.get.map{
k => k.province.getOrElse(" ")
}.toList.padTo(maximumLength," ")
} else{
List.tabulate(maximumLength)(k => " ")
}
val phoneList = if(phone.isDefined){
List.tabulate(maximumLength)(k => " ").updated(0,phone.get.padTo(maximumLength," ").mkString("/"))
} else{
List.tabulate(maximumLength)(k => " ")
}
val transposedList:List[List[String]] = List(idList,nameList,addressCityList,addressProvinceList,phoneList).transpose
transposedList.+:(csvHeaders)
}
So, now student_1.toCSV will return:
/* List(
List(StudentId, Name, Address.City, Address.Province, Phones),
List(1, john, Newyork, USA, 1111/9999/8888),
List( , , Berlin, Germany, ),
List( , , Tokyo, Japan, )
) */

How to convert datatype in SPARK SQL to specific datatype but RDD result to a specifical class

I am reading a csv file and need to create a RDDSchema
I read the file by using the sqlContext.csvFile
val testfile = sqlContext.csvFile("file")
testfile.registerTempTable(testtable)
I wanted to change the pick some of the fields and return an RDD type of those fields
For example : class Test(ID: String, order_date: Date, Name: String, value: Double)
Using sqlContext.sql("Select col1, col2, col3, col4 FROM ...)
val testfile = sqlContext.sql("Select col1, col2, col3, col4 FROM testtable).collect
testfile.getClass
Class[_ <: Array[org.apache.spark.sql.Row]] = class [Lorg.apache.spark.sql.Row;
So I wanted to change col1 to double, col2 to a date , and column3 to string?
Is there a way to do this in the sqlContext.sql or I have to run a map function to the result and then turn it back to RDD..
I tried to do the do the item in one statement and I got this error:
val old_rdd : RDD[Test] = sqlContext.sql("SELECT col, col2, col3,col4 FROM testtable").collect.map(t => (t(0) : String ,dateFormat.parse(dateFormat.format(1)),t(2) : String, t(3) : Double))
The issue I am having is the assignment does not result on RDD[Test] where Test is a class defined
The error is saying that the map command is coming out as an Array Class and not an RDD Class
found : Array[edu.model.Test]
[error] required: org.apache.spark.rdd.RDD[edu.model.Test]

Lets say you have a case class like this:
case class Test(
ID: String, order_date: java.sql.Date, Name: String, value: Double)
Since you load your data with csvFile with default parameters it doesn't perform any schema inference and your data is stored as plain strings. Lets assume that there are no other fields:
val df = sc.parallelize(
("ORD1", "2016-01-02", "foo", "2.23") ::
("ORD2", "2016-07-03", "bar", "9.99") :: Nil
).toDF("col1", "col2", "col3", "col4")
Your attempt to use map is wrong for more than one reason:
function you use annotates individual values with incorrect types. Not only Row.apply is of type Int => Any but also your data table contains shouldn't contain any Double values
since you collect (which doesn't makes sense here) you fetch all data to the driver and result is local Array not RDD
finally, if all previous issues were resolved, (String, Date, String, Double) is clearly not a Test
One way to handle this:
import org.apache.spark.sql.Row
import org.apache.spark.rdd.RDD
val casted = df.select(
$"col1".alias("ID"),
$"col2".cast("date").alias("order_date"),
$"col3".alias("name"),
$"col4".cast("double").alias("value")
)
val tests: RDD[Test] = casted.map {
case Row(id: String, date: java.sql.Date, name: String, value: Double) =>
Test(id, date, name, value)
}
You can also try to use new Dataset API but it is far from stable:
casted.as[Test].rdd