I have a csv file containing addressId and address of the customers as below
addressId
address
ADD001
"123, Maccolm Street, Copenhagen, Denmark"
ADD002
"384, East Avenue Street, New York, USA
I want to parse the address column to get number, street, city and country. I am given initial code to build on to get the necessary output
object Address extends App {
val spark = SparkSession.builder().master("local[*]").appName("CustomerAddress").getOrCreate()
import spark.implicits._
Logger.getRootLogger.setLevel(Level.WARN)
case class AddressRawData(
addressId: String,
address: String
)
case class AddressData(
addressId: String,
address: String,
number: Option[Int],
road: Option[String],
city: Option[String],
country: Option[String]
)
def addressParser(unparsedAddress: Seq[AddressData]): Seq[AddressData] = {
unparsedAddress.map(address => {
val split = address.address.split(", ")
address.copy(
number = Some(split(0).toInt),
road = Some(split(1)),
city = Some(split(2)),
country = Some(split(3))
)
}
)
}
val addressDF: DataFrame = spark.read.option("header", "true").csv("src/main/resources/address_data.csv")
val addressDS: Dataset[AddressRawData] = addressDF.as[AddressRawData]
}
I assume that I need to use addressParser function to parse my addressDS information. However, the parameter to the function is of type Seq. I am not sure how should I convert addressDS as an input to function to parse the raw data. some form of guidance to solve this is appreciated.
Related
I want to parse a column to get split values using Seq of an object
case class RawData(rawId: String, rawData: String)
case class SplitData(
rawId: String,
rawData: String,
split1: Option[Int],
split2: Option[String],
split3: Option[String],
split4: Option[String]
)
def rawDataParser(unparsedRawData: Seq[RawData]): Seq[RawData] = {
unparsedrawData.map(rawData => {
val split = rawData.address.split(", ")
rawData.copy(
split1 = Some(split(0).toInt),
split2 = Some(split(1)),
split3 = Some(split(2)),
split4 = Some(split(3))
)
})
}
val rawDataDF= Seq[(String, String)](
("001", "Split1, Split2, Split3, Split4"),
("002", "Split1, Split2, Split3, Split4")
).toDF("rawDataID", "rawData")
val rawDataDS: Dataset[RawData] = rawDataDF.as[RawData]
I need to use rawDataParser function to parse my rawData. However, the parameter to the function is of type Seq. I am not sure how should I convert rawDataDS as an input to function to parse the raw data. some form of guidance to solve this is appreciated.
Each DataSet is further divided into partitions. You can use mapPartitions with a mapping Iterator[T] => Iterator[U] to convert a DataSet[T] into a DataSet[U].
So, you can just use your addressParser as the argument for mapPartition.
val rawAddressDataDS =
spark.read
.option("header", "true")
.csv(csvFilePath)
.as[AddressRawData]
val addressDataDS =
rawAddressDataDS
.map { rad =>
AddressData(
addressId = rad.addressId,
address = rad.address,
number = None,
road = None,
city = None,
country = None
)
}
.mapPartitions { unparsedAddresses =>
addressParser(unparsedAddresses.toSeq).toIterator
}
I'm completely new to kotlin programming and the mongo db.I'm defining a data class, which all the data fields are not nullable and all the fileds are val
data class Order( #Id
val id: String,
val customerId: String,
val externalTransactionId : String,
val quoteId :String,
val manifestItems : List<ManifestItem>,
val externalTokenNumber : String,
val deliveryId : String,
val quoteCreatedTime: String,
val deliveryCreatedTime: String,
val status : String,
val deliveryInfo: DeliveryInfo,
val pickupInfo: PickupInfo,
val riderId : String,
val currency : String,
val expiryTime : String,
val trackingUrl : String,
val complete:Boolean,
val updated:String
)
and I'm sending a http request with following body
{
"pickupAddress":"101 Market St, San Francisco, CA 94105",
"deliveryAddress":"101 Market St, San Francisco, CA 94105",
"deliveryExpectedTime":"2018-07-25T23:31:38Z",
"deliveryAddressLatitude":7.234,
"deliveryAddressLongitude":80.000,
"pickupLatitude":7.344,
"pickupLongitude":8.00,
"pickupReadyTime":""
}
in my router class I'm get the request body to order object and send to the service class
val request = serverRequest.awaitBody<Order>()
val quoteResponse = quoteService.createQuote(request,customerId)
in my service class I'm saving the order to database
suspend fun createQuote(order: Order,customerId:String):QuoteResponse {
ordersRepo.save(order).awaitFirst()
//generating quote response here
return quoteResponse
}
the id is generating at the database.and I'm having this kind of error when sending the request
org.springframework.web.server.ServerWebInputException: 400 BAD_REQUEST "Failed to read HTTP message"; nested exception is org.springframework.core.codec.DecodingException: JSON decoding error: Instantiation of [simple type, class basepackage.repo.Order] value failed for JSON property id due to missing (therefore NULL) value for creator parameter id which is a non-nullable type; nested exception is com.fasterxml.jackson.module.kotlin.MissingKotlinParameterException: Instantiation of [simple type, class basepackage.repo.Order] value failed for JSON property id due to missing (therefore NULL) value for creator parameter id which is a non-nullable type
at [Source: (org.springframework.core.io.buffer.DefaultDataBuffer$DefaultDataBufferInputStream); line: 12, column: 1] (through reference chain: basepackage.repo.Order["id"])
How do I overcome that problem.
If you are using mongoDB, you need to use ObjectID instead of String and it could not generate by itself and you need to create it every time!
Following code probably is a solution :
data class Order constructor(
#Id
#field:JsonSerialize(using = ToStringSerializer::class)
#field:JsonProperty(access = JsonProperty.Access.READ_ONLY)
var id: ObjectId?,
val customerId: String,
val externalTransactionId: String,
val quoteId: String,
val manifestItems: List<ManifestItem>,
val externalTokenNumber: String,
val deliveryId: String,
val quoteCreatedTime: String,
val deliveryCreatedTime: String,
val status: String,
val deliveryInfo: DeliveryInfo,
val pickupInfo: PickupInfo,
val riderId: String,
val currency: String,
val expiryTime: String,
val trackingUrl: String,
val complete: Boolean,
val updated: String
)
I converted val to var that it lets you set value to it after serialize, set access as READ_ONLY that Jackson doesn't expects id as input anymore.
ToStringSerializer is optional, if you like id as String
I have started learning scala and I have tryed to solve a scenario as below. I have an input file with multiple transactions separated by ','. Below are my sample values:
transactionId, accountId, transactionDay, category, transactionAmount
A11,A45,1,SA,340
A12,A2,1,FD,567
and I have to calculate the total transaction value for all transactions for each day along with other statistics. Below is my initial snippet
import scala.io.Source
val fileName = "<path of input file>"
Transaction(
transactionId: String, accountId: String,
transactionDay: Int, category: String,
transactionAmount: Double)
Source.fromFile(fileName).getLines().drop(1)
val transactions: List[Transaction] = transactionslines.map { line =>
val split = line.split(',') Transaction(split(0), split(1), split(2).toInt, split(3), split(4).toDouble) }.toList
You can do it as below:
val sd=transactions.groupBy(_.transactionDay).mapValues(_.map(_.transactionAmount).sum)
Further ,you can do complex analytics by converting it into a dataframe.
val scalatoDF = spark.sparkContext.parallelize(transactions).toDF("transactionId","accountId","transactionDay","category","transactionAmount")
scalatoDF.show()
Hope this helps!
I have a scala case class that is something like below.
case class Address(
city: Option[String] = None,
country: Option[String] = None
)
case class Student(
id: Option[String] = None,
name: Option[String] = None,
address: Option[Seq[Address]] = None,
phone: Option[Seq[String]] = None
)
Now I want to write a toCSV method for student that would generate a list of csv strings/lines for each student. I am unable to determine how to generate string format for fields that can have multiple values such as: address and phone.
For a student,
val student_1 = Student(
id = Some("1"),
name = Some("john"),
address = Some(Seq(
Address(Some("Newyork"),Some("USA")),
Address(Some("Berlin"),Some("Germany")),
Address(Some("Tokyo"),Some("Japan")),
)),
phone = Some(Seq(
"1111","9999","8888"
))
)
So, student_1.toCSV must result in following csv string:
id, name, address.city, address.country, phone
1 , John, Newyork , USA , 1111/9999/888
, , Berlin , Germany ,
, , Tokyo , Japan ,
This is the csv list of string where first string represents the first row and so on.I need to generate this list of strings for each student. Note that there could be multiple lines for each student because address and phones can have multiple values.In this case, there are 3 addresses and 2 phones for student John.
How do I achieve this in scala ?
Addition:
So far, I am working to produce a List of csv lines i.e a list of Lists where each list would store one row.
So, the list would look like below:
List(
List("id","name","address.city","address.country","phone"),
List("1" ,"John","Newyork" ,"USA" ,"11111/22222"),
List("" ,"" ,"Berlin" ,"Germany" ,""),
List("" ,"" ,"Tokyo" ,"Japan" ,"")
)
In this I simplified your Address type to just a String, and I kept the phone layout as you had it originally (i.e. the one I complained about in the comments). So this is more of a proof-of-concept rather than a finished product.
val student = Student(Some("1")
, Some("John")
, Some(Seq("Newyork", "Berlin", "Tokyo"))
, Some(Seq("1111","9999"))
)
student match {
case Student(i,n,a,p) =>
val maxLen = a.getOrElse(Seq("")).length max p.getOrElse(Seq("")).length
Seq( Seq(i.getOrElse("")).padTo(maxLen,"")
, Seq(n.getOrElse("")).padTo(maxLen,"")
, a.getOrElse(Seq()).padTo(maxLen,"")
, p.getOrElse(Seq()).padTo(maxLen,"")
).transpose
}
// res0: Seq[Seq[String]] = List( List(1, John, Newyork, 1111)
// , List(, , Berlin, 9999)
// , List(, , Tokyo, ))
Based on the answer provided by #jwvh above,
I have come up with a solution which is working for me now:
val student_1 = Student(
id = Some("1"),
name = Some("john"),
address = Some(Seq(
Address(Some("Newyork"),Some("USA")),
Address(Some("Berlin"),Some("Germany")),
Address(Some("Tokyo"),Some("Japan"))
)),
phone = Some(Seq(
"1111","9999","8888"
))
)
def csvHeaders:List[String] = {
List("StudentId","Name","Address.City","Address.Province","Phones")
}
def toCSV:List[List[String]] ={
val maximumLength = address.getOrElse(Seq.empty[Address]).length max 1
//phone.getOrElse(Seq.empty[String]).length for earlier case where phones were kept in separate rows , done by #jwvh above
val idList = List.tabulate(maximumLength)(k => " ").updated(0,id.getOrElse(""))
val nameList = List.tabulate(maximumLength)(k => " ").updated(0,name.getOrElse(""))
val addressCityList = if(address.isDefined){
address.get.map{
k => k.city.getOrElse(" ")
}.toList.padTo(maximumLength," ")
} else{
List.tabulate(maximumLength)(k => " ")
}
val addressProvinceList = if(address.isDefined){
address.get.map{
k => k.province.getOrElse(" ")
}.toList.padTo(maximumLength," ")
} else{
List.tabulate(maximumLength)(k => " ")
}
val phoneList = if(phone.isDefined){
List.tabulate(maximumLength)(k => " ").updated(0,phone.get.padTo(maximumLength," ").mkString("/"))
} else{
List.tabulate(maximumLength)(k => " ")
}
val transposedList:List[List[String]] = List(idList,nameList,addressCityList,addressProvinceList,phoneList).transpose
transposedList.+:(csvHeaders)
}
So, now student_1.toCSV will return:
/* List(
List(StudentId, Name, Address.City, Address.Province, Phones),
List(1, john, Newyork, USA, 1111/9999/8888),
List( , , Berlin, Germany, ),
List( , , Tokyo, Japan, )
) */
I have a flat file on HDFS containing a list of companies
CompanyA
CompanyA Decription
April '12
San Fran
11-50
CompanyB
...
and i want to map this into a companies class
case class Company(company: String,
desc: String,
founded: Date,
location: String,
employees: String)
I have tried the following but it doesn't seem to map properly
val companiesText = sc.textFile(...)
val companies = companyText.map(
lines => Company(
lines(0).toString.replaceAll("\"", ""),
lines(1).toString.replaceAll("\"", ""),
lines(2).toString.replaceAll("\"", ""),
lines(3).toString.replaceAll("\"", ""),
lines(4).toString.replaceAll("\"", ""),
lines(5).toString.replaceAll("\"", "")
)
)
i know i am not doing the date properly here but that is not the issue.