How to group the dataframe for transformation - pyspark

I have the following dataframe with schema:
<bound method DataFrame.printSchema of DataFrame[outer_value: string, value01: string, value02: string, value03: string, value04: string, value05: string, value06: string, value07: string, value08: string, value09: string, value10: string, value11: string, value12: string, value13: string, value14: string, value15: string, value16: string, value17: string, value18: string, value19: string, value20: string, value21: string, value22: string, value23: string, value24: string, value25: string, value26: string, value27: string, value28: string, value29: string, value30: string, value31: string, value32: string, value33: string, value34: string, value35: string, value36: string, value37: string, value38: string, value39: string, value40: string, value41: string, value42: string, value43: string, value44: string, value45: string, value46: string, value47: string, value48: string, value49: string, value50: string, value51: string, value52: string, value53: string, value54: string, value55: string]>
I would like to group the columns by 5 (divide 55 by 5 to create 11 column groups) and populate another dataframe with the following schema.
> [outer_value: string, value01: string, value02: string, value03:
> string, value04: string, value05:string]
The Image the below represents the first group of 5 columns used to populate the target schema. Likewise, the next group will be formed as : outer_value, value 6, value7, value 8, value 9, value 10.
it goes like that until the 11 groups are processed and each group populated the target dataframe.
Eventually, one row in the source data frame will end up as 11 rows in the target data frame.
How to achieve this?

Related

Convert spark scala dataset of one type to another

I have a dataset with following case class type:
case class AddressRawData(
addressId: String,
customerId: String,
address: String
)
I want to convert it to:
case class AddressData(
addressId: String,
customerId: String,
address: String,
number: Option[Int], //i.e. it is optional
road: Option[String],
city: Option[String],
country: Option[String]
)
Using a parser function:
def addressParser(unparsedAddress: Seq[AddressData]): Seq[AddressData] = {
unparsedAddress.map(address => {
val split = address.address.split(", ")
address.copy(
number = Some(split(0).toInt),
road = Some(split(1)),
city = Some(split(2)),
country = Some(split(3))
)
}
)
}
I am new to scala and spark. Could anyone please let me know how can this be done?
You were on the right path! There are multiple ways of doing this of course. But as you're already on the way by making some case classes, and you've started making a parsing function an elegant solution is by using the Dataset's map function. From the docs, this map function signature is the following:
def map[U](func: (T) ⇒ U)(implicit arg0: Encoder[U]): Dataset[U]
Where T is the starting type (AddressRawData in your case) and U is the type you want to get to (AddressData in your case). So the input of this map function is a function that transforms a AddressRawData to a AddressData. That could perfectly be the addressParser you've started making!
Now, your current addressParser has the following signature:
def addressParser(unparsedAddress: Seq[AddressData]): Seq[AddressData]
In order to be able to feed it to that map function, we need to make this signature:
def newAddressParser(unparsedAddress: AddressRawData): AddressData
Knowing all of this, we can work further! An example would be the following:
import spark.implicits._
import scala.util.Try
// Your case classes
case class AddressRawData(addressId: String, customerId: String, address: String)
case class AddressData(
addressId: String,
customerId: String,
address: String,
number: Option[Int],
road: Option[String],
city: Option[String],
country: Option[String]
)
// Your addressParser function, adapted to be able to feed into the Dataset.map
// function
def addressParser(rawAddress: AddressRawData): AddressData = {
val addressArray = rawAddress.address.split(", ")
AddressData(
rawAddress.addressId,
rawAddress.customerId,
rawAddress.address,
Try(addressArray(0).toInt).toOption,
Try(addressArray(1)).toOption,
Try(addressArray(2)).toOption,
Try(addressArray(3)).toOption
)
}
// Creating a sample dataset
val rawDS = Seq(
AddressRawData("1", "1", "20, my super road, beautifulCity, someCountry"),
AddressRawData("1", "1", "badFormat, some road, cityButNoCountry")
).toDS
val parsedDS = rawDS.map(addressParser)
parsedDS.show
+---------+----------+--------------------+------+-------------+----------------+-----------+
|addressId|customerId| address|number| road| city| country|
+---------+----------+--------------------+------+-------------+----------------+-----------+
| 1| 1|20, my super road...| 20|my super road| beautifulCity|someCountry|
| 1| 1|badFormat, some r...| null| some road|cityButNoCountry| null|
+---------+----------+--------------------+------+-------------+----------------+-----------+
As you see, thanks to the fact that you had already foreseen that parsing can go wrong, it was easily possible to use scala.util.Try to try and get the pieces of that raw address and add some robustness in there (the second line contains some null values where it could not parse the address string.
Hope this helps!

Scala Option and Some mismatch

I want to parse province to case class, it throws mismatch
scala.MatchError: Some(USA) (of class scala.Some)
val result = EntityUtils.toString(entity,"UTF-8")
val address = JsonParser.parse(result).extract[Address]
val value.province = Option(address.province)
val value.city = Option(address.city)
case class Access(
device: String,
deviceType: String,
os: String,
event: String,
net: String,
channel: String,
uid: String,
nu: Int,
ip: String,
time: Long,
version: String,
province: Option[String],
city: Option[String],
product: Option[Product]
)
This:
val value.province = Option(address.province)
val value.city = Option(address.city)
doesn't do what you think it does. It tries to treat value.province and value.city as extractors (which don't match the type, thus scala.MatchError exception). It doesn't mutate value as I believe you intended (because value apparently doesn't have such setters).
Since value is (apparently) Access case class, it is immutable and you can only obtain an updated copy:
val value2 = value.copy(
province = Option(address.province),
city = Option(address.city)
)
Assuming the starting point:
val province: Option[String] = ???
You can get the string with simple pattern matching:
province match {
case Some(stringValue) => JsonParser.parse(stringValue).extract[Province] //use parser to go from string to a case class
case None => .. //maybe provide a default value, depends on your context
}
Note: Without knowing what extract[T] returns it's hard to recommend a follow-up

Scala Play JSON trouble converting List[CaseClass] to Json String

Not sure where I am going wrong with this it is returning error:
No Json serializer as JsObject found for type List[QM_Category].
Try to implement an implicit OWrites or OFormat for this type.
[error] Json.stringify(Json.toJsObject(a.categories))
Is there a way to define a format for just List[QM_Category]? I thought the format for QM_Category would handle the case class and play is supposed to handle Lists...
All I really want to do is take my List and convert it to json string. Pretty straight forward but I am not sure why Play Json doesnt like my format.
Here is my code:
case class QM_Answer (
answerid: String,
answerstring: String,
answerscore: Int
);
case class QM_Question (
questionid: String,
questionscore: Int,
questiongoal: Int,
questionstring: String,
questiontype: String,
questioncomments: String,
questionisna: Boolean,
questionishidden: Boolean,
failcategory: Boolean,
failform: Boolean,
answers: List[QM_Answer]
);
case class QM_Category (
categoryid: String,
categoryname: String,
categoryscore: Int,
categorygoal: Int,
categorycomments: String,
categoryishidden: Boolean,
failcategory: Boolean,
questions: List[QM_Question]
);
case class SurveySourceRaw (
ownerid: String,
formid: String,
formname: String,
sessionid: String,
evaluator: String,
userid: String,
timelinekey: Long,
surveyid: String,
submitteddate: Long,
month: String,
channel: String,
categories: List[QM_Category]
);
case class SurveySource (
ownerid: String,
formid: String,
formname: String,
sessionid: String,
evaluator: String,
userid: String,
timelinekey: Long,
surveyid: String,
submitteddate: Long,
month: String,
channel: String,
categories: String
);
implicit val qmAnswerFormat = Json.format[QM_Answer];
implicit val qmQuestionFormat = Json.format[QM_Question];
implicit val qmCategoryFormat = Json.format[QM_Category];
implicit val surveySourceRawFormat = Json.format[SurveySourceRaw];
var surveySourceRaw = sc
.cassandraTable[SurveySourceRaw]("mykeyspace", "mytablename")
.select("ownerid",
"formid",
"formname",
"sessionid",
"evaluator",
"userid",
"timelinekey",
"surveyid",
"submitteddate",
"month",
"channel",
"categories")
var surveyRelational = surveySourceRaw
.map(a => SurveySource
(
a.ownerid,
a.formid,
a.formname,
a.sessionid,
a.evaluator,
a.userid,
a.timelinekey,
a.surveyid,
a.submitteddate,
a.month,
a.channel,
Json.stringify(Json.toJsObject(a.categories))
))
The Play JSON format for a List[A], given a format for A, encodes/decodes a JSON array, e.g. for a List[String] [ "foo", "bar", "baz" ]. A JSON array is not a JSON object.
So if you want the List[QM_Category] to be a stringified JSON (but not necessarily a JSON object, e.g. it could be a string, array, etc.), you can use toJson:
Json.stringify(Json.toJson(a.categories))
Alternatively, if you want it to be a JSON object, you would need to define an OFormat (or an OReads/OWrites combination) for List[QM_Category]: an OFormat is a Format which requires that the JSON be an object with string attributes and JSON values (and so forth for OReads/OWrites).
I'm almost embarresed to answer this but sometimes I make it overly complicated.
The answer was to just read the column from cassandra as a string instead of a List[QM_Category]. The column in cassandra was defined as:
categories list<FROZEN<qm.category>>,
I wrongfully assumed I would need to read it in from Cassandra as a list of custom objects. I would then need to use play json to format that class into JSON and then stringify it.
case class SurveySourceRaw (
ownerid: String,
formid: String,
formname: String,
sessionid: String,
evaluator: String,
userid: String,
timelinekey: Long,
surveyid: String,
submitteddate: Long,
month: String,
channel: String,
categories: List[QM_Category]
);
When in reality, all I needed to do was read it from cassandra as a type String and it came in as a stringified json. Well played spark cassandra connector, well played.
case class SurveySourceRaw (
ownerid: String,
formid: String,
formname: String,
sessionid: String,
evaluator: String,
userid: String,
timelinekey: Long,
surveyid: String,
submitteddate: Long,
month: String,
channel: String,
categories: String
);

KSQL all messages failing in stream and there is data in topic

I am relatively new to Kafka Stream and trying to print out a the messages in the stream I created.
Can someone tell me why all the messages are failed?
When I use the print command I get this
print 'main' from beginning limit 10;
Key format: ¯\_(ツ)_/¯ - no data processed
Value format: KAFKA_STRING
rowtime: 2021/05/29 14:17:57.375 Z, key: <null>, value: A-1,2,5/21/2019 8:29,5/21/2019 9:29,34.808868,-82.269157,34.808868,-82.269157,0,Accident on Tanner Rd at Pennbrooke Ln.,439,Tanner Rd,R,Greenville,Greenville,SC,29607-6027,US,U
S/Eastern,KGMU,5/21/2019 8:53,76,76,52,28.91,10,N,7,0,Fair,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,Day,Day,Day,Day
When I run describe extended of the stream I created I get the below:
Name : ACCIDENTS_ORIGINAL
Type : STREAM
Timestamp field : START_TIME
Key format : KAFKA
Value format : DELIMITED
Kafka topic : main (partitions: 1, replication: 1)
Statement : CREATE STREAM ACCIDENTS_ORIGINAL (ID STRING, SEVERITY INTEGER, START_TIME STRING,
END_TIME STRING, START_LAT DOUBLE, START_LNG DOUBLE, END_LAT DOUBLE, END_LNG DOUBLE, DISTANCE DOUBLE,
DESCRIPTION STRING, NUMBER DOUBLE, STREET STRING, SIDE STRING, CITY STRING, COUNTY STRING,
STATE STRING, ZIPCODE STRING, COUNTRY STRING, TIMEZONE STRING, AIRPORT_CODE STRING,
WEATHER_TIME STRING, TEMPERATURE DOUBLE, WIND_CHILL DOUBLE, HUMIDITY DOUBLE,
PRESSURE DOUBLE, VISIBILITY DOUBLE, WIND_DIRECTION STRING, WIND_SPEED STRING,
PRECIPITATION DOUBLE, WEATHER_CONDITION STRING, AMENITY BOOLEAN, BUMP BOOLEAN,
CROSSING BOOLEAN, GIVE_WAY BOOLEAN, JUNCTION BOOLEAN, NO_EXIT BOOLEAN, RAILWAY BOOLEAN,
ROUNDABOUT BOOLEAN, STATION BOOLEAN, STOP BOOLEAN, TRAFFIC_CALMING BOOLEAN,
TRAFFIC_SIGNAL BOOLEAN, TURNING_LOOP BOOLEAN, SUNRISE_SUNSET STRING,
CIVIL_TWILIGHT STRING, NAUTICAL_TWILIGHT STRING, ASTRONOMICAL_TWILIGHT STRING)
WITH (KAFKA_TOPIC='main', KEY_FORMAT='KAFKA', TIMESTAMP='Start_Time',
TIMESTAMP_FORMAT='yyyy-MM-dd HH:mm:ss', VALUE_FORMAT='DELIMITED');
Can anyone help me have a look and tell me what am I doing wrong here?
Small update, I also tried to set Key format as none and still get all the messages failed.

Scala, create a form with more than 22 fields

I'm using scala and Play framework and i want create a form with more than 22 fields so I share my field in 3 tuple like that:
val firstMapping = tuple(
"f1" -> text, "f2" -> text, ... "f18" -> text
)
val secondMapping = tuple(
"f19" -> text, "20"-> text ... "f25" -> text
)
val thirdMapping = tuple(
"f26" -> text, ... "f29" -> text
)
So after I regroup them in a form:
val createForm = From(tuple(
"general" -> firstMapping,
"specific" -> secondMapping,
"more_specific" -> thirdMapping
))
I think this is the good solution, but my question is about the view file (i'm in a MVC architecture)
In that view I want to pass my form like that:
#(formCreate: Form[])
But I don't know what I need to put in the " [] " (I'm french I don't know wath is the word for that in english) and how to create my field in HTML ?
Usually I use that kind html form:
#helper.form() {
<input type="text" name="id_metier" id="id_metier" maxlength="255"/>
}
So can I use that kind of field again or I need to use specific field from Play framework ? And what are the parameter for that #(formCreate: Form[]) ?
Thank you for your help
Your form is of type Tuple3 with some other tuples inside. Painful to read, write, use, maintain.
Form[((String, String, String, String, String, String, String, String, String, String, String, String, String, String, String, String, String, String), (String, String, String, String, String, String, String), (String, String, String, String))]
Refer to the docs: https://www.playframework.com/documentation/2.5.x/ScalaForms
and just create a case class that will contain 3 nested case classes for your data, name fields appropriately.
Here is the example from docs for nested case class
case class AddressData(street: String, city: String)
case class UserAddressData(name: String, address: AddressData)
val userFormNested: Form[UserAddressData] = Form(
mapping(
"name" -> text,
"address" -> mapping(
"street" -> text,
"city" -> text
)(AddressData.apply)(AddressData.unapply)
)(UserAddressData.apply)(UserAddressData.unapply)
)
When creating form you refer to nested fields with . notation
#helper.inputText(userFormNested("name"))
#helper.inputText(userFormNested("address.street"))
#helper.inputText(userFormNested("address.city"))