spark scala Join Dataset wit Seq[caseclass] - scala

I have been trying to create a new Dataset which contain an existing Dataset and Seq[Case class] where condition cid present in both of them.
There might be unique cid with multiple tid, trying to find out total transaction(money) done by individual cid.
case class Data1(
cid: String,
fname: String,
add: String
)
case class Data2(
cid: String,
tid: String,
money: Long
)
I have read csv into data1DF and data2DF, later created data1DS and data2DS with case class
data1DS: Dataset[Data1] = data1DF.as[Data1]
data2DS: Dataset[Data2] = data2DF.as[Data2]
I tried making Seq[Data2] and join wit the data1DF, it throws error.
val dd : Seq[Data2] = data2DS.collect().toSeq
val ansDS = data1DS.join(dd, data1DS("cid") === dd("cid"), "leftouter")
<console>:36: error: type mismatch;
found : String("cid")
required: Int

Related

How to update table query in Slick

How can I convert Query[MappedProjection[Example, (Option[String], Int, UUID, UUID)], Example, Seq] to Query[Examples, Example, Seq]?
Details
I am trying to drop a column from an existing table(Examples in this case) and move the data to another table (Examples2 in this case). I don't want to change all the existing code base, so I plan to join these two tables and map the results to Example.
import slick.lifted.Tag
import slick.driver.PostgresDriver.api._
import java.util.UUID
case class Example(
field1: Option[String] = None,
field2: Int,
someForeignId: UUID,
id: UUID,
)
object Example
class Examples(tag: Tag) extends Table[Example](tag, "entityNotes") {
def field1 = column[Option[String]]("field1")
def field2 = column[Int]("field2")
def someForeignId = column[UUID]("someForeignId")
def id = column[UUID]("id", O.PrimaryKey)
def someForeignKey = foreignKey(
"someForeignIdToExamples2",
someForeignId,
Examples2.query,
)(
_.id.?
)
def * =
(
field1.?,
field2,
someForeignId,
id,
) <> ((Example.apply _).tupled, Example.unapply)
}
object Examples{
val query = TableQuery[Examples]
}
Basically, all the functions in the codebase call Examples.query. If I update that query by joining two tables, the problem will be solved (of course with a performance shortcoming because of one extra join for each call).
To use the query with the existing code base, we need to keep the type the same. For example, we we can use filter as follows:
val query_ = TableQuery[Examples]
val query: Query[Examples, Example, Seq] = query_.filter(_.field2 > 5)
Everything will work without a problem since we keep the type of the query as it is supposed to be.
However, I cannot do that with a join if I want to use data from the second table.
val query_ = TableQuery[Examples]
val query = query
.join(Examples2.query_)
.on(_.someForeignId === _.id)
.map({
case (e, e2) =>
((
e2.value.?,
e1.field2,
e2.id
e.id,
) <> ((Example.apply _).tupled, Example.unapply))
})
This is where I got stuck. Its type is Query[MappedProjection[Example, (Option[String], Int, UUID, UUID)], Example, Seq].
Can anyone help? Btw, we don't have to use map. This is just what I got so far.

PostgreSQL Error with Doobie: PSQLException: The column index is out of range: 3, number of columns: 2

I am practicing with Scala, Doobie and PostgreSQL. The database is within a Docker container. I am able to post and update jobs but unable to GET all posts. I keep getting the below error.
I have researched other similar questions but my differs as I am just trying to GET all from the database so I don't understand this column issue.
I am starting to think I need a circe encoder to read the Json out of the database??? The circe decoder below throws at error on leftMap and on show
implicit val get: Get[JobPostDetails] =
Get[Json].temap(_.as[JobPostDetails].leftMap(_.show))
case class JobPost(id: String, details: JobPostDetails)
case class JobPostDetails(title: String, description: String, salary: Double, employmentType: String, employer: String)
def createTable: doobie.Update0 = {
sql"""
|CREATE TABLE IF NOT EXISTS jobs (
| id TEXT PRIMARY KEY,
| details JSON NOT NULL
|)
""".stripMargin
.update
}
def getAll: doobie.Query0[JobPost] = {
sql"""
|SELECT id, details FROM jobs
""".stripMargin
.query[JobPost]
}
def getPosts: IO[List[JobPost]] = {
JobQueries.getAll.to[List].transact(xa)
}
09:01:50.452 [ioapp-compute-7] ERROR org.http4s.server.service-errors - Error servicing request: GET /api/v1/posts from 0:0:0:0:0:0:0:1
org.postgresql.util.PSQLException: The column index is out of range: 3, number of columns: 2.
at org.postgresql.jdbc.PgResultSet.checkColumnIndex(PgResultSet.java:2760) ~[postgresql-42.2.12.jar:42.2.12]
at org.postgresql.jdbc.PgResultSet.checkResultSet(PgResultSet.java:2780) ~[postgresql-42.2.12.jar:42.2.12]

How do I insert JSON into a postgres table using Anorm?

I'm getting a runtime exception when trying to insert a JSON string into a JSON column. The string I have looks like """{"Events": []}""", the table has a column defined as status JSONB NOT NULL. I can insert the string into the table from the command line no problem. I've defined a method to do the insert as:
import play.api.libs.json._
import anorm._
import anorm.postgresql._
def createStatus(
status: String,
created: LocalDateTime = LocalDateTime.now())(implicit c: SQLConnection): Unit = {
SQL(s"""
|INSERT INTO status_feed
| (status, created)
|VALUES
| ({status}, {created})
|""".stripMargin)
.on(
'status -> Json.parse("{}"), // n.b. would be Json.parse(status) but this provides a concise error message
'created -> created)
.execute()
}
and calling it gives the following error:
TypeDoesNotMatch(Cannot convert {}: org.postgresql.util.PGobject to String for column ColumnName(status_feed.status,Some(status)))
anorm.AnormException: TypeDoesNotMatch(Cannot convert {}: org.postgresql.util.PGobject to String for column ColumnName(status_feed.status,Some(status)))
I've done loads of searching for this issue but there's nothing about this specific use case that I could find - most of it is pulling out json columns into case classes. I've tried slightly different formats using spray-json's JsValue, play's JsValue, simply passing the string as-is and casting in the query with ::JSONB and they all give the same error.
Update: here is the SQL which created the table:
CREATE TABLE status_feed (
id SERIAL PRIMARY KEY,
status JSONB NOT NULL,
created TIMESTAMP WITHOUT TIME ZONE NOT NULL DEFAULT NOW()
)
The error is not on values given to .executeInsert, but on the parsing of the INSERT result (inserted key).
import java.sql._
// postgres=# CREATE TABLE test(foo JSONB NOT NULL);
val jdbcUrl = "jdbc:postgresql://localhost:32769/postgres"
val props = new java.util.Properties()
props.setProperty("user", "postgres")
props.setProperty("password", "mysecretpassword")
implicit val con = DriverManager.getConnection(jdbcUrl, props)
import anorm._, postgresql._
import play.api.libs.json._
SQL"""INSERT INTO test(foo) VALUES(${Json.obj("foo" -> 1)})""".
executeInsert(SqlParser.scalar[JsValue].singleOpt)
// Option[play.api.libs.json.JsValue] = Some({"foo":1})
/*
postgres=# SELECT * FROM test ;
foo
------------
{"foo": 1}
*/
BTW, the plain string interpolation is useless.
Turns out cchantep was right, it was the parser I was using. The test framework I am using swallowed the stack trace and I assumed the problem was on the insert, but what's actually blowing up is the next line in the test where I use the parser.
The case class and parser were defined as:
case class StatusFeed(
status: String,
created: LocalDateTime) {
val ItemsStatus: Status = status.parseJson.convertTo[Status]
}
object StatusFeed extends DefaultJsonProtocol {
val fields: String = sqlFields[StatusFeed]() // helper function that results in "created, status"
// used in SQL as RETURNING ${StatusFeed.fields}
val parser: RowParser[StatusFeed] =
Macro.namedParser[StatusFeed](Macro.ColumnNaming.SnakeCase)
// json formatter for Status
}
As defined the parser attempts to read a JSONB column from the result set into the String status. Changing fields to val fields: String = "created, status::TEXT" resolves the issue, though the cast may be expensive. Alternatively, defining status as a JsValue instead of a String and providing an implicit for anorm (adapted from this answer to use spray-json) fixes the issue:
implicit def columnToJsValue: Column[JsValue] = anorm.Column.nonNull[JsValue] { (value, meta) =>
val MetaDataItem(qualified, nullable, clazz) = meta
value match {
case json: org.postgresql.util.PGobject => Right(json.getValue.parseJson)
case _ =>
Left(TypeDoesNotMatch(
s"Cannot convert $value: ${value.asInstanceOf[AnyRef].getClass} to Json for column $qualified"))
}
}

How to pass join key as variable in Spark Data Frame using Scala

I am trying to kep the join key of the two dataframe in two varibale. The same i want to pass into a join . here My variable contains one key. Can I also pass more than one key ?
Ex:
1st key :
scala> val primary_key_col = scd_table_keys_df.first().getString(2)
primary_key_col: String = acct_nbr
2nd key :
scala> val delta_primary_key_col = "delta_"+primary_key_col
delta_primary_key_col: String = delta_acct_nbr
** My python Code which is working
cdc_new_acct_df = delta_src_rename_df.join(hist_tgt_tbl_Y_df ,(col(delta_primary_key_col) == col(primary_key_col)) ,'left_outer' ).where(hist_tgt_tbl_Y_df[primary_key_col].isNull())
I want to achieve same in Scala. Please suggest.
Tries multiple ways.
val cdc_new_acct_df = delta_src_rename_df.join(hist_tgt_tbl_Y_df ,(delta_src_rename_df({primary_key_col.mkstring(",")}) == hist_tgt_tbl_Y_df({primary_key_col.mkstring(",")}),"left_outer" )).where(hist_tgt_tbl_Y_df[primary_key_col].isNull())
:121: error: value mkstring is not a member of String
val cdc_new_acct_df = delta_src_rename_df.join(hist_tgt_tbl_Y_df ,(delta_src_rename_df(delta_primary_key_col.map(c => col(c))) == hist_tgt_tbl_Y_df(primary_key_col.map(c => col(c))),"left_outer" ))
:123: error: type mismatch;
found : Array[org.apache.spark.sql.Column]
required: String
Not able to solve. Please suggest.
I was missing the variable substitution. This is working for me.
scala> val cdc_new_acct_df = delta_src_rename_df.join(hist_tgt_tbl_Y_df ,delta_src_rename_df(**s"$delta_primary_key_col"**) === hist_tgt_tbl_Y_df(s"$primary_key_col"),"left_outer" )
cdc_new_acct_df: org.apache.spark.sql.DataFrame = [delta_acct_nbr: string, delta_primary_state: string, delta_zip_code: string, delta_load_tm: string, delta_load_date: string, hash_key_col: string, delta_hash_key: int, delta_eff_start_date: string, acct_nbr: string, account_sk_id: bigint, primary_state: string, zip_code: string, eff_start_date: string, eff_end_date: string, load_tm: string, hash_key: string, eff_flag: string]

TypedPipe can't coerce strings to DateTime even when given implicit function

I've got a Scalding data flow that starts with a bunch of Pipe separated value files. The first column is a DateTime in a slightly non-standard format. I want to use the strongly typed TypedPipe API, so I've specified a tuple type and a case class to contain the data:
type Input = (DateTime, String, Double, Double, String)
case class LatLonRecord(date : DateTime, msidn : String, lat : Double, lon : Double, cellname : String)
however, Scalding doesn't know how to coerce a String into a DateTime, so I tried adding an implicit function to do the dirty work:
implicit def stringToDateTime(dateStr: String): DateTime =
DateTime.parse(dateStr, DateTimeFormat.forPattern("yyyy-MM-dd HH:mm:ss.S"))
However, I still get a ClassCastException:
val lines: TypedPipe[Input] = TypedPipe.from(TypedPsv[Input](args("input")))
lines.map(x => x._1).dump
//cascading.flow.FlowException: local step failed at java.lang.Thread.run(Thread.java:745)
//Caused by: cascading.pipe.OperatorException: [FixedPathTypedDelimite...][com.twitter.scalding.RichPipe.eachTo(RichPipe.scala:509)] operator Each failed executing operation
//Caused by: java.lang.ClassCastException: java.lang.String cannot be cast to org.joda.time.DateTime
What do I need to do to get Scalding to call my conversion function?
So I ended up doing this:
case class LatLonRecord(date : DateTime, msisdn : String, lat : Double, lon : Double, cellname : String)
object dateparser {
val format = DateTimeFormat.forPattern("yyyy-MM-dd HH:mm:ss.S")
def parse(s : String) : DateTime = DateTime.parse(s,format);
}
//changed first column to a String, yuck
type Input = (String, String, Double, Double, String)
val lines: TypedPipe[Input] = TypedPipe.from( TypedPsv[Input]( args("input")) )
val recs = lines.map(v => LatLonRecord(dateparser.parse(v._1), v._2, v._3,v._4, v._5))
But I feel like its a sub-optimal solution. I welcome better answers from people who have been using Scala for more than, say, 1 week, like me.