Scala slick join table with grouped query - scala

I have two tables:
Shop
class ShopTable(tag: Tag) extends GenericTableShop, UUID {
def * = (id.?, name, address) <> (Shop.tupled, Shop.unapply _)
def name = column[String]("name")
def address = column[String]("address")
}
val shopTable = TableQuery[ShopDAO.ShopTable]
Order
class OrderTable(tag: Tag) extends GenericTableOrder, UUID {
def * = (id.?, shopId, amount) <> (Order.tupled, Order.unapply _)
def shopId = column[UUID]("shop_id")
def amount = column[Double]("amount")
}
val orderTable = TableQuery[OrderDAO.OrderTable]
I can get statictic (orders count, orders amount sum) for shops:
def getStatisticForShops(shopIds: List[UUID]): Future[Seq(UUID, Int, Double)] = {
searchStatisticsByShopIds(shopIds).map(orderStatistics =>
shopIds.map(shopId => {
val o = orderStatistics.find(_._1 == shopId)
(
shopId,
o.map(_._2).getOrElse(0),
o.map(_._3).getOrElse(0.0)
)
})
)
}
def searchStatisticsByShopIds(shopIds: List[UUID]): Future[Seq(UUID, Int, Double)] =
db.run(searchStatisticsByShopIdsCompiled(shopIds).result)
private val searchStatisticsByShopIdsCompiled = Compiled((shopIds: Rep[List[(UUID)]]) =>
orderTable.filter(_.shopId === shopIds.any)
.groupBy(_.shopId)
.map { case (shopId, row) =>
(shopId, row.length, row.map(_.amount).sum.get)
}
)
By I need sorting and filtering shopTable by orders count.
How I can join grouped orderTable to shopTable with zero values for missing shops?
I want have such a request:
| id(shopId) | name | address | ordersCount | ordersAmount |
| id1 | name | address | 4 | 200.0 |
| id2 | name | address | 0 | 0.0 |
| id3 | name | address | 2 | 300.0 |
I use scala 2.12.6, slick 2.12:3.0.1, play-slick 3.0.1, slick-pg 0.16.3
P.S. I may have found a solution
val shopsOrdersQuery: Query[(Rep[UUID], Rep[Int], Rep[Double]), (UUID, Int, Double), Seq] = searchShopsOrdersCompiled.extract
// Query shops with orders for sorting and filtering
val allShopsOrdersQueryQuery = shopTable.joinLeft(shopsOrdersQuery).on(_.id === _._1)
.map(s => (s._1, s._2.map(_._2).getOrElse(0), s._2.map(_._3).getOrElse(0.0)))
private val searchShopsOrdersCompiled = Compiled(
orderTable.groupBy(_.shopId)
.map { case (shopId, row) =>
(shopId, row.length, row.map(_.amount).sum.get)
}
)

Yes, this solution is work fine
val shopsOrdersQuery: Query[(Rep[UUID], Rep[Int], Rep[Double]), (UUID, Int, Double), Seq] = searchShopsOrdersCompiled.extract
// Query shops with orders for sorting and filtering
val allShopsOrdersQueryQuery = shopTable.joinLeft(shopsOrdersQuery).on(_.id === _._1)
.map(s => (s._1, s._2.map(_._2).getOrElse(0), s._2.map(_._3).getOrElse(0.0)))
private val searchShopsOrdersCompiled = Compiled(
orderTable.groupBy(_.shopId)
.map { case (shopId, row) =>
(shopId, row.length, row.map(_.amount).sum.get)
}
)

Related

How to write a generic function to evaluate column values inside withcolumn of spark dataframe

Hi I have the below dataframe which has countries column along with multiple other columns and more than a lack rows.I want to write a generic function(because used in multiple places) which can be used inside the withcolumn to create a new column.
input
| countries |
|------------|
| RFRA |
| BRES |
| EAST |
| RUSS |
| .... |
output
| countries |
|-----------|
| FRA |
| BRA |
| POL |
| RUS |
| ... |
Below is my code when I pass countries column to the function,am not able to evaluate the column with string. How can I extract the value from the column and evaluate with the string value specified and I want to return as a column.
val df = sample.withColumn("renamedcountries", replace($"countries"))
def replace(countries: Column) :Column = {
val Updated = countries match {
case "RFRA" => "FRA"
case "BRES" => "BRA"
case "RESP" => "ESP"
case "RBEL" => "BEL"
case "RGRB" => "GBR"
case "RALL" => "DEU"
case "MARO" => "MAR"
case "RPOR" => "PRT"
case _ => "unknown"
}
Updated
}
Wrap the function logic you have as udf and call this udf from various places from code.
import org.apache.spark.sql.functions._
val df = Seq( ("RFRA"), ("BRES"), ("RUSS")).toDF("countries")
val mapCountries = udf[String, String](country => {
val Updated = country match {
case "RFRA" => "FRA"
case "BRES" => "BRA"
case "RESP" => "ESP"
case "RBEL" => "BEL"
case "RGRB" => "GBR"
case "RALL" => "DEU"
case "MARO" => "MAR"
case "RPOR" => "PRT"
case _ => "unknown"
}
Updated
})
df.withColumn("renamedCountries", mapCountries($"countries")).show()
+---------+----------------+
|countries|renamedCountries|
+---------+----------------+
| RFRA| FRA|
| BRES| BRA|
| RUSS| unknown|
+---------+----------------+
Here you go with typedLit, so whenever there is a change only update on the map on method,
val df = Seq("RFRA","BRES","EAST", "RUSS").toDF("countries")
val replaceMap = typedLit(Map("RFRA" -> "FRA",
"BRES" -> "BRA",
"RESP" -> "ESP",
"RBEL" -> "BEL",
"RGRB" -> "GBR",
"RALL" -> "DEU",
"MARO" -> "MAR",
"RPOR" -> "PRT"))
def replace(countries: Column): Column = {
when(replaceMap($"$countries").isNotNull,replaceMap($"$countries"))
.otherwise(lit("unknown"))
}
val res = df.withColumn("modified_countries", replace($"countries"))
res.show(false)
+---------+------------------+
|countries|modified_countries|
+---------+------------------+
|RFRA |FRA |
|BRES |BRA |
|EAST |unknown |
|RUSS |unknown |
+---------+------------------+
You should define it as a reusable expression :
def replace(c: Column): Column = {
when(c === "RFRA", "FRA")
.when(c === "BRES", "BRA")
.when(c === "RESP", "ESP")
.when(c === "RBEL", "BEL")
// add more here
.otherwise("unknown")
}
df
.withColumn("contries",replace($"countries"))
.show()
you can also pack the modifications inside a map and use it in this expression :
val replaceMap = Map("RFRA" -> "FRA",
"BRES" -> "BRA",
"RESP" -> "ESP",
"RBEL" -> "BEL",
"RGRB" -> "GBR",
"RALL" -> "DEU",
"MARO" -> "MAR",
"RPOR" -> "PRT")
def replace(countries: Column): Column = {
replaceMap.foldLeft(when(lit(false),countries)){case (acc,(k,v)) => acc.when(countries === k,v)}
.otherwise("unknown")
}

Add additional columns to Spark dataframe

I parse Spark dataframe using file paths but now I would like to add paths to the resulting dataframe along with time as a separate column too. Here is a current solution (pathToDF is a helper method):
val paths = pathsDF
.orderBy($"time")
.select($"path")
.as[String]
.collect()
if(paths.nonEmpty) {
paths
.grouped(groupsNum.getOrElse(paths.length))
.map(_.map(pathToDF).reduceLeft(_ union _))
} else {
Seq.empty[DataFrame]
}
I am trying to do something like this but I am not sure how to add time column too using withColumn:
val orderedPaths = pathsDF
.orderBy($"time")
.select($"path")
//.select($"path", $"time") for both columns
val paths = orderedPaths
.as[String]
.collect()
if (paths.nonEmpty) {
paths
.grouped(groupsNum.getOrElse(paths.length))
.map(group => group.map(pathToDataDF).reduceLeft(_ union _)
.withColumn("path", orderedPaths("path")))
//.withColumn("time", orderedPaths("time") something like this
} else {
Seq.empty[DataFrame]
}
What would be a better way to implement it?
Input DF:
time Long
path String
Current result:
resultDF schema
field1 Int
field2 String
....
fieldN String
Expected result:
resultDF schema
field1 Int
field2 String
....
path String
time Long
Please check below code.
1. Change grouped to par function for parallel data load.
2. Change
// Below code will add same path for multiple files content.
paths.grouped(groupsNum.getOrElse(paths.length))
.map(group => group.map(pathToDataDF).reduceLeft(_ union _)
.withColumn("path", orderedPaths("path")))
to
// Below code will add same path for same file content.
paths
.grouped(groupsNum.getOrElse(paths.length))
.flatMap(group => {
group.map(path => {
pathToDataDF(path).withColumn("path", lit(path))
}
)
})
.reduceLeft(_ union _)
For example I have used both par & grouped to show you.
Note Ignore some of method like pathToDataDF I have tried to replicate your methods.
scala> val orderedPaths = Seq(("/tmp/data/foldera/foldera.json","2020-05-29 01:30:00"),("/tmp/data/folderb/folderb.json","2020-05-29 02:00:00"),("/tmp/data/folderc/folderc.json","2020-05-29 03:00:00")).toDF("path","time")
orderedPaths: org.apache.spark.sql.DataFrame = [path: string, time: string]
scala> def pathToDataDF(path: String) = spark.read.format("json").load(path)
pathToDataDF: (path: String)org.apache.spark.sql.DataFrame
//Sample File content I have taken.
scala> "cat /tmp/data/foldera/foldera.json".!
{"name":"Srinivas","age":29}
scala> "cat /tmp/data/folderb/folderb.json".!
{"name":"Ravi","age":20}
scala> "cat /tmp/data/folderc/folderc.json".!
{"name":"Raju","age":25}
Using par
scala> val paths = orderedPaths.orderBy($"time").select($"path").as[String].collect
paths: Array[String] = Array(/tmp/data/foldera/foldera.json, /tmp/data/folderb/folderb.json, /tmp/data/folderc/folderc.json)
scala> val parDF = paths match {
case p if !p.isEmpty => {
p.par
.map(path => {
pathToDataDF(path)
.withColumn("path",lit(path))
}).reduceLeft(_ union _)
}
case _ => spark.emptyDataFrame
}
parDF: org.apache.spark.sql.DataFrame = [age: bigint, name: string ... 1 more field]
scala> parDF.show(false)
+---+--------+------------------------------+
|age|name |path |
+---+--------+------------------------------+
|29 |Srinivas|/tmp/data/foldera/foldera.json|
|20 |Ravi |/tmp/data/folderb/folderb.json|
|25 |Raju |/tmp/data/folderc/folderc.json|
+---+--------+------------------------------+
// With time column.
scala> val paths = orderedPaths.orderBy($"time").select($"path",$"time").as[(String,String)].collect
paths: Array[(String, String)] = Array((/tmp/data/foldera/foldera.json,2020-05-29 01:30:00), (/tmp/data/folderb/folderb.json,2020-05-29 02:00:00), (/tmp/data/folderc/folderc.json,2020-05-29 03:00:00))
scala> val parDF = paths match {
case p if !p.isEmpty => {
p.par
.map(path => {
pathToDataDF(path._1)
.withColumn("path",lit(path._1))
.withColumn("time",lit(path._2))
}).reduceLeft(_ union _)
}
case _ => spark.emptyDataFrame
}
parDF: org.apache.spark.sql.DataFrame = [age: bigint, name: string ... 2 more fields]
scala> parDF.show(false)
+---+--------+------------------------------+-------------------+
|age|name |path |time |
+---+--------+------------------------------+-------------------+
|29 |Srinivas|/tmp/data/foldera/foldera.json|2020-05-29 01:30:00|
|20 |Ravi |/tmp/data/folderb/folderb.json|2020-05-29 02:00:00|
|25 |Raju |/tmp/data/folderc/folderc.json|2020-05-29 03:00:00|
+---+--------+------------------------------+-------------------+
Using grouped
scala> val paths = orderedPaths.orderBy($"time").select($"path").as[String].collect
paths: Array[String] = Array(/tmp/data/foldera/foldera.json, /tmp/data/folderb/folderb.json, /tmp/data/folderc/folderc.json)
scala> val groupedDF = paths match {
case p if !p.isEmpty => {
paths
.grouped(groupsNum.getOrElse(paths.length))
.flatMap(group => {
group
.map(path => {
pathToDataDF(path)
.withColumn("path", lit(path))
})
}).reduceLeft(_ union _)
}
case _ => spark.emptyDataFrame
}
groupedDF: org.apache.spark.sql.DataFrame = [age: bigint, name: string ... 1 more field]
scala> groupedDF.show(false)
+---+--------+------------------------------+
|age|name |path |
+---+--------+------------------------------+
|29 |Srinivas|/tmp/data/foldera/foldera.json|
|20 |Ravi |/tmp/data/folderb/folderb.json|
|25 |Raju |/tmp/data/folderc/folderc.json|
+---+--------+------------------------------+
// with time column.
scala> val paths = orderedPaths.orderBy($"time").select($"path",$"time").as[(String,String)].collect
paths: Array[(String, String)] = Array((/tmp/data/foldera/foldera.json,2020-05-29 01:30:00), (/tmp/data/folderb/folderb.json,2020-05-29 02:00:00), (/tmp/data/folderc/folderc.json,2020-05-29 03:00:00))
scala> val groupedDF = paths match {
case p if !p.isEmpty => {
paths
.grouped(groupsNum.getOrElse(paths.length))
.flatMap(group => {
group
.map(path => {
pathToDataDF(path._1)
.withColumn("path",lit(path._1))
.withColumn("time",lit(path._2))
})
}).reduceLeft(_ union _)
}
case _ => spark.emptyDataFrame
}
groupedDF: org.apache.spark.sql.DataFrame = [age: bigint, name: string ... 2 more fields]
scala> groupedDF.show(false)
+---+--------+------------------------------+-------------------+
|age|name |path |time |
+---+--------+------------------------------+-------------------+
|29 |Srinivas|/tmp/data/foldera/foldera.json|2020-05-29 01:30:00|
|20 |Ravi |/tmp/data/folderb/folderb.json|2020-05-29 02:00:00|
|25 |Raju |/tmp/data/folderc/folderc.json|2020-05-29 03:00:00|
+---+--------+------------------------------+-------------------+

Create a recursive object graph from a tuple with Scala

I've got a very simple database table called regions where each region may have a parent region.
mysql> describe region;
+---------------+-------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+---------------+-------------+------+-----+---------+-------+
| region_code | char(3) | NO | PRI | NULL | |
| region_name | varchar(50) | NO | | NULL | |
| parent_region | char(3) | YES | MUL | NULL | |
+---------------+-------------+------+-----+---------+-------+
Now I'd like to hydrate this data to a Scala object graph of case classes that each have a parent of the same type.
case class Region(code: String, name: String, parent: Option[Region])
I do this with the following code. It works but it creates duplicate objects which I'd like to avoid if possible.
class RegionDB #Inject() (db: Database) {
def getAll(): Seq[Region] = {
Logger.debug("Getting all regions.")
db.withConnection { implicit conn =>
val parser = for {
code <- str("region_code")
name <- str("region_name")
parent <- str("parent_region").?
} yield (code, name, parent)
val results = SQL("SELECT region_code, region_name, parent_region from region").as(parser.*)
// TODO: Change this so it doesn't create duplicate records
def toRegion(record: (String, String, Option[String])): Region = {
val (code, name, parent) = record
val parentRecord = parent.map(p => results.find(_._1 == p)).getOrElse(None)
new Region(code, name, parentRecord.map(toRegion).orElse(None))
}
val regions = results map toRegion
regions.foreach(r => Logger.debug("region: " + r))
regions
}
}
}
I know how to do this in the imperative way but not the functional way. I know there has got to be an expressive way to do this with recursion but I can't seem to figure it out. Do you know how? Thanks!
I was able to solve this issue by restructuring the Region case class so that the parent region is a var and by adding a collection of children. It would be nice to do this without a var but oh well.
case class Region(code: String, name: String, subRegions: Seq[Region]){
var parentRegion: Option[Region] = None
subRegions.foreach(_.parentRegion = Some(this))
}
The recursion is more natural going from the root down.
def getAll(): Seq[Region] = {
Logger.debug("Getting all regions.")
db.withConnection { implicit conn =>
val parser = for {
code <- str("region_code")
name <- str("region_name")
parent <- str("parent_region").?
} yield (code, name, parent)
val results = SQL("SELECT region_code, region_name, parent_region from region").as(parser.*)
def toRegion(record: (String, String, Option[String])): Region = {
val (regionCode, name, parent) = record
val children = results.filter(_._3 == Some(regionCode)).map(toRegion)
Region(regionCode, name, children)
}
val rootRegions = results filter(_._3 == None) map toRegion
rootRegions.foreach(r => Logger.debug("region: " + r))
rootRegions
}
}

scala slick or and query

my question may sound very banal but I still didn't resolve it.
I have the Products Table implemented like
class ProductsTable(tag: Tag) extends Table[Product](tag, "PRODUCTS") {
def id = column[Int]("PRODUCT_ID", O.PrimaryKey, O.AutoInc)
def title = column[String]("NAME")
def description = column[String]("DESCRIPTION")
def style = column[String]("STYLE")
def price = column[Int]("PRICE")
def category_id = column[Int]("CATEGORY_ID")
def size_id = column[Int]("SIZE_ID")
def brand_id = column[Int]("BRAND_ID")
def * = (id.?, title, description, style, price, category_id, size_id, brand_id) <>(Product.tupled, Product.unapply _)
}
and its representation in
val Products = TableQuery[ProductsTable]
How can I implement query equivalent to SQl query:
select * from products where( category_id = 1 or category_id = 2 or category_id = 3 ) and (price between min and max)
Try something like this:
val query = Products filter { p => (p.category_id inSet List(1,2,3)) && p.price > min && p.price < max }
val result = db.run(query.result)
You can use println(query.result.statements) to see what query looks like.
EDIT:
Answer for additional question. You can make a function for your query that accepts optional min and max values:
def getProductsQuery(maybeMin: Option[Int] = None, maybeMax: Option[Int] = None) = {
val initialQuery = val query = Products filter { p => (p.category_id inSet List(1,2,3)) }
val queryWithMin = maybeMin match {
case Some(min) => initialQuery filter { _.price > min }
case None => initialQuery
}
val queryWithMax = maybeMax match {
case Some(max) => queryWithMin filter { _.price < max }
case None => queryWithMin
}
queryWithMax
}
And then you could do any of these:
val q1 = getProductsQuery() // without min or max
val q2 = getProductsQuery(maybeMin = Option(3)) // only min
val q3 = getProductsQuery(maybeMax = Option(10)) // only max
val q4 = getProductsQuery(maybeMin = Option(3), maybeMax = Option(10)) // both
and run any of these as needed...

How to populate User defined objects from list of lists in scala

I am new to Scala.
Currently trying to write a program which fetches table metadata from a database as a list of lists in the below format, and convert it to Objects of user defined type TableFieldMetadata.
The getAllTableMetaDataAsVO function does this conversion.
Can you tell me if I can write this function much better in functional way.
| **Table Name** | **FieldName** | **Data type** |
| table xxxxxx | field yyyyyy | type zzzz |
| table xxxxxx | field wwwww| type mmm|
| table qqqqqq| field nnnnnn| type zzzz |
Note: Here table name can repeat as it usually has multiple columns.
User defined classes:
1. TableFieldMetadata:
/**
* Class to hold the meta data for a table
*/
class TableFieldMetadata (name: String){
var tableName: String = name
var fieldMetaDataList: List[FieldMetaData] = List()
def add(fieldMetadata: FieldMetaData) {
fieldMetaDataList = fieldMetadata :: fieldMetaDataList
}
}
2. FieldMetaData :
/**
* Class to hold the meta data for a field
*/
class FieldMetaData (fieldName: String, fieldsDataType: String) {
var name:String = fieldName
var dataType:String = fieldsDataType
}
Function:
/**
* Function to convert list of lists to user defined objects
*/
def getAllTableMetaDataAsVO(allTableMetaData: List[List[String]]):List[TableFieldMetadata] = {
var currentTableName:String = null
var currentTable: TableFieldMetadata = null;
var tableFieldMetadataList: List[TableFieldMetadata] = List()
allTableMetaData.foreach { tableFieldMetadataItem =>
var tableName = tableFieldMetadataItem.head
if (currentTableName == null || !currentTableName.equals(tableName)) {
currentTableName = tableName
currentTable = new TableFieldMetadata(tableName)
tableFieldMetadataList = currentTable :: tableFieldMetadataList
}
if (currentTableName.equals(tableName)) {
var tableField = tableFieldMetadataItem.tail
currentTable.add(new FieldMetaData(tableField(0), tableField(1)))
}
}
return tableFieldMetadataList
}
Here is one solution. NOTE: I just used Table and Field for easy of typing into the REPL. You should use case classes.
scala> case class Field( name: String, dType : String)
defined class Field
scala> case class Table(name : String, fields : List[Field])
defined class Table
scala> val rawData = List( List("table1", "field1", "string"), List("table1", "field2", "int"), List("table2", "field1", "string") )
rawData: List[List[String]] = List(List(table1, field1, string), List(table1, field2, int), List(table2, field1, string))
scala> val grouped = rawData.groupBy(_.head)
grouped: scala.collection.immutable.Map[String,List[List[String]]] = Map(table1 -> List(List(table1, field1, string), List(table1, field2, int)), table2 -> List(List(table2, field1, string)))
scala> val result = grouped.map { case(k,v) => { val fields = for { r <- v } yield { Field( r(1), r(2) ) }; Table( k, fields ) } }
result: scala.collection.immutable.Iterable[Table] = List(Table(table1,List(Field(field1,string), Field(field2,int))), Table(table2,List(Field(field1,string))))