Null value Left Outer Join in Spark - scala

I have two RDD and got the left join
left join p1.leftOuterJoin(p2) the result is like:
Array[((String, String), (Int, Option[Int]))] =
Array(((1001-150329-002-0-04624,5060567),(1,None)), ((1002-141105-008-0-01934,10145500),(1,None)), ((1013-150324-009-0-02270,15750046),(1,None)), ((1005-150814-005-0-05885,5060656),(1,Some(1))), ((1009-150318-004-0-02537,5060583),(1,None)))
I want to replace all None with 0 and get a clean data set like:
Array(((1001-150329-002-0-04624,5060567),0), ((1002-141105-008-0-01934,10145500),0), ((1013-150324-009-0-02270,15750046),0), ((1005-150814-005-0-05885,5060656),1)), ((1009-150318-004-0-02537,5060583),0))
Basically replace all (1,None) with 0 and (1,Some(1)) with 1

If you are looking for the value of the Option when Some or 0 when None, I would implement it with .map and .getOrElse:
a.map { case (k, (_, o)) => (k, o.getOrElse(0)) }
The result match the expected one:
Array(
((1001-150329-002-0-04624,5060567),0),
((1002-141105-008-0-01934,10145500),0),
((1013-150324-009-0-02270,15750046),0),
((1005-150814-005-0-05885,5060656),1)),
((1009-150318-004-0-02537,5060583),0))

Related

How to get value of previous row in scala apache rdd[row]?

I need to get value from previous or next row while Im iterating through RDD[Row]
(10,1,string1)
(11,1,string2)
(21,1,string3)
(22,1,string4)
I need to sum strings for rows where difference between 1st value is not higher than 3. 2nd value is ID. So the result should be:
(1, string1string2)
(1, string3string4)
I tried use groupBy, reduce, partitioning but still I can't achieve what I want.
I'm trying to make something like this(I know it's not proper way):
rows.groupBy(row => {
row(1)
}).map(rowList => {
rowList.reduce((acc, next) => {
diff = next(0) - acc(0)
if(diff <= 3){
val strings = acc(2) + next(2)
(acc(1), strings)
}else{
//create new group to aggregatre strings
(acc(1), acc(2))
}
})
})
I wonder if my idea is proper to solve this problem.
Looking for help!
I think you can use sqlContext to Solve your problem by using lag function
Create RDD:
val rdd = sc.parallelize(List(
(10, 1, "string1"),
(11, 1, "string2"),
(21, 1, "string3"),
(22, 1, "string4"))
)
Create DataFrame:
val df = rdd.map(rec => (rec._1.toInt, rec._2.toInt, rec._3.toInt)).toDF("a", "b", "c")
Register your Dataframe:
df.registerTempTable("df")
Query the result:
val res = sqlContext.sql("""
SELECT CASE WHEN l < 3 THEN ROW_NUMBER() OVER (ORDER BY b) - 1
ELSE ROW_NUMBER() OVER (ORDER BY b)
END m, b, c
FROM (
SELECT b,
(a - CASE WHEN lag(a, 1) OVER (ORDER BY a) is not null
THEN lag(a, 1) OVER (ORDER BY a)
ELSE 0
END) l, c
FROM df) A
""")
Show the Results:
res.show
I Hope this will Help.

How to get optional result in insert statements with Doobie?

I have an optional insert query:
val q = sql"insert into some_table (some_field) select 42 where ...(some condition)"
Running this query with:
q.update.withUniqueGeneratedKeys[Option[Long]]("id")
fails with
Result set exhausted: more rows expected
then condition is false.
How to get Optional[Long] result from insert statements with Doobie?
UPD
.withGeneratedKeys[Long]("id") gives just Long in for comprehension
val q = sql"insert into some_table (some_field) select 42 where ...(some condition)"
for {
id <- q.update.withGeneratedKeys[Long]("id") // id is long
_ <- if (<id is present>) <some other inserts> else <nothing>
} yield id
How to check id?
As #Thilo commented, you can use the use withGeneratedKeys which gives you back a Stream[F, Long] (where F is your effect type)
val result = q.update.withGeneratedKeys[Long]("id")
Here is the doc.

Left outer join of three tables with two conditions each not working in Slick

I have the following left outer join that I'd like to represent in Slick:
select * from tableD d
left outer join tableE e on e.ds_sk=d.ds_sk and e.ds_type = d.ds_type
left outer join tableV v on v.es_sk=e.sk and v.var_name=d.var_name
where e.sk=30;
Note that each relationship contains two join conditions. The first join conditions are between tableD and tableE, and the second join conditions are between tableV, tableE and tableD. Finally, there's also a where condition.
This is my attempt (that throws a compilation error):
val query = for {
((d, e), v) <- tableE joinLeft tableD on ((x,y) => x.dsType === y.dsType && x.dsSk === y.dsSk)
joinLeft tableV on ((x,y) => x._1.sk === y.esSk && x._2.varName === y.varName)
if (e.sk === 30)
} yield (d,e,v)
The error I get is:
â—¾value varName is not a member of slick.lifted.Rep[Option[tableD]]
What is the error and how to fix this code?
There seems to be discrepancies between your SQL query (D leftjoin E leftjoin V) and Slick query (E leftjoin D leftjoin V).
Assuming your SQL query is the correct version, the Slick query should look like the following:
val query = for {
((d, e), v) <- tableD joinLeft tableE on ( (x, y) =>
x.dsType === y.dsType && x.dsSk === y.dsSk )
joinLeft tableV on ( (x, y) =>
x._2.map(_.sk) === y.esSk && x._1.varName) === y.varName )
if (e.sk === 30)
} yield (d,e,v)
Note that in the second joinLeft, you need to use map to access fields in tableE which is wrapped in Option after the first joinLeft. That also explains why you're getting the error message about Option[tableD] in your Slick query.
You can compare doing map by varName, example:
... && x._2.map(_.varName) === y.varName ...
Here I leave a similar example:
val withAddressesQuery = for {
(((person, phone), _), address) <- withPhonesQuery.
joinLeft(PersonAddress).on(_._1.personId === _.personId).
joinLeft(Address).on(_._2.map(_.addressId) === _.addressId)
} yield (person, phone, address)
Notice: ... on(_._2.map( _.addressId) === ...see complete
There's some more information about joinLeft in Chapter 6.4.3 of Essential Slick.

How to make aggregations with slick

I want to force slick to create queries like
select max(price) from coffees where ...
But slick's documentation doesn't help
val q = Coffees.map(_.price) //this is query Query[Coffees.type, ...]
val q1 = q.min // this is Column[Option[Double]]
val q2 = q.max
val q3 = q.sum
val q4 = q.avg
Because those q1-q4 aren't queries, I can't get the results but can use them inside other queries.
This statement
for {
coffee <- Coffees
} yield coffee.price.max
generates right query but is deprecated (generates warning: " method max in class ColumnExtensionMethods is deprecated: Use Query.max instead").
How to generate such query without warnings?
Another issue is to aggregate with group by:
"select name, max(price) from coffees group by name"
Tried to solve it with
for {
coffee <- Coffees
} yield (coffee.name, coffee.price.max)).groupBy(x => x._1)
which generates
select x2.x3, x2.x3, x2.x4 from (select x5."COF_NAME" as x3, max(x5."PRICE") as x4 from "coffees" x5) x2 group by x2.x3
which causes obvious db error
column "x5.COF_NAME" must appear in the GROUP BY clause or be used in an aggregate function
How to generate such query?
As far as I can tell is the first one simply
Query(Coffees.map(_.price).max).first
And the second one
val maxQuery = Coffees
.groupBy { _.name }
.map { case (name, c) =>
name -> c.map(_.price).max
}
maxQuery.list
or
val maxQuery = for {
(name, c) <- Coffees groupBy (_.name)
} yield name -> c.map(_.price).max
maxQuery.list

scalaquery retrieve values

I have few tables, lets say 2 for simplicity. I can create them in this way,
...
val tableA = new Table[(Int,Int)]("tableA"){
def a = column[Int]("a")
def b = column[Int]("b")
}
val tableB = new Table[(Int,Int)]("tableB"){
def a = column[Int]("a")
def b = column[Int]("b")
}
Im going to have a query to retrieve value 'a' from tableA and value 'a' from tableB as a list inside the results from 'a'
my result should be:
List[(a,List(b))]
so far i came upto this point in query,
def createSecondItr(b1:NamedColumn[Int]) = for(
b2 <- tableB if b1 === b1.b
) yield b2.a
val q1 = for (
a1 <- tableA
listB = createSecondItr(a1.b)
) yield (a1.a , listB)
i didn't test the code so there might be errors in the code. My problem is I cannot retrieve data from the results.
to understand the question, take trains and classes of it. you search the trains after 12pm and you need to have a result set where the train name and the classes which the train have as a list inside the train's result.
I don't think you can do this directly in ScalaQuery. What I would do is to do a normal join and then manipulate the result accordingly:
import scala.collection.mutable.{HashMap, Set, MultiMap}
def list2multimap[A, B](list: List[(A, B)]) =
list.foldLeft(new HashMap[A, Set[B]] with MultiMap[A, B]){(acc, pair) => acc.addBinding(pair._1, pair._2)}
val q = for (
a <- tableA
b <- tableB
if (a.b === b.b)
) yield (a.a, b.a)
list2multimap(q.list)
The list2multimap is taken from https://stackoverflow.com/a/7210191/66686
The code is written without assistance of an IDE, compiler or similar. Consider the debugging free training :-)