Let's say I have some sql that is going to return a result set that looks like this:
ID
Value
A1
Val1
A1
Val2
A1
Val3
B1
Val4
B1
Val5
B1
Val6
val query = sql"""select blah""".query[(ID, VALUE)]
val result: ConnectionIO[(ID, List[VALUE])] = for {
tuples <- query.to[List]
} yield tuples.traverse(t => t._1 -> t._2)
This is the closest I can get, but I get a compiler error:
Could not find an instance of Applicative for [+T2](ID, T2)
What I want is to turn this into a Map[ID, List[VALUE]]
Here, .traverse isn't the most helpful method, try this instead:
val result: ConnectionIO[Map[ID, List[VALUE]]] = for {
tuples <- query.to[List]
} yield tuples.groupMap(_._1)(_._2)
If you have Scala older than 2.13 you can try:
val result: ConnectionIO[Map[ID, List[VALUE]]] = for {
tuples <- query.to[List]
} yield tuples
.groupBy(_._1) // Map[ID, List[(ID, VALUE])]
.mapValues(_.map(_._2))
I don't know what DB you are using, but if you have arrays functions like postgresql you can try to use group by with array_agg. After you can just run .asMap or .groupBy on a List[(ID, List[VALUE])].
val query =
sql"""select id, array_agg(value) as values group by id"""
.query[(ID, List[VALUE])]
val result = query.to[List].map(_.toMap)
Related
Given this statement 1:
val aggDF3 = aggDF2.select(cols.map { col => ( when(size(aggDF2(col)) === 0,lit(null))
.otherwise(aggDF2(col))).as(s"$col") }: _*)
Given this statement 2:
aggDF.select(colsToSelect.head, colsToSelect.tail: _*).show()
Can I combine the when logic... on statement 1 with the colsToSelect.tail: _* in a single statement, so that the first field is just selected, and the logic only applies to tail scope of dataframe colums? Tried various aspects, but on thin ice here.
This should work:
val aggDF : DataFrame = ???
val colsToSelect : Seq[String] = ???
aggDF
.select((col(colsToSelect.head) +: colsToSelect.tail.map
(col => when(size(aggDF(col)) === 0,lit(null))
.otherwise(aggDF(col)).as(s"$col"))):_*)
.show()
remember that select is overloaded and works differently with String and Column: With cols : Seq[String], you need select(cols.head,cols.tail:_*), with cols : Seq[Column] you need select(cols:_*). The solution above uses the second variant.
Consider I have the following tables: A, B and AB.
Tables AB is a link table between A and B.
When I execute a simple insertOrUpdate action, it succeeds. I have 1 row inserted in the table.
val a = TableQuery[A]
val b = TableQuery[B]
val ab = TableQuery[AB]
Await.result(db.run(ab.insertOrUpdate(ABLink(1,1)),Duration.Inf)
println(Await.result(db.run(ab.length.result, Duration.Inf)))
//prints 1
But when I read from table A and B, get the ids and then insertorUpdate into table AB using for comprehension, The row is not inserted. The program completes without any errors.
val a = TableQuery[A]
val b = TableQuery[B]
val ab = TableQuery[AB]
val action = for {
aId <- a.map(_.id).result.headOption
bId <- b.map(_.id).result.headOption
}
yield ab.insertOrUpdate(ABLink(aId.get,bId.get))
Await.result(db.run(action),Duration.Inf)
println(Await.result(db.run(ab.length.result, Duration.Inf)))
//prints 0
Can someone throw light on this behavior?
Because map operation is performed on the last value of the for comprehension, the insertOrUpdate Statement wasn't executed by run. When I use flatMap , it works.
When using Map. it creates a nested DBIOAction.
The inner sqlAction is not executed.
DBIOAction[FixedSqlAction[Int, NoStream, Effect.Write], NoStream, Effect.Read with Effect.Read]
When using flatMap, it creats a flattend DBIOAction which gets executed.
DBIOAction[Int, NoStream, Effect.Read with Effect.Read with Effect.Write]
val a = TableQuery[A]
val b = TableQuery[B]
val ab = TableQuery[AB]
val action = for {
aId <- a.map(_.id).result.headOption
bId <- b.map(_.id).result.headOption
}
yield ABLink(aId.get,bId.get)
Await.result(db.run(action.flatMap(ab.insertOrUpdate(_)),Duration.Inf)
println(Await.result(db.run(ab.length.result, Duration.Inf)))
//prints 1
I hava an RDD with 3 columns (road_idx, snodeidx,enodeidx).
It looks like this:
(roadidx_995, 1138, 1145)
(roadidx_996, 1138, 1139)
(roadidx_997, 2740, 1020)
(roadidx_998, 2762, 2740)
(roadidx_999, 3251, 3240)
.........
How to group together road_idx which have one of the snodeidx or enodeidx in common? Give each group a number starts from 1.
expected output:
(1,[roadidx_995,roadidx_996])
(2,[roadidx_997,roadidx_998])
(3,[roadidx_999])
as shown above,
roadidx_995 and roadidx_996 have the same snodeidx 1138.
roadidx_997 has the snodeidx the same as the enodeidx of roadidx_998 which is 2740.
roadidx_999 is in a group on its own.
Scala code or Python code are both ok. As long as you can tell me the logic of using RDD APIs to get the expected output.
Much appreciated!
Can be implemented as:
Split original on two rdd - grouped by "start" and "end" node.
Join original dataset with values from 1) several times, and get four columns like:
|------------------|----------------|--------------|----------------|
| start join start | start join end | end join end | end join start |
|------------------|----------------|--------------|----------------|
Join values from four columns in one
On Scala can be implemented:
val data = List(
("roadidx_995", 1138, 1145),
("roadidx_996", 1138, 1139),
("roadidx_997", 2740, 1020),
("roadidx_998", 2762, 2740),
("roadidx_999", 3251, 3240)
)
val original = sparkContext.parallelize(data)
val groupedByStart = original.map(v => (v._1, v._2)).groupBy(_._2).mapValues(_.map(_._1))
val groupedByEnd = original.map(v => (v._1, v._3)).groupBy(_._2).mapValues(_.map(_._1))
val indexesOnly = original.map(allRow => (allRow._2, allRow._3))
// join by start value
val startJoinsStart = indexesOnly.keyBy(_._1).join(groupedByStart)
val startJoinsEnd = startJoinsStart.leftOuterJoin(groupedByEnd)
// join by end value
val endKeys = startJoinsEnd.values.keyBy(_._1._1._2)
val endJoinsEnd = endKeys.join(groupedByEnd)
val endJoinsStart = endJoinsEnd.leftOuterJoin(groupedByStart)
// flatten to output format
val result = endJoinsStart
.values
.map(v => (v._1._1._1._2, v._1._1._2, v._1._2, v._2))
.map(v => v._1 ++ v._2.getOrElse(Seq()) ++ v._3 ++ v._4.getOrElse(Seq()))
.map(_.toSet)
.distinct()
result.foreach(println)
Output is:
Set(roadidx_995, roadidx_996)
Set(roadidx_998, roadidx_997)
Set(roadidx_999)
I have a Dataframe in which I want to remove duplicates based on a key, the catch is among the records sharing the same key, I need to select based on some columns and not just any record.
For example, my DF looks like:
+------+-------+-----+
|animal|country|color|
+------+-------+-----+
| Cat|america|white|
| dog|america|brown|
| dog| canada|white|
| dog| canada|black|
| Cat| canada|black|
| bear| canada|white|
+------+-------+-----+
Now I want to do remove duplicates based on column animal, and then have choose the ones which have country 'america'.
My desired output should be:
+------+-------+-----+
|animal|country|color|
+------+-------+-----+
| Cat|america|white|
| dog|america|brown|
| bear| canada|white|
+------+-------+-----+
Since there is no reduceBykey in Dataframe api, I convert this to a keyValue pair rdd and then do a reduceBykey I'm stuck in the function which will do this preference based selection amongst the duplicates.
I'll prefer the sample code in scala.
Provided that for all animal kind (dog, cat, ...) there is at least an entry for the country "america" and that you don't mind to loose duplicated matching animals within america, you can use reduceByKey:
val animals = sc.parallelize(("cat","america", "white")::("dog","america","brown")::("dog","canada","white")::("dog","canada","black")::("cat","canada","black")::("bear","canada","white")::Nil)
val animalsKV = animals.map { case (k, a, b) => k -> (a,b) }
animalsKV.reduceByKey {
case (a # ("america",_ ), _) => a
case (_, b) => b
}
In case you might have animals with no entries in "america", the code above will take one of the duplicates: the last one. You can improve it by providing a result maintaining the duplicates in those cases. e.g:
animalsKV.combineByKey(
Map(_), // First met entries keep wrapped within a map from their country to their color
(m: Map[String, String], entry: (String, String)) =>
if(entry._1 == "america") Map(entry) // If one animal in "america" is found, then it should be the answer value.
else m + entry, //Otherwise, we keep track of the duplicates
(a: Map[String, String], b: Map[String, String]) => //When combining maps...
if(a contains "america") a // If one of them contains "america"
else if(b contains "america") b //... then we keep that map
else a ++ b // Otherwise, we accumulate the duplicates
)
That code can be modified to keep track of duplicated "american" animals too.
I believe you can do what you with DataFrames in spark versions >= 1.4 using windows (at least I think that's what they're called).
However using RDDs
val input: RDD[(String, String, Row)] = ???
val keyedByAnimal: RDD[(String, (String, Row))] =
input.map{case (animal, country, other) => (animal, (country, other)) }
val result: RDD[(String, (String, Row))] = keyedByAnimal.reduceByKey{(x, y) =>
if(x._1 == "america") x else y
}
The above gives you a single distinct value for each animal value. The choice of which value is non-deterministic. All that can be said about it is if there exists a value for the animal with "america" one of them will be chosen.
Regarding your comment:
val df: DataFrame = ???
val animalCol:String = ???
val countryCol: String = ???
val otherCols = df.columns.filter(c => c != animalCol && c!= countryCol)
val rdd: RDD[(String, String, Row)] =
df.select(animalCol, countryCol, otherCols:_ *).rdd.map(r => (r.getString(0), r.getString(1), r)
The select reorders the columns so the getString methods pull out the expected values.
Honestly though, look into Window Aggregations. I don't know much about them as I don't use Dataframes or Spark beyond 1.3.
I have an input file of the form
(id | column_name | value)
...
column_name can take on some 50 names, list of ids can be huge.
I want to build a tall-and-skinny matrix whose (i,j) coefficient corresponds to the value found at (id, column_name) with id mapped to i and column name mapped to j.
So far, here's my approach
I load the file
val f = sc.textFile("example.txt")
val data = f.map(_.split('|') match {
case Array(id, column_name, score) =>
(id.toInt, column_name.toString, score.toDouble)
}
)
Then I will build the column_name and ids lists
val column_name_list = data.map(x=>(x._2)._1).distinct.collect.zipWithIndex
val ids_list = data.map(x=>x._1).distinct.collect.zipWithIndex
val nCols = column_name_list.length
val nRows = ids_list.length
and then I will build a coordinatematrix defining the entries using the mapping I just created;
val broadcastcolumn_name = sc.broadcast(column_name_list.toMap)
val broadcastIds = sc.broadcast(ids_list.toMap)
val matrix_entries_tmp = data.map{
case(id, column_name, score) => (broadcastIds.value.getOrElse(id,0), broadcastcolumn_name.value.getOrElse(column_name,0), score)
}
val matrix_entries = matrix_entries_tmp.map{
e => MatrixEntry(e._1, e._2, e._3)
}
val coo_matrix = new CoordinateMatrix(matrix_entries)
This work fine on small examples. However, I get a memory error when the id list is getting huge. The problem seems to be:
val ids_list = data.map(x=>x._1).distinct.collect.zipWithIndex
that induces a memory error
What would be a workaround ? I actually don't really need the id mapping. What is important are the column names and that each row corresponds to some (lost) id. I was thinking about using a IndexedRowMatrix but I am stuck in how to do it.
Thanks for the help!!
CoordinateMatrix
Too ugly to be a decent solution but it should give you some place to start.
First lets create a mapping between column name and index:
val colIdxMap = sc.broadcast(data.
map({ case (row, col, value) => col }).
distinct.
zipWithIndex.
collectAsMap)
Group columns by row id and map values to pairs (colIdx, value):
val values = data.
groupBy({ case (row, col, value) => row }).
mapValues({ _.map { case (_, col, value) => (colIdxMap.value(col), value)}}).
values
Generate entries:
val entries = values.
zipWithIndex.
flatMap { case (vals, row) =>
vals.map {case (col, value) => MatrixEntry(row, col, value)}
}
Create a final matrix:
val mat: CoordinateMatrix = new CoordinateMatrix(entries)
RowMatrix
If row ids are not important at all you can use a RowMatrix as follows:
First lets group data by row
val dataByRow = data.groupBy { case (row, col, value) => row }
Generate sparse vector for each row:
val rows = dataByRow.mapValues((vals) => {
val cols = vals.map {
case (_, col, value) => (colIdxMap.value(col).toInt, value)
}
Vectors.sparse(colIdxMap.value.size, cols.toSeq)
}).values
Create a matrix:
val mat: RowMatrix = new RowMatrix(rows)
You can use zipWithIndex on rows to create a RDD[IndexedRow] and IndexedRowMatrix as well.