Data filtering in Spark

Data filtering in Spark - scala

I am trying to do a certain kind of filtering using Spark. I have a data frame that looks like the following:
ID Property#1 Property#2 Property#3
-----------------------------------------
01 a b c
01 a X c
02 d e f
03 i j k
03 i j k
I expect the properties for a given ID to be the same. In the example above, I would like to filter out the following:
ID Property#2
---------------
01 b
01 X
Note that it is okay for IDs to be repeated in the data frame as long as the properties are the same (e.g. ID '03' in the first table). The code needs to be as efficient as possible as I am planning to apply it on datasets with >10k rows. I tried extracting the distinct rows using the distinct function in DataFrame API, grouped them on the ID column using groupBy and aggregated the results using countDistinct function, but unfortunately I couldn't get a working version of the code. Also the way I implemented it seems to be quite slow. I was wondering if anyone can provide some pointers as to how to approach this problem.
Thanks!

You can for example aggregate and join. First you'll have to create a lookup table:
val df = Seq(
("01", "a", "b", "c"), ("01", "a", "X", "c"),
("02", "d", "e", "f"), ("03", "i", "j", "k"),
("03", "i", "j", "k")
).toDF("id", "p1", "p2", "p3")
val lookup = df.distinct.groupBy($"id").count
Then filter the records:
df.join(broadcast(lookup), Seq("id"))
df.join(broadcast(lookup), Seq("id")).where($"count" !== 1).show
// +---+---+---+---+-----+
// | id| p1| p2| p3|count|
// +---+---+---+---+-----+
// | 01| a| b| c| 2|
// | 01| a| X| c| 2|
// +---+---+---+---+-----+

Related

how to perform complicated manipulations on scala datasets

I am fairly new to scala and having come from a sql and pandas background the dataset objects in scala are giving me a bit of trouble.
I have a dataset that looks like the following...
|car_num| colour|
+-----------+---------+
| 145| c|
| 132| p|
| 104| u|
| 110| c|
| 110| f|
| 113| c|
| 115| c|
| 11| i|
| 117| s|
| 118| a|
I have loaded it as a dataset using a case class that looks like the following
case class carDS(carNum: String, Colour: String)
Each car_num is unique to a car, many of the cars have multiple entries. The colour column refers to the colour the car was painted.
I would like to know how to add a column that gives the total number of paint jobs a car has had without being green (g) for example.
So far I have tried this.
carDS
.map(x => (x.carNum, x.Colour))
.groupBy("_1")
.count()
.orderBy($"count".desc).show()
But I believe it just gives me a count column of the number of times the car was painted. Not the longest sequential amount of times the car was painted without being green.
I think I might need to use a function in my query like the following
def colourrun(sq: String): Int = {
println(sq)
sq.mkString(" ")
.split("g")
.filter(_.nonEmpty)
.map(_.trim)
.map(s => s.split(" ").length)
.max
}
but I am unsure where it should go.
Ultimately if car 102 had been painted r, b, g, b, o, y, r, g
I would want the count column to give 4 as the answer.
How would I do this?
thanks

Here's one approach that involves grouping the paint jobs for a given car into monotonically numbered groups separated by paint jobs of color "g", followed by a couple of groupBy/aggs for the max count of paint jobs between being paint jobs of color "g".
(Note that a timestamp column is being added to ensure a deterministic ordering of the rows in the dataset.)
val ds = Seq(
("102", "r", 1), ("102", "b", 2), ("102", "g", 3), ("102", "b", 4), ("102", "o", 5), ("102", "y", 6), ("102", "r", 7), ("102", "g", 8),
("145", "c", 1), ("145", "g", 2), ("145", "b", 3), ("145", "r", 4), ("145", "g", 5), ("145", "c", 6), ("145", "g", 7)
).toDF("car_num", "colour", "timestamp").as[(String, String, Long)]
import org.apache.spark.sql.expressions.Window
val win = Window.partitionBy("car_num").orderBy("timestamp")
ds.
withColumn("group", sum(when($"colour" === "g", 1).otherwise(0)).over(win)).
groupBy("car_num", "group").agg(
when($"group" === 0, count("group")).otherwise(count("group") - 1).as("count")
).
groupBy("car_num").agg(max("count").as("max_between_g")).
show
// +-------+-------------+
// |car_num|max_between_g|
// +-------+-------------+
// | 102| 4|
// | 145| 2|
// +-------+-------------+
An alternative to using the DataFrame API is to apply groupByKey to the Dataset followed by mapGroups like below:
ds.
map(c => (c.car_num, c.colour)).
groupByKey(_._1).mapGroups{ case (k, iter) =>
val maxTuple = iter.map(_._2).foldLeft((0, 0)){ case ((cnt, mx), c) =>
if (c == "g") (0, math.max(cnt, mx)) else (cnt + 1, mx)
}
(k, maxTuple._2)
}.
show
// +---+---+
// | _1| _2|
// +---+---+
// |102| 4|
// |145| 2|
// +---+---+

PySpark - Convert list of JSON objects to rows

I want to convert a list of objects and store their attributes as columns.
{
"heading": 1,
"columns": [
{
"col1": "a",
"col2": "b",
"col3": "c"
},
{
"col1": "d",
"col2": "e",
"col3": "f"
}
]
}
Final Result
heading | col1 | col2 | col3
1 | a | b | c
1 | d | e | f
I am currently flattening my data (and excluding the columns column)
df = target_table.relationalize('roottable', temp_path)
However, for this use case, I will need the columns column. I saw examples where arrays_zip and explode was used. Would I need to iterate through each object or is there an easier way to extract each object and convert into a row?

use Spark SQL builtin function: inline or inline_outer is probably easiest way to handle this (use inline_outer when NULL is allowed in columns):
From Apache Hive document:
Explodes an array of structs to multiple rows. Returns a row-set with N columns (N = number of top level elements in the struct), one row per struct from the array. (As of Hive 0.10.)
df.selectExpr('heading', 'inline_outer(columns)').show()
+-------+----+----+----+
|heading|col1|col2|col3|
+-------+----+----+----+
| 1| a| b| c|
| 1| d| e| f|
+-------+----+----+----+

Looping through dataframe columns to form a nested dataframe - Spark

I have a dataframe as below,
val x = Seq(("A", "B", "C", "D")).toDF("DOC", "A1", "A2", "A3")
+---+---+---+---+
|DOC| A1| A2| A3|
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
Here the A's can be till 100, so I want to loop and get all the A's and nest them under a common structure as below,
+---+---+---+----+
|DOC|A LIST |
+---+---+---+---+
| A| [B, C, D] |
+---+---+---+---+
I want to create a dataframe by creating dynamic column names like A1, A2.. by looping from 1 to 100 and do a select.
How can I do this?
Cheers!

Simply assemble a list of columns to be combined into an array, transform the column names into Columns via col and apply method array to resulting list:
val df = Seq(
(1, "a", "b", "c", 10.0),
(2, "d", "e", "f", 20.0)
).toDF("id", "a1", "a2", "a3", "b")
val selectedCols = df.columns.filter(_.startsWith("a")).map(col)
val otherCols = df.columns.map(col) diff selectedCols
df.select((otherCols :+ array(selectedCols: _*).as("a_list")): _*).show
// +---+----+---------+
// | id| b| a_list|
// +---+----+---------+
// | 1|10.0|[a, b, c]|
// | 2|20.0|[d, e, f]|
// +---+----+---------+

Join two dataframes with different records and size in Spark

It seems this issue asked couple of times, but the solutions that suggested in previous questions not working for me.
I have two dataframes with different dimensions as shown in picture below. The table two second was part of table one first but after some processing on it I added one more column column4. Now I want to join these two tables such that I have table three Required after joining.
Things that tried.
So I did couple of different solution but no one works for me.
I tried
val required =first.join(second, first("PDE_HDR_CMS_RCD_NUM") === second("PDE_HDR_CMS_RCD_NUM") , "left_outer")
Also I tried
val required = first.withColumn("SEQ", when(second.col("PDE_HDR_FILE_ID") === (first.col("PDE_HDR_FILE_ID").alias("PDE_HDR_FILE_ID1")), second.col("uniqueID")).otherwise(lit(0)))
In the second attempt I used .alias after I get an error that says
Error occured during extract process. Error:
org.apache.spark.sql.AnalysisException: Resolved attribute(s) uniqueID#775L missing from.
Thanks for taking time to read my question

To generate the wanted result, you should join the two tables on column(s) that are row-identifying in your first table. Assuming c1 + c2 + c3 uniquely identifies each row in the first table, here's an example using a partial set of your sample data:
import org.apache.spark.sql.functions._
import spark.implicits._
val df1 = Seq(
(1, "e", "o"),
(4, "d", "t"),
(3, "f", "e"),
(2, "r", "r"),
(6, "y", "f"),
(5, "t", "g"),
(1, "g", "h"),
(4, "f", "j"),
(6, "d", "k"),
(7, "s", "o")
).toDF("c1", "c2", "c3")
val df2 = Seq(
(3, "f", "e", 444),
(5, "t", "g", 555),
(7, "s", "o", 666)
).toDF("c1", "c2", "c3", "c4")
df1.join(df2, Seq("c1", "c2", "c3"), "left_outer").show
// +---+---+---+----+
// | c1| c2| c3| c4|
// +---+---+---+----+
// | 1| e| o|null|
// | 4| d| t|null|
// | 3| f| e| 444|
// | 2| r| r|null|
// | 6| y| f|null|
// | 5| t| g| 555|
// | 1| g| h|null|
// | 4| f| j|null|
// | 6| d| k|null|
// | 7| s| o| 666|
// +---+---+---+----+

Aggregating similar records in Spark 1.6.2

I have a data set where after some transformation by using Spark SQL (1.6.2) in scala I got following data. (part of data).
home away count
a b 90
b a 70
c d 50
e f 45
f e 30
Now I want to get final result, like aggregating similar home and away i.e. a and b appearing two times. Similar home and away may not always come in consecutive rows
home away count
a b 160
c d 50
e f 75
Can someone help me out for this.

You can use create a temporary column using array and sort_array which you can use groupBy on to solve this. Here I assumed there can at most be two rows with the same value in the home/away columns and that which value is in home and which is in away doesn't matter:
val df = Seq(("a", "b", 90),
("b", "a", 70),
("c", "d", 50),
("e", "f", 45),
("f", "e", 30)).toDF("home", "away", "count")
val df2 = df.withColumn("home_away", sort_array(array($"home", $"away")))
.groupBy("home_away")
.agg(sum("count").as("count"))
.select($"home_away"(0).as("home"), $"home_away"(1).as("away"), $"count")
.drop("home_away")
Will give:
+----+----+-----+
|home|away|count|
+----+----+-----+
| e| f| 75|
| c| d| 50|
| a| b| 160|
+----+----+-----+

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Data filtering in Spark - scala

Related

how to perform complicated manipulations on scala datasets

PySpark - Convert list of JSON objects to rows

Looping through dataframe columns to form a nested dataframe - Spark

Join two dataframes with different records and size in Spark

Aggregating similar records in Spark 1.6.2

Categories

Resources