pyspark join two rdds and flatten the results - pyspark

Environment is pyspark, Spark Version 2.2.
We have two rdds test1 and test2, below are sample data
test1 = [('a', 20), ('b', 10), ('c', 2)]
test2 = [('a', 2), ('b', 3)]
Now we want to generate output1 as below, any help is appreciated.
[('a', 20, 2), ('b', 10, 3)]

You can accomplish this with a simple join followed by a call to map to flatten the values.
test1.join(test2).map(lambda (key, values): (key,) + values).collect()
#[('a', 20, 2), ('b', 10, 3)]
To explain, the result of the join is the following:
test1.join(test2).collect()
#[('a', (20, 2)), ('b', (10, 3))]
This is almost the desired output, but you want to flatten the results. We can accomplish this by calling map and returning a new tuple with the desired format. The syntax (key,) will create a one element tuple with just the key, which we add to the values.
You can also use the DataFrame API, by using pyspark.sql.DataFrame.toDF() to convert your RDDs to DataFrames:
test1.toDF(["key", "value1"]).join(test2.toDF(["key", "value2"]), on="key").show()
#+---+------+------+
#|key|value1|value2|
#+---+------+------+
#| b| 10| 3|
#| a| 20| 2|
#+---+------+------+

Related

How to create a column with the maximum number in each row of another column in PySpark?

I have a PySpark dataframe, each row of the column 'TAGID_LIST' is a set of numbers such as {426,427,428,430,432,433,434,437,439,447,448,450,453,460,469,469,469,469}, but I only want to keep the maximum number in each set, 469 for this row. I tried to create a new column with:
wechat_userinfo.withColumn('TAG', f.when(wechat_userinfo['TAGID_LIST'] != 'null', max(wechat_userinfo['TAGID_LIST'])).otherwise('null'))
but got TypeError: Column is not iterable.
How do I correct it?
If the column for which you want to retrieve the max value is an array, you can use the array_max function:
import pyspark.sql.functions as F
new_df = wechat_userinfo.withColumn("TAG", F.array_max(F.col("TAGID_LIST")))
To illustrate with an example,
df = spark.createDataFrame( [(1, [1, 772, 3, 4]), (2, [5, 6, 44, 8, 9])], ('a','d'))
df2 = df.withColumn("maxd", F.array_max(F.col("d")))
df2.show()
+---+----------------+----+
| a| d|maxd|
+---+----------------+----+
| 1| [1, 772, 3, 4]| 772|
| 2|[5, 6, 44, 8, 9]| 44|
+---+----------------+----+
In your particular case, the column in question is not an array of numbers but a string, formatted as comma-separated numbers surrounded by { and }. What I'd suggest is turning your string into an array and then operate on that array as described above. You can use the regexp_replace function to quickly remove the brackets, and then split() the comma-separated string into an array. It would look like this:
df = spark.createDataFrame( [(1, "{1,2,3,4}"), (2, "{5,6,7,8}")], ('a','d'))
df2 = df
.withColumn("as_str", F.regexp_replace( F.col("d") , '^\{|\}?', '' ) )
.withColumn("as_arr", F.split( F.col("as_str"), ",").cast("array<long>"))
.withColumn("maxd", F.array_max(F.col("as_arr"))).drop("as_str")
df2.show()
+---+---------+------------+----+
| a| d| as_arr|maxd|
+---+---------+------------+----+
| 1|{1,2,3,4}|[1, 2, 3, 4]| 4|
| 2|{5,6,7,8}|[5, 6, 7, 8]| 8|
+---+---------+------------+----+

Selecting subset spark dataframe by months

I have this dataset:
i want to take a 3 month subset of it (eg the months: april, may and august) using pyspark.
I still haven't found anything that would let me near this dataframe using pyspark.
You can extract the month using month() and then apply a isin function to find rows matching the filter criteria.
from pyspark.sql import functions as F
data = [(1, "2021-01-01", ), (2, "2021-04-01", ), (3, "2021-05-01", ), (4, "2021-06-01", ), (5, "2021-07-01", ), (6, "2021-08-01", ), ]
df = spark.createDataFrame(data, ("cod_item", "date_emissao", )).withColumn("date_emissao", F.to_date("date_emissao"))
df.filter(F.month("date_emissao").isin(4, 5, 8)).show()
"""
+--------+------------+
|cod_item|date_emissao|
+--------+------------+
| 2| 2021-04-01|
| 3| 2021-05-01|
| 6| 2021-08-01|
+--------+------------+
"""

How to find change occurance points in a Spark dataframe

I have a simple dataframe as an example:
val someDF = Seq(
(1, "A"),
(2, "A"),
(3, "A"),
(4, "B"),
(5, "B"),
(6, "A"),
(7, "A"),
(8, "A")
).toDF("t", "state")
// this part is half pseudocode
someDF.aggregate((acc, cur) => {
if (acc.last.state != cur.state) {
acc.add(cur)
}
}, List()).show(truncate=false)
"t" column represents points in time and "state" column represents the state at that point in time.
What I wish to find is the first time where each change occurs plus the first row, as in:
(1, "A")
(4, "B")
(6, "A")
I looked at the solutions in SQL too but they involve complex self-joins and window functions which I don't completely understand, but an SQL solution is OK too.
There are numerous functions in spark (fold, aggregate, reduce ..) that I feel which can do this, but I couldn't grasp the differences since I'm new to spark concepts like partitioning, and it's a bit tricky if the partitioning could affect the results.
You can use the window function lag for comparing with the previous row, and row_number for checking whether it's the first row:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val result = someDF.withColumn(
"change",
lag("state", 1).over(Window.orderBy("t")) =!= col("state") ||
row_number().over(Window.orderBy("t")) === 1
).filter("change").drop("change")
result.show
+---+-----+
| t|state|
+---+-----+
| 1| A|
| 4| B|
| 6| A|
+---+-----+
For an SQL solution:
someDF.createOrReplaceTempView("mytable")
val result = spark.sql("""
select t, state
from (
select
t, state,
lag(state) over (order by t) != state or row_number() over (order by t) = 1 as change
from mytable
)
where change
""")

Weighted mean median quartiles in Spark

I have a Spark SQL dataframe:
id
Value
Weights
1
2
4
1
5
2
2
1
4
2
6
2
2
9
4
3
2
4
I need to groupBy by 'id' and aggregate to get the weighted mean, median, and quartiles of the values per 'id'. What is the best way to do this?
Before the calculation you should do a small transformation to your Value column:
F.explode(F.array_repeat('Value', F.col('Weights').cast('int')))
array_repeat creates an array out of your number - the number inside the array will be repeated as many times as is specified in the column 'Weights' (casting to int is necessary, because array_repeat expects this column to be of int type. After this part the first value of 2 will be transformed into [2,2,2,2].
Then, explode will create a row for every element in the array. So, the line [2,2,2,2] will be transformed into 4 rows, each containing an integer 2.
Then you can calculate statistics, the results will have weights applied, as your dataframe is now transformed according to the weights.
Full example:
from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(
[(1, 2, 4),
(1, 5, 2),
(2, 1, 4),
(2, 6, 2),
(2, 9, 4),
(3, 2, 4)],
['id', 'Value', 'Weights']
)
df = df.select('id', F.explode(F.array_repeat('Value', F.col('Weights').cast('int'))))
df = (df
.groupBy('id')
.agg(F.mean('col').alias('weighted_mean'),
F.expr('percentile(col, 0.5)').alias('weighted_median'),
F.expr('percentile(col, 0.25)').alias('weighted_lower_quartile'),
F.expr('percentile(col, 0.75)').alias('weighted_upper_quartile')))
df.show()
#+---+-------------+---------------+-----------------------+-----------------------+
#| id|weighted_mean|weighted_median|weighted_lower_quartile|weighted_upper_quartile|
#+---+-------------+---------------+-----------------------+-----------------------+
#| 1| 3.0| 2.0| 2.0| 4.25|
#| 2| 5.2| 6.0| 1.0| 9.0|
#| 3| 2.0| 2.0| 2.0| 2.0|
#+---+-------------+---------------+-----------------------+-----------------------+

How to convert a SQL query output (dataframe) into an array list of key value pairs in Spark Scala?

I created a dataframe in spark scala shell for SFPD incidents. I queried the data for Category count and the result is a datafame. I want to plot this data into a graph using Wisp. Here is my dataframe,
+--------------+--------+
| Category|catcount|
+--------------+--------+
| LARCENY/THEFT| 362266|
|OTHER OFFENSES| 257197|
| NON-CRIMINAL| 189857|
| ASSAULT| 157529|
| VEHICLE THEFT| 109733|
| DRUG/NARCOTIC| 108712|
| VANDALISM| 91782|
| WARRANTS| 85837|
| BURGLARY| 75398|
|SUSPICIOUS OCC| 64452|
+--------------+--------+
I want to convert this dataframe into an arraylist of key value pairs. So I want result like this with (String,Int) type,
(LARCENY/THEFT,362266)
(OTHER OFFENSES,257197)
(NON-CRIMINAL,189857)
(ASSAULT,157529)
(VEHICLE THEFT,109733)
(DRUG/NARCOTIC,108712)
(VANDALISM,91782)
(WARRANTS,85837)
(BURGLARY,75398)
(SUSPICIOUS OCC,64452)
I tried converting this dataframe (t) into an RDD as val rddt = t.rdd. And then used flatMapValues,
rddt.flatMapValues(x=>x).collect()
but still couldn't get the required result.
Or is there a way to directly give the dataframe output into Wisp?
In pyspark it'd be as below. Scala will be quite similar.
Creating test data
rdd = sc.parallelize([(0,1), (0,1), (0,2), (1,2), (1,1), (1,20), (3,18), (3,18), (3,18)])
df = sqlContext.createDataFrame(rdd, ["id", "score"])
Mapping the test data, reformatting from a RDD of Rows to an RDD of tuples. Then, using collect to extract all the tuples as a list.
df.rdd.map(lambda x: (x[0], x[1])).collect()
[(0, 1), (0, 1), (0, 2), (1, 2), (1, 1), (1, 20), (3, 18), (3, 18), (3, 18)]
Here's the Scala Spark Row documentation that should help you convert this to Scala Spark code