How to sample groups in pyspark? - pyspark

I have a dataframe with about > 1M groups and each group contains about ~100 records (rows). How do I sample based on the distinct groups in pyspark so that the selected groups will still have the complete rows?
A much smaller example:
+-----+---+
|group| x |
+-----+---+
| 1 |0.1|
| 1 |0.2|
| 2 |0.1|
| 2 |0.5|
| 2 |0.3|
| 3 |0.5|
| 4 |0.8|
| 4 |0.5|
+-----+---+
I want to sample so that if group 1 and 3 are selected I got complete records from them:
+-----+---+
|group| x |
+-----+---+
| 1 |0.1|
| 1 |0.2|
| 3 |0.5|
+-----+---+

Related

Ranking a pivoted column

I want to pivot a column and then rank the data from the pivoted column. Here is sample data:
| id | objective | metric | score |
|----|-----------|-------------|-------|
| 1 | Sales | Total Sales | 10 |
| 1 | Marketing | Total Reach | 4 |
| 2 | Sales | Total Sales | 2 |
| 2 | Marketing | Total Reach | 11 |
| 3 | Sales | Total Sales | 9 |
This would be my expected output after pivot + rank:
| id | Sales | Marketing |
|----|--------|-----------|
| 1 | 1 | 2 |
| 2 | 3 | 1 |
| 3 | 2 | 3 |
The ranking is based on sum(score) from each objective. An objective can have also have multiple metrics but that isn't included in the sample for simplicity.
I have been able to successfully pivot and count the scores like so:
pivot = (
spark.table('scoring_table')
.select('id', 'objective', 'metric', 'score')
.groupBy('id')
.pivot('objective')
.agg(
sf.sum('score').alias('score')
)
This then lets me see the total score per objective, but I'm unsure how to rank these. I have tried the following after aggregation:
.withColumn('rank', rank().over(Window.partitionBy('id', 'objective').orderBy(sf.col('score').desc())))
However objective is no longer callable from this point as it has been pivoted. I then tried this instead:
.withColumn('rank', rank().over(Window.partitionBy('id', 'Sales', 'Marketing').orderBy(sf.col('score').desc())))
But also the score column is no longer available. How can I rank these scores after pivoting the data?
You just need to order by the score after pivot:
from pyspark.sql import functions as F, Window
df2 = df.groupBy('id').pivot('objective').agg(F.sum('score')).fillna(0)
df3 = df2.select(
'id',
*[F.rank().over(Window.orderBy(F.desc(c))).alias(c) for c in df2.columns[1:]]
)
df3.show()
+---+---------+-----+
| id|Marketing|Sales|
+---+---------+-----+
| 2| 1| 3|
| 1| 2| 1|
| 3| 3| 2|
+---+---------+-----+

How can I add one column to other columns in PySpark?

I have the following PySpark DataFrame where each column represents a time series and I'd like to study their distance to the mean.
+----+----+-----+---------+
| T1 | T2 | ... | Average |
+----+----+-----+---------+
| 1 | 2 | ... | 2 |
| -1 | 5 | ... | 4 |
+----+----+-----+---------+
This is what I'm hoping to get:
+----+----+-----+---------+
| T1 | T2 | ... | Average |
+----+----+-----+---------+
| -1 | 0 | ... | 2 |
| -5 | 1 | ... | 4 |
+----+----+-----+---------+
Up until now, I've tried naively running a UDF on individual columns but it takes respectively 30s-50s-80s... (keeps increasing) per column so I'm probably doing something wrong.
cols = ["T1", "T2", ...]
for c in cols:
df = df.withColumn(c, df[c] - df["Average"])
Is there a better way to do this transformation of adding one column to many other?
By using rdd, it can be done in this way.
+---+---+-------+
|T1 |T2 |Average|
+---+---+-------+
|1 |2 |2 |
|-1 |5 |4 |
+---+---+-------+
df.rdd.map(lambda r: (*[r[i] - r[-1] for i in range(0, len(r) - 1)], r[-1])) \
.toDF(df.columns).show()
+---+---+-------+
| T1| T2|Average|
+---+---+-------+
| -1| 0| 2|
| -5| 1| 4|
+---+---+-------+

Transform structure of Spark DF. Create one column or row for each value in a column. Impute values [duplicate]

This question already has answers here:
How to pivot Spark DataFrame?
(10 answers)
Closed 4 years ago.
I have a Spark DF with the following structure:
+--------------------------------------+
| user| time | counts |
+--------------------------------------+
| 1 | 2018-06-04 16:00:00.0 | 5 |
| 1 | 2018-06-04 17:00:00.0 | 7 |
| 1 | 2018-06-04 17:30:00.0 | 7 |
| 1 | 2018-06-04 18:00:00.0 | 8 |
| 1 | 2018-06-04 18:30:00.0 | 10 |
| 1 | 2018-06-04 19:00:00.0 | 9 |
| 1 | 2018-06-04 20:00:00.0 | 7 |
| 2 | 2018-06-04 17:00:00.0 | 4 |
| 2 | 2018-06-04 18:00:00.0 | 4 |
| 2 | 2018-06-04 18:30:00.0 | 5 |
| 2 | 2018-06-04 19:30:00.0 | 7 |
| 3 | 2018-06-04 16:00:00.0 | 6 |
+--------------------------------------+
It was obtained from an event-log table using the following code:
ranked.groupBy($"user", sql.functions.window($"timestamp", "30 minutes"))
.agg(sum("id").as("counts"))
.withColumn("time", $"window.start")
As can be seen looking at the time column, not all 30-min intervals registered events for each user, i.e. not all user groups of frames are of equal lengths. I'd like to impute (possibly with NA's or 0's) missing time values and create a table (or RDD) like the following:
+-----------------------------------------------------------------------------+
| user| 2018-06-04 16:00:00 | 2018-06-04 16:30:00 | 2018-06-04 17:00:00 | ... |
+-----------------------------------------------------------------------------+
| 1 | 5 | NA (or 0) | 7 | ... |
| 2 | NA (or 0) | NA (or 0) | 4 | ... |
| 3 | 6 | NA (or 0) | NA (or 0) | ... |
+-----------------------------------------------------------------------------+
The transpose of the table above (with a time, column, and a column for the counts of each user) would theoretically work too, but I am not sure it would be optimal spark-wise as I have almost a million different users.
How can I perform a table re-structuring like described?
If each time window appears for at least one user, a simple pivot would do the trick (and put null for missing values). With millions of rows, it should be the case.
val reshaped_df = df.groupBy("user").pivot("time").agg(sum('counts))
In case a column is still missing, you could access the list of the columns with reshaped_df.columns and then add the missing ones. You would need to generate the list of columns that you expect (expected_columns) and then generate the missing ones as follows:
val expected_columns = ???
var result = reshaped_df
expected_columns
.foreach{ c =>
if(! result.columns.contains(c))
result = result.withColumn(c, lit(null))
}

PySpark: How can I join one more column to a dataFrame?

I'm work on a dataframe with two inicial columns, id and colA.
+---+-----+
|id |colA |
+---+-----+
| 1 | 5 |
| 2 | 9 |
| 3 | 3 |
| 4 | 1 |
+---+-----+
I need to merge that dataFrame to another column more, colB. I know that colB fits perfectly at the end of the dataFrame, I just need some way to join it all together.
+-----+
|colB |
+-----+
| 8 |
| 7 |
| 0 |
| 6 |
+-----+
In result of these, I need to obtain a new dataFrame like that below:
+---+-----+-----+
|id |colA |colB |
+---+-----+-----+
| 1 | 5 | 8 |
| 2 | 9 | 7 |
| 3 | 3 | 0 |
| 4 | 1 | 6 |
+---+-----+-----+
This is the pyspark code to obtain the first DataFrame:
l=[(1,5),(2,9), (3,3), (4,1)]
names=["id","colA"]
db=sqlContext.createDataFrame(l,names)
db.show()
How can I do it? Could anyone help me, please? Thanks
I've done! I've solved it by adding a temporary column with the indices of the rows and then I delete it.
code:
from pyspark.sql import Row
from pyspark.sql.window import Window
from pyspark.sql.functions import rowNumber
w = Window().orderBy()
l=[(1,5),(2,9), (3,3), (4,1)]
names=["id","colA"]
db=sqlContext.createDataFrame(l,names)
db.show()
l=[5,9,3,1]
rdd = sc.parallelize(l).map(lambda x: Row(x))
test_df = rdd.toDF()
test_df2 = test_df.selectExpr("_1 as colB")
dbB = test_df2.select("colB")
db= db.withColum("columnindex", rowNumber().over(w))
dbB = dbB.withColum("columnindex", rowNumber().over(w))
testdf_out = db.join(dbB, db.columnindex == dbB.columnindex. 'inner').drop(db.columnindex).drop(dbB.columnindex)
testdf_out.show()

Dataframe in pypark - How to apply aggregate functions to into two columns?

I'm using Dataframe in pyspark. I have one table like Table 1 bellow. I need to obtain Table 2. Where:
num_category - it is how many differents categories for each id
sum(count) - it is the sum of the third column in Table 1 for each id.
Example:
Table 1
id |category | count
1 | 4 | 1
1 | 3 | 2
1 | 1 | 2
2 | 2 | 1
2 | 1 | 1
Table 2
id |num_category| sum(count)
1 | 3 | 5
2 | 2 | 2
I try:
table1 = data.groupBy("id","category").agg(count("*"))
cat = table1.groupBy("id").agg(count("*"))
count = table1.groupBy("id").agg(func.sum("count"))
table2 = cat.join(count, cat.id == count.id)
Error:
1 table1 = data.groupBy("id","category").agg(count("*"))
---> 2 cat = table1.groupBy("id").agg(count("*"))
count = table1.groupBy("id").agg(func.sum("count"))
table2 = cat.join(count, cat.id == count.id)
TypeError: 'DataFrame' object is not callable
You can do multiple column aggregation on single grouped data,
data.groupby('id').agg({'category':'count','count':'sum'}).withColumnRenamed('count(category)',"num_category").show()
+---+-------+--------+
| id|num_cat|sum(cnt)|
+---+-------+--------+
| 1| 3| 5|
| 2| 2| 2|
+---+-------+--------+