pyspark dataframe: add a new indicator column with random sampling - pyspark

I have a spark dataframe containing the following schema:
StructField(email_address,StringType,true),StructField(subject_line,StringType,true)))
I want randomly sample 50% of the population into control and test. Currently I am doing it the following way:
df_segment_ctl = df_segment.sample(False, 0.5, seed=0)
df_segment_tmt = df_segment.join(df_segment_ctl, ["email_address"], "leftanti")
But I am certain there must be a better way to create a column instead like the following
| email_address| segment_id|group |
+--------------------+---------------+---------+
|xxxxxxxxxx#gmail.com| 1.1|treatment|
| xxxxxxx#gmail.com| 1.6|control |
Any help is appreciated. I am new to this world
UPDATE:
I dont want to split the dataframe into two. Just want to add an indicator column
UPDATE:
Is it possible to have multiple splits elegantly. Suppose instead of two groups I want a single control and two treatment
| email_address| segment_id|group |
+--------------------+---------------+---------+
|xxxxxxxxxx#gmail.com| 1.1|treat_1. |
| xxxxxxx#gmail.com| 1.6|control |
| xxxxx#gmail.com | 1.6|treat_2 |

You can split the spark dataframe using random split like below
df_segment_ctl, df_segment_tmt = df_segment.randomSplit(weights=[0.5,0.5], seed=0)

Related

Spark Column merging all list into 1 single list

I want the below column to merge into a single list for n-gram calculation. I am not sure how can I merge all the lists in a column into a single one.
+--------------------+
| author|
+--------------------+
| [Justin, Lee]|
|[Chatbots, were, ...|
|[Our, hopes, were...|
|[And, why, wouldn...|
|[At, the, Mobile,...|
+--------------------+
(Edit)Some more info:
I would like this as a spark df column and all the words including the repeated ones in a single list. The data is kind of big so I want to try avoiding methods like collect
OP wants to aggregate all the arrays/lists into the top row.
values = [(['Justin','Lee'],),(['Chatbots','were'],),(['Our','hopes','were'],),
(['And','why','wouldn'],),(['At','the','Mobile'],)]
df = sqlContext.createDataFrame(values,['author',])
df.show()
+------------------+
| author|
+------------------+
| [Justin, Lee]|
| [Chatbots, were]|
|[Our, hopes, were]|
|[And, why, wouldn]|
| [At, the, Mobile]|
+------------------+
This step suffices.
from pyspark.sql import functions as F
df = df.groupby().agg(F.collect_list('author').alias('list_of_authors'))
df.show(truncate=False)
+--------------------------------------------------------------------------------------------------------------------------------------------------------+
|list_of_authors |
+--------------------------------------------------------------------------------------------------------------------------------------------------------+
|[WrappedArray(Justin, Lee), WrappedArray(Chatbots, were), WrappedArray(Our, hopes, were), WrappedArray(And, why, wouldn), WrappedArray(At, the, Mobile)]|
+--------------------------------------------------------------------------------------------------------------------------------------------------------+
DataFrames, same as other distributed data structures, are not iterable and by only using dedicated higher order function and / or SQL methods can be accessed
Suppose your Dataframe is DF1 and Output is DF2
You need something like :
values = [(['Justin', 'Lee'],), (['Chatbots', 'were'],), (['Our', 'hopes', 'were'],),
(['And', 'why', 'wouldn'],), (['At', 'the', 'Mobile'],)]
df = spark.createDataFrame(values, ['author', ])
df.agg(F.collect_list('author').alias('author')).show(truncate=False)
Upvote if works

Concatenate Dataframe rows based on timestamp value

I have a Dataframe with text messages and a timestamp value for each row.
Like so:
+--------------------------+---------------------+
| message | timestamp |
+--------------------------+---------------------+
| some text from message 1 | 2019-08-03 01:00:00 |
+--------------------------+---------------------+
| some text from message 2 | 2019-08-03 01:01:00 |
+--------------------------+---------------------+
| some text from message 3 | 2019-08-03 01:03:00 |
+--------------------------+---------------------+
I need to concatenate the messages by creating time windows of X number of minutes so that for example they look like this:
+---------------------------------------------------+
| message |
+---------------------------------------------------+
| some text from message 1 some text from message 2 |
+---------------------------------------------------+
| some text from message 3 |
+---------------------------------------------------+
After doing the concatenation I have no use for the timestamp column so I can drop it or keep it with any value.
I have been able to do this by iterating through the entire Dataframe, adding timestamp diffs and inserting into a new Dataframe when the time window is achieved. It works but it's ugly and I am looking for some pointers into how to accomplish this in Scala in a more functional/elegant way.
I looked at the Window functions but since I am not doing aggregations it appears that I do not have a way to access the content of the groups once the WindowSpec is created so I didn't get very far.
I also looked at the lead and lag functions but I couldn't figure out how to use them without also having to go into a for loop.
I appreciate any ideas or pointers you can provide.
Any thoughts or pointers into how to accomplish this?
You can use the window datetime function (not to be confused with Window functions) to generate time windows, followed by a groupBy to aggregate messages using concat_ws:
import org.apache.spark.sql.functions._
import spark.implicits._
val df = Seq(
("message1", "2019-08-03 01:00:00"),
("message2", "2019-08-03 01:01:00"),
("message3", "2019-08-03 01:03:00")
).toDF("message", "timestamp")
val duration = "2 minutes"
df.
groupBy(window($"timestamp", duration)).
agg(concat_ws(" ", collect_list($"message")).as("message")).
show(false)
// +------------------------------------------+-----------------+
// |window |message |
// +------------------------------------------+-----------------+
// |[2019-08-03 01:00:00, 2019-08-03 01:02:00]|message1 message2|
// |[2019-08-03 01:02:00, 2019-08-03 01:04:00]|message3 |
// +------------------------------------------+-----------------+

Define custom window functions in pySpark

I have a dataframe in Spark for which I would like to create a column defined recursively like that:
new_column_row = f(last_column_row, other_parameters)
The best way to do that would be to define my custom window function, but I cannot find a way to do it in PySpark, anybody encountered the same problem ?
The case I am working on is:
My problem is about reconstructing an order book from a list of orders:
I have a dataframe like that (value is what I would like to calculate)
size | price | output
1 | 1 | {1:1}
1.2 | 1.1 | {1:1, 1.2:1.1}
1.3 | - 1. | {1.2:1.1}
Output is updated at each row like that (in python pseudocode)
- if price not in Output:
Output[price] = size
- if price in Output:
Output[price] = output[price] + size
if Output[price] = 0:
del Output[price]

Reshaping RDD from an array of array to unique columns in pySpark

I want to use pySpark to restructure my data so that I can use it for MLLib models, currently for each user I have an array of array in one column and I want to convert it unique columns with the count.
Users | column1 |
user1 | [[name1, 4], [name2, 5]] |
user2 | [[name1, 2], [name3, 1]] |
should get converted to:
Users | name1 | name2 | name3 |
user1 | 4.0 | 5.0 | 0.0 |
user2 | 2.0 | 0.0 | 1.0 |
I came up with a method that uses for loops but I am looking for a way that can utilize spark because the data is huge. Could you give me any hints? Thanks.
Edit:
All of the unique names should come as individual columns with the score corresponding to each user. Basically, a sparse matrix.
I am working with pandas right now and the code I'm using to do this is
data = data.applymap(lambda x: dict(x)) # To convert the array of array into a dictionary
columns = list(data)
for i in columns:
# For each columns using the dictionary to make a new Series and appending it to the current dataframe
data = pd.concat([data.drop([i], axis=1), data[i].apply(pd.Series)], axis=1)
Figured out the answer,
import pyspark.sql.functions as F
# First we explode column`, this makes each element as a separate row
df= df.withColumn('column1', F.explode_outer(F.col('column1')))
# Then, seperate out the new column1 into two columns
df = df.withColumn(("column1_seperated"), F.col('column1')[0])
df= df.withColumn("count", F.col(i)['column1'].cast(IntegerType()))
# Then pivot the df
df= df.groupby('Users').pivot("column1_seperated").sum('count')

Spark explode multiple columns of row in multiple rows

I have a problem with converting one row using three 3 columns into 3 rows
For example:
<pre>
<b>ID</b> | <b>String</b> | <b>colA</b> | <b>colB</b> | <b>colC</b>
<em>1</em> | <em>sometext</em> | <em>1</em> | <em>2</em> | <em>3</em>
</pre>
I need to convert it into:
<pre>
<b>ID</b> | <b>String</b> | <b>resultColumn</b>
<em>1</em> | <em>sometext</em> | <em>1</em>
<em>1</em> | <em>sometext</em> | <em>2</em>
<em>1</em> | <em>sometext</em> | <em>3</em>
</pre>
I just have dataFrame which is connected with first schema(table).
val df: dataFrame
Note: I can do it using RDD, but do we have other way? Thanks
Assuming that df has the schema of your first snippet, I would try:
df.select($"ID", $"String", explode(array($"colA", $"colB",$"colC")).as("resultColumn"))
I you further want to keep the column names, you can use a trick that consists in creating a column of arrays that contains the array of the value and the name. First create your expression
val expr = explode(array(array($"colA", lit("colA")), array($"colB", lit("colB")), array($"colC", lit("colC"))))
then use getItem (since you can not use generator on nested expressions, you need 2 select here)
df.select($"ID, $"String", expr.as("tmp")).select($"ID", $"String", $"tmp".getItem(0).as("resultColumn"), $"tmp".getItem(1).as("columnName"))
It is a bit verbose though, there might be more elegant way to do this.