Updating dataframes in a dictionary - pyspark

I have a spark dataframe like the following,
+-------------------+----------+--------------------+----------------+--------------+
| placekey|naics_code| visits_by_day|date_range_start|date_range_end|
+-------------------+----------+--------------------+----------------+--------------+
|zzy-222#627-wby-z9f| 445120|[41,126,72,96,110...| 2018-12-31| 2019-01-07|
|zzw-223#627-s6k-fzz| 722410|[25,22,92,74,98,5...| 2018-12-31| 2019-01-07|
|223-222#627-s8r-8gk| 722410|[70,82,58,80,106,...| 2018-12-31| 2019-01-07|
| ...| ...| ...| ...| ...|
|22j-222#627-vty-5cq| 722511| [11,5,9,5,4,6,5]| 2019-01-28| 2019-02-04|
+-------------------+----------+--------------------+----------------+--------------+
This dataframe has a 9 unique naics_code and my goal is to add few more columns using other columns and partition it by the naics_code to create 9 different csv files. I am trying to partition the dataframe first and then add the columns, because I think this will somehow make the work more efficient, because I can get away with grouping the data by the key (let me know if this is a bad idea). So I created a dictionary of dataframes and I tried to add a new column to all of the partitioned dataframes in a loop,
for x in df_dict.values():
x = x.withColumn('new_col', udf_some_func(x['col1'], x['col2']))
But when I looked at the dataframes, the new column is not there, but when I print x.show() in the for loop, it does show the new column. Why is this the case?

Within each iteration of the loop, x is a local variable. Reassigning x will make it point to the new dataframe you created, but the old dataframe in the dict will remain untouched. You probably mean to do something like
for k, v in df_dict.items():
df[k] = v.withColumn('new_col', udf_some_func(v['col1'], v['col2']))

Related

pyspark dataframe: add a new indicator column with random sampling

I have a spark dataframe containing the following schema:
StructField(email_address,StringType,true),StructField(subject_line,StringType,true)))
I want randomly sample 50% of the population into control and test. Currently I am doing it the following way:
df_segment_ctl = df_segment.sample(False, 0.5, seed=0)
df_segment_tmt = df_segment.join(df_segment_ctl, ["email_address"], "leftanti")
But I am certain there must be a better way to create a column instead like the following
| email_address| segment_id|group |
+--------------------+---------------+---------+
|xxxxxxxxxx#gmail.com| 1.1|treatment|
| xxxxxxx#gmail.com| 1.6|control |
Any help is appreciated. I am new to this world
UPDATE:
I dont want to split the dataframe into two. Just want to add an indicator column
UPDATE:
Is it possible to have multiple splits elegantly. Suppose instead of two groups I want a single control and two treatment
| email_address| segment_id|group |
+--------------------+---------------+---------+
|xxxxxxxxxx#gmail.com| 1.1|treat_1. |
| xxxxxxx#gmail.com| 1.6|control |
| xxxxx#gmail.com | 1.6|treat_2 |
You can split the spark dataframe using random split like below
df_segment_ctl, df_segment_tmt = df_segment.randomSplit(weights=[0.5,0.5], seed=0)

Spark Column merging all list into 1 single list

I want the below column to merge into a single list for n-gram calculation. I am not sure how can I merge all the lists in a column into a single one.
+--------------------+
| author|
+--------------------+
| [Justin, Lee]|
|[Chatbots, were, ...|
|[Our, hopes, were...|
|[And, why, wouldn...|
|[At, the, Mobile,...|
+--------------------+
(Edit)Some more info:
I would like this as a spark df column and all the words including the repeated ones in a single list. The data is kind of big so I want to try avoiding methods like collect
OP wants to aggregate all the arrays/lists into the top row.
values = [(['Justin','Lee'],),(['Chatbots','were'],),(['Our','hopes','were'],),
(['And','why','wouldn'],),(['At','the','Mobile'],)]
df = sqlContext.createDataFrame(values,['author',])
df.show()
+------------------+
| author|
+------------------+
| [Justin, Lee]|
| [Chatbots, were]|
|[Our, hopes, were]|
|[And, why, wouldn]|
| [At, the, Mobile]|
+------------------+
This step suffices.
from pyspark.sql import functions as F
df = df.groupby().agg(F.collect_list('author').alias('list_of_authors'))
df.show(truncate=False)
+--------------------------------------------------------------------------------------------------------------------------------------------------------+
|list_of_authors |
+--------------------------------------------------------------------------------------------------------------------------------------------------------+
|[WrappedArray(Justin, Lee), WrappedArray(Chatbots, were), WrappedArray(Our, hopes, were), WrappedArray(And, why, wouldn), WrappedArray(At, the, Mobile)]|
+--------------------------------------------------------------------------------------------------------------------------------------------------------+
DataFrames, same as other distributed data structures, are not iterable and by only using dedicated higher order function and / or SQL methods can be accessed
Suppose your Dataframe is DF1 and Output is DF2
You need something like :
values = [(['Justin', 'Lee'],), (['Chatbots', 'were'],), (['Our', 'hopes', 'were'],),
(['And', 'why', 'wouldn'],), (['At', 'the', 'Mobile'],)]
df = spark.createDataFrame(values, ['author', ])
df.agg(F.collect_list('author').alias('author')).show(truncate=False)
Upvote if works

How to combine several Dataframes together in scala?

I have several dataframes which contains single column in them. Let's say I have 4 such dataframe all with one column. How can I form a single dataframe by combining all of them?
val df = xmldf.select(col("UserData.UserValue._valueRef"))
val df2 = xmldf.select(col("UserData.UserValue._title"))
val df3 = xmldf.select(col("author"))
val df4 = xmldf.select(col("price"))
To combine, I am trying this, but it doesn't work:
var newdf = df
newdf = newdf.withColumn("col1",df1.col("UserData.UserValue._title"))
newdf.show()
It errors out saying that field of one column are not present in another. I am not sure how can I combine these 4 dataframes together. They don't have any common column.
df2 looks like this:
+---------------+
| _title|
+---------------+
|_CONFIG_CONTEXT|
|_CONFIG_CONTEXT|
|_CONFIG_CONTEXT|
+---------------+
and df looks like this:
+-----------+
|_valuegiven|
+-----------+
| qwe|
| dfdfrt|
| dfdf|
+-----------+
df3 and df4 are also in same format. I want like below dataframe:
+-----------+---------------+
|_valuegiven| _title|
+-----------+---------------+
| qwe|_CONFIG_CONTEXT|
| dfdfrt|_CONFIG_CONTEXT|
| dfdf|_CONFIG_CONTEXT|
+-----------+---------------+
I used this:
val newdf = xmldf.select(col("UserData.UserValue._valuegiven"),col("UserData.UserValue._title") )
newdf.show()
But I am getting column name on the go and as such, I would need to append on the go, due to which I don't know exactly how many columns I will get. Which is why I cannot use the above command.
It's a little unclear of your goal. If asking to join these dataframes, but perhaps you just want to select those 4 columns.
val newdf = xmldf.select($"UserData.UserValue._valueRef", $"UserData.UserValue._title", 'author,'price")
newdf.show
If you really want to join all these dataframes, you'll need to join them all and select the appropriate fields.
If the goal is to get 4 columns from xmldf into a new dataframe you shouldn't be splitting it into 4 dataframes in the first place.
You can select multiple columns from a dataframe by providing additional column names in the select function.
val newdf = xmldf.select(
col("UserData.UserValue._valueRef"),
col("UserData.UserValue._title"),
col("author"),
col("price"))
newdf.show()
So I looked at various ways and finally Ram Ghadiyaram's answer in Solution 2 does what I wanted to do. Using this approach, you can combine any number of columns on the go. Basically, you need to create indexes by which you can join the dataframes together and after joining, drop the index column altogether.

Dropping entries of close timestamps

I would like to drop all records which are duplicate entries but have said a difference in the timestamp could be of any amount of time as an offset but for simplicity will use 2 minutes.
+-------------------+-----+----+
|Date |ColA |ColB|
+-------------------+-----+----+
|2017-07-04 18:50:21|ABC |DEF |
|2017-07-04 18:50:26|ABC |DEF |
|2017-07-04 18:50:21|ABC |KLM |
+-------------------+-----+----+
I would like my dataframe to have only rows
+-------------------+-----+----+
|Date |ColA |ColB|
+-------------------+-----+----+
|2017-07-04 18:50:26|ABC |DEF |
|2017-07-04 18:50:21|ABC |KLM |
+-------------------+-----+----+
I tried something like this but this does not remove duplicates.
val joinedDfNoDuplicates = joinedDFTransformed.as("df1").join(joinedDFTransformed.as("df2"), col("df1.ColA") === col("df2.ColA") &&
col("df1.ColB") === col("df2.ColB") &&
&& abs(unix_timestamp(col("Date")) - unix_timestamp(col("Date"))) > offset
)
For now, I am just selecting distinct or a group by min here Find minimum for a timestamp through Spark groupBy dataframe on the data based on certain columns but I would like a more robust solution the reason for this is that data outside of that interval may be valid data. Also, the offset could be changed so maybe within 5s or 5 minutes depending on requirements.
Somebody mentioned to me about creating a UDF comparing dates and if all other columns are the same but I am not sure exactly how to do that such that either I would filter out rows or add a flag and then remove those rows any help would be greatly appreciated.
Similiar sql question here Duplicate entries with different timestamp
Thanks!
I would do it like this:
Define a Window to order to dates over a dummy column.
Add a dummy column, and add a constant value to it.
Add a new column containing the date of the previous record.
calculate the difference between the date and the previous date.
Filter your records based on the value of the difference.
The Code can be something like the follow:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
val w = Window.partitionBy("dummy").orderBy("Date") // step 1
df.withColumn("dummy", lit(1)) // this is step 1
.withColumn("previousDate", lag($"Date", 1) over w) // step 2
.withColumn("difference", unix_timestamp($"Date") - unix_timestamp("previousDate")) // step 3
This above solution is valid if you have pairs of records that might be close in time. If you have more than two records, you can compare each record to the first record (not the previous one) in the window, so instead of using lag($"Date",1), you use first($"Date"). In this case the 'difference' column contains the difference in time between the current record and the first record in the window.

Spark - extracting single value from DataFrame

I have a Spark DataFrame query that is guaranteed to return single column with single Int value. What is the best way to extract this value as Int from the resulting DataFrame?
You can use head
df.head().getInt(0)
or first
df.first().getInt(0)
Check DataFrame scala docs for more details
This could solve your problem.
df.map{
row => row.getInt(0)
}.first()
In Pyspark, you can simply get the first element if the dataframe is single entity with one column as a response, otherwise, a whole row will be returned, then you have to get dimension-wise response i.e. 2 Dimension list like df.head()[0][0]
df.head()[0]
If we have the spark dataframe as :
+----------+
|_c0 |
+----------+
|2021-08-31|
+----------+
x = df.first()[0]
print(x)
2021-08-31