I have some data source that comes in with certain rows missing when they are invalid.
In the following DataFrame, each user should have 3 rows, from index 0 to index 2.
val df = sc.parallelize(List(
("user1",0,true),
("user1",1,true),
("user1",2,true),
("user2",1,true),
("user3",2,true)
)).toDF("user_id","index","is_active")
For instance:
user1 has all the necessary index.
user2 is missing index 0 and index 2.
user3 is missing index 1.
Like this:
+-------+-----+---------+
|user_id|index|is_active|
+-------+-----+---------+
| user1| 0| true|
| user1| 1| true|
| user1| 2| true|
| user2| 1| true|
| user3| 0| true|
| user3| 2| true|
+-------+-----+---------+
I'd like to insert the default rows and make them into the following DataFrame.
+-------+-----+---------+
|user_id|index|is_active|
+-------+-----+---------+
| user1| 0| true|
| user1| 1| true|
| user1| 2| true|
| user2| 0| false|
| user2| 1| true|
| user2| 2| false|
| user3| 0| true|
| user3| 1| false|
| user3| 2| true|
+-------+-----+---------+
I've seen a separate and similar question where the answer to pivot the table so each user has 3 columns. But 1, that relies on the fact that index 0 to 2 exist for some users. In my real case the index is over a very large range so i cannot guarantee that after the pivot all columns would be complete. It also seems a pretty expensive operation to pivot, then un-pivot to get the second DataFrame.
I also tried to create a new DataFrame like this:
val indexDF = df.select("user_id").distinct.join(sc.parallelize(Seq.range(0, 3)).toDF("ind"))
val result = indexDF.join(df, Seq("user_id", "index"), "left").na.fill(0)
This actually works but when I ran it with real data (with millions of record and hundreds of values in index) it took very long and so I suspect it could be done more efficiently.
Thanks in advance!
Related
I have a Spark DataFrame like this:
+-------+------+-----+---------------+
|Account|nature|value| time|
+-------+------+-----+---------------+
| a| 1| 50|10:05:37:293084|
| a| 1| 50|10:06:46:806510|
| a| 0| 50|11:19:42:951479|
| a| 1| 40|19:14:50:479055|
| a| 0| 50|16:56:17:251624|
| a| 1| 40|16:33:12:133861|
| a| 1| 20|17:33:01:385710|
| b| 0| 30|12:54:49:483725|
| b| 0| 40|19:23:25:845489|
| b| 1| 30|10:58:02:276576|
| b| 1| 40|12:18:27:161290|
| b| 0| 50|12:01:50:698592|
| b| 0| 50|08:45:53:894441|
| b| 0| 40|17:36:55:827330|
| b| 1| 50|17:18:41:728486|
+-------+------+-----+---------------+
I want to compare nature column of one row to other rows with the same Account and value,I should look forward, and add new column named Repeated. The new column get true for both rows, if nature changed, from 1 to 0 or vise versa. For example, the above dataframe should look like this:
+-------+------+-----+---------------+--------+
|Account|nature|value| time|Repeated|
+-------+------+-----+---------------+--------+
| a| 1| 50|10:05:37:293084| true |
| a| 1| 50|10:06:46:806510| true|
| a| 0| 50|11:19:42:951479| true |
| a| 0| 50|16:56:17:251624| true |
| b| 0| 50|08:45:53:894441| true |
| b| 0| 50|12:01:50:698592| false|
| b| 1| 50|17:18:41:728486| true |
| a| 1| 40|16:33:12:133861| false|
| a| 1| 40|19:14:50:479055| false|
| b| 1| 40|12:18:27:161290| true|
| b| 0| 40|17:36:55:827330| true |
| b| 0| 40|19:23:25:845489| false|
| b| 1| 30|10:58:02:276576| true|
| b| 0| 30|12:54:49:483725| true |
| a| 1| 20|17:33:01:385710| false|
+-------+------+-----+---------------+--------+
My solution is that I have to do group by or window on Account and value columns; then in each group, compare nature of each row to nature of other rows and as a result of comperation, Repeated column become full.
I did this calculation with Spark Window functions. Like this:
windowSpec = Window.partitionBy("Account","value").orderBy("time")
df.withColumn("Repeated", coalesce(f.when(lead(df['nature']).over(windowSpec)!=df['nature'],lit(True)).otherwise(False))).show()
The result was like this which is not the result that I wanted:
+-------+------+-----+---------------+--------+
|Account|nature|value| time|Repeated|
+-------+------+-----+---------------+--------+
| a| 1| 50|10:05:37:293084| false|
| a| 1| 50|10:06:46:806510| true|
| a| 0| 50|11:19:42:951479| false|
| a| 0| 50|16:56:17:251624| false|
| b| 0| 50|08:45:53:894441| false|
| b| 0| 50|12:01:50:698592| true|
| b| 1| 50|17:18:41:728486| false|
| a| 1| 40|16:33:12:133861| false|
| a| 1| 40|19:14:50:479055| false|
| b| 1| 40|12:18:27:161290| true|
| b| 0| 40|17:36:55:827330| false|
| b| 0| 40|19:23:25:845489| false|
| b| 1| 30|10:58:02:276576| true|
| b| 0| 30|12:54:49:483725| false|
| a| 1| 20|17:33:01:385710| false|
+-------+------+-----+---------------+--------+
UPDATE:
To explain more, if we suppose the first Spark Dataframe is named "df",in the following, I write what exactly want to do in each group of "Account" and "value":
a = df.withColumn('repeated',lit(False))
for i in range(len(group)):
j = i+1
for j in j<=len(group):
if a.loc[i,'nature']!=a.loc[j,'nature'] and a.loc[j,'repeated']==False:
a.loc[i,'repeated'] = True
a.loc[j,'repeated'] = True
Would you please guide me how to do that using Pyspark Window?
Any help is really appreciated.
You actually need to guarantee that the order you see in your dataframe is the actual order. Can you do that? You need a column to sequence that what happened did happen in that order. Inserting new data into a dataframe doesn't guarantee it's order.
A window & Lag will allow you to look at the previous rows value and make the required adjustment.
FYI: I use coalesce here as if it's the first row there is no value for it to compare with. consider using the second parameter to coalesce as you see fit with what should happen with the first value in the account.)
If you need it look at monotonically increasing function. It may help you to create the order by value that is required for us to deterministically look at this data.
from pyspark.sql.functions import lag
from pyspark.sql.functions import lit
from pyspark.sql.functions import coalesce
from pyspark.sql.window import Window
spark.sql("create table nature (Account string,nature int, value int, order int)");
spark.sql("insert into nature values ('a', 1, 50,1), ('a', 1, 40,2),('a',0,50,3),('b',0,30,4),('b',0,40,5),('b',1,30,6),('b',1,40,7)")
windowSpec = Window.partitionBy("Account").orderBy("order")
nature = spark.table("nature");
nature.withColumn("Repeated", coalesce( lead(nature['nature']).over(windowSpec) != nature['nature'], lit(True)) ).show()
|Account|nature|value|order|Repeated|
+-------+------+-----+-----+--------+
| b| 0| 30| 4| false|
| b| 0| 40| 5| true|
| b| 1| 30| 6| false|
| b| 1| 40| 7| true|
| a| 1| 50| 1| false|
| a| 1| 40| 2| true|
| a| 0| 50| 3| true|
+-------+------+-----+-----+--------+
EDIT:
It's not clear from your description if I should look forward or backward. I have changed my code to look forward a row as this is consistent with account 'B' in your output. However it doesn't seem like the logic for Account 'A' is identical to the logic for 'B' in your sample output. (Or I don't understand a subtly of starting on '1' instead of starting on '0'.) If you want to look forward a row use lead, if you want to look back a row use lag.
Problem solved.
Even though this way costs a lot,but it's ok.
def check(part):
df = part
size = len(df)
for i in range(size):
if (df.loc[i,'repeated'] == True):
continue
else:
for j in range((i+1),size):
if (df.loc[i,'nature']!=df.loc[j,'nature']) & (df.loc[j,'repeated']==False):
df.loc[j,'repeated'] = True
df.loc[i,'repeated'] = True
break
return df
df.groupby("Account","value").applyInPandas(check, schema="Account string, nature int,value long,time string,repeated boolean").show()
Update1:
Another solution without any iterations.
def check(df):
df = df.sort_values('verified_time')
df['index'] = df.index
df['IS_REPEATED'] = 0
df1 = df.sort_values(['nature'],ascending=[True]).reset_index(drop=True)
df2 = df.sort_values(['nature'],ascending=[False]).reset_index(drop=True)
df1['IS_REPEATED']=df1['nature']^df2['nature']
df3 = df1.sort_values(['index'],ascending=[True])
df = df3.drop(['index'],axis=1)
return df
df = df.groupby("account", "value").applyInPandas(gf.check2,schema=gf.get_schema('trx'))
UPDATE2:
Solution with Spark window:
def is_repeated_feature(df):
windowPartition = Window.partitionBy("account", "value", 'nature').orderBy('nature')
df_1 = df.withColumn('rank', F.row_number().over(windowPartition))
w = (Window
.partitionBy('account', 'value')
.orderBy('nature')
.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing))
df_1 = df_1.withColumn("count_nature", F.count('nature').over(w))
df_1 = df_1.withColumn('sum_nature', F.sum('nature').over(w))
df_1 = df_1.select('*')
df_2 = df_1.withColumn('min_val',
when((df_1.sum_nature > (df_1.count_nature - df_1.sum_nature)),
(df_1.count_nature - df_1.sum_nature)).otherwise(df_1.sum_nature))
df_2 = df_2.withColumn('more_than_one', when(df_2.count_nature > 1, '1').otherwise('0'))
df_2 = df_2.withColumn('is_repeated',
when(((df_2.more_than_one == 1) & (df_2.count_nature > df_2.sum_nature) & (
df_2.rank <= df_2.min_val)), '1')
.otherwise('0'))
return df_2
I have a dataset that has column userid and index values.
+---------+--------+
| userid | index|
+---------+--------+
| user1| 1|
| user2| 2|
| user3| 3|
| user4| 4|
| user5| 5|
| user6| 6|
| user7| 7|
| user8| 8|
| user9| 9|
| user10| 10|
+---------+--------+
I want to append a new data frame to it and add an index to the newly added columns.
The userid is unique and the existing data frame will not have the Dataframe 2 user ids.
+----------+
| userid |
+----------+
| user11|
| user21|
| user41|
| user51|
| user64|
+----------+
The expected output with newly added userid and index
+---------+--------+
| userid | index|
+---------+--------+
| user1| 1|
| user2| 2|
| user3| 3|
| user4| 4|
| user5| 5|
| user6| 6|
| user7| 7|
| user8| 8|
| user9| 9|
| user10| 10|
| user11| 11|
| user21| 12|
| user41| 13|
| user51| 14|
| user64| 15|
+---------+--------+
Is it possible to achive this by passing a max index value and start index for second Dataframe from given index value.
If the userid has some ordering, then you can use the rownumber function. Even if it does not have, then you can add an id using monotonically_increasing_id(). For now I assume that userid can be ordered. Then you can do this:
from pyspark.sql import functions as F
from pyspark.sql.window import Window
df_merge = df1.select('userid').union(df2.select('userid'))
w=Window.orderBy('userid')
df_result = df_merge.withColumn('indexid',F.row_number().over(w))
EDIT : After discussions in comment.
#%% Test data and imports
import pyspark.sql.functions as F
from pyspark.sql import Window
df = sqlContext.createDataFrame([('a',100),('ab',50),('ba',300),('ced',60),('d',500)],schema=['userid','index'])
df1 = sqlContext.createDataFrame([('fgh',100),('ff',50),('fe',300),('er',60),('fi',500)],schema=['userid','dummy'])
#%%
#%% Merge the two dataframes, with a null columns as the index
df1=df1.withColumn('index', F.lit(None))
df_merge = df.select(df.columns).union(df1.select(df.columns))
#%%Define a window to arrange the newly added rows at the last and order them by userid
#%% The user id, even though random strings, can be ordered
w= Window.orderBy(F.col('index').asc_nulls_last(),F.col('userid'))# if possible add a partition column here, otherwise all your data will come in one partition, consider salting
#%% For the newly added rows, define index as the maximum value + increment of number of rows in main dataframe
df_final = df_merge.withColumn("index_new",F.when(~F.col('index').isNull(),F.col('index')).otherwise((F.last(F.col('index'),ignorenulls=True).over(w))+F.sum(F.lit(1)).over(w)))
#%% If number of rows in main dataframe is huge, then add an offset in the above line
df_final.show()
+------+-----+---------+
|userid|index|index_new|
+------+-----+---------+
| ab| 50| 50|
| ced| 60| 60|
| a| 100| 100|
| ba| 300| 300|
| d| 500| 500|
| er| null| 506|
| fe| null| 507|
| ff| null| 508|
| fgh| null| 509|
| fi| null| 510|
+------+-----+---------+
There is a dataframe containing 2 columns: [time: Timestamp, value: Double].
Here we define a rule to find outliers of value. Except that, we want to pick up neighboring rows of the rows containing outliers.
For example, the dataframe is:
Row 7 in red contains the outlier value we defined, and we want to get Row 4~10 (3 rows next to Line 7).
How to implement that? I guess rowsBetween might be an alternative but I do not know how.
Thanks!
Yes, you can use Window-functions with rowsBetween like this:
val df = Seq(
(1,220),
(2,220),
(3,220),
(4,220),
(5,220),
(6,230),
(7,220),
(8,220),
(9,220),
(10,220)
).toDF("time","value")
df
.withColumn("is_outlier",$"value">220)
.withColumn("outlier_region",max($"is_outlier").over(Window.orderBy($"time").rowsBetween(-3L,3L)))
.show()
gives:
+----+-----+----------+--------------+
|time|value|is_outlier|outlier_region|
+----+-----+----------+--------------+
| 1| 220| false| false|
| 2| 220| false| false|
| 3| 220| false| true|
| 4| 220| false| true|
| 5| 220| false| true|
| 6| 230| true| true|
| 7| 220| false| true|
| 8| 220| false| true|
| 9| 220| false| true|
| 10| 220| false| false|
+----+-----+----------+--------------+
I have a Spark data frame as shown below -
val myDF = Seq(
(1,"A",100,0,0),
(1,"E",200,0,0),
(1,"",300,1,49),
(2,"A",200,0,0),
(2,"C",300,0,0),
(2,"D",100,0,0)
).toDF("visitor","channel","timestamp","purchase_flag","amount")
scala> myDF.show
+-------+-------+---------+-------------+------+
|visitor|channel|timestamp|purchase_flag|amount|
+-------+-------+---------+-------------+------+
| 1| A| 100| 0| 0|
| 1| E| 200| 0| 0|
| 1| | 300| 1| 49|
| 2| A| 200| 0| 0|
| 2| C| 300| 0| 0|
| 2| D| 100| 0| 0|
+-------+-------+---------+-------------+------+
I would like to create Sequence dataframe for every visitor from myDF that traces a visitor's path to purchase ordered by timestamp dimension.
The output dataframe should look like below(-> can be any delimiter) -
+-------+---------------------+
|visitor|channel sequence |
+-------+---------------------+
| 1| A->E->purchase |
| 2| D->A->C->no_purchase|
+-------+---------------------+
To make things clear, visitor 2 has been exposed to channel D, then A and then C; and he does not make a purchase.
Hence the sequence is to be formed as D->A-C->no_purchase.
NOTE: Whenever a purchase happens, channel value goes blank and purchase_flag is set to 1.
I want to do this using a Scala UDF in Spark so that I re-apply the method on other datasets.
Here's how it is done using udf function
val myDF = Seq(
(1,"A",100,0,0),
(1,"E",200,0,0),
(1,"",300,1,49),
(2,"A",200,0,0),
(2,"C",300,0,0),
(2,"D",100,0,0)
).toDF("visitor","channel","timestamp","purchase_flag","amount")
import org.apache.spark.sql.functions._
def sequenceUdf = udf((struct: Seq[Row], purchased: Seq[Int])=> struct.map(row => (row.getAs[String]("channel"), row.getAs[Int]("timestamp"))).sortBy(_._2).map(_._1).filterNot(_ == "").mkString("->")+{if(purchased.contains(1)) "->purchase" else "->no_purchase"})
myDF.groupBy("visitor").agg(collect_list(struct("channel", "timestamp")).as("struct"), collect_list("purchase_flag").as("purchased"))
.select(col("visitor"), sequenceUdf(col("struct"), col("purchased")).as("channel sequence"))
.show(false)
which should give you
+-------+--------------------+
|visitor|channel sequence |
+-------+--------------------+
|1 |A->E->purchase |
|2 |D->A->C->no_purchase|
+-------+--------------------+
You can make it as much generic as you can . this is just a demo on how you should proceed
I would like to compute the mutual information (MI) between two variables x and y that I have in a Spark dataframe which looks like this:
scala> df.show()
+---+---+
| x| y|
+---+---+
| 0| DO|
| 1| FR|
| 0| MK|
| 0| FR|
| 0| RU|
| 0| TN|
| 0| TN|
| 0| KW|
| 1| RU|
| 0| JP|
| 0| US|
| 0| CL|
| 0| ES|
| 0| KR|
| 0| US|
| 0| IT|
| 0| SE|
| 0| MX|
| 0| CN|
| 1| EE|
+---+---+
In my case, x happens to be whether an event is occurring (x = 1) or not (x = 0), and y is a country code, but these variables could represent anything. To compute the MI between x and y I would like to have the above dataframe grouped by x, y pairs with the following three additional columns:
The number of occurrences of x
The number of occurrences of y
The number of occurrences of x, y
In the short example above, it would look like
x, y, count_x, count_y, count_xy
0, FR, 17, 2, 1
1, FR, 3, 2, 1
...
Then I would just have to compute the mutual information term for each x, y pair and sum them.
So far, I have been able to group by x, y pairs and aggregate a count(*) column but I couldn't find an efficient way to add the x and y counts. My current solution is to convert the DF into an array and count the occurrences and cooccurrences manually. It works well when y is a country but it takes forever when the cardinality of y gets big. Any suggestions as to how I could do it in a more Sparkish way?
Thanks in advance!
I would go with RDDs, generate a key for each use case, count by key and join the results. This way I know exactly what are the stages.
rdd.cache() // rdd is your data [x,y]
val xCnt:RDD[Int, Int] = rdd.countByKey
val yCnt:RDD[String, Int] = rdd.countByValue
val xyCnt:RDD[(Int,String), Int] = rdd.map((x, y) => ((x,y), x,y)).countByKey
val tmp = xCnt.cartsian(yCnt).map(((x, xCnt),(y, yCnt)) => ((x,y),xCnt,yCnt))
val miReady = tmp.join(xyCnt).map(((x,y), ((xCnt, yCnt), xyCnt)) => ((x,y), xCnt, yCnt, xyCnt))
another option would be to use map Partition and simply work on iterables and merge the resolutes across partitions.
Also new to Spark but I have an idea what to do. I do not know if this is the perfect solution but I thought sharing this wouldnt harm.
What I would do is probably filter() for the value 1 to create a Dataframe and filter() for the value 0 for a second Dataframe
You would get something like
1st Dataframe
DO 1
DO 1
FR 1
In the next step i would groupBy(y)
So you would get for the 1st Dataframe
DO 1 1
FR 1
As GroupedData https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/sql/GroupedData.html
This also has a count() function which should be counting the rows per group. Unfortunately I do not have the time to try this out by myself right now but I wanted to try and help anyway.
Edit: Please let me know if this helped, otherwise I'll delete the answer so other people still take a look at this!
Recently, I had the same task to compute probabilities and here I would like to share my solution based on Spark's window aggregation functions:
// data is your DataFrame with two columns [x,y]
val cooccurrDF: DataFrame = data
.groupBy(col("x"), col("y"))
.count()
.toDF("x", "y", "count-x-y")
val windowX: WindowSpec = Window.partitionBy("x")
val windowY: WindowSpec = Window.partitionBy("y")
val countsDF: DataFrame = cooccurrDF
.withColumn("count-x", sum("count-x-y") over windowX)
.withColumn("count-y", sum("count-x-y") over windowY)
countsDF.show()
First you groups every possible combination of two columns and use count to get the cooccurrences number. The windowed aggregates windowX and windowY allow summing over aggregated rows, so you will get counts for either column x or y.
+---+---+---------+-------+-------+
| x| y|count-x-y|count-x|count-y|
+---+---+---------+-------+-------+
| 0| MK| 1| 17| 1|
| 0| MX| 1| 17| 1|
| 1| EE| 1| 3| 1|
| 0| CN| 1| 17| 1|
| 1| RU| 1| 3| 2|
| 0| RU| 1| 17| 2|
| 0| CL| 1| 17| 1|
| 0| ES| 1| 17| 1|
| 0| KR| 1| 17| 1|
| 0| US| 2| 17| 2|
| 1| FR| 1| 3| 2|
| 0| FR| 1| 17| 2|
| 0| TN| 2| 17| 2|
| 0| IT| 1| 17| 1|
| 0| SE| 1| 17| 1|
| 0| DO| 1| 17| 1|
| 0| JP| 1| 17| 1|
| 0| KW| 1| 17| 1|
+---+---+---------+-------+-------+