I am trying to compute the weekly occurrences of a word. That is, whether each word is more frequent this week than the previous week. For that I am kind of stuck. I did the following:
m = sc.parallelize(["oded,12-12-2018", "oded,12-03-2018", "oded,12-12-2018", "oded,12-06-2018", "oded2,12-02-2018", "oded2,12-02-2018"])
m = m.map(lambda line: line.split(','))
weekly = m.map(lambda line: (line[0], (parse(line[1]).strftime("%V%y"))))
s = sql.createDataFrame(daily)
s.groupby("_1", "_2").count().sort("_2")
the result is:
+-----+----+-----+
| _1| _2|count|
+-----+----+-----+
|oded2|4818| 2|
| oded|4918| 2|
| oded|5018| 2|
+-----+----+-----+
How can I go and get oded: 0 = ( 2 - 2 ) and oded2: 2 = (2 - 0)
Hi you can use lag window function to find value from previous week , after you count words peer week. For weeks that don't have previous value value for count will be zero or you can use na.drop() to remove that lines completely.
from pyspark.sql.functions import lag, col,coalesce
from pyspark.sql.window import Window
w = Window().partitionBy("_1").orderBy(col("_2"))
s.select("*", lag("count").over(w).alias("prev_week")).na.fill(0).show()
Related
I have a dataframe containing the id of some person and the date on which he performed a certain action:
+----+----------+
| id| date|
+----+----------+
| 1|2022-09-01|
| 1|2022-10-01|
| 1|2022-11-01|
| 2|2022-07-01|
| 2|2022-10-01|
| 2|2022-11-01|
| 3|2022-09-01|
| 3|2022-10-01|
| 3|2022-11-01|
+----+----------+
I need to determine the fact that this person performed some action over a certain period of time (suppose the last 3 months). In a specific example, person number 2 missed months 08 and 09, respectively, the condition was not met. So I expect to get the following result:
+----+------------------------------------+------+
| id| dates|3month|
+----+------------------------------------+------+
| 1|[2022-09-01, 2022-10-01, 2022-11-01]| true|
| 2|[2022-07-01, 2022-10-01, 2022-11-01]| false|
| 3|[2022-09-01, 2022-10-01, 2022-11-01]| true|
+----+------------------------------------+------+
I understand that I should group by person ID and get an array of dates that correspond to it.
data.groupBy(col("id")).agg(collect_list("date") as "dates").withColumn("3month", ???)
However, I'm at a loss in writing a function that would carry out a check for compliance with the requirement.I have an option using recursion, but it does not suit me due to low performance (there may be more than one thousand dates). I would be very grateful if someone could help me with my problem.
A simple trick is to use a set instead of a list in your aggregation, in order to have distinct values, and then check the size of that set.
Here are some possible solutions:
Solution 1
Assuming you have a list of months of interest on which you want to check, you can perform a preliminary filter on the required months, then aggregate and validate.
import org.apache.spark.sql.{functions => F}
import java.time.{LocalDate, Duration}
val requiredMonths = Seq(
LocalDate.parse("2022-09-01"),
LocalDate.parse("2022-10-01"),
LocalDate.parse("2022-11-01")
);
df
.filter(F.date_trunc("month", $"date").isInCollection(requiredMonths))
.groupBy($"id")
.agg(F.collect_set(F.date_trunc("month", $"date")) as "months")
.withColumn("is_valid", F.size($"months") === requiredMonths.size)
date_trunc is used to truncate the date column to month.
Solution 2
Similar to the previous one, with preliminary filter, but here assuming you have a range of months
import java.time.temporal.ChronoUnit
val firstMonth = LocalDate.parse("2022-09-01");
val lastMonth = LocalDate.parse("2022-11-01");
val requiredNumberOfMonths = ChronoUnit.MONTHS.between(firstMonth, lastMonth) + 1;
df
.withColumn("month", F.date_trunc("month", $"date"))
.filter($"month" >= firstMonth && $"month" <= lastMonth)
.groupBy($"id")
.agg(F.collect_set($"month") as "months")
.withColumn("is_valid", F.size($"months") === requiredNumberOfMonths)
Solution 3
Both solution 1 and 2 have a problem that causes the complete exclusion from the final result of the ids that have no intersection with the dates of interest.
This is caused by the filter applied before grouping.
Here is a solution based on solution 2 that does not filter and solves this problem.
df
.withColumn("month", F.date_trunc("month", $"date"))
.groupBy($"id")
.agg(F.collect_set(F.when($"month" >= firstMonth && $"month" <= lastMonth, $"month")) as "months")
.withColumn("is_valid", F.size($"months") === requiredNumberOfMonths)
Now the filter is performed using a conditional collect_set.
It is right to consider also solution 1 and 2 because the preliminary filter can have advantages and in some cases that could be the expected result.
I have a PySpark DF, with ID and Date column, looking like this.
ID
Date
1
2021-10-01
2
2021-10-01
1
2021-10-02
3
2021-10-02
I want to count the number of unique IDs that did not exist in the date one day before. So, here the result would be 1 as there is only one new unique ID in 2021-10-02.
ID
Date
Count
1
2021-10-01
-
2
2021-10-01
-
1
2021-10-02
1
3
2021-10-02
1
I tried following this solution but it does not work on date type value. Any help would be highly appreciated.
Thank you!
If you want to avoid a self-join (e.g. for performance reasons), you could work with Window functions:
from pyspark.sql import Row, Window
import datetime
df = spark.createDataFrame([
Row(ID=1, date=datetime.date(2021,10,1)),
Row(ID=2, date=datetime.date(2021,10,1)),
Row(ID=1, date=datetime.date(2021,10,2)),
Row(ID=2, date=datetime.date(2021,10,2)),
Row(ID=1, date=datetime.date(2021,10,3)),
Row(ID=3, date=datetime.date(2021,10,3)),
])
First add the number of days since an ID was last seen (will be None if it never appeared before)
df = df.withColumn('days_since_last_occurrence', F.datediff('date', F.lag('date').over(Window.partitionBy('ID').orderBy('date'))))
Second, we add a column marking rows where this number of days is not 1. We add a 1 into this column so that we can later sum over this column to count the rows
df = df.withColumn('is_new', F.when(F.col('days_since_last_occurrence') == 1, None).otherwise(1))
Now we do the sum of all rows with the same date and then remove the column we do not require anymore:
(
df
.withColumn('count', F.sum('is_new').over(Window.partitionBy('date'))) # sum over all rows with the same date
.drop('is_new', 'days_since_last_occurrence')
.sort('date', 'ID')
.show()
)
# Output:
+---+----------+-----+
| ID| date|count|
+---+----------+-----+
| 1|2021-10-01| 2|
| 2|2021-10-01| 2|
| 1|2021-10-02| null|
| 2|2021-10-02| null|
| 1|2021-10-03| 1|
| 3|2021-10-03| 1|
+---+----------+-----+
Take out the id list of the current day and the previous day, and then get the size of the difference between the two to get the final result.
Update to a solution to eliminate join.
df = df.select('date', F.expr('collect_set(id) over (partition by date) as id_arr')).dropDuplicates() \
.select('*', F.expr('size(array_except(id_arr, lag(id_arr,1,id_arr) over (order by date))) as count')) \
.select(F.explode('id_arr').alias('id'), 'date', 'count')
df.show(truncate=False)
I am working with Pyspark and trying to figure out how to do complex calculation with previous columns. I think there are generally two ways to do calculation with previous columns : Windows, and mapwithPartition. I think my problem is too complex to solve by windows, and I want the result as a sepreate row, not column. So I am trying to use mapwithpartition. I am having a trouble with syntax of this.
For instance, here is a rough draft of the code.
def change_dd(rows):
prev_rows = []
prev_rows.append(rows)
for row in rows:
new_row=[]
for entry in row:
# Testing to figure out syntax, things would get more complex
new_row.append(entry + prev_rows[0])
yield new_row
updated_rdd = select.rdd.mapPartitions(change_dd)
However, I can't access to the single data of prev_rows. Seems like prev_rows[0] is itertools.chain. How do I iterate over this prev_rows[0]?
edit
neighbor = sc.broadcast(df_sliced.where(df_sliced.id == neighbor_idx).collect()[0][:-1]).value
current = df_sliced.where(df_sliced.id == i)
def oversample_dt(dataframe):
for row in dataframe:
new_row = []
for entry, neigh in zip(row, neighbor):
if isinstance(entry, str):
if scale < 0.5:
new_row.append(entry)
else:
new_row.append(neigh)
else:
if isinstance(entry, int):
new_row.append(int(entry + (neigh - entry) * scale))
else:
new_row.append(entry + (neigh - entry) * scale)
yield new_row
sttt = time.time()
sample = current.rdd.mapPartitions(oversample_dt).toDF(schema)
In the end, I ended up doing like this for now, but I really don't want to use collect in the first row. If someone knows how to fix this / point out any problem in using pyspark, please tell me.
edit2
--Suppose Alice, and its neighbor Alice_2
scale = 0.4
+---+-------+--------+
|age| name | height |
+---+-------+--------+
| 10| Alice | 170 |
| 11|Alice_2| 175 |
+---+-------+--------+
Then, I want a row
+---+-------+----------------------------------+
|age | name | height |
+---+-------+---------------------------------+
| 10+1*0.4 | Alice_2 | 170 + 5*0.4 |
+---+-------+---------------------------------+
Why not using dataframes?
Add a column to the dataframe with the previous values using window functions like this:
from pyspark.sql import SparkSession, functions
from pyspark.sql.window import Window
spark_session = SparkSession.builder.getOrCreate()
df = spark_session.createDataFrame([{'name': 'Alice', 'age': 1}, {'name': 'Alice_2', 'age': 2}])
df.show()
+---+-------+
|age| name|
+---+-------+
| 1| Alice|
| 2|Alice_2|
+---+-------+
window = Window.partitionBy().orderBy('age')
df = df.withColumn("age-1", functions.lag(df.age).over(window))
df.show()
You can use this function for every column
+---+-------+-----+
|age| name|age-1|
+---+-------+-----+
| 1| Alice| null|
| 2|Alice_2| 1|
+---+-------+-----+
An then just make your calculus
And if you want to use rdd, then just use df.rdd
The following I am attempting in Scala-Spark.
I'm hoping someone can give me some guidance on how to tackle this problem or provide me with some resources to figure out what I can do.
I have a dateCountDF with a count corresponding to a date. I would like to randomly select a certain number of entries for each dateCountDF.month from another Dataframe entitiesDF where dateCountDF.FirstDate<entitiesDF.Date && entitiesDF.Date <= dateCountDF.LastDate and then place all the results into a new Dataframe. See Bellow for Data Example
I'm not at all sure how to approach this problem from a Spark-SQl or Spark-MapReduce perspective. The furthest I got was the naive approach, where I use a foreach on a dataFrame and then refer to the other dataframe within the function. But this doesn't work because of the distributed nature of Spark.
val randomEntites = dateCountDF.foreach(x => {
val count:Int = x(1).toString().toInt
val result = entitiesDF.take(count)
return result
})
DataFrames
**dateCountDF**
| Date | Count |
+----------+----------------+
|2016-08-31| 4|
|2015-12-31| 1|
|2016-09-30| 5|
|2016-04-30| 5|
|2015-11-30| 3|
|2016-05-31| 7|
|2016-11-30| 2|
|2016-07-31| 5|
|2016-12-31| 9|
|2014-06-30| 4|
+----------+----------------+
only showing top 10 rows
**entitiesDF**
| ID | FirstDate | LastDate |
+----------+-----------------+----------+
| 296| 2014-09-01|2015-07-31|
| 125| 2015-10-01|2016-12-31|
| 124| 2014-08-01|2015-03-31|
| 447| 2017-02-01|2017-01-01|
| 307| 2015-01-01|2015-04-30|
| 574| 2016-01-01|2017-01-31|
| 613| 2016-04-01|2017-02-01|
| 169| 2009-08-23|2016-11-30|
| 205| 2017-02-01|2017-02-01|
| 433| 2015-03-01|2015-10-31|
+----------+-----------------+----------+
only showing top 10 rows
Edit:
For clarification.
My inputs are entitiesDF and dateCountDF. I want to loop through dateCountDF and for each row I want to select a random number of entities in entitiesDF where dateCountDF.FirstDate<entitiesDF.Date && entitiesDF.Date <= dateCountDF.LastDate
To select random you do like this in scala
import random
def sampler(df, col, records):
# Calculate number of rows
colmax = df.count()
# Create random sample from range
vals = random.sample(range(1, colmax), records)
# Use 'vals' to filter DataFrame using 'isin'
return df.filter(df[col].isin(vals))
select random number of rows you want store in dataframe and the add this data in the another dataframe for this you can use unionAll.
also you can refer this answer
I am using pySpark, and have set up my dataframe with two columns representing a daily asset price as follows:
ind = sc.parallelize(range(1,5))
prices = sc.parallelize([33.3,31.1,51.2,21.3])
data = ind.zip(prices)
df = sqlCtx.createDataFrame(data,["day","price"])
I get upon applying df.show():
+---+-----+
|day|price|
+---+-----+
| 1| 33.3|
| 2| 31.1|
| 3| 51.2|
| 4| 21.3|
+---+-----+
Which is fine and all. I would like to have another column that contains the day-to-day returns of the price column, i.e., something like
(price(day2)-price(day1))/(price(day1))
After much research, I am told that this is most efficiently accomplished through applying the pyspark.sql.window functions, but I am unable to see how.
You can bring the previous day column by using lag function, and add additional column that does actual day-to-day return from the two columns, but you may have to tell spark how to partition your data and/or order it to do lag, something like this:
from pyspark.sql.window import Window
import pyspark.sql.functions as func
from pyspark.sql.functions import lit
dfu = df.withColumn('user', lit('tmoore'))
df_lag = dfu.withColumn('prev_day_price',
func.lag(dfu['price'])
.over(Window.partitionBy("user")))
result = df_lag.withColumn('daily_return',
(df_lag['price'] - df_lag['prev_day_price']) / df_lag['price'] )
>>> result.show()
+---+-----+-------+--------------+--------------------+
|day|price| user|prev_day_price| daily_return|
+---+-----+-------+--------------+--------------------+
| 1| 33.3| tmoore| null| null|
| 2| 31.1| tmoore| 33.3|-0.07073954983922816|
| 3| 51.2| tmoore| 31.1| 0.392578125|
| 4| 21.3| tmoore| 51.2| -1.403755868544601|
+---+-----+-------+--------------+--------------------+
Here is longer introduction into Window functions in Spark.
Lag function can help you resolve your use case.
from pyspark.sql.window import Window
import pyspark.sql.functions as func
### Defining the window
Windowspec=Window.orderBy("day")
### Calculating lag of price at each day level
prev_day_price= df.withColumn('prev_day_price',
func.lag(dfu['price'])
.over(Windowspec))
### Calculating the average
result = prev_day_price.withColumn('daily_return',
(prev_day_price['price'] - prev_day_price['prev_day_price']) /
prev_day_price['price'] )