I have a data like this
TagID,ListnerID,Timestamp,Sum_RSSI
2,101,1496745906,90
3,102,1496745907,70
3,104,1496745906,80
2,101,1496745909,60
4,106,1496745908,60
My expected output would be
2,101,1496745906,90
3,104,1496745906,80
4,106,1496745908,60
I tried like this
val high_window = Window.partitionBy($"tagShortID")
val prox = averageDF
.withColumn("rank", row_number().over(window.orderBy($"Sum_RSSI".desc)))
.filter($"rank" === 1)
But it prints all the rows. Any help would be appreciated.
Related
Sorry, I need your help to make this manipulation in pyspark.
I have this dataframe
data = [['tom', True,False], ['nick', True,False], ['juli', False,True]]
df = pd.DataFrame(data, columns=['Name', 'Age','gender'])
and I want to have
data = [['tom', True,False,1], ['nick', True,False,1], ['juli', False,True,2]]
df = pd.DataFrame(data, columns=['Name', 'cond1','cond2',"stat"])
I mean if cond1 ==True, stat = 1 and if cond2 ==True, stat = 2
Thank in advance for your help
Hello StackOverflowers.
I have a pyspark dataframe that consists of a time_column and a column with values.
E.g.
+----------+--------------------+
| snapshot| values|
+----------+--------------------+
|2005-01-31| 0.19120256617637743|
|2005-01-31| 0.7972692479278891|
|2005-02-28|0.005236883665445502|
|2005-02-28| 0.5474099672222935|
|2005-02-28| 0.13077227571485905|
+----------+--------------------+
I would like to perform a KS test of each snapshot value with the previous one.
I tried to do it with a for loop.
import numpy as np
from scipy.stats import ks_2samp
import pyspark.sql.functions as F
def KS_for_one_snapshot(temp_df, snapshots_list, j, var = "values"):
sample1=temp_df.filter(F.col("snapshot")==snapshots_list[j])
sample2=temp_df.filter(F.col("snapshot")==snapshots_list[j-1]) # pick the last snapshot as the one to compare with
if (sample1.count() == 0 or sample2.count() == 0 ):
ks_value = -1 # previously "0 observations" which gave type error
else:
ks_value, p_value = ks_2samp( np.array(sample1.select(var).collect()).reshape(-1)
, np.array(sample2.select(var).collect()).reshape(-1)
, alternative="two-sided"
, mode="auto")
return ks_value
results = []
snapshots_list = df.select('snapshot').dropDuplicates().sort('snapshot').rdd.flatMap(lambda x: x).collect()
for j in range(len(snapshots_list) - 1 ):
results.append(KS_for_one_snapshot(df, snapshots_list, j+1))
results
But the data in reality is huge so it takes forever. I am using databricks and pyspark, so I wonder what would be a more efficient way to run it by avoiding the for loop and utilizing the available workers.
I tried to do it by using a udf but in vain.
Any ideas?
PS. you can generate the data with the following code.
from random import randint
df = (spark.createDataFrame( range(1,1000), T.IntegerType())
.withColumn('snapshot' ,F.array(F.lit("2005-01-31"), F.lit("2005-02-28"),F.lit("2005-03-30") ).getItem((F.rand()*3).cast("int")))
.withColumn('values', F.rand()).drop('value')
)
Update:
I tried the following by using an UDF.
var_used = 'values'
data_input_1 = df.groupBy('snapshot').agg(F.collect_list(var_used).alias('value_list'))
data_input_2 = df.groupBy('snapshot').agg(F.collect_list(var_used).alias("value_list_2"))
windowSpec = Window.orderBy("snapshot")
data_input_2 = data_input_2.withColumn('snapshot_2', F.lag("snapshot", 1).over(Window.orderBy('snapshot'))).filter('snapshot_2 is not NULL')
data_input_final = data_input_final = data_input_1.join(data_input_2, data_input_1.snapshot == data_input_2.snapshot_2)
def KS_one_snapshot_general(sample_in_list_1, sample_in_list_2):
if (len(sample_in_list_1) == 0 or len(sample_in_list_2) == 0 ):
ks_value = -1 # previously "0 observations" which gave type error
else:
print('something')
ks_value, p_value = ks_2samp( sample_in_list_1
, sample_in_list_2
, alternative="two-sided"
, mode="auto")
return ks_value
import pyspark.sql.types as T
KS_one_snapshot_general_udf = udf(KS_one_snapshot_general, T.FloatType())
data_input_final.select( KS_one_snapshot_general_udf('value_list', 'value_list_2')).display()
Which works fine if the dataset (per snapshot) is small. But If I increase the number of rows then I end up with an error.
PickleException: expected zero arguments for construction of ClassDict (for numpy.dtype)
My dataframe returns the below result as String.
QueryResult{status='success', finalSuccess=true, parseSuccess=true, allRows=[{"cbcnt":0}], signature={"cbcnt":"number"}, info=N1qlMetrics{resultCount=1, errorCount=0, warningCount=0, mutationCount=0, sortCount=0, resultSize=11, elapsedTime='5.080179ms', executionTime='4.931124ms'}, profileInfo={}, errors=[], requestId='754d19f6-7ec1-4609-bf2a-54214d06c57c', clientContextId='542bc4c8-1a56-4afb-8c2f-63d81e681cb4'} |
QueryResult{status='success', finalSuccess=true, parseSuccess=true, allRows=[{"cbcnt":"2021-07-30T00:00:00-04:00"}], signature={"cbcnt":"String"}, info=N1qlMetrics{resultCount=1, errorCount=0, warningCount=0, mutationCount=0, sortCount=0, resultSize=11, elapsedTime='5.080179ms', executionTime='4.931124ms'}, profileInfo={}, errors=[], requestId='754d19f6-7ec1-4609-bf2a-54214d06c57c', clientContextId='542bc4c8-1a56-4afb-8c2f-63d81e681cb4'}
I just want
"cbcnt":0 <-- Numeric part of this
Expected Output
col
----
0
2021-07-30
Tried:
.withColumn("CbRes",regexp_extract($"Col", """"cbcnt":(\S*\d+)""", 1))
Output
col
----
0
"2021-07-30 00:00:00 --<--additional " is coming
Using the Pyspark function regexp_extract:
from pyspark.sql import functions as F
df = <dataframe with a column "text" that contains the input data">
df.withColumn("col", F.regexp_extract("text", """"cbcnt":(\d+)""", 1)).show()
Extract via regex:
val value = "QueryResult{status='success', finalSuccess=true, parseSuccess=true, allRows=[{\"cbcnt\":0}], signature={\"cbcnt\":\"number\"}, info=N1qlMetrics{resultCount=1, errorCount=0, warningCount=0, mutationCount=0, sortCount=0, resultSize=11, elapsedTime='5.080179ms', executionTime='4.931124ms'}, profileInfo={}, errors=[], requestId='754d19f6-7ec1-4609-bf2a-54214d06c57c', clientContextId='542bc4c8-1a56-4afb-8c2f-63d81e681cb4'} |"
val regex = """"cbcnt":(\d+)""".r.unanchored
val s"${regex(result)}" = value
println(result)
Output:
0
I am joining multiple dataframes
and I am calculating the output by multiplying two columns from two diff dataframes and dividing it with a column belonging to another dataframe.
I get grouping sequence expression is empty error and no_order is not an aggregate function
whats is wrong with the code
df = df1.join(df2,df2["Code"] == df1["Code"],how = 'left')\
.join(df3, df3["ID"] == df1["ID"],how = 'left')\
.join(df4, df4["ID"] == df1["ID"],how = 'left')\
.join(df5, df5["Scenario"] == df1["Status"],how='left')\
.withColumn("Country",when(df1.Ind == 1,"WI"))\
.withColumn("Country",when(df1.Ind == 0,"AA"))\
.withColumn("Year",when(df1.Year == "2020","2021"))\
.agg((sum(df5["amt"] * df1["cost"]))/df2["no_order"]).alias('output')
.groupby('Country','Year','output')
the error shows you that df2["no_order"] should be withing some aggregation function, for example the sum which you are using for df5["amt"] * df1["cost"].
Also move .groupby() above .agg().
If I got correctly what you are trying to achieve, the code should look like:
df = df1\
.join(df2, on = 'Code', how = 'left')\
.join(df3, on = 'ID', how = 'left')\
.join(df4, on = 'ID', how = 'left')\
.join(df5, df5.Scenario == df1.Status, how='left')\
.withColumn('Country', when(df1.Ind == 1,"WI").when(df1.Ind == 0,"AA"))\
.withColumn('Year', when(df1.Year == "2020","2021"))\
.groupby('Country','Year')\
.agg(sum(df5["amt"] * df1["cost"] / df2["no_order"]).alias('output'))
I have an rdd with type RDD[String] as an example here is a part of it as such:
1990,1990-07-08
1994,1994-06-18
1994,1994-06-18
1994,1994-06-22
1994,1994-06-22
1994,1994-06-26
1994,1994-06-26
1954,1954-06-20
2002,2002-06-26
1954,1954-06-23
2002,2002-06-29
1954,1954-06-16
2002,2002-06-30
...
result:
(1982,52)
(2006,64)
(1962,32)
(1966,32)
(1986,52)
(2002,64)
(1994,52)
(1974,38)
(1990,52)
(2010,64)
(1978,38)
(1954,26)
(2014,64)
(1958,35)
(1998,64)
(1970,32)
I group it nicely, but my problem is this v.size part, I do not know to to calculate that length.
Just to put it in perspective, here are expected results:
It is not a mistake that there is two times for 2002. But ignore that.
define date format:
val formatter = DateTimeFormatter.ofPattern("yyyy-MM-dd")
and order:
implicit val localDateOrdering: Ordering[LocalDate] = Ordering.by(_.toEpochDay)
create a function that receives "v" and returns MAX(date_of_matching_year) - MIN(date_of_matching_year)) = LENGTH (in days):
def f(v: Iterable[Array[String]]): Int = {
val parsedDates = v.map(LocalDate.parse(_(1), formatter))
parsedDates.max.getDayOfYear - parsedDates.min.getDayOfYear
then replace the v.size with f(v)