I am trying to write an R script to calculate average ride length for each day based on user type using the aggregate function - aggregate

I am trying to write an R script to calculate average ride length for each day and group by user type. The data structure has:
day_of_week (chr), Ride_length (num), user_type(chr)
The script is the following:
aggregate(divvy_2022$ride_length ~ divvy_2022$user_type ~ divvy_2022$day_of_week, FUN = mean)
I get this error:
Error in model.frame.default(formula = divvy_2022$ride_length ~ divvy_2022$user_type ~ :
object is not a matrix
What could be wrong?
I can't understand why the error comes up

Related

how to download precipitation data for latitude-longitude coordinates from NOAA in R

I'm trying to download precipitation data for a list of latitude-longitude coordinates in R. I've came across this question which gets me most of the way there, but over half of the weather stations don't have precipitation data. I've pasted code below up to this point.
I'm now trying to figure out how to only get data from the closest station with precipitation data, or run a second function on the sites with missing data to get data from the second closest station. However, I haven't been able to figure out how to do this. Any suggestions or resources that might help?
`
library(rnoaa)
# load station data - takes some minutes
station_data <- ghcnd_stations() %>% filter(element == "PRCP")
# add id column for each location (necessary for next function)
sites_df$id <- 1:nrow(sites_df)
# retrieve all stations in radius (e.g. 20km) using lapply
stations <- lapply(1:nrow(sites_df),
function(i) meteo_nearby_stations(sites_df[i,],lat_colname = 'Lattitude',lon_colname = 'Longitude',radius = 20,station_data = station_data)[[1]])
# pull data for nearest stations - x$id[1] selects ID of closest station
stations_data <- lapply(stations,function(x) meteo_pull_monitors(x$id[1], date_min = "2022-05-01", date_max = "2022-05-31", var = c("prcp")))
stations_data`
# poor attempt its making me include- trying to rerun subset for second closest station. I know this isn't working but don't know how to get lapply to run for a subset of a list, or understand exactly how the function is running to code it another way
for (i in c(1,2,3,7,9,10,11,14,16,17,19,20)){
stations_data[[i]] <- lapply(stations,function(x) meteo_pull_monitors(x$id[2], date_min = "2022-05-01", date_max = "2022-05-31", var = c("prcp")))
}

Is there a function in R to create a new column in a tibble that depends on values from a previous row?

First time poster and quite new to R.
I'm trying to add a new variable to a tibble ("joined") that adds value nrow-1 from column 22 ("NurseID"), if the value of the variable in column 3("AccountID") on nrow matches the one on nrow-1.
I can do it with a sorted loop, but this is a large dataset and it takes a long time to run and I wonder if there is a faster/easier way to do this
arrange (joined, AccountID, date_day, shift)
tie <- "."
for (i in 2:nrow(joined))
{
ifelse (joined[i,3]==joined[i-1,3], temp<-joined[i-1,22], temp<-".")
tie <- c(tie,temp)
}
temptie <- as.numeric(tie)
joined <- as_tibble(cbind(joined,temptie))
Any help / input is much appreciated. Please kindly let me know if you need more information on the tibble

Method executing again and again in scala

Suppose I've a list of zip codes of stores and I want to find out which is at minimum distance from a given user's zip code. For this, I've written some code in scala -
case class UserCityDistance(city:String, zip:String, distance:Double)
private def getUserLocation(user_loc:String, store_loc:String)={
println(s"# Calculating distance between user_loc $user_loc AND warehouse_loc $store_loc #")
// Code here to calculate distance between users and store
// This will give 3 values - store city, store zip and distance from user's zip
return UserCityDistance(city, zip, distance)
}
I'm calling this method from the same class -
productStoreMap contains list of stores having a product (the key) and user_zip contains the zip code of user.
val userDistanceList = productStoreMap.mapValues(_.map(x=>getUserLocation(user_zip,x))) // Get distance of each store from user's location
println(s"# Distance lists obtained - $userDistanceList #")
val minStore = userDistanceList.mapValues(_.minBy(_.distance)) // Selecting the minimum distant store
println(s"# Minimum distant warehouse - ${minWarehouse} #")
Its working good. But after printing out # Distance lists obtained - SOME VALUES HERE #, its again starts calculating distance between user zip and each of the warehouse zip as before, just like lazy evaluation.
Is there any problem with my code or anything that I'm missing ?
EDIT -
productStoreMap is of type - Map[List[String], List[String]]
Is this causing this trouble ?

Pyspark label points aggregation

I am performing a binary classification using LabeledPoint. I then attempt to sum() the number of labeled points with 1.0 to verify if the classification.
I have labelled an RDD as follows
lp_RDD = RDD.map(lambda x: LabeledPoint(1 if (flag in x[0]) else 0,x[1]))
I thought perhaps I could get a count of how many have been labelled with 1 using:
cnt = lp_RDD.map(lambda x: x[0]).sum()
But I get the following error :
'LabeledPoint' object does not support indexing
I have verified the labeled RDD as correct by printing the entire RDD and then doing a search for the string "LabeledPoint(1.0". I was simply wondering if there was a shortcut by trying to do a sum?
LabeledPoint has label value member which can be used to find the count or sum.Please try,
cnt = lp_RDD.map(lambda x: x.label).sum()

partial Distance Based RDA - Centroids vanished from Plot

I am trying to fir a partial db-RDA with field.ID to correct for the repeated measurements character of the samples. However including Condition(field.ID) leads to Disappearance of the centroids of the main factor of interest from the plot (left plot below).
The Design: 12 fields have been sampled for species data in two consecutive years, repeatedly. Additionally every year 3 samples from reference fields have been sampled. These three fields have been changed in the second year, due to unavailability of the former fields.
Additionally some environmental variables have been sampled (Nitrogen, Soil moisture, Temperature). Every field has an identifier (field.ID).
Using field.ID as Condition seem to erroneously remove the F1 factor. However using Sampling campaign (SC) as Condition does not. Is the latter the rigth way to correct for repeated measurments in partial db-RDA??
set.seed(1234)
df.exp <- data.frame(field.ID = factor(c(1:12,13,14,15,1:12,16,17,18)),
SC = factor(rep(c(1,2), each=15)),
F1 = factor(rep(rep(c("A","B","C","D","E"),each=3),2)),
Nitrogen = rnorm(30,mean=0.16, sd=0.07),
Temp = rnorm(30,mean=13.5, sd=3.9),
Moist = rnorm(30,mean=19.4, sd=5.8))
df.rsp <- data.frame(Spec1 = rpois(30, 5),
Spec2 = rpois(30,1),
Spec3 = rpois(30,4.5),
Spec4 = rpois(30,3),
Spec5 = rpois(30,7),
Spec6 = rpois(30,7),
Spec7 = rpois(30,5))
data=cbind(df.exp, df.rsp)
dbRDA <- capscale(df.rsp ~ F1 + Nitrogen + Temp + Moist + Condition(SC), df.exp); ordiplot(dbRDA)
dbRDA <- capscale(df.rsp ~ F1 + Nitrogen + Temp + Moist + Condition(field.ID), df.exp); ordiplot(dbRDA)
You partial out variation due to ID and then you try to explain variable aliased to this ID, but it was already partialled out. The key line in the printed output was this:
Some constraints were aliased because they were collinear (redundant)
And indeed, when you ask for details, you get
> alias(dbRDA, names=TRUE)
[1] "F1B" "F1C" "F1D" "F1E"
The F1? variables were constant within ID which already was partialled out, and nothing was left to explain.