movie recommendation ALS pyspark - pyspark

I have a movielens dataset. I want to make a recommendation system based on the input but I'm having trouble doing the model transform.
# get the recommendations for the users
recommendations_df = model.transform(rated_users)
IllegalArgumentException: movieId does not exist. Available: userId
my code like this :
# create the ALS model
als = ALS(userCol="userId", itemCol="movieId", ratingCol="rating", coldStartStrategy="drop")
# fit the model to the training data
model = als.fit(training)
# get the movie input
idmovie = int(input("Enter movie ID: "))
# get the users who have rated the input movie
rated_users = ratings.filter(movie_ratings["movieId"] == idmovie).select("userId")
got error in :
# get the recommendations for the users
recommendations_df = model.transform(rated_users)

Related

how to download precipitation data for latitude-longitude coordinates from NOAA in R

I'm trying to download precipitation data for a list of latitude-longitude coordinates in R. I've came across this question which gets me most of the way there, but over half of the weather stations don't have precipitation data. I've pasted code below up to this point.
I'm now trying to figure out how to only get data from the closest station with precipitation data, or run a second function on the sites with missing data to get data from the second closest station. However, I haven't been able to figure out how to do this. Any suggestions or resources that might help?
`
library(rnoaa)
# load station data - takes some minutes
station_data <- ghcnd_stations() %>% filter(element == "PRCP")
# add id column for each location (necessary for next function)
sites_df$id <- 1:nrow(sites_df)
# retrieve all stations in radius (e.g. 20km) using lapply
stations <- lapply(1:nrow(sites_df),
function(i) meteo_nearby_stations(sites_df[i,],lat_colname = 'Lattitude',lon_colname = 'Longitude',radius = 20,station_data = station_data)[[1]])
# pull data for nearest stations - x$id[1] selects ID of closest station
stations_data <- lapply(stations,function(x) meteo_pull_monitors(x$id[1], date_min = "2022-05-01", date_max = "2022-05-31", var = c("prcp")))
stations_data`
# poor attempt its making me include- trying to rerun subset for second closest station. I know this isn't working but don't know how to get lapply to run for a subset of a list, or understand exactly how the function is running to code it another way
for (i in c(1,2,3,7,9,10,11,14,16,17,19,20)){
stations_data[[i]] <- lapply(stations,function(x) meteo_pull_monitors(x$id[2], date_min = "2022-05-01", date_max = "2022-05-31", var = c("prcp")))
}

How to predict the outcome variables using a saved pipeline when the data set does not contain the actual outcome?

I have a data set that contains the following columns: outcome (this is the outcome that we want to predict), and raw (a column that consists of text). I want to develop an ML model that will predict the outcome from the raw column. I have trained an ML model in Databricks using the following pipeline:
regexTokenizer = RegexTokenizer(inputCol="raw", outputCol="words", pattern="\\W")
countVec = CountVectorizer(inputCol="words", outputCol="features")
indexer = StringIndexer(inputCol="outcome", outputCol="label").setHandleInvalid("skip").fit(trainDF)
inverter = IndexToString(inputCol="prediction", outputCol="prediction_label", labels=indexer.labels)
nb = NaiveBayes(labelCol="label", featuresCol="features", smoothing=1.0, modelType="multinomial")
pipeline = Pipeline(stages=[regexTokenizer, indexer, countVec, nb, inverter])
model = pipeline.fit(trainDF)
model.write().overwrite().save("/FileStore/project")
In another notebook, I load the model and try to predict the values for a new data set. This data set does not contain the outcome variable ("outcome" in this case):
model = PipelineModel.load("/FileStore/project")
score_output_df = model.transform(score_this)
When I try to predict the values for the new data set, I get an error message that the column "outcome" cannot be found. I suspect that this is due to the fact that some stages in the pipeline transform this column (the indexer and inverter stages are used to convert the outcome column to numbers and then back to string labels.).
My question is this, how can I load a saved model and use it to predict values when the original pipeline contains stages that have this column as an input.
instead of using
model.write().overwrite().save("/FileStore/project")
you have to write it like this
model.write().overwrite().save("/FileStore/project/model.sav")
and then for loading you will use this
model = PipelineModel.load("/FileStore/project/model.sav")
score_output_df = model.transform(score_this)
I have found a solution to the problem and will post it here so that if someone faces the same problem they can benefit from it. The solution was simply to extract the stages that I want to use in the prediction and save them to the model as such:
model = PipelineModel.load("/FileStore/project")
stages1 = []
stages1 += [model.stages[0]]
stages1 += [model.stages[2]]
stages1 += [model.stages[3]]
stages1 += [model.stages[4]]
model.stages = stages1
score_output_df = model.transform(score_this)
In this code, I exclude the second step ([1]) because it contains the indexer. Once I do this, I can predict values when the "outcome" column is not available.

Trying to use Distributed data parallel on GANs but getting runtime error about an inplace operation

I am trying to train a GAN a machine with 3GPUs using distributed data parallel.
before wrapping my model in the DDP everything works fine but when I wrap it, it givers me the following Runtime Error
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [128]] is at version 5; expected version 4 instead.
I cloned every related tensor to the gradient to solve the inplace operation (if it is any) but I could not find it.
the part of code with the problem is as follow:
Tensor = torch.cuda.FloatTensor
# ----------
# Training
# ----------
def train_gan(rank, world_size, opt):
print(f"Running basic DDP example on rank {rank}.")
setup(rank, world_size)
if rank == 0:
get_dataloader(rank, opt)
dist.barrier()
print(f"Rank {rank}/{world_size} training process passed data download barrier.\n")
dataloader = get_dataloader(rank, opt)
# Loss function
adversarial_loss = torch.nn.BCELoss()
# Initialize generator and discriminator
generator = Generator()
discriminator = Discriminator()
# Initialize weights
generator.apply(weights_init_normal)
discriminator.apply(weights_init_normal)
generator.to(rank)
discriminator.to(rank)
generator_d = DDP(generator, device_ids=[rank])
discriminator_d = DDP(discriminator, device_ids=[rank])
# Optimizers
# Since we are computing the average of several batches at once (an effective batch size of
# world_size * batch_size) we scale the learning rate to match.
optimizer_G = torch.optim.Adam(generator_d.parameters(), lr=opt.lr * opt.world_size, betas=(opt.b1, opt.b2))
optimizer_D = torch.optim.Adam(discriminator_d.parameters(), lr=opt.lr * opt.world_size, betas=(opt.b1, opt.b2))
losses = []
for epoch in range(opt.n_epochs):
for i, (imgs, _) in enumerate(dataloader):
# Adversarial ground truths
valid = Variable(Tensor(imgs.shape[0], 1).fill_(1.0), requires_grad=False).to(rank)
fake = Variable(Tensor(imgs.shape[0], 1).fill_(0.0), requires_grad=False).to(rank)
# Configure input
real_imgs = Variable(imgs.type(Tensor)).to(rank)
# -----------------
# Train Generator
# -----------------
optimizer_G.zero_grad()
# Sample noise as generator input
z = Variable(Tensor(np.random.normal(0, 1, (imgs.shape[0], opt.latent_dim)))).to(rank)
# Generate a batch of images
gen_imgs = generator_d(z)
# Loss measures generator's ability to fool the discriminator
g_loss = adversarial_loss(discriminator_d(gen_imgs), valid)
g_loss.backward()
optimizer_G.step()
# ---------------------
# Train Discriminator
# ---------------------
optimizer_D.zero_grad()
# Measure discriminator's ability to classify real from generated samples
real_loss = adversarial_loss(discriminator_d(real_imgs), valid)
fake_loss = adversarial_loss(discriminator_d(gen_imgs.detach()), fake)
d_loss = ((real_loss + fake_loss) / 2).to(rank)
d_loss.backward()
optimizer_D.step()
I encountered a similar error when trying to train a GAN with DistributedDataParallel.
I noticed the problem was coming from BatchNorm layers in my discriminator.
Indeed, DistributedDataParallel synchronizes the batchnorm parameters at each forward pass (see the doc), thereby modifying the variable inplace, which causes problems if you have multiple forward passes in a row.
Converting my BatchNorm layers to SyncBatchNorm did the trick for me:
discriminator = torch.nn.SyncBatchNorm.convert_sync_batchnorm(discriminator)
discriminator = DPP(discriminator)
You probably want to do it anyway when using DistributedDataParallel.
Alternatively, if you don't want to use SyncBatchNorm, you can set the broadcast_buffers parameter to False, but I don't think you really want to do that, as it means your batch norm stats will not be synchronized among processes.
discriminator = DPP(discriminator, device_ids=[rank], broadcast_buffers=False)

Text mining : Bad predictions of toxic comments using Word2Vec

I have a dataset containing sentences and boolean columns (0 or 1) to classify the type of the comment (toxic|severe_toxic|obscene|threat|insult|identity_hate).
You can download the dataset here : https://ufile.io/nqns7
I filtered the words with spacy to only keep useful words, i kept : Adjectives, Adverbs, Verbs and Nouns using this function :
def filter_words(words) :
vec = []
conditions = ('ADV','NOUN','ADJ','VERB')
for token in nlp(words):
if not token.is_stop and token.pos_ in conditions:
vec.append(token.lemma_)
return vec
Then i converted the dataframe to a parquet file to speed up the performances.
I ended up with a dataframe which looks like this :
I used a Word2Vec on this DF to create a features column in order to use RandomForestClassifier to predict if model works well.
Here is the code :
from pyspark.ml.feature import Word2Vec
from pyspark.sql.functions import *
word2vec = Word2Vec(inputCol="vector_words",outputCol="features")
model = word2vec.fit(sentences)
result = model.transform(sentences)
result = result.withColumn("toxic", result["toxic"].cast(IntegerType()))
rf =RandomForestClassifier(labelCol="toxic",featuresCol="features")
result = result.dropna()
(trainingSet, testSet) = result.randomSplit([0.7,0.3])
model_toxic = rf.fit(trainingSet)
predictions = model_toxic.transform(testSet)
But the problem i have here, is that i only have 16 predictions that are considered toxic from which 13 are really identified as toxic while there are about 4000 toxic comments in the set.
I don't understand why. Is it because of the filter i applied on the words, which might be too restrictive( i don't know why though ) or is it because the parameters of my Word2Vec and RandomForestClassifier aren't precise enough?
I'm new to pyspark and i couldn't find any information about bad models, basically people on internet are pretty happy about the results. Any help would be appreciated.

save page rank output in neo4j

I am running Pregel Page rank algorith
m on twitter data in Spark using scala. The algorithm runs fine and gives me the output correctly finding out the highest page rank score. But I am unable to save graph on neo4j.
The inputs and outputs are mentioned below.
Input file: (The numbers are twitter userIDs)
86566510 15647839
86566510 197134784
86566510 183967095
15647839 11272122
15647839 10876852
197134784 34236703
183967095 20065583
11272122 197134784
34236703 18859819
20065583 91396874
20065583 86566510
20065583 63433165
20065583 29758446
Output of the graph vertices:
(11272122,0.75)
(34236703,1.0)
(10876852,0.75)
(18859819,1.0)
(15647839,0.6666666666666666)
(86566510,0.625)
(63433165,0.625)
(29758446,0.625)
(91396874,0.625)
(183967095,0.6666666666666666)
(197134784,1.1666666666666665)
(20065583,1.0)
Using the below scala code I try saving the graph but it does'nt. Please help me solve this.
Neo4jGraph.saveGraph(sc, pagerankGraph, nodeProp = "twitterId", relProp = "follows")
Thanks.
Did you load the graph originally from Neo4j? Currently saveGraph saves the graph data back to Neo4j nodes via their internal id's.
It actually runs this statement:
UNWIND {data} as row
MATCH (n) WHERE id(n) = row.id
SET n.$nodeProp = row.value return count(*)
But as a short term mitigation I added optional labelIdProp parameters that are used instead of the internal id's, and a match/merge flag. You'll have to build the library yourself though to use that. I gonna push the update the next few days.
Something you can try is Neo4jDataFrame.mergeEdgeList
Here is the test code for it.
You basically have a dataframe with the data and it saves it to a Neo4j graph (including relationships though).
val rows = sc.makeRDD(Seq(Row("Keanu", "Matrix")))
val schema = StructType(Seq(StructField("name", DataTypes.StringType), StructField("title", DataTypes.StringType)))
val df = new SQLContext(sc).createDataFrame(rows, schema)
Neo4jDataFrame.mergeEdgeList(sc, df, ("Person",Seq("name")),("ACTED_IN",Seq.empty),("Movie",Seq("title")))
val edges : RDD[Edge[Long]] = sc.makeRDD(Seq(Edge(0,1,42L)))
val graph = Graph.fromEdges(edges,-1)
assertEquals(2, graph.vertices.count)
assertEquals(1, graph.edges.count)
Neo4jGraph.saveGraph(sc,graph,null,"test")
val it: ResourceIterator[Long] = server.graph().execute("MATCH (:Person {name:'Keanu'})-[:ACTED_IN]->(:Movie {title:'Matrix'}) RETURN count(*) as c").columnAs("c")
assertEquals(1L, it.next())
it.close()