How can pyspark remember something in memory like class attributes in mapreduce? - pyspark

I have a table with 2 columns: image_url, comment. The same image may have many comments, and data are sorted by image_url in files.
I need to crawl the image, and transfer it to binary. This will take a long time. So, for the same image, I want to do it only once.
In mapreduce, I can remember the last row and result in memory.
class Mapper:
def __init__(self):
self.image_url = None
self.image_bin = None
def run(self, image_url, comment):
if image_url != self.image_url:
self.image_url = image_url
self.image_bin = process(image_url)
return self.image_url, self.image_bin, comment
How can I do it in pyspark? Either rdd and dataframe is ok.

I would advice you to simply process a grouped version of your dataframe. Something like this :
from pyspark.sql import functions as F
# Assuming df is your dataframe
df = df.groupBy("image_url").agg(F.collect_list("comment").alias("comments"))
df = df.withColumn("image_bin", process(F.col("image_url")))
df.select(
"image_url",
"image_bin",
F.explode("comments").alias("comment"),
).show()

I found mapPartitions works. The code look like this.
def do_cover_partition(partitionData):
last_url = None
last_bin = None
for row in partitionData:
data = row.asDict()
print(data)
if data['cover_url'] != last_url:
last_url = data['cover_url']
last_bin = url2bin(last_url)
print(data['comment'])
data['frames'] = last_bin
yield data
columns = ["id","cover_url","comment","frames"]
df = df.rdd.mapPartitions(do_cover_partition).map(lambda x: [x[c] for c in columns]).toDF(columns)

Related

How to apply filters on spark scala dataframe view?

I am pasting a snippet here where I am facing issues with the BigQuery Read. The "wherePart" has more number of records and hence BQ call is invoked again and again. Keeping the filter outside of BQ Read would help. The idea is, first read the "mainTable" from BQ, store it in a spark view, then apply the "wherePart" filter to this view in spark.
["subDate" is a function to subtract one date from another and return the number of days in between]
val Df = getFb(config, mainTable, ds)
def getFb(config: DataFrame, mainTable: String, ds: String) : DataFrame = {
val fb = config.map(row => Target.Pfb(
row.getAs[String]("m1"),
row.getAs[String]("m2"),
row.getAs[Seq[Int]]("days")))
.collect
val wherePart = fb.map(x => (x.m1, x.m2, subDate(ds, x.days.max - 1))).
map(x => s"(idata_${x._1} = '${x._2}' AND ds BETWEEN '${x._3}' AND '${ds}')").
mkString(" OR ")
val q = new Q()
val tempView = "tempView"
spark.readBigQueryTable(mainTable, wherePart).createOrReplaceTempView(tempView)
val Df = q.mainTableLogs(tempView)
Df
}
Could someone please help me here.
Are you using the spark-bigquery-connector? If so the right syntax is
spark.read.format("bigquery")
.load(mainTable)
.where(wherePart)
.createOrReplaceTempView(tempView)

Dynamic conversion of Array of double columns into multiple columns in nested spark dataframe

My current DataFrame looks like as below:
{"id":"1","inputs":{"values":{"0.2":[1,1],"0.4":[1,1],"0.6":[1,1]}},"id1":[1,2]}
I want to transform this dataframe into the below dataFrame:
{"id":"1", "v20":[1,1],"v40":[1,1],"v60":[1,1],"id1":[1,2]}
This means that, each 'values' array's items (0.2, 0.4 and 0.6) will be multiplied by 100, prepended with the letter 'v', and extracted into separate columns.
How does the code would look like in order to achieve this. I have tried withColumn but couldn't achieve this.
Try the below code and please find the inline comments for the code explanation
import org.apache.spark.sql.SaveMode
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.StructType
object DynamicCol {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().master("local[*]").getOrCreate()
val df = spark.read.json("src/main/resources/dyamicCol.json") /// Load the JSON file
val dfTemp = df.select(col("inputs.values").as("values")) // Temp Dataframe for fetching the nest values
val index = dfTemp
.schema.fieldIndex("values")
val propSchema = dfTemp.schema(index).dataType.asInstanceOf[StructType]
val dfFinal = propSchema.fields.foldLeft(df)( (df,field) => { // Join Dataframe with the list of nested columns
val colNameInt = (field.name.toDouble * 100).toInt
val colName = s"v$colNameInt"
df.withColumn(colName,col("inputs.values.`" + field.name + "`")) // Add the nested column mappings
} ).drop("inputs") // Drop the extra column
dfFinal.write.mode(SaveMode.Overwrite).json("src/main/resources/dyamicColOut.json") // Output the JSON file
}
}
I would make the logic for the change of column name splitter into 2 parts, the one that is a numeric value, and the one that doesn't change.
def stringDecimalToVNumber(colName:String): String =
"v" + (colName.toFloat * 100).toInt.toString
and form a single function that transforms according to the case
val floatRegex = """(\d+\.?\d*)""".r
def transformColumnName(colName:String): String = colName match {
case floatRegex(v) => stringDecimalToVNumber(v) //it's a float, transform it
case x => x // keep it
now we have the function to transform the end of the columns, let's pick the schema dynamicly.
val flattenDF = df.select("id","inputs.values.*")
val finalDF = flattenDF
.schema.names
.foldLeft(flattenDF)((dfacum,x) => {
val newName = transformColumnName(x)
if (newName == x)
dfacum // the name didn't need to be changed
else
dfacum.withColumnRenamed(x, transformColumnName(x))
})
This will dynamically transform all the columns inside inputs.values to the new name, and put them in next to id.

Spark: How to get String value while generating output file

I have two files
--------Student.csv---------
StudentId,City
101,NDLS
102,Mumbai
-------StudentDetails.csv---
StudentId,StudentName,Course
101,ABC,C001
102,XYZ,C002
Requirement
StudentId in first should be replaced with StudentName and Course in the second file.
Once replaced I need to generate a new CSV with complete details like
ABC,C001,NDLS
XYZ,C002,Mumbai
Code used
val studentRDD = sc.textFile(file path);
val studentdetailsRDD = sc.textFile(file path);
val studentB = sc.broadcast(studentdetailsRDD.collect)
//Generating CSV
studentRDD.map{student =>
val name = getName(student.StudentId)
val course = getCourse(student.StudentId)
Array(name, course, student.City)
}.mapPartitions{data =>
val stringWriter = new StringWriter();
val csvWriter =new CSVWriter(stringWriter);
csvWriter.writeAll(data.toList)
Iterator(stringWriter.toString())
}.saveAsTextFile(outputPath)
//Functions defined to get details
def getName(studentId : String) {
studentB.value.map{stud =>if(studentId == stud.StudentId) stud.StudentName}
}
def getCourse(studentId : String) {
studentB.value.map{stud =>if(studentId == stud.StudentId) stud.Course}
}
Problem
File gets generated but the values are object representations instead of String value.
How can I get the string values instead of objects ?
As suggested in another answer, Spark's DataFrame API is especially suitable for this, as it easily supports joining two DataFrames, and writing CSV files.
However, if you insist on staying with RDD API, looks like the main issue with your code is the lookup functions: getName and getCourse basically do nothing, because their return type is Unit; Using an if without an else means that for some inputs there's no return value, which makes the entire function return Unit.
To fix this, it's easier to get rid of them and simplify the lookup by broadcasting a Map:
// better to broadcast a Map instead of an Array, would make lookups more efficient
val studentB = sc.broadcast(studentdetailsRDD.keyBy(_.StudentId).collectAsMap())
// convert to RDD[String] with the wanted formatting
val resultStrings = studentRDD.map { student =>
val details = studentB.value(student.StudentId)
Array(details.StudentName, details.Course, student.City)
}
.map(_.mkString(",")) // naive CSV writing with no escaping etc., you can also use CSVWriter like you did
// save as text file
resultStrings.saveAsTextFile(outputPath)
Spark has great support for join and write to file. Join only takes 1 line of code and write also only takes 1.
Hand write those code can be error proven, hard to read and most likely super slow.
val df1 = Seq((101,"NDLS"),
(102,"Mumbai")
).toDF("id", "city")
val df2 = Seq((101,"ABC","C001"),
(102,"XYZ","C002")
).toDF("id", "name", "course")
val dfResult = df1.join(df2, "id").select("id", "city", "name")
dfResult.repartition(1).write.csv("hello.csv")
There will be a directory created. There is only 1 file in the directory which is the finally result.

How to efficiently extract a value from HiveContext Query

I am running a query through my HiveContext
Query:
val hiveQuery = s"SELECT post_domain, post_country, post_geo_city, post_geo_region
FROM $database.$table
WHERE year=$year and month=$month and day=$day and hour=$hour and event_event_id='$uniqueIdentifier'"
val hiveQueryObj:DataFrame = hiveContext.sql(hiveQuery)
Originally, I was extracting each value from the column with:
hiveQueryObj.select(column).collectAsList().get(0).get(0).toString
However, I was told to avoid this because it makes too many connections to Hive. I am pretty new to this area so I'm not sure how to extract the column values efficiently. How can I perform the same logic in a more efficient way?
I plan to implement this in my code
val arr = Array("post_domain", "post_country", "post_geo_city", "post_geo_region")
arr.foreach(column => {
// expected Map
val ex = expected.get(column).get
val actual = hiveQueryObj.select(column).collectAsList().get(0).get(0).toString
assert(actual.equals(ex))
}

Spark 2.0 ALS Recommendation how to recommend to a user

I have followed the guide given in the link
http://ampcamp.berkeley.edu/big-data-mini-course/movie-recommendation-with-mllib.html
But this is outdated as it uses spark Mlib RDD approach. The New Spark 2.0 has DataFrame approach.
Now My problem is I have got the updated code
val ratings = spark.read.textFile("data/mllib/als/sample_movielens_ratings.txt")
.map(parseRating)
.toDF()
val Array(training, test) = ratings.randomSplit(Array(0.8, 0.2))
// Build the recommendation model using ALS on the training data
val als = new ALS()
.setMaxIter(5)
.setRegParam(0.01)
.setUserCol("userId")
.setItemCol("movieId")
.setRatingCol("rating")
val model = als.fit(training)
// Evaluate the model by computing the RMSE on the test data
val predictions = model.transform(test)
Now Here is the problem, In the old code the model that was obtained was a MatrixFactorizationModel, Now it has its own model(ALSModel)
In MatrixFactorizationModel you could directly do
val recommendations = bestModel.get
.predict(userID)
Which will give the list of products with highest probability of user liking them.
But Now there is no .predict method. Any Idea how to recommend a list of products given a user Id
Use transform method on model:
import spark.implicits._
val dataFrameToPredict = sparkContext.parallelize(Seq((111, 222)))
.toDF("userId", "productId")
val predictionsOfProducts = model.transform (dataFrameToPredict)
There's a jira ticket to implement recommend(User|Product) method, but it's not yet on default branch
Now you have DataFrame with score for user
You can simply use orderBy and limit to show N recommended products:
// where is for case when we have big DataFrame with many users
model.transform (dataFrameToPredict.where('userId === givenUserId))
.select ('productId, 'prediction)
.orderBy('prediction.desc)
.limit(N)
.map { case Row (productId: Int, prediction: Double) => (productId, prediction) }
.collect()
DataFrame dataFrameToPredict can be some large user-product DataFrame, for example all users x all products
The ALS Model in Spark contains the following helpful methods:
recommendForAllItems(int numUsers)
Returns top numUsers users recommended for each item, for all items.
recommendForAllUsers(int numItems)
Returns top numItems items recommended for each user, for all users.
recommendForItemSubset(Dataset<?> dataset, int numUsers)
Returns top numUsers users recommended for each item id in the input data set.
recommendForUserSubset(Dataset<?> dataset, int numItems)
Returns top numItems items recommended for each user id in the input data set.
e.g. Python
from pyspark.ml.recommendation import ALS
from pyspark.sql.functions import explode
alsEstimator = ALS()
(alsEstimator.setRank(1)
.setUserCol("user_id")
.setItemCol("product_id")
.setRatingCol("rating")
.setMaxIter(20)
.setColdStartStrategy("drop"))
alsModel = alsEstimator.fit(productRatings)
recommendForSubsetDF = alsModel.recommendForUserSubset(TargetUsers, 40)
recommendationsDF = (recommendForSubsetDF
.select("user_id", explode("recommendations")
.alias("recommendation"))
.select("user_id", "recommendation.*")
)
display(recommendationsDF)
e.g. Scala:
import org.apache.spark.ml.recommendation.ALS
import org.apache.spark.sql.functions.explode
val alsEstimator = new ALS().setRank(1)
.setUserCol("user_id")
.setItemCol("product_id")
.setRatingCol("rating")
.setMaxIter(20)
.setColdStartStrategy("drop")
val alsModel = alsEstimator.fit(productRatings)
val recommendForSubsetDF = alsModel.recommendForUserSubset(sampleTargetUsers, 40)
val recommendationsDF = recommendForSubsetDF
.select($"user_id", explode($"recommendations").alias("recommendation"))
.select($"user_id", $"recommendation.*")
display(recommendationsDF)
Here is what I did to get recommendations for a specific user with spark.ml:
import com.github.fommil.netlib.BLAS.{getInstance => blas}
userFactors.lookup(userId).headOption.fold(Map.empty[String, Float]) { user =>
val ratings = itemFactors.map { case (id, features) =>
val rating = blas.sdot(features.length, user, 1, features, 1)
(id, rating)
}
ratings.sortBy(_._2).take(numResults).toMap
}
Both userFactors and itemFactors in my case are RDD[(String, Array[Float])] but you should be able to do something similar with DataFrames.