Spark-rdd manipulating data - scala

I have a sample data like below:
UserId,ProductId,Category,Action
1,111,Electronics,Browse
2,112,Fashion,Click
3,113,Kids,AddtoCart
4,114,Food,Purchase
5,115,Books,Logout
6,114,Food,Click
7,113,Kids,AddtoCart
8,115,Books,Purchase
9,111,Electronics,Click
10,112,Fashion,Purchase
3,112,Fashion,Click
I need to generate list of users who are interested in either “Fashion” category or “Electronics” category but not in both categories. User is interested if he/she has performed any of these actions (Click / AddToCart / Purchase) using spark/scala code I have done up till below:
val rrd1 = sc.textFile("/user/harshit.kacker/datametica_logs.csv")
val rrd2 = rrd1.map( x=> {
| val c = x.split(",")
| (c(0).toInt , x)})
val rrd3 = rrd1.filter(x=> x.split(",")(2) == "Fashion" || x.split(",")(2) == "Electronics")
val rrd4 = rrd3.filter(x=> x.split(",")(3)== "Click" || x.split(",")(3)=="Purchase" || x.split(",")(3)=="AddtoCart")
rrd4.collect.foreach(println)
2,112,Fashion,Click
9,111,Electronics,Click
10,112,Fashion,Purchase
3,112,Fashion,Click
4,111,Electronics,Click
19,112,Fashion,Click
9,112,Fashion,Purchase
2,112,Fashion,Click
2,111,Electronics,Click
1,112,Fashion,Purchase
now I have to work on "to generate list of users who are interested in either “Fashion” category or “Electronics” category but not in both categories" this italic part and get desired output as :
10,Fashion
3,Fashion
4,Electronics
19,Fashion
1,Fashion
means userId having Fashion and Electronics as category should be eliminated. How can I achieve the same?

Start by parsing the input text file in to tuples:
val srcPath = "/user/harshit.kacker/datametica_logs.csv"
// parse test file in to tuples:
val rdd = spark.sparkContext.textFile(srcPath)
val rows = rdd.map(line => line.split(",")).map(row => (row(0), row(1), row(2), row(3)))
val header = rows.first
// drop the header:
val logs = rows.filter(row => row != header)
Filter the RDD by interest criteria:
val interests = logs.filter(log =>
List("Click", "AddtoCart", "Purchase").contains(log._4)
)
Filter for fashion and electronics separately:
val fashion = interests.filter(row => row._3 == "Fashion")
val electronics = interests.filter(row => row._3 == "Electronics")
Find the common user IDs between fashion and electronics:
val fashionIds = fashion.map(_._1).distinct
val electronicsIds = electronics.map(_._1).distinct
val commonIds = fashionIds.intersection(electronicsIds).collect()
Combine the fashion and electronics rows and filter the ids common between both:
val finalRdd = (fashion ++ electronics)
.filter(log => !commonIds.contains(log._1))
.map(log => (log._1, log._3))
.distinct()
Edit: Using DataFrame
// using dataframes:
val df = spark.read.option("header", "true").csv(srcPath)
val interestDf = df.where($"Action".isin("Click", "Purchase", "AddToCart"))
val fashionDf = interestDf.where($"Category" === "Fashion")
val electronicsDf = interestDf.where($"Category" === "Electronics")
val joinDf = electronicsDf.alias("e").join(fashionDf.alias("f"), Seq("UserId"), "outer")
.where($"e.Category".isNull || $"f.Category".isNull)
val finalDf = joinDf.select($"UserId", when($"e.Category".isNull, $"f.Category").otherwise($"e.Category").as("Category")).distinct

Related

Combining files

I am new to scala. I have two RDD's and I need to separate out my training and testing data. In one file I have all the data and in another just the testing data. I need to remove the testing data from my complete data set.
The complete data file is of the format(userID,MovID,Rating,Timestamp):
res8: Array[String] = Array(1, 31, 2.5, 1260759144)
The test data file is of the format(userID,MovID):
res10: Array[String] = Array(1, 1172)
How do I generate ratings_train that will not have the caes matched with the testing dataset
I am using the following function but the returned list is showing empty:
def create_training(data: RDD[String], ratings_test: RDD[String]): ListBuffer[Array[String]] = {
val ratings_split = dropheader(data).map(line => line.split(","))
val ratings_testing = dropheader(ratings_test).map(line => line.split(",")).collect()
var ratings_train = new ListBuffer[Array[String]]()
ratings_split.foreach(x => {
ratings_testing.foreach(y => {
if (x(0) != y(0) || x(1) != y(1)) {
ratings_train += x
}
})
})
return ratings_train
}
EDIT: changed code but running into memory issues.
This may work.
def create_training(data: RDD[String], ratings_test: RDD[String]): Array[Array[String]] = {
val ratings_split = dropheader(data).map(line => line.split(","))
val ratings_testing = dropheader(ratings_test).map(line => line.split(","))
ratings_split.filter(x => {
ratings_testing.exists(y =>
(x(0) == y(0) && x(1) == y(1))
) == false
})
}
The code snippets you posted are not logically correct. A row will only be part of the final data if it has no presence in the test data. But in the code you picked the row if it does not match with any of the test data. But we should check whether it does not match with all of the test data and then only we can decide whether it is a valid row or not.
You are using RDD, but now exploring the full power of them. I guess you are reading the input from a csv file. Then you can structure your data in the RDD, no need to spit the string based on comma character and manually processing them as ROW. You can take a look at the DataFrame API of spark. These links may help: https://www.tutorialspoint.com/spark_sql/spark_sql_dataframes.htm , http://spark.apache.org/docs/latest/sql-programming-guide.html#datasets-and-dataframes
Using Regex:
def main(args: Array[String]): Unit = {
// creating test data set
val data = spark.sparkContext.parallelize(Seq(
// "userID, MovID, Rating, Timestamp",
"1, 31, 2.5, 1260759144",
"2, 31, 2.5, 1260759144"))
val ratings_test = spark.sparkContext.parallelize(Seq(
// "userID, MovID",
"1, 31",
"2, 30",
"30, 2"
))
val result = getData(data, ratings_test).collect()
// the result will only contain "2, 31, 2.5, 1260759144"
}
def getData(data: RDD[String], ratings_test: RDD[String]): RDD[String] = {
val ratings = dropheader(data)
val ratings_testing = dropheader(ratings_test)
// Broadcasting the test rating data to all spark nodes, since we are collecting this before hand.
// The reason we are collecting the test data is to avoid call collect in the filter logic
val ratings_testing_bc = spark.sparkContext.broadcast(ratings_testing.collect.toSet)
ratings.filter(rating => {
ratings_testing_bc.value.exists(testRating => regexMatch(rating, testRating)) == false
})
}
def regexMatch(data: String, testData: String): Boolean = {
// Regular expression to find first two columns
val regex = """^([^,]*), ([^,\r\n]*),?""".r
val (dataCol1, dataCol2) = regex findFirstIn data match {
case Some(regex(col1, col2)) => (col1, col2)
}
val (testDataCol1, testDataCol2) = regex findFirstIn testData match {
case Some(regex(col1, col2)) => (col1, col2)
}
(dataCol1 == testDataCol1) && (dataCol2 == testDataCol2)
}

Spark DF: Schema for type Unit is not supported

I am new to Scala and Spark and trying to build on some samples I found. Essentially I am trying to call a function from within a data frame to get State from zip code using Google API..
I have the code working separately but not together ;(
Here is the piece of code not working...
Exception in thread "main" java.lang.UnsupportedOperationException: Schema for type Unit is not supported
at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:716)
at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:654)
at org.apache.spark.sql.functions$.udf(functions.scala:2837)
at MovieRatings$.getstate(MovieRatings.scala:51)
at MovieRatings$$anonfun$4.apply(MovieRatings.scala:48)
at MovieRatings$$anonfun$4.apply(MovieRatings.scala:47)...
Line 51 starts with def getstate = udf {(zipcode:String)...
...
code:
userDF.createOrReplaceTempView("Users")
// SQL statements can be run by using the sql methods provided by Spark
val zipcodesDF = spark.sql("SELECT distinct zipcode, zipcode as state FROM Users")
// zipcodesDF.map(zipcodes => "zipcode: " + zipcodes.getAs[String]("zipcode") + getstate(zipcodes.getAs[String]("zipcode"))).show()
val colNames = zipcodesDF.columns
val cols = colNames.map(cName => zipcodesDF.col(cName))
val theColumn = zipcodesDF("state")
val mappedCols = cols.map(c =>
if (c.toString() == theColumn.toString()) getstate(c).as("transformed") else c)
val newDF = zipcodesDF.select(mappedCols:_*).show()
}
def getstate = udf {(zipcode:String) => {
val url = "http://maps.googleapis.com/maps/api/geocode/json?address="+zipcode
val result = scala.io.Source.fromURL(url).mkString
val address = parse(result)
val shortnames = for {
JObject(address_components) <- address
JField("short_name", short_name) <- address_components
} yield short_name
val state = shortnames(3)
//return state.toString()
val stater = state.toString()
}
}
Thanks for the responses.. I think I figured it out. Here is the code that works. One thing to note is Google API has restriction so some valid zip codes don't have state info.. not an issue for me though.
private def loaduserdata(spark: SparkSession): Unit = {
import spark.implicits._
// Create an RDD of User objects from a text file, convert it to a Dataframe
val userDF = spark.sparkContext
.textFile("examples/src/main/resources/users.csv")
.map(_.split("::"))
.map(attributes => users(attributes(0).trim.toInt, attributes(1), attributes(2).trim.toInt, attributes(3), attributes(4)))
.toDF()
// Register the DataFrame as a temporary view
userDF.createOrReplaceTempView("Users")
// SQL statements can be run by using the sql methods provided by Spark
val zipcodesDF = spark.sql("SELECT distinct zipcode, substr(zipcode,1,5) as state FROM Users ORDER BY zipcode desc") // zipcodesDF.map(zipcodes => "zipcode: " + zipcodes.getAs[String]("zipcode") + getstate(zipcodes.getAs[String]("zipcode"))).show()
val colNames = zipcodesDF.columns
val cols = colNames.map(cName => zipcodesDF.col(cName))
val theColumn = zipcodesDF("state")
val mappedCols = cols.map(c =>
if (c.toString() == theColumn.toString()) getstate(c).as("state") else c)
val geoDF = zipcodesDF.select(mappedCols:_*)//.show()
geoDF.createOrReplaceTempView("Geo")
}
val getstate = udf {(zipcode: String) =>
val url = "http://maps.googleapis.com/maps/api/geocode/json?address="+zipcode
val result = scala.io.Source.fromURL(url).mkString
val address = parse(result)
val statenm = for {
JObject(statename) <- address
JField("types", JArray(types)) <- statename
JField("short_name", JString(short_name)) <- statename
if types.toString().equals("List(JString(administrative_area_level_1), JString(political))")
// if types.head.equals("JString(administrative_area_level_1)")
} yield short_name
val str = if (statenm.isEmpty.toString().equals("true")) "N/A" else statenm.head
}

Spark map reduce with condition

suppose these are my CSV file:
attr1;attr2
11111;MOC
22222;MTC
11111;MOC
22222;MOC
33333;MMS
I want to have the number of occurrences in the first column when attr2 = MOC. Like this :
(11111,2)
(22222,1)
i've tried:
val sc = new SparkContext(conf)
val textFile = sc.textFile(args(0))
val data = textFile.map(line => line.split(";").map(elem => elem.trim))
val header = new SimpleCSVHeader(data.take(1)(0))
val rows = data.filter(line => header(line,"attr1") != "attr1")
val attr1 = rows.map(row => header(row,"attr1"))
val attr2 = rows.map(row => header(row,"attr2"))
attr1.map( k => (k,1) ).reduceByKey(_+_)
attr1.foreach (println)
how can I add the condition in my code?
the result of my code is:
(11111,2)
(22222,2)
(33333,1)
Use filter (again):
val rows = data
.filter(line => header(line,"attr1") != "attr1")
.filter(line => header(line,"attr2") == "MOC")
And then continue as before...

obtain a specific value from a RDD according to another RDD

I want to map a RDD by lookup another RDD by this code:
val product = numOfT.map{case((a,b),c)=>
val h = keyValueRecords.lookup(b).take(1).mkString.toInt
(a,(h*c))
}
a,b are Strings and c is a Integer. keyValueRecords is like this: RDD[(string,string)]-
i got type missmatch error: how can I fix it ?
what is my mistake ?
This is a sample of data:
userId,movieId,rating,timestamp
1,16,4.0,1217897793
1,24,1.5,1217895807
1,32,4.0,1217896246
2,3,2.0,859046959
3,7,3.0,8414840873
I'm triying by this code:
val lines = sc.textFile("ratings.txt").map(s => {
val substrings = s.split(",")
(substrings(0), (substrings(1),substrings(1)))
})
val shoppingList = lines.groupByKey()
val coOccurence = shoppingList.flatMap{case(k,v) =>
val arry1 = v.toArray
val arry2 = v.toArray
val pairs = for (pair1 <- arry1; pair2 <- arry2 ) yield ((pair1,pair2),1)
pairs.iterator
}
val numOfT = coOccurence.reduceByKey((a,b)=>(a+b)) // (((item,rate),(item,rate)),coccurence)
// produce recommend for an especial user
val keyValueRecords = sc.textFile("ratings.txt").map(s => {
val substrings = s.split(",")
(substrings(0), (substrings(1),substrings(2)))
}).filter{case(k,v)=> k=="1"}.groupByKey().flatMap{case(k,v) =>
val arry1 = v.toArray
val arry2 = v.toArray
val pairs = for (pair1 <- arry1; pair2 <- arry2 ) yield ((pair1,pair2),1)
pairs.iterator
}
val numOfTForaUser = keyValueRecords.reduceByKey((a,b)=>(a+b))
val joined = numOfT.join(numOfTForaUser).map{case(k,v)=>(k._1._1,(k._2._2.toFloat*v._1.toFloat))}.collect.foreach(println)
The Last RDD won't produced. Is it wrong ?

What is the difference between using dataframe and rdd in Spark 1.5.2?

I read data from MongoDB then doing mapping to InteractionItem.
val df = filterByParams(startTs, endTs, widgetIds, documents)
.filter(item => {
item._2.get("url") != "localhost" && !EXCLUDED_TRIGGERS.contains(item._2.get("trigger"))
})
.flatMap(item => {
var res = Array[InteractionItem]()
try {
val widgetId = item._2.get("widgetId").toString
val timestamp = java.lang.Long.parseLong(item._2.get("time").toString)
val extra = item._2.get("extra").toString
val extras = parseExtra(extra)
val c = parseUserAgent(extras.userAgent.getOrElse(""))
val os = c.os.family
val osVersion = c.os.major
val device = c.device.family
val browser = c.userAgent.family
val browserVersion = c.userAgent.major
val adUnit = extras.adunit.get
val gUid = extras.guid.get
val trigger = item._2.get("trigger").toString
val objectName = item._2.get("object").toString
val response = item._2.get("response").toString
val ts: Long = timestamp - timestamp % 3600
//
val interaction = interactionConfiguration.filter(interaction =>
interaction.get("trigger") == trigger &&
interaction.get("object") == objectName &&
interaction.get("response") == response).head
val clickThrough = interaction.get("clickThrough").asInstanceOf[Boolean]
val interactionId = interaction.get("_id").toString
adUnitPublishers.filter(x => x._2._2.toString == widgetId && x._1.toString == adUnit).foreach(publisher => {
res = res :+ InteractionItem(widgetId, ts, adUnit, publisher._2._1.toString, os, osVersion, device, browser, browserVersion,
interactionId, clickThrough, 1L, gUid)
})
bdPublishers.filter(x => x._1.toString == widgetId).foreach(publisher => {
res = res :+ InteractionItem(widgetId, ts, adUnit, publisher._2.toString, os, osVersion, device, browser, browserVersion,
interactionId, clickThrough, 1L, gUid)
})
}
catch {
case e: Exception => {
log.info(e.getMessage)
res = res :+ InteractionItem.invalid()
}
}
res
}).filter(i => i.interactionCount > 0)
With RDD way I map again and reduceByKey
.map(i => ((i.widgetId, i.date, i.section, i.publisher, i.os, i.device, i.browser, i.clickThrough, i.id), i.interactionCount))
.reduceByKey((a, b) => a + b)
With DataFrame way I convert
.toDF()
df.registerTempTable("interactions")
df.cache()
val v = sqlContext.sql("SELECT id, clickThrough, widgetId, date, section, publisher, os, device, browser, interactionCount" +
" FROM interactions GROUP BY id, clickThrough, widgetId, date, section, publisher, os, device, browser, interactionCount")
From what I see in Spark UI
For Using Dataframe it take 210 stage?
For RDD it only 20 stages:
What I am doing wrong here?
The actions you've applied on RDD & DF are not the same.
The reason that the DF has a longer process time is because of the following extra tasks:
registerTempTable()
cache()
While the RDD only reduces one single given expression, the DF processes the entire data as a table and also prepares the cache which consumes additional CPU and storage resources.