How could I filter into Join RDD just to select some fields of the table in Apache Spark? - scala

I'm trying to filter two fields of two datasets in csv files. I've already applied an INNER JOIN for dataset1.csv and dataset2.csv.
This is my starting code:
case class Customer(
Customer_ID:Int,
Customer_Name:String,
Account_Number:Double,
Marital_Status:String,
Age:Int,
Contact:Double,
Location:String,
Monthly_Income_USD:Double,
Yearly_Balance_USD:Double,
Job_type:String,
Credit_Card:String
)
case class HouseLoan(
Customer_ID:Int,
Account_Number:Double,
House_Loan_Amount_USD:Double,
Total_Installment:Int,
Installment_Pending:Int,
Loan_Defaulter:String
)
val custRDD = sc.textFile("dataset1.csv")
.map(_.split(","))
.map(r => (
r(0),
Customer( r(0).toInt, r(1), r(2).toDouble, r(3), r(4).toInt, r(5).toDouble, r(6), r(7).toDouble, r(8).toDouble, r(9), r(10))
)
)
val houseRDD = sc.textFile("dataset2.csv")
.map(_.split(","))
.map(r => (
r(0),
HouseLoan(r(0).toInt,r(1).toDouble,r(2).toDouble,r(3).toInt,r(4).toInt,r(5))
)
)
val joinTab = custRDD.join(houseRDD)
joinTab.collect().foreach(println)
Until here everything is okay, the result is show in the image.
Now, I need:
the field of Customer_ID (the key of both tables)
the Job_Type
for those who are:
"Doctor" and
House_Loan_Amoun_USD is more than 1000000
using join.
I tried something like
val joinTab = custRDD.filter{record => (record.split(",")(9) == "Doctor").join(houseRDD)filter{record => (record.split(",")(2) > 100000}
but it is obviously wrong and I'm still noob for Apache Spark.
Note: I can't use spark sql because I'm learning this topic in my university (Spark Core - RDD) so I must do it with RDD join.

Related

How to filter composite primary key in Slick 3

I'm new to Slick 3. I want to filter composite primary key as follows that query by using Slick.
SELECT * FROM test_table WHERE (pk1, pk2) IN (("a1", "a2"), ("b1", "b2"));
I know that Slick can filter multiple conditions like
TestTableQuery.filter(t => t.pk1 === "a1" && t.pk2 === "a2")
Additionally, I know that Slick can filter multiple values(That means IN caluse.) like
val pkSeq = Seq("a1", "b1")
TestTableQuery.filter(t => t.pk1.inSet(pkSeq))
So, I want to write as describe below if it's possible.
val pk1AndPk2Seq = Seq(("a1", "a2"), ("b1", "b2"))
TestTableQuery.filter(t => (t.pk1, t.pk2).inSet(pk1AndPk2Seq))
Is there something way? Thank you.

Scala Spark Filter RDD using Cassandra

I am new to spark-Cassandra and Scala. I have an existing RDD. let say:
((url_hash, url, created_timestamp )).
I want to filter this RDD based on url_hash. If url_hash exists in the Cassandra table then I want to filter it out from the RDD so I can do processing only on the new urls.
Cassandra Table looks like following:
url_hash| url | created_timestamp | updated_timestamp
Any pointers will be great.
I tried something like this this:
case class UrlInfoT(url_sha256: String, full_url: String, created_ts: Date)
def timestamp = new java.utils.Date()
val rdd1 = rdd.map(row => (calcSHA256(row(1)), (row(1), timestamp)))
val rdd2 = sc.cassandraTable[UrlInfoT]("keyspace", "url_info").select("url_sha256", "full_url", "created_ts")
val rdd3 = rdd2.map(row => (row.url_sha256,(row.full_url, row.created_ts)))
newUrlsRDD = rdd1.subtractByKey(rdd3)
I am getting cassandra error
java.lang.NullPointerException: Unexpected null value of column full_url in keyspace.url_info.If you want to receive null values from Cassandra, please wrap the column type into Option or use JavaBeanColumnMapper
There are no null values in cassandra table
Thanks The Archetypal Paul!
I hope somebody finds this useful. Had to add Option to case class.
Looking forward to better solutions
case class UrlInfoT(url_sha256: String, full_url: Option[String], created_ts: Option[Date])
def timestamp = new java.utils.Date()
val rdd1 = rdd.map(row => (calcSHA256(row(1)), (row(1), timestamp)))
val rdd2 = sc.cassandraTable[UrlInfoT]("keyspace", "url_info").select("url_sha256", "full_url", "created_ts")
val rdd3 = rdd2.map(row => (row.url_sha256,(row.full_url, row.created_ts)))
newUrlsRDD = rdd1.subtractByKey(rdd3)

Spark join is not working

I am trying to write a spark join using to text data files. But my join doesn't working as I expect.
val sc = new SparkContext("local[*]", "employeedata")
val empoyees= sc.textFile("../somewhere/employee.data")
val reputations= sc.textFile("../somewhere/reputations.data")
val employeesRdd= empoyees.map(x=> (x.toString().split(",")(0), x))
val reputationsRdd= reputations.map(y=> (y.toString().split(",")(0), y))
val joineddata = employeesRdd.join(reputationsRdd).map(_._2)
employee.data would be like below
emp_id, firstname,lastname,age,country,Education
reputations.data would be like below
emp_id, reputation
But my results I get would be like below
(empid,first name, last name, age, country,education,,employeeid,reputation)
But I need the below output
(empid,first name, last name, age, country,education,reputation)
extra comma between employee id and the education should be removed and the employee id before the reputations should also need to be removed
Can anybody please help me ?
Here is some psuedo-code (it might compile and even work if we're lucky!) to give you a bit of help:
// split the fields and key by id
// you could map the arrays to case classes here
val employeesRdd= empoyees.map(x=> x.toString().split(","))
.keyBy(e => e(0))
val reputationsRdd= reputations.map(y=> y.toString().split(","))
.keyBy(r => r(0))
val joineddata = employeesRdd.join(reputationsRdd)
.map { case (key, (Array(emp_id, firstname,lastname,age,country,Education), Array(employee_id, reputation))) =>
(empid,first name, last name, age, country,education,reputation)
}

How to filter the data in spark-shell using scala?

I have the below data which needed to be sorted using spark(scala) in such a way that, I only need id of the person who visited "Walmart" but not "Bestbuy". store might be repetitive because a person can visit the store any number of times.
Input Data:
id, store
1, Walmart
1, Walmart
1, Bestbuy
2, Target
3, Walmart
4, Bestbuy
Output Expected:
3, Walmart
I have got the output using dataFrames and running SQL queries on spark context. But is there any way to do this using groupByKey/reduceByKey etc without dataFrames. Can someone help me with the code, After map-> groupByKey, a ShuffleRDD has been formed and I am facing difficulty in filtering the CompactBuffer!
The code with which I got it using sqlContext is below:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.createSchemaRDD
case class Person(id: Int, store: String)
val people = sc.textFile("examples/src/main/resources/people.txt")
.map(_.split(","))
.map(p => Person(p(1)trim.toInt, p(1)))
people.registerTempTable("people")
val result = sqlContext.sql("select id, store from people left semi join (select id from people where store in('Walmart','Bestbuy') group by id having count(distinct store)=1) sample on people.id=sample.id and people.url='Walmart'")
The code which I am trying now is this, but I am struck after the third step:
val data = sc.textFile("examples/src/main/resources/people.txt")
.map(x=> (x.split(",")(0),x.split(",")(1)))
.filter(!_.filter("id"))
val dataGroup = data.groupByKey()
val dataFiltered = dataGroup.map{case (x,y) =>
val url = y.flatMap(x=> x.split(",")).toList
if (!url.contains("Bestbuy") && url.contains("Walmart")){
x.map(x=> (x,y))}}
if I do dataFiltered.collect(), I am getting
Array[Any] = Array(Vector((3,Walmart)), (), ())
Please help me how to extract the output after this step
To filter an RDD, just use RDD.filter:
val dataGroup = data.groupByKey()
val dataFiltered = dataGroup.filter {
// keep only lists that contain Walmart but do not contain Bestbuy:
case (x, y) => val l = y.toList; l.contains("Walmart") && !l.contains("Bestbuy")
}
dataFiltered.foreach(println) // prints: (3,CompactBuffer(Walmart))
// if you want to flatten this back to tuples of (id, store):
val result = dataFiltered.flatMap { case (id, stores) => stores.map(store => (id, store)) }
result.foreach(println) // prints: (3, Walmart)
I also tried it another way and it worked out
val data = sc.textFile("examples/src/main/resources/people.txt")
.filter(!_.filter("id"))
.map(x=> (x.split(",")(0),x.split(",")(1)))
data.cache()
val dataWalmart = data.filter{case (x,y) => y.contains("Walmart")}.distinct()
val dataBestbuy = data.filter{case (x,y) => y.contains("Bestbuy")}.distinct()
val result = dataWalmart.subtractByKey(dataBestbuy)
data.uncache()

Deduping evnts using hiveContext in spark with scala

I am trying to dedupe event records, using the hiveContext in spark with Scala.
df to rdd is compilation error saying "object Tuple23 is not a member of package scala". There is known issue, that Scala Tuple can't have 23 or more
Is there any other way to dedupe
val events = hiveContext.table("default.my_table")
val valid_events = events.select(
events("key1"),events("key2"),events("col3"),events("col4"),events("col5"),
events("col6"),events("col7"),events("col8"),events("col9"),events("col10"),
events("col11"),events("col12"),events("col13"),events("col14"),events("col15"),
events("col16"),events("col17"),events("col18"),events("col19"),events("col20"),
events("col21"),events("col22"),events("col23"),events("col24"),events("col25"),
events("col26"),events("col27"),events("col28"),events("col29"),events("epoch")
)
//events are deduped based on latest epoch time
val valid_events_rdd = valid_events.rdd.map(t => {
((t(0),t(1)),(t(2),t(3),t(4),t(5),t(6),t(7),t(8),t(9),t(10),t(11),t(12),t(13),t(14),t(15),t(16),t(17),t(18),t(19),t(20),t(21),t(22),t(23),t(24),t(25),t(26),t(28),t(29)))
})
// reduce by key so we will only get one record for every primary key
val reducedRDD = valid_events_rdd.reduceByKey((a,b) => if ((a._29).compareTo(b._29) > 0) a else b)
//Get all the fields
reducedRDD.map(r => r._1 + "," + r._2._1 + "," + r._2._2).collect().foreach(println)
Off the top of my head:
use cases classes which no longer have size limit. Just keep in mind that cases classes won't work correctly in Spark REPL,
use Row objects directly and extract only keys,
operate directly on a DataFrame,
import org.apache.spark.sql.functions.{col, max}
val maxs = df
.groupBy(col("key1"), col("key2"))
.agg(max(col("epoch")).alias("epoch"))
.as("maxs")
df.as("df")
.join(maxs,
col("df.key1") === col("maxs.key1") &&
col("df.key2") === col("maxs.key2") &&
col("df.epoch") === col("maxs.epoch"))
.drop(maxs("epoch"))
.drop(maxs("key1"))
.drop(maxs("key2"))
or with window function:
val w = Window.partitionBy($"key1", $"key2").orderBy($"epoch")
df.withColumn("rn_", rowNumber.over(w)).where($"rn" === 1).drop("rn")