I have a dataset containing the Customer_ID and the Movies that each customer have seen.
I am analyzing the pattern over the movies. Like If customer X see movie Y then he also se movie Z.
I already group my dataset by Customer ID and I've the sample of data:
Customer_ID,Movie_ID
1, 2,1,3
2, 1
3, 3,6,8
What I want is ignore the column Customer_ID and only having the list of movies like this:
2,1,3
1
3,6,8
How can I do this? My code is:
val data = sc.textFile("FILE");
case class Movies(Customer_ID:String,Movie_ID:String);
def csvToMyClass(line: String) = {
val split = line.split(',');
Movies(split(0),split(1))
}
val df = data.map(csvToMyClass).toDF("Customer_ID","Movie_ID");
df.show;
val movies = df.groupBy(col("Customer_ID")).agg(collect_list(col("Movie_ID")) as "Movie_ID").withColumn("Movie_ID", concat_ws(",", col("Movie_ID"))).rdd
Thanks
Related
I trying to implement blow logic.
1. Taking some records from one table.
2. based on resultant data I'm using one loop.
3.then inside loop taking data from other tables in two different dataframe
4. joining these two dataframes and loading data into 3rd table.
var id_chck1 = s"select distinct id ,id1, id2 from table WHERE type = 'N';
val id_chck = hive.executeQuery(id_chck1)
for (data <- id_chck) {
var id = data(0)
var id1 = data(1)
var id2 = data(2)
val values_1 = "select distinct bill, bil_num, id_num, bill_date,process_date from table l WHERE id2 = '222';
val values_1_data = hive.executeQuery(values_1)
for (row <- values_1_data.collect) {
val bill = row.mkString(",").split(",")(0)
val bil_num = row.mkString(",").split(",")(1)
val id_num= row.mkString(",").split(",")(2)
val bill_date = row.mkString(",").split(",")(3)
var df1 = s"select column name from tablename where id=222"
val df1_data = hive.executeQuery(df1)
var df2 = s"s"select column name from tablename2 where id=222""
val df2_data = hive.executeQuery(df2)
val df3="joining df1 and df2"
df3.write.format("orc").mode("Append").save("hdfslocation")
}
var load1 = s"load data inpath 'hdfslocation' into table tablename"
val load1_data = hive.executeUpdate(load1)
but this process is taking 6+ hrs is there any other way to doing the same thing so it can complete in short time.is there any other way to do the same thing..like rdd or setting some spark hive properties to improve performance.
I have 5,00,000 records in test1 table.
could you please add the input and expected output as an example?
it's hard to see what exactly you're trying to achieve
I am new to Scala and Spark, can someone optimize below Scala code for finding maximum marks scored by students each year
val m=sc.textFile("marks.csv")
val SumOfMarks=m.map(_.split(",")).mapPartitionsWithIndex {(idx, iter) => if (idx == 0) iter.drop(1) else iter}.map(l=>((l(0),l(1)),l(3).toInt)).reduceByKey(_+_).sortBy(line => (line._1._1, line._2), ascending=false)
var s:Int=0
var y:String="0"
for(i<-SumOfMarks){ if((i._1._1!=y) || (i._2==s && i._1._1==y)){ println(i);s=i._2;y=i._1._1}}
Input : marks.csv
year,student,sub,marks
2016,ram,maths,90
2016,ram,physics,86
2016,ram,chemistry,88
2016,raj,maths,84
2016,raj,physics,96
2016,raj,chemistry,98
2017,raghu,maths,96
2017,raghu,physics,98
2017,raghu,chemistry,94
2017,rajesh,maths,92
2017,rajesh,physics,98
2017,rajesh,chemistry,98
Output :
2017,raghu,288
2017,rajesh,288
2016,raj,278
I am not sure what you mean exactly by "Optimised", but a more "scala-y" and "spark-y" way of doing this might be as follows:
import org.apache.spark.sql.expressions.Window
// Read your data file as a CSV file with row headers.
val marksDF = spark.read.option("header","true").csv("marks.csv")
// Calculate the total marks for each student in each year. The new total mark column will be called "totMark"
val marksByStudentYear = marksDF.groupBy(col("year"), col("student")).agg(sum(col("marks")).as("totMark"))
// Rank the marks within each year. Highest Mark will get rank 1, second highest rank 2 and so on.
// A benefit of rank is that if two scores have the same mark, they will both get the
// same rank.
val marksRankedByYear = marksByStudentYear.withColumn("rank", dense_rank().over(Window.partitionBy("year").orderBy($"totMark".desc)))
// Finally filter so that we only have the "top scores" (rank = 1) for each year,
// order by year and student name and display the result.
val topStudents = marksRankedByYear.filter($"rank" === 1).orderBy($"year", $"student").show
topStudents.show
This will produce the following output in Spark-shell:
+----+-------+-------+----+
|year|student|totMark|rank|
+----+-------+-------+----+
|2016| raj| 278.0| 1|
|2017| raghu| 288.0| 1|
|2017| rajesh| 288.0| 1|
+----+-------+-------+----+
If you need a CSV displayed as per your question, you can use:
topStudents.collect.map(_.mkString(",")).foreach(println)
which produces:
2016,raj,278.0,1
2017,raghu,288.0,1
2017,rajesh,288.0,1
I have broken the process up into individual steps. This will allow you to see what is going on at each step by simply running show on an intermediate result. For example to see what the spark.read.option... does, simply enter marksDF.show into spark-shell
Since OP wanted an RDD version, here is one example. Probably it is not optimal, but it does give the correct result:
import org.apache.spark.rdd.RDD
// A Helper function which makes it slightly easier to view RDD content.
def dump[R] (rdd : RDD[R]) = rdd.collect.foreach(println)
val marksRdd = sc.textFile("marks.csv")
// A case class to annotate the content in the RDD
case class Report(year:Int, student:String, sub:String, mark:Int)
// Create the RDD as a series of Report objects - ignore the header.
val marksReportRdd = marksRdd.map(_.split(",")).mapPartitionsWithIndex {
(idx, iter) => if (idx == 0) iter.drop(1) else iter
}.map(r => Report(r(0).toInt,r(1),r(2),r(3).toInt))
// Group the data by year and student.
val marksGrouped = marksReportRdd.groupBy(report => (report.year, report.student))
// Calculate the total score for each student for each year by adding up the scores
// of each subject the student has taken in that year.
val totalMarkStudentYear = marksGrouped.map{ case (key, marks:Iterable[Report]) => (key, marks.foldLeft(0)((acc, rep) => acc + rep.mark))}
// Determine the highest score for each year.
val yearScoreHighest = totalMarkStudentYear.map{ case (key, score:Int) => (key._1, score) }.reduceByKey(math.max(_, _))
// Determine the list of students who have received the highest score in each year.
// This is achieved by joining the total marks each student received in each year
// to the highest score in each year.
// The join is performed on the key which must is a Tuple2(year, score).
// To achieve this, both RDD's must be mapped to produce this key with a data attribute.
// The data attribute for the highest scores is a dummy value "x".
// The data attribute for the student scores is the student's name.
val highestRankStudentByYear = totalMarkStudentYear.map{ case (key, score) => ((key._1, score), key._2)}.join (yearScoreHighest.map (k => (k, "x")))
// Finally extract the year, student name and score from the joined RDD
// Sort by year and name.
val result = highestRankStudentByYear.map{ case (key, score) => (key._1, score._1, key._2)}.sortBy( r => (r._1, r._2))
// Show the final result.
dump(result)
val result = highestRankStudentByYear.map{ case (key, score) => (key._1, score._1, key._2)}.sortBy( r => (r._1, r._2))
dump(result)
The result of the above is:
(2016,raj,278)
(2017,raghu,288)
(2017,rajesh,288)
As before, you can view the intermediate RDD's simply by dumping them using the dump function. NB: the dump function takes an RDD. If you want to show the content of a DataFrame or dataset use it's show method.
It is probably that there is a more optimal solution than the one above, but it does the job.
Hopefully the RDD version will encourage you to use DataFrames and/or DataSets if you can. Not only is the code simpler, but:
Spark will evaluate DataFrames and DataSets and can optimise the overall transformation process. RDD's are not (i.e. they are executed one after another without optimisation). Translations DataFrame and DataSet based processes will likely run faster (assuming you don't manually optimise the RDD equivalent)
DataSets and DataFrames allow schemas to varying degrees (e.g. named columns and data typing).
DataFrames and DataSets can be queried using SQL.
DataFrame and DataSet operations/methods are more aligned with SQL constructs
DataFrames and DataSets are easier to use than RDD's
DataSets (and RDD's) offer compile time error detection.
DataSets are the future direction.
Check out these couple of links for more information:
https://data-flair.training/blogs/apache-spark-rdd-vs-dataframe-vs-dataset/
https://www.linkedin.com/pulse/apache-spark-rdd-vs-dataframe-dataset-chandan-prakash/
https://medium.com/#sachee/apache-spark-dataframe-vs-rdd-24a04d2eb1b9
or simply google "spark should i use rdd or dataframe"
All the best with your project.
Try It on SCALA spark-shell
scala> val df = spark.read.format("csv").option("header", "true").load("/CSV file location/marks.csv")
scala> df.registerTempTable("record")
scala> sql(" select year, student, marks from (select year, student, marks, RANK() over (partition by year order by marks desc) rank From ( Select year, student, SUM(marks) as marks from record group by Year, student)) where rank =1 ").show
It will generate the following table
+----+-------+-----+
|year|student|marks|
+----+-------+-----+
|2016| raj|278.0|
|2017| raghu|288.0|
|2017| rajesh|288.0|
+----+-------+-----+
import org.apache.spark.rdd.RDD
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions
//Finding Max sum of marks each year
object Marks2 {
def getSparkContext() = {
val conf = new SparkConf().setAppName("MaxMarksEachYear").setMaster("local")
val sc = new SparkContext(conf)
sc
}
def dump[R](rdd: RDD[R]) = rdd.collect.foreach(println)
def main(args: Array[String]): Unit = {
// System.setProperty("hadoop.home.dir", "D:\\Setup\\hadoop_home")
val sc = getSparkContext()
val inpRDD = sc.textFile("marks.csv")
val head = inpRDD.first()
val marksRdd = inpRDD.filter(record=> !record.equals(head)).map(rec => rec.split(","))
val marksByNameyear = marksRdd.map(rec =>((rec(0).toInt,rec(1)),rec(3).toInt))
//marksByNameyear.cache()
val aggMarksByYearName = marksByNameyear.reduceByKey(_+_)
val maxMarksByYear = aggMarksByYearName.map(s=> (s._1._1,s._2))reduceByKey(math.max(_, _))
val markYearName = aggMarksByYearName.map(s => (s._2.toInt,s._1._2))
val marksAndYear = maxMarksByYear.map(s => (s._2.toInt,s._1))
val tt = sc.broadcast(marksAndYear.collect().toMap)
marksAndYear.flatMap {case(key,value) => tt.value.get(key).map {other => (other,value, key) } }
val yearMarksName = marksAndYear.leftOuterJoin(markYearName)
val result = yearMarksName.map(s =>(s._2._1,s._2._2,s._1)).sortBy(f=>f._3, true)
//dump(markYearName);'
dump(result)
}
}
I have a spark dataset like this:
+--------+--------------------+
| uid| recommendations|
+--------+--------------------+
|41344966|[[2133, red]...|
|41345063|[[11353, red...|
|41346177|[[2996, yellow]...|
|41349171|[[8477, green]...|
res98: org.apache.spark.sql.Dataset[userItems] = [uid: int, recommendations: array<struct<iid:int,color:string>>]
I want to filter each recommendations array, to contain the first two of each color. Pseudo example:
[(13,'red'), (4,'green'), (8,'red'), (2,'red'), (10, 'yellow')]
would become
[(13,'red'), (4,'green'), (8,'red'), (10, 'yellow')]
How can I efficiently do this in scala with datasets? Is there an elegant solution using something like reduceGroups?
What I have so far:
case class itemData (iid: Int, color: String)
val filterList = (recs: Array[itemData], filterAttribute, maxCount) => {
// filter the list somehow... using the max count and attribute
})
dataset.map(d => filterList(d.recommendations, "color", 2))
You can explode the recommendations, then create a row number partition by uid and color and finally filter out the row numbers greater than 2. The code should look like below. I hope it helps.
//Creating Test Data
val df = Seq((13,"red"), (4,"green"), (8,"red"), (2,"red"), (10, "yellow")).toDF("iid", "color")
.withColumn("uid", lit(41344966))
.groupBy("uid").agg(collect_list(struct("iid", "color")).as("recommendations"))
df.show(false)
+--------+----------------------------------------------------+
|uid |recommendations |
+--------+----------------------------------------------------+
|41344966|[[13,red], [4,green], [8,red], [2,red], [10,yellow]]|
+--------+----------------------------------------------------+
val filterDF = df.withColumn("rec", explode(col("recommendations")))
.withColumn("iid", col("rec.iid"))
.withColumn("color", col("rec.color"))
.drop("recommendations", "rec")
.withColumn("rownum",
row_number().over(Window.partitionBy("uid", "color").orderBy(col("iid").desc)))
.filter(col("rownum") <= 2)
.groupBy("uid").agg(collect_list(struct("iid", "color")).as("recommendations"))
filterDF.show(false)
+--------+-------------------------------------------+
|uid |recommendations |
+--------+-------------------------------------------+
|41344966|[[4,green], [13,red], [8,red], [10,yellow]]|
+--------+-------------------------------------------+
I have two DataFrames in my code with exact same dimensions, let's say 1,000,000 X 50. I need to add corresponding values in both dataframes. How to achieve that.
One option would be to add another column with ids, union both DataFrames and then use reduceByKey. But is there any other more elegent way?
Thanks.
Your approach is good. Another option can be two take the RDD and zip those together and then iterate over those to sum the columns and create a new dataframe using any of the original dataframe schemas.
Assuming the data types for all the columns are integer, this code snippets should work. Please note that, this has been done in spark 2.1.0.
import spark.implicits._
val a: DataFrame = spark.sparkContext.parallelize(Seq(
(1, 2),
(3, 6)
)).toDF("column_1", "column_2")
val b: DataFrame = spark.sparkContext.parallelize(Seq(
(3, 4),
(1, 5)
)).toDF("column_1", "column_2")
// Merge rows
val rows = a.rdd.zip(b.rdd).map{
case (rowLeft, rowRight) => {
val totalColumns = rowLeft.schema.fields.size
val summedRow = for(i <- (0 until totalColumns)) yield rowLeft.getInt(i) + rowRight.getInt(i)
Row.fromSeq(summedRow)
}
}
// Create new data frame
val ab: DataFrame = spark.createDataFrame(rows, a.schema) // use any of the schemas
ab.show()
Update:
So, I tried to experiment with the performance of my solution vs yours. I tested with 100000 rows and each row has 50 columns. In case of your approach it has 51 columns, the extra one is for the ID column. In a single machine(no cluster), my solution seems to work a bit faster.
The union and group by approach takes about 5598 milliseconds.
Where as my solution takes about 5378 milliseconds.
My assumption is the first solution takes a bit more time because of the union operation of the two dataframes.
Here are the methods which I created for testing the approaches.
def option_1()(implicit spark: SparkSession): Unit = {
import spark.implicits._
val a: DataFrame = getDummyData(withId = true)
val b: DataFrame = getDummyData(withId = true)
val allData = a.union(b)
val result = allData.groupBy($"id").agg(allData.columns.collect({ case col if col != "id" => (col, "sum") }).toMap)
println(result.count())
// result.show()
}
def option_2()(implicit spark: SparkSession): Unit = {
val a: DataFrame = getDummyData()
val b: DataFrame = getDummyData()
// Merge rows
val rows = a.rdd.zip(b.rdd).map {
case (rowLeft, rowRight) => {
val totalColumns = rowLeft.schema.fields.size
val summedRow = for (i <- (0 until totalColumns)) yield rowLeft.getInt(i) + rowRight.getInt(i)
Row.fromSeq(summedRow)
}
}
// Create new data frame
val result: DataFrame = spark.createDataFrame(rows, a.schema) // use any of the schemas
println(result.count())
// result.show()
}
I have two DataFrame a and b.
a is like
Column 1 | Column 2
abc | 123
cde | 23
b is like
Column 1
1
2
I want to zip a and b (or even more) DataFrames which becomes something like:
Column 1 | Column 2 | Column 3
abc | 123 | 1
cde | 23 | 2
How can I do it?
Operation like this is not supported by a DataFrame API. It is possible to zip two RDDs but to make it work you have to match both number of partitions and number of elements per partition. Assuming this is the case:
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructField, StructType, LongType}
val a: DataFrame = sc.parallelize(Seq(
("abc", 123), ("cde", 23))).toDF("column_1", "column_2")
val b: DataFrame = sc.parallelize(Seq(Tuple1(1), Tuple1(2))).toDF("column_3")
// Merge rows
val rows = a.rdd.zip(b.rdd).map{
case (rowLeft, rowRight) => Row.fromSeq(rowLeft.toSeq ++ rowRight.toSeq)}
// Merge schemas
val schema = StructType(a.schema.fields ++ b.schema.fields)
// Create new data frame
val ab: DataFrame = sqlContext.createDataFrame(rows, schema)
If above conditions are not met the only option that comes to mind is adding an index and join:
def addIndex(df: DataFrame) = sqlContext.createDataFrame(
// Add index
df.rdd.zipWithIndex.map{case (r, i) => Row.fromSeq(r.toSeq :+ i)},
// Create schema
StructType(df.schema.fields :+ StructField("_index", LongType, false))
)
// Add indices
val aWithIndex = addIndex(a)
val bWithIndex = addIndex(b)
// Join and clean
val ab = aWithIndex
.join(bWithIndex, Seq("_index"))
.drop("_index")
In Scala's implementation of Dataframes, there is no simple way to concatenate two dataframes into one. We can simply work around this limitation by adding indices to each row of the dataframes. Then, we can do a inner join by these indices. This is my stub code of this implementation:
val a: DataFrame = sc.parallelize(Seq(("abc", 123), ("cde", 23))).toDF("column_1", "column_2")
val aWithId: DataFrame = a.withColumn("id",monotonicallyIncreasingId)
val b: DataFrame = sc.parallelize(Seq((1), (2))).toDF("column_3")
val bWithId: DataFrame = b.withColumn("id",monotonicallyIncreasingId)
aWithId.join(bWithId, "id")
A little light reading - Check out how Python does this!
What about pure SQL ?
SELECT
room_name,
sender_nickname,
message_id,
row_number() over (partition by room_name order by message_id) as message_index,
row_number() over (partition by room_name, sender_nickname order by message_id) as user_message_index
from messages
order by room_name, message_id
I know the OP was using Scala but if, like me, you need to know how to do this in pyspark then try the Python code below. Like #zero323's first solution it relies on RDD.zip() and will therefore fail if both DataFrames don't have the same number of partitions and the same number of rows in each partition.
from pyspark.sql import Row
from pyspark.sql.types import StructType
def zipDataFrames(left, right):
CombinedRow = Row(*left.columns + right.columns)
def flattenRow(row):
left = row[0]
right = row[1]
combinedVals = [left[col] for col in left.__fields__] + [right[col] for col in right.__fields__]
return CombinedRow(*combinedVals)
zippedRdd = left.rdd.zip(right.rdd).map(lambda row: flattenRow(row))
combinedSchema = StructType(left.schema.fields + right.schema.fields)
return zippedRdd.toDF(combinedSchema)
joined = zipDataFrames(a, b)