I am new to spark ,Could someone help me find a way to combine two rdds to create a final rdd as per the below logic in scala preferably without using sqlcontext(dataframes) -
RDD1=column1,column2,column3 having 362825 records
RDD2=column2_distinct(same as from RDD1 but containing distinct values),column4 having 2621 records
Final RDD=column1,column2,column3,column4
Example-
RDD1 =
userid | progid | Rating
a 001 5
b 001 3
b 002 4
c 003 2
RDD2=
progid(distinct) | id
001 1
002 2
003 3
Final RDD=
userid | progid | id | rating
a 001 1 5
b 001 1 3
b 002 2 4
c 003 3 2
code
val rawRdd1 = pairrdd1.map(x => x._1.split(",")(0) + "," + x._1.split(",")(1) + "," + x._2) //362825 records
val rawRdd2 = pairrdd2.map(x => x._1 + "," + x._2) //2621 records
val schemaString1 = "userid programid rating"
val schemaString2 = "programid id"
val fields1 = schemaString1.split(" ").map(fieldName => StructField(fieldName, StringType, nullable = true))
val fields2 = schemaString2.split(" ").map(fieldName => StructField(fieldName, StringType, nullable = true))
val schema1 = StructType(fields1)
val schema2 = StructType(fields2)
val rowRDD1 = rawRdd1.map(_.split(",")).map(attributes => Row(attributes(0), attributes(1), attributes(2)))
val rowRDD2 = rawRdd2.map(_.split(",")).map(attributes => Row(attributes(0), attributes(1)))
val DF1 = sparkSession.createDataFrame(rowRDD1, schema1)
val DF2 = sparkSession.createDataFrame(rowRDD2, schema2)
DF1.createOrReplaceTempView("df1")
DF2.createOrReplaceTempView("df2")
val resultDf = DF1.join(DF2, Seq("programid"))
val DF3 = sparkSession.sql("""SELECT df1.userid, df1.programid, df2.id, df1.rating FROM df1 JOIN df2 on df1.programid == df2.programid""")
println(DF1.count()) //362825 records
println(DF2.count()) //2621 records
println(DF3.count()) //only 297 records
expecting same number of records as DF1 with a new column attached from DF2 (id) having corresponding value of programid from DF2`
It is a bit ugly but should work (Spark 2.0):
val rdd1 = sparkSession.sparkContext.parallelize(List("a,001,5", "b,001,3", "b,002,4","c,003,2"))
val rdd2 = sparkSession.sparkContext.parallelize(List("001,1", "002,2", "003,3"))
val groupedRDD1 = rdd1.map(x => (x.split(",")(1),x))
val groupedRDD2 = rdd2.map(x => (x.split(",")(0),x))
val joinRDD = groupedRDD1.join(groupedRDD2)
// convert back to String
val cleanJoinRDD = joinRDD.map(x => x._1 + "," + x._2._1.replace(x._1 + ",","") + "," + x._2._2.replace(x._1 + ",",""))
cleanJoinRDD.collect().foreach(println)
I think better option is to use spark SQL
First of all, why do you split, concatenate and split the row again? You could just do it in one step:
val rowRdd1 = pairrdd1.map{x =>
val (userid, progid) = x._1.split(",")
val rating = x._2
Row(userid, progid, rating)
}
My guess that your problem might be that there are some additional characters in your keys so the don't match in the joins. A simple approach would be to do a left join and inspect the rows where they don't match.
It could be something like extra space in the rows which you could fix like this for both rdds:
val rowRdd1 = pairrdd1.map{x =>
val (userid, progid) = x._1.split(",").map(_.trim)
val rating = x._2
Row(userid, progid, rating)
}
Related
How do I take the average of columns in an array cols with non-null values in a dataframe df? I can do this for all columns but it gives null when any of the values are null.
val cols = Array($"col1", $"col2", $"col3")
df.withColumn("avgCols", cols.foldLeft(lit(0)){(x, y) => x + y} / cols.length)
I don't want to na.fill because I want to preserve the true average.
I guess you can do something like this:
val cols = Array("col1", "col2", "col3")
def countAvg =
udf((data: Row) => {
val notNullIndices = cols.indices.filterNot(i => data.isNullAt(i))
notNullIndices.map(i => data.getDouble(i)).sum / notNullIndices.lenght
})
df.withColumn("seqNull", struct(cols.map(col): _*))
.withColumn("avg", countAvg(col("seqNull")))
.show(truncate = false)
But be careful, here average is counted only for not null elements.
If you need exactly solution like in your code:
val cols = Array("col1", "col2", "col3")
def countAvg =
udf((data: Row) => {
val notNullIndices = cols.indices.filterNot(i => data.isNullAt(i))
notNullIndices.map(i => data.getDouble(i)).sum / cols.lenght
})
df.withColumn("seqNull", struct(cols.map(col): _*))
.withColumn("avg", countAvg(col("seqNull")))
.show(truncate = false)
aggregate function can do it without udf.
val cols = Array($"col1", $"col2", $"col3")
df.withColumn(
"avgCols",
aggregate(
cols,
struct(lit(0).alias("sum"), lit(0).alias("count")),
(acc, x) => struct((acc("sum") + coalesce(x, lit(0))).alias("sum"), (acc("count") + coalesce(x.cast("boolean").cast("int"), lit(0))).alias("count")),
(s) => s("sum") / s("count")
)
)
I am trying to fetch rows from a lookup table (3 rows and 3 columns) and iterate row by row and pass values in each row to a SPARK SQL as parameters.
DB | TBL | COL
----------------
db | txn | ID
db | sales | ID
db | fee | ID
I tried this in spark shell for one row, it worked. But I am finding it difficult to iterate over rows.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val db_name:String = "db"
val tbl_name:String = "transaction"
val unique_col:String = "transaction_number"
val dupDf = sqlContext.sql(s"select count(*), transaction_number from $db_name.$tbl_name group by $unique_col having count(*)>1")
Please let me know how I can iterate over the rows and pass as parameters ?
Above 2 approaches are may be right in general.. but some how I dont like collecting the
data because of performance reasons... specially if data is huge...
org.apache.spark.util.CollectionAccumulator is right candidate for this kind of requirements... see docs
Also if data is huge then foreachPartition is right candidate for this for performance reasons again!
Below is the implementation
package examples
import org.apache.log4j.Level
import org.apache.spark.sql.SparkSession
import org.apache.spark.util.CollectionAccumulator
import scala.collection.JavaConversions._
import scala.collection.mutable
object TableTest extends App {
val logger = org.apache.log4j.Logger.getLogger("org")
logger.setLevel(Level.WARN)
val spark = SparkSession.builder.appName(getClass.getName)
.master("local[*]").getOrCreate
import spark.implicits._
val lookup =
Seq(("db", "txn", "ID"), ("db", "sales", "ID")
, ("db", "fee", "ID")
).toDF("DB", "TBL", "COL")
val collAcc: CollectionAccumulator[String] = spark.sparkContext.collectionAccumulator[String]("mySQL Accumulator")
val data = lookup.foreachPartition { partition =>
partition.foreach {
{
record => {
val selectString = s"select count(*), transaction_number from ${record.getAs[String]("DB")}.${record.getAs[String]("TBL")} group by ${record.getAs[String]("COL")} having count(*)>1";
collAcc.add(selectString)
println(selectString)
}
}
}
}
val mycollectionOfSelects: mutable.Seq[String] = asScalaBuffer(collAcc.value)
val finaldf = mycollectionOfSelects.map { x => spark.sql(x)
}.reduce(_ union _)
finaldf.show
}
Sample Result :
[2019-08-13 12:11:16,458] WARN Unable to load native-hadoop library for your platform... using builtin-java classes where applicable (org.apache.hadoop.util.NativeCodeLoader:62)
[Stage 0:> (0 + 0) / 2]
select count(*), transaction_number from db.txn group by ID having count(*)>1
select count(*), transaction_number from db.sales group by ID having count(*)>1
select count(*), transaction_number from db.fee group by ID having count(*)>1
Note : since those are psuedo tables I have NOT displayed dataframe.
val lookup =
Seq(("db", "txn", "ID"), ("db", "sales", "ID")).toDF("DB", "TBL", "COL")
val data = lookup
.collect()
.map(
x =>
(x.getAs[String]("DB"), x.getAs[String]("TBL"), x.getAs[String]("COL"))
)
.map(
y =>
sparkSession.sql(
s"select count(*), transaction_number from ${y._1}.${y._2} group by ${y._3} having count(*)>1"
)
)
.reduce(_ union _)
Change the DF into Arrays. From that point you can iterate through the string objects and build the string input query for the Spark.sql command. Below I gave a quick except about how you would do it, however it is fairly complex.
//Pull in the needed columns, remove all duplicates
val inputDF = spark.sql("select * from " + dbName + "." + tableName). selectExpr("DB", "TBL", "COL").distinct
//Hold all of the columns as arrays
////dbArray(0) is the first element of the DB column
////dbArray(n-1) is the last element of the DB column
val dbArray = inputDF.selectExpr("DB").rdd.map(x=>x.mkString).collect
val tableArray = inputDF.selectExpr("TBL").rdd.map(x=>x.mkString).collect
val colArray = inputDF.selectExpr("COL").rdd.map(x=>x.mkString).collect
//Need to hold all the dataframe objects and values as we build insert and union them as we progress through loop
var dupDF = spark.sql("select 'foo' as bar")
var interimDF = dupDF
var initialDupDF = dupDF
var iterator = 1
//Run until we reach end of array
while (iterator <= dbArray.length)
{
//on each run insert the array elements into string call
initialDupDF = spark.sql("select count(*), transaction_number from " + dbArray(iterator - 1) + "." + tableArray(iterator - 1) + " group by " + colArray(iterator - 1) + " having count(*)>1")
//on run 1 overwrite the variable, else union
if (iterator == 1) {
interimDF = initialDupDF
} else {
interimDF = dupDF.unionAll(initialDupDF)
}
//This is needed because you cant do DF = DF.unionAll(newDF)
dupDF = interimDF
iterator = iterator + 1
}
I have data in one RDD and the data is as follows:
scala> c_data
res31: org.apache.spark.rdd.RDD[String] = /home/t_csv MapPartitionsRDD[26] at textFile at <console>:25
scala> c_data.count()
res29: Long = 45212
scala> c_data.take(2).foreach(println)
age;job;marital;education;default;balance;housing;loan;contact;day;month;duration;campaign;pdays;previous;poutcome;y
58;management;married;tertiary;no;2143;yes;no;unknown;5;may;261;1;-1;0;unknown;no
I want to split the data into another rdd and I am using:
scala> val csv_data = c_data.map{x=>
| val w = x.split(";")
| val age = w(0)
| val job = w(1)
| val marital_stat = w(2)
| val education = w(3)
| val default = w(4)
| val balance = w(5)
| val housing = w(6)
| val loan = w(7)
| val contact = w(8)
| val day = w(9)
| val month = w(10)
| val duration = w(11)
| val campaign = w(12)
| val pdays = w(13)
| val previous = w(14)
| val poutcome = w(15)
| val Y = w(16)
| }
that returns :
csv_data: org.apache.spark.rdd.RDD[Unit] = MapPartitionsRDD[28] at map at <console>:27
when I query csv_data it returns Array((),....).
How can I get the data with first row as header and rest as data ?
Where I am doing wrong ?
Thanks in Advance.
Your mapping function returns Unit, so you map to an RDD[Unit]. You can get a tuple of your values by changing your code to
val csv_data = c_data.map{x=>
val w = x.split(";")
...
val Y = w(16)
(w, age, job, marital_stat, education, default, balance, housing, loan, contact, day, month, duration, campaign, pdays, previous, poutcome, Y)
}
I can think of one only using withColumn():
val df = sc.dataFrame.withColumn('newcolname',{ lambda row: row + 1 } )
but how would I generalize this to Text data? For example of my DataFrame had
strning values say "This is an example of a string" and I wanted to extract the
first and last word as in val arraystring : Array[String] = Array(first,last)
Is this the thing you're looking for?
val sc: SparkContext = ...
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val extractFirstWord = udf((sentence: String) => sentence.split(" ").head)
val extractLastWord = udf((sentence: String) => sentence.split(" ").reverse.head)
val sentences = sc.parallelize(Seq("This is an example", "And this is another one", "One_word", "")).toDF("sentence")
val splits = sentences
.withColumn("first_word", extractFirstWord(col("sentence")))
.withColumn("last_word", extractLastWord(col("sentence")))
splits.show()
Then the output is:
+--------------------+----------+---------+
| sentence|first_word|last_word|
+--------------------+----------+---------+
| This is an example| This| example|
|And this is anoth...| And| one|
| One_word| One_word| One_word|
| | | |
+--------------------+----------+---------+
# Create a simple DataFrame, stored into a partition directory
df1 = sqlContext.createDataFrame(sc.parallelize(range(1, 6))\
.map(lambda i: Row(single=i, double=i * 2)))
df1.save("data/test_table/key=1", "parquet")
# Create another DataFrame in a new partition directory,
# adding a new column and dropping an existing column
df2 = sqlContext.createDataFrame(sc.parallelize(range(6, 11))
.map(lambda i: Row(single=i, triple=i * 3)))
df2.save("data/test_table/key=2", "parquet")
# Read the partitioned table
df3 = sqlContext.parquetFile("data/test_table")
df3.printSchema()
https://spark.apache.org/docs/1.3.1/sql-programming-guide.html
I have two DataFrame a and b.
a is like
Column 1 | Column 2
abc | 123
cde | 23
b is like
Column 1
1
2
I want to zip a and b (or even more) DataFrames which becomes something like:
Column 1 | Column 2 | Column 3
abc | 123 | 1
cde | 23 | 2
How can I do it?
Operation like this is not supported by a DataFrame API. It is possible to zip two RDDs but to make it work you have to match both number of partitions and number of elements per partition. Assuming this is the case:
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructField, StructType, LongType}
val a: DataFrame = sc.parallelize(Seq(
("abc", 123), ("cde", 23))).toDF("column_1", "column_2")
val b: DataFrame = sc.parallelize(Seq(Tuple1(1), Tuple1(2))).toDF("column_3")
// Merge rows
val rows = a.rdd.zip(b.rdd).map{
case (rowLeft, rowRight) => Row.fromSeq(rowLeft.toSeq ++ rowRight.toSeq)}
// Merge schemas
val schema = StructType(a.schema.fields ++ b.schema.fields)
// Create new data frame
val ab: DataFrame = sqlContext.createDataFrame(rows, schema)
If above conditions are not met the only option that comes to mind is adding an index and join:
def addIndex(df: DataFrame) = sqlContext.createDataFrame(
// Add index
df.rdd.zipWithIndex.map{case (r, i) => Row.fromSeq(r.toSeq :+ i)},
// Create schema
StructType(df.schema.fields :+ StructField("_index", LongType, false))
)
// Add indices
val aWithIndex = addIndex(a)
val bWithIndex = addIndex(b)
// Join and clean
val ab = aWithIndex
.join(bWithIndex, Seq("_index"))
.drop("_index")
In Scala's implementation of Dataframes, there is no simple way to concatenate two dataframes into one. We can simply work around this limitation by adding indices to each row of the dataframes. Then, we can do a inner join by these indices. This is my stub code of this implementation:
val a: DataFrame = sc.parallelize(Seq(("abc", 123), ("cde", 23))).toDF("column_1", "column_2")
val aWithId: DataFrame = a.withColumn("id",monotonicallyIncreasingId)
val b: DataFrame = sc.parallelize(Seq((1), (2))).toDF("column_3")
val bWithId: DataFrame = b.withColumn("id",monotonicallyIncreasingId)
aWithId.join(bWithId, "id")
A little light reading - Check out how Python does this!
What about pure SQL ?
SELECT
room_name,
sender_nickname,
message_id,
row_number() over (partition by room_name order by message_id) as message_index,
row_number() over (partition by room_name, sender_nickname order by message_id) as user_message_index
from messages
order by room_name, message_id
I know the OP was using Scala but if, like me, you need to know how to do this in pyspark then try the Python code below. Like #zero323's first solution it relies on RDD.zip() and will therefore fail if both DataFrames don't have the same number of partitions and the same number of rows in each partition.
from pyspark.sql import Row
from pyspark.sql.types import StructType
def zipDataFrames(left, right):
CombinedRow = Row(*left.columns + right.columns)
def flattenRow(row):
left = row[0]
right = row[1]
combinedVals = [left[col] for col in left.__fields__] + [right[col] for col in right.__fields__]
return CombinedRow(*combinedVals)
zippedRdd = left.rdd.zip(right.rdd).map(lambda row: flattenRow(row))
combinedSchema = StructType(left.schema.fields + right.schema.fields)
return zippedRdd.toDF(combinedSchema)
joined = zipDataFrames(a, b)