Append a row to a pair RDD in spark - scala

I have a pair RDD of existing values such as :
(1,2)
(3,4)
(5,6)
I want to append a row (7,8) to the same RDD
How can I append to the same RDD in Spark?

You can use union operation.
scala> val rdd1 = sc.parallelize(List((1,2), (3,4), (5,6)))
q: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[1] at parallelize at <console>:24
scala> val rdd2 = sc.parallelize(List((7, 8)))
q: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[1] at parallelize at <console>:24
scala> val unionOfTwo = rdd1.union(rdd2)
res0: org.apache.spark.rdd.RDD[(Int, Int)] = UnionRDD[2] at union at <console>:28

Related

Spark: Scala - apply function to a list in a DataFrame [duplicate]

This question already has answers here:
How to slice and sum elements of array column?
(6 answers)
Closed 4 years ago.
I am trying to apply a sum-function to each cell of a column of a dataframe in spark. Each cell contains a list of integers which I would like to add up.
However, the error I am getting is:
console:357: error: value sum is not a member of
org.apache.spark.sql.ColumnName
for the example script below.
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
val spark = SparkSession.builder().getOrCreate()
val df = spark.createDataFrame(Seq(
(0, List(1,2,3)),
(1, List(2,2,3)),
(2, List(3,2,3)))).toDF("Id", "col_1")
val test = df.withColumn( "col_2", $"col_1".sum )
test.show()
You can define a UDF.
scala> def sumFunc(a: Seq[Int]): Int = a.sum
sumFunc: (a: Seq[Int])Int
scala> val sumUdf = udf(sumFunc(_: Seq[Int]))
sumUdf: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,IntegerType,Some(List(ArrayType(IntegerType,false))))
scala> val test = df.withColumn( "col_2", sumUdf($"col_1") )
test: org.apache.spark.sql.DataFrame = [Id: int, col_1: array<int> ... 1 more field]
scala> test.collect
res0: Array[org.apache.spark.sql.Row] = Array([0,WrappedArray(1, 2, 3),6], [1,WrappedArray(2, 2, 3),7], [2,WrappedArray(3, 2, 3),8])

Scala--How to get the same part of the two RDDS?

There are two RDD:
val rdd1 = sc.parallelize(List(("aaa", 1), ("bbb", 4), ("ccc", 3)))
val rdd2 = sc.parallelize(List(("aaa", 2), ("bbb", 5), ("ddd", 2)))
If I want to join those by the first field and get the result like:
List(("aaa", 1,2), ("bbb",4 ,5))
What should I code?Thx!!!!
You can join the RDDs and map the result to the wanted data structure:
val resultRDD = rdd1.join(rdd2).map{
case (k: String, (v1: Int, v2: Int)) => (k, v1, v2)
}
// resultRDD: org.apache.spark.rdd.RDD[(String, Int, Int)] = MapPartitionsRDD[53] at map at <console>:32
resultRDD.collect
// res1: Array[(String, Int, Int)] = Array((aaa,1,2), (bbb,4,5))
As both the RDDs of the type RDD[(String, Int)] you can simply use join to join these two RDDs and you will get RDD[(String, (Int, Int))] . Now as you want List[(String, (Int, Int))] you need to collect joined RDD (it is not recommended if joined RDD is huge) and convert it to List. Try following code,
val rdd1: RDD[(String, Int)] = sc.parallelize(List(("aaa", 1), ("bbb", 4), ("ccc", 3)))
val rdd2: RDD[(String, Int)] = sc.parallelize(List(("aaa", 2), ("bbb", 5), ("ddd", 2)))
//simply join two RDDs
val joinedRdd: RDD[(String, (Int, Int))] = rdd1.join(rdd2)
//only if you want List then collect it (It is not recommended for huge RDDs)
val lst: List[(String, (Int, Int))] = joinedRdd.collect().toList
println(lst)
//output
//List((bbb,(4,5)), (aaa,(1,2)))

Filtering RDD1 on the basis of RDD2

i have 2 RDDS in below format
RDD1 178,1
156,1
23,2
RDD2
34
178
156
now i want to filter rdd1 on the basis of value in rdd2 ie if 178 is present in rdd1 and also in rdd2 then it should return me those tuples from rdd1.
i have tried
val out = reversedl1.filter({ case(x,y) => x.contains(lines)})
where lines is my 2nd rdd and reversedl1 is first, but its not working
i also tried
val abce = reversedl1.subtractByKey(lines)
val defg = reversedl1.subtractByKey(abce)
This is also not working.
Any suggestions?
You can convert rdd2 to key value pairs and then join with rdd1 on the keys:
val rdd1 = sc.parallelize(Seq((178, 1), (156, 1), (23, 2)))
val rdd2 = sc.parallelize(Seq(34, 178, 156))
(rdd1.join(rdd2.distinct().map(k => (k, null)))
// here create a dummy value to form a pair wise RDD so you can join
.map{ case (k, (v1, v2)) => (k, v1) } // drop the dummy value
).collect
// res11: Array[(Int, Int)] = Array((156,1), (178,1))

how to convert dataframe to RDD and don't change partition?

For some reason i have to convert RDD to dataframe,then do something with dataframe,but my interface is RDD,so i have to convert dataframe to RDD,when i use df.rdd,the partition change to 1,so i have to repartition and sortBy RDD,Is there any cleaner solution ?thanks!
this is my try:
val rdd=sc.parallelize(List(1,3,2,4,5,6,7,8),4)
val partition=rdd.getNumPartitions
val sqlContext = new SQLContext(m_sparkCtx)
import sqlContext.implicits._
val df=rdd.toDF()
df.rdd.zipWithIndex().sortBy(x => {x._2}, true, partition).map(x => {x._1})
Partitions should remain the same when you convert the DataFrame to an RDD.
For example when the rdd of 4 partitions is converted to DF and back the RDD the partitions of the RDD remains same as shown below.
scala> val rdd=sc.parallelize(List(1,3,2,4,5,6,7,8),4)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[11] at parallelize at <console>:27
scala> val partition=rdd.getNumPartitions
partition: Int = 4
scala> val df=rdd.toDF()
df: org.apache.spark.sql.DataFrame = [value: int]
scala> df.rdd.getNumPartitions
res1: Int = 4
scala> df.withColumn("col2", lit(10)).rdd.getNumPartitions
res1: Int = 4

How to "negative select" columns in spark's dataframe

I can't figure it out, but guess it's simple. I have a spark dataframe df. This df has columns "A","B" and "C". Now let's say I have an Array containing the name of the columns of this df:
column_names = Array("A","B","C")
I'd like to do a df.select() in such a way, that I can specify which columns not to select.
Example: let's say I do not want to select columns "B". I tried
df.select(column_names.filter(_!="B"))
but this does not work, as
org.apache.spark.sql.DataFrame
cannot be applied to (Array[String])
So, here it says it should work with a Seq instead. However, trying
df.select(column_names.filter(_!="B").toSeq)
results in
org.apache.spark.sql.DataFrame
cannot be applied to (Seq[String]).
What am I doing wrong?
Since Spark 1.4 you can use drop method:
Scala:
case class Point(x: Int, y: Int)
val df = sqlContext.createDataFrame(Point(0, 0) :: Point(1, 2) :: Nil)
df.drop("y")
Python:
df = sc.parallelize([(0, 0), (1, 2)]).toDF(["x", "y"])
df.drop("y")
## DataFrame[x: bigint]
I had the same problem and solved it this way (oaffdf is a dataframe):
val dropColNames = Seq("col7","col121")
val featColNames = oaffdf.columns.diff(dropColNames)
val featCols = featColNames.map(cn => org.apache.spark.sql.functions.col(cn))
val featsdf = oaffdf.select(featCols: _*)
https://forums.databricks.com/questions/2808/select-dataframe-columns-from-a-sequence-of-string.html
OK, it's ugly, but this quick spark shell session shows something that works:
scala> val myRDD = sc.parallelize(List.range(1,10))
myRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[17] at parallelize at <console>:21
scala> val myDF = myRDD.toDF("a")
myDF: org.apache.spark.sql.DataFrame = [a: int]
scala> val myOtherRDD = sc.parallelize(List.range(1,10))
myOtherRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[20] at parallelize at <console>:21
scala> val myotherDF = myRDD.toDF("b")
myotherDF: org.apache.spark.sql.DataFrame = [b: int]
scala> myDF.unionAll(myotherDF)
res2: org.apache.spark.sql.DataFrame = [a: int]
scala> myDF.join(myotherDF)
res3: org.apache.spark.sql.DataFrame = [a: int, b: int]
scala> val twocol = myDF.join(myotherDF)
twocol: org.apache.spark.sql.DataFrame = [a: int, b: int]
scala> val cols = Array("a", "b")
cols: Array[String] = Array(a, b)
scala> val selectedCols = cols.filter(_!="b")
selectedCols: Array[String] = Array(a)
scala> twocol.select(selectedCols.head, selectedCols.tail: _*)
res4: org.apache.spark.sql.DataFrame = [a: int]
Providings varargs to a function that requires one is treated in other SO questions. The signature of select is there to ensure your list of selected columns is not empty – which makes the conversion from the list of selected columns to varargs a bit more complex.
For Spark v1.4 and higher, using drop(*cols) -
Returns a new DataFrame without the specified column(s).
Example -
df.drop('age').collect()
For Spark v2.3 and higher you could also do it using colRegex(colName) -
Selects column based on the column name specified as a regex and returns it as Column.
Example-
df = spark.createDataFrame([("a", 1), ("b", 2), ("c", 3)], ["Col1", "Col2"])
df.select(df.colRegex("`(Col1)?+.+`")).show()
Reference - colRegex, drop
For older versions of Spark, take the list of columns in dataframe, then remove columns you want to drop from it (maybe using set operations) and then use select to pick the resultant list.
val columns = Seq("A","B","C")
df.select(columns.diff(Seq("B")))
In pyspark you can do
df.select(list(set(df.columns) - set(["B"])))
Using more than one line you can also do
cols = df.columns
cols.remove("B")
df.select(cols)
It is possible to do as following
It uses Spark's ability to select columns using regular expressions.
And using negative look-ahead expression ?!
In this case dataframe has columns a,b,c and regex excluding column b from the list.
Notice: you need to enable regexp for column name lookups using spark.sql.parser.quotedRegexColumnNames=true session setting. And requires Spark 2.3+
select `^(?!b).*`
from (
select 1 as a, 2 as b, 3 as c
)