how to convert dataframe to RDD and don't change partition? - scala

For some reason i have to convert RDD to dataframe,then do something with dataframe,but my interface is RDD,so i have to convert dataframe to RDD,when i use df.rdd,the partition change to 1,so i have to repartition and sortBy RDD,Is there any cleaner solution ?thanks!
this is my try:
val rdd=sc.parallelize(List(1,3,2,4,5,6,7,8),4)
val partition=rdd.getNumPartitions
val sqlContext = new SQLContext(m_sparkCtx)
import sqlContext.implicits._
val df=rdd.toDF()
df.rdd.zipWithIndex().sortBy(x => {x._2}, true, partition).map(x => {x._1})

Partitions should remain the same when you convert the DataFrame to an RDD.
For example when the rdd of 4 partitions is converted to DF and back the RDD the partitions of the RDD remains same as shown below.
scala> val rdd=sc.parallelize(List(1,3,2,4,5,6,7,8),4)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[11] at parallelize at <console>:27
scala> val partition=rdd.getNumPartitions
partition: Int = 4
scala> val df=rdd.toDF()
df: org.apache.spark.sql.DataFrame = [value: int]
scala> df.rdd.getNumPartitions
res1: Int = 4
scala> df.withColumn("col2", lit(10)).rdd.getNumPartitions
res1: Int = 4

Related

Filling nulls values from a CSV file issue-spark

I'm using Scala and Apache Spark 2.3.0 with a CSV file. I'm doing this because when I try to use the csv for k means it tells me that I have null values but it keeps appearing the same issue even if I try to fill those nulls
scala>val df = sqlContext.read.format("com.databricks.spark.csv")
.option("header", "true")
.option("delimiter",";")
.schema(schema).load("33.csv")
scala> df.na.fill(df.columns.zip(
df.select(df.columns.map(mean(_)): _*).first.toSeq
).toMap)
scala> val featuresCols = Array("LONGITUD","LATITUD")
featuresCols: Array[String] = Array(LONGITUD, LATITUD)
scala> val featureCols = Array("LONGITUD","LATITUD")
featureCols: Array[String] = Array(LONGITUD, LATITUD)
scala> val assembler = new VectorAssembler().setInputCols(featureCols).setOutputCol("features")
assembler: org.apache.spark.ml.feature.VectorAssembler = vecAssembler_440117601217
scala> val df2 = assembler.transform(df)
df2: org.apache.spark.sql.DataFrame = [ID_CALLE: int, TIPO: int ... 6 more fields]
scala> df2.show
Caused by: org.apache.spark.SparkException: Values to assemble cannot be null
Looks like you did na.fill() but didn't assign it to a DataFrame.
Try val nonullDF = df.na.fill(...)

Spark: Scala - apply function to a list in a DataFrame [duplicate]

This question already has answers here:
How to slice and sum elements of array column?
(6 answers)
Closed 4 years ago.
I am trying to apply a sum-function to each cell of a column of a dataframe in spark. Each cell contains a list of integers which I would like to add up.
However, the error I am getting is:
console:357: error: value sum is not a member of
org.apache.spark.sql.ColumnName
for the example script below.
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
val spark = SparkSession.builder().getOrCreate()
val df = spark.createDataFrame(Seq(
(0, List(1,2,3)),
(1, List(2,2,3)),
(2, List(3,2,3)))).toDF("Id", "col_1")
val test = df.withColumn( "col_2", $"col_1".sum )
test.show()
You can define a UDF.
scala> def sumFunc(a: Seq[Int]): Int = a.sum
sumFunc: (a: Seq[Int])Int
scala> val sumUdf = udf(sumFunc(_: Seq[Int]))
sumUdf: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,IntegerType,Some(List(ArrayType(IntegerType,false))))
scala> val test = df.withColumn( "col_2", sumUdf($"col_1") )
test: org.apache.spark.sql.DataFrame = [Id: int, col_1: array<int> ... 1 more field]
scala> test.collect
res0: Array[org.apache.spark.sql.Row] = Array([0,WrappedArray(1, 2, 3),6], [1,WrappedArray(2, 2, 3),7], [2,WrappedArray(3, 2, 3),8])

Append a row to a pair RDD in spark

I have a pair RDD of existing values such as :
(1,2)
(3,4)
(5,6)
I want to append a row (7,8) to the same RDD
How can I append to the same RDD in Spark?
You can use union operation.
scala> val rdd1 = sc.parallelize(List((1,2), (3,4), (5,6)))
q: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[1] at parallelize at <console>:24
scala> val rdd2 = sc.parallelize(List((7, 8)))
q: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[1] at parallelize at <console>:24
scala> val unionOfTwo = rdd1.union(rdd2)
res0: org.apache.spark.rdd.RDD[(Int, Int)] = UnionRDD[2] at union at <console>:28

can I not change partition when I use window with spark/scala?

I have a RDD , the RDD'partition of result changes to 1 when I use window,can I not change partition when I use window?
this is my code:
val rdd=sc.parallelize(List(1,3,2,4,5,6,7,8),4)
val sqlContext = new SQLContext(m_sparkCtx)
import sqlContext.implicits._
val result = rdd.toDF("values").withColumn("csum", sum(col("values")).over(Window.orderBy("values"))).rdd
println(result.getNumPartitions+"rdd2")
My input'partition is 4,I want my result'partition is alse 4,is there any cleaner solution?
The partitions of the RDD is 1 as expected this is because your are performing a Window function on a DataFrame without a partitionBy clause. So all the data has to be grouped into a single partition in this case.
When we include a partitionBy clause in our Window function then the number of partitions in the result RDD is no longer 1 as shown below. In the below example we have included an-another column in called col1 in the original dataframe and the same window function is applied with a partitionBy clause on col1 column.
scala> import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.expressions.Window
scala> val rdd = spark.sparkContext.parallelize(List((1,1),(3,1),(2,2),(4,2),(5,2),(6,3),(7,3),(8,3)),4)
rdd: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[49] at parallelize at <console>:28
scala> val result = rdd.toDF("values", "col1").withColumn("csum", sum(col("values")).over(Window.partitionBy("col1").orderBy("values"))).rdd
result: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[58] at rdd at <console>:30
scala> result.getNumPartitions
res6: Int = 200

How to "negative select" columns in spark's dataframe

I can't figure it out, but guess it's simple. I have a spark dataframe df. This df has columns "A","B" and "C". Now let's say I have an Array containing the name of the columns of this df:
column_names = Array("A","B","C")
I'd like to do a df.select() in such a way, that I can specify which columns not to select.
Example: let's say I do not want to select columns "B". I tried
df.select(column_names.filter(_!="B"))
but this does not work, as
org.apache.spark.sql.DataFrame
cannot be applied to (Array[String])
So, here it says it should work with a Seq instead. However, trying
df.select(column_names.filter(_!="B").toSeq)
results in
org.apache.spark.sql.DataFrame
cannot be applied to (Seq[String]).
What am I doing wrong?
Since Spark 1.4 you can use drop method:
Scala:
case class Point(x: Int, y: Int)
val df = sqlContext.createDataFrame(Point(0, 0) :: Point(1, 2) :: Nil)
df.drop("y")
Python:
df = sc.parallelize([(0, 0), (1, 2)]).toDF(["x", "y"])
df.drop("y")
## DataFrame[x: bigint]
I had the same problem and solved it this way (oaffdf is a dataframe):
val dropColNames = Seq("col7","col121")
val featColNames = oaffdf.columns.diff(dropColNames)
val featCols = featColNames.map(cn => org.apache.spark.sql.functions.col(cn))
val featsdf = oaffdf.select(featCols: _*)
https://forums.databricks.com/questions/2808/select-dataframe-columns-from-a-sequence-of-string.html
OK, it's ugly, but this quick spark shell session shows something that works:
scala> val myRDD = sc.parallelize(List.range(1,10))
myRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[17] at parallelize at <console>:21
scala> val myDF = myRDD.toDF("a")
myDF: org.apache.spark.sql.DataFrame = [a: int]
scala> val myOtherRDD = sc.parallelize(List.range(1,10))
myOtherRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[20] at parallelize at <console>:21
scala> val myotherDF = myRDD.toDF("b")
myotherDF: org.apache.spark.sql.DataFrame = [b: int]
scala> myDF.unionAll(myotherDF)
res2: org.apache.spark.sql.DataFrame = [a: int]
scala> myDF.join(myotherDF)
res3: org.apache.spark.sql.DataFrame = [a: int, b: int]
scala> val twocol = myDF.join(myotherDF)
twocol: org.apache.spark.sql.DataFrame = [a: int, b: int]
scala> val cols = Array("a", "b")
cols: Array[String] = Array(a, b)
scala> val selectedCols = cols.filter(_!="b")
selectedCols: Array[String] = Array(a)
scala> twocol.select(selectedCols.head, selectedCols.tail: _*)
res4: org.apache.spark.sql.DataFrame = [a: int]
Providings varargs to a function that requires one is treated in other SO questions. The signature of select is there to ensure your list of selected columns is not empty – which makes the conversion from the list of selected columns to varargs a bit more complex.
For Spark v1.4 and higher, using drop(*cols) -
Returns a new DataFrame without the specified column(s).
Example -
df.drop('age').collect()
For Spark v2.3 and higher you could also do it using colRegex(colName) -
Selects column based on the column name specified as a regex and returns it as Column.
Example-
df = spark.createDataFrame([("a", 1), ("b", 2), ("c", 3)], ["Col1", "Col2"])
df.select(df.colRegex("`(Col1)?+.+`")).show()
Reference - colRegex, drop
For older versions of Spark, take the list of columns in dataframe, then remove columns you want to drop from it (maybe using set operations) and then use select to pick the resultant list.
val columns = Seq("A","B","C")
df.select(columns.diff(Seq("B")))
In pyspark you can do
df.select(list(set(df.columns) - set(["B"])))
Using more than one line you can also do
cols = df.columns
cols.remove("B")
df.select(cols)
It is possible to do as following
It uses Spark's ability to select columns using regular expressions.
And using negative look-ahead expression ?!
In this case dataframe has columns a,b,c and regex excluding column b from the list.
Notice: you need to enable regexp for column name lookups using spark.sql.parser.quotedRegexColumnNames=true session setting. And requires Spark 2.3+
select `^(?!b).*`
from (
select 1 as a, 2 as b, 3 as c
)