Filling nulls values from a CSV file issue-spark - scala

I'm using Scala and Apache Spark 2.3.0 with a CSV file. I'm doing this because when I try to use the csv for k means it tells me that I have null values but it keeps appearing the same issue even if I try to fill those nulls
scala>val df = sqlContext.read.format("com.databricks.spark.csv")
.option("header", "true")
.option("delimiter",";")
.schema(schema).load("33.csv")
scala> df.na.fill(df.columns.zip(
df.select(df.columns.map(mean(_)): _*).first.toSeq
).toMap)
scala> val featuresCols = Array("LONGITUD","LATITUD")
featuresCols: Array[String] = Array(LONGITUD, LATITUD)
scala> val featureCols = Array("LONGITUD","LATITUD")
featureCols: Array[String] = Array(LONGITUD, LATITUD)
scala> val assembler = new VectorAssembler().setInputCols(featureCols).setOutputCol("features")
assembler: org.apache.spark.ml.feature.VectorAssembler = vecAssembler_440117601217
scala> val df2 = assembler.transform(df)
df2: org.apache.spark.sql.DataFrame = [ID_CALLE: int, TIPO: int ... 6 more fields]
scala> df2.show
Caused by: org.apache.spark.SparkException: Values to assemble cannot be null

Looks like you did na.fill() but didn't assign it to a DataFrame.
Try val nonullDF = df.na.fill(...)

Related

Spark: Scala - apply function to a list in a DataFrame [duplicate]

This question already has answers here:
How to slice and sum elements of array column?
(6 answers)
Closed 4 years ago.
I am trying to apply a sum-function to each cell of a column of a dataframe in spark. Each cell contains a list of integers which I would like to add up.
However, the error I am getting is:
console:357: error: value sum is not a member of
org.apache.spark.sql.ColumnName
for the example script below.
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
val spark = SparkSession.builder().getOrCreate()
val df = spark.createDataFrame(Seq(
(0, List(1,2,3)),
(1, List(2,2,3)),
(2, List(3,2,3)))).toDF("Id", "col_1")
val test = df.withColumn( "col_2", $"col_1".sum )
test.show()
You can define a UDF.
scala> def sumFunc(a: Seq[Int]): Int = a.sum
sumFunc: (a: Seq[Int])Int
scala> val sumUdf = udf(sumFunc(_: Seq[Int]))
sumUdf: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,IntegerType,Some(List(ArrayType(IntegerType,false))))
scala> val test = df.withColumn( "col_2", sumUdf($"col_1") )
test: org.apache.spark.sql.DataFrame = [Id: int, col_1: array<int> ... 1 more field]
scala> test.collect
res0: Array[org.apache.spark.sql.Row] = Array([0,WrappedArray(1, 2, 3),6], [1,WrappedArray(2, 2, 3),7], [2,WrappedArray(3, 2, 3),8])

Databrick Azure broadcast variables not serializable

So I am trying to create a extremely simple spark notebook using Azure Databricks and would like to make use of a simple RDD map call.
This is just for messing around, so the example is a bit contrived, but I can not get a value to work in the RDD map call unless it is a static constant value
I have tried using a broadcast variable
Here is a simple example using an int which I broadcast and then try and use in the RDD map
val sparkContext = spark.sparkContext
val sqlContext = spark.sqlContext
import sqlContext.implicits._
val multiplier = 3
val multiplierBroadcast = sparkContext.broadcast(multiplier)
val data = Array(1, 2, 3, 4, 5)
val dataRdd = sparkContext.parallelize(data)
val mappedRdd = dataRdd.map(x => multiplierBroadcast.value)
val df = mappedRdd.toDF
df.show()
Here is another example where I use simple serializable singleton object with an int field which I broadcast and then try and use in the RDD map
val sparkContext = spark.sparkContext
val sqlContext = spark.sqlContext
import sqlContext.implicits._
val multiplier = 3
object Foo extends Serializable { val theMultiplier: Int = multiplier}
val fooBroadcast = sparkContext.broadcast(Foo)
val data = Array(1, 2, 3, 4, 5)
val dataRdd = sparkContext.parallelize(data)
val mappedRdd = dataRdd.map(x => fooBroadcast.value.theMultiplier)
val df = mappedRdd.toDF
df.show()
And finally a List[int] with a single element which I broadcast and then try and use in the RDD map
val sparkContext = spark.sparkContext
val sqlContext = spark.sqlContext
import sqlContext.implicits._
val multiplier = 3
val listBroadcast = sparkContext.broadcast(List(multiplier))
val data = Array(1, 2, 3, 4, 5)
val dataRdd = sparkContext.parallelize(data)
val mappedRdd = dataRdd.map(x => listBroadcast.value.head)
val df = mappedRdd.toDF
df.show()
However ALL the examples above fail with this error. Which as you can see is pointing towards an issue with the RDD map value not being serializable. I can not see the issue, and int value should be serializable using all the above examples I think
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:345)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:335)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2375)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:379)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:378)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:371)
at org.apache.spark.rdd.RDD.map(RDD.scala:378)
If I however make the value in the RDD map a regular int value like this
val sparkContext = spark.sparkContext
val sqlContext = spark.sqlContext
import sqlContext.implicits._
val data = Array(1, 2, 3, 4, 5)
val dataRdd = sparkContext.parallelize(data)
val mappedRdd = dataRdd.map(x => 6)
val df = mappedRdd.toDF
df.show()
Everything is working fine and I see my simple DataFrame shown as expected
Any ideas anyone?
From your code, I would assume that you are on Spark 2+. Perhaps, there is no need to drop down to the RDD level and, instead, work with DataFrames.
The code below shows how to join two DataFrames and explicitly broadcast the first one.
import sparkSession.implicits._
import org.apache.spark.sql.functions._
val data = Seq(1, 2, 3, 4, 5)
val dataDF = data.toDF("id")
val largeDataDF = Seq((0, "Apple"), (1, "Pear"), (2, "Banana")).toDF("id", "value")
val df = largeDataDF.join(broadcast(dataDF), Seq("id"))
df.show()
Typically, small DataFrames are perfect candidates for broadcasting as an optimization whereby they are sent to all executors. spark.sql.autoBroadcastJoinThreshold is a configuration which limits the size of DataFrames eligible for broadcast. Additional details can be found on the Spark official documentation
Note also that with DataFrames, you have access to a handy explain method. With it, you can see the physical plan and it can be useful for debugging.
Running explain() on our example would confirm that Spark is doing a BroadcastHashJoin optimization.
df.explain()
== Physical Plan ==
*Project [id#11, value#12]
+- *BroadcastHashJoin [id#11], [id#3], Inner, BuildRight
:- LocalTableScan [id#11, value#12]
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
+- LocalTableScan [id#3]
If you need additional help with DataFrames, I provide an extensive list of examples at http://allaboutscala.com/big-data/spark/
So the answer was that you should not capture the Spark content in a val and then use that for the broadcast. So this is working code
import sqlContext.implicits._
val multiplier = 3
val multiplierBroadcast = spark.sparkContext.broadcast(multiplier)
val data = Array(1, 2, 3, 4, 5)
val dataRdd = sparkContext.parallelize(data)
val mappedRdd = dataRdd.map(x => multiplierBroadcast.value)
val df = mappedRdd.toDF
df.show()
Thanks to #nadim Bahadoor for this answer

How to union multiple csv files in to single csv file

I am writing the below code to convert the union of multiple CSV files and writing the combined data into new file. But I am facing an error.
val filesData=List("file1", "file2")
val dataframes = filesData.map(spark.read.option("header", true).csv(_))
val combined = dataframes.reduce(_ union _)
val data = combined.rdd
val head :Array[String]= data.first()
val memberDataRDD = data.filter(_(0) != head(0))
type mismatch; found : org.apache.spark.sql.Row required: Array[String]
there will not be any issue as long as both csv df have same schema
val df = spark.read.option("header", "true").csv("C:\maheswara\learning\big data\spark\sample_data\tmp")
val df1 = spark.read.option("header", "true").csv("C:\maheswara\learning\big data\spark\sample_data\tmp1")
val dfs = List(df, df1)
val dfUnion = dfs.reduce(_ union _)
You can just read multiple paths directly with Spark:
spark.read.option("header", true).csv(filesData:_*)

how to convert dataframe to RDD and don't change partition?

For some reason i have to convert RDD to dataframe,then do something with dataframe,but my interface is RDD,so i have to convert dataframe to RDD,when i use df.rdd,the partition change to 1,so i have to repartition and sortBy RDD,Is there any cleaner solution ?thanks!
this is my try:
val rdd=sc.parallelize(List(1,3,2,4,5,6,7,8),4)
val partition=rdd.getNumPartitions
val sqlContext = new SQLContext(m_sparkCtx)
import sqlContext.implicits._
val df=rdd.toDF()
df.rdd.zipWithIndex().sortBy(x => {x._2}, true, partition).map(x => {x._1})
Partitions should remain the same when you convert the DataFrame to an RDD.
For example when the rdd of 4 partitions is converted to DF and back the RDD the partitions of the RDD remains same as shown below.
scala> val rdd=sc.parallelize(List(1,3,2,4,5,6,7,8),4)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[11] at parallelize at <console>:27
scala> val partition=rdd.getNumPartitions
partition: Int = 4
scala> val df=rdd.toDF()
df: org.apache.spark.sql.DataFrame = [value: int]
scala> df.rdd.getNumPartitions
res1: Int = 4
scala> df.withColumn("col2", lit(10)).rdd.getNumPartitions
res1: Int = 4

How to "negative select" columns in spark's dataframe

I can't figure it out, but guess it's simple. I have a spark dataframe df. This df has columns "A","B" and "C". Now let's say I have an Array containing the name of the columns of this df:
column_names = Array("A","B","C")
I'd like to do a df.select() in such a way, that I can specify which columns not to select.
Example: let's say I do not want to select columns "B". I tried
df.select(column_names.filter(_!="B"))
but this does not work, as
org.apache.spark.sql.DataFrame
cannot be applied to (Array[String])
So, here it says it should work with a Seq instead. However, trying
df.select(column_names.filter(_!="B").toSeq)
results in
org.apache.spark.sql.DataFrame
cannot be applied to (Seq[String]).
What am I doing wrong?
Since Spark 1.4 you can use drop method:
Scala:
case class Point(x: Int, y: Int)
val df = sqlContext.createDataFrame(Point(0, 0) :: Point(1, 2) :: Nil)
df.drop("y")
Python:
df = sc.parallelize([(0, 0), (1, 2)]).toDF(["x", "y"])
df.drop("y")
## DataFrame[x: bigint]
I had the same problem and solved it this way (oaffdf is a dataframe):
val dropColNames = Seq("col7","col121")
val featColNames = oaffdf.columns.diff(dropColNames)
val featCols = featColNames.map(cn => org.apache.spark.sql.functions.col(cn))
val featsdf = oaffdf.select(featCols: _*)
https://forums.databricks.com/questions/2808/select-dataframe-columns-from-a-sequence-of-string.html
OK, it's ugly, but this quick spark shell session shows something that works:
scala> val myRDD = sc.parallelize(List.range(1,10))
myRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[17] at parallelize at <console>:21
scala> val myDF = myRDD.toDF("a")
myDF: org.apache.spark.sql.DataFrame = [a: int]
scala> val myOtherRDD = sc.parallelize(List.range(1,10))
myOtherRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[20] at parallelize at <console>:21
scala> val myotherDF = myRDD.toDF("b")
myotherDF: org.apache.spark.sql.DataFrame = [b: int]
scala> myDF.unionAll(myotherDF)
res2: org.apache.spark.sql.DataFrame = [a: int]
scala> myDF.join(myotherDF)
res3: org.apache.spark.sql.DataFrame = [a: int, b: int]
scala> val twocol = myDF.join(myotherDF)
twocol: org.apache.spark.sql.DataFrame = [a: int, b: int]
scala> val cols = Array("a", "b")
cols: Array[String] = Array(a, b)
scala> val selectedCols = cols.filter(_!="b")
selectedCols: Array[String] = Array(a)
scala> twocol.select(selectedCols.head, selectedCols.tail: _*)
res4: org.apache.spark.sql.DataFrame = [a: int]
Providings varargs to a function that requires one is treated in other SO questions. The signature of select is there to ensure your list of selected columns is not empty – which makes the conversion from the list of selected columns to varargs a bit more complex.
For Spark v1.4 and higher, using drop(*cols) -
Returns a new DataFrame without the specified column(s).
Example -
df.drop('age').collect()
For Spark v2.3 and higher you could also do it using colRegex(colName) -
Selects column based on the column name specified as a regex and returns it as Column.
Example-
df = spark.createDataFrame([("a", 1), ("b", 2), ("c", 3)], ["Col1", "Col2"])
df.select(df.colRegex("`(Col1)?+.+`")).show()
Reference - colRegex, drop
For older versions of Spark, take the list of columns in dataframe, then remove columns you want to drop from it (maybe using set operations) and then use select to pick the resultant list.
val columns = Seq("A","B","C")
df.select(columns.diff(Seq("B")))
In pyspark you can do
df.select(list(set(df.columns) - set(["B"])))
Using more than one line you can also do
cols = df.columns
cols.remove("B")
df.select(cols)
It is possible to do as following
It uses Spark's ability to select columns using regular expressions.
And using negative look-ahead expression ?!
In this case dataframe has columns a,b,c and regex excluding column b from the list.
Notice: you need to enable regexp for column name lookups using spark.sql.parser.quotedRegexColumnNames=true session setting. And requires Spark 2.3+
select `^(?!b).*`
from (
select 1 as a, 2 as b, 3 as c
)