Replace null values in Spark DataFrame - scala

I saw a solution here but when I tried it doesn't work for me.
First I import a cars.csv file :
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.load("/usr/local/spark/cars.csv")
Which looks like the following :
+----+-----+-----+--------------------+-----+
|year| make|model| comment|blank|
+----+-----+-----+--------------------+-----+
|2012|Tesla| S| No comment| |
|1997| Ford| E350|Go get one now th...| |
|2015|Chevy| Volt| null| null|
Then I do this :
df.na.fill("e",Seq("blank"))
But the null values didn't change.
Can anyone help me ?

This is basically very simple. You'll need to create a new DataFrame. I'm using the DataFrame df that you have defined earlier.
val newDf = df.na.fill("e",Seq("blank"))
DataFrames are immutable structures.
Each time you perform a transformation which you need to store, you'll need to affect the transformed DataFrame to a new value.

you can achieve same in java this way
Dataset<Row> filteredData = dataset.na().fill(0);

If the column was string type,
val newdf= df.na.fill("e",Seq("blank"))
would work.
Since it's float type (as the image tells) you need to use
val newdf= df.na.fill(0.0, Seq("blank"))

Related

PySpark input_file_name() into a variable NOT df

I want to store the value from input_file_name() into a variable instead of a dataframe. This variable will then be used for logging and troubleshooting.etc
You can create a new column on the data frame using withColumn and input_file_name() and then use collect() operation, something like below:
df = spark.read.csv("/FileStore/tmp/part-00000-tid-6847462229548084439-4a50d1c2-9b65-4756-9a29-0044d620a1da-11-1-c000.csv")
df.show()
+-----+
| _c0|
+-----+
|43368|
+-----+
from pyspark.sql.functions import *
df1 = df.withColumn("file_name", input_file_name())
df1.show(truncate=False)
+-----+---------------------------------------------------------------------------------------------------------+
|_c0 |file_name |
+-----+---------------------------------------------------------------------------------------------------------+
|43368|dbfs:/FileStore/tmp/part-00000-tid-6847462229548084439-4a50d1c2-9b65-4756-9a29-0044d620a1da-11-1-c000.csv|
+-----+---------------------------------------------------------------------------------------------------------+
Now, creating a variable with file_name using collect and then split it on /
file_name = df1.collect()[0][1].split("/")[3]
print(file_name)
Output
part-00000-tid-6847462229548084439-4a50d1c2-9b65-4756-9a29-0044d620a1da-11-1-c000.csv
Please note, in your case index for both collect as well as well as after split might be differ.

I need to skip three rows from the dataframe while loading from a CSV file in scala

I am loading my CSV file to a data frame and I can do that but I need to skip the starting three lines from the file.
I tried .option() command by giving header as true but it is ignoring the only first line.
val df = spark.sqlContext.read
.schema(Myschema)
.option("header",true)
.option("delimiter", "|")
.csv(path)
I thought of giving header as 3 lines but I couldn't find the way to do that.
alternative thought: skip those 3 lines from the data frame
Please help me with this. Thanks in Advance.
A generic way to handle your problem would be to index the dataframe and filter the indices that are greater than 2.
Straightforward approach:
As suggested in another answer, you may try adding an index with monotonically_increasing_id.
df.withColumn("Index",monotonically_increasing_id)
.filter('Index > 2)
.drop("Index")
Yet, that's only going to work if the first 3 rows are in the first partition. Moreover, as mentioned in the comments, this is the case today but this code may break completely with further versions or spark and that would be very hard to debug. Indeed, the contract in the API is just "The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive". It is therefore not very sage to assume that they will always start from zero. There might even be other cases in the current version in which that does not work (I'm not sure though).
To illustrate my first concern, have a look at this:
scala> spark.range(4).withColumn("Index",monotonically_increasing_id()).show()
+---+----------+
| id| Index|
+---+----------+
| 0| 0|
| 1| 1|
| 2|8589934592|
| 3|8589934593|
+---+----------+
We would only remove two rows...
Safe approach:
The previous approach will work most of the time though but to be safe, you can use zipWithIndex from the RDD API to get consecutive indices.
def zipWithIndex(df : DataFrame, name : String) : DataFrame = {
val rdd = df.rdd.zipWithIndex
.map{ case (row, i) => Row.fromSeq(row.toSeq :+ i) }
val newSchema = df.schema
.add(StructField(name, LongType, false))
df.sparkSession.createDataFrame(rdd, newSchema)
}
zipWithIndex(df, "index").where('index > 2).drop("index")
We can check that it's safer:
scala> zipWithIndex(spark.range(4).toDF("id"), "index").show()
+---+-----+
| id|index|
+---+-----+
| 0| 0|
| 1| 1|
| 2| 2|
| 3| 3|
+---+-----+
You can try this option
df.withColumn("Index",monotonically_increasing_id())
.filter(col("Index") > 2)
.drop("Index")
You may try changing wrt to your schema.
import org.apache.spark.sql.Row
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
//Read CSV
val file = sc.textFile("csvfilelocation")
//Remove first 3 lines
val data = file.mapPartitionsWithIndex{ (idx, iter) => if (idx == 0) iter.drop(3) else iter }
//Create RowRDD by mapping each line to the required fields
val rowRdd = data.map(x=>Row(x(0), x(1)))
//create dataframe by calling sqlcontext.createDataframe with rowRdd and your schema
val df = sqlContext.createDataFrame(rowRdd, schema)

Adding StringType column to existing Spark DataFrame and then applying default values

Scala 2.10 here using Spark 1.6.2. I have a similar (but not the same) question as this one, however, the accepted answer is not an SSCCE and assumes a certain amount of "upfront knowledge" about Spark; and therefore I can't reproduce it or make sense of it. More importantly, that question is also just limited to adding a new column to an existing dataframe, whereas I need to add a column as well as a value for all existing rows in the dataframe.
So I want to add a column to an existing Spark DataFrame, and then apply an initial ('default') value for that new column to all rows.
val json : String = """{ "x": true, "y": "not true" }"""
val rdd = sparkContext.parallelize(Seq(json))
val jsonDF = sqlContext.read.json(rdd)
jsonDF.show()
When I run that I get this following as output (via .show()):
+----+--------+
| x| y|
+----+--------+
|true|not true|
+----+--------+
Now I want to add a new field to jsonDF, after it's created and without modifying the json string, such that the resultant DF would look like this:
+----+--------+----+
| x| y| z|
+----+--------+----+
|true|not true| red|
+----+--------+----+
Meaning, I want to add a new "z" column to the DF, of type StringType, and then default all rows to contain a z-value of "red".
From that other question I have pieced the following pseudo-code together:
val json : String = """{ "x": true, "y": "not true" }"""
val rdd = sparkContext.parallelize(Seq(json))
val jsonDF = sqlContext.read.json(rdd)
//jsonDF.show()
val newDF = jsonDF.withColumn("z", jsonDF("col") + 1)
newDF.show()
But when I run this, I get a compiler error on that .withColumn(...) method:
org.apache.spark.sql.AnalysisException: Cannot resolve column name "col" among (x, y);
at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:152)
at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:152)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:151)
at org.apache.spark.sql.DataFrame.col(DataFrame.scala:664)
at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:652)
I also don't see any API methods that would allow me to set "red" as the default value. Any ideas as to where I'm going awry?
You can use lit function. First you have to import it
import org.apache.spark.sql.functions.lit
and use it as shown below
jsonDF.withColumn("z", lit("red"))
Type of the column will be inferred automatically.

Creating a Spark DataFrame from a single string

I'm trying to take a hardcoded String and turn it into a 1-row Spark DataFrame (with a single column of type StringType) such that:
String fizz = "buzz"
Would result with a DataFrame whose .show() method looks like:
+-----+
| fizz|
+-----+
| buzz|
+-----+
My best attempt thus far has been:
val rawData = List("fizz")
val df = sqlContext.sparkContext.parallelize(Seq(rawData)).toDF()
df.show()
But I get the following compiler error:
java.lang.ClassCastException: org.apache.spark.sql.types.ArrayType cannot be cast to org.apache.spark.sql.types.StructType
at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:413)
at org.apache.spark.sql.SQLImplicits.rddToDataFrameHolder(SQLImplicits.scala:155)
Any ideas as to where I'm going awry? Also, how do I set "buzz" as the row value for the fizz column?
Update:
Trying:
sqlContext.sparkContext.parallelize(rawData).toDF()
I get a DF that looks like:
+----+
| _1|
+----+
|buzz|
+----+
Try:
sqlContext.sparkContext.parallelize(rawData).toDF()
In 2.0 you can:
import spark.implicits._
rawData.toDF
Optionally provide a sequence of names for toDF:
sqlContext.sparkContext.parallelize(rawData).toDF("fizz")
In Java, the following works:
List<String> textList = Collections.singletonList("yourString");
SQLContext sqlContext = new SQLContext(sparkContext);
Dataset<Row> data = sqlContext
.createDataset(textList, Encoders.STRING())
.withColumnRenamed("value", "text");

Dataframe creation in Scala

wordsDF = sqlContext.createDataFrame([('cat',), ('elephant',), ('rat',), ('rat',), ('cat', )], ['word'])
This is a way of creating dataframe from a list of tuples in python. How can I do this in scala ? I'm new to Scala and I'm facing problem in figuring it out.
Any help will be appreciated!
One simple way,
val df = sc.parallelize(List( (1,"a"), (2,"b") )).toDF("key","value")
and so df.show
+---+-----+
|key|value|
+---+-----+
| 1| a|
| 2| b|
+---+-----+
Refer to the worked example in Programmatically Specifying the Schema for constructing a DataFrame with createDataFrame.
To create a dataframe , you need to create SQLContext .
val sc: SparkContext // An existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// this is used to implicitly convert an RDD to a DataFrame , after importing it you can use .toDF method
import sqlContext.implicits._
now you can create dataframes
val df1 = sc.makeRDD(1 to 5).map(i => (i, i * 2)).toDF("single", "double")
learn more about creating dataframes here