Creating Empty DF and adding column is NOT working - scala

I'm trying to create a empty dataframe and append new column. I tried to do this by two option. Option A is working but Option B is not working. Please help!
Option A:
`
var initialDF1 = Seq(("test")).toDF("M")
initialDF1 = initialDF1.withColumn(("P"), lit(s"P"))
initialDF1.show
+----+---+
| M| P|
+----+---+
|test| P|
+----+---+
`
Option B: (Not working)
`
import org.apache.spark.sql.types.{StructType, StructField, StringType}
import org.apache.spark.sql.Row
val schema = StructType(List(StructField("N", StringType, true)))
var initialDF = spark.createDataFrame(sc.emptyRDD[Row], schema)
initialDF = initialDF.withColumn(("P"), lit(s"P"))
initialDF.show
+---+---+
| N| P|
+---+---+
+---+---+
`

It is working as intended the withColumn command only affects the schema and it allows setting a value to existing records (lit or some other calculation) but that would only be applied to existing rows. In your second case you created an empty dataframe. the withColum iterates on that and adds a "P" to any existing row (none..)

Related

How to add header and column to dataframe spark?

I have got a dataframe, on which I want to add a header and a first column
manually. Here is the dataframe :
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder.master("local").appName("my-spark-app").getOrCreate()
val df = spark.read.option("header",true).option("inferSchema",true).csv("C:\\gg.csv").cache()
the content of the dataframe
12,13,14
11,10,5
3,2,45
The expected output is
define,col1,col2,col3
c1,12,13,14
c2,11,10,5
c3,3,2,45
What you want to do is:
df.withColumn("columnName", column) //here "columnName" should be "define" for you
Now you just need to create the said column (this might help)
Here is a solution that depends on Spark 2.4:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
import org.apache.spark.sql.Row
//First off the dataframe needs to be loaded with the expected schema
val spark = SparkSession.builder().appName().getOrCreate()
val schema = new StructType()
.add("col1",IntegerType,true)
.add("col2",IntegerType,true)
.add("col3",IntegerType,true)
val df = spark.read.format("csv").schema(schema).load("C:\\gg.csv").cache()
val rddWithId = df.rdd.zipWithIndex
// Prepend "define" column of type Long
val newSchema = StructType(Array(StructField("define", StringType, false)) ++ df.schema.fields)
val dfZippedWithId = spark.createDataFrame(rddWithId.map{
case (row, index) =>
Row.fromSeq(Array("c" + index) ++ row.toSeq)}, newSchema)
// Show results
dfZippedWithId.show
Displays:
+------+----+----+----+
|define|col1|col2|col3|
+------+----+----+----+
| c0| 12| 13| 14|
| c1| 11| 10| 5|
| c2| 3| 2| 45|
+------+----+----+----+
This is a mix of the documentation here and this example.

Overwrite Spark dataframe schema

LATER EDIT:
Based on this article it seems that Spark cannot edit and RDD or column. A new one has to be created with the new type and the old one deleted. The for loop and .withColumn method suggested below seem to be the easiest way to get the job done.
ORIGINAL QUESTION:
Is there a simple way (for both human and machine) to convert multiple columns to a different data type?
I tried to define the schema manually, then load the data from a parquet file using this schema and save it to another file but I get "Job aborted."..."Task failed while writing rows" every time and on every DF. Somewhat easy for me, laborious for Spark ... and it does not work.
Another option is using:
df = df.withColumn("new_col", df("old_col").cast(type)).drop("old_col").withColumnRenamed("new_col", "old_col")
A bit more work for me as there are close to 100 columns and, if Spark has to duplicate each column in memory, then that doesn't sound optimal either. Is there an easier way?
Depending on how complicated the casting rules are, you can accomplish what you are asking a with this loop:
scala> var df = Seq((1,2),(3,4)).toDF("a", "b")
df: org.apache.spark.sql.DataFrame = [a: int, b: int]
scala> df.show
+---+---+
| a| b|
+---+---+
| 1| 2|
| 3| 4|
+---+---+
scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._
scala> > df.columns.foreach{c => df = df.withColumn(c, df(c).cast(DoubleType))}
scala> df.show
+---+---+
| a| b|
+---+---+
|1.0|2.0|
|3.0|4.0|
+---+---+
This should be as efficient as any other column operation.

How to handle the null/empty values on a dataframe Spark/Scala

I have a CSV file and I am processing its data.
I am working with data frames, and I calculate average, min, max, mean, sum of each column based on some conditions. The data of each column could be empty or null.
I have noticed that in some cases I got as max, or sum a null value instead of a number. Or I got in max() a number which is less that the output that the min() returns.
I do not want to replace the null/empty values with other.
The only thing I have done is to use these 2 options in CSV:
.option("nullValue", "null")
.option("treatEmptyValuesAsNulls", "true")
Is there any way to handle this issue? Have everyone faced this problem before? Is it a problem of data types?
I run something like this:
data.agg(mean("col_name"), stddev("col_name"),count("col_name"),
min("col_name"), max("col_name"))
Otherwise I can consider that it is a problem in my code.
I have done some research on this question, and the result shows that mean, max, min functions ignore null values. Below is the experiment code and results.
Environment: Scala, Spark 1.6.1 Hadoop 2.6.0
import org.apache.spark.sql.{Row}
import org.apache.spark.sql.types.{DoubleType, IntegerType, StringType, StructField, StructType}
import org.apache.spark.sql.types._
import org.apache.spark.{SparkConf, SparkContext}
val row1 =Row("1", 2.4, "2016-12-21")
val row2 = Row("1", None, "2016-12-22")
val row3 = Row("2", None, "2016-12-23")
val row4 = Row("2", None, "2016-12-23")
val row5 = Row("3", 3.0, "2016-12-22")
val row6 = Row("3", 2.0, "2016-12-22")
val theRdd = sc.makeRDD(Array(row1, row2, row3, row4, row5, row6))
val schema = StructType(StructField("key", StringType, false) ::
StructField("value", DoubleType, true) ::
StructField("d", StringType, false) :: Nil)
val df = sqlContext.createDataFrame(theRdd, schema)
df.show()
df.agg(mean($"value"), max($"value"), min($"value")).show()
df.groupBy("key").agg(mean($"value"), max($"value"), min($"value")).show()
Output:
+---+-----+----------+
|key|value| d|
+---+-----+----------+
| 1| 2.4|2016-12-21|
| 1| null|2016-12-22|
| 2| null|2016-12-23|
| 2| null|2016-12-23|
| 3| 3.0|2016-12-22|
| 3| 2.0|2016-12-22|
+---+-----+----------+
+-----------------+----------+----------+
| avg(value)|max(value)|min(value)|
+-----------------+----------+----------+
|2.466666666666667| 3.0| 2.0|
+-----------------+----------+----------+
+---+----------+----------+----------+
|key|avg(value)|max(value)|min(value)|
+---+----------+----------+----------+
| 1| 2.4| 2.4| 2.4|
| 2| null| null| null|
| 3| 2.5| 3.0| 2.0|
+---+----------+----------+----------+
From the output you can see that the mean, max, min functions on column 'value' of group key='1' returns '2.4' instead of null which shows that the null values were ignored in these functions. However, if the column contains only null values then these functions will return null values.
Contrary to one of the comments it is not true that nulls are ignored. Here is an approach:
max(coalesce(col_name,Integer.MinValue))
min(coalesce(col_name,Integer.MaxValue))
This will still have an issue if there were only null values: you will need to convert Min/MaxValue to null or whatever you want to use to represent "no valid/non-null entries".
To add to other answers:
Remember the null and NaN are different things to spark:
NaN is not a number and numeric aggregations on a column with NaN in it result in NaN
null is a missing value and numeric aggregations on a column with null ignore it as if the row wasn't even there
df_=spark.createDataFrame([(1, np.nan), (None, 2.0),(3,4.0)], ("a", "b"))
df_.show()
| a| b|
+----+---+
| 1|NaN|
|null|2.0|
| 3|4.0|
+----+---+
df_.agg(F.mean("a"),F.mean("b")).collect()
[Row(avg(a)=2.0, avg(b)=nan)]

How to convert a dataframe column to sequence

I have a dataframe as below:
+-----+--------------------+
|LABEL| TERM|
+-----+--------------------+
| 4| inhibitori_effect|
| 4| novel_therapeut|
| 4| antiinflammator...|
| 4| promis_approach|
| 4| cell_function|
| 4| cell_line|
| 4| cancer_cell|
I want to create a new dataframe by taking all terms as sequence so that I can use them with Word2vec. That is:
+-----+--------------------+
|LABEL| TERM|
+-----+--------------------+
| 4| inhibitori_effect, novel_therapeut,..., cell_line |
As a result I want to apply this sample code as given here: https://spark.apache.org/docs/latest/ml-features.html#word2vec
So far I have tried to convert df to RDD and map it. And then I could not manage to re-convert it to a df.
Thanks in advance.
EDIT:
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.SQLContext
val sc = new SparkContext(conf)
val sqlContext: SQLContext = new HiveContext(sc)
val df = sqlContext.load("jdbc",Map(
"url" -> "jdbc:oracle:thin:...",
"dbtable" -> "table"))
df.show(20)
df.groupBy($"label").agg(collect_list($"term").alias("term"))
You can use collect_list or collect_set functions:
import org.apache.spark.sql.functions.{collect_list, collect_set}
df.groupBy($"label").agg(collect_list($"term").alias("term"))
In Spark < 2.0 it requires HiveContext and in Spark 2.0+ you have to enable hive support in SessionBuilder. See Use collect_list and collect_set in Spark SQL

How to add columns into org.apache.spark.sql.Row inside of mapPartitions

I am a newbie at scala and spark, please keep that in mind :)
Actually, I have three questions
How should I define function to pass it into df.rdd.mapPartitions, if I want to create new Row with few additional columns
How can I add few columns into Row object(or create a new one)
How create DataFrame from created RDD
Thank you at advance
Usually there should be no need for that and it is better to use UDFs but here you are:
How should I define function to pass it into df.rdd.mapPartitions, if I want to create new Row with few additional columns
It should take Iterator[Row] and return Iterator[T] so in your case you should use something like this
import org.apache.spark.sql.Row
def transformRows(iter: Iterator[Row]): Iterator[Row] = ???
How can I add few columns into Row object(or create a new one)
There are multiple ways of accessing Row values including Row.get* methods, Row.toSeq etc. New Row can be created using Row.apply, Row.fromSeq, Row.fromTuple or RowFactory. For example:
def transformRow(row: Row): Row = Row.fromSeq(row.toSeq ++ Array[Any](-1, 1))
How create DataFrame from created RDD
If you have RDD[Row] you can use SQLContext.createDataFrame and provide schema.
Putting this all together:
import org.apache.spark.sql.types.{IntegerType, StructField, StructType}
val df = sc.parallelize(Seq(
(1.0, 2.0), (0.0, -1.0),
(3.0, 4.0), (6.0, -2.3))).toDF("x", "y")
def transformRows(iter: Iterator[Row]): Iterator[Row] = iter.map(transformRow)
val newSchema = StructType(df.schema.fields ++ Array(
StructField("z", IntegerType, false), StructField("v", IntegerType, false)))
sqlContext.createDataFrame(df.rdd.mapPartitions(transformRows), newSchema).show
// +---+----+---+---+
// | x| y| z| v|
// +---+----+---+---+
// |1.0| 2.0| -1| 1|
// |0.0|-1.0| -1| 1|
// |3.0| 4.0| -1| 1|
// |6.0|-2.3| -1| 1|
// +---+----+---+---+