How to create DataFrame with nulls using toDF? - scala

How do you create a dataframe containing nulls from a sequence using .toDF ?
This works:
val df = Seq((1,"a"),(2,"b")).toDF("number","letter")
but I'd like to do something along the lines of:
val df = Seq((1, NULL),(2,"b")).toDF("number","letter")

In addition to Ramesh's answer it's worth noting that since toDF uses reflection to infer the schema it's important for the provided sequence to have a correct type. And if scala's type inference isn't enough you need to specify the type explicitly.
For example if you want 2nd column to be nullable integer then neither of the following works:
Seq((1, null)) has inferred type Seq[(Int, Null)]
Seq((1, null), (2, 2)) has inferred type Seq[(Int, Any)]
In this case you need to explicitly specify the type for the 2nd column. There are at least two ways how to do it. You can explicitly specify the generic type for the sequence
Seq[(Int, Integer)]((1, null)).toDF
or create a case class for the row:
case class MyRow(x: Int, y: Integer)
Seq(MyRow(1, null)).toDF
Note that I used Integer instead of Int as the later being a primitive type cannot accommodate nulls.

NULL is not defined in APIs anywhere but null is, so you can define like
val df2 = Seq((1, null), (2, "b")).toDF("number","letter")
And you should have output as
+------+------+
|number|letter|
+------+------+
|1 |null |
|2 |b |
+------+------+
The trick is to use two or more values for the column with nulls to define a type Spark SQL should use.
The following then won't work:
val df = Seq((1, null)).toDF("number","letter")
Spark has no way of knowing what the type of letter is in this case.

Related

Collecting two values from a DataFrame, and using them as parameters for a case class; looking for less verbose solution

I've got some data in spark, result: DataFrame = ..., where two integer columns are of interest; week and year. The values of these columns are identical for all rows.
I want to extract these two integer values, and pass them as parameters to create a WeekYear:
case class WeekYear(week: Int, year: Int)
Below is my current solution, but I'm thinking there must be a more elegant way to do this. How can this be done without the intermediate step of creating temp?
val temp = result
.select("week", "year")
.first
.toSeq
.map(_.toString.toInt)
val resultWeekYear = WeekYear(temp(0), temp(1))
The best way to utilize a case class with dataframes is to allow spark to convert it to a dataset with the .as() method. As long as your case class has attributes which match all of the column names, it should work very easily.
case class WeekYear(week: Int, year: Int)
val df = spark.createDataset(Seq((1, 1), (2, 2), (3, 3))).toDF("week", "year")
val ds = df.as[WeekYear]
ds.show()
Which provides a Dataset[WeekYear] that looks like this:
+----+----+
|week|year|
+----+----+
| 1| 1|
| 2| 2|
| 3| 3|
+----+----+
You can utilize some more complicated nested classes, but you have to start working with Encoders for that, so that spark knows how to convert back and forth.
Spark does some implicit conversions, so ds may still look like a Dataframe, but it is actually a strongly typed Dataset[WeekYear], instead of a Dataset[Row] that has arbitrary columns. You operate on it similarly to an RDD. Then just grab the .first() one of those and you'll already have the type you need.
val resultWeekYear = ds.first

Process all columns / the entire row in a Spark UDF

For a dataframe containing a mix of string and numeric datatypes, the goal is to create a new features column that is a minhash of all of them.
While this could be done by performing a dataframe.toRDD it is expensive to do that when the next step will be to simply convert the RDD back to a dataframe.
So is there a way to do a udf along the following lines:
val wholeRowUdf = udf( (row: Row) => computeHash(row))
Row is not a spark sql datatype of course - so this would not work as shown.
Update/clarifiction I realize it is easy to create a full-row UDF that runs inside withColumn. What is not so clear is what can be used inside a spark sql statement:
val featurizedDf = spark.sql("select wholeRowUdf( what goes here? ) as features
from mytable")
Row is not a spark sql datatype of course - so this would not work as shown.
I am going to show that you can use Row to pass all the columns or selected columns to a udf function using struct inbuilt function
First I define a dataframe
val df = Seq(
("a", "b", "c"),
("a1", "b1", "c1")
).toDF("col1", "col2", "col3")
// +----+----+----+
// |col1|col2|col3|
// +----+----+----+
// |a |b |c |
// |a1 |b1 |c1 |
// +----+----+----+
Then I define a function to make all the elements in a row as one string separated by , (as you have computeHash function)
import org.apache.spark.sql.Row
def concatFunc(row: Row) = row.mkString(", ")
Then I use it in udf function
import org.apache.spark.sql.functions._
def combineUdf = udf((row: Row) => concatFunc(row))
Finally I call the udf function using withColumn function and struct inbuilt function combining selected columns as one column and pass to the udf function
df.withColumn("contcatenated", combineUdf(struct(col("col1"), col("col2"), col("col3")))).show(false)
// +----+----+----+-------------+
// |col1|col2|col3|contcatenated|
// +----+----+----+-------------+
// |a |b |c |a, b, c |
// |a1 |b1 |c1 |a1, b1, c1 |
// +----+----+----+-------------+
So you can see that Row can be used to pass whole row as an argument
You can even pass all columns in a row at once
val columns = df.columns
df.withColumn("contcatenated", combineUdf(struct(columns.map(col): _*)))
Updated
You can achieve the same with sql queries too, you just need to register the udf function as
df.createOrReplaceTempView("tempview")
sqlContext.udf.register("combineUdf", combineUdf)
sqlContext.sql("select *, combineUdf(struct(`col1`, `col2`, `col3`)) as concatenated from tempview")
It will give you the same result as above
Now if you don't want to hardcode the names of columns then you can select the column names according to your desire and make it a string
val columns = df.columns.map(x => "`"+x+"`").mkString(",")
sqlContext.sql(s"select *, combineUdf(struct(${columns})) as concatenated from tempview")
I hope the answer is helpful
I came up with a workaround: drop the column names into any existing spark sql function to generate a new output column:
concat(${df.columns.tail.mkString(",'-',")}) as Features
In this case the first column in the dataframe is a target and was excluded. That is another advantage of this approach: the actual list of columns many be manipulated.
This approach avoids unnecessary restructuring of the RDD/dataframes.

How do I combine two columns in a Spark SchemaRDD containing WrappedArrays into a 3rd column with the combined WrappedArray?

I have a DataFrame with two columns ( "features1" and "features2" ) containing WrappedArrays.
I need to combine the two columns into a third column containing the merged contents of the first two columns as a WrappedArray.
How do I do this?
I'm using Scala not PySpark
I didn't find another way than a udf, surprisingly
def catArray[A](a:Seq[A], b: Seq[A]): Seq[A] = a ++ b
val catArrayUdf = udf { catArray[Int] _ }
Then
scala> sc.parallelize(List((Seq(1,2),Seq(3,4))))
.toDF("A","B")
.withColumn("cat",catArray('A,'B))
.show(false)
+------+------+------------+
|A |B |cat |
+------+------+------------+
|[1, 2]|[3, 4]|[1, 2, 3, 4]|
+------+------+------------+
Maybe there is a shorter way to define the UDF based on ++ though.

how to create an extra column in DataFrame based on a simple condition in spark

I have a dataframe and I'd like to add an extra column to it based on a simple condition which basically says whether the value sof another column is equal to a given string or not. I know I can create an UDF and register it and use it then, however I think there must be an easier way of doing it. This is the psuedocode of what I'm about to do
df.withColumn("extra", if (col("a) == "str" 1 else 2))
You are pretty much there:
scala> val df = Seq((1,2), (3,3), (4,5)).toDF("a", "b")
scala> df.show
+-+-+
|a|b|
+-+-+
|1|2|
|3|3|
|4|5|
+-+-+
scala> df.withColumn("New", when($"a" === $"b", "equal").otherwise("not")).show
+-+-+-----+
|a|b| New|
+-+-+-----+
|1|2| not|
|3|3|equal|
|4|5| not|
+-+-+-----+
Note that you will need functions and implicits imported for the above to work.

Scala spark Select as not working as expected

Hope someone can help. Fairly certain this is something I'm doing wrong.
I have a dataframe called uuidvar with 1 column called 'uuid' and another dataframe, df1, with a number of columns, one of which is also 'uuid'. I would like to select from from df1 all of the rows which have a uuid that appear in uuidvar. Now, having the same column names is not ideal so I tried to do it with
val uuidselection=df1.join(uuidvar, df1("uuid") === uuidvar("uuid").as("another_uuid"), "right_outer").select("*")
However when I show uuidselection I have 2 columns called "uuid". Furthermore, if I try and select the specific columns I want, I am told
cannot resolve 'uuidvar' given input columns
or similar depending on what I try and select.
I have tried to make it simpler and just do
val uuidvar2=uuidvar.select("uuid").as("uuidvar")
and this doesn't rename the column in uuidvar.
Does 'as' not operate as I am expecting it to, am I making some other fundamental error or is it broken?
I'm using spark 1.5.1 and scala 1.10.
Answer
You can't use as when specifying the join-criterion.
Use withColumnRenamed to modify the column before the join.
Seccnd, use generic col function for accessing columns via name (instead of using the dataframe's apply method, e.g. df1(<columnname>)
case class UUID1 (uuid: String)
case class UUID2 (uuid: String, b:Int)
class UnsortedTestSuite2 extends SparkFunSuite {
configuredUnitTest("SO - uuid") { sc =>
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val uuidvar = sc.parallelize( Seq(
UUID1("cafe-babe-001"),
UUID1("cafe-babe-002"),
UUID1("cafe-babe-003"),
UUID1("cafe-babe-004")
)).toDF()
val df1 = sc.parallelize( Seq(
UUID2("cafe-babe-001", 1),
UUID2("cafe-babe-002", 2),
UUID2("cafe-babe-003", 3)
)).toDF()
val uuidselection=df1.join(uuidvar.withColumnRenamed("uuid", "another_uuid"), col("uuid") === col("another_uuid"), "right_outer")
uuidselection.show()
}
}
delivers
+-------------+----+-------------+
| uuid| b| another_uuid|
+-------------+----+-------------+
|cafe-babe-001| 1|cafe-babe-001|
|cafe-babe-002| 2|cafe-babe-002|
|cafe-babe-003| 3|cafe-babe-003|
| null|null|cafe-babe-004|
+-------------+----+-------------+
Comment
.select("*") does not have any effect. So
df.select("*") =^= df
I've always used the withColumnRenamed api to rename columns:
Take this table as an example:
| Name | Age |
df.withColumnRenamed('Age', 'newAge').show()
| Name | newAge |
So to make it work with your code, something like this should work:
val uuidvar_another = uuidvar.withColumnRenamed("uuid", "another_uuid")
val uuidselection=df1.join(uuidvar, df1("uuid") === uuidvar("another_uuid"), "right_outer").select("*")