How to handle nulls in SparkSQL Dataframes - scala

This is the code that I am following:
val ebayds = sc.textFile("/user/spark/xbox.csv")
case class Auction(auctionid: String, bid: Float, bidtime: Float, bidder: String, bidderrate: Int, openbid: Float, price: Float)
val ebay = ebayds.map(a=>a.split(",")).map(p=>Auction(p(0),p(1).toFloat,p(2).toFloat,p(3),p(4).toInt,p(5).toFloat,p(6).toFloat)).toDF()
ebay.select("auctionid").distinct.count
The error that I am getting is:
For input string: ""
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)

Use DataFrameNaFunctions
DataFrame fill(double value) Returns a new DataFrame that replaces
null values in numeric columns with value.
DataFrame fill(double
value, scala.collection.Seq cols) (Scala-specific) Returns a
new DataFrame that replaces null values in specified numeric columns.
Example Usage :
df.na.fill(0.0,Seq("your columnname"))
for that column null values will be replaced with 0.0 or any default value.
replace is also useful for replacing empty strings with default values
replace public DataFrame replace(String col,
java.util.Map replacement) Replaces values matching keys in replacement map with the corresponding values. Key
and value of replacement map must have the same type, and can only be
doubles or strings. If col is "*", then the replacement is applied on
all string columns or numeric columns.
import com.google.common.collect.ImmutableMap;
// Replaces all occurrences of 1.0 with 2.0 in column "height".
df.replace("height", ImmutableMap.of(1.0, 2.0));
// Replaces all occurrences of "UNKNOWN" with "unnamed" in column
"name". df.replace("name", ImmutableMap.of("UNKNOWN", "unnamed"));
// Replaces all occurrences of "UNKNOWN" with "unnamed" in all
string columns. df.replace("*", ImmutableMap.of("UNKNOWN",
"unnamed")); Parameters: col - name of the column to apply the value
replacement replacement - value replacement map, as explained above
Returns: (undocumented) Since:
1.3.1
for example :
df.na.replace("your column", Map(""-> 0.0)))

This worked for me. It returned a dataframe. Here A and B are columns and 1.0 and "unknown" are values to be replaced.
df.na.fill(Map("A" -> "unknown","B" -> 1.0))

Related

Comparing Column Object Values in Spark with Scala

I'm writing methods in Scala that take in Column arguments and return a column. Within them, I'm looking to compare the value of the columns (ranging from integers to dates) using logic similar to the below, but have been encountering an error message.
The lit() is for example purposes only. In truth I'm passing columns from a DataFrame.select() into a method to do computation. I need to compare using those columns.
val test1 = lit(3)
val test2 = lit(4)
if (test1 > test2) {
print("tuff")
}
Error message.
Error : <console>:96: error: type mismatch;
found : org.apache.spark.sql.Column
required: Boolean
if (test1 > test2) {
What is the correct way to compare Column objects in Spark? The column documentation lists the > operator as being valid for comparisons.
Edit: Here's a very contrived example of usage, assuming the columns passed into the function are dates that need to be compared for business reasons with a returned integer value that also has some business significance.
someDataFrame.select(
$"SomeColumn",
computedColumn($"SomeColumn1", $"SomeColumn2").as("MyComputedColumn")
)
Where computedColumn would be
def computedColumn(col1 : Column, col2: Column) : Column = {
val returnCol : Column = lit(0)
if (col1 > col2) {
returnCol = lit(4)
}
}
Except in actually usage there is a lot more if/else logic that needs to happen in computedColumn, with the final result being a returned Column that will be added to the select's output.
You can use when to do a conditional comparison:
someDataFrame.select(
$"SomeColumn",
when($"SomeColumn1" > $"SomeColumn2", 4).otherwise(0).as("MyComputedColumn")
)
If you prefer to write a function:
def computedColumn(col1 : Column, col2: Column) : Column = {
when(col1 > col2, 4).otherwise(0)
}
someDataFrame.select(
$"SomeColumn",
computedColumn($"SomeColumn1", $"SomeColumn2").as("MyComputedColumn")
)

How to extract first digit from a Integer in transformer stage in IBM DataStage?

I have an integer field coming and I want to extract the first digit from the field, how can I do it. I cannot cast the field since the data is coming from a dataset, is there a way to extract first digit from the transformer stage in IBM datastage?
Example:
Input:
ABC = 1234
Output: 1
Can anyone please help me with the same?
Thanks!
Use a transformer, define a stage variable as varchar and use this formula to get the substring
ABC[1,1]
Alternatively you can also convert your numeric value by using the DecimalToString
You CAN convert to string within the context of your expression, and back again if the result needs to be an integer.
AsInteger(Left(ln_jn_ENCNTR_DTL.CCH,1)
This solution has used implicit conversion from integer to string. It assumes that the value of CCH is always an integer.
I would say- if ABC has type int, you can define a stage variable of type char having length 1.
then you need to convert Number to string first.And use Left function to extract the first char.
Left(DecimalToString(ABC),1).
If you are getting ABC as string, you can directly apply left function.
You can first define a stage variable (name say SV) of varchar type (to convert input integer column into varchar) :
Stage variable definition
Now assign the input integer column to stage variable SV and derive output integer column as AsInteger(SV[1,1]) : Column definition
i.e. input integer => (Type conversion to varchar) Stage variable => Substring[1,1] and Substring Conversion to Integer using AsInteger.
DecimalToString is an implicit conversion, so all you need is the Left() function. Left(MyString,1)

Why does $ not work with values of type String (and only with the string literals directly)?

I have the following object which mimic an enumeration:
object ColumnNames {
val JobSeekerID = "JobSeekerID"
val JobID = "JobID"
val Date = "Date"
val BehaviorType = "BehaviorType"
}
Then I want to group a DF by a column. The following does not compile:
userJobBehaviourDF.groupBy($(ColumnNames.JobSeekerID))
If I change it to
userJobBehaviourDF.groupBy($"JobSeekerID")
It works.
How can I use $ and ColumnNames.JobSeekerID together to do this?
$ is a Scala feature called string interpolator.
Starting in Scala 2.10.0, Scala offers a new mechanism to create strings from your data: String Interpolation. String Interpolation allows users to embed variable references directly in processed string literals.
Spark leverages string interpolators in Spark SQL to convert $"col name" into a column.
scala> spark.version
res0: String = 2.3.0-SNAPSHOT
scala> :type $"hello"
org.apache.spark.sql.ColumnName
ColumnName type is a subtype of Column type and that's why you can use $-prefixed strings as column references where values of Column type are expected.
import org.apache.spark.sql.Column
val c: Column = $"columnName"
scala> :type c
org.apache.spark.sql.Column
How can I use $ and ColumnNames.JobSeekerID together to do this?
You cannot.
You should either map the column names (in the "enumerator") to the Column type using $ directly (that would require changing their types to Column) or using col or column functions when Columns are required.
col(colName: String): Column Returns a Column based on the given column name.
column(colName: String): Column Returns a Column based on the given column name.
$s Elsewhere
What's interesting is that Spark MLlib uses $-prefixed strings for ML parameters, but in this case $ is just a regular method.
protected final def $[T](param: Param[T]): T = getOrDefault(param)
It's also worth mentioning that (another) $ string interpolator is used in Catalyst DSL to create logical UnresolvedAttributes that could be useful for testing or Spark SQL internals exploration.
import org.apache.spark.sql.catalyst.dsl.expressions._
scala> :type $"hello"
org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute
String Interpolator in Scala
The string interpolator feature works (is resolved to a proper value) at compile time so either it is a string literal or it's going to fail.
$ is akin to the s string interpolator:
Prepending s to any string literal allows the usage of variables directly in the string.
Scala provides three string interpolation methods out of the box: s, f and raw and you can write your own interpolator as Spark did.
You can only use $ with string literals(values) If you want to use ColumnNames you can do as below
userJobBehaviourDF.groupBy(userJobBehaviourDF(ColumnNames.JobSeekerID))
userJobBehaviourDF.groupBy(col(ColumnNames.JobSeekerID))
From the Spark Docs for Column, here are different ways of representing a column:
df("columnName") // On a specific `df` DataFrame.
col("columnName") // A generic column no yet associated with a DataFrame.
col("columnName.field") // Extracting a struct field
col("`a.column.with.dots`") // Escape `.` in column names.
$"columnName" // Scala short hand for a named column.
Hope this helps!

Spark SQL change format of the number

After show command spark prints the following:
+-----------------------+---------------------------+
|NameColumn |NumberColumn |
+-----------------------+---------------------------+
|name |4.3E-5 |
+-----------------------+---------------------------+
Is there a way to change NumberColumn format to something like 0.000043?
you can use format_number function as
import org.apache.spark.sql.functions.format_number
df.withColumn("NumberColumn", format_number($"NumberColumn", 5))
here 5 is the decimal places you want to show
As you can see in the link above that the format_number functions returns a string column
format_number(Column x, int d)
Formats numeric column x to a format like '#,###,###.##', rounded to d decimal places, and returns the result as a string column.
If your don't require , you can call regexp_replace function which is defined as
regexp_replace(Column e, String pattern, String replacement)
Replace all substrings of the specified string value that match regexp with rep.
and use it as
import org.apache.spark.sql.functions.regexp_replace
df.withColumn("NumberColumn", regexp_replace(format_number($"NumberColumn", 5), ",", ""))
Thus comma (,) should be removed for large numbers.
You can use cast operation as below:
val df = sc.parallelize(Seq(0.000043)).toDF("num")
df.createOrReplaceTempView("data")
spark.sql("select CAST (num as DECIMAL(8,6)) from data")
adjust the precision and scale accordingly.
In newer versions of pyspark you can use round() or bround() functions.
Theses functions return a numeric column and solve the problem with ",".
it would be like:
df.withColumn("NumberColumn", bround("NumberColumn",5))

variable not binding value in Spark

I am passing variable, but it is not passing value.
I populates variable value here.
val Temp = sqlContext.read.parquet("Tabl1.parquet")
Temp.registerTempTable("temp")
val year = sqlContext.sql("""select value from Temp where name="YEAR"""")
year.show()
here year.show() proper value.
I am passing the parameter here in below code.
val data = sqlContext.sql("""select count(*) from Table where Year='$year' limit 10""")
data.show()
The value year is a Dataframe, not a specific value (Int or Long). So when you use it inside a string interpolation, you get the result of Dataframe.toString, which isn't something you can use to compare values to (the toString returns a string representation of the Dataframe's schema).
If you can assume the year Dataframe has a single Row with a single column of type Int, and you want to get the value of that column - you get use first().getAs[Int](0) to get that value and then use it to construct your next query:
val year: DataFrame = sqlContext.sql("""select value from Temp where name="YEAR"""")
// get the first column of the first row:
val actualYear: Int = year.first().getAs[Int](0)
val data = sqlContext.sql(s"select count(*) from Table where Year='$actualYear' limit 10")
If value column in Temp table has a different type (String, Long) - just replace the Int with that type.