Spark - Getting Type mismatch when assigning a string label to null values - scala

I have a dataset with a stringType column which contains nulls. I wanted to change each row with a null value with a string. I was trying the following:
val renameDF = DF
.withColumn("code", when($"code".isNull,lit("NON")).otherwise($"code"))
But I am getting the following exception:
org.apache.spark.sql.AnalysisException: cannot resolve 'CASE WHEN
(del.code IS NULL) THEN 'NON' ELSE del.code END' due to
data type mismatch: THEN and ELSE expressions should all be same type
or coercible to a common type;
How can I make the string a column type compatible with $"code"

This is weird, I just tried this snippet :
val df = Seq("yoyo","yaya",null).toDF("code")
df.withColumn("code", when($"code".isNull,lit("NON")).otherwise($"code")).show
And this is working fine, can you share your spark version. And did you import the spark implicits ? Are you sure your column is StringTyped ?

Related

pyspark nested when throws exception

I have simple dataframe consisting of 3 most important columns: ID, CAT and SUBCAT.
I want to group results including only rows with some type of SUBCAT.
Everything works fine with below code:
subcategories = ["AA", "AB", "BA", "BB"]
df_grouped = df \
.groupby("ID") \
.agg(
collect_set(when(col("SUBCAT").isin(subcategories), struct(*[df[columnName] for columnName in restOfColumns]))))
When I want to add "nested" when for differentiation list of allowed SUBCATs per CAT:
df_grouped = df \
.groupby("ID") \
.agg(
collect_set(when(col("SUBCAT").isin( \
when((col("CAT") == lit("A")), array([lit("AA"), lit("AB")])) \
.otherwise(array([lit("BA"), lit("BB")])) \
), struct(*[df[columnName] for columnName in restOfColumns]))))
, I start receiving exception:
cannot resolve '(df.SUBCAT IN (CASE WHEN (df.CAT = 'A') THEN array('AA', 'AB') ELSE array('BA', 'BB') END))' due to data type mismatch: Arguments must be same type but were: string != array<string>
I have read similar topics here and people got similar errors, but now with the same type of query. Is such "nested" when limitation of pyspark or my query is wrong?
In the first code the string in the column is compared with the strings in the list.
But in the second case as when only supports returning literals and columns expressions according to this documentation, you are returning the array of string literals from the nested when.
cannot resolve '(df.SUBCAT IN (CASE WHEN (df.CAT = 'A') THEN array('AA', 'AB') ELSE array('BA', 'BB') END))' due to data type mismatch: Arguments must be same type but were: string != array<string>
That’s why it is giving the above exception of data type mismatch which arise due to the comparison of the string in the column with the array of Strings.
In this kind of cases, it is good to use other methods rather than nested when as commented.

In a Scala notebook on Apache Spark Databricks how do you correctly cast an array to type decimal(30,0)?

I am trying to cast an array as Decimal(30,0) for use in a select dynamically as:
WHERE array_contains(myArrayUDF(), someTable.someColumn)
However when casting with:
val arrIds = someData.select("id").withColumn("id", col("id")
.cast(DecimalType(30, 0))).collect().map(_.getDecimal(0))
Databricks accepts that and signature however already looks wrong to be:
intArrSurrIds: Array[java.math.BigDecimal] = Array(2181890000000,...) // ie, a BigDecimal
Which results in the below error:
Error in SQL statement: AnalysisException: cannot resolve.. due to data type mismatch: Input to function array_contains should have been array followed by a value with same element type, but it's [array<decimal(38,18)>, decimal(30,0)]
How do you correctly cast as decimal(30,0) in Spark Databricks Scala notebook instead of decimal(38,18) ?
Any help appreciated!
You can make arrIds an Array[Decimal] using the code below:
import org.apache.spark.sql.functions.col
import org.apache.spark.sql.types.{Decimal, DecimalType}
val arrIds = someData.select("id")
.withColumn("id", col("id").cast(DecimalType(30, 0)))
.collect()
.map(row => Decimal(row.getDecimal(0), 30, 0))
However, it will not solve your problem because you lose the precision and scale once you create your user defined function, as I explain in this answer
To solve your problem, you need to cast the column someTable.someColumn to Decimal with the same precision and scale than the UDF returned type. So your WHERE clause should be:
WHERE array_contains(myArray(), cast(someTable.someColumn as Decimal(38, 18)))

RandomForestClassifier for multiclass classification Spark 2.x

I'm trying to use random forest for a multiclass classification using spark 2.1.1
After defining my pipeline as usual, it's failing during indexing stage.
I have a dataframe with many string type columns. I have created a StringIndexer for each of them.
I am creating a Pipeline by chaining the StringIndexers with VectorAssembler and finally a RandomForestClassifier following by a label converter.
I've checked all my columns with distinct().count() to make sure I do not have too many categories and so on...
After some debugging, I understand that whenever I started the indexing of some of the columns I get the following errors...
When calling:
val indexer = udf { label: String =>
if (labelToIndex.contains(label)) {
labelToIndex(label)
} else {
throw new SparkException(s"Unseen label: $label.")
}
}
Error evaluating methog: 'labelToIndex'
Error evaluating methog: 'labels'
Then inside the transformation, there is this error when defining the metadata:
Error evaluating method: org$apache$spark$ml$feature$StringIndexerModel$$labelToIndex
Method threw 'java.lang.NullPointerException' exception. Cannot evaluate org.apache.spark.sql.types.Metadata.toString()
This is happening because I have null on some columns that I'm indexing.
I could reproduce the error with the following example.
val df = spark.createDataFrame(
Seq(("asd2s","1e1e",1.1,0), ("asd2s","1e1e",0.1,0),
(null,"1e3e",1.2,0), ("bd34t","1e1e",5.1,1),
("asd2s","1e3e",0.2,0), ("bd34t","1e2e",4.3,1))
).toDF("x0","x1","x2","x3")
val indexer = new
StringIndexer().setInputCol("x0").setOutputCol("x0idx")
indexer.fit(df).transform(df).show
// java.lang.NullPointerException
https://issues.apache.org/jira/browse/SPARK-11569
https://github.com/apache/spark/blob/branch-2.1/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala
The solution present here can be used, and on the Spark 2.2.0, the issue is fixed upstream.
You can use
DataFrame.na.fill(Map("colName1", val1), ("colName2", val2),..))
Where:
DataFrame - DataFrame Object ; "colName" - name of the column & val - value for replacing nulls if any found in column "colName".
Use feature transformations, after filling all nulls.
You can check for number of nulls in all columns of as:
for ( column <- DataFrame.columns ) {
DataFrame.filter(DataFrame(column) === null || DataFrame(column).isNull || DataFrame(column).isNan).count()
}
OR
DataFrame.count() will give you total number of rows in DataFrame. Then number of nulls can be judged by DataFrame.describe()

Datatype validation of Spark columns in for loop - Spark DataFrame

I'm trying to validate datatype of DataFrame before entering the loop, wherein I'm trying to do SQL calculation, but datatype validation is not going through and it is not getting inside the loop. The operation needs to be performed on only numeric columns.
How can this be solved? Is this the right way to handle datatype validation?
//get datatype of dataframe fields
val datatypes = parquetRDD_subset.schema.fields
//check if datatype of column is String and enter the loop for calculations.
for (val_datatype <- datatypes if val_datatype.dataType =="StringType")
{
val dfs = x.map(field => spark.sql(s"select * from table"))
val withSum = dfs.reduce((x, y) => x.union(y)).distinct()
}
You are comparing the dataType to a string which will never be true (for me the compiler complains that they are unrelated). dataType is an object which is a subtype of org.apache.spark.sql.types.DataType.
Try replacing your for with
for (val_datatype <- datatypes if val_datatype.dataType.isInstanceOf[StringType])
In any case, your for loop does nothing but declare the vals, it doesn't do anything with them.

Replace Empty values with nulls in Spark Dataframe

I have a data frame with n number of columns and I want to replace empty strings in all these columns with nulls.
I tried using
val ReadDf = rawDF.na.replace("columnA", Map( "" -> null));
and
val ReadDf = rawDF.withColumn("columnA", if($"columnA"=="") lit(null) else $"columnA" );
Both of them didn't work.
Any leads would be highly appreciated. Thanks.
Your first approach seams to fail due to a bug that prevents replace from being able to replace values with nulls, see here.
Your second approach fails because you're confusing driver-side Scala code for executor-side Dataframe instructions: your if-else expression would be evaluated once on the driver (and not per record); You'd want to replace it with a call to when function; Moreover, to compare a column's value you need to use the === operator, and not Scala's == which just compares the driver-side Column object:
import org.apache.spark.sql.functions._
rawDF.withColumn("columnA", when($"columnA" === "", lit(null)).otherwise($"columnA"))