I'm trying to use random forest for a multiclass classification using spark 2.1.1
After defining my pipeline as usual, it's failing during indexing stage.
I have a dataframe with many string type columns. I have created a StringIndexer for each of them.
I am creating a Pipeline by chaining the StringIndexers with VectorAssembler and finally a RandomForestClassifier following by a label converter.
I've checked all my columns with distinct().count() to make sure I do not have too many categories and so on...
After some debugging, I understand that whenever I started the indexing of some of the columns I get the following errors...
When calling:
val indexer = udf { label: String =>
if (labelToIndex.contains(label)) {
labelToIndex(label)
} else {
throw new SparkException(s"Unseen label: $label.")
}
}
Error evaluating methog: 'labelToIndex'
Error evaluating methog: 'labels'
Then inside the transformation, there is this error when defining the metadata:
Error evaluating method: org$apache$spark$ml$feature$StringIndexerModel$$labelToIndex
Method threw 'java.lang.NullPointerException' exception. Cannot evaluate org.apache.spark.sql.types.Metadata.toString()
This is happening because I have null on some columns that I'm indexing.
I could reproduce the error with the following example.
val df = spark.createDataFrame(
Seq(("asd2s","1e1e",1.1,0), ("asd2s","1e1e",0.1,0),
(null,"1e3e",1.2,0), ("bd34t","1e1e",5.1,1),
("asd2s","1e3e",0.2,0), ("bd34t","1e2e",4.3,1))
).toDF("x0","x1","x2","x3")
val indexer = new
StringIndexer().setInputCol("x0").setOutputCol("x0idx")
indexer.fit(df).transform(df).show
// java.lang.NullPointerException
https://issues.apache.org/jira/browse/SPARK-11569
https://github.com/apache/spark/blob/branch-2.1/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala
The solution present here can be used, and on the Spark 2.2.0, the issue is fixed upstream.
You can use
DataFrame.na.fill(Map("colName1", val1), ("colName2", val2),..))
Where:
DataFrame - DataFrame Object ; "colName" - name of the column & val - value for replacing nulls if any found in column "colName".
Use feature transformations, after filling all nulls.
You can check for number of nulls in all columns of as:
for ( column <- DataFrame.columns ) {
DataFrame.filter(DataFrame(column) === null || DataFrame(column).isNull || DataFrame(column).isNan).count()
}
OR
DataFrame.count() will give you total number of rows in DataFrame. Then number of nulls can be judged by DataFrame.describe()
Related
I have this code :
val o = p_value.alias("d1").join(t_d.alias("d2"),
(col("d1.origin_latitude")===col("d2.origin_latitude")&&
col("d1.origin_longitude")===col("d2.origin_longitude")),"left").
filter(col("d2.origin_longitude").isNull)
val c = p_value2.alias("d3").join(o.alias("d4"),
(col("d3.origin_latitude")===col("d4.origin_latitude") &&
col("d3.origin_longitude")===col("d4.origin_longitude")),"left").
filter(col("d3.origin_longitude").isNull)
I get this error :
Exception in thread "main" org.apache.spark.sql.AnalysisException: Reference 'd4.origin_latitude' is ambiguous, could be: d4.origin_latitude, d4.origin_latitude.;
at org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.resolve(package.scala:240)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:101)
On this line
(col("d3.origin_latitude")===col("d4.origin_latitude") && col("d3.origin_longitude")===col("d4.origin_longitude")),"left").
Any idea ?
Thank you .
You are aliasing DataFrame not columns, which is used to access/refer columns in that DataFrame.
So the first join will result into another DataFrame having same column name twice (origin_latitude as well as origin_longitude). Once you try to access one of these columns in resultant DataFrame, you are going to get Ambiguity error.
So you need to make sure that DataFrame contains each column only once.
You can rewrite the first join as below:
p_value
.join(t_d, Seq("origin_latitude", "origin_longitude"), "left")
.filter(t_d.col("t_d.origin_longitude").isNull)
i have the following code. df3 is created using the following code.i want to get the minimum value of distance_n and also the entire row containing that minimum value .
//it give just the min value , but i want entire row containing that min value
for getting the entire row , i converted this df3 to table for performing spark.sql
if i do like this
spark.sql("select latitude,longitude,speed,min(distance_n) from table1").show()
//it throws error
and if
spark.sql("select latitude,longitude,speed,min(distance_nd) from table180").show()
// by replacing the distance_n with distance_nd it throw the error
how to resolve this to get the entire row corresponding to min value
Before using a custom UDF, you have to register it in spark's sql Context.
e.g:
spark.sqlContext.udf.register("strLen", (s: String) => s.length())
After the UDF is registered, you can access it in your spark sql like
spark.sql("select strLen(some_col) from some_table")
Reference: https://docs.databricks.com/spark/latest/spark-sql/udf-scala.html
I have a data frame with n number of columns and I want to replace empty strings in all these columns with nulls.
I tried using
val ReadDf = rawDF.na.replace("columnA", Map( "" -> null));
and
val ReadDf = rawDF.withColumn("columnA", if($"columnA"=="") lit(null) else $"columnA" );
Both of them didn't work.
Any leads would be highly appreciated. Thanks.
Your first approach seams to fail due to a bug that prevents replace from being able to replace values with nulls, see here.
Your second approach fails because you're confusing driver-side Scala code for executor-side Dataframe instructions: your if-else expression would be evaluated once on the driver (and not per record); You'd want to replace it with a call to when function; Moreover, to compare a column's value you need to use the === operator, and not Scala's == which just compares the driver-side Column object:
import org.apache.spark.sql.functions._
rawDF.withColumn("columnA", when($"columnA" === "", lit(null)).otherwise($"columnA"))
I am working on SPARK 1.6.1 version using SCALA and facing a unusual issue. When creating a new column using an existing column created during same execution getting "org.apache.spark.sql.AnalysisException".
WORKING:.
val resultDataFrame = dataFrame.withColumn("FirstColumn",lit(2021)).withColumn("SecondColumn",when($"FirstColumn" - 2021 === 0, 1).otherwise(10))
resultDataFrame.printSchema().
NOT WORKING
val resultDataFrame = dataFrame.withColumn("FirstColumn",lit(2021)).withColumn("SecondColumn",when($"FirstColumn" - **max($"FirstColumn")** === 0, 1).otherwise(10))
resultDataFrame.printSchema().
Here i am creating my SecondColumn using the FirstColumn created during the same execution. Question is why it does not work while using avg/max functions. Please let me know how can i resolve this problem.
If you want to use aggregate functions together with "normal" columns, the functions should come after a groupBy or with a Window definition clause. Out of these cases they make no sense. Examples:
val result = df.groupBy($"col1").max("col2").as("max") // This works
In the above case, the resulting DataFrame will have both "col1" and "max" as columns.
val max = df.select(min("col2"), max("col2"))
This works because there are only aggregate functions in the query. However, the following will not work:
val result = df.filter($"col1" === max($"col2"))
because I am trying to mix a non aggregated column with an aggregated column.
If you want to compare a column with an aggregated value, you can try a join:
val maxDf = df.select(max("col2").as("maxValue"))
val joined = df.join(maxDf)
val result = joined.filter($"col1" === $"maxValue").drop("maxValue")
Or even use the simple value:
val maxValue = df.select(max("col2")).first.get(0)
val result = filter($"col1" === maxValue)
In my Spark 1.6 application, I have some code to choose partition and query only the given partition. I do this using:
val rdd = df.rdd.mapPartitionsWithIndex((idx, iter) => if (idx == 0) iter else Iterator(), true)
val newDF = sqlContext.createDataFrame(rdd, df.schema)
If I then call a UDF together with a new mapPartitions call on a field, as in
newDF.withColumn("newField", myUDF(df("oldField")).mapPartitions(...)
I get a
resolved attribute(s) oldField#36 missing from idField#51L,oldField#52 in operator !Project [idField#51L,oldField#52,UDF(oldField#36) AS newField#53];
To me it seems as the field "oldField" is somehow present, but - maybe because I created a new DataFrame? - with a wrong id (compare oldField#52 and oldField#36). If I print the schema, of my old DataFrame and newDF, both look the same.
What can I do to avoid this error (except changing the order of the operations in the code, which I do not really like to do as the current structure seems pretty useful to me)?
Don't bind the name to the DataFrame which is not longer in scope. You can use col function:
newDF.withColumn("newField", myUDF(col("oldField"))
implicit conversions:
newDF.withColumn("newField", myUDF($"oldField"))
or current DataFrame:
newDF.withColumn("newField", myUDF(newDF("oldField"))