I have a modified version of the original dataframe on which I did clustering,
Now I want to bring the predicted column back to the original DF (the index is ok, so it matches). How am I supposed to do this?
With this code I get an error.
println("Predicted:")
dfWithOutput.show
println("Original:")
originalDF = originalDF.withColumn("cluster", dfWithOutput.col("prediction")
Exception in thread "main" org.apache.spark.sql.AnalysisException: Resolved attribute(s) prediction#2121 missing from (list of columns in the original df)
you need to join the two dataframes and then select the columns you're interested in
Related
I have a dataframe(s) on which the Spark SQL queries will run. The problem is that the SQL query may have columns which may not be available in the dataframe, or even if the column is present some of the rows in the column may be null.
So the question is, in Spark 2.2 version and scala 2.12 how do I achieve this? An example of nested column is: a.b.c (where b can be a Map/ Array) and default values can be String/Int.
In the above table i have 5 columns and selected 2 columns and saved as new dataframe. From unselected column when i try to retrieve information on new dataframe its returning result instead of trowing error as column not present in dataframe.
Sample Code:
df1 = df.select('id', 'subject1')
df1.filter('subject2' > 50).show()
Above dataframe dosen't have subject2 but its returning result instead of trowing error. How to completely drop list of columns from memory?
Output dataframe result:
As pyspark is lazily evaluated whenever you apply .select or .drop they are not performed right there and is added to the DAG and will be selected later after an action is applied to the dataframe.
So, you can filter on the non-selected column also as long as there is no action applied to the dataframe.
For memory concern, spark doesn't reads anything in memory till the action is performed, it only creates the DAG and once you perform the action only then things start getting into the memory.
cannot reproduce your case. In general when you select some columns - only those will be available for filtering. You should get a type error that condition should be string or column. Please ensure that above in the code you didn't specify subject2 as an object.
Also try:
df1 = df1.drop('subject2', 'subject3', 'subject4')
Hope this helps.
I have two dataframes aaa_01 and aaa_02 in Apache Spark 2.1.0.
And I perform an Inner Join on these two dataframes selecting few colums from both dataframes to appear in the output.
The Join is working perfectly fine but the output dataframe has the column names as it was present in the input dataframes. I get stuck here. I need to have new column names instead of getting the same column names in my output dataframe.
Sample Code is given below for reference
DF1.alias("a").join(DF2.alias("b"),DF1("primary_col") === DF2("primary_col"), "inner").select("a.col1","a.col2","b.col4")
I am getting the output dataframe with column names as "col1, col2, col3". I tried to modify the code as below but in vain
DF1.alias("a").join(DF2.alias("b"),DF1("primary_col") === DF2("primary_col"), "inner").select("a.col1","a.col2","b.col4" as "New_Col")
Any help is appreciated. Thanks in advance.
Edited
I browsed and got similar posts which is given below. But I do not see an answer to my question.
Updating a dataframe column in spark
Renaming Column names of a Data frame in spark scala
The answers in this post : Spark Dataframe distinguish columns with duplicated name are not relevant to me as it is related more to pyspark than Scala and it had explained how to rename all the columns of a dataframe whereas my requirement is to rename only one or few columns.
You want to rename columns of the dataset, the fact that your dataset comes from a join does not change anything. Yo can try any example from this answer, for instance :
DF1.alias("a").join(DF2.alias("b"),DF1("primary_col") === DF2("primary_col"), "inner")
.select("a.col1","a.col2","b.col4")
.withColumnRenamed("col4","New_col")
you can .as alias as
import sqlContext.implicits._
DF1.alias("a").join(DF2.alias("b"),DF1("primary_col") === DF2("primary_col"), "inner").select($"a.col1".as("first"),$"a.col2".as("second"),$"b.col4".as("third"))
or you can use .alias as
import sqlContext.implicits._
DF1.alias("a").join(DF2.alias("b"),DF1("primary_col") === DF2("primary_col"), "inner").select($"a.col1".alias("first"),$"a.col2".alias("second"),$"b.col4".alias("third"))
if you are looking to update only one column name then you can do
import sqlContext.implicits._
DF1.alias("a").join(DF2.alias("b"),DF1("primary_col") === DF2("primary_col"), "inner").select($"a.col1", $"a.col2", $"b.col4".alias("third"))
I have a dataframe and I want to add for each row new_col=max(some_column0) grouped by some other column1:
maxs = df0.groupBy("catalog").agg(max("row_num").alias("max_num")).withColumnRenamed("catalog", "catalogid")
df0.join(maxs, df0.catalog == maxs.catalogid).take(4)
And in second string I get an error:
AnalysisException: u'Detected cartesian product for INNER join between
logical plans\nProject ... Use the CROSS JOIN syntax to allow
cartesian products between these relations.;'
What do I not understand: why spark finds here cartesian product?
A possible way to get this error: I save DF to Hive table, then init DF again as select from table. Or replace these 2 strings with hive query - no matter. But I don't want to save DF.
As described in Why does spark think this is a cross/cartesian join, it may be caused by:
This happens because you join structures sharing the same lineage and this leads to a trivially equal condition.
As for how the cartesian product was generated? You can refer to Identifying and Eliminating the Dreaded Cartesian Product.
Try to persist the dataframes before joining them. Worked for me.
I've faced the same problem with cartesian product for my join.
In order to overcome it I used aliases on DataFrames. See example
from pyspark.sql.functions import col
df1.alias("buildings").join(df2.alias("managers"), col("managers.distinguishedName") == col("buildings.manager"))
Is it possible and what would be the most efficient neat method to add a column to Data Frame?
More specifically, column may serve as Row IDs for the existing Data Frame.
In a simplified case, reading from file and not tokenizing it, I can think of something as below (in Scala), but it completes with errors (at line 3), and anyways doesn't look like the best route possible:
var dataDF = sc.textFile("path/file").toDF()
val rowDF = sc.parallelize(1 to DataDF.count().toInt).toDF("ID")
dataDF = dataDF.withColumn("ID", rowDF("ID"))
It's been a while since I posted the question and it seems that some other people would like to get an answer as well. Below is what I found.
So the original task was to append a column with row identificators (basically, a sequence 1 to numRows) to any given data frame, so the rows order/presence can be tracked (e.g. when you sample). This can be achieved by something along these lines:
sqlContext.textFile(file).
zipWithIndex().
map(case(d, i)=>i.toString + delimiter + d).
map(_.split(delimiter)).
map(s=>Row.fromSeq(s.toSeq))
Regarding the general case of appending any column to any data frame:
The "closest" to this functionality in Spark API are withColumn and withColumnRenamed. According to Scala docs, the former Returns a new DataFrame by adding a column. In my opinion, this is a bit confusing and incomplete definition. Both of these functions can operate on this data frame only, i.e. given two data frames df1 and df2 with column col:
val df = df1.withColumn("newCol", df1("col") + 1) // -- OK
val df = df1.withColumn("newCol", df2("col") + 1) // -- FAIL
So unless you can manage to transform a column in an existing dataframe to the shape you need, you can't use withColumn or withColumnRenamed for appending arbitrary columns (standalone or other data frames).
As it was commented above, the workaround solution may be to use a join - this would be pretty messy, although possible - attaching the unique keys like above with zipWithIndex to both data frames or columns might work. Although efficiency is ...
It's clear that appending a column to the data frame is not an easy functionality for distributed environment and there may not be very efficient, neat method for that at all. But I think that it's still very important to have this core functionality available, even with performance warnings.
not sure if it works in spark 1.3 but in spark 1.5 I use withColumn:
import sqlContext.implicits._
import org.apache.spark.sql.functions._
df.withColumn("newName",lit("newValue"))
I use this when I need to use a value that is not related to existing columns of the dataframe
This is similar to #NehaM's answer but simpler
I took help from above answer. However, I find it incomplete if we want to change a DataFrame and current APIs are little different in Spark 1.6.
zipWithIndex() returns a Tuple of (Row, Long) which contains each row and corresponding index. We can use it to create new Row according to our need.
val rdd = df.rdd.zipWithIndex()
.map(indexedRow => Row.fromSeq(indexedRow._2.toString +: indexedRow._1.toSeq))
val newstructure = StructType(Seq(StructField("Row number", StringType, true)).++(df.schema.fields))
sqlContext.createDataFrame(rdd, newstructure ).show
I hope this will be helpful.
You can use row_number with Window function as below to get the distinct id for each rows in a dataframe.
df.withColumn("ID", row_number() over Window.orderBy("any column name in the dataframe"))
You can also use monotonically_increasing_id for the same as
df.withColumn("ID", monotonically_increasing_id())
And there are some other ways too.