Update multiple column values of a dataframe unconditionally with different variables - scala

I have dataframe with some 10 columns. I have selected 4 columns out of these 10 and cleaned their values(by calling some external API and using its response). I would like to create new dataframe now (as old one cannot be updated) and update these 4 columns with its cleaned value(as returned by the API) and keep other 6 as is.
Tried exploring .na.replace and .withColumn but they all work on some condition for the columns.
val newdf = df.withColumn("col1", when(col("col1") === "XYZ", cleanedcol1)
.otherwise(col("col1")));
And
val newdf = df.na.replace("col1", Map("col1" -> cleanedcol1))
The first snippet matches col1 value with XYZ and then replaces it. I want unconditional change.
The second one actually tries to look for String "col1" for col1 column and hence does not replace anything.
What is the optimum approach to achieve this? The source of the df is Kafka and hence traffic would be fast.

You can have unconditional change with withCoulumn, just write
val newdf = df.withColumn("col1", newColumnValue)
Where newColumnValue is the new value you want to set for the column.

Related

drop all df2.columns from another df (pyspark.sql.dataframe.DataFrame specific)

I have a large DF (pyspark.sql.dataframe.DataFrame) that is a result of multiple joins, plus new columns being created by using a combination of inputs from different DFS, including DF2.
I want to drop all DF2 columns from DF after I'm done with the join/creating new columns based on DF2 input.
drop() doesn't accept list - only a string or a Column.
I know that df.drop("col1", "col2", "coln") will work but I'd prefer not to crowd the code (if I can) by listing those 20 columns.
Is there a better way of doing this in pyspark dataframe specifically?
drop_cols = df2.columns
df = df.drop(*drop_cols)

Spark Scala, grabbing the max value of 1 column, but keep all columns

I have a dataframe with 3 columns (customer, associations, timestamp).
I want to grab the latest customer by looking at timestamps.
Attempt
val rdd = readRdd.select(col("value"))
val val_columns = Seq("value.timestamp").map(x => last(col(x)).alias(x))
rdd.orderBy("value.timestamp")
.groupBy("value.customer")
.agg(val_columns.head, val_columns.tail: _*)
.show()
I believe the above code is working, but trying to figure out how to include all columns (ie. associations). If I understand correctly, adding it into the groupby would mean I'm grabbing the latest combination of customer and associations combined, but I only want to grab latest off the customer column and not look at multiple columns together.
Edit:
I might be onto something by adding:
val val_columns = Seq("value.lastRefresh", "value.associations")
.map(x => last(col(x)).alias(x))
Curious on thoughts.
If you want to return the latest customer data by the timestamp column, you can just order your dataframe by value.timestamp and apply limit(1):
import org.apache.spark.sql.functions._
df.orderBy(desc("value.timestamp")).limit(1).show()

Spark Persist method odd behavior

I am exploring spark persist function. It seems for some dataframe it is persisting whereas for others it is not, even though I have used the persist method on all the dataframes
Here is my code with explaination
// loading csv as dataframe and creating a view
val src_data=spark.read.option("header",true).csv("sources/data.csv")
src_data.createTempView("src_data")
**There is alreading a table called test in hive**
Here I am creating 3 dataframes using src and test and using persist on all 3 for later use
//dataframe 1
val changed_data= spark.sql("select sc.* from src_data sc inner join default.test t on sc.id=t.id where t.value!=sc.value or t.description!=sc.description ")
changed_data.persist().show()
changed_data.createOrReplaceTempView("changed_data")
// dataframe 2
val new_data= spark.sql("select * from src_data where id not in (select distinct id from default.test)")
println("new_data")
new_data.persist().show()
new_data.createOrReplaceTempView("new_data")
// dataframe 3
val unchanged_data= spark.sql("select * from test where id not in (select id from changed_data)")
unchanged_data.persist().show()
unchanged_data.createTempView("unchanged_data")
**then I truncate the table test**
spark.sql("truncate table test")
***Then i print the 3 dataframes I persisted***
new_data.show()
unchanged_data.show()
changed_data.show()
Before truncating test I can see data for all 3 dataframes using show but after I see only data for one dataframe....
I get data for only new_data(which is dataframe 2) eventhough I persisted all 3 dataframes and all three use table test??
Why this odd behaviour
The Dataframes will only be persisted if you invoke an action that goes through every record of the Dataframe.
Remember, that show() only shows the Top 20 rows as documented in the ScalaDocs.
In contrast, you could apply something like count() but that obviously will have some negative impact on your performance.

Spark: transform dataframe

I work with Spark 1.6.1 in Scala.
I have one dataframe, and I want to create different dataframe and only want to read 1 time.
For example one dataframe have two columns ID and TYPE, and I want to create two dataframe one with the value of type = A and other with type value = B.
I've checked another posts on stackoverflow, but found only the option to read the dataframe 2 times.
However, I would like another solution with the best performance possible.
Kinds regards.
Spark will read from the data source multiple times if you perform multiple actions on the data. The way to aviod this is to use cache(). In this way, the data will be saved to memory after the first action, which will make subsequent actions using the data faster.
Your two dataframes can be created in this way, requiring only one read of the data source.
val df = spark.read.csv(path).cache()
val dfA = df.filter($"TYPE" === "A").drop("TYPE")
val dfB = df.filter($"TYPE" === "B").drop("TYPE")
The "TYPE" column is dropped as it should be unnecessary after the separation.

Append a column to Data Frame in Apache Spark 1.3

Is it possible and what would be the most efficient neat method to add a column to Data Frame?
More specifically, column may serve as Row IDs for the existing Data Frame.
In a simplified case, reading from file and not tokenizing it, I can think of something as below (in Scala), but it completes with errors (at line 3), and anyways doesn't look like the best route possible:
var dataDF = sc.textFile("path/file").toDF()
val rowDF = sc.parallelize(1 to DataDF.count().toInt).toDF("ID")
dataDF = dataDF.withColumn("ID", rowDF("ID"))
It's been a while since I posted the question and it seems that some other people would like to get an answer as well. Below is what I found.
So the original task was to append a column with row identificators (basically, a sequence 1 to numRows) to any given data frame, so the rows order/presence can be tracked (e.g. when you sample). This can be achieved by something along these lines:
sqlContext.textFile(file).
zipWithIndex().
map(case(d, i)=>i.toString + delimiter + d).
map(_.split(delimiter)).
map(s=>Row.fromSeq(s.toSeq))
Regarding the general case of appending any column to any data frame:
The "closest" to this functionality in Spark API are withColumn and withColumnRenamed. According to Scala docs, the former Returns a new DataFrame by adding a column. In my opinion, this is a bit confusing and incomplete definition. Both of these functions can operate on this data frame only, i.e. given two data frames df1 and df2 with column col:
val df = df1.withColumn("newCol", df1("col") + 1) // -- OK
val df = df1.withColumn("newCol", df2("col") + 1) // -- FAIL
So unless you can manage to transform a column in an existing dataframe to the shape you need, you can't use withColumn or withColumnRenamed for appending arbitrary columns (standalone or other data frames).
As it was commented above, the workaround solution may be to use a join - this would be pretty messy, although possible - attaching the unique keys like above with zipWithIndex to both data frames or columns might work. Although efficiency is ...
It's clear that appending a column to the data frame is not an easy functionality for distributed environment and there may not be very efficient, neat method for that at all. But I think that it's still very important to have this core functionality available, even with performance warnings.
not sure if it works in spark 1.3 but in spark 1.5 I use withColumn:
import sqlContext.implicits._
import org.apache.spark.sql.functions._
df.withColumn("newName",lit("newValue"))
I use this when I need to use a value that is not related to existing columns of the dataframe
This is similar to #NehaM's answer but simpler
I took help from above answer. However, I find it incomplete if we want to change a DataFrame and current APIs are little different in Spark 1.6.
zipWithIndex() returns a Tuple of (Row, Long) which contains each row and corresponding index. We can use it to create new Row according to our need.
val rdd = df.rdd.zipWithIndex()
.map(indexedRow => Row.fromSeq(indexedRow._2.toString +: indexedRow._1.toSeq))
val newstructure = StructType(Seq(StructField("Row number", StringType, true)).++(df.schema.fields))
sqlContext.createDataFrame(rdd, newstructure ).show
I hope this will be helpful.
You can use row_number with Window function as below to get the distinct id for each rows in a dataframe.
df.withColumn("ID", row_number() over Window.orderBy("any column name in the dataframe"))
You can also use monotonically_increasing_id for the same as
df.withColumn("ID", monotonically_increasing_id())
And there are some other ways too.