Spark Dataframe - Add new Column from List[String] - scala

I have a List[String] and add the value of these Strings as Column Names to an existing Dataframe.
Is there a way to do it instead of Iterating over the List. If Iterating over the List is the only way, how best can I achieve it?

Must be dump.. should have tried this before..
got answer after a little try:
val test: DataFrame = useCaseTagField_l.foldLeft(ds_segments)((df, tag) => df.withColumn(tag._2, lit(null)))

Related

How to apply Spark's col() to Array[String] containing the columns name?

If I have a Array[String] that contains the columns I need to use in select() function, how I can apply them in the most designed way?
.select(from_json(col("value").cast("string"), schema).as("data"), col("oneColumn"))
I'd like to put several columns with names from the array in the place of col("oneColumn")
ANswers from here can't help me, as they deal with Lists of Strings, while I already have a Column object and can't apply collection of columns as a parameter of select()
preparing list of columns
val cols: List[Column] = headers.toList.map(name => col(name))
val cols1 = cols :+ from_json(col("value").cast("string"), schema).as("data")
and then
.select(cols1: _*)

Reverse contents of a field within a dataframe using scala

I'm using scala.
I have a dataframe with millions of rows and multiple fields. One of the fields is a string field containing thing like this:
"Snow_KC Bingfamilies Conference_610507"
How do I reverse the contents of just this field for all the rows in the dataframe?
Thanks.
Doing a quick search on the Scaladoc, I found this reverse function which does exactly that.
import org.apache.spark.sql.{functions => sqlfun}
val df1 = ...
val df2 = df1.withColumn("columnName", sqlfun.reverse($"columnName"))

Spark Scala Dynamic column selection from DataFrame

I have a DataFrame which have different type of columns. Among those column, i need to retrieve specific column from that DataFrame.
Hard coded DataFrame select statement will be like this:
val logRegrDF = myDF.select(myDF("LEBEL_COLUMN").as("label"),
col("FEATURE_COL1"), col("FEATURE_COL2"), col("FEATURE_COL3"), col("FEATURE_COL4"))
Where LEBEL_COLUMN and FEATURE_COLs will be dynamic.
I have Array or Seq for those FEATURE Columns like this:
val FEATURE_COL_ARR = Array("FEATURE_COL1","FEATURE_COL2","FEATURE_COL3","FEATURE_COL4")
I need to use this Array of column collection with that SELECT statement in the 2nd part.
In the select, 1st column will be one (LABEL_COLUMN) and rest will be dynamic list.
Can you please help me to make the select statement working in SCALA.
Note:
The sample code given bellow is working, but i need to add column array in the 2nd part of the SELECT
val colNames = FEATURE_COL_ARR.map(name => col(name))
val logRegrDF = myDF.select(colNames:_*) // it is not the requirement
I am thinking for 2nd part code will be like this, but it is not working:
val logRegrDF = myDF.select(myDF("LEBEL_COLUMN").as("label"), colNames:_*)
If I understand your question, I hope this is what you are looking for
val allColumnsArr = "LEBEL_COLUMN" +: FEATURE_COL_ARR
result.select("LEBEL_COLUMN", allColumnsArr: _*)
.withColumnRenamed("LEBEL_COLUMN", "label")
Hope this helps!
Thanks a lot #Shankar.
Though your given suggestion is not working, but i got an idea from your suggestion and solved the issue by this way
val allColumnsArr = "LEBEL_COLUMN" +: FEATURE_COL_ARR
val colNames = allColumnsArr.map(name => col(name))
myDF.select(colNames:_*).withColumnRenamed("LEBEL_COLUMN", "label")
Also this way without creating DataFrame column:
result.select(LEBEL_COLUMN, FEATURE_COL_ARR: _*) .withColumnRenamed(LEBEL_COLUMN, "label")

remove a column from a dataframe spark

I have a Spark dataframe with a very large number of columns. I want to remove two columns from it to get a new dataframe.
Had there been fewer columns, I could have used the select method in the API like this:
pcomments = pcomments.select(pcomments.col("post_id"),pcomments.col("comment_id"),pcomments.col("comment_message"),pcomments.col("user_name"),pcomments.col("comment_createdtime"));
But since picking columns from a long list is a tedious task, is there a workaround?
Use drop method and withColumnRenamed methods.
Example:
val initialDf= ....
val dfAfterDrop=initialDf.drop("column1").drop("coumn2")
val dfAfterColRename= dfAfterDrop.withColumnRenamed("oldColumnName","new ColumnName")
Try this:
val initialDf = ...
val dfAfterDropCols = initialDf.drop("column1", "coumn2")

How to convert DataFrame to RDD in Scala?

Can someone please share how one can convert a dataframe to an RDD?
Simply:
val rows: RDD[Row] = df.rdd
Use df.map(row => ...) to convert the dataframe to a RDD if you want to map a row to a different RDD element. For example
df.map(row => (row(1), row(2)))
gives you a paired RDD where the first column of the df is the key and the second column of the df is the value.
I was just looking for my answer and found this post.
Jean's answer to absolutely correct,adding on that "df.rdd" will return a RDD[Rows]. I need to apply split() once i get RDD. For that we need to convert RDD[Row} to RDD[String]
val opt=spark.sql("select tags from cvs").map(x=>x.toString()).rdd