remove a column from a dataframe spark - scala

I have a Spark dataframe with a very large number of columns. I want to remove two columns from it to get a new dataframe.
Had there been fewer columns, I could have used the select method in the API like this:
pcomments = pcomments.select(pcomments.col("post_id"),pcomments.col("comment_id"),pcomments.col("comment_message"),pcomments.col("user_name"),pcomments.col("comment_createdtime"));
But since picking columns from a long list is a tedious task, is there a workaround?

Use drop method and withColumnRenamed methods.
Example:
val initialDf= ....
val dfAfterDrop=initialDf.drop("column1").drop("coumn2")
val dfAfterColRename= dfAfterDrop.withColumnRenamed("oldColumnName","new ColumnName")

Try this:
val initialDf = ...
val dfAfterDropCols = initialDf.drop("column1", "coumn2")

Related

Reverse contents of a field within a dataframe using scala

I'm using scala.
I have a dataframe with millions of rows and multiple fields. One of the fields is a string field containing thing like this:
"Snow_KC Bingfamilies Conference_610507"
How do I reverse the contents of just this field for all the rows in the dataframe?
Thanks.
Doing a quick search on the Scaladoc, I found this reverse function which does exactly that.
import org.apache.spark.sql.{functions => sqlfun}
val df1 = ...
val df2 = df1.withColumn("columnName", sqlfun.reverse($"columnName"))

Filter Spark Dataframe with list of values in Scala

I am trying to create a dataframe from hive table using SparkSession like below. Once created I am filtering the rows by a list of Ids.
val myDF = spark.sql("select * from myhivetable")
val someDF = mfiDF.where(mfiDF("id").isin(myList:_*))
Instead of this approach is there a way I can query the hive table as below:
val myDF = spark.sql("select * from myhivetable").where (("id").isin(myList:_*))
When I try like this I am getting a compilation error.
Could someone suggest a best approach for this. Thanks.
You could also do an inner join to remove unwanted ids, something like below may work.
val ids = sc.parallelize(myList).toDF("id")
someDF.join(ids, ids.id === someDF.id)

spark scala reducekey dataframe operation

I'm trying to do a count in scala with dataframe. My data has 3 columns and I've already loaded the data and split by tab. So I want to do something like this:
val file = file.map(line=>line.split("\t"))
val x = file1.map(line=>(line(0), line(2).toInt)).reduceByKey(_+_,1)
I want to put the data in dataframe, and having some trouble on the syntax
val file = file.map(line=>line.split("\t")).toDF
val file.groupby(line(0))
.count()
Can someone help check if this is correct?
spark needs to know the schema of the df
there are many ways to specify the schema, here is one option:
val df = file
.map(line=>line.split("\t"))
.map(l => (l(0), l(1).toInt)) //at this point spark knows the number of columns and their types
.toDF("a", "b") //give the columns names for ease of use
df
.groupby('a)
.count()

Spark scala copying dataframe column to new dataframe

I have an empty dataframe with schema already created.
I'm trying to add the columns to this dataframe from a new dataframe to the existing columns in a for loop.
k schema - |ID|DATE|REPORTID|SUBMITTEDDATE|
for(data <- 0 to range-1){
val c = df2.select(substring(col("value"), str(data)._2, str(data)._3).alias(str(data)._1)).toDF()
//c.show()
k = c.withColumn(str(data)._1, c(str(data)._1))
}
k.show()
But the k dataframe has just one column, but it should have all the 4 columns populated with values.
I think the last line in for loop is replacing exisitng columns in the dataframe.
Can somebody help me with this?
Thanks!!
Add your logic and conditions and create new dataframe
val dataframe2 = dataframe1.select("A","B",C)
Copying few columns of a dataframe to another one is not possible in spark.
Although there are few alternatives to attain the same
1. You need to join both the dataframe based on some join condition.
2. Convert bot the data frame to json and do RDD Union
val rdd = df1.toJSON.union(df2.toJSON)
val dfFinal = spark.read.json(rdd)

Spark Scala Dynamic column selection from DataFrame

I have a DataFrame which have different type of columns. Among those column, i need to retrieve specific column from that DataFrame.
Hard coded DataFrame select statement will be like this:
val logRegrDF = myDF.select(myDF("LEBEL_COLUMN").as("label"),
col("FEATURE_COL1"), col("FEATURE_COL2"), col("FEATURE_COL3"), col("FEATURE_COL4"))
Where LEBEL_COLUMN and FEATURE_COLs will be dynamic.
I have Array or Seq for those FEATURE Columns like this:
val FEATURE_COL_ARR = Array("FEATURE_COL1","FEATURE_COL2","FEATURE_COL3","FEATURE_COL4")
I need to use this Array of column collection with that SELECT statement in the 2nd part.
In the select, 1st column will be one (LABEL_COLUMN) and rest will be dynamic list.
Can you please help me to make the select statement working in SCALA.
Note:
The sample code given bellow is working, but i need to add column array in the 2nd part of the SELECT
val colNames = FEATURE_COL_ARR.map(name => col(name))
val logRegrDF = myDF.select(colNames:_*) // it is not the requirement
I am thinking for 2nd part code will be like this, but it is not working:
val logRegrDF = myDF.select(myDF("LEBEL_COLUMN").as("label"), colNames:_*)
If I understand your question, I hope this is what you are looking for
val allColumnsArr = "LEBEL_COLUMN" +: FEATURE_COL_ARR
result.select("LEBEL_COLUMN", allColumnsArr: _*)
.withColumnRenamed("LEBEL_COLUMN", "label")
Hope this helps!
Thanks a lot #Shankar.
Though your given suggestion is not working, but i got an idea from your suggestion and solved the issue by this way
val allColumnsArr = "LEBEL_COLUMN" +: FEATURE_COL_ARR
val colNames = allColumnsArr.map(name => col(name))
myDF.select(colNames:_*).withColumnRenamed("LEBEL_COLUMN", "label")
Also this way without creating DataFrame column:
result.select(LEBEL_COLUMN, FEATURE_COL_ARR: _*) .withColumnRenamed(LEBEL_COLUMN, "label")