Reverse contents of a field within a dataframe using scala - scala

I'm using scala.
I have a dataframe with millions of rows and multiple fields. One of the fields is a string field containing thing like this:
"Snow_KC Bingfamilies Conference_610507"
How do I reverse the contents of just this field for all the rows in the dataframe?
Thanks.

Doing a quick search on the Scaladoc, I found this reverse function which does exactly that.
import org.apache.spark.sql.{functions => sqlfun}
val df1 = ...
val df2 = df1.withColumn("columnName", sqlfun.reverse($"columnName"))

Related

scala: select column where not contains elements into dataframe

I have this ligne of code that should create a dataframe from list of columns that not contain a string. I tried this but it doesn't work:
val exemple = hiveObj.sql("show tables in database").select("tableName")!==="ABC".collect()
Try using the filter method:
import org.apache.spark.sql.functions._
import spark.implicits._
val exemple = hiveObj.sql("your query here").filter($"columnToFilter" =!= "ABC").show
NOTE: the inequality operator =!=is only available for Spark 2.0.0+. If you're using an older version, you must use !==. You can see the documentation here.
If you need to filter several columns you can do so:
.filter($"columnToFilter" =!= "ABC" and $"columnToFilter2" =!= "ABC")
another alternative answer to my question:
val exemple1 = hiveObj.sql("show tables in database").filter(!$"tableName".contains("ABC")).show()

Spark Dataframe - Add new Column from List[String]

I have a List[String] and add the value of these Strings as Column Names to an existing Dataframe.
Is there a way to do it instead of Iterating over the List. If Iterating over the List is the only way, how best can I achieve it?
Must be dump.. should have tried this before..
got answer after a little try:
val test: DataFrame = useCaseTagField_l.foldLeft(ds_segments)((df, tag) => df.withColumn(tag._2, lit(null)))

spark scala reducekey dataframe operation

I'm trying to do a count in scala with dataframe. My data has 3 columns and I've already loaded the data and split by tab. So I want to do something like this:
val file = file.map(line=>line.split("\t"))
val x = file1.map(line=>(line(0), line(2).toInt)).reduceByKey(_+_,1)
I want to put the data in dataframe, and having some trouble on the syntax
val file = file.map(line=>line.split("\t")).toDF
val file.groupby(line(0))
.count()
Can someone help check if this is correct?
spark needs to know the schema of the df
there are many ways to specify the schema, here is one option:
val df = file
.map(line=>line.split("\t"))
.map(l => (l(0), l(1).toInt)) //at this point spark knows the number of columns and their types
.toDF("a", "b") //give the columns names for ease of use
df
.groupby('a)
.count()

Spark Scala Dynamic column selection from DataFrame

I have a DataFrame which have different type of columns. Among those column, i need to retrieve specific column from that DataFrame.
Hard coded DataFrame select statement will be like this:
val logRegrDF = myDF.select(myDF("LEBEL_COLUMN").as("label"),
col("FEATURE_COL1"), col("FEATURE_COL2"), col("FEATURE_COL3"), col("FEATURE_COL4"))
Where LEBEL_COLUMN and FEATURE_COLs will be dynamic.
I have Array or Seq for those FEATURE Columns like this:
val FEATURE_COL_ARR = Array("FEATURE_COL1","FEATURE_COL2","FEATURE_COL3","FEATURE_COL4")
I need to use this Array of column collection with that SELECT statement in the 2nd part.
In the select, 1st column will be one (LABEL_COLUMN) and rest will be dynamic list.
Can you please help me to make the select statement working in SCALA.
Note:
The sample code given bellow is working, but i need to add column array in the 2nd part of the SELECT
val colNames = FEATURE_COL_ARR.map(name => col(name))
val logRegrDF = myDF.select(colNames:_*) // it is not the requirement
I am thinking for 2nd part code will be like this, but it is not working:
val logRegrDF = myDF.select(myDF("LEBEL_COLUMN").as("label"), colNames:_*)
If I understand your question, I hope this is what you are looking for
val allColumnsArr = "LEBEL_COLUMN" +: FEATURE_COL_ARR
result.select("LEBEL_COLUMN", allColumnsArr: _*)
.withColumnRenamed("LEBEL_COLUMN", "label")
Hope this helps!
Thanks a lot #Shankar.
Though your given suggestion is not working, but i got an idea from your suggestion and solved the issue by this way
val allColumnsArr = "LEBEL_COLUMN" +: FEATURE_COL_ARR
val colNames = allColumnsArr.map(name => col(name))
myDF.select(colNames:_*).withColumnRenamed("LEBEL_COLUMN", "label")
Also this way without creating DataFrame column:
result.select(LEBEL_COLUMN, FEATURE_COL_ARR: _*) .withColumnRenamed(LEBEL_COLUMN, "label")

remove a column from a dataframe spark

I have a Spark dataframe with a very large number of columns. I want to remove two columns from it to get a new dataframe.
Had there been fewer columns, I could have used the select method in the API like this:
pcomments = pcomments.select(pcomments.col("post_id"),pcomments.col("comment_id"),pcomments.col("comment_message"),pcomments.col("user_name"),pcomments.col("comment_createdtime"));
But since picking columns from a long list is a tedious task, is there a workaround?
Use drop method and withColumnRenamed methods.
Example:
val initialDf= ....
val dfAfterDrop=initialDf.drop("column1").drop("coumn2")
val dfAfterColRename= dfAfterDrop.withColumnRenamed("oldColumnName","new ColumnName")
Try this:
val initialDf = ...
val dfAfterDropCols = initialDf.drop("column1", "coumn2")