Selecting several columns from spark dataframe with a list of columns as a start - scala

Assuming that I have a list of spark columns and a spark dataframe df, what is the appropriate snippet of code in order to select a subdataframe containing only the columns in the list?
Something similar to maybe:
var needed_column: List[Column]=List[Column](new Column("a"),new Column("b"))
df(needed_columns)
I wanted to get the columns names then select them using the following line of code.
Unfortunately, the column name seems to be in write mode only.
df.select(needed_columns.head.as(String),needed_columns.tail: _*)

Your needed_columns is of type List[Column], hence you can simply use needed_columns: _* as the arguments for select:
val df = Seq((1, "x", 10.0), (2, "y", 20.0)).toDF("a", "b", "c")
import org.apache.spark.sql.Column
val needed_columns: List[Column] = List(new Column("a"), new Column("b"))
df.select(needed_columns: _*)
// +---+---+
// | a| b|
// +---+---+
// | 1| x|
// | 2| y|
// +---+---+
Note that select takes two types of arguments:
def select(cols: Column*): DataFrame
def select(col: String, cols: String*): DataFrame
If you have a list of column names of String type, you can use the latter select:
val needed_col_names: List[String] = List("a", "b")
df.select(needed_col_names.head, needed_col_names.tail: _*)
Or, you can map the list of Strings to Columns to use the former select
df.select(needed_col_names.map(col): _*)

I understand that you want to select only those columns from a list(A)other than the dataframe columns. I have a below example, where I select the firstname and lastname using a separate list. check this out
scala> val df = Seq((101,"Jack", "wright" , 27, "01976", "US")).toDF("id","fname","lname","age","zip","country")
df: org.apache.spark.sql.DataFrame = [id: int, fname: string ... 4 more fields]
scala> df.columns
res20: Array[String] = Array(id, fname, lname, age, zip, country)
scala> val needed =Seq("fname","lname")
needed: Seq[String] = List(fname, lname)
scala> val needed_df = needed.map( x=> col(x) )
needed_df: Seq[org.apache.spark.sql.Column] = List(fname, lname)
scala> df.select(needed_df:_*).show(false)
+-----+------+
|fname|lname |
+-----+------+
|Jack |wright|
+-----+------+
scala>

Related

Spark create a dataframe from multiple lists/arrays

So, I have 2 lists in Spark(scala). They both contain the same number of values. The first list a contains all strings and the second list b contains all Long's.
a: List[String] = List("a", "b", "c", "d")
b: List[Long] = List(17625182, 17625182, 1059731078, 100)
I also have a schema defined as follows:
val schema2=StructType(
Array(
StructField("check_name", StringType, true),
StructField("metric", DecimalType(38,0), true)
)
)
What is the best way to convert my lists to a single dataframe, that has schema schema2 and the columns are made from a and b respectively?
You can create an RDD[Row] and convert to Spark dataframe with the given schema:
val df = spark.createDataFrame(
sc.parallelize(a.zip(b).map(x => Row(x._1, BigDecimal(x._2)))),
schema2
)
df.show
+----------+----------+
|check_name| metric|
+----------+----------+
| a| 17625182|
| b| 17625182|
| c|1059731078|
| d| 100|
+----------+----------+
Using Dataset:
import spark.implicits._
case class Schema2(a: String, b: Long)
val el = (a zip b) map { case (a, b) => Schema2(a, b)}
val df = spark.createDataset(el).toDF()

Dynamic dataframe with n columns and m rows

Reading data from json(dynamic schema) and i'm loading that to dataframe.
Example Dataframe:
scala> import spark.implicits._
import spark.implicits._
scala> val DF = Seq(
(1, "ABC"),
(2, "DEF"),
(3, "GHIJ")
).toDF("id", "word")
someDF: org.apache.spark.sql.DataFrame = [number: int, word: string]
scala> DF.show
+------+-----+
|id | word|
+------+-----+
| 1| ABC|
| 2| DEF|
| 3| GHIJ|
+------+-----+
Requirement:
Column count and names can be anything. I want to read rows in loop to fetch each column one by one. Need to process that value in subsequent flows. Need both column name and value. I'm using scala.
Python:
for i, j in df.iterrows():
print(i, j)
Need the same functionality in scala and it column name and value should be fetched separtely.
Kindly help.
df.iterrows is not from pyspark, but from pandas. In Spark, you can use foreach :
DF
.foreach{_ match {case Row(id:Int,word:String) => println(id,word)}}
Result :
(2,DEF)
(3,GHIJ)
(1,ABC)
I you don't know the number of columns, you cannot use unapply on Row, then just do :
DF
.foreach(row => println(row))
Result :
[1,ABC]
[2,DEF]
[3,GHIJ]
And operate with row using its methods getAs etc

Use rlike with regex column in spark 1.5.1

I want to filter dataframe based on applying regex values in one of the columns to another column.
Example:
Id Column1 RegexColumm
1 Abc A.*
2 Def B.*
3 Ghi G.*
The result of filtering dataframe using RegexColumm should give rows with id 1 and 3.
Is there a way to do this in spark 1.5.1? Don't want to use UDF as this might cause scalability issues, looking for spark native api.
You can convert df -> rdd then by traversing through row we can match the regex and filter out only the matching data without using any UDF.
Example:
import org.apache.spark.sql.types._
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
df.show()
//+---+-------+--------+
//| id|column1|regexCol|
//+---+-------+--------+
//| 1| Abc| A.*|
//| 2| Def| B.*|
//| 3| Ghi| G.*|
//+---+-------+--------+
//creating new schema to add new boolean field
val sch = StructType(df.schema.fields ++ Array(StructField("bool_col", BooleanType, false)))
//convert df to rdd and match the regex using .map
val rdd = df.rdd.map(row => {
val regex = row.getAs[String]("regexCol")
val bool = row.getAs[String]("column1").matches(regex)
val bool_col = s"$bool".toBoolean
val newRow = Row.fromSeq(row.toSeq ++ Array(bool_col))
newRow
})
//convert rdd to dataframe filter out true values for bool_col
val final_df = sqlContext.createDataFrame(rdd, sch).where(col("bool_col")).drop("bool_col")
final_df.show(10)
//+---+-------+--------+
//| id|column1|regexCol|
//+---+-------+--------+
//| 1| Abc| A.*|
//| 3| Ghi| G.*|
//+---+-------+--------+
UPDATE:
Instead of .map we can use .mapPartition (map vs mapPartiiton):
val rdd = df.rdd.mapPartitions(
partitions => {
partitions.map(row => {
val regex = row.getAs[String]("regexCol")
val bool = row.getAs[String]("column1").matches(regex)
val bool_col = s"$bool".toBoolean
val newRow = Row.fromSeq(row.toSeq ++ Array(bool_col))
newRow
})
})
scala> val df = Seq((1,"Abc","A.*"),(2,"Def","B.*"),(3,"Ghi","G.*")).toDF("id","Column1","RegexColumm")
df: org.apache.spark.sql.DataFrame = [id: int, Column1: string ... 1 more field]
scala> val requiredDF = df.filter(x=> x.getAs[String]("Column1").matches(x.getAs[String]("RegexColumm")))
requiredDF: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: int, Column1: string ... 1 more field]
scala> requiredDF.show
+---+-------+-----------+
| id|Column1|RegexColumm|
+---+-------+-----------+
| 1| Abc| A.*|
| 3| Ghi| G.*|
+---+-------+-----------+
You can use like above, I think this is what you are lioking for. Please do let me know if it helps you.

Convert spark dataframe to sequence of sequences and vice versa in Scala [duplicate]

This question already has an answer here:
How to get Array[Seq[String]] from DataFrame?
(1 answer)
Closed 3 years ago.
I have a DataFrame and I want to convert it into a sequence of sequences and vice versa.
Now the thing is, I want to do it dynamically, and write something which runs for DataFrame with any number/type of columns.
In summary, these are the questions:
How to convert Seq[Seq[String]] to a DataFrame?
How to convert DataFrame to Seq[Seq[String]?
How to perform 2 but also make the DataFrame infer the schema and decide column types by itself?
UPDATE 1
This is not a duplicate of this question because in answer to that question solution provided is not dynamic, it works for two columns or how many columns is to be hardcoded. I am trying to find a dynamic solution.
This is how you can dynamically create a dataframe from Seq[Seq[String]]:
scala> val seqOfSeq = Seq(Seq("a","b", "c"),Seq("3","4", "5"))
seqOfSeq: Seq[Seq[String]] = List(List(a, b, c), List(3, 4, 5))
scala> val lengthOfRow = seqOfSeq(0).size
lengthOfRow: Int = 3
scala> val tempDf = sc.parallelize(seqOfSeq).toDF
tempDf: org.apache.spark.sql.DataFrame = [value: array<string>]
scala> val requiredDf = tempDf.select((0 until lengthOfRow).map(i => col("value")(i).alias(s"col$i")): _*)
requiredDf: org.apache.spark.sql.DataFrame = [col0: string, col1: string ... 1 more field]
scala> requiredDf.show
+----+----+----+
|col0|col1|col2|
+----+----+----+
| a| b| c|
| 3| 4| 5|
+----+----+----+
How to convert DataFrame to Seq[Seq[String]:
val newSeqOfSeq = requiredDf.collect().map(row => row.toSeq.map(_.toString).toSeq).toSeq
To use custom column names:
scala> val myCols = Seq("myColA", "myColB", "myColC")
myCols: Seq[String] = List(myColA, myColB, myColC)
scala> val requiredDf = tempDf.select((0 until lengthOfRow).map(i => col("value")(i).alias( myCols(i) )): _*)
requiredDf: org.apache.spark.sql.DataFrame = [myColA: string, myColB: string ... 1 more field]

Replace all occurrences of a String in all columns in a dataframe in scala

I have a dataframe with 20 Columns and in these columns there is a value XX which i want to replace with Empty String. How do i achieve that in scala. The withColumn function is for a single column, But i want to pass all 20 columns and replace values that have XX in the entire frame with Empty String , Can some one suggest a way.
Thanks
You can gather all the stringType columns in a list and use foldLeft to apply your removeXX UDF to each of the columns as follows:
val df = Seq(
(1, "aaXX", "bb"),
(2, "ccXX", "XXdd"),
(3, "ee", "fXXf")
).toDF("id", "desc1", "desc2")
import org.apache.spark.sql.types._
val stringColumns = df.schema.fields.collect{
case StructField(name, StringType, _, _) => name
}
val removeXX = udf( (s: String) =>
if (s == null) null else s.replaceAll("XX", "")
)
val dfResult = stringColumns.foldLeft( df )( (acc, c) =>
acc.withColumn( c, removeXX(df(c)) )
)
dfResult.show
+---+-----+-----+
| id|desc1|desc2|
+---+-----+-----+
| 1| aa| bb|
| 2| cc| dd|
| 3| ee| ff|
+---+-----+-----+
def clearValueContains(dataFrame: DataFrame,token :String,columnsToBeUpdated : List[String])={
columnsToBeUpdated.foldLeft(dataFrame){
(dataset ,columnName) =>
dataset.withColumn(columnName, when(col(columnName).contains(token), "").otherwise(col(columnName)))
}
}
You can use this function .. where you can put token as "XX" . Also the columnsToBeUpdated is the list of columns in which you need to search for the particular column.
dataset.withColumn(columnName, when(col(columnName) === token, "").otherwise(col(columnName)))
you can use the above code to replace on exact match.
We can do like this as well in scala.
//Getting all columns
val columns: Seq[String] = df.columns
//Using DataFrameNaFunctions to achieve this.
val changedDF = df.na.replace(columns, Map("XX"-> ""))
Hope this helps.