Spark create a dataframe from multiple lists/arrays

Spark create a dataframe from multiple lists/arrays - scala

So, I have 2 lists in Spark(scala). They both contain the same number of values. The first list a contains all strings and the second list b contains all Long's.
a: List[String] = List("a", "b", "c", "d")
b: List[Long] = List(17625182, 17625182, 1059731078, 100)
I also have a schema defined as follows:
val schema2=StructType(
Array(
StructField("check_name", StringType, true),
StructField("metric", DecimalType(38,0), true)
)
)
What is the best way to convert my lists to a single dataframe, that has schema schema2 and the columns are made from a and b respectively?

You can create an RDD[Row] and convert to Spark dataframe with the given schema:
val df = spark.createDataFrame(
sc.parallelize(a.zip(b).map(x => Row(x._1, BigDecimal(x._2)))),
schema2
)
df.show
+----------+----------+
|check_name| metric|
+----------+----------+
| a| 17625182|
| b| 17625182|
| c|1059731078|
| d| 100|
+----------+----------+

Using Dataset:
import spark.implicits._
case class Schema2(a: String, b: Long)
val el = (a zip b) map { case (a, b) => Schema2(a, b)}
val df = spark.createDataset(el).toDF()

Related

How to add a new column to my DataFrame such that values of new column are populated by some other function in scala?

myFunc(Row): String = {
//process row
//returns string
}
appendNewCol(inputDF : DataFrame) : DataFrame ={
inputDF.withColumn("newcol",myFunc(Row))
inputDF
}
But no new column got created in my case. My myFunc passes this row to a knowledgebasesession object and that returns a string after firing rules. Can I do it this way? If not, what is the right way? Thanks in advance.
I saw many StackOverflow solutions using expr() sqlfunc(col(udf(x)) and other techniques but here my newcol is not derived directly from existing column.

Dataframe:
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StringType, StructField, StructType}
val myFunc = (r: Row) => {r.getAs[String]("col1") + "xyz"} // example transformation
val testDf = spark.sparkContext.parallelize(Seq(
(1, "abc"), (2, "def"), (3, "ghi"))).toDF("id", "col1")
testDf.show
val rddRes = testDf
.rdd
.map{x =>
val y = myFunc (x)
Row.fromSeq (x.toSeq ++ Seq(y) )
}
val newSchema = StructType(testDf.schema.fields ++ Array(StructField("col2", dataType =StringType, nullable =false)))
spark.sqlContext.createDataFrame(rddRes, newSchema).show
Results:
+---+----+
| id|col1|
+---+----+
| 1| abc|
| 2| def|
| 3| ghi|
+---+----+
+---+----+------+
| id|col1| col2|
+---+----+------+
| 1| abc|abcxyz|
| 2| def|defxyz|
| 3| ghi|ghixyz|
+---+----+------+
With Dataset:
case class testData(id: Int, col1: String)
case class transformedData(id: Int, col1: String, col2: String)
val test: Dataset[testData] = List(testData(1, "abc"), testData(2, "def"), testData(3, "ghi")).toDS
val transformedData: Dataset[transformedData] = test
.map { x: testData =>
val newCol = x.col1 + "xyz"
transformedData(x.id, x.col1, newCol)
}
transformedData.show
As you can see datasets is more readable, plus provides strong type casting.
Since I'm unaware of your spark version, providing both solutions here. However if you're using spark v>=1.6, you should look into Datasets. Playing with rdd is fun, but can quickly devolve into longer job runs and a host of other issues that you wont foresee

Spark scala dataframe get value for each row and assign to variables

I have a dataframe like below :
val df=spark.sql("select * from table")
row1|row2|row3
A1,B1,C1
A2,B2,C2
A3,B3,C3
i want to iterate for loop to get values like this :
val value1="A1"
val value2="B1"
val value3="C1"
function(value1,value2,value3)
Please help me.
emphasized text

You have 2 options :
Solution 1- Your data is big, then you must stick with dataframes. So to apply a function on every row. We must define a UDF.
Solution 2- Your data is small, then you can collect the data to the driver machine and then iterate with a map.
Example:
val df = Seq((1,2,3), (4,5,6)).toDF("a", "b", "c")
def sum(a: Int, b: Int, c: Int) = a+b+c
// Solution 1
import org.apache.spark.sql.Row
val myUDF = udf((r: Row) => sum(r.getAs[Int](0), r.getAs[Int](1), r.getAs[Int](2)))
df.select(myUDF(struct($"a", $"b", $"c")).as("sum")).show
//Solution 2
df.collect.map(r=> sum(r.getAs[Int](0), r.getAs[Int](1), r.getAs[Int](2)))
Output for both cases:
+---+
|sum|
+---+
| 6|
| 15|
+---+
EDIT:
val myUDF = udf((r: Row) => {
val value1 = r.getAs[Int](0)
val value2 = r.getAs[Int](1)
val value3 = r.getAs[Int](2)
myFunction(value1, value2, value3)
})

Add new column containing an Array of column names sorted by the row-wise values

Given a dataFrame with a few columns, I'm trying to create a new column containing an array of these columns' names sorted by decreasing order, based on the row-wise values of these columns.
| a | b | c | newcol|
|---|---|---|-------|
| 1 | 4 | 3 |[b,c,a]|
| 4 | 1 | 3 |[a,c,b]|
---------------------
The names of the columns are stored in a var names:Array[String]
What approach should I go for?

Using UDF is most simple way to achieve custom tasks here.
val df = spark.createDataFrame(Seq((1,4,3), (4,1,3))).toDF("a", "b", "c")
val names=df.schema.fieldNames
val sortNames = udf((v: Seq[Int]) => {v.zip(names).sortBy(_._1).map(_._2)})
df.withColumn("newcol", sortNames(array(names.map(col): _*))).show

Something like this can be an approach using Dataset:
case class Element(name: String, value: Int)
case class Columns(a: Int, b: Int, c: Int, elements: Array[String])
def function1()(implicit spark: SparkSession) = {
import spark.implicits._
val df0: DataFrame =
spark.createDataFrame(spark.sparkContext
.parallelize(Seq(Row(1, 2, 3), Row(4, 1, 3))),
StructType(Seq(StructField("a", IntegerType, false),
StructField("b", IntegerType, false),
StructField("c", IntegerType, false))))
val df1 = df0
.flatMap(row => Seq(Columns(row.getAs[Int]("a"),
row.getAs[Int]("b"),
row.getAs[Int]("c"),
Array(Element("a", row.getAs[Int]("a")),
Element("b", row.getAs[Int]("b")),
Element("c", row.getAs[Int]("c"))).sortBy(-_.value).map(_.name))))
df1
}
def main(args: Array[String]) : Unit = {
implicit val spark = SparkSession.builder().master("local[1]").getOrCreate()
function1().show()
}
gives:
+---+---+---+---------+
| a| b| c| elements|
+---+---+---+---------+
| 1| 2| 3|[a, b, c]|
| 4| 1| 3|[b, c, a]|
+---+---+---+---------+

Try something like this:
val sorted_column_names = udf((column_map: Map[String, Int]) =>
column_map.toSeq.sortBy(- _._2).map(_._1)
)
df.withColumn("column_map", map(lit("a"), $"a", lit("b"), $"b", lit("c"), $"c")
.withColumn("newcol", sorted_column_names($"column_map"))

Selecting several columns from spark dataframe with a list of columns as a start

Assuming that I have a list of spark columns and a spark dataframe df, what is the appropriate snippet of code in order to select a subdataframe containing only the columns in the list?
Something similar to maybe:
var needed_column: List[Column]=List[Column](new Column("a"),new Column("b"))
df(needed_columns)
I wanted to get the columns names then select them using the following line of code.
Unfortunately, the column name seems to be in write mode only.
df.select(needed_columns.head.as(String),needed_columns.tail: _*)

Your needed_columns is of type List[Column], hence you can simply use needed_columns: _* as the arguments for select:
val df = Seq((1, "x", 10.0), (2, "y", 20.0)).toDF("a", "b", "c")
import org.apache.spark.sql.Column
val needed_columns: List[Column] = List(new Column("a"), new Column("b"))
df.select(needed_columns: _*)
// +---+---+
// | a| b|
// +---+---+
// | 1| x|
// | 2| y|
// +---+---+
Note that select takes two types of arguments:
def select(cols: Column*): DataFrame
def select(col: String, cols: String*): DataFrame
If you have a list of column names of String type, you can use the latter select:
val needed_col_names: List[String] = List("a", "b")
df.select(needed_col_names.head, needed_col_names.tail: _*)
Or, you can map the list of Strings to Columns to use the former select
df.select(needed_col_names.map(col): _*)

I understand that you want to select only those columns from a list(A)other than the dataframe columns. I have a below example, where I select the firstname and lastname using a separate list. check this out
scala> val df = Seq((101,"Jack", "wright" , 27, "01976", "US")).toDF("id","fname","lname","age","zip","country")
df: org.apache.spark.sql.DataFrame = [id: int, fname: string ... 4 more fields]
scala> df.columns
res20: Array[String] = Array(id, fname, lname, age, zip, country)
scala> val needed =Seq("fname","lname")
needed: Seq[String] = List(fname, lname)
scala> val needed_df = needed.map( x=> col(x) )
needed_df: Seq[org.apache.spark.sql.Column] = List(fname, lname)
scala> df.select(needed_df:_*).show(false)
+-----+------+
|fname|lname |
+-----+------+
|Jack |wright|
+-----+------+
scala>

Replace all occurrences of a String in all columns in a dataframe in scala

I have a dataframe with 20 Columns and in these columns there is a value XX which i want to replace with Empty String. How do i achieve that in scala. The withColumn function is for a single column, But i want to pass all 20 columns and replace values that have XX in the entire frame with Empty String , Can some one suggest a way.
Thanks

You can gather all the stringType columns in a list and use foldLeft to apply your removeXX UDF to each of the columns as follows:
val df = Seq(
(1, "aaXX", "bb"),
(2, "ccXX", "XXdd"),
(3, "ee", "fXXf")
).toDF("id", "desc1", "desc2")
import org.apache.spark.sql.types._
val stringColumns = df.schema.fields.collect{
case StructField(name, StringType, _, _) => name
}
val removeXX = udf( (s: String) =>
if (s == null) null else s.replaceAll("XX", "")
)
val dfResult = stringColumns.foldLeft( df )( (acc, c) =>
acc.withColumn( c, removeXX(df(c)) )
)
dfResult.show
+---+-----+-----+
| id|desc1|desc2|
+---+-----+-----+
| 1| aa| bb|
| 2| cc| dd|
| 3| ee| ff|
+---+-----+-----+

def clearValueContains(dataFrame: DataFrame,token :String,columnsToBeUpdated : List[String])={
columnsToBeUpdated.foldLeft(dataFrame){
(dataset ,columnName) =>
dataset.withColumn(columnName, when(col(columnName).contains(token), "").otherwise(col(columnName)))
}
}
You can use this function .. where you can put token as "XX" . Also the columnsToBeUpdated is the list of columns in which you need to search for the particular column.
dataset.withColumn(columnName, when(col(columnName) === token, "").otherwise(col(columnName)))
you can use the above code to replace on exact match.

We can do like this as well in scala.
//Getting all columns
val columns: Seq[String] = df.columns
//Using DataFrameNaFunctions to achieve this.
val changedDF = df.na.replace(columns, Map("XX"-> ""))
Hope this helps.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Spark create a dataframe from multiple lists/arrays - scala

Using Dataset: import spark.implicits._ case class Schema2(a: String, b: Long) val el = (a zip b) map { case (a, b) => Schema2(a, b)} val df = spark.createDataset(el).toDF()

Related

How to add a new column to my DataFrame such that values of new column are populated by some other function in scala?

Spark scala dataframe get value for each row and assign to variables

Add new column containing an Array of column names sorted by the row-wise values

Selecting several columns from spark dataframe with a list of columns as a start

Replace all occurrences of a String in all columns in a dataframe in scala

Categories

Resources