Fetch columns based on list in Spark - scala

I have a list List(0, 1, 2, 3, 4, 5, 6, 7, 10, 8, 13) and I have a dataframe which read input from text file with no headers. I want to fetch the columns mentioned in my List from that dataframe(inputFile). My input files has more 20 column but I want to fetch only columns mentioned in my list
val inputFile = spark.read
.format("com.databricks.spark.csv")
.option("inferSchema", "true")
.option("delimiter", "|")
.load("C:\\demo.txt")

You can get the required columns using the following :
val fetchIndex = List(0, 1, 2, 3, 4, 5, 6, 7, 10, 8, 13)
val fetchCols = inputFile.columns.zipWithIndex
.filter { case (colName, idx) => fetchIndex.contains(idx) }
.map(x => col(x._1) )
inputFile.select( fetchCols : _* )
Basically what it does is, zipWithIndex adds a continuous index to each element of the collection. So you get something like this :
df.columns.zipWithIndex.filter { case (data, idx) => a.contains(idx) }.map(x => col(x._1))
res8: Array[org.apache.spark.sql.Column] = Array(companyid, event, date_time)
And then you can just use the splat operator to pass the generated array as varargs to the select function.

You can use the following steps to get the columns that you have defined in a list as indexes.
You can get the column names by doing the following
val names = df.schema.fieldNames
And you have a list of column indexes as
val list = List(0, 1, 2, 3, 4, 5, 6, 7, 10, 8, 13)
Now you can select the column names that the indexes that the list has by doing the following
val selectCols = list.map(x => names(x))
Last step is to select only the columns that has been selected by doing the following
import org.apache.spark.sql.functions.col
val selectedDataFrame = df.select(selectCols.map(col): _*)
You should have the dataframe with column indexes mentioned in the list.
Note: indexes in the list should not be greater than the column indexes present in the dataframe

Related

How to add a column with duplicate sequence number for spark dataframe in scala?

I need add a column to a spark dataframe, which should be duplicate sequence number, such as [1, 1, 1, 2, 2, 2, 3, 3, 3, ..., 10000, 10000, 10000]. I knew that we can use monotonically_increasing_id to get the sequence number as new column.
val df_new = df.withColumn("id", monotonically_increasing_id)
Then, what is the solution to extend this function to get the duplicate sequence number? Thanks!
You can calculate a row number, divide that by 3, cast to integer type, and add 1:
import org.apache.spark.sql.expressions.Window
val df_new = df.withColumn(
"id",
(row_number().over(Window.orderBy(monotonically_increasing_id))/3).cast("int") + 1
)

Calculate date difference for a specific column ID Scala

I need to calculate a date difference for a column, considering a specific ID shown in a different column and the first date for that specific ID, using Scala.
I have the following dataset:
The column ID shows the specific ID previously mentioned, the column date shows the date of the event and the column rank shows the chronological positioning of the different event dates for each specific ID.
I need to calculate for ID 1, the date difference for ranks 2 and 3 compared to rank 1 for that same ID, the same for ID 2 and so forth.
The expected result is the following:
Does somebody know how to do it?
Thanks!!!
Outside of using a library like Spark to reason about your data in SQL-esque terms, this can be accomplished using the Collections API by first finding the minimum date for each ID and then comparing the dates in the original collection:
# import java.time.temporal.ChronoUnit.DAYS
import java.time.temporal.ChronoUnit.DAYS
# import java.time.LocalDate
import java.time.LocalDate
# case class Input(id : Int, date : LocalDate, rank : Int)
defined class Input
# case class Output(id : Int, date : LocalDate, rank : Int, diff : Long)
defined class Output
# val inData = Seq(Input(1, LocalDate.of(2020, 12, 10), 1),
Input(1, LocalDate.of(2020, 12, 12), 2),
Input(1, LocalDate.of(2020, 12, 16), 3),
Input(2, LocalDate.of(2020, 12, 11), 1),
Input(2, LocalDate.of(2020, 12, 13), 2),
Input(2, LocalDate.of(2020, 12, 14), 3))
inData: Seq[Input] = List(
Input(1, 2020-12-10, 1),
Input(1, 2020-12-12, 2),
Input(1, 2020-12-16, 3),
Input(2, 2020-12-11, 1),
Input(2, 2020-12-13, 2),
Input(2, 2020-12-14, 3)
# val minDates = inData.groupMapReduce(_.id)(identity){(a, b) =>
a.date.isBefore(b.date) match {
case true => a
case false => b
}}
minDates: Map[Int, Input] = Map(1 -> Input(1, 2020-12-10, 1), 2 -> Input(2, 2020-12-11, 1))
# val outData = inData.map(a => Output(a.id, a.date, a.rank, DAYS.between(minDates(a.id).date, a.date)))
outData: Seq[Output] = List(
Output(1, 2020-12-10, 1, 0L),
Output(1, 2020-12-12, 2, 2L),
Output(1, 2020-12-16, 3, 6L),
Output(2, 2020-12-11, 1, 0L),
Output(2, 2020-12-13, 2, 2L),
Output(2, 2020-12-14, 3, 3L)
You can get the required output by performing the steps as done below :
//Creating the Sample data
import org.apache.spark.sql.types._
val sampledf = Seq((1,"2020-12-10",1),(1,"2020-12-12",2),(1,"2020-12-16",3),(2,"2020-12-08",1),(2,"2020-12-11",2),(2,"2020-12-13",3))
.toDF("ID","Date","Rank").withColumn("Date",$"Date".cast("Date"))
//adding column with just the value for the rank = 1 column
import org.apache.spark.sql.functions._
val df1 = sampledf.withColumn("Basedate",when($"Rank" === 1 ,$"Date"))
//Doing GroupBy based on ID and basedate column and filtering the records with null basedate
val groupedDF = df1.groupBy("ID","basedate").min("Rank").filter($"min(Rank)" === 1)
//joining the two dataframes and selecting the required columns.
val joinedDF = df1.join(groupedDF.as("t"), Seq("ID"),"left").select("ID","Date","Rank","t.basedate")
//Applying the inbuilt datediff function to get the required output.
val finalDF = joinedDF.withColumn("DateDifference", datediff($"Date",$"basedate"))
finalDF.show(false)
//If using databricks you can use display method.
display(finalDF)

Spark Dataframe from all combinations of Array column

Assume I have a Spark DataFrame d1 with two columns, elements_1 and elements_2, that contain sets of integers of size k, and value_1, value_2 that contain a integer value. For example, with k = 3:
d1 =
+------------+------------+
| elements_1 | elements_2 |
+-------------------------+
| (1, 4, 3) | (3, 4, 5) |
| (2, 1, 3) | (1, 0, 2) |
| (4, 3, 1) | (3, 5, 6) |
+-------------------------+
I need to create a new column combinations made that contains, for each pair of sets elements_1 and elements_2, a list of the sets from all possible combinations of their elements. These sets must have the following properties:
Their size must be k+1
They must contain either the set in elements_1 or the set in elements_2
For example, from (1, 2, 3) and (3, 4, 5) we obtain [(1, 2, 3, 4), (1, 2, 3, 5), (3, 4, 5, 1) and (3, 4, 5, 2)]. The list does not contain (1, 2, 5) because it is not of length 3+1, and it does not contain (1, 2, 4, 5) because it contains neither of the original sets.
You need to create a custom user-defined function to perform the transformation, create a spark-compatible UserDefinedFunction from it, then apply using withColumn. So really, there are two questions here: (1) how to do the set transformation you described, and (2) how to create a new column in a DataFrame using a user-defined function.
Here's a first shot at the set logic, let me know if it does what you're looking for:
def combo[A](a: Set[A], b: Set[A]): Set[Set[A]] =
a.diff(b).map(b+_) ++ b.diff(a).map(a+_)
Now create the UDF wrapper. Note that under the hood these sets are all represented by WrappedArrays, so we need to handle this. There's probably a more elegant way to deal with this by defining some implicit conversions, but this should work:
import scala.collection.mutable.WrappedArray
val comboWrap: (WrappedArray[Int],WrappedArray[Int])=>Array[Array[Int]] =
(x,y) => combo(x.toSet,y.toSet).map(_.toArray).toArray
val comboUDF = udf(comboWrap)
Finally, apply it to the DataFrame by creating a new column:
val data = Seq((Set(1,2,3),Set(3,4,5))).toDF("elements_1","elements_2")
val result = data.withColumn("result",
comboUDF(col("elements_1"),col("elements_2")))
result.show

Values of a Dataframe Column into an Array in Scala Spark

Say, I have dataframe
val df1 = sc.parallelize(List(
("A1",45, "5", 1, 90),
("A2",60, "1", 1, 120),
("A3", 45, "9", 1, 450),
("A4", 26, "7", 1, 333)
)).toDF("CID","age", "children", "marketplace_id","value")
Now I want all the values of column "children" into an separate array in the same order.
the below code works for smaller dataset with only one partition
val list1 = df.select("children").map(r => r(0).asInstanceOf[String]).collect()
output:
list1: Array[String] = Array(5, 1, 9, 7)
But the above code fails when we have partitions
val partitioned = df.repartition($"CID")
val list = partitioned.select("children").map(r => r(0).asInstanceOf[String]).collect()
output:
list: Array[String] = Array(9, 1, 7, 5)
is there way, that I can get all the values of a column into an array without changing an order?

how to filter few rows in a table using Scala

Using Scala:
I have a emp table as below
id, name, dept, address
1, a, 10, hyd
2, b, 10, blr
3, a, 5, chn
4, d, 2, hyd
5, a, 3, blr
6, b, 2, hyd
Code:
val inputFile = sc.textFile("hdfs:/user/edu/emp.txt");
val inputRdd = inputFile.map(iLine => (iLine.split(",")(0),
iLine.split(",")(1),
iLine.split(",")(3)
));
// filtering only few columns Now i want to pull hyd addressed employees complete data
Problem: I don't want to print all emp details, I want to print only few emp details who all are from hyd.
I have loaded this emp dataset into Rdd
I was split this Rdd with ','
now I want to print only hyd addressed emp.
I think the below solution will help to solve your problem.
val fileName = "/path/stact_test.txt"
val strRdd = sc.textFile(fileName).map { line =>
val data = line.split(",")
(data(0), data(1), data(3))
}.filter(rec=>rec._3.toLowerCase.trim.equals("hyd"))
after splitting the data, filter the location using the 3rd item from the tuple RDD.
Output:
(1, a, hyd)
(4, d, hyd)
(6, b, hyd)
You may try to use dataframe
val viewsDF=spark.read.text("hdfs:/user/edu/emp.txt")
val splitedViewsDF = viewsDF.withColumn("id", split($"value",",").getItem(0))
.withColumn("name", split($"value", ",").getItem(1))
.withColumn("address", split($"value", ",").getItem(3))
.drop($"value")
.filter(df("address").equals("hyd") )