Select Columns in Spark Dataframe based on Column name pattern - scala

I have a spark dataframe with the following column structure:
UT_LVL_17_CD,UT_LVL_20_CD, 2017 1Q,2017 2Q,2017 3Q,2017 4Q, 2017 FY,2018 1Q, 2018 2Q,2018 3Q,2018 4Q,2018 FY
In the above column structure , I will get new columns with subsequent quarters like 2019 1Q , 2019 2Q etc
I want to select UT_LVL_17_CD,UT_LVL_20_CD and columns which has the pattern year<space>quarter like 2017 1Q.
Basically I want to avoid selecting columns like 2017 FY , 2018 FY , and this has to be dynamic as I will get new FY data each year
I am using spark 2.4.4

Like I stated in my comment, this can be done with plain scala using Regex since the DataFrame can return columns names as a Seq[String] :
scala> val columns = df.columns
// columns: Seq[String] = List(UT_LVL_17_CD, UT_LVL_20_CD, 2017 1Q, 2017 2Q, 2017 3Q, 2017 4Q, 2017 FY, 2018 1Q, 2018 2Q, 2018 3Q, 2018 4Q, 2018 FY)
scala> val regex = """^((?!FY).)*$""".r
// regex: scala.util.matching.Regex = ^((?!FY).)*$
scala> val selection = columns.filter(s => regex.findFirstIn(s).isDefined)
// selection: Seq[String] = List(UT_LVL_17_CD, UT_LVL_20_CD, 2017 1Q, 2017 2Q, 2017 3Q, 2017 4Q, 2018 1Q, 2018 2Q, 2018 3Q, 2018 4Q)
You can check that the selected columns does not contain the unwanted columns :
scala> columns.diff(selection)
// res2: Seq[String] = List(2017 FY, 2018 FY)
Now you can use the selection :
scala> df.select(selection.head, selection.tail : _*)
// res3: org.apache.spark.sql.DataFrame = [UT_LVL_17_CD: int, UT_LVL_20_CD: int ... 8 more fields]

You could use desc sql command to get list of column names
val fyStringList=new util.ArrayList[String]()
spark.sql("desc <table_name>").select("col_name").filter(str => str.getString(0).toLowerCase.contains("fy")).collect.foreach(str=>fyStringList.add(str.getString(0)))
println(fyStringList)
Use above snippet to get list of column name which contains "fy"
You can update filter logic with regex and also update logic in forEach for storing string columns

you can try this snippet. Assuming the DF is your dataframe which consists of all those columns.
var DF1 = DF.select(DF.columns.filter(x => !x.contains("FY")).map(DF(_)) : _*)
This will remove those FY related columns. Hope this works for you.

Related

Split date into day of the week, month,year using Pyspark

I have very little experience in Pyspark and I am trying with no success to create 3 new columns from a column that contain the timestamp of each row.
The column containing the date has the following format:EEE MMM dd HH:mm:ss Z yyyy.
So it looks like this:
+--------------------+
| timestamp|
+--------------------+
|Fri Oct 18 17:07:...|
|Mon Oct 21 21:49:...|
|Thu Oct 31 18:03:...|
|Sun Oct 20 15:00:...|
|Mon Sep 30 23:35:...|
+--------------------+
The 3 columns have to contain: the day of the week as an integer (so 0 for monday, 1 for tuesday...), the number of the month and the year.
What is the most effective way to create these additional 3 columns and append them to the pyspark dataframe? Thanks in advance!!
Spark 1.5 and higher has many date processing functions. Here are some that maybe useful for you
from pyspark.sql.functions import *
from pyspark.sql.functions import year, month, dayofweek
df = df.withColumn('dayOfWeek', dayofweek(col('your_date_column')))
df = df.withColumn('month', month(col('your_date_column')))
df = df.withColumn('year', year(col('your_date_column')))

Spark SQL Dataframes - replace function from DataFrameNaFunctions does not work if the Map is created with RDD.collectAsMap()

From DataFrameNaFunctions I am using replace function to replace values of a column in a dataframe with those from a Map.
The keys & values of the Map are available as a delimited file. These are read into an RDD, then transformed to a pair RDD and converted to a Map.
For example a text file of month number & month name available as a file as shown below:
01,January
02,February
03,March
... ...
... ...
val mRDD1 = sc.textFile("file:///.../monthlist.txt")
When this data is transformed as a Map using RDD.collect().toMap as given below the dataframe.na.replace function works fine which I am referring as Method 1.
val monthMap1= mRDD1.map(_.split(",")).map(line => (line(0),line(1))).collect().toMap
monthMap1: scala.collection.immutable.Map[String,String] = Map(12 -> December, 08 -> August, 09 -> September, 11 -> November, 05 -> May, 04 -> April, 10 -> October, 03 -> March, 06 -> June, 02 -> February, 07 -> July, 01 -> January)
val df2 = df1.na.replace("monthname", monthMap1)
df2: org.apache.spark.sql.DataFrame = [col1: int, col2: string ... 13 more fields]
However when this data is transformed as a Map using RDD.collectAsMap() as shown below since it is not an immutable Map it is not working which I am calling Method 2.
Is there simple a way to convert this scala.collection.Map into scala.collection.immutable.Map so that it does not give this error?
val monthMap2= mRDD1.map(_.split(",")).map(line => (line(0),line(1))).collectAsMap()
monthMap2: scala.collection.Map[String,String] = Map(12 -> December, 09 -> September, 03 -> March, 06 -> June, 11 -> November, 05 -> May, 08 -> August, 02 -> February, 01 -> January, 10 -> October, 04 -> April, 07 -> July)
val df3 = df1.na.replace("monthname", monthMap2)
<console>:30: error: overloaded method value replace with alternatives:
[T](cols: Seq[String], replacement: scala.collection.immutable.Map[T,T])org.apache.spark.sql.DataFrame <and>
[T](col: String, replacement: scala.collection.immutable.Map[T,T])org.apache.spark.sql.DataFrame <and>
[T](cols: Array[String], replacement: java.util.Map[T,T])org.apache.spark.sql.DataFrame <and>
[T](col: String, replacement: java.util.Map[T,T])org.apache.spark.sql.DataFrame
cannot be applied to (String, scala.collection.Map[String,String])
val cdf3 = cdf2.na.replace("monthname", monthMap2)
^
Method 1 mentioned above is working fine.
However, for using Method 2, I would like to know what is the simple and direct way to convert a scala.collection.Map into scala.collection.immutable.Map and which libraries I need to import as well.
Thanks
You can try this :
val monthMap2 = mRDD1.map(_.split(",")).map(line => (line(0),line(1))).collectAsMap()
// create an immutable map from monthMap2
val monthMap = collection.immutable.Map(monthMap2.toSeq: _*)
val df3 = df1.na.replace("monthname", monthMap)
The method replace takes also a java map, you can also convert it like this:
import scala.jdk.CollectionConverters._
val df3 = df1.na.replace("monthname", monthMap2.asJava)

I have a dataframe. I need to add an array [a,a,b,b,c,c,d,d,] in pyspark

I have a data frame df , I have an array arr = [1,1,2,2,3,3,4,4]. I need to add this array to existing data frame df.
My code is as follows:
low_limit = 2011
upper_limit = 2017
arr = np.repeat(np.arange(low_limit,upper_limit),2)
df = df.withColumn('arrayYear',F.array(F.lit(arr))).show()
I am getting this error Py4JJavaError:
An error occurred while calling z:org.apache.spark.sql.functions.lit. :
java.lang.RuntimeException: Unsupported literal type class java.util.ArrayList [2011, 2011, 2012, 2012, 2013, 2013, 2014, 2014, 2015, 2015, 2016, 2016] at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:80)

How to iterate through rows after group by in spark scala dataframe?

I have a dataframe with the below columns , df1
Following the example there:
Project_end_date I_date Project_start_date id
Jan 30 2017 Jan 10 2017 Jan 1 2017 1
Jan 30 2017 Jan 15 2017 Jan 1 2017 1
Jan 30 2017 Jan 20 2017 Jan 1 2017 1
Here you would fist find the differences between i and start date, which would be 10, 15, and 20 days. Then you would express those as a percentage of the project's duration, so 100*10/30=33%, 100*15/30=50%, 100*20/20=67%. Then you would obtain the mean (33%), min(33%), max(67%), etc of these.
how to achieve this after doing group by on id
df.groupby("id"). ?
Easiest way would be to add the value you care about just before the groupBy:
import org.apache.spark.sql.{functions => F}
import spark.implicits._
df.withColumn("ival", (
$"I_date" - $"Project_start_date") /
($"Project_end_date" - $"Project_start_date"))
.groupBy('id').agg(
F.min($"ival").as("min"),
F.max($"ival").as("max"),
F.avg($"ival").as("avg")
)
If you want to avoid the withColumn you can just get the expression for ival inside F.min, F.max and F.avg, but that's more verbose.

How to replace epoch column in DataFrame with date in scala

I am writing a spark application which receives an avro record. I am converting that avro record into Spark DataFrame (df) object.
The df contains a time stamp attribute which is in seconds. (Epoch time)
I want to replace the seconds column with the date column.
How to do that?
My code snippet is :
val df = sqlContext.read.avro("/root/Work/PixelReporting/input_data/pixel.avro")
val pixelGeoOutput = df.groupBy("current_time", "pixel_id", "geo_id", "operation_type", "is_piggyback").count()
pixelGeoOutput.write.json("/tmp/pixelGeo")
"current_time" is in seconds right now. I want to convert it into date.
Since Spark 1.5, there's a built in sql.function called from_unixtime, you can do:
val df = Seq(Tuple1(1462267668L)).toDF("epoch")
df.withColumn("date", from_unixtime(col("epoch")))
Thanks guys,
I used withColumn method to solve my problem.
Code snippet is :
val newdf = df.withColumn("date", epochToDateUDF(df("current_time")))
def epochToDateUDF = udf((current_time : Long) =>{
DateTimeFormat.forPattern("YYYY-MM-dd").print(current_time *1000)
})
This should give you an idea:
import java.util.Date
val df = sc.parallelize(List(1462267668L, 1462267672L, 1462267678L)).toDF("current_time")
val dfWithDates = df.map(row => new Date(row.getLong(0) * 1000))
dfWithDates.collect()
Output:
Array[java.util.Date] = Array(Tue May 03 11:27:48 CEST 2016, Tue May 03 11:27:52 CEST 2016, Tue May 03 11:27:58 CEST 2016)
You might also try this in a UDF and using withColumn to just replace that single column.