How do I check if temp view exists in PySpark? - pyspark

I understand how to check for table existence in PySpark:
>>> spark.catalog.setCurrentDatabase("staging")
>>> 'test_table' in sqlContext.tableNames()
True
But what about views?
If it create it like this:
df = sqlContext.sql("SELECT * FROM staging.test_table")
df.createOrReplaceTempView("test_view")
df.persist(p.persistLevel)
How do I check if "test view" exists later in my code?

You can use sqlContext.tableNames and sqlContext.tables
>>> sqlContext.registerDataFrameAsTable(df, "table1")
>>> "table1" in sqlContext.tableNames()
True
>>> "table1" in sqlContext.tableNames("default")
True

"default" is the context where views are defined.
>>> spark.catalog.setCurrentDatabase("staging")
>>> 'test_view' in sqlContext.tableNames()
False
>>> spark.catalog.setCurrentDatabase("default")
>>> 'test_view' in sqlContext.tableNames()
True
This takes a bit of a time (>3 sec)
Faster would be to try/catch
try:
_=spark.read.table('test_view')
print('Exists!')
catch:
print('Does not exist.')

This is what works for me using spark.catalog.tableExists():
from datetime import datetime, date
from pyspark.sql import SparkSession, Row
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([
Row(a=1, b=2., c='string1', d=date(2000, 1, 1), e=datetime(2000, 1, 1, 12, 0)),
Row(a=2, b=3., c='string2', d=date(2000, 2, 1), e=datetime(2000, 1, 2, 12, 0)),
Row(a=4, b=5., c='string3', d=date(2000, 3, 1), e=datetime(2000, 1, 3, 12, 0))
])
# create a view
df.createTempView('demo')
# check if this view exists
print(spark.catalog.tableExists('demo'))

Related

How to add a column with duplicate sequence number for spark dataframe in scala?

I need add a column to a spark dataframe, which should be duplicate sequence number, such as [1, 1, 1, 2, 2, 2, 3, 3, 3, ..., 10000, 10000, 10000]. I knew that we can use monotonically_increasing_id to get the sequence number as new column.
val df_new = df.withColumn("id", monotonically_increasing_id)
Then, what is the solution to extend this function to get the duplicate sequence number? Thanks!
You can calculate a row number, divide that by 3, cast to integer type, and add 1:
import org.apache.spark.sql.expressions.Window
val df_new = df.withColumn(
"id",
(row_number().over(Window.orderBy(monotonically_increasing_id))/3).cast("int") + 1
)

Calculate date difference for a specific column ID Scala

I need to calculate a date difference for a column, considering a specific ID shown in a different column and the first date for that specific ID, using Scala.
I have the following dataset:
The column ID shows the specific ID previously mentioned, the column date shows the date of the event and the column rank shows the chronological positioning of the different event dates for each specific ID.
I need to calculate for ID 1, the date difference for ranks 2 and 3 compared to rank 1 for that same ID, the same for ID 2 and so forth.
The expected result is the following:
Does somebody know how to do it?
Thanks!!!
Outside of using a library like Spark to reason about your data in SQL-esque terms, this can be accomplished using the Collections API by first finding the minimum date for each ID and then comparing the dates in the original collection:
# import java.time.temporal.ChronoUnit.DAYS
import java.time.temporal.ChronoUnit.DAYS
# import java.time.LocalDate
import java.time.LocalDate
# case class Input(id : Int, date : LocalDate, rank : Int)
defined class Input
# case class Output(id : Int, date : LocalDate, rank : Int, diff : Long)
defined class Output
# val inData = Seq(Input(1, LocalDate.of(2020, 12, 10), 1),
Input(1, LocalDate.of(2020, 12, 12), 2),
Input(1, LocalDate.of(2020, 12, 16), 3),
Input(2, LocalDate.of(2020, 12, 11), 1),
Input(2, LocalDate.of(2020, 12, 13), 2),
Input(2, LocalDate.of(2020, 12, 14), 3))
inData: Seq[Input] = List(
Input(1, 2020-12-10, 1),
Input(1, 2020-12-12, 2),
Input(1, 2020-12-16, 3),
Input(2, 2020-12-11, 1),
Input(2, 2020-12-13, 2),
Input(2, 2020-12-14, 3)
# val minDates = inData.groupMapReduce(_.id)(identity){(a, b) =>
a.date.isBefore(b.date) match {
case true => a
case false => b
}}
minDates: Map[Int, Input] = Map(1 -> Input(1, 2020-12-10, 1), 2 -> Input(2, 2020-12-11, 1))
# val outData = inData.map(a => Output(a.id, a.date, a.rank, DAYS.between(minDates(a.id).date, a.date)))
outData: Seq[Output] = List(
Output(1, 2020-12-10, 1, 0L),
Output(1, 2020-12-12, 2, 2L),
Output(1, 2020-12-16, 3, 6L),
Output(2, 2020-12-11, 1, 0L),
Output(2, 2020-12-13, 2, 2L),
Output(2, 2020-12-14, 3, 3L)
You can get the required output by performing the steps as done below :
//Creating the Sample data
import org.apache.spark.sql.types._
val sampledf = Seq((1,"2020-12-10",1),(1,"2020-12-12",2),(1,"2020-12-16",3),(2,"2020-12-08",1),(2,"2020-12-11",2),(2,"2020-12-13",3))
.toDF("ID","Date","Rank").withColumn("Date",$"Date".cast("Date"))
//adding column with just the value for the rank = 1 column
import org.apache.spark.sql.functions._
val df1 = sampledf.withColumn("Basedate",when($"Rank" === 1 ,$"Date"))
//Doing GroupBy based on ID and basedate column and filtering the records with null basedate
val groupedDF = df1.groupBy("ID","basedate").min("Rank").filter($"min(Rank)" === 1)
//joining the two dataframes and selecting the required columns.
val joinedDF = df1.join(groupedDF.as("t"), Seq("ID"),"left").select("ID","Date","Rank","t.basedate")
//Applying the inbuilt datediff function to get the required output.
val finalDF = joinedDF.withColumn("DateDifference", datediff($"Date",$"basedate"))
finalDF.show(false)
//If using databricks you can use display method.
display(finalDF)

How can I apply boolean indexing in a Spark-Scala dataframe?

I have two Spark-Scala dataframes and I need to use one boolean column from one dataframe to filter the second dataframe. Both dataframes have the same number of rows.
In pandas I would so it like this:
import pandas as pd
df1 = pd.DataFrame({"col1": ["A", "B", "A", "C"], "boolean_column": [True, False, True, False]})
df2 = pd.DataFrame({"col1": ["Z", "X", "Y", "W"], "col2": [1, 2, 3, 4]})
filtered_df2 = df2[df1['boolean_column']]
// Expected filtered_df2 should be this:
// df2 = pd.DataFrame({"col1": ["Z", "Y"], "col2": [1, 3]})
How can I do the same operation in Spark-Scala in the most time-efficient way?
My current solution is to add "boolean_column" from df1 to df2, then filter df2 by selecting only the rows with a true value in the newly added column and finally removing "boolean_column" from df2, but I'm not sure it is the best solution.
Any suggestion is appreciated.
Edit:
The expected output is a Spark-Scala dataframe (not a list or a column) with the same schema as the second dataframe, and only the subset of rows from df2 that satisfy the boolean mask from the "boolean_column" of df1.
The schema of df2 presented above is just an example. I'm expecting to receive df2 as a parameter, with any number of columns of different (and not fixed) schemas.
you can zip both DataFrames and filter on those tuples.
val ints = sparkSession.sparkContext.parallelize(List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10))
val bools = sparkSession.sparkContext.parallelize(List(true, false, true, false, true, false, true, false, true, false))
val filtered = ints.zip(bools).filter { case (int, bool) => bool }.map { case (int, bool) => int }
println(filtered.collect().toList) //List(1, 3, 5, 7, 9)
I managed to solve this with the following code:
import org.apache.spark.sql.types.{LongType, StructField, StructType}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{DataFrame, Row, SQLContext, SparkSession}
val spark = SparkSession.builder().appName(sc.appName).master(sc.master).getOrCreate()
val sqlContext = spark.sqlContext
def addColumnIndex(df: DataFrame, sqlContext: SQLContext) = sqlContext.createDataFrame(
// Add Column index
df.rdd.zipWithIndex.map{case (row, columnindex) => Row.fromSeq(row.toSeq :+ columnindex)},
// Create schema
StructType(df.schema.fields :+ StructField("columnindex", LongType, nullable = false))
)
import spark.implicits._
val DF1 = Seq(
("A", true),
("B", false),
("A", true),
("C", false)
).toDF("col1", "boolean_column")
val DF2 = Seq(
("Z", 1),
("X", 2),
("Y", 3),
("W", 4)
).toDF("col_1", "col_2")
// Add index
val DF1WithIndex = addColumnIndex(DF1, sqlContext)
val DF2WithIndex = addColumnIndex(DF2, sqlContext)
// Join
val joinDF = DF2WithIndex
.join(DF1WithIndex, Seq("columnindex"))
.drop("columnindex", "col1")
// Filter
val filteredDF2 = joinDF.filter(joinDF("boolean_column")).drop("boolean_column")
The filtered dataframe will be the following:
+-----+-----+
|col_1|col_2|
+-----+-----+
| Z| 1|
| Y| 3|
+-----+-----+

Pyspark isin with column in argument doesn't exclude rows

I need to exclude rows which doesn't have True value in column Status.
In my opinion this filter( isin( )== False) structure should solve my problem but it doesn't.
df = sqlContext.createDataFrame([( "A", "True"), ( "A", "False"), ( "B", "False"), ("C", "True")], ( "name", "status"))
df.registerTempTable("df")
df_t = df[df.status == "True"]
from pyspark.sql import functions as sf
df_f = df.filter(df.status.isin(df_t.name)== False)
I expect row:
B | False
any help is greatly appreciated!
First, I think in your last statement, you meant to use df.name instead of df.status.
df_f = df.filter(df.status.isin(df_t.name)== False)
Second, even if you use df.name, it still won't work.
Because it's mixing the columns (Column type) from two DataFrames, i.e. df_t and df in your final statement. I don't think this works in pyspark.
However, you can achieve the same effect using other methods.
If I understand correctly, you want to select 'A' and 'C' first through 'status' column, then select the rows excluding ['A', 'C']. The thing here is to extend the selection to the second row of 'A', which can be achieved by Window. See below:
from pyspark.sql import functions as F
from pyspark.sql.window import Window
df = sqlContext.createDataFrame([( "A", "True"), ( "A", "False"), ( "B", "False"), ("C", "True")], ( "name", "status"))
df.registerTempTable("df")
# create an auxiliary column satisfying the condition
df = df.withColumn("flag", F.when(df['status']=="True", 1).otherwise(0))
df.show()
# extend the selection to other rows with the same 'name'
df = df.withColumn('flag', F.max(df['flag']).over(Window.partitionBy('name')))
df.show()
#filter is now easy
df_f = df.filter(df.flag==0)
df_f.show()

Fetch columns based on list in Spark

I have a list List(0, 1, 2, 3, 4, 5, 6, 7, 10, 8, 13) and I have a dataframe which read input from text file with no headers. I want to fetch the columns mentioned in my List from that dataframe(inputFile). My input files has more 20 column but I want to fetch only columns mentioned in my list
val inputFile = spark.read
.format("com.databricks.spark.csv")
.option("inferSchema", "true")
.option("delimiter", "|")
.load("C:\\demo.txt")
You can get the required columns using the following :
val fetchIndex = List(0, 1, 2, 3, 4, 5, 6, 7, 10, 8, 13)
val fetchCols = inputFile.columns.zipWithIndex
.filter { case (colName, idx) => fetchIndex.contains(idx) }
.map(x => col(x._1) )
inputFile.select( fetchCols : _* )
Basically what it does is, zipWithIndex adds a continuous index to each element of the collection. So you get something like this :
df.columns.zipWithIndex.filter { case (data, idx) => a.contains(idx) }.map(x => col(x._1))
res8: Array[org.apache.spark.sql.Column] = Array(companyid, event, date_time)
And then you can just use the splat operator to pass the generated array as varargs to the select function.
You can use the following steps to get the columns that you have defined in a list as indexes.
You can get the column names by doing the following
val names = df.schema.fieldNames
And you have a list of column indexes as
val list = List(0, 1, 2, 3, 4, 5, 6, 7, 10, 8, 13)
Now you can select the column names that the indexes that the list has by doing the following
val selectCols = list.map(x => names(x))
Last step is to select only the columns that has been selected by doing the following
import org.apache.spark.sql.functions.col
val selectedDataFrame = df.select(selectCols.map(col): _*)
You should have the dataframe with column indexes mentioned in the list.
Note: indexes in the list should not be greater than the column indexes present in the dataframe