Scala code to label rows of data frame based on another data frame - scala

I just started learning scala to do data analytics and I encountered a problem when I try to label my data rows based on another data frame.
Suppose I have a df1 with columns "date","id","value",and"label" which is set to be "F" for all rows in df1 in the beginning. Then I have this df2 which is a smaller set of data with columns "date","id","value".Then I want to change the row label in df1 from "F" to "T" if that row appears in df2, i.e.some row in df2 has the same combination of ("date","id","value")as that row in df1.
I tried with df.filter and df.join but seems that both cannot solve my problem.

I Think this is what you are looking for.
val spark =SparkSession.builder().master("local").appName("test").getOrCreate()
import spark.implicits._
//create Dataframe 1
val df1 = spark.sparkContext.parallelize(Seq(
("2016-01-01", 1, "abcd", "F"),
("2016-01-01", 2, "efg", "F"),
("2016-01-01", 3, "hij", "F"),
("2016-01-01", 4, "klm", "F")
)).toDF("date","id","value", "label")
//Create Dataframe 2
val df2 = spark.sparkContext.parallelize(Seq(
("2016-01-01", 1, "abcd"),
("2016-01-01", 3, "hij")
)).toDF("date1","id1","value1")
val condition = $"date" === $"date1" && $"id" === $"id1" && $"value" === $"value1"
//Join two dataframe with above condition
val result = df1.join(df2, condition, "left")
// check wather both fields contain same value and drop columns
val finalResult = result.withColumn("label", condition)
.drop("date1","id1","value1")
//Update column label from true false to T or F
finalResult.withColumn("label", when(col("label") === true, "T").otherwise("F")).show

The basic idea is to join the two and then calculate the result. Something like this:
df2Mod = df2.withColumn("tmp", lit(true))
joined = df1.join(df2Mod , df1("date") <=> df2Mod ("date") && df1("id") <=> df2Mod("id") && df1("value") <=> df2Mod("value"), "left_outer")
joined.withColumn("label", when(joined("tmp").isNull, "F").otherwise("T")
The idea is that we add the "tmp" column and then do a left_outer join. "tmp" would be null for everything not in df2 and therefore we can use that to calculate the label.

Related

Pyspark isin with column in argument doesn't exclude rows

I need to exclude rows which doesn't have True value in column Status.
In my opinion this filter( isin( )== False) structure should solve my problem but it doesn't.
df = sqlContext.createDataFrame([( "A", "True"), ( "A", "False"), ( "B", "False"), ("C", "True")], ( "name", "status"))
df.registerTempTable("df")
df_t = df[df.status == "True"]
from pyspark.sql import functions as sf
df_f = df.filter(df.status.isin(df_t.name)== False)
I expect row:
B | False
any help is greatly appreciated!
First, I think in your last statement, you meant to use df.name instead of df.status.
df_f = df.filter(df.status.isin(df_t.name)== False)
Second, even if you use df.name, it still won't work.
Because it's mixing the columns (Column type) from two DataFrames, i.e. df_t and df in your final statement. I don't think this works in pyspark.
However, you can achieve the same effect using other methods.
If I understand correctly, you want to select 'A' and 'C' first through 'status' column, then select the rows excluding ['A', 'C']. The thing here is to extend the selection to the second row of 'A', which can be achieved by Window. See below:
from pyspark.sql import functions as F
from pyspark.sql.window import Window
df = sqlContext.createDataFrame([( "A", "True"), ( "A", "False"), ( "B", "False"), ("C", "True")], ( "name", "status"))
df.registerTempTable("df")
# create an auxiliary column satisfying the condition
df = df.withColumn("flag", F.when(df['status']=="True", 1).otherwise(0))
df.show()
# extend the selection to other rows with the same 'name'
df = df.withColumn('flag', F.max(df['flag']).over(Window.partitionBy('name')))
df.show()
#filter is now easy
df_f = df.filter(df.flag==0)
df_f.show()

Spark scala : select column name from other dataframe

There are two json and first json has more column and always it is super set.
val df1 = spark.read.json(sqoopJson)
val df2 = spark.read.json(kafkaJson)
Except Operation :
I like to apply except operation on both df1 and df2, But df1 has 10 column and df2 has only 8 columns.
In case manually if i drop 2 column from df1 then except will work. But I have 50+ tables/json and need to do EXCEPT for all 50 set of tables/json.
Question :
How to select only columns available in DF2 ( 8) columns from DF1 and create new df3? So df3 will have data from df1 with limited column and it will match with df2 columns.
For the Question: How to select only columns available in DF2 ( 8) columns from DF1 and create new df3?
//Get the 8 column names from df2
val columns = df2.schema.fieldNames.map(col(_))
//select only the columns from df2
val df3 = df1.select(columns :_*)
Hope this helps!

[Spark SQL]: Lookup functionality given two DataFrames and creating a new DataFrame

I am using Scala with Spark 1.5.
Given two DataFrames DataFrame1 and DataFrame2, I want to search for values in DataFrame2 for the keys in DataFrame1 and create DataFrame3 with the result. The functionality is unique as DataFrame1 has many keys in each row and the output DataFrame should have keys and values populated in the same order like shown in the output DataFrame below. I'm looking for a distributed solution, if possible, as this functionality should to be implemented on millions of records (~10 million records). Any directions on how to proceed and information on useful methods is of great help. Thanks in advance!
Input: DataFrame1 (contract_id along with maximum of 4 customers associated)
contract_id, cust1_id, cust2_id, cust3_id, cust4_id
500001,100000001,100000002,100000003,100000004
500305,100000001,100000002,100000007
500303,100000021
500702,110000045
500304,100000021,100000051,120000051
503001,540000012,510000012,500000002,510000002
503051,880000045
Input: DataFrame2 (Customer master lookup information)
cust_id,date_of_birth
100000001,1988-11-04
100000002,1955-11-16
100000003,1980-04-14
100000004,1980-09-26
100000007,1942-03-07
100000021,1964-06-22
100000051,1920-03-12
120000051,1973-11-17
110000045,1955-11-16
880000045,1980-04-14
540000012,1980-09-26
510000012,1973-03-15
500000002,1958-08-18
510000002,1942-03-07
Output: DataFrame3
contract_id, cust1_id, cust2_id, cust3_id, cust4_id, cust1_dob, cust2_dob, cust3_dob, cust4_dob
500001,100000001,100000002,100000003,100000004,1988-11-04,1955-11-16,1980-04-14,1980-09-26
500305,100000001,100000002,100000007, ,1988-11-04,1955-11-16,1942-03-07
500303,100000021, , , ,1964-06-22
500702,110000045 , , ,1955-11-16
500304,100000021,100000051,120000051, ,1964-06-22,1920-03-12,1973-11-17
503001,540000012,510000012,500000002,510000002,1980-09-26,1973-03-15,1958-08-18,1942-03-07
503051,880000045 , , ,1980-04-14
This may not be the most effective solution but this works for your case.
import spark.implicits._
val df1 = spark.sparkContext
.parallelize(
Seq(
("500001", "100000001", "100000002", "100000003", "100000004"),
("500305", "100000001", "100000002", "100000007", ""),
("500303", "100000021", "", "", ""),
("500702", "110000045", "", "", ""),
("500304", "100000021", "100000051", "120000051", ""),
("503001", "540000012", "510000012", "500000002", "510000002"),
("503051", "880000045", "", "", "")
))
.toDF("contract_id", "cust1_id", "cust2_id", "cust3_id", "cust4_id")
val df2 = spark.sparkContext
.parallelize(
Seq(
("100000001", "1988-11-04"),
("100000002", "1955-11-16"),
("100000003", "1980-04-14"),
("100000004", "1980-09-26"),
("100000007", "1942-03-07"),
("100000021", "1964-06-22"),
("100000051", "1920-03-12"),
("120000051", "1973-11-17"),
("110000045", "1955-11-16"),
("880000045", "1980-04-14"),
("540000012", "1980-09-26"),
("510000012", "1973-03-15"),
("500000002", "1958-08-18"),
("510000002", "1942-03-07")
))
.toDF("cust_id", "date_of_birth")
val finalDF = df1
.join(df2, df1("cust1_id") === df2("cust_id"), "left")
.drop("cust_id")
.withColumnRenamed("date_of_birth", " cust1_dob")
.join(df2, df1("cust2_id") === df2("cust_id"), "left")
.drop("cust_id")
.withColumnRenamed("date_of_birth", " cust2_dob")
.join(df2, df1("cust3_id") === df2("cust_id"), "left")
.drop("cust_id")
.withColumnRenamed("date_of_birth", " cust3_dob")
.join(df2, df1("cust4_id") === df2("cust_id"), "left")
.drop("cust_id")
.withColumnRenamed("date_of_birth", " cust4_dob")
finalDF.na.fill("").show()

Spark Dataframe select based on column index

How do I select all the columns of a dataframe that has certain indexes in Scala?
For example if a dataframe has 100 columns and i want to extract only columns (10,12,13,14,15), how to do the same?
Below selects all columns from dataframe df which has the column name mentioned in the Array colNames:
df = df.select(colNames.head,colNames.tail: _*)
If there is similar, colNos array which has
colNos = Array(10,20,25,45)
How do I transform the above df.select to fetch only those columns at the specific indexes.
You can map over columns:
import org.apache.spark.sql.functions.col
df.select(colNos map df.columns map col: _*)
or:
df.select(colNos map (df.columns andThen col): _*)
or:
df.select(colNos map (col _ compose df.columns): _*)
All the methods shown above are equivalent and don't impose performance penalty. Following mapping:
colNos map df.columns
is just a local Array access (constant time access for each index) and choosing between String or Column based variant of select doesn't affect the execution plan:
val df = Seq((1, 2, 3 ,4, 5, 6)).toDF
val colNos = Seq(0, 3, 5)
df.select(colNos map df.columns map col: _*).explain
== Physical Plan ==
LocalTableScan [_1#46, _4#49, _6#51]
df.select("_1", "_4", "_6").explain
== Physical Plan ==
LocalTableScan [_1#46, _4#49, _6#51]
#user6910411's answer above works like a charm and the number of tasks/logical plan is similar to my approach below. BUT my approach is a bit faster.
So,
I would suggest you to go with the column names rather than column numbers. Column names are much safer and much ligher than using numbers. You can use the following solution :
val colNames = Seq("col1", "col2" ...... "col99", "col100")
val selectColNames = Seq("col1", "col3", .... selected column names ... )
val selectCols = selectColNames.map(name => df.col(name))
df = df.select(selectCols:_*)
If you are hesitant to write all the 100 column names then there is a shortcut method too
val colNames = df.schema.fieldNames
Example: Grab first 14 columns of Spark Dataframe by Index using Scala.
import org.apache.spark.sql.functions.col
// Gives array of names by index (first 14 cols for example)
val sliceCols = df.columns.slice(0, 14)
// Maps names & selects columns in dataframe
val subset_df = df.select(sliceCols.map(name=>col(name)):_*)
You cannot simply do this (as I tried and failed):
// Gives array of names by index (first 14 cols for example)
val sliceCols = df.columns.slice(0, 14)
// Maps names & selects columns in dataframe
val subset_df = df.select(sliceCols)
The reason is that you have to convert your datatype of Array[String] to Array[org.apache.spark.sql.Column] in order for the slicing to work.
OR Wrap it in a function using Currying (high five to my colleague for this):
// Subsets Dataframe to using beg_val & end_val index.
def subset_frame(beg_val:Int=0, end_val:Int)(df: DataFrame): DataFrame = {
val sliceCols = df.columns.slice(beg_val, end_val)
return df.select(sliceCols.map(name => col(name)):_*)
}
// Get first 25 columns as subsetted dataframe
val subset_df:DataFrame = df_.transform(subset_frame(0, 25))

How to compare records in a dataframe in scala

For example I have a dataframe as below:
var tmp_df = sqlContext.createDataFrame(Seq(
("One", "Sagar", 1),
("Two", "Ramesh" , 2),
("Three", "Suresh", 3),
("One", "Sagar", 5)
)).toDF("ID", "Name", "Balance");
Now I want to write all records from above dataframe having same ID in one file likewise. Please advise.
//find records having same id and rename the id column to idstowrite
val idsMoreThanOne = tmp_df.groupBy('id).count.filter('count.gt(1)).withColumnRenamed("id" , "idstowrite")
idsMoreThanOne.show
//join back with original dataframe
val joinedDf = idsMoreThanOne.join(tmp_df ,tmp_df("id") === idsMoreThanOne("idstowrite") , "left")
joinedDf.show
//select only the columns we want
val dfToWrite = joinedDf.select("id" , "Name" , "Balance")
dfToWrite.show