remove duplicate column from dataframe using scala - scala

I need to remove one column from the DataFrame having another column with the same name. I need to remove only one column and need the other one for further usage.
For example, given this input DF:
sno | age | psk | psk
---------------------
1 | 12 | a4 | a4
I would like to obtain this output DF:
sno | age | psk
----------------
1 | 12 | a4

RDD is the way (but you need to know the column index of the duplicate columns for removing duplicate columns back to dataframe)
If you have dataframe with duplicate columns as
+---+---+---+---+
|sno|age|psk|psk|
+---+---+---+---+
|1 |12 |a4 |a4 |
+---+---+---+---+
You know that the last two column index are duplicates.
Next step is for you to have column names with duplicates removed and form schema
val columns = df.columns.toSet.toArray
val schema = StructType(columns.map(name => StructField(name, StringType, true)))
Vital part is to convert the dataframe to rdd and remove the required column index (here it is the 4th)
val rdd = df.rdd.map(row=> Row.fromSeq(Seq(row(0).toString, row(1).toString, row(2))))
Final step is to convert the rdd to dataframe using schema
sqlContext.createDataFrame(rdd, schema).show(false)
which should give you
+---+---+---+
|sno|age|psk|
+---+---+---+
|1 |12 |a4 |
+---+---+---+
I hope the answer is helpful

Related

export many files from a table

I have a sql query that generate a table with the below format
|sex |country|popularity|
|null |null | x |
|null |value | x |
|value|null | x |
|value|null | x |
|null |value | x |
|value|value | x |
value for sex column could be woman,man
value for country could be Italy,England,US etc.
x is a int
Now i would like to save four files based on data combination(value,null). So file1 consist of (value,value) for column sex,country.
file2 consist of (value,null) for column sex,country. file3 consist of (null,value) and file4 consist of
(null,null).
I have searched a lot of things but i couldn't find any useful info. I have also tried the below
val df1 = data.withColumn("combination",concat(col("sex") ,lit(","), col("country")))
df1.coalesce(1).write.partitionBy("combination").format("csv").option("header", "true").mode("overwrite").save("text.csv")
but i receive more files because this command generate files based on all possible data of (sex-country).
Same with the below
val df1 = data.withColumn("combination",concat(col("sex")))
df1.coalesce(1).write.partitionBy("combination").format("csv").option("header", "true").mode("overwrite").save("text.csv")
Is there any command similar to partitionby that gives me a combination of pairs (value,null) and not for columns?
You can convert the columns into Boolean depending on whether they are null or not, and concat into a string, which will look like "true_true", "true_false" etc.
df = df.withColumn("coltype", concat(col("sex").isNull(), lit("_"), col("country").isNull()))
df.coalesce(1)
.write
.partitionBy("coltype")
.format("csv")
.option("header", "true")
.mode("overwrite")
.save("output")

Spark Dataframes: Add Conditional column to dataframe

I want to add a conditional column Flag to dataframe A. When the following two conditions are satisfied, add 1 to Flag, otherwise 0:
num from dataframe A is in between numStart and numEnd from dataframe B.
If the above condition satifies, check if include is 1.
DataFrame A (it's a very big dataframe, containing millions of rows):
+----+------+-----+------------------------+
|num |food |price|timestamp |
+----+------+-----+------------------------+
|1275|tomato|1.99 |2018-07-21T00:00:00.683Z|
|145 |carrot|0.45 |2018-07-21T00:00:03.346Z|
|2678|apple |0.99 |2018-07-21T01:00:05.731Z|
|6578|banana|1.29 |2018-07-20T01:11:59.957Z|
|1001|taco |2.59 |2018-07-21T01:00:07.961Z|
+----+------+-----+------------------------+
DataFrame B (it's a very small DF, containing only 100 rows):
+----------+-----------+-------+
|numStart |numEnd |include|
+----------+-----------+-------+
|0 |200 |1 |
|250 |1050 |0 |
|2000 |3000 |1 |
|10001 |15001 |1 |
+----------+-----------+-------+
Expected output:
+----+------+-----+------------------------+----------+
|num |food |price|timestamp |Flag |
+----+------+-----+------------------------+----------+
|1275|tomato|1.99 |2018-07-21T00:00:00.683Z|0 |
|145 |carrot|0.45 |2018-07-21T00:00:03.346Z|1 |
|2678|apple |0.99 |2018-07-21T01:00:05.731Z|1 |
|6578|banana|1.29 |2018-07-20T01:11:59.957Z|0 |
|1001|taco |2.59 |2018-07-21T01:00:07.961Z|0 |
+----+------+-----+------------------------+----------+
You can left-join dfB to dfA based on the condition you described in (i), then build a Flag column using withColumn and the coalesce function to "default" to 0:
Records for which a match was found would use the include value of the matching dfB record
Records for which there was no match would have include=null, and per your requirement such records should get Flag=0, so we use coalesce which in case of null returns the default value with a literal lit(0)
Lastly, get rid of the dfB columns which are of no interest to you:
import org.apache.spark.sql.functions._
import spark.implicits._ // assuming "spark" is your SparkSession
dfA.join(dfB, $"num".between($"numStart", $"numEnd"), "left")
.withColumn("Flag", coalesce($"include", lit(0)))
.drop(dfB.columns: _*)
.show()
// +----+------+-----+--------------------+----+
// | num| food|price| timestamp|Flag|
// +----+------+-----+--------------------+----+
// |1275|tomato| 1.99|2018-07-21T00:00:...| 0|
// | 145|carrot| 0.45|2018-07-21T00:00:...| 1|
// |2678| apple| 0.99|2018-07-21T01:00:...| 1|
// |6578|banana| 1.29|2018-07-20T01:11:...| 0|
// |1001| taco| 2.59|2018-07-21T01:00:...| 0|
// +----+------+-----+--------------------+----+
Join the two dataframes together on the first condition while keeping all rows in dataframe A (i.e. with a left join, see code below). After the join, the include column can be renamed Flag and any NaN values inside it are set to 0. The two extra columns, numStart and numEnd are dropped.
The code can thus be written as follows:
A.join(B, $"num" >= $"numStart" && $"num" <= $"numEnd", "left")
.withColumnRenamed("include", "Flag")
.drop("numStart", "numEnd")
.na.fill(Map("Flag" -> 0))

Scala how to match two dfs if mathes then update the key in first df

I have data in two dataframes:
selectedPersonDF:
ID key
1
2
3
4
5
selectedDetailsDF:
first second third key
--------------------------
1 9 9 777
9 8 8 878
8 10 10 765
10 12 19 909
11 2 20 708
Code :
val personDF = spark.read.option("header", "true").option("inferSchema", "false").csv("person.csv")
val detailsDF = spark.read.option("header", "true").option("inferSchema", "false").csv("details.csv")
val selectedPersonDF=personDF.select((col("ID"),col("key"))).show()
val selectedDetailsDF=detailsDF.select(col("first"),col("second"),col("third"),col("key")).show()
I have to match the selectedPersonDF id column with selectedDetailsDF all the columns(First, Second, Third) if any of the column data matches with persons id then we have to take the key value from selectedDetailsDF and have to update in selectedPersonDF key column.
Expected output (in selectedPersonDF):
ID key
1 777
2 708
3
4
5
and after removing the first row from persons'df since its matched with detailsdf remaining data should be stored in another df.
You can use join and use || condition checking and left join as
val finalDF = selectedPersonDF.join(selectedDetailsDF.withColumnRenamed("key", "key2"), $"ID" === $"first" || $"ID" === $"second" || $"ID" === $"third", "left")
.select($"ID", $"key2".as("key"))
.show(false)
so finalDF should give you
+---+----+
|ID |key |
+---+----+
|1 |777 |
|2 |708 |
|3 |null|
|4 |null|
|5 |null|
+---+----+
We can call .na.fill("") on above dataframe (key column has to be StringType()) to get
+---+---+
|ID |key|
+---+---+
|1 |777|
|2 |708|
|3 | |
|4 | |
|5 | |
+---+---+
After that you can use filter to separate the dataframe into matching and non matching using key column with value and null repectively
val notMatchingDF = finalDF.filter($"key" === "")
val matchingDF = finalDF.except(notMatchingDF)
Updated for if the column names of selectedDetailsDF is unknown except the key column
If the column names of the second dataframe is unknown then you will have to form an array column of the unknown columns as
val columnsToCheck = selectedDetailsDF.columns.toSet - "key" toList
import org.apache.spark.sql.functions._
val tempSelectedDetailsDF = selectedDetailsDF.select(array(columnsToCheck.map(col): _*).as("array"), col("key").as("key2"))
Now tempSelectedDetailsDF dataframe has two columns: combined column of all the unknown columns as an array column and the key column renamed as key2.
After that you would need a udf function for checking the condition while joining
val arrayContains = udf((array: collection.mutable.WrappedArray[String], value: String) => array.contains(value))
And then you join the dataframes using the call to the defined udf function as
val finalDF = selectedPersonDF.join(tempSelectedDetailsDF, arrayContains($"array", $"ID"), "left")
.select($"ID", $"key2".as("key"))
.na.fill("")
Rest of the process is already defined above.
I hope the answer is helpful and understandable.

How to fetch the value and type of each column of each row in a dataframe?

How can I convert a dataframe to a tuple that includes the datatype for each column?
I have a number of dataframes with varying sizes and types. I need to be able to determine the type and value of each column and row of a given dataframe so I can perform some actions that are type-dependent.
So for example say I have a dataframe that looks like:
+-------+-------+
| foo | bar |
+-------+-------+
| 12345 | fnord |
| 42 | baz |
+-------+-------+
I need to get
Seq(
(("12345", "Integer"), ("fnord", "String")),
(("42", "Integer"), ("baz", "String"))
)
or something similarly simple to iterate over and work with programmatically.
Thanks in advance and sorry for what is, I'm sure, a very noobish question.
If I understand your question correct, then following shall be your solution.
val df = Seq(
(12345, "fnord"),
(42, "baz"))
.toDF("foo", "bar")
This creates dataframe which you already have.
+-----+-----+
| foo| bar|
+-----+-----+
|12345|fnord|
| 42| baz|
+-----+-----+
Next step is to extract dataType from the schema of the dataFrame and create a iterator.
val fieldTypesList = df.schema.map(struct => struct.dataType)
Next step is to convert the dataframe rows into rdd list and map each value to dataType from the list created above
val dfList = df.rdd.map(row => row.toString().replace("[","").replace("]","").split(",").toList)
val tuples = dfList.map(list => list.map(value => (value, fieldTypesList(list.indexOf(value)))))
Now if we print it
tuples.foreach(println)
It would give
List((12345,IntegerType), (fnord,StringType))
List((42,IntegerType), (baz,StringType))
Which you can iterate over and work with programmatically

Apache Spark DataFrame apply custom operation after GroupBy

I have 2 columns say ID, value Id is of type Int and value is of type List[String].
Ids are repeating so to make them unique I apply GroupBy("id") on My DataFrame now my problem is I want to append the value with each other and value column must be distinct.
Example :- i have a data like
+---+---+
| id| v |
+---+---+
| 1|[a]|
| 1|[b]|
| 1|[a]|
| 2|[e]|
| 2|[b]|
+---+---+
and i want my output like this
+---+---+--
| id| v |
+---+-----+
| 1|[a,b]|
| 2|[e,b]|
i tried this :-
val uniqueDF = df.groupBy("id").agg(collect_list("v"))
uniqueDf.map{row => (row.getInt(0),
row.getAsSeq[String].toList.distinct)}
Can I do the same after groupBy() say in agg() or something I do not want to apply map operation
thanks
val uniqueDF = df.groupBy("id").agg(collect_set("v"))
Set will have only unique values