select best record possible - scala

Have different files in a directory as below
f1.txt
id FName Lname Adrress sex levelId
t1 Girish Hm 10oak m 1111
t2 Kiran Kumar 5wren m 2222
t3 sara chauhan 15nvi f 6666
f2.txt
t4 girish hm 11oak m 1111
t5 Kiran Kumar 5wren f 2222
t6 Prakash Jha 18nvi f 3333
f3.txt
t7 Kiran Kumar 5wren f 2222
t8 Girish Hm 10oak m 1111
t9 Prakash Jha 18nvi m 3333
f4.txt
t10 Kiran Kumar 5wren f 2222
t11 girish hm 10oak m 1111
t12 Prakash Jha 18nvi f 3333
only first name and last name constant here and case should be ignored,
other Address,Sex, levelID could be changed.
Data should be grouped first based on fname and lname
t1 Girish Hm 10oak m 1111
t4 girish hm 11oak m 1111
t8 Girish Hm 10oak m 1111
t11 girish hm 10oak m 1111
t2 Kiran Kumar 5wren m 2222
t5 Kiran Kumar 5wren f 2222
t7 Kiran Kumar 5wren f 2222
t10 Kiran Kumar 5wren f 2222
t3 sara chauhan 15nvi f 6666
t6 Prakash Jha 18nvi f 3333
t9 Prakash Jha 18nvi m 3333
t12 Prakash Jha 18nvi f 33
Later we need to choose appropriate first record from each group based on frequency of values of columns Address,Sex,LevelID
Example: For person Girish Hm
10oak has maximum frequency from address
m has maximum frequency from gender
1111 has maximum frequency from LevelID.
so, Id with t1 will be correct record(considering need to choose 1st appropriate record from the group)
Final output should be:
t1 Girish Hm 10oak m 1111
t5 Kiran Kumar 5wren f 2222
t3 sara chauhan 15nvi f 6666
t6 Prakash Jha 18nvi f 3333

Scala solution:
First define columns of interest:
val cols = Array("Adrress", "sex", "levelId")
Then add an array column of each column of interest and its frequency using
df.select(
cols.map(
x => array(
count(x).over(
Window.partitionBy(
lower(col("FName")),
lower(col("LName")),
col(x)
)
),
col(x)
).alias(x ++ "_freq")
)
)
Then group by each person and aggregate to get the maximum frequency: (ignore the dummy agg which is due to the agg function requiring an argument and a bunch of other arguments)
.groupBy(
lower(col("FName")).alias("FName"),
lower(col("LName")).alias("LName"))
.agg(
count($"*").alias("dummy"),
cols.map(
x => max(col(x ++ "_freq"))(1).alias(x)
): _*
)
.drop("dummy"))
Overall code:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val cols = Array("Adrress", "sex", "levelId")
val df = spark.read.option("header", "true").option("delimiter", " ").option("inferSchema", "true").csv("names.txt")
val df2 = (df
.select(col("*") +: cols.map(x => array(count(x).over(Window.partitionBy(lower(col("FName")), lower(col("LName")), col(x))), col(x)).alias(x ++ "_freq")): _*)
.groupBy(lower(col("FName")).alias("FName"), lower(col("LName")).alias("LName"))
.agg(count($"*").alias("dummy"), cols.map(x => max(col(x ++ "_freq"))(1).alias(x)): _*)
.drop("dummy"))
df2.show
+-------+-------+-------+---+-------+
| FName| LName|Adrress|sex|levelId|
+-------+-------+-------+---+-------+
| sara|chauhan| 15nvi| f| 6666|
|prakash| jha| 18nvi| f| 3333|
| girish| hm| 10oak| m| 1111|
| kiran| kumar| 5wren| f| 2222|
+-------+-------+-------+---+-------+

Related

finding the max in a dataframe with multiple columns in pyspark

Please i need help
i'm new to pyspark and i got this probleme
i have a dataframe with 4 columns like this
A
B
C
D
O1
2
E1
2
O1
3
E1
1
O1
2
E1
0
O1
5
E2
2
O1
2
E2
3
O1
2
E2
2
O1
5
E2
1
O2
8
E1
2
O2
8
E1
0
O2
0
E1
1
O2
2
E1
4
O2
9
E1
2
O2
2
E2
1
O2
9
E2
4
O2
2
E2
2
and i want to have this ( the max of D for each (A,C) couple) :
A
B
C
D
O1
2
E1
2
O1
2
E2
3
O2
2
E1
4
O2
9
E2
4
i tried
table.groupby("A","C").agg(round(max("D")))
it did work by the column B is missing
Why not use partition by instead of group by, that way you can keep all your columns. You will retain all your records.
Edit added- If you want the distinct values of A,C - just get the columns you want and get unique values.
import pyspark.sql.functions as F
from pyspark.sql.window import Window
table1 = table.withColumn("max_D",F.round(F.max('D').over (Window.partitionBy('A','C'))))
table1.select('A','B','C','max_D').distinct().show()
You'd want to apply max with pair of D and B so you won't lose B when aggerate.
from pyspark.sql import functions as F
(df
.groupBy('a', 'c')
.agg(F.max(F.array('d', 'b')).alias('max_d'))
.select(
F.col('a'),
F.col('c'),
F.col('max_d')[1].alias('b'),
F.col('max_d')[0].alias('d'),
)
.show()
)
+---+---+---+---+
| a| c| b| d|
+---+---+---+---+
| O1| E1| 2| 2|
| O1| E2| 2| 3|
| O2| E1| 2| 4|
| O2| E2| 9| 4|
+---+---+---+---+

How to add new columns to a dataframe in a loop using scala on Azure Databricks

I have imported a csv file into a dataframe in Azure Databricks using scala.
--------------
A B C D E
--------------
a1 b1 c1 d1 e1
a2 b2 c2 d2 e2
--------------
Now I want to perform hash on some selective columns and add the result as a new column to that dataframe.
--------------------------------
A B B2 C D D2 E
--------------------------------
a1 b1 hash(b1) c1 d1 hash(d1) e1
a2 b2 hash(b2) c2 d2 hash(d2) e2
--------------------------------
This is the code I have:
val data_df = spark.read.format("csv").option("header", "true").option("sep", ",").load(input_file)
...
...
for (col <- columns) {
if (columnMapping.keys.contains((col))){
val newColName = col + "_token"
// Now here I want to add a new column to data_df and the content would be hash of the current value
}
}
// And here I would like to upload selective columns (B, B2, D, D2) to a SQL database
Any help will be highly appreciated.
Thank you!
Try this -
val colsToApplyHash = Array("B","D")
val hashFunction:String => String = <ACTUAL HASH LOGIC>
val hash = udf(hashFunction)
val finalDf = colsToApplyHash.foldLeft(data_df){
case(acc,colName) => acc.withColumn(colName+"2",hash(col(colName)))
}

How to properly join two DataFrames for my case?

I use Spark 2.2.0 and Scala 2.11.8. I have some problems with joining two DataFrames.
df1 =
product1_PK product2_PK
111 222
333 111
...
and:
df2 =
product_PK product_name
111 AAA
222 BBB
333 CCC
I want to get this result:
product1_PK product2_PK product1_name product2_name
111 222 AAA BBB
333 111 CCC AAA
...
How can I do it?
This is how I tried as a part solution, but I don't know how to efficiently make joining for both product1_PK and product2_PK and rename columns:
val result = df1.as("left")
.join(df2.as("right"), $"left.product1_PK" === $"right.product_PK")
.drop($"left.product_PK")
.withColumnRenamed("right.product_name","product1_name")
You need to use two joins : first for product1_name and second for product2_name
df1.join(df2.withColumnRenamed("product_PK", "product1_PK").withColumnRenamed("product_name", "product1_name"), Seq("product1_PK"), "left")
.join(df2.withColumnRenamed("product_PK", "product2_PK").withColumnRenamed("product_name", "product2_name"), Seq("product2_PK"), "left")
.show(false)
You should have your desired output as
+-----------+-----------+-------------+-------------+
|product2_PK|product1_PK|product1_name|product2_name|
+-----------+-----------+-------------+-------------+
|222 |111 |AAA |BBB |
|111 |333 |CCC |AAA |
+-----------+-----------+-------------+-------------+

Split multiple fields or columns of a single row and create multiple rows using Scala

I have a data-frame with 4 fields as mentioned below :
Field1 , Field2 , Field3 , Field4
I have values in the fields as below :
A1 , B1 , C1 , D1
A2 , B2,B3 , C2,C3 , D2,D3
A1 , B4,B5,B6 , C4,C5,C6 , D4,D5,D6
I have to convert it into the below format :
A1 , B1 , C1 , D1
A2 , B2 , C2 , D2
A2 , B3 , C3 , D3
A1 , B4 , C4 , D4
A1 , B5 , C5 , D5
A1 , B6 , C6 , D6
Basically I have to split the comma separated values in multiple columns and form new rows based on the values in the same order.
You can consider all of them as of type String. Can you suggest me a way to do this splitting and forming new rows based on the new values.
I could see already a question similar to this as the below one:
How to flatmap a nested Dataframe in Spark
But this question is different as I have to consider splitting multiple columns in this case and the values should not repeat.
You can convert DataFrame to Dataset[(String, String, String, String)] and flatMap:
import scala.util.Try
val df = Seq(
("A1", "B1", "C1", "D1"),
("A2", "B2,B3", "C2,C3", "D2,D3"),
("A1", "B4,B5,B6", "C4,C5,C6", "D4,D5,D6")
).toDF("x1", "x2", "x3", "x4")
// A simple sequence of expressions which allows us to flatten the results
val exprs = (0 until df.columns.size).map(i => $"value".getItem(i))
df.select($"x1", array($"x2", $"x3", $"x4")).as[(String, Seq[String])].flatMap {
case (x1, xs) =>
Try(xs.map(_.split(",")).transpose).map(_.map("x" +: _)).getOrElse(Seq())
}.toDF.select(exprs:_*)
// +--------+--------+--------+--------+
// |value[0]|value[1]|value[2]|value[3]|
// +--------+--------+--------+--------+
// | A1| B1| C1| D1|
// | A2| B2| C2| D2|
// | A2| B3| C3| D3|
// | A1| B4| C4| D4|
// | A1| B5| C5| D5|
// | A1| B6| C6| D6|
// +--------+--------+--------+--------+
or use an UDF:
val splitRow = udf((xs: Seq[String]) =>
Try(xs.map(_.split(",")).transpose).toOption)
// Same as before but we exclude the first column
val exprs = (0 until df.columns.size - 1).map(i => $"xs".getItem(i))
df
.withColumn("xs", explode(splitRow(array($"x2", $"x3", $"x4"))))
.select($"x1" +: exprs: _*)
You can use posexplode to solve this quickly. Refer http://allabouthadoop.net/hive-lateral-view-explode-vs-posexplode/
So, your code will be like below :
select
Field1,
Field2,
Field3,
Field4
from temp_table
lateral view posexplode(Field2) pn as f2_1,f2_2, Field2
lateral view posexplode(Field3) pn as f3_1,f3_2, Field3
lateral view posexplode(Field3) pn as f4_1,f4_2, Field4
where
(f2_1 == F3_1 and f3_1 == f4_1) and/or (f2_2 == F3_2 and f3_2 == f4_2)

How to parse empty columns getting displaced within TSV?

I have a TSV file that I want to parse. There are empty fields in all columns resulting in displacement of the order of columns, so that not all the values I get using a certain column number actually come from that column.
Some fields contain long strings with empty space inside them. Also, some columns contain potential delimiters like ; | :
Input file
columnA columnB columnC columnD
A1 B1 C1 D1
B2 C2 D2
A3 D3
A4 B4 D4
Desired output
columnA columnB columnC columnD
A1 B1 C1 D1
B2 C2 D2
A3 D3
A4 B4 D4
$file myfile
`ASCII English text, with very long lines`
$awk '-F\t' '{print NF}' myfile | sort | uniq -c | tail -n
`247871 136`
I have found this code posted in reply to a similar question (https://unix.stackexchange.com/questions/29023/how-to-display-tsv-csv-in-console-when-empty-cells-are-missed-by-column-t), but I cannot make this work for my file:
sed ':x s/\(^\|\t\)\t/\1 \t/; t x' < file.tsv | column -t -s $'\t'
(The problem persists after importing into Excel.)
FieldEmpty=' '
FieldSize=${#FieldEmpty}
sed "
s/A/&/
t B
s/^ */ ${FieldEmpty}/
t B
: B
s/B/&/
t C
s/^ .\{${FieldSize}\}/&${FieldEmpty}/
t C
: C
s/C/&/
t D
s/^ \(.\{${FieldSize}\}\)\{2\}/&${FieldEmpty}/
t D
: D
s/D/&/
t
s/^ \(.\{${FieldSize}\}\)\{3\}/&${FieldEmpty}/
" YourFile
If more column are used, an iterative way is to be used instead (same concept of test/"insert")
On my AIX/KSH (so should be the same as with --posix -e on GNU sed)
$ cat YourFile
columnA columnB columnC columnD
A1 B1 C1 D1
B2 C2 D2
A3 D3
A4 B4 D4
$ FieldEmpty=' ';FieldSize=${#FieldEmpty};echo $FieldSize
11
$sed "..." YourFile
columnA columnB columnC columnD
A1 B1 C1 D1
B2 C2 D2
A3 D3
A4 B4 D4
If your file is tab separated, you should use tab as a field separator in awk. Like
$ column -t -s $'\t' file
columnA columnB columnC columnD
A1 1 B1 2 C1 3 D1 4
B2 2 C2 4 D2 4
A3 1 D3 4
A4 1 B4 2 D4 4
$xxd file
0000000: 636f 6c75 6d6e 4109 636f 6c75 6d6e 4209 columnA.columnB.
0000010: 636f 6c75 6d6e 4309 636f 6c75 6d6e 440a columnC.columnD.
0000020: 4131 2031 0942 3120 3209 4331 2033 0944 A1 1.B1 2.C1 3.D
0000030: 3120 340a 0942 3220 3209 4332 2034 0944 1 4..B2 2.C2 4.D
0000040: 3220 340a 4133 2031 0909 0944 3320 340a 2 4.A3 1...D3 4.
0000050: 4134 2031 0942 3420 3209 0944 3420 340a A4 1.B4 2..D4 4.
$ awk -F'\t' '{
for (i=1; i<=NF; i++) {
printf "%-8s ", $i
}
print ""
}'
columnA columnB columnC columnD
A1 1 B1 2 C1 3 D1 4
B2 2 C2 4 D2 4
A3 1 D3 4
A4 1 B4 2 D4 4