Spark: Extaract domain from email address in dataframe - scala

I have a difficulty in extracting email domains. I have below dataframe.
+---+----------------+
|id |email |
+---+----------------+
|1 |ii#koko.com |
|2 |lol#fsa.org |
|3 |kokojambo#mon.eu|
+---+----------------+
Now I want to have a new field for domains which I'll get:
+---+----------------+------+
|id |email |domain|
+---+----------------+------+
|1 |ii#koko.com |koko |
|2 |lol#fsa.org |fsa |
|3 |kokojambo#mon.eu|mon |
+---+----------------+------+
I tried to do something like this:
val test = df_short.withColumn("email", split($"email", "#."))
But got a false output. Can anybody direct me better?

You can simple use inbuilt regexp_extract function to get your domain name from email address.
//create an example dataframe
val df = Seq((1, "ii#koko.com"),
(2, "lol#fsa.org"),
(3, "kokojambo#mon.eu"))
.toDF("id", "email")
//original dataframe
df.show(false)
//output
// +---+----------------+
// |id |email |
// +---+----------------+
// |1 |ii#koko.com |
// |2 |lol#fsa.org |
// |3 |kokojambo#mon.eu|
// +---+----------------+
//using regex get the domain name
df.withColumn("domain",
regexp_extract($"email", "(?<=#)[^.]+(?=\\.)", 0))
.show(false)
//output
// +---+----------------+------+
// |id |email |domain|
// +---+----------------+------+
// |1 |ii#koko.com |koko |
// |2 |lol#fsa.org |fsa |
// |3 |kokojambo#mon.eu|mon |
// +---+----------------+------+

You can do like this
import org.apache.spark.sql.functions._
df.withColumn("domain", split(split(df.col("email"),"#")(1),"\\.")(0)).show
Sample Input:
+---------------+
| email|
+---------------+
|manoj#gmail.com|
| abc#ac.in|
+---------------+
Sample Output:
+---------------+------+
| email|domain|
+---------------+------+
|manoj#gmail.com| gmail|
| abc#ac.in| ac|
+---------------+------+

Related

spark sql max function not producing right value

I'm trying to find the max of a column grouped by spark partition id. I'm getting the wrong value when applying the max function though. Here is the code:
val partitionCol = uuid()
val localRankCol = "test"
df = df.withColumn(partitionCol, spark_partition_id)
val windowSpec = WindowSpec.partitionBy(partitionCol).orderBy(sortExprs:_*)
val rankDF = df.withColumn(localRankCol, dense_rank().over(windowSpec))
val rankRangeDF = rankDF.agg(max(localRankCol))
rankRangeDF.show(false)
sortExprs is applying an ascending sort on sales.
And the result with some dummy data is (partitionCol is 5th column):
+--------------+------+-----+---------------------------------+--------------------------------+----+
|title |region|sales|r6bea781150fa46e3a0ed761758a50dea|5683151561af407282380e6cf25f87b5|test|
+--------------+------+-----+---------------------------------+--------------------------------+----+
|Die Hard |US |100.0|1 |0 |1 |
|Rambo |US |100.0|1 |0 |1 |
|Die Hard |AU |200.0|1 |0 |2 |
|House of Cards|EU |400.0|1 |0 |3 |
|Summer Break |US |400.0|1 |0 |3 |
|Rambo |EU |100.0|1 |1 |1 |
|Summer Break |APAC |200.0|1 |1 |2 |
|Rambo |APAC |300.0|1 |1 |3 |
|House of Cards|US |500.0|1 |1 |4 |
+--------------+------+-----+---------------------------------+--------------------------------+----+
+---------+
|max(test)|
+---------+
|5 |
+---------+
"test" column has a max value of 4 but 5 is being returned.

How to get number of lines resulted by join in Spark

Consider these two Dataframes:
+---+
|id |
+---+
|1 |
|2 |
|3 |
+---+
+---+-----+
|idz|word |
+---+-----+
|1 |bat |
|1 |mouse|
|2 |horse|
+---+-----+
I am doing a Left join on ID=IDZ:
val r = df1.join(df2, (df1("id") === df2("idz")), "left_outer").
withColumn("ID_EMPLOYE_VENDEUR", when(col("word") =!= ("null"), col("word")).otherwise(null)).drop("word")
r.show(false)
+---+----+------------------+
|id |idz |ID_EMPLOYE_VENDEUR|
+---+----+------------------+
|1 |1 |mouse |
|1 |1 |bat |
|2 |2 |horse |
|3 |null|null |
+---+----+------------------+
But what if I only want to keep the lines whose ID only have one equal IDZ? If not, I would Like to have null in ID_EMPLOYE_VENDEUR. Desired output is:
+---+----+------------------+
|id |idz |ID_EMPLOYE_VENDEUR|
+---+----+------------------+
|1 |1 |null | --Because the Join resulted two different lines
|2 |2 |horse |
|3 |null|null |
+---+----+------------------+
I should precise that I am working on a large DF. The solution should be not very expensive in time.
Thank you
As per you mentioned data your data is too large, so groupBy is not good option to group data and join Windows over function as below :
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
def windowSpec = Window.partitionBy("idz")
val newDF = df1.withColumn("count", count("idz").over(windowSpec)).dropDuplicates("idz").withColumn("word", when(col("count") >=2 , lit(null)).otherwise(col("word"))).drop("count")
val r = df1.join(newDF, (df1("id") === newDF("idz")), "left_outer").withColumn("ID_EMPLOYE_VENDEUR", when(col("word") =!= ("null"), col("word")).otherwise(null)).drop("word")
r show
+---+----+------------------+
| id| idz|ID_EMPLOYE_VENDEUR|
+---+----+------------------+
| 1| 1| null|
| 3|null| null|
| 2| 2| horse|
+---+----+------------------+
You can retrieve easily the information that more than two df2's idz matched a single df1's id with a groupBy and a join.
r.join(
r.groupBy("id").count().as("g"),
$"g.id" === r("id")
)
.withColumn(
"ID_EMPLOYE_VENDEUR",
expr("if(count != 1, null, ID_EMPLOYE_VENDEUR)")
)
.drop($"g.id").drop("count")
.distinct()
.show()
Note: Both the groupBy and the join do not trigger any additional exchange step (shuffle around network) because the dataframe r is already partitioned on id (because it is the result of a join on id).

Map values of a column with ArrayType based on values from another dataframe in PySpark

What I have:
| ids. |items |item_id|value|timestamp|
+--------+--------+-------+-----+---------+
|[A,B,C] |1.0 |1 |5 |100 |
|[A,B,D] |1.0 |2 |6 |90 |
|[D] |0.0. |3 |7 |80 |
|[C] |0.0. |4 |8 |80 |
+--------+--------+-------+-----+----------
| ids |id_num |
+--------+--------+
|A |1 |
|B |2 |
|C |3 |
|D |4 |
+---+----+--------+
What I want:
| ids |
+--------+
|[1,2,3] |
|[1,2,4] |
|[3] |
|[4] |
+--------+
Is there a way to do this without an explode? Thank you for your help!
You can use a UDF:
from pyspark.sql.functions import udf, col
from pyspark.sql.types import ArrayType
# Suppose this is the dictionary you want to map
map_dict = {'A':1, 'B':2,'C':3,'D':4}
def array_map(array_col):
return list(map(map_dict.get, array_col))
"""
If you prefer list comprehension, you can return [map_dict[k] for k in array_col]
"""
array_map_udf = udf(array_map, ArrayType())
df = df.withColumn("mapped_array", array_map_udf(col("ids")))
I can't think of a different method, but to get a parallelized dictionary, you can just use the toJSON method. It will require further processing on the kind of reference df you have:
import json
df_json = df.toJSON().map(lambda x: json.loads(x))

How to Reverse arrangement DataFrame in Apache Spark

How can I reverse this DataFrame using Scala.
I saw sort functions but must specific column, I only want to reverse them
+---+--------+-----+
|id | name|note |
+---+--------+-----+
|1 | james |any |
|3 | marry |some |
|2 | john |some |
|5 | tom |any |
+---+--------+-----+
to:
+---+--------+-----+
|id | name|note |
+---+--------+-----+
|5 | tom |any |
|2 | john |some |
|3 | marry |some |
|1 | james |any |
+---+--------+-----+
You can add a column with increasing id with use of monotonically_increasing_id()
and sort in descending order
val dff = Seq(
(1, "james", "any"),
(3, "marry", "some"),
(2, "john", "some"),
(5, "tom", "any")
).toDF("id", "name", "note")
dff.withColumn("index", monotonically_increasing_id())
.sort($"index".desc)
.drop($"index")
.show(false)
Output:
+---+-----+----+
|id |name |note|
+---+-----+----+
|5 |tom |any |
|2 |john |some|
|3 |marry|some|
|1 |james|any |
+---+-----+----+
You could do something like this:
val reverseDf = df.withColumn("row_num", row_number.over(Window.partitionBy(lit(1)).orderBy(lit(1))))
.orderBy($"row_num".desc)
.drop("row_num")
Or refer this instead of row number.

Filtering out rows of a table bassed on a column

I am trying to filter out table rows based in column value.
I have a dataframe:
+---+-----+
|id |value|
+---+-----+
|3 |0 |
|3 |1 |
|3 |0 |
|4 |1 |
|4 |0 |
|4 |0 |
+---+-----+
I want to create a new dataframe deleting all rows with value!=0:
+---+-----+
|id |value|
+---+-----+
|3 |0 |
|3 |0 |
|4 |0 |
|4 |0 |
+---+-----+
I figured the syntax should be something like this but couldn't get it right:
val newDataFrame = OldDataFrame.filter($"value"==0)
Correct way is as following. You just forgot to add one = sign
val newDataFrame = OldDataFrame.filter($"value" === 0)
Their are various ways by which you can do the filtering.
val newDataFrame = OldDataFrame.filter($"value"===0)
val newDataFrame = OldDataFrame.filter(OldDataFrame("value") === 0)
val newDataFrame = OldDataFrame.filter("value === 0")
You can also use where function as well instead of filter.