Mapping a value into a specific column based on annother column - scala

I have the following problem:
A DataFrame containing col1 with strings A, B, or C.
A second col2 with an Integer.
And three other columns col3, col4 and col5 (these columns are also named A, B, and C).
Thus,
col1 - col2 - A (col3) - B (col4) - C (col5)
|--------------------------------------------
A 6
B 5
C 6
should obtain
col1 - col2 - A (col3) - B (col4) - C (col5)
|--------------------------------------------
A 6 6
B 5 5
C 6 6
Now I would like to go through each row and assign the integer in col2 to the column A, B or C based on the entry in col1.
How do I achieve this?
df.withColumn() I cannot use (or at least I do not know why) and the same holds for val df2 = df.map(x => x ).
Looking forward to you help and thanks in advance!
Best, Ken

Create a mapping between key and target column:
val mapping = Seq(("A", "col3"), ("B", "col4"), ("C", "col5"))
Use it to generate sequence of columns:
import org.apache.spark.sql.functions.when
val exprs = mapping.map { case (key, target) =>
when($"col1" === key, $"col2").alias(target) }
Prepend star and select:
val df = Seq(("A", 6), ("B", 5), ("C", 6)).toDF("col1", "col2")
df.select($"*" +: exprs: _*)
The result is:
+----+----+----+----+----+
|col1|col2|col3|col4|col5|
+----+----+----+----+----+
| A| 6| 6|null|null|
| B| 5|null| 5|null|
| C| 6|null|null| 6|
+----+----+----+----+----+

Related

adding sequence number to a dataframe

I have added a new column seq_col containing unique sequence using
val df2 = dfFromRDD1.withColumn("monotonically_increasing_id", monotonically_increasing_id())
val window = Window.orderBy(col("monotonically_increasing_id"))
val df3_consecutiveval = df2.withColumn("seq_col",row_number().over(window)).drop(col("monotonically_increasing_id").show()
dataframe:
col1 col2 seq_col
a aa 1
b ff 2
c rr 3
d yy 4
e tt 5
Now I want to add values to that new column in dataframe which will have data based on start and increment values specified like below example
Ex:
Start = 100
increment = 3
dataframe:
col1 col2 seq_col
a aa 100
b ff 103
c rr 106
d yy 109
e tt 112
You can define a udf that is responsible to calculate the id with the given logic, for instance in this case:
val step = 3 // increment 3 by 3
val startOffset = 100 // you want it to start with 100
val calculateId = udf((rowNum: Int) => startOffset + (rowNum * step))
df.withColumn("seq_col", calculateId(row_number().over(window))
This worked for me using some random dataframe.
The above answer is technically correct, but you should avoid using udfs whenever possible for performance reasons. This case is so simple that basic arithmetic will do the trick:
scala> val df = Seq(("a", "aa"), ("b", "ff"), ("c", "rr"), ("d", "yy"), ("e", "tt")).toDF("col1", "col2")
df: org.apache.spark.sql.DataFrame = [col1: string, col2: string]
scala> val start = 100
start: Int = 100
scala> val increment = 3
increment: Int = 3
scala> df.withColumn("seq_col", monotonically_increasing_id() * increment + start).show
+----+----+-------+
|col1|col2|seq_col|
+----+----+-------+
| a| aa| 100|
| b| ff| 103|
| c| rr| 106|
| d| yy| 109|
| e| tt| 112|
+----+----+-------+
scala>

Scala filter out rows where any column2 matches column1

Hi Stackoverflow,
I want to remove all rows in a dataframe where column A matches any of the distinct values in column B. I would expect this code block to do exactly that, but it seems to remove values where column B is null as well, which is weird since the filter should only consider column A anyway. How can I fix this code to perform the expected behavior, which is remove all rows in a dataframe where column A matches any of the distinct values in column B.
import spark.implicits._
val df = Seq(
(scala.math.BigDecimal(1) , null),
(scala.math.BigDecimal(2), scala.math.BigDecimal(1)),
(scala.math.BigDecimal(3), scala.math.BigDecimal(4)),
(scala.math.BigDecimal(4), null),
(scala.math.BigDecimal(5), null),
(scala.math.BigDecimal(6), null)
).toDF("A", "B")
// correct, has 1, 4
val to_remove = df
.filter(
df.col("B").isNotNull
).select(
df("B")
).distinct()
// incorrect, returns 2, 3 instead of 2, 3, 5, 6
val final = df.filter(!df.col("A").isin(to_remove.col("B")))
// 4 != 2
assert(4 === final.collect().length)
isin function accepts a list. However, in your code, you're passing Dataset[Row]. As per documentation https://spark.apache.org/docs/1.6.0/api/scala/index.html#org.apache.spark.sql.Column#isin%28scala.collection.Seq%29
it's declared as
def isin(list: Any*): Column
You first need to extract the values into Sequence and then use that in isin function. Please, note that this may have performance implications.
scala> val to_remove = df.filter(df.col("B").isNotNull).select(df("B")).distinct().collect.map(_.getDecimal(0))
to_remove: Array[java.math.BigDecimal] = Array(1.000000000000000000, 4.000000000000000000)
scala> val finaldf = df.filter(!df.col("A").isin(to_remove:_*))
finaldf: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [A: decimal(38,18), B: decimal(38,18)]
scala> finaldf.show
+--------------------+--------------------+
| A| B|
+--------------------+--------------------+
|2.000000000000000000|1.000000000000000000|
|3.000000000000000000|4.000000000000000000|
|5.000000000000000000| null|
|6.000000000000000000| null|
+--------------------+--------------------+
Change filter condition !df.col("A").isin(to_remove.col("B")) to !df.col("A").isin(to_remove.collect.map(_.getDecimal(0)):_*)
Check below code.
val finaldf = df
.filter(!df
.col("A")
.isin(to_remove.map(_.getDecimal(0)).collect:_*)
)
scala> finaldf.show
+--------------------+--------------------+
| A| B|
+--------------------+--------------------+
|2.000000000000000000|1.000000000000000000|
|3.000000000000000000|4.000000000000000000|
|5.000000000000000000| null|
|6.000000000000000000| null|
+--------------------+--------------------+

Get the number of null per row in PySpark dataframe

This is probably a duplicate, but somehow I have been searching for a long time already:
I want to get the number of nulls per Row in a Spark dataframe. I.e.
col1 col2 col3
null 1 a
1 2 b
2 3 null
Should in the end be:
col1 col2 col3 number_of_null
null 1 a 1
1 2 b 0
2 3 null 1
In a general fashion, I want to get the number of times a certain string or number appears in a spark dataframe row.
I.e.
col1 col2 col3 number_of_ABC
ABC 1 a 1
1 2 b 0
2 ABC ABC 2
I am using Pyspark 2.3.0 and prefer a solution that does not involve SQL syntax. For some reason, I seem not to be able to google this. :/
EDIT: Assume that I have so many columns that I can't list them all.
EDIT2: I explicitely dont want to have a pandas solution.
EDIT3: The solution explained with sums or means does not work as it throws errors:
(data type mismatch: differing types in '((`log_time` IS NULL) + 0)' (boolean and int))
...
isnull(log_time#10) + 0) + isnull(log#11))
In Scala:
val df = List(
("ABC", "1", "a"),
("1", "2", "b"),
("2", "ABC", "ABC")
).toDF("col1", "col2", "col3")
val expected = "ABC"
val complexColumn: Column = df.schema.fieldNames.map(c => when(col(c) === lit(expected), 1).otherwise(0)).reduce((a, b) => a + b)
df.withColumn("countABC", complexColumn).show(false)
Output:
+----+----+----+--------+
|col1|col2|col3|countABC|
+----+----+----+--------+
|ABC |1 |a |1 |
|1 |2 |b |0 |
|2 |ABC |ABC |2 |
+----+----+----+--------+
As stated in pasha701's answer, I resort to map and reduce. Note that I am working on Spark 1.6.x and Python 2.7
Taking your DataFrame as df (and as is)
dfvals = [
(None, "1", "a"),
("1", "2", "b"),
("2", None, None)
]
df = sqlc.createDataFrame(dfvals, ['col1', 'col2', 'col3'])
new_df = df.withColumn('null_cnt', reduce(lambda x, y: x + y,
map(lambda x: func.when(func.isnull(func.col(x)) == 'true', 1).otherwise(0),
df.schema.names)))
Check if the value is Null and assign 1 or 0. Add the result to get the count.
new_df.show()
+----+----+----+--------+
|col1|col2|col3|null_cnt|
+----+----+----+--------+
|null| 1| a| 1|
| 1| 2| b| 0|
| 2|null|null| 2|
+----+----+----+--------+

How to find intersection of dataframes based on multiple columns?

I have two dataframes as below. I'm trying to find the intersection of two dataframes based on either of the two columns, not only both of them.
So In this case, I want to return dataframe C, which has df A row 1 (as A row1 col1= row one col1 in B), df A row 2(A row 2 Col 2=row 1 Col2 in B) and df A row 4(as Col1 row 2 in B = Col 1 row 4 in A), and row 5 in A. But if I do a intersect of A and B, it will only return row 5 in A, as that's a match of both columns. How do I do this? Many thanks.Let me know if I'm not explaining the question very well.
A:
Col1 Col2
1 2
2 3
3 7
5 4
1 3
B:
Col1 Col2
1 3
5 1
C:
1 2
2 3
5 4
1 3
With the following data:
val df1 = sc.parallelize(Seq(1->2, 2->3, 3->7, 5->4, 1->3)).toDF("col1", "col2")
val df2 = sc.parallelize(Seq(1->3, 5->1)).toDF("col1", "col2")
Then you can join your datasets with a or condition:
val cols = df1.columns
df1.join(df2, cols.map(c => df1(c) === df2(c)).reduce(_ || _) )
.select(cols.map(df1(_)) :_*)
.distinct
.show
+----+----+
|col1|col2|
+----+----+
| 2| 3|
| 1| 2|
| 1| 3|
| 5| 4|
+----+----+
The join condition is generic and would work for any number of columns. The code maps each column to an equality between that column in df1 and the same one in df2 cols.map(c => df1(c) === df2(c)). The the reduce takes the logical or of all these equalities, which is what you want.
The select is there because otherwise the columns of both dataframes would be kept. Here I simply keep the ones from df1. I also added a distinct in case several lines of df2 would match a line of df1 or vice versa. Indeed, you may get a cartesian product.
Note that this method does not need any collection to the driver so it will work regardless of the size of the datasets. Yet, if df2 is small enough to be collected to the driver and braodcasted, you would get faster results with a method like this:
// to each column name, we map the set of values in df2.
val valueMap = df2.rdd
.flatMap(row => cols.map(name => name -> row.getAs[Any](name)))
.distinct
.groupByKey
.mapValues(_.toSet)
.collectAsMap
//we create a udf that looks up in valueMap
val filter = udf((name : String, value : Any) =>
valueMap(name).contains(value))
//Finally we apply the filter.
df1.where( cols.map(c => filter(lit(c), df1(c))).reduce(_||_))
.show
With this method, no shuffling of df1 and no cartesian product. If df2 is small, this is definitely the way to go.
You should perform two join operations individually on each of the join columns, and then perform a union of the two resulting Dataframes:
val dfA = List((1,2),(2,3),(3,7),(5,4),(1,3)).toDF("Col1", "Col2")
val dfB = List((1,3),(5,1)).toDF("Col1", "Col2")
val res1 = dfA.join(dfB, dfA.col("Col1")===dfB.col("Col1"))
val res2 = dfA.join(dfB, dfA.col("Col2")===dfB.col("Col2"))
val res = res1.union(res2)

How to avoid duplicate columns after join?

I have two dataframes with the following columns:
df1.columns
// Array(ts, id, X1, X2)
and
df2.columns
// Array(ts, id, Y1, Y2)
After I do
val df_combined = df1.join(df2, Seq(ts,id))
I end up with the following columns: Array(ts, id, X1, X2, ts, id, Y1, Y2). I could expect that the common columns would be dropped. Is there something that additional that needs to be done?
The simple answer (from the Databricks FAQ on this matter) is to perform the join where the joined columns are expressed as an array of strings (or one string) instead of a predicate.
Below is an example adapted from the Databricks FAQ but with two join columns in order to answer the original poster's question.
Here is the left dataframe:
val llist = Seq(("bob", "b", "2015-01-13", 4), ("alice", "a", "2015-04-23",10))
val left = llist.toDF("firstname","lastname","date","duration")
left.show()
/*
+---------+--------+----------+--------+
|firstname|lastname| date|duration|
+---------+--------+----------+--------+
| bob| b|2015-01-13| 4|
| alice| a|2015-04-23| 10|
+---------+--------+----------+--------+
*/
Here is the right dataframe:
val right = Seq(("alice", "a", 100),("bob", "b", 23)).toDF("firstname","lastname","upload")
right.show()
/*
+---------+--------+------+
|firstname|lastname|upload|
+---------+--------+------+
| alice| a| 100|
| bob| b| 23|
+---------+--------+------+
*/
Here is an incorrect solution, where the join columns are defined as the predicate left("firstname")===right("firstname") && left("lastname")===right("lastname").
The incorrect result is that the firstname and lastname columns are duplicated in the joined data frame:
left.join(right, left("firstname")===right("firstname") &&
left("lastname")===right("lastname")).show
/*
+---------+--------+----------+--------+---------+--------+------+
|firstname|lastname| date|duration|firstname|lastname|upload|
+---------+--------+----------+--------+---------+--------+------+
| bob| b|2015-01-13| 4| bob| b| 23|
| alice| a|2015-04-23| 10| alice| a| 100|
+---------+--------+----------+--------+---------+--------+------+
*/
The correct solution is to define the join columns as an array of strings Seq("firstname", "lastname"). The output data frame does not have duplicated columns:
left.join(right, Seq("firstname", "lastname")).show
/*
+---------+--------+----------+--------+------+
|firstname|lastname| date|duration|upload|
+---------+--------+----------+--------+------+
| bob| b|2015-01-13| 4| 23|
| alice| a|2015-04-23| 10| 100|
+---------+--------+----------+--------+------+
*/
This is an expected behavior. DataFrame.join method is equivalent to SQL join like this
SELECT * FROM a JOIN b ON joinExprs
If you want to ignore duplicate columns just drop them or select columns of interest afterwards. If you want to disambiguate you can use access these using parent DataFrames:
val a: DataFrame = ???
val b: DataFrame = ???
val joinExprs: Column = ???
a.join(b, joinExprs).select(a("id"), b("foo"))
// drop equivalent
a.alias("a").join(b.alias("b"), joinExprs).drop(b("id")).drop(a("foo"))
or use aliases:
// As for now aliases don't work with drop
a.alias("a").join(b.alias("b"), joinExprs).select($"a.id", $"b.foo")
For equi-joins there exist a special shortcut syntax which takes either a sequence of strings:
val usingColumns: Seq[String] = ???
a.join(b, usingColumns)
or as single string
val usingColumn: String = ???
a.join(b, usingColumn)
which keep only one copy of columns used in a join condition.
I have been stuck with this for a while, and only recently I came up with a solution what is quite easy.
Say a is
scala> val a = Seq(("a", 1), ("b", 2)).toDF("key", "vala")
a: org.apache.spark.sql.DataFrame = [key: string, vala: int]
scala> a.show
+---+----+
|key|vala|
+---+----+
| a| 1|
| b| 2|
+---+----+
and
scala> val b = Seq(("a", 1)).toDF("key", "valb")
b: org.apache.spark.sql.DataFrame = [key: string, valb: int]
scala> b.show
+---+----+
|key|valb|
+---+----+
| a| 1|
+---+----+
and I can do this to select only the value in dataframe a:
scala> a.join(b, a("key") === b("key"), "left").select(a.columns.map(a(_)) : _*).show
+---+----+
|key|vala|
+---+----+
| a| 1|
| b| 2|
+---+----+
You can simply use this
df1.join(df2, Seq("ts","id"),"TYPE-OF-JOIN")
Here TYPE-OF-JOIN can be
left
right
inner
fullouter
For example, I have two dataframes like this:
// df1
word count1
w1 10
w2 15
w3 20
// df2
word count2
w1 100
w2 150
w5 200
If you do fullouter join then the result looks like this
df1.join(df2, Seq("word"),"fullouter").show()
word count1 count2
w1 10 100
w2 15 150
w3 20 null
w5 null 200
try this,
val df_combined = df1.join(df2, df1("ts") === df2("ts") && df1("id") === df2("id")).drop(df2("ts")).drop(df2("id"))
This is a normal behavior from SQL, what I am doing for this:
Drop or Rename source columns
Do the join
Drop renamed column if any
Here I am replacing "fullname" column:
Some code in Java:
this
.sqlContext
.read()
.parquet(String.format("hdfs:///user/blablacar/data/year=%d/month=%d/day=%d", year, month, day))
.drop("fullname")
.registerTempTable("data_original");
this
.sqlContext
.read()
.parquet(String.format("hdfs:///user/blablacar/data_v2/year=%d/month=%d/day=%d", year, month, day))
.registerTempTable("data_v2");
this
.sqlContext
.sql(etlQuery)
.repartition(1)
.write()
.mode(SaveMode.Overwrite)
.parquet(outputPath);
Where the query is:
SELECT
d.*,
concat_ws('_', product_name, product_module, name) AS fullname
FROM
{table_source} d
LEFT OUTER JOIN
{table_updates} u ON u.id = d.id
This is something you can do only with Spark I believe (drop column from list), very very helpful!
Inner Join is default join in spark, Below is simple syntax for it.
leftDF.join(rightDF,"Common Col Nam")
For Other join you can follow the below syntax
leftDF.join(rightDF,Seq("Common Columns comma seperated","join type")
If columns Name are not common then
leftDF.join(rightDF,leftDF.col("x")===rightDF.col("y),"join type")
Best practice is to make column name different in both the DF before joining them and drop accordingly.
df1.columns =[id, age, income]
df2.column=[id, age_group]
df1.join(df2, on=df1.id== df2.id,how='inner').write.saveAsTable('table_name')
will return an error while error for duplicate columns
Try this instead try this:
df2_id_renamed = df2.withColumnRenamed('id','id_2')
df1.join(df2_id_renamed, on=df1.id== df2_id_renamed.id_2,how='inner').drop('id_2')
If anyone is using spark-SQL and wants to achieve the same thing then you can use USING clause in join query.
val spark = SparkSession.builder().master("local[*]").getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
import spark.implicits._
val df1 = List((1, 4, 3), (5, 2, 4), (7, 4, 5)).toDF("c1", "c2", "C3")
val df2 = List((1, 4, 3), (5, 2, 4), (7, 4, 10)).toDF("c1", "c2", "C4")
df1.createOrReplaceTempView("table1")
df2.createOrReplaceTempView("table2")
spark.sql("select * from table1 inner join table2 using (c1, c2)").show(false)
/*
+---+---+---+---+
|c1 |c2 |C3 |C4 |
+---+---+---+---+
|1 |4 |3 |3 |
|5 |2 |4 |4 |
|7 |4 |5 |10 |
+---+---+---+---+
*/
After I've joined multiple tables together, I run them through a simple function to rename columns in the DF if it encounters duplicates. Alternatively, you could drop these duplicate columns too.
Where Names is a table with columns ['Id', 'Name', 'DateId', 'Description'] and Dates is a table with columns ['Id', 'Date', 'Description'], the columns Id and Description will be duplicated after being joined.
Names = sparkSession.sql("SELECT * FROM Names")
Dates = sparkSession.sql("SELECT * FROM Dates")
NamesAndDates = Names.join(Dates, Names.DateId == Dates.Id, "inner")
NamesAndDates = deDupeDfCols(NamesAndDates, '_')
NamesAndDates.saveAsTable("...", format="parquet", mode="overwrite", path="...")
Where deDupeDfCols is defined as:
def deDupeDfCols(df, separator=''):
newcols = []
for col in df.columns:
if col not in newcols:
newcols.append(col)
else:
for i in range(2, 1000):
if (col + separator + str(i)) not in newcols:
newcols.append(col + separator + str(i))
break
return df.toDF(*newcols)
The resulting data frame will contain columns ['Id', 'Name', 'DateId', 'Description', 'Id2', 'Date', 'Description2'].
Apologies this answer is in Python - I'm not familiar with Scala, but this was the question that came up when I Googled this problem and I'm sure Scala code isn't too different.