Get distinct values of specific column with max of different columns - scala

I have the following DataFrame
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| A| 6|null|null|
| B|null| 5|null|
| C|null|null| 7|
| B|null|null| 4|
| B|null| 2|null|
| B|null| 1|null|
| A| 4|null|null|
+----+----+----+----+
What I would like to do in Spark is to return all entries in col1 in the case it has a maximum value for one of the columns col2, col3 or col4.
This snippet won't do what I want:
df.groupBy("col1").max("col2","col3","col4").show()
And this one just gives the max only for one column (1):
df.groupBy("col1").max("col2").show()
I even tried to merge the single outputs by this:
//merge rows
val rows = test1.rdd.zip(test2.rdd).map{
case (rowLeft, rowRight) => Row.fromSeq(rowLeft.toSeq ++ rowRight.toSeq)}
//merge schemas
val schema = StructType(test1.schema.fields ++ test2.schema.fields)
// create new df
val test3: DataFrame = sqlContext.createDataFrame(rows, schema)
where test1 and test2 are DataFramesdone with queries as (1).
So how do I achive this nicely??
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| A| 6|null|null|
| B|null| 5|null|
| C|null|null| 7|
+----+----+----+----+
Or even only the distinct values:
+----+
|col1|
+----+
| A|
| B|
| C|
+----+
Thanks in advance! Best

You can use some thing like below :-
sqlcontext.sql("select x.* from table_name x ,
(select max(col2) as a,max(col3) as b, max(col4) as c from table_name ) temp
where a=x.col2 or b= x.col3 or c=x.col4")
Will give the desired result.

It can be solved like this:
df.registerTempTable("temp")
spark.sql("SELECT max(col2) AS max2, max(col3) AS max3, max(col4) AS max4 FROM temp").registerTempTable("max_temp")
spark.sql("SELECT col1 FROM temp, max_temp WHERE col2 = max2 OR col3 = max3 OR col4 = max4").show

Related

Apache Spark calculating column value on the basis of distinct value of columns

I am processing the following tables and I would like to compute a new column (outcome) based on the distinct value of 2 other columns.
| id1 | id2 | outcome
| 1 | 1 | 1
| 1 | 1 | 1
| 1 | 3 | 2
| 2 | 5 | 1
| 3 | 1 | 1
| 3 | 2 | 2
| 3 | 3 | 3
The outcome should begin in incremental order starting from 1 based on the combined value of id1 and id2. Any hints how this can be accomplished in Scala. row_number doesn't seem to be useful here in this case.
The logic here is that for each unique value of id1 we will start numbering the outcome with min(id2) for corresponding id1 being assigned a value of 1.
You could try dense_rank()
with your example
val df = sqlContext
.read
.option("sep","|")
.option("header", true)
.option("inferSchema",true)
.csv("/home/cloudera/files/tests/ids.csv") // Here we read the .csv files
.cache()
df.show()
df.printSchema()
df.createOrReplaceTempView("table")
sqlContext.sql(
"""
|SELECT id1, id2, DENSE_RANK() OVER(PARTITION BY id1 ORDER BY id2) AS outcome
|FROM table
|""".stripMargin).show()
output
+---+---+-------+
|id1|id2|outcome|
+---+---+-------+
| 2| 5| 1|
| 1| 1| 1|
| 1| 1| 1|
| 1| 3| 2|
| 3| 1| 1|
| 3| 2| 2|
| 3| 3| 3|
+---+---+-------+
Use Window function to club(partition) them by first id and then order each partition based on second id.
Now you just need to assign a rank (dense_rank) over each Window partition.
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
df
.withColumn("outcome", dense_rank().over(Window.partitionBy("id1").orderBy("id2")))

End-dating records using window functions in Spark SQL

I have a dataframe like below
+----+----+----------+----------+
|colA|colB| colC| colD|
+----+----+----------+----------+
| a| 2|2013-12-12|2999-12-31|
| b| 3|2011-12-14|2999-12-31|
| a| 4|2013-12-17|2999-12-31|
| b| 8|2011-12-19|2999-12-31|
| a| 6|2013-12-23|2999-12-31|
+----+----+----------+----------+
I need to group the records based on ColA and rank the records based on colC(most recent date gets bigger rank) and then update the dates in colD by subtracting a day from the colC record of the adjacent rank.
The final dataframe should like below
+----+----+----------+----------+
|colA|colB| colC| colD|
+----+----+----------+----------+
| a| 2|2013-12-12|2013-12-16|
| a| 4|2013-12-17|2013-12-22|
| a| 6|2013-12-23|2999-12-31|
| b| 3|2011-12-14|2011-12-18|
| b| 8|2011-12-29|2999-12-31|
+----+----+----------+----------+
You can get it using the window functions
scala> val df = Seq(("a",2,"2013-12-12","2999-12-31"),("b",3,"2011-12-14","2999-12-31"),("a",4,"2013-12-17","2999-12-31"),("b",8,"2011-12-19","2999-12-31"),("a",6,"2013-12-23","2999-12-31")).toDF("colA","colB","colC","colD")
df: org.apache.spark.sql.DataFrame = [colA: string, colB: int ... 2 more fields]
scala> val df2 = df.withColumn("colc",'colc.cast("date")).withColumn("cold",'cold.cast("date"))
df2: org.apache.spark.sql.DataFrame = [colA: string, colB: int ... 2 more fields]
scala> df2.createOrReplaceTempView("yash")
scala> spark.sql(""" select cola,colb,colc,cold, rank() over(partition by cola order by colc) c1, coalesce(date_sub(lead(colc) over(partition by cola order by colc),1),cold) as cold2 from yash """).show
+----+----+----------+----------+---+----------+
|cola|colb| colc| cold| c1| cold2|
+----+----+----------+----------+---+----------+
| b| 3|2011-12-14|2999-12-31| 1|2011-12-18|
| b| 8|2011-12-19|2999-12-31| 2|2999-12-31|
| a| 2|2013-12-12|2999-12-31| 1|2013-12-16|
| a| 4|2013-12-17|2999-12-31| 2|2013-12-22|
| a| 6|2013-12-23|2999-12-31| 3|2999-12-31|
+----+----+----------+----------+---+----------+
scala>
Removing the unnecessary columns
scala> spark.sql(""" select cola,colb,colc, coalesce(date_sub(lead(colc) over(partition by cola order by colc),1),cold) as cold from yash """).show
+----+----+----------+----------+
|cola|colb| colc| cold|
+----+----+----------+----------+
| b| 3|2011-12-14|2011-12-18|
| b| 8|2011-12-19|2999-12-31|
| a| 2|2013-12-12|2013-12-16|
| a| 4|2013-12-17|2013-12-22|
| a| 6|2013-12-23|2999-12-31|
+----+----+----------+----------+
scala>
You can create row_number over partition by colA and order by colC, then a self join on the dataframe. The code should look like this.
val rnkDF = df.withColumn("rnk", row_number().over(Window.partitionBy("colA").orderBy($"colC".asc)))
.withColumn("rnkminusone", $"rnk" - lit(1))
val joinDF = rnkDF.alias('A).join(rnkDF.alias('B), ($"A.colA" === $"B.colA").and($"A.rnk" === $"B.rnkminusone"),"left")
.select($"A.colA".as("colA")
, $"A.colB".as("colB")
, $"A.colC".as("colC")
, when($"B.colC".isNull, $"A.colD").otherwise(date_sub($"B.colC", 1)).as("colD"))
The results are below. I hope this helps.
+----+----+----------+----------+
|colA|colB| colC| colD|
+----+----+----------+----------+
| a| 2|2013-12-12|2013-12-16|
| a| 4|2013-12-17|2013-12-22|
| a| 6|2013-12-23|2999-12-31|
| b| 3|2011-12-14|2011-12-18|
| b| 8|2011-12-19|2999-12-31|
+----+----+----------+----------+

pyspark: counting number of occurrences of each distinct values

I think the question is related to: Spark DataFrame: count distinct values of every column
So basically I have a spark dataframe, with column A has values of 1,1,2,2,1
So I want to count how many times each distinct value (in this case, 1 and 2) appears in the column A, and print something like
distinct_values | number_of_apperance
1 | 3
2 | 2
I just post this as I think the other answer with the alias could be confusing. What you need are the groupby and the count methods:
from pyspark.sql.types import *
l = [
1
,1
,2
,2
,1
]
df = spark.createDataFrame(l, IntegerType())
df.groupBy('value').count().show()
+-----+-----+
|value|count|
+-----+-----+
| 1| 3|
| 2| 2|
+-----+-----+
I am not sure if you are looking for below solution:
Here are my thoughts on this. Suppose you have a dataframe like this.
>>> listA = [(1,'AAA','USA'),(2,'XXX','CHN'),(3,'KKK','USA'),(4,'PPP','USA'),(5,'EEE','USA'),(5,'HHH','THA')]
>>> df = spark.createDataFrame(listA, ['id', 'name','country'])
>>> df.show();
+---+----+-------+
| id|name|country|
+---+----+-------+
| 1| AAA| USA|
| 2| XXX| CHN|
| 3| KKK| USA|
| 4| PPP| USA|
| 5| EEE| USA|
| 5| HHH| THA|
+---+----+-------+
I want to know the distinct country code appears in this particular dataframe and should be printed as alias name.
import pyspark.sql.functions as func
df.groupBy('country').count().select(func.col("country").alias("distinct_country"),func.col("count").alias("country_count")).show()
+----------------+-------------+
|distinct_country|country_count|
+----------------+-------------+
| THA| 1|
| USA| 4|
| CHN| 1|
+----------------+-------------+
were you looking something similar to this?

Pyspark : select specific column with its position

I would like to know how to select a specific column with its number but not with its name in a dataframe ?
Like this in Pandas:
df = df.iloc[:,2]
It's possible ?
You can always get the name of the column with df.columns[n] and then select it:
df = spark.createDataFrame([[1,2], [3,4]], ['a', 'b'])
To select column at position n:
n = 1
df.select(df.columns[n]).show()
+---+
| b|
+---+
| 2|
| 4|
+---+
To select all but column n:
n = 1
You can either use drop:
df.drop(df.columns[n]).show()
+---+
| a|
+---+
| 1|
| 3|
+---+
Or select with manually constructed column names:
df.select(df.columns[:n] + df.columns[n+1:]).show()
+---+
| a|
+---+
| 1|
| 3|
+---+
Same solution as mirkhosro:
For a dataframe df, you can select the column n using df[n], where n is the index of the column.
Example:
df = df.filter(df[3]!=0)
will remove the rows of df, where the value in the fourth column is 0.
Note that you can check the columns using df.printSchema()

Aggregate rows of Spark DataFrame to String after groupby

I'm quite new both Spark and Scale and could really need a hint to solve my problem. So I have two DataFrames A (columns id and name) and B (columns id and text) would like to join them, group by id and combine all rows of text into a single String:
A
+--------+--------+
| id| name|
+--------+--------+
| 0| A|
| 1| B|
+--------+--------+
B
+--------+ -------+
| id| text|
+--------+--------+
| 0| one|
| 0| two|
| 1| three|
| 1| four|
+--------+--------+
desired result:
+--------+--------+----------+
| id| name| texts|
+--------+--------+----------+
| 0| A| one two|
| 1| B|three four|
+--------+--------+----------+
So far I'm trying the following:
var C = A.join(B, "id")
var D = C.groupBy("id", "name").agg(collect_list("text") as "texts")
This works quite well besides that my texts column is an Array of Strings instead of a String. I would appreciate some help very much.
I am just adding some minor functions in yours to give the right solution, which is
A.join(B, Seq("id"), "left").orderBy("id").groupBy("id", "name").agg(concat_ws(" ", collect_list("text")) as "texts")
It's quite simple:
val bCollected = b.groupBy('id).agg(collect_list('text).as("texts")
val ab = a.join(bCollected, a("id") == bCollected("id"), "left")
First DataFrame is immediate result, b DataFrame that has texts collected for every id. Then you are joining it with a. bCollected should be smaller that b itself, so it will probably get better shuffle time