How to transform row-information into columns?

How to transform row-information into columns? - scala

I have the following DataFrame in Spark 2.2 and Scala 2.11.8:
+--------+---------+-------+-------+----+-------+
|event_id|person_id|channel| group|num1| num2|
+--------+---------+-------+-------+----+-------+
| 560| 9410| web| G1| 0| 5|
| 290| 1430| web| G1| 0| 3|
| 470| 1370| web| G2| 0| 18|
| 290| 1430| web| G2| 0| 5|
| 290| 1430| mob| G2| 1| 2|
+--------+---------+-------+-------+----+-------+
Here is the equivalent DataFrame in Scala:
df = sqlCtx.createDataFrame(
[(560,9410,"web","G1",0,5),
(290,1430,"web","G1",0,3),
(470,1370,"web","G2",0,18),
(290,1430,"web","G2",0,5),
(290,1430,"mob","G2",1,2)],
["event_id","person_id","channel","group","num1","num2"]
)
The column group can only have two values: G1 and G2. I need to transform these values of the column group into new columns as follows:
+--------+---------+-------+--------+-------+--------+-------+
|event_id|person_id|channel| num1_G1|num2_G1| num1_G2|num2_G2|
+--------+---------+-------+--------+-------+--------+-------+
| 560| 9410| web| 0| 5| 0| 0|
| 290| 1430| web| 0| 3| 0| 0|
| 470| 1370| web| 0| 0| 0| 18|
| 290| 1430| web| 0| 0| 0| 5|
| 290| 1430| mob| 0| 0| 1| 2|
+--------+---------+-------+--------+-------+--------+-------+
How can I do it?

AFAIK (at least i couldn't find a way to perform PIVOT without aggregation) we must use aggregation function when doing pivoting in Spark
Scala version:
scala> df.groupBy("event_id","person_id","channel")
.pivot("group")
.agg(max("num1") as "num1", max("num2") as "num2")
.na.fill(0)
.show
+--------+---------+-------+-------+-------+-------+-------+
|event_id|person_id|channel|G1_num1|G1_num2|G2_num1|G2_num2|
+--------+---------+-------+-------+-------+-------+-------+
| 560| 9410| web| 0| 5| 0| 0|
| 290| 1430| web| 0| 3| 0| 5|
| 470| 1370| web| 0| 0| 0| 18|
| 290| 1430| mob| 0| 0| 1| 2|
+--------+---------+-------+-------+-------+-------+-------+

Related

Is there any generic function to finding the column names in pyspark

how to print column names in generic way. I want col1,col2,… instead of _1,_2,…
+---+---+---+---+---+---+---+---+---+---+---+---+
| _1| _2| _3| _4| _5| _6| _7| _8| _9|_10|_11|_12|
+---+---+---+---+---+---+---+---+---+---+---+---+
| 0| 0| 0| 1| 0| 1| 0| 0| 0| 1| 0| |
| 0| 0| 0| 1| 0| 1| 0| 0| 0| 1| 0| |
| 0| 0| 0| 0| 0| 1| 1| 0| 1| 1| 0| |
| 0| 0| 0| 0| 0| 1| 1| 0| 1| 1| 0| |
| 0| 0| 0| 0| 0| 1| 1| 0| 1| 1| 0| |

assuming df is your dataframe, you can juste rename :
for col in df.columns:
df = df.withColumnRenamed(col, col.replace("_", "col"))

Is there any generic functions to assign column names in pyspark?

is there any generic functions to assign column names in pyspark ?instead of _1,_2,_3....... it has to give col_1,col_2,col_3
+---+---+---+---+---+---+---+---+---+---+---+---+
| _1| _2| _3| _4| _5| _6| _7| _8| _9|_10|_11|_12|
+---+---+---+---+---+---+---+---+---+---+---+---+
| 0| 0| 0| 1| 0| 1| 0| 0| 0| 1| 0| |
| 0| 0| 0| 1| 0| 1| 0| 0| 0| 1| 0| |
| 0| 0| 0| 0| 0| 1| 1| 0| 1| 1| 0| |
| 0| 0| 0| 0| 0| 1| 1| 0| 1| 1| 0| |
| 0| 0| 0| 0| 0| 1| 1| 0| 1| 1| 0| |
| 0| 0| 0| 0| 0| 1| 1| 0| 1| 1| 0| |
| 0| 0| 0| 0| 0| 1| 1| 0| 1| 1| 0| |
| 0| 0| 0| 0| 0| 1| 1| 0| 1| 1| 0| |
| 0| 0| 0| 0| 0| 1| 1| 0| 1| 1| 0| |
| 0| 0| 0| 0| 0| 1| 1| 0| 1| 1| 0| |
| 0| 0| 0| 0| 0| 1| 1| 0| 1| 1| 0| |
| 0| 0| 0| 0| 0| 1| 1| 0| 1| 1| 0| |
| 0| 0| 0| 0| 0| 1| 1| 0| 1| 1| 1| |
| 0| 0| 0| 0| 0| 1| 1| 0| 1| 1| 1| |
| 0| 0| 0| 0| 0| 1| 1| 0| 1| 1| 1| |
| 0| 0| 0| 0| 0| 1| 1| 0| 1| 1| 1| |
| 0| 0| 0| 0| 0| 1| 1| 0| 1| 1| 1| |
| 0| 0| 0| 0| 0| 1| 1| 0| 1| 1| 1| |
| 0| 0| 0| 0| 0| 1| 1| 0| 1| 1| 1| |
| 0| 0| 0| 0| 0| 1| 1| 0| 1| 1| 1| |
+---+---+---+---+---+---+---+---+---+---+---+---+
only showing top 20 rows

Try this-
df.toDF(*["col_{}".format(i) for i in range(1,len(df.columns)+1)])

Filtering on multiple columns in Spark dataframes

Suppose I have a dataframe in Spark as shown below -
val df = Seq(
(0,0,0,0.0),
(1,0,0,0.1),
(0,1,0,0.11),
(0,0,1,0.12),
(1,1,0,0.24),
(1,0,1,0.27),
(0,1,1,0.30),
(1,1,1,0.40)
).toDF("A","B","C","rate")
Here is how it looks like -
scala> df.show()
+---+---+---+----+
| A| B| C|rate|
+---+---+---+----+
| 0| 0| 0| 0.0|
| 1| 0| 0| 0.1|
| 0| 1| 0|0.11|
| 0| 0| 1|0.12|
| 1| 1| 0|0.24|
| 1| 0| 1|0.27|
| 0| 1| 1| 0.3|
| 1| 1| 1| 0.4|
+---+---+---+----+
A,B and C are the advertising channels in this case. 0 and 1 represent absence and presence of channels respectively. 2^3 shows 8 combinations in the data-frame.
I want to filter records from this data-frame that shows presence of 2 channels at a time( AB, AC, BC) . Here is how I want my output to be -
+---+---+---+----+
| A| B| C|rate|
+---+---+---+----+
| 1| 1| 0|0.24|
| 1| 0| 1|0.27|
| 0| 1| 1| 0.3|
+---+---+---+----+
I can write 3 statements to get the output by doing -
scala> df.filter($"A" === 1 && $"B" === 1 && $"C" === 0).show()
+---+---+---+----+
| A| B| C|rate|
+---+---+---+----+
| 1| 1| 0|0.24|
+---+---+---+----+
scala> df.filter($"A" === 1 && $"B" === 0 && $"C" === 1).show()
+---+---+---+----+
| A| B| C|rate|
+---+---+---+----+
| 1| 0| 1|0.27|
+---+---+---+----+
scala> df.filter($"A" === 0 && $"B" === 1 && $"C" === 1).show()
+---+---+---+----+
| A| B| C|rate|
+---+---+---+----+
| 0| 1| 1| 0.3|
+---+---+---+----+
However, I want to achieve this using either a single statement that does my job or a function that helps me get the output.
I was thinking of using a case statement to match the values. However in general my dataframe might consist of more than 3 channels -
scala> df.show()
+---+---+---+---+----+
| A| B| C| D|rate|
+---+---+---+---+----+
| 0| 0| 0| 0| 0.0|
| 0| 0| 0| 1| 0.1|
| 0| 0| 1| 0| 0.1|
| 0| 0| 1| 1|0.59|
| 0| 1| 0| 0| 0.1|
| 0| 1| 0| 1|0.89|
| 0| 1| 1| 0|0.39|
| 0| 1| 1| 1| 0.4|
| 1| 0| 0| 0| 0.0|
| 1| 0| 0| 1|0.99|
| 1| 0| 1| 0|0.49|
| 1| 0| 1| 1| 0.1|
| 1| 1| 0| 0|0.79|
| 1| 1| 0| 1| 0.1|
| 1| 1| 1| 0| 0.1|
| 1| 1| 1| 1| 0.1|
+---+---+---+---+----+
In this scenario I would want my output as -
scala> df.show()
+---+---+---+---+----+
| A| B| C| D|rate|
+---+---+---+---+----+
| 0| 0| 1| 1|0.59|
| 0| 1| 0| 1|0.89|
| 0| 1| 1| 0|0.39|
| 1| 0| 0| 1|0.99|
| 1| 0| 1| 0|0.49|
| 1| 1| 0| 0|0.79|
+---+---+---+---+----+
which shows rates for paired presence of channels => (AB, AC, AD, BC, BD, CD).
Kindly help.

One way could be to sum the columns and then filter only when the result of the sum is 2.
import org.apache.spark.sql.functions._
df.withColumn("res", $"A" + $"B" + $"C").filter($"res" === lit(2)).drop("res").show
The output is:
+---+---+---+----+
| A| B| C|rate|
+---+---+---+----+
| 1| 1| 0|0.24|
| 1| 0| 1|0.27|
| 0| 1| 1| 0.3|
+---+---+---+----+

Pyspark - Ranking columns keeping ties

I'm looking for a way to rank columns of a dataframe preserving ties. Specifically for this example, I have a pyspark dataframe as follows where I want to generate ranks for colA & colB (though I want to support being able to rank N number of columns)
+--------+----------+-----+----+
| Entity| id| colA|colB|
+-------------------+-----+----+
| a|8589934652| 21| 50|
| b| 112| 9| 23|
| c|8589934629| 9| 23|
| d|8589934702| 8| 21|
| e| 20| 2| 21|
| f|8589934657| 2| 5|
| g|8589934601| 1| 5|
| h|8589934653| 1| 4|
| i|8589934620| 0| 4|
| j|8589934643| 0| 3|
| k|8589934618| 0| 3|
| l|8589934602| 0| 2|
| m|8589934664| 0| 2|
| n| 25| 0| 1|
| o| 67| 0| 1|
| p|8589934642| 0| 1|
| q|8589934709| 0| 1|
| r|8589934660| 0| 1|
| s| 30| 0| 1|
| t| 55| 0| 1|
+--------+----------+-----+----+
What I'd like is a way to rank this dataframe where tied values receive the same rank such as:
+--------+----------+-----+----+---------+---------+
| Entity| id| colA|colB|colA_rank|colB_rank|
+-------------------+-----+----+---------+---------+
| a|8589934652| 21| 50| 1| 1|
| b| 112| 9| 23| 2| 2|
| c|8589934629| 9| 21| 2| 3|
| d|8589934702| 8| 21| 3| 3|
| e| 20| 2| 21| 4| 3|
| f|8589934657| 2| 5| 4| 4|
| g|8589934601| 1| 5| 5| 4|
| h|8589934653| 1| 4| 5| 5|
| i|8589934620| 0| 4| 6| 5|
| j|8589934643| 0| 3| 6| 6|
| k|8589934618| 0| 3| 6| 6|
| l|8589934602| 0| 2| 6| 7|
| m|8589934664| 0| 2| 6| 7|
| n| 25| 0| 1| 6| 8|
| o| 67| 0| 1| 6| 8|
| p|8589934642| 0| 1| 6| 8|
| q|8589934709| 0| 1| 6| 8|
| r|8589934660| 0| 1| 6| 8|
| s| 30| 0| 1| 6| 8|
| t| 55| 0| 1| 6| 8|
+--------+----------+-----+----+---------+---------+
My current implementation with the first dataframe looks like:
def getRanks(mydf, cols=None, ascending=False):
from pyspark import Row
# This takes a dataframe and a list of columns to rank
# If no list is provided, it ranks *all* columns
# returns a new dataframe
def addRank(ranked_rdd, col, ascending):
# This assumes an RDD of the form (Row(...), list[...])
# it orders the rdd by col, finds the order, then adds that to the
# list
myrdd = ranked_rdd.sortBy(lambda (row, ranks): row[col],
ascending=ascending).zipWithIndex()
return myrdd.map(lambda ((row, ranks), index): (row, ranks +
[index+1]))
myrdd = mydf.rdd
fields = myrdd.first().__fields__
ranked_rdd = myrdd.map(lambda x: (x, []))
if (cols is None):
cols = fields
for col in cols:
ranked_rdd = addRank(ranked_rdd, col, ascending)
rank_names = [x + "_rank" for x in cols]
# Hack to make sure columns come back in the right order
ranked_rdd = ranked_rdd.map(lambda (row, ranks): Row(*row.__fields__ +
rank_names)(*row + tuple(ranks)))
return ranked_rdd.toDF()
which produces:
+--------+----------+-----+----+---------+---------+
| Entity| id| colA|colB|colA_rank|colB_rank|
+-------------------+-----+----+---------+---------+
| a|8589934652| 21| 50| 1| 1|
| b| 112| 9| 23| 2| 2|
| c|8589934629| 9| 23| 3| 3|
| d|8589934702| 8| 21| 4| 4|
| e| 20| 2| 21| 5| 5|
| f|8589934657| 2| 5| 6| 6|
| g|8589934601| 1| 5| 7| 7|
| h|8589934653| 1| 4| 8| 8|
| i|8589934620| 0| 4| 9| 9|
| j|8589934643| 0| 3| 10| 10|
| k|8589934618| 0| 3| 11| 11|
| l|8589934602| 0| 2| 12| 12|
| m|8589934664| 0| 2| 13| 13|
| n| 25| 0| 1| 14| 14|
| o| 67| 0| 1| 15| 15|
| p|8589934642| 0| 1| 16| 16|
| q|8589934709| 0| 1| 17| 17|
| r|8589934660| 0| 1| 18| 18|
| s| 30| 0| 1| 19| 19|
| t| 55| 0| 1| 20| 20|
+--------+----------+-----+----+---------+---------+
As you can see, the function getRanks() takes a dataframe, specifies the columns to be ranked, sorts them, and uses zipWithIndex() to generate an ordering or rank. However, I can't figure out a way to preserve ties.
This stackoverflow post is the closest solution I've found:
rank-users-by-column But it appears to only handle 1 column (I think).
Thanks so much for the help in advance!
EDIT: column 'id' is generated from calling monotonically_increasing_id() and in my implementation is cast to a string.

You're looking for dense_rank
First let's create our dataframe:
df = spark.createDataFrame(sc.parallelize([["a",8589934652,21,50],["b",112,9,23],["c",8589934629,9,23],
["d",8589934702,8,21],["e",20,2,21],["f",8589934657,2,5],
["g",8589934601,1,5],["h",8589934653,1,4],["i",8589934620,0,4],
["j",8589934643,0,3],["k",8589934618,0,3],["l",8589934602,0,2],
["m",8589934664,0,2],["n",25,0,1],["o",67,0,1],["p",8589934642,0,1],
["q",8589934709,0,1],["r",8589934660,0,1],["s",30,0,1],["t",55,0,1]]
), ["Entity","id","colA","colB"])
We'll define two windowSpec:
from pyspark.sql import Window
import pyspark.sql.functions as psf
wA = Window.orderBy(psf.desc("colA"))
wB = Window.orderBy(psf.desc("colB"))
df = df.withColumn(
"colA_rank",
psf.dense_rank().over(wA)
).withColumn(
"colB_rank",
psf.dense_rank().over(wB)
)
+------+----------+----+----+---------+---------+
|Entity| id|colA|colB|colA_rank|colB_rank|
+------+----------+----+----+---------+---------+
| a|8589934652| 21| 50| 1| 1|
| b| 112| 9| 23| 2| 2|
| c|8589934629| 9| 23| 2| 2|
| d|8589934702| 8| 21| 3| 3|
| e| 20| 2| 21| 4| 3|
| f|8589934657| 2| 5| 4| 4|
| g|8589934601| 1| 5| 5| 4|
| h|8589934653| 1| 4| 5| 5|
| i|8589934620| 0| 4| 6| 5|
| j|8589934643| 0| 3| 6| 6|
| k|8589934618| 0| 3| 6| 6|
| l|8589934602| 0| 2| 6| 7|
| m|8589934664| 0| 2| 6| 7|
| n| 25| 0| 1| 6| 8|
| o| 67| 0| 1| 6| 8|
| p|8589934642| 0| 1| 6| 8|
| q|8589934709| 0| 1| 6| 8|
| r|8589934660| 0| 1| 6| 8|
| s| 30| 0| 1| 6| 8|
| t| 55| 0| 1| 6| 8|
+------+----------+----+----+---------+---------+

I'll also pose an alternative:
for cols in data.columns[2:]:
lookup = (data.select(cols)
.distinct()
.orderBy(cols, ascending=False)
.rdd
.zipWithIndex()
.map(lambda x: x[0] + (x[1], ))
.toDF([cols, cols+"_rank_lookup"]))
name = cols + "_ranks"
data = data.join(lookup, [cols]).withColumn(name,col(cols+"_rank_lookup")
+ 1).drop(cols + "_rank_lookup")
Not as elegant as dense_rank() and I'm uncertain as to performance implications.

Spark: Transpose DataFrame Without Aggregating

I have looked at a number of questions online, but they don't seem to do what I'm trying to achieve.
I'm using Apache Spark 2.0.2 with Scala.
I have a dataframe:
+----------+-----+----+----+----+----+----+
|segment_id| val1|val2|val3|val4|val5|val6|
+----------+-----+----+----+----+----+----+
| 1| 100| 0| 0| 0| 0| 0|
| 2| 0| 50| 0| 0| 20| 0|
| 3| 0| 0| 0| 0| 0| 0|
| 4| 0| 0| 0| 0| 0| 0|
+----------+-----+----+----+----+----+----+
which I want to transpose to
+----+-----+----+----+----+
|vals| 1| 2| 3| 4|
+----+-----+----+----+----+
|val1| 100| 0| 0| 0|
|val2| 0| 50| 0| 0|
|val3| 0| 0| 0| 0|
|val4| 0| 0| 0| 0|
|val5| 0| 20| 0| 0|
|val6| 0| 0| 0| 0|
+----+-----+----+----+----+
I've tried using pivot() but I couldn't get to the right answer. I ended up looping through my val{x} columns, and pivoting each as per below, but this is proving to be very slow.
val d = df.select('segment_id, 'val1)
+----------+-----+
|segment_id| val1|
+----------+-----+
| 1| 100|
| 2| 0|
| 3| 0|
| 4| 0|
+----------+-----+
d.groupBy('val1).sum().withColumnRenamed('val1', 'vals')
+----+-----+----+----+----+
|vals| 1| 2| 3| 4|
+----+-----+----+----+----+
|val1| 100| 0| 0| 0|
+----+-----+----+----+----+
Then using union() on each iteration of val{x} to my first dataframe.
+----+-----+----+----+----+
|vals| 1| 2| 3| 4|
+----+-----+----+----+----+
|val2| 0| 50| 0| 0|
+----+-----+----+----+----+
Is there a more efficient way of a transpose where I do not want to aggregate data?
Thanks :)

Unfortunately there is no case when:
Spark DataFrame is justified considering amount of data.
Transposition of data is feasible.
You have to remember that DataFrame, as implemented in Spark, is a distributed collection of rows and each row is stored and processed on a single node.
You could express transposition on a DataFrame as pivot:
val kv = explode(array(df.columns.tail.map {
c => struct(lit(c).alias("k"), col(c).alias("v"))
}: _*))
df
.withColumn("kv", kv)
.select($"segment_id", $"kv.k", $"kv.v")
.groupBy($"k")
.pivot("segment_id")
.agg(first($"v"))
.orderBy($"k")
.withColumnRenamed("k", "vals")
but it is merely a toy code with no practical applications. In practice it is not better than collecting data:
val (header, data) = df.collect.map(_.toSeq.toArray).transpose match {
case Array(h, t # _*) => {
(h.map(_.toString), t.map(_.collect { case x: Int => x }))
}
}
val rows = df.columns.tail.zip(data).map { case (x, ys) => Row.fromSeq(x +: ys) }
val schema = StructType(
StructField("vals", StringType) +: header.map(StructField(_, IntegerType))
)
spark.createDataFrame(sc.parallelize(rows), schema)
For DataFrame defined as:
val df = Seq(
(1, 100, 0, 0, 0, 0, 0),
(2, 0, 50, 0, 0, 20, 0),
(3, 0, 0, 0, 0, 0, 0),
(4, 0, 0, 0, 0, 0, 0)
).toDF("segment_id", "val1", "val2", "val3", "val4", "val5", "val6")
both would you give you the desired result:
+----+---+---+---+---+
|vals| 1| 2| 3| 4|
+----+---+---+---+---+
|val1|100| 0| 0| 0|
|val2| 0| 50| 0| 0|
|val3| 0| 0| 0| 0|
|val4| 0| 0| 0| 0|
|val5| 0| 20| 0| 0|
|val6| 0| 0| 0| 0|
+----+---+---+---+---+
That being said if you need an efficient transpositions on distributed data structure you'll have to look somewhere else. There is a number of structures, including core CoordinateMatrix and BlockMatrix, which can distribute data across both dimensions and can be transposed.

In python, this can be done in a simple way
I normally use transpose function in Pandas by converting the spark DataFrame
spark_df.toPandas().T

Here is the solution for Pyspark
https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.transpose.html
Here is the solution code for your problem:
Step1: Choose columns
d = df.select('val1','val2','val3','val4','val5','val6','segment_id')
This code part can form the data frame like this:
+----------+-----+----+----+----+----+----+
| val1|val2|val3|val4|val5|val6|segment_id
+----------+-----+----+----+----+----+----+
| 100| 0| 0| 0| 0| 0| 1 |
| 0| 50| 0| 0| 20| 0| 2 |
| 0| 0| 0| 0| 0| 0| 3 |
| 0| 0| 0| 0| 0| 0| 4 |
+----------+-----+----+----+----+----+----+
Step 2: Transpose the whole table.
d_transposed = d.T.sort_index()
This code part can form the data frame like this:
+----+-----+----+----+----+----+-
|segment_id| 1| 2| 3| 4|
+----+-----+----+----+----+----+-
|val1 | 100| 0| 0| 0|
|val2 | 0| 50| 0| 0|
|val3 | 0| 0| 0| 0|
|val4 | 0| 0| 0| 0|
|val5 | 0| 20| 0| 0|
|val6 | 0| 0| 0| 0|
+----+-----+----+----+----+----+-
Step 3: You need to rename the segment_id to vals:
d_transposed.withColumnRenamed("segment_id","vals")
+----+-----+----+----+----+----+-
|vals | 1| 2| 3| 4|
+----+-----+----+----+----+----+-
|val1 | 100| 0| 0| 0|
|val2 | 0| 50| 0| 0|
|val3 | 0| 0| 0| 0|
|val4 | 0| 0| 0| 0|
|val5 | 0| 20| 0| 0|
|val6 | 0| 0| 0| 0|
+----+-----+----+----+----+----+-
Here is your full code:
d = df.select('val1','val2','val3','val4','val5','val6','segment_id')
d_transposed = d.T.sort_index()
d_transposed.withColumnRenamed("segment_id","vals")

This should be a perfect solution.
val seq = Seq((1,100,0,0,0,0,0),(2,0,50,0,0,20,0),(3,0,0,0,0,0,0),(4,0,0,0,0,0,0))
val df1 = seq.toDF("segment_id", "val1", "val2", "val3", "val4", "val5", "val6")
df1.show()
val schema = df1.schema
val df2 = df1.flatMap(row => {
val metric = row.getInt(0)
(1 until row.size).map(i => {
(metric, schema(i).name, row.getInt(i))
})
})
val df3 = df2.toDF("metric", "vals", "value")
df3.show()
import org.apache.spark.sql.functions._
val df4 = df3.groupBy("vals").pivot("metric").agg(first("value"))
df4.show()

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to transform row-information into columns? - scala

Related

Is there any generic function to finding the column names in pyspark

Is there any generic functions to assign column names in pyspark?

Filtering on multiple columns in Spark dataframes

Pyspark - Ranking columns keeping ties

Spark: Transpose DataFrame Without Aggregating

Categories

Resources