How i can transpose the dataframe in pyspark.
My Table is shown as below -
|two|0.6|1.2|1.7|1.5|1.4|2.0|
|one|0.3|1.2|1.3|1.5|1.4|1.0|
i want to transpose to
+---+---+
|two|one|
+---+---+
|0.6|0.3|
|1.2|1.2|
|1.7|1.3|
|1.5|1.5|
|1.4|1.4|
|2.0|1.0|
+---+---+
Related
Basic Problem :
I want to copy the "first row" of a Spark Dataframe sdf to another Spark dataframe sdfEmpty.
I do not understand what goes wrong in the following code.
Hence I am looking forward for a solution and an explanation what fails in my minimal example.
A minimal example :
// create a spark data frame
import org.apache.spark.sql._
val sdf = Seq(
(1, "a"),
(12, "b"),
(234, "b")
).toDF("A", "B")
sdf.show()
+---+---+
| A| B|
+---+---+
| 1| a|
| 2| b|
| 3| b|
+---+---+
// create an empty spark data frame to store the row
// declare it as var, such that I can change it later
var sdfEmpty = spark.createDataFrame(sc.emptyRDD[Row], sdf.schema)
sdfEmpty.show()
+---+---+
| A| B|
+---+---+
+---+---+
// take the "first" row of sdf as a spark data frame
val row = sdf.limit(1)
// combine the two spark data frames
sdfEmpty = sdfEmpty.union(row)
As row is:
row.show()
+---+---+
| A| B|
+---+---+
| 1| a|
+---+---+
the exspected result for sdfEmpty is:
+---+---+
| A| B|
+---+---+
| 1| a|
+---+---+
But I get :
sdfEmpty.show()
+---+---+
| A| B|
+---+---+
| 2| b|
+---+---+
Question:
What confused me is the following: Using val row = sdf.limit(1) I thought I created a permanent/ unchangeable/ well defined object. Such that when I print it once and add it to something, I get the same results.
Remark: (thanks a lot to Daniel's remarks)
I know that in the distributed world of scala there is no well defined notion of "first row". I put it there for simplicity and I hope that people struggling with something similar will "accidentially" use the term "first".
What I try to achieve is the following: (in a simplified example)
I have a data frame with 2 columns A and B. Column A is partially ordered and column B is totally ordered.
I want to filter the data w.r.t. the columns. So the idea is some kind of divide and conquer: split the data frame, such that into pieces both columns are totally ordered and than filter as usual. (and do the obvious iterations)
To achieve this I need to pick a well defined row and split the date w.r.t. row.A. But as the minimal example shows my comands do not produce a well defined object.
Thanks a lot
Spark is distributed, so the notion of 'first' is not something we can rely on. Dependently on partitioning we can get a different result when calling limit or first.
To have consistent results your data has to have an underlying order which we can use - what makes a lot of sense, since unless there is logical ordering to your data, we can't really say what does it mean to take the first row.
Assuming you want to take the first row with respect to column A, you can just run orderBy("A").first()(*) . Although if column A has more than one row with same smallest value there is no guarantee which row you will get.
(* I assume scala API has the same naming as Python so please correct me if they are differently named)
#Christian you can achieve this result by using take function.
take(num) Take the first num elements of the RDD. It works by first scanning one partition, and use the results from that partition to estimate the number of additional partitions needed to satisfy the limit.
here the code snippet.
scala> import org.apache.spark.sql.types._
scala> val sdf = Seq(
(1, "a"),
(12, "b"),
(234, "b")
).toDF("A", "B")
scala> import org.apache.spark.sql._
scala> var sdfEmpty = spark.createDataFrame(sc.emptyRDD[Row], sdf.schema)
scala> var first1 =sdf.rdd.take(1)
scala> val first_row = spark.createDataFrame(sc.parallelize(first1), sdf.schema)
scala> sdfEmpty.union(first_row).show
+---+---+
| A| B|
+---+---+
| 1| a|
+---+---+
for more about take() and first() function just read spark Documentation.let me know if you have any query related to this.
I am posting this answer as it contains the solution suggested by Daniel. Once I am through literature provided mahesh gupta or some more testing I'll update this answer and give remarks on the runtimes of the different approaches in "real life".
Basic Problem :
I want to copy the "first row" of a Spark Dataframe sdf to another Spark dataframe sdfEmpty.
As in the distributed world of spark there is not a well defined notion of first, but something similar might be achieved due to orderBy.
A minimal working example :
// create a spark data frame
import org.apache.spark.sql._
val sdf = Seq(
(1, "a"),
(12, "b"),
(234, "b")
).toDF("A", "B")
sdf.show()
+---+---+
| A| B|
+---+---+
| 1| a|
| 2| b|
| 3| b|
+---+---+
// create an empty spark data frame to store the row
// declare it as var, such that I can change it later
var sdfEmpty = spark.createDataFrame(sc.emptyRDD[Row], sdf.schema)
sdfEmpty.show()
+---+---+
| A| B|
+---+---+
+---+---+
// take the "first" row of sdf as a spark data frame
val row = sdf.limit(1).collect()
// combine the two spark data frames
sdfEmpty = sdfEmpty.union(row)
The row is:
row.show()
+---+---+
| A| B|
+---+---+
| 1| a|
+---+---+
** and the result for sdfEmpty is:**
+---+---+
| A| B|
+---+---+
| 1| a|
+---+---+
Remark: Explanation given by Daniel (see comments above) .limit(n) is a transformation - it does not get evaluated until an action runs like show or collect. Hence depending on the context it can return different value. To use the result of .limit consistently one can .collect it to driver and use it as a local variable.
I have a dataframe which is like :
df:
col1 col2
a [p1,p2,p3]
b [p1,p4]
Desired output is that:
df_out:
col1 col2 col3
p1 p2 a
p1 p3 a
p2 p3 a
p1 p4 b
I did some research and i think that converting df to rdd and then flatMap with cartesian product are ideal for the problem. However i could not combine them together.
Thanks,
It looks like you are trying to do combination rather than cartesian. Please check my understanding.
This is in PySpark but the only python thing is the UDF, the rest is just DataFrame operations.
process is
Create dataframe
define UDF to get all pairs of combinations ignoring order
use UDF to convert array into array of pairs of structs, one for each element of the combination
explode the results to get rows of pair of structs
select each struct and original column 1 into desired result columns
from itertools import combinations
from pyspark.sql import functions as F
df = spark.createDataFrame([
("a", ["p1", "p2", "p3"]),
("b", ["p1", "p4"])
],
["col1", "col2"]
)
# define and register udf that takes an array and returns an array of struct of two strings
#udf("array<struct<_1: string, _2: string>>")
def combinations_list(x):
return combinations(x, 2)
resultDf = df.select("col1", F.explode(combinations_list(df.col2)).alias("combos"))
resultDf.selectExpr("combos._1 as col1", "combos._2 as col2", "col1 as col3").show()
Result:
+----+----+----+
|col1|col2|col3|
+----+----+----+
| p1| p2| a|
| p1| p3| a|
| p2| p3| a|
| p1| p4| b|
+----+----+----+
Is there a way to select the entire row as a column to input into a Pyspark filter udf?
I have a complex filtering function "my_filter" that I want to apply to the entire DataFrame:
my_filter_udf = udf(lambda r: my_filter(r), BooleanType())
new_df = df.filter(my_filter_udf(col("*"))
But
col("*")
throws an error because that's not a valid operation.
I know that I can convert the dataframe to an RDD and then use the RDD's filter method, but I do NOT want to convert it to an RDD and then back into a dataframe. My DataFrame has complex nested types, so the schema inference fails when I try to convert the RDD into a dataframe again.
You should write all columns staticly. For example:
from pyspark.sql import functions as F
# create sample df
df = sc.parallelize([
(1, 'b'),
(1, 'c'),
]).toDF(["id", "category"])
#simple filter function
#F.udf(returnType=BooleanType())
def my_filter(col1, col2):
return (col1>0) & (col2=="b")
df.filter(my_filter('id', 'category')).show()
Results:
+---+--------+
| id|category|
+---+--------+
| 1| b|
+---+--------+
If you have so many columns and you are sure to order of columns:
cols = df.columns
df.filter(my_filter(*cols)).show()
Yields the same output.
I have a dataframe like this:
+---+---+
|_c0|_c1|
+---+---+
|1.0|4.0|
|1.0|4.0|
|2.1|3.0|
|2.1|3.0|
|2.1|3.0|
|2.1|3.0|
|3.0|6.0|
|4.0|5.0|
|4.0|5.0|
|4.0|5.0|
+---+---+
and I would like to shuffle all the rows using Spark in Scala.
How can I do this without going back to RDD?
You need to use orderBy method of the dataframe:
import org.apache.spark.sql.functions.rand
val shuffledDF = dataframe.orderBy(rand())
LATER EDIT:
Based on this article it seems that Spark cannot edit and RDD or column. A new one has to be created with the new type and the old one deleted. The for loop and .withColumn method suggested below seem to be the easiest way to get the job done.
ORIGINAL QUESTION:
Is there a simple way (for both human and machine) to convert multiple columns to a different data type?
I tried to define the schema manually, then load the data from a parquet file using this schema and save it to another file but I get "Job aborted."..."Task failed while writing rows" every time and on every DF. Somewhat easy for me, laborious for Spark ... and it does not work.
Another option is using:
df = df.withColumn("new_col", df("old_col").cast(type)).drop("old_col").withColumnRenamed("new_col", "old_col")
A bit more work for me as there are close to 100 columns and, if Spark has to duplicate each column in memory, then that doesn't sound optimal either. Is there an easier way?
Depending on how complicated the casting rules are, you can accomplish what you are asking a with this loop:
scala> var df = Seq((1,2),(3,4)).toDF("a", "b")
df: org.apache.spark.sql.DataFrame = [a: int, b: int]
scala> df.show
+---+---+
| a| b|
+---+---+
| 1| 2|
| 3| 4|
+---+---+
scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._
scala> > df.columns.foreach{c => df = df.withColumn(c, df(c).cast(DoubleType))}
scala> df.show
+---+---+
| a| b|
+---+---+
|1.0|2.0|
|3.0|4.0|
+---+---+
This should be as efficient as any other column operation.