How to shuffle the rows in a Spark dataframe?

How to shuffle the rows in a Spark dataframe? - scala

I have a dataframe like this:
+---+---+
|_c0|_c1|
+---+---+
|1.0|4.0|
|1.0|4.0|
|2.1|3.0|
|2.1|3.0|
|2.1|3.0|
|2.1|3.0|
|3.0|6.0|
|4.0|5.0|
|4.0|5.0|
|4.0|5.0|
+---+---+
and I would like to shuffle all the rows using Spark in Scala.
How can I do this without going back to RDD?

You need to use orderBy method of the dataframe:
import org.apache.spark.sql.functions.rand
val shuffledDF = dataframe.orderBy(rand())

Related

Using pandas udf without looping in pyspark

So suppose I have a big spark dataframe .I dont know how many columns.
(the solution has to be in pyspark using pandas udf. Not a different approach)
I want to perform an action on all columns. So it's ok to loop inside on all columns
But I dont want to loop through rows. I want it to act on the column at once.
I didnt find on the internet how this could be done.
Suppose I have this datafrme
A B C
5 3 2
1 7 0
Now I want to send to pandas udf to get sum of each row.
Sum
10
8
Number of columns not known.
I can do it inside the udf by looping row at a time. But I dont want. I want it to act on all rows without looping. And I allow looping through columns if needed.
One option I tried is combining all colmns to array column
ARR
[5,3,2]
[1,7,0]
But even here it doesnt work for me without looping.
I send this column to the udf and then inside I need to loop through its rows and sum each value of the list-row.
It would be nice if I could seperate each column as a one and act on the whole column at once
How do I act on the column at once? Without looping through the rows?
If I loop through the rows I guess it's no better than a regular python udf

I wouldnt go to pandas udfs, resort to udfs it cant be done in pyspark. Anyway code for both below
df = spark.read.load('/databricks-datasets/asa/small/small.csv', header=True,format='csv')
sf = df.select(df.colRegex("`.*rrDelay$|.*pDelay$`"))
#sf.show()
columns = ["id","ArrDelay","DepDelay"]
data = [("a", 81.0,3),
("b", 36.2,5),
("c", 12.0,5),
("d", 81.0,5),
("e", 36.3,5),
("f", 12.0,5),
("g", 111.7,5)]
sf = spark.createDataFrame(data=data,schema=columns)
sf.show()
# Use aggregate function
new = (sf.withColumn('sums', array(*[x for x in ['ArrDelay','DepDelay'] ]))#Create an array of values per row on desired columns
.withColumn('sums', expr("aggregate(sums,cast(0 as double), (c,i)-> c+i)"))# USE aggregate to sum
).show()
#use pandas udf
sch= sf.withColumn('v', lit(90.087654623)).schema
def sum_s(iterator: Iterator[pd.DataFrame]) -> Iterator[pd.DataFrame]:
for pdf in iterator:
yield pdf.assign(v=pdf.sum(1))
sf.mapInPandas(sum_s, schema=sch).show()

here's a simple way to do it
from pyspark.sql import functions as F
from pyspark.sql.types import *
from pyspark.sql import Window
from functools import reduce
df = spark.createDataFrame(
[
(5,3,2),
(1,7,0),
],
["A", "B", "C"],
)
cols = df.columns
calculate_sum = reduce(lambda a, x: a+x, map(col, cols))
df = (
df
.withColumn(
"sum",calculate_sum
)
)
df.show()
output:
+---+---+---+---+
| A| B| C|sum|
+---+---+---+---+
| 5| 3| 2| 10|
| 1| 7| 0| 8|
+---+---+---+---+

Generic Solution Of Transpose dataframe in Pyspark

How i can transpose the dataframe in pyspark.
My Table is shown as below -
|two|0.6|1.2|1.7|1.5|1.4|2.0|
|one|0.3|1.2|1.3|1.5|1.4|1.0|
i want to transpose to
+---+---+
|two|one|
+---+---+
|0.6|0.3|
|1.2|1.2|
|1.7|1.3|
|1.5|1.5|
|1.4|1.4|
|2.0|1.0|
+---+---+

Filter Pyspark Dataframe with udf on entire row

Is there a way to select the entire row as a column to input into a Pyspark filter udf?
I have a complex filtering function "my_filter" that I want to apply to the entire DataFrame:
my_filter_udf = udf(lambda r: my_filter(r), BooleanType())
new_df = df.filter(my_filter_udf(col("*"))
But
col("*")
throws an error because that's not a valid operation.
I know that I can convert the dataframe to an RDD and then use the RDD's filter method, but I do NOT want to convert it to an RDD and then back into a dataframe. My DataFrame has complex nested types, so the schema inference fails when I try to convert the RDD into a dataframe again.

You should write all columns staticly. For example:
from pyspark.sql import functions as F
# create sample df
df = sc.parallelize([
(1, 'b'),
(1, 'c'),
]).toDF(["id", "category"])
#simple filter function
#F.udf(returnType=BooleanType())
def my_filter(col1, col2):
return (col1>0) & (col2=="b")
df.filter(my_filter('id', 'category')).show()
Results:
+---+--------+
| id|category|
+---+--------+
| 1| b|
+---+--------+
If you have so many columns and you are sure to order of columns:
cols = df.columns
df.filter(my_filter(*cols)).show()
Yields the same output.

Overwrite Spark dataframe schema

LATER EDIT:
Based on this article it seems that Spark cannot edit and RDD or column. A new one has to be created with the new type and the old one deleted. The for loop and .withColumn method suggested below seem to be the easiest way to get the job done.
ORIGINAL QUESTION:
Is there a simple way (for both human and machine) to convert multiple columns to a different data type?
I tried to define the schema manually, then load the data from a parquet file using this schema and save it to another file but I get "Job aborted."..."Task failed while writing rows" every time and on every DF. Somewhat easy for me, laborious for Spark ... and it does not work.
Another option is using:
df = df.withColumn("new_col", df("old_col").cast(type)).drop("old_col").withColumnRenamed("new_col", "old_col")
A bit more work for me as there are close to 100 columns and, if Spark has to duplicate each column in memory, then that doesn't sound optimal either. Is there an easier way?

Depending on how complicated the casting rules are, you can accomplish what you are asking a with this loop:
scala> var df = Seq((1,2),(3,4)).toDF("a", "b")
df: org.apache.spark.sql.DataFrame = [a: int, b: int]
scala> df.show
+---+---+
| a| b|
+---+---+
| 1| 2|
| 3| 4|
+---+---+
scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._
scala> > df.columns.foreach{c => df = df.withColumn(c, df(c).cast(DoubleType))}
scala> df.show
+---+---+
| a| b|
+---+---+
|1.0|2.0|
|3.0|4.0|
+---+---+
This should be as efficient as any other column operation.

Dataframe creation in Scala

wordsDF = sqlContext.createDataFrame([('cat',), ('elephant',), ('rat',), ('rat',), ('cat', )], ['word'])
This is a way of creating dataframe from a list of tuples in python. How can I do this in scala ? I'm new to Scala and I'm facing problem in figuring it out.
Any help will be appreciated!

One simple way,
val df = sc.parallelize(List( (1,"a"), (2,"b") )).toDF("key","value")
and so df.show
+---+-----+
|key|value|
+---+-----+
| 1| a|
| 2| b|
+---+-----+
Refer to the worked example in Programmatically Specifying the Schema for constructing a DataFrame with createDataFrame.

To create a dataframe , you need to create SQLContext .
val sc: SparkContext // An existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// this is used to implicitly convert an RDD to a DataFrame , after importing it you can use .toDF method
import sqlContext.implicits._
now you can create dataframes
val df1 = sc.makeRDD(1 to 5).map(i => (i, i * 2)).toDF("single", "double")
learn more about creating dataframes here

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to shuffle the rows in a Spark dataframe? - scala

I have a dataframe like this: +---+---+ |_c0|_c1| +---+---+ |1.0|4.0| |1.0|4.0| |2.1|3.0| |2.1|3.0| |2.1|3.0| |2.1|3.0| |3.0|6.0| |4.0|5.0| |4.0|5.0| |4.0|5.0| +---+---+ and I would like to shuffle all the rows using Spark in Scala. How can I do this without going back to RDD?

You need to use orderBy method of the dataframe: import org.apache.spark.sql.functions.rand val shuffledDF = dataframe.orderBy(rand())

Related

Using pandas udf without looping in pyspark

Generic Solution Of Transpose dataframe in Pyspark

Filter Pyspark Dataframe with udf on entire row

Overwrite Spark dataframe schema

Dataframe creation in Scala

Categories

Resources