Adding column to pyspark dataframe based on conditions from other pyspark dataframes - pyspark

I am currently learning pyspark and currently working on adding columns to pyspark dataframes using multiple conditions.
I have tried working with UDFs but getting some errors like:
TypeError: 'object' object has no attribute '__getitem__'
from pyspark.sql.functions import udf, struct
from pyspark.sql.types import IntegerType, StringType, FloatType
from pyspark.sql.functions import pandas_udf, PandasUDFType
#first dataframe (superset)
df1 = simple_example1
#second dataframe
df = diff_cols.dropna()
def func(x,y):
z = df1[(df1['a'] == x) & (df1['b'] <= (y+10000000000)) & (df1['b'] >= (y-10000000000))]
z = z[(z["c"] ==1) | (z["d"] ==1)]
z = z[(z["e"]!=0) | (z["f"]!=0) | (z["g"]!=0) | (z["h"]!=0)]
return 1 if z.count() > 3 else 0
udf_func = udf(func, IntegerType())
df = df.withColumn('status', udf_func(df['a'],df['b']))
what i am trying is as follow:
1. for each row of df filter data from df1 where parameter a is equal to the parameter in df and parameter b should be in between b-10 to b+10
2. then filter that data further with either c or d = 1
3. then filter that data further if any of parameters from e f g h are non 0
4. then count number of rows in the subset and assign 0/1
5. return this 0/1 in status column of df

Related

Convert each row of pyspark DataFrame column to a Json string

How to create a column with json structure based on other columns of a pyspark dataframe.
For example, I want to achieve the below in pyspark dataframe. I am able to do this on pandas dataframe as below, but how do I do the same on pyspark dataframe
df = {'Address': ['abc', 'dvf', 'bgh'], 'zip': [34567, 12345, 78905], 'state':['VA', 'TN', 'MA']}
df = pd.DataFrame(df, columns = ['Address', 'zip', 'state'])
lst = ['Address', 'zip']
df['new_col'] = df[lst].apply(lambda x: x.to_json(), axis = 1)
Expected output
Assuming your pyspark dataframe is named df, use the struct function to construct a struct, and then use the to_json function to convert it to a json string.
import pyspark.sql.functions as F
....
lst = ['Address', 'zip']
df = df.withColumn('new_col', F.to_json(F.struct(*[F.col(c) for c in lst])))
df.show(truncate=False)

Split the PySpark dataframes against number of records

I'm working on a pyspark dataframe having around 100000 records and I want to create new dataframes of around 20000 records each.How can I achieve it?
It can be dynamic but here is a lazy way to do it
#Creates a random DF with 100000 rows
from pyspark.sql import functions as F
df = spark.range(0, 100001).withColumn('rand_col', F.rand()).drop('id')
from pyspark.sql.functions import row_number,lit
from pyspark.sql.window import Window
w = Window().orderBy(lit('A'))
df = df.withColumn("index", row_number().over(w)) #creates a index column to split the DF
df1 = df.filter(F.col('index') < 20001)
df2 = df.filter((F.col('index') >= 20001) & (F.col('index') < 40001))
df3 = df.filter((F.col('index') >= 40001) & (F.col('index') < 60001))
df4 = df.filter((F.col('index') >= 60001) & (F.col('index') < 80001))
df5 = df.filter((F.col('index') >= 80001) & (F.col('index') < 100001))

spark/scala drop row with nan in any column

I am using Zeppelin and df is the spark DataFrame. I try to filter NaNs that can occur in any row, however it doesn't filter it out for some reason.
val df = df_labeled("df_Germany")
df: org.apache.spark.sql.DataFrame = [Kik: string, Lak: string ... 15 more fields]
df.count()
res66: Long = 455
df.na.drop().count
res66: Long = 455
How do I filter NaNs all at once?
How do I filter NaNs all at once?
generally following should work
df.na.drop
But there is an alternative to use .isNaN function on each columns that can be NaN. And we know that NaN values can be possible in Floats and Doubles, so we need to get the column names that have DoubleType or FloatType as dataTypes and do the filter as
import org.apache.spark.sql.functions._
val nan_columns = df.schema.filter(x => x.dataType == DoubleType || x.dataType == FloatType).map(_.name)
df.filter(!nan_columns.map(col(_).isNaN).reduce(_ or _))
or you can use isnan inbuilt function as
import org.apache.spark.sql.functions._
val nan_columns = df.schema.filter(x => x.dataType == DoubleType || x.dataType == FloatType).map(_.name)
df.filter(!nan_columns.map(x => isnan(col(x))).reduce(_ or _))
Assuming df is your dataframe. If you want to drop all rows(in Any Column) which NaN values in any columns. You can use
df.na.drop
If you want to fill all the NaN values with some values, you can use
df.na.fill(your_value)
On multiple columns
val cols = Seq("col1","col2")
df.na.drop(cols)
But if you want to do this Column-wise, you can do
df.filter(!$"col_name".isNaN)
Or
df.filter(!isnan($"your_column"))

Convert Spark Data Frame to org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector]

I'm very new to scala and spark 2.1.
I'm trying to calculate correlation between many elements in a data frame which looks like this:
item_1 | item_2 | item_3 | item_4
1 | 1 | 4 | 3
2 | 0 | 2 | 0
0 | 2 | 0 | 1
Here is what I've tried:
val df = sqlContext.createDataFrame(
Seq((1, 1, 4, 3),
(2, 0, 2, 0),
(0, 2, 0, 1)
).toDF("item_1", "item_2", "item_3", "item_4")
val items = df.select(array(df.columns.map(col(_)): _*)).rdd.map(_.getSeq[Double](0))
And calcualte correlation between elements:
val correlMatrix: Matrix = Statistics.corr(items, "pearson")
With followning error message:
<console>:89: error: type mismatch;
found : org.apache.spark.rdd.RDD[Seq[Double]]
required: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector]
val correlMatrix: Matrix = Statistics.corr(items, "pearson")
I don't know how to create the org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] from a data frame.
This might be a really easy task but I kinda struggle with it and I'm happy for any advice.
You can for example use VectorAssembler. Assemble vectors and convert to RDD
import org.apache.spark.ml.feature.VectorAssembler
val rows = new VectorAssembler().setInputCols(df.columns).setOutputCol("vs")
.transform(df)
.select("vs")
.rdd
Extract Vectors from Row:
Spark 1.x:
rows.map(_.getAs[org.apache.spark.mllib.linalg.Vector](0))
Spark 2.x:
rows
.map(_.getAs[org.apache.spark.ml.linalg.Vector](0))
.map(org.apache.spark.mllib.linalg.Vectors.fromML)
Regarding your code:
You have Integer columns not Double.
Data is not an array so the you cannot use _.getSeq[Double](0).
If your goal is to perform pearson correlations, you don't really have to use RDDs and Vectors. Here's an example of performing pearson correlations directly on DataFrame columns (the columns in question are Doubles types).
Code:
import org.apache.spark.sql.{SQLContext, Row, DataFrame}
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType, DoubleType}
import org.apache.spark.sql.functions._
val rb = spark.read.option("delimiter","|").option("header","false").option("inferSchema","true").format("csv").load("rb.csv").toDF("name","beerId","brewerId","abv","style","appearance","aroma","palate","taste","overall","time","reviewer").cache()
rb.agg(
corr("overall","taste"),
corr("overall","aroma"),
corr("overall","palate"),
corr("overall","appearance"),
corr("overall","abv")
).show()
In this example, I'm importing a dataframe (with a custom delimiter, no header, and inferred data types), and then simply performing an agg function against the dataframe which has multiple correlations inside it.
Output:
+--------------------+--------------------+---------------------+-------------------------+------------------+
|corr(overall, taste)|corr(overall, aroma)|corr(overall, palate)|corr(overall, appearance)|corr(overall, abv)|
+--------------------+--------------------+---------------------+-------------------------+------------------+
| 0.8762432795943761| 0.789023067942876| 0.7008942639550395| 0.5663593891357243|0.3539158620897098|
+--------------------+--------------------+---------------------+-------------------------+------------------+
As you can see from the results, the (overall, taste) columns are highly correlated, while (overall, abv) not so much.
Here's a link to the Scala Docs DataFrame page which has the Aggregation Correlation Function.

Merge multiple Dataframes into one Dataframe in Spark [duplicate]

I have two DataFrame a and b.
a is like
Column 1 | Column 2
abc | 123
cde | 23
b is like
Column 1
1
2
I want to zip a and b (or even more) DataFrames which becomes something like:
Column 1 | Column 2 | Column 3
abc | 123 | 1
cde | 23 | 2
How can I do it?
Operation like this is not supported by a DataFrame API. It is possible to zip two RDDs but to make it work you have to match both number of partitions and number of elements per partition. Assuming this is the case:
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructField, StructType, LongType}
val a: DataFrame = sc.parallelize(Seq(
("abc", 123), ("cde", 23))).toDF("column_1", "column_2")
val b: DataFrame = sc.parallelize(Seq(Tuple1(1), Tuple1(2))).toDF("column_3")
// Merge rows
val rows = a.rdd.zip(b.rdd).map{
case (rowLeft, rowRight) => Row.fromSeq(rowLeft.toSeq ++ rowRight.toSeq)}
// Merge schemas
val schema = StructType(a.schema.fields ++ b.schema.fields)
// Create new data frame
val ab: DataFrame = sqlContext.createDataFrame(rows, schema)
If above conditions are not met the only option that comes to mind is adding an index and join:
def addIndex(df: DataFrame) = sqlContext.createDataFrame(
// Add index
df.rdd.zipWithIndex.map{case (r, i) => Row.fromSeq(r.toSeq :+ i)},
// Create schema
StructType(df.schema.fields :+ StructField("_index", LongType, false))
)
// Add indices
val aWithIndex = addIndex(a)
val bWithIndex = addIndex(b)
// Join and clean
val ab = aWithIndex
.join(bWithIndex, Seq("_index"))
.drop("_index")
In Scala's implementation of Dataframes, there is no simple way to concatenate two dataframes into one. We can simply work around this limitation by adding indices to each row of the dataframes. Then, we can do a inner join by these indices. This is my stub code of this implementation:
val a: DataFrame = sc.parallelize(Seq(("abc", 123), ("cde", 23))).toDF("column_1", "column_2")
val aWithId: DataFrame = a.withColumn("id",monotonicallyIncreasingId)
val b: DataFrame = sc.parallelize(Seq((1), (2))).toDF("column_3")
val bWithId: DataFrame = b.withColumn("id",monotonicallyIncreasingId)
aWithId.join(bWithId, "id")
A little light reading - Check out how Python does this!
What about pure SQL ?
SELECT
room_name,
sender_nickname,
message_id,
row_number() over (partition by room_name order by message_id) as message_index,
row_number() over (partition by room_name, sender_nickname order by message_id) as user_message_index
from messages
order by room_name, message_id
I know the OP was using Scala but if, like me, you need to know how to do this in pyspark then try the Python code below. Like #zero323's first solution it relies on RDD.zip() and will therefore fail if both DataFrames don't have the same number of partitions and the same number of rows in each partition.
from pyspark.sql import Row
from pyspark.sql.types import StructType
def zipDataFrames(left, right):
CombinedRow = Row(*left.columns + right.columns)
def flattenRow(row):
left = row[0]
right = row[1]
combinedVals = [left[col] for col in left.__fields__] + [right[col] for col in right.__fields__]
return CombinedRow(*combinedVals)
zippedRdd = left.rdd.zip(right.rdd).map(lambda row: flattenRow(row))
combinedSchema = StructType(left.schema.fields + right.schema.fields)
return zippedRdd.toDF(combinedSchema)
joined = zipDataFrames(a, b)