Split row and flip colum value

Split row and flip colum value - pyspark

How could I best stplit a single row into 2 and flip the values of the src_* to dest_* and dest_* to src_* for the second row?
|source_ip |destination_ip|src_port|dst_port|
|192.168.0.1|10.0.0.1 |5000 |22 |
into
|ip |src_port|dst_port|
|192.168.0.1|5000 |22 |
|10.0.0.1 |22 |5000 |

You can select source_ip,src_port,dest_port then union it with dest_ip,**dst_port_src_port**.
df = spark.createDataFrame([['192.168.0.1','10.0.0.1',5000,22]],['source_ip','destination_ip','src_port','dst_port'])
df2 = df.select(['source_ip','src_port','dst_port']).union(df.select(['destination_ip','dst_port','src_port']))
df2 = df2.withColumnRenamed("source_ip", "ip")
df2.show()
Output:
+-----------+--------+--------+
| ip|src_port|dst_port|
+-----------+--------+--------+
|192.168.0.1| 5000| 22|
| 10.0.0.1| 22| 5000|
+-----------+--------+--------+
Folllowup: when you have n columns. all constant columns can be kept in a variable and the swap ones in another variable.
df = spark.createDataFrame([['1','xxx','192.168.0.1','10.0.0.1',5000,22]],['colno','somothercol','source_ip','destination_ip','src_port','dst_port'])
constant_columns = ['colno','somothercol']
swap_columns1 = ['source_ip','src_port','dst_port']
swap_columns2 = ['destination_ip','dst_port','src_port']
df2 = df.select([*constant_columns,*swap_columns1]).union(df.select([*constant_columns,*swap_columns2]))
df2 = df2.withColumnRenamed("source_ip", "ip")
df2.show()
Output:
+-----+-----------+-----------+--------+--------+
|colno|somothercol| ip|src_port|dst_port|
+-----+-----------+-----------+--------+--------+
| 1| xxx|192.168.0.1| 5000| 22|
| 1| xxx| 10.0.0.1| 22| 5000|
+-----+-----------+-----------+--------+--------+

Related

Pyspark filter where value is in another dataframe

I have two data frames. I need to filter one to only show values that are contained in the other.
table_a:
+---+----+
|AID| foo|
+---+----+
| 1 | bar|
| 2 | bar|
| 3 | bar|
| 4 | bar|
+---+----+
table_b:
+---+
|BID|
+---+
| 1 |
| 2 |
+---+
In the end I want to filter out what was in table_a to only the IDs that are in the table_b, like this:
+--+----+
|ID| foo|
+--+----+
| 1| bar|
| 2| bar|
+--+----+
Here is what I'm trying to do
result_table = table_a.filter(table_b.BID.contains(table_a.AID))
But this doesn't seem to be working. It looks like I'm getting ALL values.
NOTE: I can't add any other imports other than pyspark.sql.functions import col

You can join the two tables and specify how = 'left_semi'
A left semi-join returns values from the left side of the relation that has a match with the right.
result_table = table_a.join(table_b, (table_a.AID == table_b.BID), \
how = "left_semi").drop("BID")
result_table.show()
+---+---+
|AID|foo|
+---+---+
| 1|bar|
| 2|bar|
+---+---+

In case you have duplicates or Multiple values in the second dataframe and you want to take only distinct values, below approach can be useful to tackle such use cases -
Create the Dataframe
df = spark.createDataFrame([(1,"bar"),(2,"bar"),(3,"bar"),(4,"bar")],[ "col1","col2"])
df_lookup = spark.createDataFrame([(1,1),(1,2)],[ "id","val"])
df.show(truncate=True)
df_lookup.show()
+----+----+
|col1|col2|
+----+----+
| 1| bar|
| 2| bar|
| 3| bar|
| 4| bar|
+----+----+
+---+---+
| id|val|
+---+---+
| 1| 1|
| 1| 2|
+---+---+
get all the unique values of val column in dataframe two and take in a set/list variable
df_lookup_var = df_lookup.groupBy("id").agg(F.collect_set("val").alias("val")).collect()[0][1][0]
print(df_lookup_var)
df = df.withColumn("case_col", F.when((F.col("col1").isin([1,2])), F.lit("1")).otherwise(F.lit("0")))
df = df.filter(F.col("case_col") == F.lit("1"))
df.show()
+----+----+--------+
|col1|col2|case_col|
+----+----+--------+
| 1| bar| 1|
| 2| bar| 1|
+----+----+--------+

This should work too:
table_a.where( col(AID).isin(table_b.BID.tolist() ) )

Unsure how to apply row-wise normalization on pyspark dataframe

Disclaimer: I'm a beginner when it comes to Pyspark.
For each cell in a row, I'd like to apply the following function
new_col_i = col_i / max(col_1,col_2,col_3,...,col_n)
At the very end, I'd like the range of values to go from 0.0 to 1.0.
Here are the details of my dataframe:
Dimensions: (6.5M, 2905)
Dtypes: Double
Initial DF:
+-----+-------+-------+-------+
|. id| col_1| col_2| col_n |
+-----+-------+-------+-------+
| 1| 7.5| 0.1| 2.0|
| 2| 0.3| 3.5| 10.5|
+-----+-------+-------+-------+
Updated DF:
+-----+-------+-------+-------+
|. id| col_1| col_2| col_n |
+-----+-------+-------+-------+
| 1| 1.0| 0.013| 0.26|
| 2| 0.028| 0.33| 1.0|
+-----+-------+-------+-------+
Any help would be appreciated.

You can find the maximum value from an array of columns and loop your dataframe to replace the normalized column value.
cols = df.columns[1:]
import builtins as p
df2 = df.withColumn('max', array_max(array(*[col(c) for c in cols]))) \
for c in cols:
df2 = df2.withColumn(c, col(c) / col('max'))
df2.show()
+---+-------------------+--------------------+-------------------+----+
| id| col_1| col_2| col_n| max|
+---+-------------------+--------------------+-------------------+----+
| 1| 1.0|0.013333333333333334|0.26666666666666666| 7.5|
| 2|0.02857142857142857| 0.3333333333333333| 1.0|10.5|
+---+-------------------+--------------------+-------------------+----+

How to get the set of rows which contains null values from dataframe in scala using filter

I'm new to spark and have a question regarding filtering dataframe based on null condition.
I have gone through many answers which has solution like
df.filter(($"col2".isNotNULL) || ($"col2" !== "NULL") || ($"col2" !== "null") || ($"col2".trim !== "NULL"))
But in my case, I can not write hard coded column names as my schema is not fixed. I am reading csv file and depending upon the columns in it, I have to filter my dataframe for null values and want it in another dataframe. In short, any column which has null value, that complete row should come under a different dataframe.
for example :
Input DataFrame :
+----+----+---------+---------+
|name| id| email| company|
+----+----+---------+---------+
| n1|null|n1#c1.com|[c1,1,d1]|
| n2| 2|null |[c1,1,d1]|
| n3| 3|n3#c1.com| null |
| n4| 4|n4#c2.com|[c2,2,d2]|
| n6| 6|n6#c2.com|[c2,2,d2]|
Output :
+----+----+---------+---------+
|name| id| email| company|
+----+----+---------+---------+
| n1|null|n1#c1.com|[c1,1,d1]|
| n2| 2|null |[c1,1,d1]|
| n3| 3|n3#c1.com| null |
Thank you in advance.

Try this-
val df1 = spark.sql("select col1, col2 from values (null, 1), (2, null), (null, null), (1,2) T(col1, col2)")
/**
* +----+----+
* |col1|col2|
* +----+----+
* |null|1 |
* |2 |null|
* |null|null|
* |1 |2 |
* +----+----+
*/
df1.show(false)
df1.filter(df1.columns.map(col(_).isNull).reduce(_ || _)).show(false)
/**
* +----+----+
* |col1|col2|
* +----+----+
* |null|1 |
* |2 |null|
* |null|null|
* +----+----+
*/

Thank you so much for your answers. I tried below logic and it worked for me.
var arrayColumn = df.columns;
val filterString = String.format(" %1$s is null or %1$s == '' "+ arrayColumn(0));
val x = new StringBuilder(filterString);
for(i <- 1 until arrayColumn.length){
if (x.toString() != ""){
x ++= String.format("or %1$s is null or %1$s == '' ", arrayColumn(i))
}
}
val dfWithNullRows = df.filter(x.toString());

To deal with null values and dataframes spark has some useful functions.
I will show some dataframes examples with distinct number of columns.
val schema = StructType(List(StructField("id", IntegerType, true), StructField("obj",DoubleType, true)))
val schema1 = StructType(List(StructField("id", IntegerType, true), StructField("obj",StringType, true), StructField("obj",IntegerType, true)))
val t1 = sc.parallelize(Seq((1,null),(1,1.0),(8,3.0),(2,null),(3,1.4),(3,2.5),(null,3.7))).map(t => Row(t._1,t._2))
val t2 = sc.parallelize(Seq((1,"A",null),(2,"B",null),(3,"C",36),(null,"D",15),(5,"E",25),(6,null,7),(7,"G",null))).map(t => Row(t._1,t._2,t._3))
val tt1 = spark.createDataFrame(t1, schema)
val tt2 = spark.createDataFrame(t2, schema1)
tt1.show()
tt2.show()
// To clean all rows with null values
val dfWithoutNull = tt1.na.drop()
dfWithoutNull.show()
val df2WithoutNull = tt2.na.drop()
df2WithoutNull.show()
// To fill null values with another value
val df1 = tt1.na.fill(-1)
df1.show()
// to get new dataframes with the null values rows
val nullValues = tt1.filter(row => row.anyNull == true)
nullValues.show()
val nullValues2 = tt2.filter(row => row.anyNull == true)
nullValues2.show()
output
// input dataframes
+----+----+
| id| obj|
+----+----+
| 1|null|
| 1| 1.0|
| 8| 3.0|
| 2|null|
| 3| 1.4|
| 3| 2.5|
|null| 3.7|
+----+----+
+----+----+----+
| id| obj| obj|
+----+----+----+
| 1| A|null|
| 2| B|null|
| 3| C| 36|
|null| D| 15|
| 5| E| 25|
| 6|null| 7|
| 7| G|null|
+----+----+----+
// Dataframes without null values
+---+---+
| id|obj|
+---+---+
| 1|1.0|
| 8|3.0|
| 3|1.4|
| 3|2.5|
+---+---+
+---+---+---+
| id|obj|obj|
+---+---+---+
| 3| C| 36|
| 5| E| 25|
+---+---+---+
// Dataframe with null values replaced
+---+----+
| id| obj|
+---+----+
| 1|-1.0|
| 1| 1.0|
| 8| 3.0|
| 2|-1.0|
| 3| 1.4|
| 3| 2.5|
| -1| 3.7|
+---+----+
// Dataframes which the rows have at least one null value
+----+----+
| id| obj|
+----+----+
| 1|null|
| 2|null|
|null| 3.7|
+----+----+
+----+----+----+
| id| obj| obj|
+----+----+----+
| 1| A|null|
| 2| B|null|
|null| D| 15|
| 6|null| 7|
| 7| G|null|
+----+----+----+

Filter one data frame using other data frame in spark scala

I am going to demonstrate my question using following two data frames.
val datF1= Seq((1,"everlasting",1.39),(1,"game", 2.7),(1,"life",0.69),(1,"learning",0.69),
(2,"living",1.38),(2,"worth",1.38),(2,"life",0.69),(3,"learning",0.69),(3,"never",1.38)).toDF("ID","token","value")
datF1.show()
+---+-----------+-----+
| ID| token|value|
+---+-----------+-----+
| 1|everlasting| 1.39|
| 1| game| 2.7|
| 1| life| 0.69|
| 1| learning| 0.69|
| 2| living| 1.38|
| 2| worth| 1.38|
| 2| life| 0.69|
| 3| learning| 0.69|
| 3| never| 1.38|
+---+-----------+-----+
val dataF2= Seq(("life ",0.71),("learning",0.75)).toDF("token1","val2")
dataF2.show()
+--------+----+
| token1|val2|
+--------+----+
| life |0.71|
|learning|0.75|
+--------+----+
I want to filter the ID and value of dataF1 based on the token1 of dataF2. For the each word in token1 of dataF2 , if there is a word token then value should be equal to the value of dataF1 else value should be zero.
In other words my desired output should be like this
+---+----+----+
| ID| val|val2|
+---+----+----+
| 1|0.69|0.69|
| 2| 0.0|0.69|
| 3|0.69| 0.0|
+---+----+----+
Since learning is not presented in ID equals 2 , the val has equal to zero. Similarly since life is not there for ID equal 3, val2 equlas zero.
I did it manually as follows ,
val newQ61=datF1.filter($"token"==="learning")
val newQ7 =Seq(1,2,3).toDF("ID")
val newQ81 =newQ7.join(newQ61, Seq("ID"), "left")
val tf2=newQ81.select($"ID" ,when(col("value").isNull ,0).otherwise(col("value")) as "val" )
val newQ62=datF1.filter($"token"==="life")
val newQ71 =Seq(1,2,3).toDF("ID")
val newQ82 =newQ71.join(newQ62, Seq("ID"), "left")
val tf3=newQ82.select($"ID" ,when(col("value").isNull ,0).otherwise(col("value")) as "val2" )
val tf4 =tf2.join(tf3 ,Seq("ID"), "left")
tf4.show()
+---+----+----+
| ID| val|val2|
+---+----+----+
| 1|0.69|0.69|
| 2| 0.0|0.69|
| 3|0.69| 0.0|
+---+----+----+
Instead of doing this manually , is there a way to do this more efficiently by accessing indexes of one data frame within the other data frame ? because in real life situations, there can be more than 2 words so manually accessing each word may be very hard thing to do.
Thank you
UPDATE
When i use leftsemi join my output is like this :
datF1.join(dataF2, $"token"===$"token1", "leftsemi").show()
+---+--------+-----+
| ID| token|value|
+---+--------+-----+
| 1|learning| 0.69|
| 3|learning| 0.69|
+---+--------+-----+

I believe a left outer join and then pivoting on token can work here:
val ans = df1.join(df2, $"token" === $"token1", "LEFT_OUTER")
.filter($"token1".isNotNull)
.select("ID","token","value")
.groupBy("ID")
.pivot("token")
.agg(first("value"))
.na.fill(0)
The result (without the null handling):
ans.show
+---+--------+----+
| ID|learning|life|
+---+--------+----+
| 1| 0.69|0.69|
| 3| 0.69|0.0 |
| 2| 0.0 |0.69|
+---+--------+----+
UPDATE: as the answer by Lamanus suggest, an inner join is possibly a better approach than an outer join + filter.

I think the inner join is enough. Btw, I found the typo in your test case, which makes the result wrong.
val dataF1= Seq((1,"everlasting",1.39),
(1,"game", 2.7),
(1,"life",0.69),
(1,"learning",0.69),
(2,"living",1.38),
(2,"worth",1.38),
(2,"life",0.69),
(3,"learning",0.69),
(3,"never",1.38)).toDF("ID","token","value")
dataF1.show
// +---+-----------+-----+
// | ID| token|value|
// +---+-----------+-----+
// | 1|everlasting| 1.39|
// | 1| game| 2.7|
// | 1| life| 0.69|
// | 1| learning| 0.69|
// | 2| living| 1.38|
// | 2| worth| 1.38|
// | 2| life| 0.69|
// | 3| learning| 0.69|
// | 3| never| 1.38|
// +---+-----------+-----+
val dataF2= Seq(("life",0.71), // "life " -> "life"
("learning",0.75)).toDF("token1","val2")
dataF2.show
// +--------+----+
// | token1|val2|
// +--------+----+
// | life|0.71|
// |learning|0.75|
// +--------+----+
val resultDF = dataF1.join(dataF2, $"token" === $"token1", "inner")
resultDF.show
// +---+--------+-----+--------+----+
// | ID| token|value| token1|val2|
// +---+--------+-----+--------+----+
// | 1| life| 0.69| life|0.71|
// | 1|learning| 0.69|learning|0.75|
// | 2| life| 0.69| life|0.71|
// | 3|learning| 0.69|learning|0.75|
// +---+--------+-----+--------+----+
resultDF.groupBy("ID").pivot("token").agg(first("value"))
.na.fill(0).orderBy("ID").show
This will give you the result such as
+---+--------+----+
| ID|learning|life|
+---+--------+----+
| 1| 0.69|0.69|
| 2| 0.0|0.69|
| 3| 0.69| 0.0|
+---+--------+----+

Seems like you need "left semi-join". It will filter one dataframe, based on another one.
Try using it like
datF1.join(datF2, $"token"===$"token2", "leftsemi")
You can find a bit more info here - https://medium.com/datamindedbe/little-known-spark-dataframe-join-types-cc524ea39fd5

Fill null values in dataframe column with next value

I have to fill the first null values with immediate value of the same column in dataframe. This logic applies only on first consecutive null values only of the column.
I have a dataframe with similar to below
//I replaced null to 0 in value column
val df = Seq( (0,"exA",30), (0,"exB",22), (0,"exC",19), (16,"exD",13),
(5,"exE",28), (6,"exF",26), (0,"exG",12), (13,"exH",53))
.toDF("value", "col2", "col3")
scala> df.show(false)
+-----+----+----+
|value|col2|col3|
+-----+----+----+
|0 |exA |30 |
|0 |exB |22 |
|0 |exC |19 |
|16 |exD |13 |
|5 |exE |28 |
|6 |exF |26 |
|0 |exG |12 |
|13 |exH |53 |
+-----+----+----+
From this dataframe I am expecting as below
scala> df.show(false)
+-----+----+----+
|value|col2|col3|
+-----+----+----+
|16 |exA |30 | // Change the value 0 to 16 at value column
|16 |exB |22 | // Change the value 0 to 16 at value column
|16 |exC |19 | // Change the value 0 to 16 at value column
|16 |exD |13 |
|5 |exE |28 |
|6 |exF |26 |
|0 |exG |12 | // value should not be change here
|13 |exH |53 |
+-----+----+----+
Please help me solve this.

You can use Window function for this purpose
val df = Seq( (0,"exA",30), (0,"exB",22), (0,"exC",19), (16,"exD",13),
(5,"exE",28), (6,"exF",26), (0,"exG",12), (13,"exH",53))
.toDF("value", "col2", "col3")
val w = Window.orderBy($"col2".desc)
df.withColumn("Result", last(when($"value" === 0, null).otherwise($"value"), ignoreNulls = true).over(w))
.orderBy($"col2")
.show(10)
Will result in
+-----+----+----+------+
|value|col2|col3|Result|
+-----+----+----+------+
| 0| exA| 30| 16|
| 0| exB| 22| 16|
| 0| exC| 19| 16|
| 16| exD| 13| 16|
| 5| exE| 28| 5|
| 6| exF| 26| 6|
| 0| exG| 12| 13|
| 13| exH| 53| 13|
+-----+----+----+------+
Expression df.orderBy($"col2") is needed only to show final results in right order. You can skip it if you don't care about final order.
UPDATE
To get exactly what you need you should you a little bit more complicated code
val w = Window.orderBy($"col2")
val w2 = Window.orderBy($"col2".desc)
df.withColumn("IntermediateResult", first(when($"value" === 0, null).otherwise($"value"), ignoreNulls = true).over(w))
.withColumn("Result", when($"IntermediateResult".isNull, last($"IntermediateResult", ignoreNulls = true).over(w2)).otherwise($"value"))
.orderBy($"col2")
.show(10)
+-----+----+----+------------------+------+
|value|col2|col3|IntermediateResult|Result|
+-----+----+----+------------------+------+
| 0| exA| 30| null| 16|
| 0| exB| 22| null| 16|
| 0| exC| 19| null| 16|
| 16| exD| 13| 16| 16|
| 5| exE| 28| 16| 5|
| 6| exF| 26| 16| 6|
| 0| exG| 12| 16| 0|
| 13| exH| 53| 16| 13|
+-----+----+----+------------------+------+

I think you need to take the 1st not null or non-zero value based on col2 's order. Please find the script below. I have created a table in spark's memory to write sql.
val df = Seq( (0,"exA",30), (0,"exB",22), (0,"exC",19), (16,"exD",13),
(5,"exE",28), (6,"exF",26), (0,"exG",12), (13,"exH",53))
.toDF("value", "col2", "col3")
df.registerTempTable("table_df")
spark.sql("with cte as(select *,row_number() over(order by col2) rno from table_df) select case when value = 0 and rno<(select min(rno) from cte where value != 0) then (select value from cte where rno=(select min(rno) from cte where value != 0)) else value end value,col2,col3 from cte").show(df.count.toInt,false)
Please let me know if you have any questions.

I added a new column with incremental id to your DF
import org.apache.spark.sql.functions._
val df_1 = Seq((0,"exA",30),
(0,"exB",22),
(0,"exC",19),
(16,"exD",13),
(5,"exE",28),
(6,"exF",26),
(0,"exG",12),
(13,"exH",53))
.toDF("value", "col2", "col3")
.withColumn("UniqueID", monotonically_increasing_id)
filter DF to have non-zero values
val df_2 = df_1.filter("value != 0")
create a variable "limit" to limit first N row that we need and variable Nvar for the first non-zero value
val limit = df_2.agg(min("UniqueID")).collect().map(_(0)).mkString("").toInt + 1
val nVal = df_1.limit(limit).agg(max("value")).collect().map(_(0)).mkString("").toInt
create DF with a column with the same name ("value") with a condition
val df_4 = df_1.withColumn("value", when(($"UniqueID" < limit), nVal).otherwise($"value"))

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Split row and flip colum value - pyspark

How could I best stplit a single row into 2 and flip the values of the src_* to dest_* and dest_* to src_* for the second row? |source_ip |destination_ip|src_port|dst_port| |192.168.0.1|10.0.0.1 |5000 |22 | into |ip |src_port|dst_port| |192.168.0.1|5000 |22 | |10.0.0.1 |22 |5000 |

Related

Pyspark filter where value is in another dataframe

Unsure how to apply row-wise normalization on pyspark dataframe

How to get the set of rows which contains null values from dataframe in scala using filter

Filter one data frame using other data frame in spark scala

Fill null values in dataframe column with next value

Categories

Resources