Parse Dataframe and store output in a single file [duplicate] - scala

This question already has an answer here:
Spark split a column value into multiple rows
(1 answer)
Closed 4 years ago.
I have a data frame using Spark SQL in Scala with columns A and B with values:
A | B
1 a|b|c
2 b|d
3 d|e|f
I need to store the output to a single textfile in following format
1 a
1 b
1 c
2 b
2 d
3 d
3 e
3 f
How can I do that?

You can get the desired Dataframe with an expode and a split:
val resultDF = df.withColumn("B", explode(split($"B", "\\|")))
Result
+---+---+
| A| B|
+---+---+
| 1| a|
| 1| b|
| 1| c|
| 2| b|
| 2| d|
| 3| d|
| 3| e|
| 3| f|
+---+---+
Then you can save in a single file with a coalesce(1)
resultDF.coalesce(1).rdd.saveAsTextFile("desiredPath")

You can do something like,
val df = ???
val resDF =df.withColumn("B", explode(split(col("B"), "\\|")))
resDF.coalesce(1).write.option("delimiter", " ").csv("path/to/file")

Related

Change value on duplicated rows using Pyspark, keeping the first record as is

how can I change the column status value on rows that contains duplicate records on specific columns, and keep the first one(with the lower id) as A, for example:
logic:
if the account_id and user_id already exists the status is E, the first record(lower id) is A
if the user_id exists and the account_id is different the status is I, the first record(lower id) is A
input sample:
id
account_id
user_id
1
a
1
2
a
1
3
b
1
4
c
2
5
c
2
6
c
2
7
d
3
8
d
3
9
e
3
output sample
id
account_id
user_id
status
1
a
1
A
2
a
1
E
3
b
1
I
4
c
2
A
5
c
2
E
6
c
2
E
7
d
3
A
8
d
3
E
9
e
3
I
I think I need to group into multiple datasets and join it back, compare and change the values, I think I'm overthinking, help?
Thanks!!
Thank you
Two window functions would help you to determine the duplications and rank them.
from pyspark.sql import functions as F
from pyspark.sql import Window as W
(df
# Distinguishes between "first occurrence" vs "2nd occurrence" and so on
.withColumn('rank', F.rank().over(W.partitionBy('account_id', 'user_id').orderBy('id')))
# Detecting if there is no duplication per pair of 'account_id' and 'user_id'
.withColumn('count', F.count('*').over(W.partitionBy('account_id', 'user_id')))
# building status based on conditions
.withColumn('status', F
.when(F.col('count') == 1, 'I') # if there is only one record, status is 'I'
.when(F.col('rank') == 1, 'A') # if there is more than one record, the first occurrence is 'A'
.otherwise('E') # finally, the other occurrences are 'E'
)
.orderBy('id')
.show()
)
# Output
# +---+----------+-------+----+-----+------+
# | id|account_id|user_id|rank|count|status|
# +---+----------+-------+----+-----+------+
# | 1| a| 1| 1| 2| A|
# | 2| a| 1| 2| 2| E|
# | 3| b| 1| 1| 1| I|
# | 4| c| 2| 1| 3| A|
# | 5| c| 2| 2| 3| E|
# | 6| c| 2| 3| 3| E|
# | 7| d| 3| 1| 2| A|
# | 8| d| 3| 2| 2| E|
# | 9| e| 3| 1| 1| I|
# +---+----------+-------+----+-----+------+

pyspark - how to add a column where value of new column is searched from the dataframe:

how to add a column where value of new column is searched from the dataframe:
eg.
A B newCol
1 a a
2 b null
3 c null
4 d b
5 e null
6 f null
7 g null
8 h null
9 i c
The value in this case in newCol is based on sqrt of value in A. It is based on lookup in the current dataframe though not the same row.
pseudocode:
df[newCol] = df[sqrt(df[A])]
The sqr/sqrt is just an example - the lookup could be based on value in column B or something else. I added the sqrt example to eliminate the lead/lag answers. x
There may be no positional relationship between current element and what is being looked up.
Instead of sqrt which creates a float column, you can calculate square of column A and create a look up data frame from it and then merge it against the original data frame:
lookup = df.withColumn('A', (df.A ** 2).cast('int')).withColumnRenamed('B', 'newCol')
df.join(lookup, on=['A'], how='left').show()
+---+---+------+
| A| B|newCol|
+---+---+------+
| 7| g| null|
| 6| f| null|
| 9| i| c|
| 5| e| null|
| 1| a| a|
| 3| c| null|
| 8| h| null|
| 2| b| null|
| 4| d| b|
+---+---+------+
Or without type casting:
lookup = df.withColumn('A', df.A * df.A).withColumnRenamed('B', 'newCol')
df.join(lookup, on=['A'], how='left').show()

Split multiple array columns into rows

This is a question identical to
Pyspark: Split multiple array columns into rows
but I want to know how to do it in scala
for a dataframe like this,
+---+---------+---------+---+
| a| b| c| d|
+---+---------+---------+---+
| 1|[1, 2, 3]|[, 8, 9] |foo|
+---+---------+---------+---+
I want to have it in following format
+---+---+-------+------+
| a| b| c | d |
+---+---+-------+------+
| 1| 1| None | foo |
| 1| 2| 8 | foo |
| 1| 3| 9 | foo |
+---+---+-------+------+
In scala, I know there's an explode function, but I don't think it's applicable here.
I tried
import org.apache.spark.sql.functions.arrays_zip
but I get an error, saying arrays_zip is not a member of org.apache.spark.sql.functions although it's clearly a function in https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/functions.html
the below answer might be helpful to you,
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
import org.apache.spark.sql.functions._
val arrayData = Seq(
Row(1,List(1,2,3),List(0,8,9),"foo"))
val arraySchema = new StructType().add("a",IntegerType).add("b", ArrayType(IntegerType)).add("c", ArrayType(IntegerType)).add("d",StringType)
val df = spark.createDataFrame(spark.sparkContext.parallelize(arrayData),arraySchema)
df.select($"a",$"d",explode($"b",$"c")).show(false)
val zip = udf((x: Seq[Int], y: Seq[Int]) => x.zip(y))
df.withColumn("vars", explode(zip($"b", $"c"))).select($"a", $"d",$"vars._1".alias("b"), $"vars._2".alias("c")).show()
/*
+---+---+---+---+
| a| d| b| c|
+---+---+---+---+
| 1|foo| 1| 0|
| 1|foo| 2| 8|
| 1|foo| 3| 9|
+---+---+---+---+
*/

spark withcolumn create a column duplicating values from existining column

I am having problem figuring this. Here is the problem statement
lets say I have a dataframe, I want to select value for column c where column b value is foo and create a new column D and repeat the vale "3" for all rows
+---+----+---+
| A| B| C|
+---+----+---+
| 4|blah| 2|
| 2| | 3|
| 56| foo| 3|
|100|null| 5|
+---+----+---+
want it to become:
+---+----+---+-----+
| A| B| C| D |
+---+----+---+-----+
| 4|blah| 2| 3 |
| 2| | 3| 3 |
| 56| foo| 3| 3 |
|100|null| 5| 3 |
+---+----+---+-----+
You will have to extract the column C value i.e. 3 with foo in column B
import org.apache.spark.sql.functions._
val value = df.filter(col("B") === "foo").select("C").first()(0)
Then use that value using withColumn to create a new column D using lit function
df.withColumn("D", lit(value)).show(false)
You should get your desired output.

Exploding pipe separated data in spark

I have a spark dataframe(input_dataframe), data in this dataframe looks like as below:
id value
1 a
2 x|y|z
3 t|u
I want to have output_dataframe, having pipe separated fields exploded and it should look like below:
id value
1 a
2 x
2 y
2 z
3 t
3 u
Please help me achieving the desired solution using PySpark. Any help will be appreciated
we can first split and then explode the value column using functions as below,
>>> l=[(1,'a'),(2,'x|y|z'),(3,'t|u')]
>>> df = spark.createDataFrame(l,['id','val'])
>>> df.show()
+---+-----+
| id| val|
+---+-----+
| 1| a|
| 2|x|y|z|
| 3| t|u|
+---+-----+
>>> from pyspark.sql import functions as F
>>> df.select('id',F.explode(F.split(df.val,'[|]')).alias('value')).show()
+---+-----+
| id|value|
+---+-----+
| 1| a|
| 2| x|
| 2| y|
| 2| z|
| 3| t|
| 3| u|
+---+-----+