creating data-frame from pipe & comma delimited file - pyspark

I am trying to create data-frame form a data feed which has the following format,
ABC,13:10,23| PQR,01:20,2| XYZ,07:30,14
BCD,11:40,13| ABC,05:50,9| RST,17:20,5
Each record is pipe delimited and comes in batch of 3 and consists of 3 sub records.
I intend to have each sub record as a column and each record aa one row of the data frame.So the above would result in 3 columns and 9 rows.
col1 col2 col3
ABC 13:10 23
PQR 01:20 2

from pyspark.sql.functions import split, explode
df = spark.read.text("/path/to/data.csv")
df.select(explode(split(df["value"], "\|"))).show()

Related

splitting string column into multiple columns based on key value item using spark scala

I have a dataframe where one column contains several information in a 'key=value' format.
There are almost a 30 different 'key=value' that can appear in that column will use 4 columns
for understanding ( _age, _city, _sal, _tag)
id name properties
0 A {_age=10, _city=A, _sal=1000}
1 B {_age=20, _city=B, _sal=3000, tag=XYZ}
2 C {_city=BC, tag=ABC}
How can I convert this string column into multiple columns?
Need to use spark scala dataframe for it.
The expected output is:
id name _age _city _sal tag
0 A 10 A 1000
1 B 20 B 3000 XYZ
2 C BC ABC
Short answer
df
.select(
col("id"),
col("name"),
col("properties.*"),
..
)
Try this :
val s = df.withColumn("dummy", explode(split(regexp_replace($"properties", "\\{|\\}", ""), ",")))
val result= s.drop("properties").withColumn("col1",split($"dummy","=")(0)).withColumn("col1-value",split($"dummy","=")(1)).drop("dummy")
result.groupBy("id","name").pivot("col1").agg(first($"col1-value")).orderBy($"id").show

Pyspark dataframe split and pad delimited column value into Array of N index

There is a pyspark source dataframe having a column named X. The column X consists of '-' delimited values. There can be any number of delimited values in that particular column.
Example of source dataframe given below:
X
A123-B345-C44656-D4423-E3445-F5667
X123-Y345
Z123-N345-T44656-M4423
X123
Now, need to split this column with delimiter and pull exactly N=4 seperate delimited values. If there are more than 4 delimited values, then we need first 4 delimited values and discard the rest. If there are less than 4 delimited values, then we need to pick the existing ones and pad the rest with empty character "".
Resulting output should be like below:
X
Col1
Col2
Col3
Col4
A123-B345-C44656-D4423-E3445-F5667
A123
B345
C44656
D4423
X123-Y345
A123
Y345
Z123-N345-T44656-M4423
Z123
N345
T44656
M4423
X123
X123
Have easily accomplished this in python as per below code, but thinking of pyspark approach to do this:
def pad_infinite(siterable, padding=None):
return chain(iterable, repeat(padding))
def pad(iterable, size, padding=None):
return islice(pad_infinite(iterable, padding), size)
colA, colB, colC, colD= list(pad(X.split('-'), 4, ''))
You can split the string into an array, separate the elements of the array into columns and then fill the null values with an empty string:
df = ...
df.withColumn("arr", F.split("X", "-")) \
.selectExpr("X", "arr[0] as Col1", "arr[1] as Col2", "arr[2] as Col3", "arr[3] as Col4") \
.na.fill("") \
.show(truncate=False)
Output:
+----------------------------------+----+----+------+-----+
|X |Col1|Col2|Col3 |Col4 |
+----------------------------------+----+----+------+-----+
|A123-B345-C44656-D4423-E3445-F5667|A123|B345|C44656|D4423|
|X123-Y345 |X123|Y345| | |
|Z123-N345-T44656-M4423 |Z123|N345|T44656|M4423|
|X123 |X123| | | |
+----------------------------------+----+----+------+-----+

spark dataframe collect malforming data that contains commas

I am attempting to perform a collect on a dataframe, however my data contains commas and is malforming the output object, resulting in more columns then expected (split by each comma).
My dataframe contains the data:
col_a |col_b
------------------
1,2,3,4,5|1
2,3,4 |2
I then perform this:
val ct = configTable.collect()
ct.foreach(row => {
println(row(0))
})
Output is -
1
2
When it should be the string -
1,2,3,4,5
2,3,4
How do I get the expected results?

spark Group By data-frame columns without aggregation [duplicate]

This question already has answers here:
How to aggregate values into collection after groupBy?
(3 answers)
Closed 4 years ago.
I have a csv file in hdfs : /hdfs/test.csv, I like to group below data using spark & scala, I need a output some this like this.
I want to group by A1...AN column based on A1 column and the output should be something like this
all the rows should be grouped like below.
OUTPUt:
JACK , ABCD, ARRAY("0,1,0,1", "2,9,2,9")
JACK , LMN, ARRAY("0,1,0,3", "0,4,3,T")
JACK, HBC, ARRAY("1,T,5,21", "E7,4W,5,8)
Input:
++++++++++++++++++++++++++++++
name A1 A1 A2 A3..AN
--------------------------------
JACK ABCD 0 1 0 1
JACK LMN 0 1 0 3
JACK ABCD 2 9 2 9
JAC HBC 1 T 5 21
JACK LMN 0 4 3 T
JACK HBC E7 4W 5 8
I need a below output in spark scala
JACK , ABCD, ARRAY("0,1,0,1", "2,9,2,9")
JACK , LMN, ARRAY("0,1,0,3", "0,4,3,T")
JACK, HBC, ARRAY("1,T,5,21", "E7,4W,5,8)
You can achieve this by having the columns as an array.
import org.apache.spark.sql.functions.{collect_set, concat_ws, array, col}
val aCols = 1.to(250).map( x -> col(s"A$x"))
val concatCol = concat_ws(",", array(aCols : _*))
groupedDf = df.withColumn("aConcat", concatCol).
groupBy("name", "A").
agg(collect_set("aConcat"))
If you're okay with duplicates you can also use collect_list instead of collect_set.
Your input has two different columns called A1. I will assume the groupBy category is called A, while the element to put in that final array is A1.
If you load the data into a DataFrame, you can do this to achieve the output specified:
import org.apache.spark.sql.functions.{collect_set, concat_ws}
val grouped = someDF
.groupBy($"name", $"A")
.agg(collect_set(concat_ws(",", $"A1", $"A2", $"A3", $"A4")).alias("grouped"))

Removing blank rows or rows with half blank and half null in Spark

I have a Spark dataframe as
id name address
1 xyz nc
null
..blank line....
3 pqr stw
I need to remove row 2 and 3 from the dataframe and need following output
id name address
1 xyz nc
3 pqr stw
I have tried using
df1.filter(($"id" =!= "") && ($"id".isNotNull)).filter(($"name" =!= "") && ($"name".isNotNull))
But here i need to do it for every single column by iterating column over column,is there a way where i can do it on an entire row level not by iterating over the columns.
You can use the following logic
import org.apache.spark.sql.functions._
def filterEmpty = udf((cols: mutable.WrappedArray[String]) => cols.map(_.equals("")).contains(true))
df.na.fill("").filter(filterEmpty(array("id", "name", "address")) =!= true).show(false)
where filterEmpty is a udf function which returns true if any of the columns contains an empty value.
na.fill("") replaces all null values to empty value in the dataframe.
and filter function filters out the unnecessary rows.
I hope the answer is helpful