spark dataframe collect malforming data that contains commas

spark dataframe collect malforming data that contains commas - scala

I am attempting to perform a collect on a dataframe, however my data contains commas and is malforming the output object, resulting in more columns then expected (split by each comma).
My dataframe contains the data:
col_a |col_b
------------------
1,2,3,4,5|1
2,3,4 |2
I then perform this:
val ct = configTable.collect()
ct.foreach(row => {
println(row(0))
})
Output is -
1
2
When it should be the string -
1,2,3,4,5
2,3,4
How do I get the expected results?

Related

Pyspark dataframe split and pad delimited column value into Array of N index

There is a pyspark source dataframe having a column named X. The column X consists of '-' delimited values. There can be any number of delimited values in that particular column.
Example of source dataframe given below:
X
A123-B345-C44656-D4423-E3445-F5667
X123-Y345
Z123-N345-T44656-M4423
X123
Now, need to split this column with delimiter and pull exactly N=4 seperate delimited values. If there are more than 4 delimited values, then we need first 4 delimited values and discard the rest. If there are less than 4 delimited values, then we need to pick the existing ones and pad the rest with empty character "".
Resulting output should be like below:
X
Col1
Col2
Col3
Col4
A123-B345-C44656-D4423-E3445-F5667
A123
B345
C44656
D4423
X123-Y345
A123
Y345
Z123-N345-T44656-M4423
Z123
N345
T44656
M4423
X123
X123
Have easily accomplished this in python as per below code, but thinking of pyspark approach to do this:
def pad_infinite(siterable, padding=None):
return chain(iterable, repeat(padding))
def pad(iterable, size, padding=None):
return islice(pad_infinite(iterable, padding), size)
colA, colB, colC, colD= list(pad(X.split('-'), 4, ''))

You can split the string into an array, separate the elements of the array into columns and then fill the null values with an empty string:
df = ...
df.withColumn("arr", F.split("X", "-")) \
.selectExpr("X", "arr[0] as Col1", "arr[1] as Col2", "arr[2] as Col3", "arr[3] as Col4") \
.na.fill("") \
.show(truncate=False)
Output:
+----------------------------------+----+----+------+-----+
|X |Col1|Col2|Col3 |Col4 |
+----------------------------------+----+----+------+-----+
|A123-B345-C44656-D4423-E3445-F5667|A123|B345|C44656|D4423|
|X123-Y345 |X123|Y345| | |
|Z123-N345-T44656-M4423 |Z123|N345|T44656|M4423|
|X123 |X123| | | |
+----------------------------------+----+----+------+-----+

add new column in a dataframe depending on another dataframe's row values

I need to add a new column to dataframe DF1 but the new column's value should be calculated using other columns' value present in that DF. Which of the other columns to be used will be given in another dataframe DF2.
eg. DF1
|protocolNo|serialNum|testMethod |testProperty|
+----------+---------+------------+------------+
|Product1 | AB |testMethod1 | TP1 |
|Product2 | CD |testMethod2 | TP2 |
DF2-
|action| type| value | exploded |
+------------+---------------------------+-----------------+
|append|hash | [protocolNo] | protocolNo |
|append|text | _ | _ |
|append|hash | [serialNum,testProperty] | serialNum |
|append|hash | [serialNum,testProperty] | testProperty |
Now the value of exploded column in DF2 will be column names of DF1 if value of type column is hash.
Required -
New column should be created in DF1. the value should be calculated like below-
hash[protocolNo]_hash[serialNumTestProperty] ~~~ here on place of column their corresponding row values should come.
eg. for Row1 of DF1, col value should be
hash[Product1]_hash[ABTP1]
this will result into something like this abc-df_egh-45e after hashing.
The above procedure should be followed for each and every row of DF1.
I've tried using map and withColumn function using UDF on DF1. But in UDF, outer dataframe value is not accessible(gives Null Pointer Exception], also I'm not able to give DataFrame as input to UDF.
Input DFs would be DF1 and DF2 as mentioned above.
Desired Output DF-
|protocolNo|serialNum|testMethod |testProperty| newColumn |
+----------+---------+------------+------------+----------------+
|Product1 | AB |testMethod1 | TP1 | abc-df_egh-4je |
|Product2 | CD |testMethod2 | TP2 | dfg-df_ijk-r56 |
newColumn value is after hashing

Instead of DF2, you can translate DF2 to case class like Specifications, e.g
case class Spec(columnName:String,inputColumns:Seq[String],action:String,action:String,type:String*){}
Create instances of above class
val specifications = Seq(
Spec("new_col_name",Seq("serialNum","testProperty"),"hash","append")
)
Then you can process the below columns
val transformed = specifications
.foldLeft(dtFrm)((df: DataFrame, spec: Specification) => df.transform(transformColumn(columnSpec)))
def transformColumn(spec: Spec)(df: DataFrame): DataFrame = {
spec.type.foldLeft(df)((df: DataFrame, type : String) => {
type match {
case "append" => {have a case match of the action and do that , then append with df.withColumn}
}
}
Syntax may not be correct

Since DF2 has the column names that will be used to calculate a new column from DF1, I have made this assumption that DF2 will not be a huge Dataframe.
First step would be to filter DF2 and get the column names that we want to pick from DF1.
val hashColumns = DF2.filter('type==="hash").select('exploded).collect
Now, hashcolumns will have the columns that we want to use to calculate hash in the newColumn. The hashcolumns is an Array of Row. We need this to be a Column that will be applied while creating the newColumn in DF1.
val newColumnHash = hashColumns.map(f=>hash(col(f.getString(0)))).reduce(concat_ws("_",_,_))
The above line will convert the Row to a Column with hash function applied to it. And we reduce it while concatenating _. Now, the task becomes simple. We just need to apply this to DF1.
DF1.withColumn("newColumn",newColumnHash).show(false)
Hope this helps!

Split a string in scala based on string lengths

I have a table with two columns, one is an id and the other a value. My value column contains 1488 characters. I have to split this column into multiple rows with 12 characters each. Example:
Dataframe:
ID Value
1 123456789987653ABCDEFGHI
Expected output:
ID Value
1 123456789987
1 653ABCDEFGHI
How can this be done in Spark?

Create an UDF to split a string into equal length parts using grouped. Then use explode on the resulting list of string to flatten it.
import org.apache.spark.sql.functions._
def splitOnLength(len: Int) = udf((str: String) => {
str.grouped(len).toSeq
})
df.withColumn("Value", explode(splitOnString(12)($"Value")))

Spark dataframe explode column

Every row in the dataframe contains a csv formatted string line plus another simple string, so what I'm trying to get at the end is a dataframe composed of the fields extracted from the line string together with category.
So I proceeded as follows to explode the line string
val df = stream.toDF("line","category")
.map(x => x.getString(0))......
At the end I manage to get a new dataframe composed of the line fields but I can't return the category to the new dataframe
I can't join the new dataframe with the initial one since the common field id was not a separate column at first.
Sample of input :
line | category
"'1';'daniel';'dan#gmail.com'" | "premium"
Sample of output:
id | name | email | category
1 | "daniel"| "dan#gmail.com"| "premium"
Any suggestions, thanks in advance.

If the structure of strings in line column is fixed as mentioned in the question, then following simple solution should work where split inbuilt function is used to split the string into array and then finally selecting the elements from the array and aliasing to get the final dataframe
import org.apache.spark.sql.functions._
df.withColumn("line", split(col("line"), ";"))
.select(col("line")(0).as("id"), col("line")(1).as("name"), col("line")(2).as("email"), col("category"))
.show(false)
which should give you
+---+--------+---------------+--------+
|id |name |email |category|
+---+--------+---------------+--------+
|'1'|'daniel'|'dan#gmail.com'|premium |
+---+--------+---------------+--------+
I hope the answer is helpful

how to clean output from symbols: plus minus pipe ( | - + )

using the sentence:
scala> val intento2 = sql("SELECT _CreationDate FROM tablaTemporal" )
intento2: org.apache.spark.sql.DataFrame = [_CreationDate: string]
scala> intento2.show(5, false)
I receive this output:
+-----------------------+
|_CreationDate |
+-----------------------+
|2008-07-31T00:00:00.000|
|2008-07-31T14:22:31.287|
|2008-07-31T14:22:31.287|
|2008-07-31T14:22:31.287|
|2008-07-31T14:22:31.317|
+-----------------------+
only showing top 5 rows
but the result I need is the same but no symbols added by scala/spark:
2005-07-31T14:20:19.239
2007-07-31T14:20:31.287
2009-07-31T14:21:33.287
2005-07-31T14:23:36.287
2009-07-31T14:20:38.317
How can i do, to print a clean output like above?

Here, you're printing the dataframe.
What you want to do is print each record of the dataframe:
intento2.collect().map(_.getString(0)).foreach(println)
collect transforms the dataframe into an array of Row objects.
then we map each Row to its first element with row.getString(0). In fact the Row contains only one element, the date.