How to parse column (with list data) within a DataFrame? - scala

There is a column in a DataFrame that contains a list and I want to parse that list for the first element and replace that column with it. So for example:
col1
[elem1, elem2]
[elem3, elem4]
I want to make this:
col1
elem1
elem3
I've tried dataFrameName.withColumn("col1", explode($"col1")) but it gives me a NoSuchElementException. What's the right way to do this?

To replace the ArrayType column col1 with its first element, explode would not be useful. You can simply replace it with $"col1"(0) (or $"col1".getItem(0)), as shown below:
import spark.implicits._
import org.apache.spark.sql.functions._
val df = Seq(
Seq("elem1", "elem2"),
Seq("elem3", "elem4")
).toDF("col1")
df.withColumn("col1", $"col1"(0)).show
// +-----+
// | col1|
// +-----+
// |elem1|
// |elem3|
// +-----+
Note that you may have a separate issue with the encountered NoSuchElementException, as explode-ing an ArrayType column normally wouldn't generate such an exception.

Related

PySpark input_file_name() into a variable NOT df

I want to store the value from input_file_name() into a variable instead of a dataframe. This variable will then be used for logging and troubleshooting.etc
You can create a new column on the data frame using withColumn and input_file_name() and then use collect() operation, something like below:
df = spark.read.csv("/FileStore/tmp/part-00000-tid-6847462229548084439-4a50d1c2-9b65-4756-9a29-0044d620a1da-11-1-c000.csv")
df.show()
+-----+
| _c0|
+-----+
|43368|
+-----+
from pyspark.sql.functions import *
df1 = df.withColumn("file_name", input_file_name())
df1.show(truncate=False)
+-----+---------------------------------------------------------------------------------------------------------+
|_c0 |file_name |
+-----+---------------------------------------------------------------------------------------------------------+
|43368|dbfs:/FileStore/tmp/part-00000-tid-6847462229548084439-4a50d1c2-9b65-4756-9a29-0044d620a1da-11-1-c000.csv|
+-----+---------------------------------------------------------------------------------------------------------+
Now, creating a variable with file_name using collect and then split it on /
file_name = df1.collect()[0][1].split("/")[3]
print(file_name)
Output
part-00000-tid-6847462229548084439-4a50d1c2-9b65-4756-9a29-0044d620a1da-11-1-c000.csv
Please note, in your case index for both collect as well as well as after split might be differ.

how to pivot /transpose rows of a column in to individual columns in spark-scala without using the pivot method

Please check below image for the reference to my use case
You can get the same result without using pivot by adding the columns manually, if you know all the names of the new columns:
import org.apache.spark.sql.functions.{col, when}
dataframe
.withColumn("cheque", when(col("ttype") === "cheque", col("tamt")))
.withColumn("draft", when(col("ttype") === "draft", col("tamt")))
.drop("tamt", "ttype")
As this solution does not trigger shuffle, your processing will be faster than using pivot.
It can be generalized if you don't know the name of the columns. However, in this case you should benchmark to check whether pivot is more performant:
import org.apache.spark.sql.functions.{col, when}
val newColumnNames = dataframe.select("ttype").distinct.collect().map(_.getString(0))
newColumnNames
.foldLeft(dataframe)((df, columnName) => {
df.withColumn(columnName, when(col("ttype") === columnName, col("tamt")))
})
.drop("tamt", "ttype")
Use groupBy,pivot & agg functions. Check below code.
Added inline comments.
scala> df.show(false)
+----------+------+----+
|tdate |ttype |tamt|
+----------+------+----+
|2020-10-15|draft |5000|
|2020-10-18|cheque|7000|
+----------+------+----+
scala> df
.groupBy($"tdate") // Grouping data based on tdate column.
.pivot("ttype",Seq("cheque","draft")) // pivot based on ttype and "draft","cheque" are new column name
.agg(first("tamt")) // aggregation by "tamt" column.
.show(false)
+----------+------+-----+
|tdate |cheque|draft|
+----------+------+-----+
|2020-10-18|7000 |null |
|2020-10-15|null |5000 |
+----------+------+-----+

check data size spark dataframes

I have the following question :
Actually I am working with the following csv file:
""job"";""marital"""
""management"";""married"""
""technician"";""single"""
I loaded it into a spark dataframe as follows:
My aim is to check the length and type of each field in the dataframe following the set od rules below :
col type
job char10
marital char7
I started implementing the check of the length of each field but I am getting a compilation error :
val data = spark.read.option("inferSchema", "true").option("header", "true").csv("file:////home/user/Desktop/user/file.csv")
data.map(line => {
val fields = line.toString.split(";")
fields(0).size
fields(1).size
})
The expected output should be:
List(10,10)
As for the check of the types I don't have any idea about how to implement it as we are using dataframes. Any idea about a function verifying the data format ?
Thanks a lot in advance for your replies.
ata
I see you are trying to use Dataframe, But if there are multiple double quotes then you can read as a textFile and remove them and convert to Dataframe as below
import org.apache.spark.sql.functions._
import spark.implicits._
val raw = spark.read.textFile("path to file ")
.map(_.replaceAll("\"", ""))
val header = raw.first
val data = raw.filter(row => row != header)
.map { r => val x = r.split(";"); (x(0), x(1)) }
.toDF(header.split(";"): _ *)
You get with data.show(false)
+----------+-------+
|job |marital|
+----------+-------+
|management|married|
|technician|single |
+----------+-------+
To calculate the size you can use withColumn and length function and play around as you need.
data.withColumn("jobSize", length($"job"))
.withColumn("martialSize", length($"marital"))
.show(false)
Output:
+----------+-------+-------+-----------+
|job |marital|jobSize|martialSize|
+----------+-------+-------+-----------+
|management|married|10 |7 |
|technician|single |10 |6 |
+----------+-------+-------+-----------+
All the column type are String.
Hope this helps!
You are using a dataframe. So when you use the map method, you are processing Row in your lambda.
so line is a Row.
Row.toString will return a string representing the Row, so in your case 2 structfields typed as String.
If you want to use map and process your Row, you have to get the vlaue inside the fields manually. with getAsString and getAsString.
Usually when you use Dataframes, you have to work in column's logic as in SQL using select, where... or directly the SQL syntax.

Process all columns / the entire row in a Spark UDF

For a dataframe containing a mix of string and numeric datatypes, the goal is to create a new features column that is a minhash of all of them.
While this could be done by performing a dataframe.toRDD it is expensive to do that when the next step will be to simply convert the RDD back to a dataframe.
So is there a way to do a udf along the following lines:
val wholeRowUdf = udf( (row: Row) => computeHash(row))
Row is not a spark sql datatype of course - so this would not work as shown.
Update/clarifiction I realize it is easy to create a full-row UDF that runs inside withColumn. What is not so clear is what can be used inside a spark sql statement:
val featurizedDf = spark.sql("select wholeRowUdf( what goes here? ) as features
from mytable")
Row is not a spark sql datatype of course - so this would not work as shown.
I am going to show that you can use Row to pass all the columns or selected columns to a udf function using struct inbuilt function
First I define a dataframe
val df = Seq(
("a", "b", "c"),
("a1", "b1", "c1")
).toDF("col1", "col2", "col3")
// +----+----+----+
// |col1|col2|col3|
// +----+----+----+
// |a |b |c |
// |a1 |b1 |c1 |
// +----+----+----+
Then I define a function to make all the elements in a row as one string separated by , (as you have computeHash function)
import org.apache.spark.sql.Row
def concatFunc(row: Row) = row.mkString(", ")
Then I use it in udf function
import org.apache.spark.sql.functions._
def combineUdf = udf((row: Row) => concatFunc(row))
Finally I call the udf function using withColumn function and struct inbuilt function combining selected columns as one column and pass to the udf function
df.withColumn("contcatenated", combineUdf(struct(col("col1"), col("col2"), col("col3")))).show(false)
// +----+----+----+-------------+
// |col1|col2|col3|contcatenated|
// +----+----+----+-------------+
// |a |b |c |a, b, c |
// |a1 |b1 |c1 |a1, b1, c1 |
// +----+----+----+-------------+
So you can see that Row can be used to pass whole row as an argument
You can even pass all columns in a row at once
val columns = df.columns
df.withColumn("contcatenated", combineUdf(struct(columns.map(col): _*)))
Updated
You can achieve the same with sql queries too, you just need to register the udf function as
df.createOrReplaceTempView("tempview")
sqlContext.udf.register("combineUdf", combineUdf)
sqlContext.sql("select *, combineUdf(struct(`col1`, `col2`, `col3`)) as concatenated from tempview")
It will give you the same result as above
Now if you don't want to hardcode the names of columns then you can select the column names according to your desire and make it a string
val columns = df.columns.map(x => "`"+x+"`").mkString(",")
sqlContext.sql(s"select *, combineUdf(struct(${columns})) as concatenated from tempview")
I hope the answer is helpful
I came up with a workaround: drop the column names into any existing spark sql function to generate a new output column:
concat(${df.columns.tail.mkString(",'-',")}) as Features
In this case the first column in the dataframe is a target and was excluded. That is another advantage of this approach: the actual list of columns many be manipulated.
This approach avoids unnecessary restructuring of the RDD/dataframes.

Create new DataFrame with new rows depending in number of a column - Spark Scala

I have a DataFrame with the following data:
num_cta | n_lines
110000000000| 2
110100000000| 3
110200000000| 1
With that information, I need to create a new DF with different number of rows depending the value that comes over the n_lines column.
For example, for the first row of my DF (110000000000), the value of the n_lines column is 2. The result would have to be something like the following:
num_cta
110000000000
110000000000
For all the Dataframe example that I show, the result to get would have to be something like this:
num_cta
110000000000
110000000000
110100000000
110100000000
110100000000
110200000000
Is there a way to do that? And multiply a row n times, depending on the value of a column value?
Regards.
One approach would be to expand n_lines into an array with an UDF and explode it:
val df = Seq(
("110000000000", 2),
("110100000000", 3),
("110200000000", 1)
)toDF("num_cta", "n_lines")
def fillArr = udf(
(n: Int) => Array.fill(n)(1)
)
val df2 = df.withColumn("arr", fillArr($"n_lines")).
withColumn("a", explode($"arr")).
select($"num_cta")
df2.show
+------------+
| num_cta|
+------------+
|110000000000|
|110000000000|
|110100000000|
|110100000000|
|110100000000|
|110200000000|
+------------+
There is no off the shelve way to doing this. However you can try iterate over the dataframe and return a list of num_cta where the number of elements are equal to the corresponding n_lines.
Something like
import spark.implicits._
case class (num_cta:String) // output dataframe schema
case class (num_cta:String, n_lines:Integer) // input dataframe 'df' schema
val result = df.flatmap(x => {
List.fill(x.n_lines)(x.num_cta)
}).toDF