Create Dataframe from tuple list with dynamic schema in pyspark - pyspark

I am trying to create a dataframe with dynamic schema from tuple list in pyspark
here is my code for tuple list
outputlist= []
for row in df2.collect():
tmpList = row
temptuple = ()
id = tmpList[0]
temptuple = temptuple+(id,)
print(id)
for val in range (1,len(tmpList)):
if tmpList[val] is None:
break
else :
value = tmpList[val]
index = val
if index > 1:
index =1
temptuple = temptuple+ (value,)
temptuple = temptuple+ (index,)
outputlist.append(temptuple)
print(outputlist)
[('44038:4132', '324772', 1), ('44038:4291', '772122995105', 1, '477212299170', 1)]
Until here it is okay, now i have to create a dataframe with dynamic schema using above values
for example when dataframe read first tuple it should give result like this
if you see in screenshot value 324772 is coming as field name
and when dataframe read second tuple , it should give results like this
if you see in screenshot value 772122995105,477212299170 is coming as field name and so on

See the following code:
tuples = [('44038:4132', '324772', 1), ('44038:4291', '772122995105', 1, '477212299170', 1)]
for tuple in tuples:
id = tuple[0]
tmp_tuple = tuple[1:]
cols = {}
for i in range(int(len(tmp_tuple) / 2)):
j = i * 2
cols[tmp_tuple[j]] = tmp_tuple[j + 1]
tmp_dict = {
"id": id,
**cols
}
cols_keys = cols.keys()
df = spark.createDataFrame(Row(tmp_dict))
df = df.select("id", *cols_keys)
df.show()
Here is the sample output:
+----------+------+
| id|324772|
+----------+------+
|44038:4132| 1|
+----------+------+
+----------+------------+------------+
| id|772122995105|477212299170|
+----------+------------+------------+
|44038:4291| 1| 1|
+----------+------------+------------+

Related

Is it possible to register a string as a UDF?

In Spark (Scala), after the application jar is submitted to Spark, is it possible for the jar to fetch many strings from a database table, convert each string to a catalyst Expression and then convert that expression to a UDF, and use the UDF to filters rows in another DataFrame, and finally union the result of each UDF?
(The said expression needs some or all columns of the DataFrame, but which columns are needed is unknown at the time of the code of the jar is written, the schema of the DataFrame is known at development time)
An example:
expression 1: "id == 1"
expression 2: "name == \"andy\""
DataFrame:
row 1: id = 1, name = "red", age = null
row 2: id = 2, name = "andy", age = 20
row 3: id = 3, name = "juliet", age = 21
the final result should be the first two rows
Note: it is not acceptable to first concatenate the two expressions with a or, for I needed to track which expression results the result row
Edited: Filter for each argument and union All.
import org.apache.spark.sql.DataFrame
val df = spark.read.option("header","true").option("inferSchema","true").csv("test1.csv")
val args = Array("id == 1", "name == \"andy\"")
val filters = args.zipWithIndex
var dfs = Array[DataFrame]()
filters.foreach {
case (filter, index) =>
val tempDf = df.filter(filter).withColumn("index", lit(index))
dfs = dfs :+ tempDf
}
val resultDF = dfs.reduce(_ unionAll _)
resultDF.show(false)
+---+----+----+-----+
|id |name|age |index|
+---+----+----+-----+
|1 |red |null|0 |
|2 |andy|20 |1 |
+---+----+----+-----+
Original: Why just put the string to the filter?
val df = spark.read.option("header","true").option("inferSchema","true").csv("test.csv")
val condition = "id == 1 or name == \"andy\""
df.filter(condition).show(false)
+---+----+----+
|id |name|age |
+---+----+----+
|1 |red |null|
|2 |andy|20 |
+---+----+----+
Something I have missed?

How can i check for empty values on spark Dataframe using User defined functions

guys, I have this user-defined function to check if the text rows are empty:
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().master("local").getOrCreate()
import spark.implicits._
{{{
val df = Seq(
(0, "","Mongo"),
(1, "World","sql"),
(2, "","")
).toDF("id", "text", "Source")
// Define a "regular" Scala function
val checkEmpty: String => Boolean = x => {
var test = false
if(x.isEmpty){
test = true
}
test
}
val upper = udf(checkEmpty)
df.withColumn("isEmpty", upper('text)).show
}}}
I'm actually getting this dataframe:
+---+-----+------+-------+
| id| text|Source|isEmpty|
+---+-----+------+-------+
| 0| | Mongo| true|
| 1|World| sql| false|
| 2| | | true|
+---+-----+------+-------+
How could I check for all the rows for empty values and return a message like:
id 0 has the text column with empty values
id 2 has the text,source column with empty values
UDF which get nullable columns as Row can be used, for get empty column names. Then rows with non-empty columns can be filtered:
val emptyColumnList = (r: Row) => r
.toSeq
.zipWithIndex
.filter(_._1.toString().isEmpty)
.map(pair => r.schema.fields(pair._2).name)
val emptyColumnListUDF = udf(emptyColumnList)
val columnsToCheck = Seq($"text", $"Source")
val result = df
.withColumn("EmptyColumns", emptyColumnListUDF(struct(columnsToCheck: _*)))
.where(size($"EmptyColumns") > 0)
.select(format_string("id %s has the %s columns with empty values", $"id", $"EmptyColumns").alias("description"))
Result:
+----------------------------------------------------+
|description |
+----------------------------------------------------+
|id 0 has the [text] columns with empty values |
|id 2 has the [text,Source] columns with empty values|
+----------------------------------------------------+
You could do something like this:
case class IsEmptyRow(id: Int, description: String) //case class for column names
val isEmptyDf = df.map {
row => row.getInt(row.fieldIndex("id")) -> row //we take id of row as first column
.toSeq //then to get secod we change row values to seq
.zip(df.columns) //zip it with column names
.collect { //if value is string and empty we append column name
case (value: String, column) if value.isEmpty => column
}
}.map { //then we create description string and pack results to case class
case (id, Nil) => IsEmptyRow(id, s"id $id has no columns with empty values")
case (id, List(column)) => IsEmptyRow(id, s"id $id has the $column column with empty values")
case (id, columns) => IsEmptyRow(id, s"id $id has the ${columns.mkString(", ")} columns with empty values")
}
Then running isEmptyDf.show(truncate = false) will show:
+---+---------------------------------------------------+
|id |description |
+---+---------------------------------------------------+
|0 |id 0 has the text columns with empty values |
|1 |id 1 has no columns with empty values |
|2 |id 2 has the text, Source columns with empty values|
+---+---------------------------------------------------+
You can also join back with original dataset:
df.join(isEmptyDf, "id").show(truncate = false)

Calculating edit distance on successive rows of a `Spark Dataframe

I have a data frame as follows:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions._
import spark.implicits._
// some data...
val df = Seq(
(1, "AA", "BB", ("AA", "BB")),
(2, "AA", "BB", ("AA", "BB")),
(3, "AB", "BB", ("AB", "BB"))
).toDF("id","name", "surname", "array")
df.show()
and i am looking to calculate the edit distance between the 'array' column in successive row. As an example i want to calculate the edit distance between the 'array' entity in column 1 ("AA", "BB") and the the 'array' entity in column 2 ("AA", "BB"). Here is the edit distance function i am using:
def editDist2[A](a: Iterable[A], b: Iterable[A]): Int = {
val startRow = (0 to b.size).toList
a.foldLeft(startRow) { (prevRow, aElem) =>
(prevRow.zip(prevRow.tail).zip(b)).scanLeft(prevRow.head + 1) {
case (left, ((diag, up), bElem)) => {
val aGapScore = up + 1
val bGapScore = left + 1
val matchScore = diag + (if (aElem == bElem) 0 else 1)
List(aGapScore, bGapScore, matchScore).min
}
}
}.last
}
I know i need to create a UDF for this function but can't seem to be able to. If i use the function as is and using Spark Windowing to get at the pervious row:
// creating window - ordered by ID
val window = Window.orderBy("id")
// using the window with lag function to compare to previous value in each column
df.withColumn("edit-d", editDist2(($"array"), lag("array", 1).over(window))).show()
i get the following error:
<console>:245: error: type mismatch;
found : org.apache.spark.sql.ColumnName
required: Iterable[?]
df.withColumn("edit-d", editDist2(($"array"), lag("array", 1).over(window))).show()
I figured out you can use Spark's own levenshtein function for this. This function takes in two string to compare, so it can't be used with the array.
// creating window - ordered by ID
val window = Window.orderBy("id")
// using the window with lag function to compare to previous value in each column
df.withColumn("edit-d", levenshtein(($"name"), lag("name", 1).over(window)) + levenshtein(($"surname"), lag("surname", 1).over(window))).show()
giving the desired output:
+---+----+-------+--------+------+
| id|name|surname| array|edit-d|
+---+----+-------+--------+------+
| 1| AA| BB|[AA, BB]| null|
| 2| AA| BB|[AA, BB]| 0|
| 3| AB| BB|[AB, BB]| 1|
+---+----+-------+--------+------+

Iterate rows and columns in Spark dataframe

I have the following Spark dataframe that is created dynamically:
val sf1 = StructField("name", StringType, nullable = true)
val sf2 = StructField("sector", StringType, nullable = true)
val sf3 = StructField("age", IntegerType, nullable = true)
val fields = List(sf1,sf2,sf3)
val schema = StructType(fields)
val row1 = Row("Andy","aaa",20)
val row2 = Row("Berta","bbb",30)
val row3 = Row("Joe","ccc",40)
val data = Seq(row1,row2,row3)
val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
df.createOrReplaceTempView("people")
val sqlDF = spark.sql("SELECT * FROM people")
Now, I need to iterate each row and column in sqlDF to print each column, this is my attempt:
sqlDF.foreach { row =>
row.foreach { col => println(col) }
}
row is type Row, but is not iterable that's why this code throws a compilation error in row.foreach. How to iterate each column in Row?
Consider you have a Dataframe like below
+-----+------+---+
| name|sector|age|
+-----+------+---+
| Andy| aaa| 20|
|Berta| bbb| 30|
| Joe| ccc| 40|
+-----+------+---+
To loop your Dataframe and extract the elements from the Dataframe, you can either chose one of the below approaches.
Approach 1 - Loop using foreach
Looping a dataframe directly using foreach loop is not possible. To do this, first you have to define schema of dataframe using case class and then you have to specify this schema to the dataframe.
import spark.implicits._
import org.apache.spark.sql._
case class cls_Employee(name:String, sector:String, age:Int)
val df = Seq(cls_Employee("Andy","aaa", 20), cls_Employee("Berta","bbb", 30), cls_Employee("Joe","ccc", 40)).toDF()
df.as[cls_Employee].take(df.count.toInt).foreach(t => println(s"name=${t.name},sector=${t.sector},age=${t.age}"))
Please see the result below :
Approach 2 - Loop using rdd
Use rdd.collect on top of your Dataframe. The row variable will contain each row of Dataframe of rdd row type. To get each element from a row, use row.mkString(",") which will contain value of each row in comma separated values. Using split function (inbuilt function) you can access each column value of rdd row with index.
for (row <- df.rdd.collect)
{
var name = row.mkString(",").split(",")(0)
var sector = row.mkString(",").split(",")(1)
var age = row.mkString(",").split(",")(2)
}
Note that there are two drawback of this approach.
1. If there is a , in the column value, data will be wrongly split to adjacent column.
2. rdd.collect is an action that returns all the data to the driver's memory where driver's memory might not be that much huge to hold the data, ending up with getting the application failed.
I would recommend to use Approach 1.
Approach 3 - Using where and select
You can directly use where and select which will internally loop and finds the data. Since it should not throws Index out of bound exception, an if condition is used
if(df.where($"name" === "Andy").select(col("name")).collect().length >= 1)
name = df.where($"name" === "Andy").select(col("name")).collect()(0).get(0).toString
Approach 4 - Using temp tables
You can register dataframe as temptable which will be stored in spark's memory. Then you can use a select query as like other database to query the data and then collect and save in a variable
df.registerTempTable("student")
name = sqlContext.sql("select name from student where name='Andy'").collect()(0).toString().replace("[","").replace("]","")
You can convert Row to Seq with toSeq. Once turned to Seq you can iterate over it as usual with foreach, map or whatever you need
sqlDF.foreach { row =>
row.toSeq.foreach{col => println(col) }
}
Output:
Berta
bbb
30
Joe
Andy
aaa
20
ccc
40
You should use mkString on your Row:
sqlDF.foreach { row =>
println(row.mkString(","))
}
But note that this will be printed inside the executors JVM's, so norally you won't see the output (unless you work with master = local)
sqlDF.foreach is not working for me but Approach 1 from #Sarath Avanavu answer works but it was also playing with the order of the records sometime.
I found one more way which is working
df.collect().foreach { row =>
println(row.mkString(","))
}
You should iterate over the partitions which allows the data to be processed by Spark in parallel and you can do foreach on each row inside the partition.
You can further group the data in partition into batches if need be
sqlDF.foreachPartition { partitionedRows: Iterator[Model1] =>
if (partitionedRows.take(1).nonEmpty) {
partitionedRows.grouped(numberOfRowsPerBatch).foreach { batch =>
batch.foreach { row =>
.....
This worked fine for me
sqlDF.collect().foreach(row => row.toSeq.foreach(col => println(col)))
simple collect result and then apply foreach
df.collect().foreach(println)
My solution using FOR because it was I need:
Solution 1:
case class campos_tablas(name:String, sector:String, age:Int)
for (row <- df.as[campos_tablas].take(df.count.toInt))
{
print(row.name.toString)
}
Solution 2:
for (row <- df.take(df.count.toInt))
{
print(row(0).toString)
}
Let's assume resultDF is the Dataframe.
val resultDF = // DataFrame //
var itr = 0
val resultRow = resultDF.count
val resultSet = resultDF.collectAsList
var load_id = 0
var load_dt = ""
var load_hr = 0
while ( itr < resultRow ){
col1 = resultSet.get(itr).getInt(0)
col2 = resultSet.get(itr).getString(1) // if column is having String value
col3 = resultSet.get(itr).getLong(2) // if column is having Long value
// Write other logic for your code //
itr = itr + 1
}

Dataframe to RDD[Row] replacing space with nulls

I am converting a Spark dataframe to RDD[Row] so I can map it to final schema to write into Hive Orc table. I want to convert any space in the input to actual null so the hive table can store actual null instead of a empty string.
Input DataFrame (a single column with pipe delimited values):
col1
1|2|3||5|6|7|||...|
My code:
inputDF.rdd.
map { x: Row => x.get(0).asInstanceOf[String].split("\\|", -1)}.
map { x => Row (nullConverter(x(0)),nullConverter(x(1)),nullConverter(x(2)).... nullConverter(x(200)))}
def nullConverter(input: String): String = {
if (input.trim.length > 0) input.trim
else null
}
Is there any clean way of doing it rather than calling the nullConverter function 200 times.
Update based on single column:
Going with your approach, I will do something like:
inputDf.rdd.map((row: Row) => {
val values = row.get(0).asInstanceOf[String].split("\\|").map(nullConverter)
Row(values)
})
Make your nullConverter or any other logic a udf:
import org.apache.spark.sql.functions._
val nullConverter = udf((input: String) => {
if (input.trim.length > 0) input.trim
else null
})
Now, use the udf on your df and apply to all columns:
val convertedDf = inputDf.select(inputDf.columns.map(c => nullConverter(col(c)).alias(c)):_*)
Now, you can do your RDD logic.
This would be easier to do using the DataFrame API before converting to an RDD. First, split the data:
val df = Seq(("1|2|3||5|6|7|8||")).toDF("col0") // Example dataframe
val df2 = df.withColumn("col0", split($"col0", "\\|")) // Split on "|"
Then find out the length of the array:
val numCols = df2.first.getAs[Seq[String]](0).length
Now, for each element in the array, use the nullConverter UDF and then assign it to it's own column.
val nullConverter = udf((input: String) => {
if (input.trim.length > 0) input.trim
else null
})
val df3 = df2.select((0 until numCols).map(i => nullConverter($"col0".getItem(i)).as("col" + i)): _*)
The result using the example dataframe:
+----+----+----+----+----+----+----+----+----+----+
|col0|col1|col2|col3|col4|col5|col6|col7|col8|col9|
+----+----+----+----+----+----+----+----+----+----+
| 1| 2| 3|null| 5| 6| 7| 8|null|null|
+----+----+----+----+----+----+----+----+----+----+
Now convert it to an RDD or continue using the data as a DataFrame depending on your needs.
There is no point in converting dataframe to rdd
import org.apache.spark.sql.functions._
df = sc.parallelize([
(1, "foo bar"), (2, "foobar "), (3, " ")
]).toDF(["k", "v"])
df.select(regexp_replace(col("*"), " ", "NULL"))