Scala Spark Replace empty String with NULL - scala

What I want here is to replace a value in a specific column to null if it's empty String.
The reason is I am using org.apache.spark.sql.functions.coalesce to fill one of the Dataframe's column based on another columns, but I have noticed in some rows the value is empty String instead of null so the coalesce function doesn't work as expected.
val myCoalesceColumnorder: Seq[String] = Seq("xx", "yy", "zz"),
val resolvedDf = df.select(
df("a"),
df("b"),
lower(org.apache.spark.sql.functions.coalesce(myCoalesceColumnorder.map(x => adjust(x)): _*)).as("resolved_id")
)
In the above example, I expected to first fill resolved_id with column xx if it' not null and if it's null with column yy and so on. But since sometime column xx is filled with "" instead of null I get "" in 'resolved_id'.
I have tried to fix it with
resolvedDf.na.replace("resolved_id", Map("" -> null))
But based on the na.replace documentation it only works if both key and value are either Bolean or String or Double so I can not use null here.
I don't want to use UDF because of the performance issue, I just want to know is there any other trick to solve this issue?
One other way I can fix this is by using when but not sure about the performance
resolvedDf
.withColumn("resolved_id", when(col("resolved_id").equalTo(""), null).otherwise(col("resolved_id")))

This is the right way with better performance
resolvedDf.withColumn("resolved_id", when($"resolved_id" =!= "", $"resolved_id"))
Basically no need to use otherwise method.
You can check sources::: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Column.scala#L507
/**
* Evaluates a list of conditions and returns one of multiple possible result expressions.
* If otherwise is not defined at the end, null is returned for unmatched conditions.
*
* {{{
* // Example: encoding gender string column into integer.
*
* // Scala:
* people.select(when(people("gender") === "male", 0)
* .when(people("gender") === "female", 1)
* .otherwise(2))
*
* // Java:
* people.select(when(col("gender").equalTo("male"), 0)
* .when(col("gender").equalTo("female"), 1)
* .otherwise(2))
* }}}
*
* #group expr_ops
* #since 1.4.0
*/
def when(condition: Column, value: Any): Column = this.expr match {
case CaseWhen(branches, None) =>
withExpr { CaseWhen(branches :+ ((condition.expr, lit(value).expr))) }
case CaseWhen(branches, Some(_)) =>
throw new IllegalArgumentException(
"when() cannot be applied once otherwise() is applied")
case _ =>
throw new IllegalArgumentException(
"when() can only be applied on a Column previously generated by when() function")
}

Related

pyspark how to get the count of records which are not matching with the given date format

I have a csv file that contains (FileName,ColumnName,Rule and RuleDetails) as headers.
As per the Rule Detail I need to get the count of columnname(INSTALLDATE) which are not matching with the RuleDetail DataFormat
I have to pass ColumnName and RuleDetails dynamically
I tried with below Code
from pyspark.sql.functions import *
DateFields = []
for rec in df_tabledef.collect():
if rec["Rule"] == "DATEFORMAT":
DateFields.append(rec["Columnname"])
DateFormatValidvalues = [str(x) for x in rec["Ruledetails"].split(",") if x]
DateFormatString = ",".join([str(elem) for elem in DateFormatValidvalues])
DateColsString = ",".join([str(elem) for elem in DateFields])
output = (
df_tabledata.select(DateColsString)
.where(
DateColsString
not in (datetime.strptime(DateColsString, DateFormatString), "DateFormatString")
)
.count()
)
display(output)
Expected output is count of records which are not matching with the given dateformat.
For Example - If 4 out of 10 records are not in (YYYY-MM-DD) then the count should be 4
I got the below Error Message if u run the above code.

Spark / Scala / SparkSQL dataframes filter issue "data type mismatch"

My probleme is i have a code that gives filter column and values in a list as parameters
val vars = "age IN ('0')"
val ListPar = "entered_user,2014-05-05,2016-10-10;"
//val ListPar2 = "entered_user,2014-05-05,2016-10-10;revenue,0,5;"
val ListParser : List[String] = ListPar.split(";").map(_.trim).toList
val myInnerList : List[String] = ListParser(0).split(",").map(_.trim).toList
if (myInnerList(0) == "entered_user" || myInnerList(0) == "date" || myInnerList(0) == "dt_action"){
responses.filter(vars +" AND " + responses(myInnerList(0)).between(myInnerList(1), myInnerList(2)))
}else{
responses.filter(vars +" AND " + responses(myInnerList(0)).between(myInnerList(1).toInt, myInnerList(2).toInt))
}
well for all the fields except the one that contains date the functions works flawless but for fields that have date it throws an error
Note : I'm working with parquet files
here is the error
when i try to write it manually i get the same
here is how the query it sent to the sparkSQL
the first one where there is revenue it works but the second one doesn't work
and when i try to just filter with dates without the value of "vars" which contains other columns, it works
Well my issue is that i was mixing between sql and spark and when i tried to concatenate sql query which is my variable "vars" whith df.filter() and especially when i used between operator it was giving an output format unrocognised by sparksql which is
age IN ('0') AND ((entered_user >= 2015-01-01) AND (entered_user <= 2015-05-01))
it might seems correct but after looking in sql documentation it was missing parenthesese(in vars) it needed to be
(age IN ('0')) AND ((entered_user >= 2015-01-01) AND (entered_user <= 2015-05-01))
well the solution is i needed to concatenate those correctly so to do that i must to add " expr " to the variable vars which will result the desire syntaxe
responses.filter(expr(vars) && responses(myInnerList(0)).between(myInnerList(1), myInnerList(2)))

Inputting an array from spark dataframe to postgreSQL through JDBC

so I need to maintain a table with results and input information into it every certain amount of time, as JDBC and spark have no built in option for UPSERT and as I can not allow myself for the table to be vacant while I input the results or for them to be double, I built an UPSERT function of my own. The problem is that I have a WrappedArray of ints in my dataFrame and I can not seem to be able to translate it to a java object that will let me insert it into the PreparedStatement.
The relevant part from my code looks like this:
import java.sql._
val st: PreparedStatement = dbc.prepareStatement("""
INSERT INTO """ + table + """ as tb """ + sliced_columns + """
VALUES"""+"(" + "?, " * (columns.size - 1) + "?)"+"""
ON CONFLICT (id)
DO UPDATE SET """ + column_name + """= CAST (? AS _int4), count_win=?, occurrences=?, "sumOccurrences"=?, win_rate=? Where tb.id=?;
""")
As you can see I tried to write the WrappedArray as a string and then cast it in the SQL code itself, but that feels like a very bad solution.
I made this as the input part, doing different actions depending on which column type it is:
for (single_type <- types){
single_type._2 match {
case "IntegerType" => st.setInt(counter + 1, x.getInt(counter))
case "StringType" => st.setString(counter + 1, x.getString(counter))
case "DoubleType" => st.setDouble(counter + 1, x.getDouble(counter))
case "LongType" => st.setLong(counter + 1, x.getLong(counter))
case _ => st.setArray(counter + 1, x.getList(counter).toArray().asInstanceOf[Array])
}
This returns an error that Ljava.lang.Object; cannot be cast to java.sql.Array. I'd really appreciate any help!
Array is a type constructor not type:
import org.apache.spark.sql.Row
Row(Seq(1, 2, 3)).getList(0).toArray.asInstanceOf[Array[_]]
but toArray (with type) should be sufficient
Row(Seq(1, 2, 3)).getList[Int](0).toArray
The problem eventually was solved by the command createArrayOf
st.setArray(counter + 1, conn.createArrayOf("int4", x.getList[Int](4).toArray()))

Slick drop and take throwing error

I'm learning Play Scala with Slick and run into a problem. The query generated by Slick throws an error when using drop and take (works fine without drop and take, but I need pagination).
val categories = TableQuery[Categories]
def list(page: Int = 0, pageSize: Int = 10, orderBy: Int = 1)(implicit s: Session): Page[Category] = {
val offset = pageSize * page
val totalRows = count
/* Error here */
val result = categories.drop(offset).take(pageSize).list
/* This works fine */
/* val result = categories.list */
Page(result, page, offset, totalRows)
}
The query generated and error stacktrace:
[JdbcSQLException: Syntax error in SQL statement "SELECT X2.""id"", X2.""name"", X2.""description"" FROM (SELECT X3.""id"" AS ""id"", X3.""name"" AS ""name"", X3.""description"" AS ""description"" FROM ""core_category"" X3 FETCH[*] NEXT 10 ROW ONLY) X2 "; expected "RIGHT, LEFT, FULL, INNER, JOIN, CROSS, NATURAL, ,, WHERE, GROUP, HAVING, UNION, MINUS, EXCEPT, INTERSECT, ORDER, LIMIT, FOR, )"; SQL statement:
select x2."id", x2."name", x2."description" from (select x3."id" as "id", x3."name" as "name", x3."description" as "description" from "core_category" x3 fetch next 10 row only) x2 [42001-175]]
Any ideas how could I implement pagination without getting this error?
I was importing JdbcDriver instead of personalized drivers (H2Driver in my case) to get the correct dialect.

use joins in zend framework query

I want to use joins in zend. Below is my query
$select = $this->_db->select()
->from(array('evaluationcriteria' => 'procurement_tbltenderevaluationcriteria'),
array('ScoringCriteriaID','ScoringCriteriaWeight'))
->join(array('scoringcriteria' => 'procurement_tbltenderscoringcriteria'),
'scoringcriteria. TenderId=evaluationcriteria.TenderId')
->join(array('tenderapplications' => 'procurement_tbltenderapplications','tendersupplier' => 'tblsupplier'),
'tenderapplications. TenderInvitationContractorID=tendersupplier.UserID');
I have UserID in tendersupplier table. but its giving following error :-
Column not found: 1054 Unknown column 'tendersupplier.UserID' in 'on clause
I think its not the right way to include more than one tale in same array of a join.
Try code like this..
->from(array('evaluationcriteria' => 'procurement_tbltenderevaluationcriteria'),
array('ScoringCriteriaID','ScoringCriteriaWeight'))
->join(array('scoringcriteria' => 'procurement_tbltenderscoringcriteria'),
'scoringcriteria. TenderId=evaluationcriteria.TenderId')
'scoringcriteria. TenderId=evaluationcriteria.TenderId')
->join(array('tenderapplications' => 'procurement_tbltenderapplications'),
'tenderapplications.TenderInvitationContractorID=tblsupplier.UserID');
I am not sure whether you are planning to join values from tblsupplier table also.
The where condition is not written the way you did .
I dount that tblsupplier is a table then it should be in an array
This code is not tested!
$select = $this->_db->select()
->from(array('evaluationcriteria' => 'procurement_tbltenderevaluationcriteria'),
array('ScoringCriteriaID','ScoringCriteriaWeight'))
->join(array('scoringcriteria' => 'procurement_tbltenderscoringcriteria'),
'scoringcriteria. TenderId=evaluationcriteria.TenderId')
->join(array('tenderapplications' => 'procurement_tbltenderapplications'), array('tendersupplier' => 'tblsupplier'))
->where('tenderapplications. TenderInvitationContractorID=tendersupplier.UserID');
Looks like you are trying to join 2 tables in one ->join, I don't think you can do that.
join code from Zend_Db_Select()
/**
* Adds a JOIN table and columns to the query.
*
* The $name and $cols parameters follow the same logic
* as described in the from() method.
*
* #param array|string|Zend_Db_Expr $name The table name.
* #param string $cond Join on this condition.
* #param array|string $cols The columns to select from the joined table.
* #param string $schema The database name to specify, if any.
* #return Zend_Db_Select This Zend_Db_Select object.
*/
public function join($name, $cond, $cols = self::SQL_WILDCARD, $schema = null)
{
return $this->joinInner($name, $cond, $cols, $schema);
}
Here is the comment block from from()
/**
* Adds a FROM table and optional columns to the query.
*
* The first parameter $name can be a simple string, in which case the
* correlation name is generated automatically. If you want to specify
* the correlation name, the first parameter must be an associative
* array in which the key is the correlation name, and the value is
* the physical table name. For example, array('alias' => 'table').
* The correlation name is prepended to all columns fetched for this
* table.
*
* The second parameter can be a single string or Zend_Db_Expr object,
* or else an array of strings or Zend_Db_Expr objects.
*
* The first parameter can be null or an empty string, in which case
* no correlation name is generated or prepended to the columns named
* in the second parameter.
*
* #param array|string|Zend_Db_Expr $name The table name or an associative array
* relating correlation name to table name.
* #param array|string|Zend_Db_Expr $cols The columns to select from this table.
* #param string $schema The schema name to specify, if any.
* #return Zend_Db_Select This Zend_Db_Select object.
*/
Maybe try something like this instead:
$select = $this->_db->select()
//FROM table procurement_tbltenderevaluationcriteria AS evaluationcriteria, SELECT FROM
//COLUMNS ScoringCriteriaID and ScoringCriteriaWeight
->from(array('evaluationcriteria' => 'procurement_tbltenderevaluationcriteria'),
array('ScoringCriteriaID','ScoringCriteriaWeight'))
//JOIN TABLE procurement_tbltenderscoringcriteria AS scoringcriteria WHERE
//TenderId FROM TABLE scoringcriteria == TenderId FROM TABLE evaluationcriteria
->join(array('scoringcriteria' => 'procurement_tbltenderscoringcriteria'),
'scoringcriteria.TenderId=evaluationcriteria.TenderId')
//JOIN TABLE procurement_tbltenderapplications AS tenderapplications
->join(array('tenderapplications' => 'procurement_tbltenderapplications'))
//JOIN TABLE tblsupplier AS tendersupplier WHERE TenderInvitationContractorID FROM TABLE
// tenderapplications == UserID FROM TABLE tendersupplier
->join(array('tendersupplier' => 'tblsupplier'),
'tenderapplications.TenderInvitationContractorID=tendersupplier.UserID');
you may also need to alter your select() definition to allow joins:
//this will lock the tables to prevent data corruption
$this->_db->select(Zend_Db_Table::SELECT_WITHOUT_FROM_PART)->setIntegrityCheck(FALSE);
I hope I'm reading your intent correctly, this should get you closer if not all the way there. (one hint, use shorter aliases...)