Scala Spark Replace empty String with NULL

Scala Spark Replace empty String with NULL - scala

What I want here is to replace a value in a specific column to null if it's empty String.
The reason is I am using org.apache.spark.sql.functions.coalesce to fill one of the Dataframe's column based on another columns, but I have noticed in some rows the value is empty String instead of null so the coalesce function doesn't work as expected.
val myCoalesceColumnorder: Seq[String] = Seq("xx", "yy", "zz"),
val resolvedDf = df.select(
df("a"),
df("b"),
lower(org.apache.spark.sql.functions.coalesce(myCoalesceColumnorder.map(x => adjust(x)): _*)).as("resolved_id")
)
In the above example, I expected to first fill resolved_id with column xx if it' not null and if it's null with column yy and so on. But since sometime column xx is filled with "" instead of null I get "" in 'resolved_id'.
I have tried to fix it with
resolvedDf.na.replace("resolved_id", Map("" -> null))
But based on the na.replace documentation it only works if both key and value are either Bolean or String or Double so I can not use null here.
I don't want to use UDF because of the performance issue, I just want to know is there any other trick to solve this issue?
One other way I can fix this is by using when but not sure about the performance
resolvedDf
.withColumn("resolved_id", when(col("resolved_id").equalTo(""), null).otherwise(col("resolved_id")))

This is the right way with better performance
resolvedDf.withColumn("resolved_id", when($"resolved_id" =!= "", $"resolved_id"))
Basically no need to use otherwise method.
You can check sources::: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Column.scala#L507
/**
* Evaluates a list of conditions and returns one of multiple possible result expressions.
* If otherwise is not defined at the end, null is returned for unmatched conditions.
*
* {{{
* // Example: encoding gender string column into integer.
*
* // Scala:
* people.select(when(people("gender") === "male", 0)
* .when(people("gender") === "female", 1)
* .otherwise(2))
*
* // Java:
* people.select(when(col("gender").equalTo("male"), 0)
* .when(col("gender").equalTo("female"), 1)
* .otherwise(2))
* }}}
*
* #group expr_ops
* #since 1.4.0
*/
def when(condition: Column, value: Any): Column = this.expr match {
case CaseWhen(branches, None) =>
withExpr { CaseWhen(branches :+ ((condition.expr, lit(value).expr))) }
case CaseWhen(branches, Some(_)) =>
throw new IllegalArgumentException(
"when() cannot be applied once otherwise() is applied")
case _ =>
throw new IllegalArgumentException(
"when() can only be applied on a Column previously generated by when() function")
}

Related

pyspark how to get the count of records which are not matching with the given date format

I have a csv file that contains (FileName,ColumnName,Rule and RuleDetails) as headers.
As per the Rule Detail I need to get the count of columnname(INSTALLDATE) which are not matching with the RuleDetail DataFormat
I have to pass ColumnName and RuleDetails dynamically
I tried with below Code
from pyspark.sql.functions import *
DateFields = []
for rec in df_tabledef.collect():
if rec["Rule"] == "DATEFORMAT":
DateFields.append(rec["Columnname"])
DateFormatValidvalues = [str(x) for x in rec["Ruledetails"].split(",") if x]
DateFormatString = ",".join([str(elem) for elem in DateFormatValidvalues])
DateColsString = ",".join([str(elem) for elem in DateFields])
output = (
df_tabledata.select(DateColsString)
.where(
DateColsString
not in (datetime.strptime(DateColsString, DateFormatString), "DateFormatString")
)
.count()
)
display(output)
Expected output is count of records which are not matching with the given dateformat.
For Example - If 4 out of 10 records are not in (YYYY-MM-DD) then the count should be 4
I got the below Error Message if u run the above code.

Spark / Scala / SparkSQL dataframes filter issue "data type mismatch"

My probleme is i have a code that gives filter column and values in a list as parameters
val vars = "age IN ('0')"
val ListPar = "entered_user,2014-05-05,2016-10-10;"
//val ListPar2 = "entered_user,2014-05-05,2016-10-10;revenue,0,5;"
val ListParser : List[String] = ListPar.split(";").map(_.trim).toList
val myInnerList : List[String] = ListParser(0).split(",").map(_.trim).toList
if (myInnerList(0) == "entered_user" || myInnerList(0) == "date" || myInnerList(0) == "dt_action"){
responses.filter(vars +" AND " + responses(myInnerList(0)).between(myInnerList(1), myInnerList(2)))
}else{
responses.filter(vars +" AND " + responses(myInnerList(0)).between(myInnerList(1).toInt, myInnerList(2).toInt))
}
well for all the fields except the one that contains date the functions works flawless but for fields that have date it throws an error
Note : I'm working with parquet files
here is the error
when i try to write it manually i get the same
here is how the query it sent to the sparkSQL
the first one where there is revenue it works but the second one doesn't work
and when i try to just filter with dates without the value of "vars" which contains other columns, it works

Well my issue is that i was mixing between sql and spark and when i tried to concatenate sql query which is my variable "vars" whith df.filter() and especially when i used between operator it was giving an output format unrocognised by sparksql which is
age IN ('0') AND ((entered_user >= 2015-01-01) AND (entered_user <= 2015-05-01))
it might seems correct but after looking in sql documentation it was missing parenthesese(in vars) it needed to be
(age IN ('0')) AND ((entered_user >= 2015-01-01) AND (entered_user <= 2015-05-01))
well the solution is i needed to concatenate those correctly so to do that i must to add " expr " to the variable vars which will result the desire syntaxe
responses.filter(expr(vars) && responses(myInnerList(0)).between(myInnerList(1), myInnerList(2)))

Inputting an array from spark dataframe to postgreSQL through JDBC

so I need to maintain a table with results and input information into it every certain amount of time, as JDBC and spark have no built in option for UPSERT and as I can not allow myself for the table to be vacant while I input the results or for them to be double, I built an UPSERT function of my own. The problem is that I have a WrappedArray of ints in my dataFrame and I can not seem to be able to translate it to a java object that will let me insert it into the PreparedStatement.
The relevant part from my code looks like this:
import java.sql._
val st: PreparedStatement = dbc.prepareStatement("""
INSERT INTO """ + table + """ as tb """ + sliced_columns + """
VALUES"""+"(" + "?, " * (columns.size - 1) + "?)"+"""
ON CONFLICT (id)
DO UPDATE SET """ + column_name + """= CAST (? AS _int4), count_win=?, occurrences=?, "sumOccurrences"=?, win_rate=? Where tb.id=?;
""")
As you can see I tried to write the WrappedArray as a string and then cast it in the SQL code itself, but that feels like a very bad solution.
I made this as the input part, doing different actions depending on which column type it is:
for (single_type <- types){
single_type._2 match {
case "IntegerType" => st.setInt(counter + 1, x.getInt(counter))
case "StringType" => st.setString(counter + 1, x.getString(counter))
case "DoubleType" => st.setDouble(counter + 1, x.getDouble(counter))
case "LongType" => st.setLong(counter + 1, x.getLong(counter))
case _ => st.setArray(counter + 1, x.getList(counter).toArray().asInstanceOf[Array])
}
This returns an error that Ljava.lang.Object; cannot be cast to java.sql.Array. I'd really appreciate any help!

Array is a type constructor not type:
import org.apache.spark.sql.Row
Row(Seq(1, 2, 3)).getList(0).toArray.asInstanceOf[Array[_]]
but toArray (with type) should be sufficient
Row(Seq(1, 2, 3)).getList[Int](0).toArray

The problem eventually was solved by the command createArrayOf
st.setArray(counter + 1, conn.createArrayOf("int4", x.getList[Int](4).toArray()))

Slick drop and take throwing error

I'm learning Play Scala with Slick and run into a problem. The query generated by Slick throws an error when using drop and take (works fine without drop and take, but I need pagination).
val categories = TableQuery[Categories]
def list(page: Int = 0, pageSize: Int = 10, orderBy: Int = 1)(implicit s: Session): Page[Category] = {
val offset = pageSize * page
val totalRows = count
/* Error here */
val result = categories.drop(offset).take(pageSize).list
/* This works fine */
/* val result = categories.list */
Page(result, page, offset, totalRows)
}
The query generated and error stacktrace:
[JdbcSQLException: Syntax error in SQL statement "SELECT X2.""id"", X2.""name"", X2.""description"" FROM (SELECT X3.""id"" AS ""id"", X3.""name"" AS ""name"", X3.""description"" AS ""description"" FROM ""core_category"" X3 FETCH[*] NEXT 10 ROW ONLY) X2 "; expected "RIGHT, LEFT, FULL, INNER, JOIN, CROSS, NATURAL, ,, WHERE, GROUP, HAVING, UNION, MINUS, EXCEPT, INTERSECT, ORDER, LIMIT, FOR, )"; SQL statement:
select x2."id", x2."name", x2."description" from (select x3."id" as "id", x3."name" as "name", x3."description" as "description" from "core_category" x3 fetch next 10 row only) x2 [42001-175]]
Any ideas how could I implement pagination without getting this error?

I was importing JdbcDriver instead of personalized drivers (H2Driver in my case) to get the correct dialect.

use joins in zend framework query

I want to use joins in zend. Below is my query
$select = $this->_db->select()
->from(array('evaluationcriteria' => 'procurement_tbltenderevaluationcriteria'),
array('ScoringCriteriaID','ScoringCriteriaWeight'))
->join(array('scoringcriteria' => 'procurement_tbltenderscoringcriteria'),
'scoringcriteria. TenderId=evaluationcriteria.TenderId')
->join(array('tenderapplications' => 'procurement_tbltenderapplications','tendersupplier' => 'tblsupplier'),
'tenderapplications. TenderInvitationContractorID=tendersupplier.UserID');
I have UserID in tendersupplier table. but its giving following error :-
Column not found: 1054 Unknown column 'tendersupplier.UserID' in 'on clause

I think its not the right way to include more than one tale in same array of a join.
Try code like this..
->from(array('evaluationcriteria' => 'procurement_tbltenderevaluationcriteria'),
array('ScoringCriteriaID','ScoringCriteriaWeight'))
->join(array('scoringcriteria' => 'procurement_tbltenderscoringcriteria'),
'scoringcriteria. TenderId=evaluationcriteria.TenderId')
'scoringcriteria. TenderId=evaluationcriteria.TenderId')
->join(array('tenderapplications' => 'procurement_tbltenderapplications'),
'tenderapplications.TenderInvitationContractorID=tblsupplier.UserID');
I am not sure whether you are planning to join values from tblsupplier table also.

The where condition is not written the way you did .
I dount that tblsupplier is a table then it should be in an array
This code is not tested!
$select = $this->_db->select()
->from(array('evaluationcriteria' => 'procurement_tbltenderevaluationcriteria'),
array('ScoringCriteriaID','ScoringCriteriaWeight'))
->join(array('scoringcriteria' => 'procurement_tbltenderscoringcriteria'),
'scoringcriteria. TenderId=evaluationcriteria.TenderId')
->join(array('tenderapplications' => 'procurement_tbltenderapplications'), array('tendersupplier' => 'tblsupplier'))
->where('tenderapplications. TenderInvitationContractorID=tendersupplier.UserID');

Looks like you are trying to join 2 tables in one ->join, I don't think you can do that.
join code from Zend_Db_Select()
/**
* Adds a JOIN table and columns to the query.
*
* The $name and $cols parameters follow the same logic
* as described in the from() method.
*
* #param array|string|Zend_Db_Expr $name The table name.
* #param string $cond Join on this condition.
* #param array|string $cols The columns to select from the joined table.
* #param string $schema The database name to specify, if any.
* #return Zend_Db_Select This Zend_Db_Select object.
*/
public function join($name, $cond, $cols = self::SQL_WILDCARD, $schema = null)
{
return $this->joinInner($name, $cond, $cols, $schema);
}
Here is the comment block from from()
/**
* Adds a FROM table and optional columns to the query.
*
* The first parameter $name can be a simple string, in which case the
* correlation name is generated automatically. If you want to specify
* the correlation name, the first parameter must be an associative
* array in which the key is the correlation name, and the value is
* the physical table name. For example, array('alias' => 'table').
* The correlation name is prepended to all columns fetched for this
* table.
*
* The second parameter can be a single string or Zend_Db_Expr object,
* or else an array of strings or Zend_Db_Expr objects.
*
* The first parameter can be null or an empty string, in which case
* no correlation name is generated or prepended to the columns named
* in the second parameter.
*
* #param array|string|Zend_Db_Expr $name The table name or an associative array
* relating correlation name to table name.
* #param array|string|Zend_Db_Expr $cols The columns to select from this table.
* #param string $schema The schema name to specify, if any.
* #return Zend_Db_Select This Zend_Db_Select object.
*/
Maybe try something like this instead:
$select = $this->_db->select()
//FROM table procurement_tbltenderevaluationcriteria AS evaluationcriteria, SELECT FROM
//COLUMNS ScoringCriteriaID and ScoringCriteriaWeight
->from(array('evaluationcriteria' => 'procurement_tbltenderevaluationcriteria'),
array('ScoringCriteriaID','ScoringCriteriaWeight'))
//JOIN TABLE procurement_tbltenderscoringcriteria AS scoringcriteria WHERE
//TenderId FROM TABLE scoringcriteria == TenderId FROM TABLE evaluationcriteria
->join(array('scoringcriteria' => 'procurement_tbltenderscoringcriteria'),
'scoringcriteria.TenderId=evaluationcriteria.TenderId')
//JOIN TABLE procurement_tbltenderapplications AS tenderapplications
->join(array('tenderapplications' => 'procurement_tbltenderapplications'))
//JOIN TABLE tblsupplier AS tendersupplier WHERE TenderInvitationContractorID FROM TABLE
// tenderapplications == UserID FROM TABLE tendersupplier
->join(array('tendersupplier' => 'tblsupplier'),
'tenderapplications.TenderInvitationContractorID=tendersupplier.UserID');
you may also need to alter your select() definition to allow joins:
//this will lock the tables to prevent data corruption
$this->_db->select(Zend_Db_Table::SELECT_WITHOUT_FROM_PART)->setIntegrityCheck(FALSE);
I hope I'm reading your intent correctly, this should get you closer if not all the way there. (one hint, use shorter aliases...)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Scala Spark Replace empty String with NULL - scala

Related

pyspark how to get the count of records which are not matching with the given date format

Spark / Scala / SparkSQL dataframes filter issue "data type mismatch"

Inputting an array from spark dataframe to postgreSQL through JDBC

Slick drop and take throwing error

use joins in zend framework query

Categories

Resources