There are a number of records in the table, and there is a column called AssignedTo, and the value for AssignedTo is comma separated string, the possible values for it could be something like:
"1"
"2"
"3"
"11"
"12"
"1,2"
"1,3"
"2,3"
"1,2,3"
"1,3,11"
"1,3,12"
If I use the following LINQ query to search, in case value = 1
records = records.Where(x => x.AssignedTo.Contains(value) || search == null);
It returns the records with AssignedTo value
"1", "11", "12", "1,2", "1,3", "1,2,3", "1,3,11", "1,3,12"
I really want to only return the records with AssignedTo containing "1",
which are "1", "1,2", "1,3", "1,2,3", "1,3,11", "1,3,12", do not want "11" and "12"
If I use the following LINQ query to search the qualified records, still value = 1
records = records.Where(x => x.AssignedTo.Contains("," + value + ",") ||
x.AssignedTo.StartsWith(value + ",") ||
x.AssignedTo.EndsWith("," + value) ||
value == null);
It returns the records with AssignedTo value "1,2", "1,3", "1,2,3", "1,3,11", "1,3,12", but missing the record with AssignedTo value "1".
Since something like this is likely a search filter, doing the operation in-memory likely isn't a very good option unless the row count is guaranteed to be manageable. Ideally something like this should be re-factored to use a proper relational structure rather than a comma-delimited string.
However, the example you had was mostly there, just missing an Equals option to catch the value by itself. I'd also take the 'value == null' check out of the Linq expression into a conditional as to whether to add the WHERE clause. The difference is with the condition in the Linq, this will generate that into the SQL, where-as by pre-checking you can avoid the SQL conditions all-together if there is no value specified.
if (!string.IsNullOrEmpty(value))
records = records.Where(x => x.AssignedTo.Contains("," + value + ",") ||
x.AssignedTo.StartsWith(value + ",") ||
x.AssignedTo.EndsWith("," + value) ||
x.AssignedTo == value);
This would catch "n,...", "...,n,...", "...,n", and "n".
A better method would be to split the string and search the results:
records = records.Where(x => x.AssignedTo.Split(',').Contains(value) || search == null);
Note that you can't use this method directly in an EF query since there's no way to translate it to standard SQL. So you may want to filter using your Contains as a starting spot (to reduce the number of false positives) and then filter in-memory:
records = records.Where(x => x.AssignedTo.Contains(value) || search == null)
.AsEnumerable() // do subsequent filtering in-memory
.Where(x => x.AssignedTo.Split(',').Contains(value) || search == null)
Or redesign
your database to use related tables rather than storing a comma-delimited list of strings...
If you are building a linq expression against database then Split function will throw an error. You can use expression below there.
if (!string.IsNullOrEmpty(value))
{
records = records.Where(x => (',' + x.AssignedTo + ',').Contains(',' + value + ',')
}
I have a array of tuple and I want to generate a join condition(OR) using that.
e.g.
input --> [("leftId", "rightId"), ("leftId", leftAltId")]
output --> leftDF("leftId") === rightDF("rightId") || leftDF("leftAltId") === rightDF("rightAltId")
method signature:
def inner(leftDF: DataFrame, rightDF: DataFrame, fieldsToJoin: Array[(String,String)]): Unit = {
}
I tried using reduce operation on the array but output of my reduce operation is Column and not String hence it can't be fed back as input. I could do recursive but hoping there's simpler way to initiate empty column variable and build the query. thoughts ?
You can do something like this:
val cond = fieldsToJoin.map(x => col(x._1) === col(x._2)).reduce(_ || _)
leftDF.join(rightDF, cond)
Basically you first turn the array into an array of conditions (col transforms the string to column and then === does the comparison) and then the reduce adds the "or" between them. The result is a column you can use.
I am trying to write a condition statement for joining 2 Spark dataframes together in Scala:
val joinCondition = when($"filteredRESULT.key" == $"allDataUSE.key" && $"allDataUSE.timestamp" >= $"filteredRESULT.tripStart" && $"allDataUSE.timestamp" <= $"filteredRESULT.tripEND", $"allDataUSE.tripid" === $"filteredRESULT.tripid").otherwise($"allDataUSE.tripid" === 0)
The filteredRESULT df is very small, and includes a tripID, tripStart time, tripEnd time. My goal is to use filteredRESULT as a lookup table, where a row from the allDataUSE df is compared against the entries in filteredRESULT. For example:
If in allDataUSE, a row matches filteredRESULT key, a timestamp >= a trip's start time, and <= a trip's end time, then the tripid column in allDataUSE should receive the value of tripid in the filteredRESULT df.
I am getting a boolean error when I run the above conditional statement. How can I perform this operation? Thank you!!
You Are getting The boolean error because where condition expect the condition to return boolean,but the operator === in spark return the column as ouput.becuase of that you are getting that error.
below i am sharing the link to spark document you see that.
https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/Column.html#equals(java.lang.Object)
public Column equalTo(Object other)
Equality test.
// Scala:
df.filter( df("colA") === df("colB") )
// Java
import static org.apache.spark.sql.functions.*;
df.filter( col("colA").equalTo(col("colB")) );
Parameters:
other - (undocumented)
Returns:
(undocumented)
so remove that === and replace with == it work
Joining on identity with Spark leads to the common key column being duplicated in the final Dataset:
val result = ds1.join(ds2, ds1("key") === ds2("key"))
// result now has two "key" columns
This is avoidable by using a Seq instead of the comparison, similar to USING keyword in SQL:
val result = ds1.join(ds2, Seq("key"))
// result now has only one "key" column
However, this doesn't work when joining with a common key + another condition, like:
val result = ds1.join(ds2, ds1("key") === ds2("key") && ds1("foo") < ds2("foo"))
// result has two "key" columns
val result = ds1.join(ds2, Seq("key") && ds1("foo") < ds2("foo"))
// compile error: value && is not a member of Seq[String]
Currently one way of getting out of this is to drop the duplicated column afterwards, but this is quite cumbersome:
val result = ds1.join(ds2, ds1("key") === ds2("key") && ds1("foo") < ds2("foo"))
.drop(ds1("key"))
Is there a more natural, cleaner way to achieve the same goal?
You can separate equi join component and filter:
ds1.join(ds2, Seq("key")).where(ds1("foo") < ds2("foo"))
I have a data frame with n number of columns and I want to replace empty strings in all these columns with nulls.
I tried using
val ReadDf = rawDF.na.replace("columnA", Map( "" -> null));
and
val ReadDf = rawDF.withColumn("columnA", if($"columnA"=="") lit(null) else $"columnA" );
Both of them didn't work.
Any leads would be highly appreciated. Thanks.
Your first approach seams to fail due to a bug that prevents replace from being able to replace values with nulls, see here.
Your second approach fails because you're confusing driver-side Scala code for executor-side Dataframe instructions: your if-else expression would be evaluated once on the driver (and not per record); You'd want to replace it with a call to when function; Moreover, to compare a column's value you need to use the === operator, and not Scala's == which just compares the driver-side Column object:
import org.apache.spark.sql.functions._
rawDF.withColumn("columnA", when($"columnA" === "", lit(null)).otherwise($"columnA"))