Pyspark Count the number of string appearance - pyspark

I am new to pyspark. I want to find the number of string "NA" for each column. The code below has a problem. I want to count the string with "NA", but the code below will also count "NAN", "NANN. Is there a way to change to exactly equal?
names = weatherData.schema.names
for name in names:
print(name + ': ' + str(weatherData.where(df[name].has("NA")).count()))

Related

In Spark Scala, how to create a column with substring() using locate() as a parameter?

I have a dataset that is like the following:
val df = Seq("samb id 12", "car id 13", "lxu id 88").toDF("list")
I want to create a column that will be a string containing only the values after Id. The result would be something like:
val df_result = Seq(("samb id 12",12), ("car id 13",13), ("lxu id 88",88)).toDF("list", "id_value")
For that, I am trying to use substring. For the the parameter of the starting position to extract the substring, I am trying to use locate. But it gives me an error saying that it should be an Int and not a column type.
What I am trying is like:
df
.withColumn("id_value", substring($"list", locate("id", $"list") + 2, 2))
The error I get is:
error: type mismatch;
found : org.apache.spark.sql.Column
required: Int
.withColumn("id_value", substring($"list", locate("id", $"list") + 2, 2))
^
How can I fix this and continue using locate() as a parameter?
UPDATE
Updating to give an example in which #wBob answer doesn't work for my real world data: my data is indeed a bit more complicated than the examples above.
It is something like this:
val df = Seq(":option car, lorem :ipsum: :ison, ID R21234, llor ip", "lst ID X49329xas ipsum :ion: ip_s-")
The values are very long strings that don't have a specific pattern.
Somewhere in the string that is always a part written ID XXXXX. The XXXXX varies, but it is always the same size (5 characters) and always after a ID .
I am not being able to use neither split nor regexp_extract to get something in this pattern.
It is not clear if you want the third item or the first number from the list, but here are a couple of examples which should help:
// Assign sample data to dataframe
val df = Seq("samb id 12", "car id 13", "lxu id 88").toDF("list")
df
.withColumn("t1", split($"list", "\\ ")(2))
.withColumn("t2", regexp_extract($"list", "\\d+", 0))
.withColumn("t3", regexp_extract($"list", "(id )(\\d+)", 2))
.withColumn("t4", regexp_extract($"list", "ID [A-Z](\\d{5})", 1))
.show()
You can use functions like split and regexp_extract with withColumn to create new columns based on existing values. split splits out the list into an array based on the delimiter you pass in. I have used space here, escaped with two slashes to split the array. The array is zero-based hence specifying 2 gets the third item in the array. regexp_extract uses regular expressions to extract from strings. here I've used \\d which represents digits and + which matches the digit 1 or many times. The third column, t3, again uses regexp_extract with a similar RegEx expression, but using brackets to group up sections and 2 to get the second group from the regex, ie the (\\d+). NB I'm using additional slashes in the regex to escape the slashes used in the \d.
My results:
If your real data is more complicated please post a few simple examples where this code does not work and explain why.

Using LINQ to search comma separated string

There are a number of records in the table, and there is a column called AssignedTo, and the value for AssignedTo is comma separated string, the possible values for it could be something like:
"1"
"2"
"3"
"11"
"12"
"1,2"
"1,3"
"2,3"
"1,2,3"
"1,3,11"
"1,3,12"
If I use the following LINQ query to search, in case value = 1
records = records.Where(x => x.AssignedTo.Contains(value) || search == null);
It returns the records with AssignedTo value
"1", "11", "12", "1,2", "1,3", "1,2,3", "1,3,11", "1,3,12"
I really want to only return the records with AssignedTo containing "1",
which are "1", "1,2", "1,3", "1,2,3", "1,3,11", "1,3,12", do not want "11" and "12"
If I use the following LINQ query to search the qualified records, still value = 1
records = records.Where(x => x.AssignedTo.Contains("," + value + ",") ||
x.AssignedTo.StartsWith(value + ",") ||
x.AssignedTo.EndsWith("," + value) ||
value == null);
It returns the records with AssignedTo value "1,2", "1,3", "1,2,3", "1,3,11", "1,3,12", but missing the record with AssignedTo value "1".
Since something like this is likely a search filter, doing the operation in-memory likely isn't a very good option unless the row count is guaranteed to be manageable. Ideally something like this should be re-factored to use a proper relational structure rather than a comma-delimited string.
However, the example you had was mostly there, just missing an Equals option to catch the value by itself. I'd also take the 'value == null' check out of the Linq expression into a conditional as to whether to add the WHERE clause. The difference is with the condition in the Linq, this will generate that into the SQL, where-as by pre-checking you can avoid the SQL conditions all-together if there is no value specified.
if (!string.IsNullOrEmpty(value))
records = records.Where(x => x.AssignedTo.Contains("," + value + ",") ||
x.AssignedTo.StartsWith(value + ",") ||
x.AssignedTo.EndsWith("," + value) ||
x.AssignedTo == value);
This would catch "n,...", "...,n,...", "...,n", and "n".
A better method would be to split the string and search the results:
records = records.Where(x => x.AssignedTo.Split(',').Contains(value) || search == null);
Note that you can't use this method directly in an EF query since there's no way to translate it to standard SQL. So you may want to filter using your Contains as a starting spot (to reduce the number of false positives) and then filter in-memory:
records = records.Where(x => x.AssignedTo.Contains(value) || search == null)
.AsEnumerable() // do subsequent filtering in-memory
.Where(x => x.AssignedTo.Split(',').Contains(value) || search == null)
Or redesign
your database to use related tables rather than storing a comma-delimited list of strings...
If you are building a linq expression against database then Split function will throw an error. You can use expression below there.
if (!string.IsNullOrEmpty(value))
{
records = records.Where(x => (',' + x.AssignedTo + ',').Contains(',' + value + ',')
}

Getting null fields in crystal reports

I have a customer name with first name, last name, middle name. I want to concatenate these fields. When I concatenate these fields I am getting some of the fields as null cause as it contains at least one null value field. I have tried with this formula.
{FIRST_NAME}&","&{MIDDLE_NAME}&","&{LAST_NAME}
Ex: I have first name, last name but middle name is null then I am getting entire field as null.
Please help me how to resolve this.
You'll probably want to wrap each field with a formula to adjust for nulls:
// {#FIRST_NAME}
If Isnull({table.FIRST_NAME}) Then "" Else {table.FIRST_NAME}
Then create a formula to concatenate them
// {#FULL_NAME}
{#FIRST_NAME} + "," + {#MIDDLE_NAME} + "," + {#LAST_NAME}

concatenate without losing thousands separator

I have a report that brings total sales and total probability sale.
The request was that this be shown in one table as "R"{totalamount}" (R"{totprobamount")".
So i added this together in a variable with the variable expression being
"R" + $F{Totalt} +" (R" + $F{Totalp} +")"
but by doing this the Thousands separator does not show anymore?
If you can add a field for each value you wouldn't do this with String concatenation but by using patterns on text field. add for each field in the properties panel a patter such as R #,##0.00.
if it has to be in a single field you'd need to add an expression to actually format the numbers in the desired way such as for example: "R" + new DecimalFormat("#,##.00").format($F{Totalt}) + " (R" + new DecimalFormat("#,##.00").format($F{Totalp}) + ")"
You can use the FORMAT function to have thousand separator.
FORMAT({totalamount} +{totprobamount},2)
This column become String column so you have to add this column separately , you cant use same column for integer value. Where 2 is for up to 2 decimal value.

MongoDB - Contains (LIKE) query on concatenated field

I am new in MongoDB.
I am programming an application with spring data and mongodb and I have one class with two fields: firstname and lastname.
I need one query for documents that contain one string in the full name (firstname + lastname).
For example: firstname = "Hansen David", lastname = "Gonzalez Sastoque" and I have a query to find David Gonzalez. In this example I expect there to be a match.
Concatenate two strings solves it but I don't know how to perform this.
Create a new array field (call it names) in the document and in that array put each name split by space. In your example the array would have the following contents:
hansen
david
gonzalez
sastoque
(make them all lower case to prevent case insensitivity issues)
Before you do your query, convert your input to lower case and split it by spaces as well.
Now, you can use the $all operator to achieve your objective:
db.persons.find( { names: { $all: [ "david", "gonzalez" ] } } );
You can use $where modifier in your queries:
db.users.findOne({$where: %JavaScript to match the document%})
In your case it may look like this:
db.users.findOne({$where: "this.firstname + ' ' + this.lastname == 'Gonzalez Sastoque'"})
or this:
db.users.findOne({$where: "this.firstname.match(/Gonzalez/) && this.lastname.match(/Sastoque/)"})
My last example does exactly what you want.
Update: Try following code:
db.users.findOne({$where: "(this.firstname + ' ' + this.lastname).match('David Gonzalez'.replace(' ', '( .*)? '))"})
You should split your full name into a first name and a last name, then do your query on both fields, using the appropriate MongoDB query selectors.