regexp_extract function - Spark scala getting error - scala

Here are the sample records
SYSTEM, paid18.26 toward test
sys, paid $861.82 toward your
L, paid $1119.00toward your
I need to extract the data between paid and toward. I have written the statement like below and I am not getting the output
I am not getting the desired
Expected Output
Please let me know where the exact error.

Assuming the amount always come in between strings "paid" and "toward"
val amount = df.withColumn(
regexp_extract(col("message_comment_txt"), "^paid(.*)toward.*", 1)
The above snippet adds a new column amount to the dataset/df. It doesn't check / replace the $ symbol though. That can be replaced in the next step if this works fine as expected in all your cases.


Why is Arrayformula returning only the first row

Update: sample sheet provided here: Any help will be appreciated!
Hi fellow nerds.
I'm trying to make the current column (most recent interaction date with client) display the max values (most recent dates) from ContactLog!b:b (dates of all recorded interactions), when the client name in ContactLog!A:A matches to the client name in current row column A.
After many days of trying, I've found several formulas to successfully achieve this result for the current cell only.
=MAXIFS(ContactLog!B:B, ContactLog!A:A, A:A)
=MAX(FILTER(ContactLog!B4:B, ContactLog!A4:A=VLOOKUP(A2, ContactLog!A4:B, 1, FALSE)))
=MAX(QUERY(ContactLog!A4:B, ""SELECT B WHERE A = '""&VLOOKUP(A2, ContactLog!A4:B, 1, FALSE)&""'"", 0))
=IF(COUNTIF(ContactLog!A:A, A2),MAX(FILTER(ContactLog!B:B, ContactLog!A:A = A2)),"")
But none of these seem to work with arrayformula, to spread to the entire column. I'd like this result to apply automatically to the entire column (wherever column A is not blank).
It's displaying the correct max value for the first cell (in which the formula is written), and I could drag the formula down, but not spreading automatically as an array.
I've tried using =match with =filter, but that keeps running into mismatched range row sizes. (I've previously solved that by using filter within a filter, but can't figure that out here).
[I have a similar issue for the nearby columns also, "most recent interaction method", and "reminders & goals". The formula there is:
=INDEX(ContactLog!C:C, MATCH(MAX(IF(ContactLog!A:A=A2, IF(ContactLog!B:B=MAX(IF(ContactLog!A:A=A2, ContactLog!B:B)), ROW(ContactLog!B:B)))), ROW(ContactLog!B:B), 0))
=IFERROR(CONCATENATE(JOIN(" • ",FILTER(ContactLog!D:D,ContactLog!A:A=A2, ContactLog!D:D<>"")),IF(INDEX(ContactLog!D:D,MAX(IF(ContactLog!A:A=A2,ROW(ContactLog!D:D))))="","","")),"")
They both work great, but I can't get them to work with arrayformula...]
What am I missing?
You can do something like this with BYROW, that allows you to expand your formula through the column and be calculated "row by row". Using your first option:
=BYROW(A:A, LAMBDA (each,IF(each="","",MAXIFS(ContactLog!B:B, ContactLog!A:A, each))))


Here's my first question on this forum, though I've read through a lot of good answers here.
Can anyone tell me what I'm doing wrong with my attempt to do a query import from one sheet to a column in another?
Here's the formula I've tried, but all my adjustments still get me a parsing error.
=QUERY(IMPORTRANGE("","Master Treatment Log (Responses)!V2:V")"WHERE Col8="'&B2&'")")
Note that importrange is only needed for imports between spreadsheets. If you only import from one sheet into another within the same spreadsheet I would suggest using filter() or query().
Assuming the value in B2 is actually a string (and not a number), you can try
=QUERY(IMPORTRANGE("","Master Treatment Log (Responses)!V2:V"), "WHERE Col8="'&B2&'", 0)
Note the added comma before "WHERE". If you want to import a header row, change 0 to 1.
See if that helps? If not, please share a copy of your spreadsheet (sensitive data erased).

dataframe: how to groupBy/count then filter on count in Scala

Spark 1.4.1
I encounter a situation where grouping by a dataframe, then counting and filtering on the 'count' column raises the exception below
import sqlContext.implicits._
import org.apache.spark.sql._
case class Paf(x:Int)
val myData = Seq(Paf(2), Paf(1), Paf(2))
val df = sc.parallelize(myData, 2).toDF()
Then grouping and filtering:
.filter("count >= 2")
Throws an exception:
java.lang.RuntimeException: [1.7] failure: ``('' expected but `>=' found count >= 2
Renaming the column makes the problem vanish (as I suspect there is no conflict with the interpolated 'count' function'
.withColumnRenamed("count", "n")
.filter("n >= 2")
So, is that a behavior to expect, a bug or is there a canonical way to go around?
thanks, alex
When you pass a string to the filter function, the string is interpreted as SQL. Count is a SQL keyword and using count as a variable confuses the parser. This is a small bug (you can file a JIRA ticket if you want to).
You can easily avoid this by using a column expression instead of a String:
.filter($"count" >= 2)
So, is that a behavior to expect, a bug
Truth be told I am not sure. It looks like parser is interpreting count not as a column name but a function and expects following parentheses. Looks like a bug or at least a serious limitation of the parser.
is there a canonical way to go around?
Some options have been already mentioned by Herman and mattinbits so here more SQLish approach from me:
import org.apache.spark.sql.functions.count
df.groupBy("x").agg(count("*").alias("cnt")).where($"cnt" > 2)
I think a solution is to put count in back ticks
.filter("`count` >= 2")

celery: Weird signs in result_backend (postgresql-db)

I am using a celery installation configured with a result_backend postgresql-database.
It's working ok so far, but I have a little problem with the "formatting" of the result in the db.
I am returning several values (a string, an Exception with an error-message, an int) from my task (function-name under #app.task).
But when I take a look into the table "celery_taskmeta"-table which celery writes to when result_backend-option is set, I see some more 'crypted' values next to the expected values (screenshot of the select-result).
Everything circled in red are the expected values. The 273 at the bottom for example is the octal representation of the int I returned.
Of course I can filter out all that unwanted stuff but if it contains some useful information or might be different under some other circumstances...
Does anyone know why there are those strange signs?
Thanks for your help,
Don't query the database directly to get the result data - celery provides an API for this - see the docs
In your case you can do this to get the output of your task and any traceback:
cur.execute("SELECT task_id FROM celery_taskmeta WHERE date_done ...")
rows = cur.fetchall()
for row in rows:
task_id = row[0]
# you'll need to do something different if you have GroupResult
result = celery.AsyncResult(task_id)
if result.traceback:

dataFrame keying using pandas groupby method

I new to pandas and trying to learn how to work with it. Im having a problem when trying to use an example I saw in one of wes videos and notebooks on my data. I have a csv file that looks like this:
I loading it to a data frame and the group it by "filePath" and "vp", the code is:
res = df.groupby(['filePath','vp']).size()
and the output is:
[E:\Audio\7168965711_5601_4.wav Cust_2102513187,
Cust_4062144116, Cust_5105831247,
Cust_5753907026, Cust_6073165338,
Cust_6625625104, Cust_7023544759,
Cust_7403410322, Cust_9513082770,
Cust_9513243289, Cust_9702229339,
Cust_9702445777, Cust_9708568031,
Now Im trying to approach the index like a dict, as i saw in examples, but when im doing
I get an error:
KeyError: 'Cust_4062144116'
I do succeed to get a result when im putting the filepath, but as i understand and saw in previouse examples i should be able to use the vp keys as well, isnt is so?
Sorry if its a trivial one, i just cant understand why it is working in one example but not in the other.
Rutger you are not correct. It is possible to "partial" index a multiIndex series. I simply did it the wrong way.
The index first level is the file name (e.g. E:\Audio\7168965711_5601_4.wav above) and the second level is vp. Meaning, for each file name i have multiple vps.
Now, this is correct:
and will return:
Cust_2102513187 2
Cust_4062144116 8
but trying to index by the inner index (the Cust_ indexes) will fail.
You groupby two columns and therefore get a MultiIndex in return. This means you also have to slice using those to columns, not with a single index value.
Your .size() on the groupby object converts it into a Series. If you force it in a DataFrame you can use the .xs method to slice a single level:
res = pd.DataFrame(df.groupby(['filePath','vp']).size())
res.xs('Cust_4062144116', level=1)
That works. If you want to keep it as a series, boolean indexing can help, something like:
res[res.index.get_level_values(1) == 'Cust_4062144116']
The last option is a bit less readable, but sometimes also more flexibile, you could test for multiple values at once for example:
res[res.index.get_level_values(1).isin(['Cust_4062144116', 'Cust_6073165338'])]