How to extract column names from spark sql Column?

How to extract column names from spark sql Column? - scala

I am very new to the Scala world, and am stuck at a problem for quite some time now, so seeking out your help!
I have a variable defined as Column (someExpression), for example:
val someExpr = Column( "concat( col(\"name\") && col(\"address\") )")
Is there a way I can get the names of the columns used in the expression inside Column() ? I tried to playaround with UnresolvedAttribute but wasn't able to make it work for my usecase.
I'd be happy to provide any extra information, code samples if my question isn't clear in the current form.
Thank you so much!

Related

Pyspark adding columns to existing dataframe

I am trying to add multiple column to to right how can i do that?
Attributes = ["RequestTypePesId","AgentId","UpdatedBy","CauseType","OriginatingSystem"] for i in Attributes: a = df2load.select(i).distinct() b = a.join(b,a.select(i) == b.select(i),"fullouter")
Output should be:
enter image description here

Check out this example:
https://stackoverflow.com/a/71966176/9658895
and since you’re new, you might find this article useful:
“Hello World” of PySpark for Python & Pandas User [Pandas Vs PySpark]
Lastly, before you ask your next question, please make sure that you have followed standard protocol of asking a question. This video might help you.

How to use In operator for comparing multiple values in EXCEL with Drools

I am trying Drools excel to compare some values of products like I have P1,P2,P3...P10 Now I am comparing it with incoming value from request.
I tried in below way but I am receiving an error.
Can you please help to fix this issue. Hope I am making a simple mistake but not able to find it.
Thanks in advance.

You have too many parentheses.
The value in the product column is substituted as is into $param. The correct syntax we're looking for is ... in ("P1", "P2"); however you've provided ("P1", "P2") as the product value. This transforms to in (("P1", "P2")), which is not correct.
Remove the parentheses from the product cell: "P1", "P2".

Issues with "QUERY(IMPORTRANGE)"

Here's my first question on this forum, though I've read through a lot of good answers here.
Can anyone tell me what I'm doing wrong with my attempt to do a query import from one sheet to a column in another?
Here's the formula I've tried, but all my adjustments still get me a parsing error.
=QUERY(IMPORTRANGE("https://docs.google.com/spreadsheets/d/1yGPdI0eBRNltMQ3Wr8E2cw-wNlysZd-XY3mtAnEyLLY/edit#gid=163356401","Master Treatment Log (Responses)!V2:V")"WHERE Col8="'&B2&'")")

Note that importrange is only needed for imports between spreadsheets. If you only import from one sheet into another within the same spreadsheet I would suggest using filter() or query().
Assuming the value in B2 is actually a string (and not a number), you can try
=QUERY(IMPORTRANGE("https://docs.google.com/spreadsheets/d/1yGPdI0eBRNltMQ3Wr8E2cw-wNlysZd-XY3mtAnEyLLY/edit#gid=163356401","Master Treatment Log (Responses)!V2:V"), "WHERE Col8="'&B2&'", 0)
Note the added comma before "WHERE". If you want to import a header row, change 0 to 1.
See if that helps? If not, please share a copy of your spreadsheet (sensitive data erased).

dataFrame keying using pandas groupby method

I new to pandas and trying to learn how to work with it. Im having a problem when trying to use an example I saw in one of wes videos and notebooks on my data. I have a csv file that looks like this:
filePath,vp,score
E:\Audio\7168965711_5601_4.wav,Cust_9709495726,-2
E:\Audio\7168965711_5601_4.wav,Cust_9708568031,-80
E:\Audio\7168965711_5601_4.wav,Cust_9702445777,-2
E:\Audio\7168965711_5601_4.wav,Cust_7023544759,-35
E:\Audio\7168965711_5601_4.wav,Cust_9702229339,-77
E:\Audio\7168965711_5601_4.wav,Cust_9513243289,25
E:\Audio\7168965711_5601_4.wav,Cust_2102513187,18
E:\Audio\7168965711_5601_4.wav,Cust_6625625104,-56
E:\Audio\7168965711_5601_4.wav,Cust_6073165338,-40
E:\Audio\7168965711_5601_4.wav,Cust_5105831247,-30
E:\Audio\7168965711_5601_4.wav,Cust_9513082770,-55
E:\Audio\7168965711_5601_4.wav,Cust_5753907026,-79
E:\Audio\7168965711_5601_4.wav,Cust_7403410322,11
E:\Audio\7168965711_5601_4.wav,Cust_4062144116,-70
I loading it to a data frame and the group it by "filePath" and "vp", the code is:
res = df.groupby(['filePath','vp']).size()
res.index
and the output is:
[E:\Audio\7168965711_5601_4.wav Cust_2102513187,
Cust_4062144116, Cust_5105831247,
Cust_5753907026, Cust_6073165338,
Cust_6625625104, Cust_7023544759,
Cust_7403410322, Cust_9513082770,
Cust_9513243289, Cust_9702229339,
Cust_9702445777, Cust_9708568031,
Cust_9709495726]
Now Im trying to approach the index like a dict, as i saw in examples, but when im doing
res['Cust_4062144116']
I get an error:
KeyError: 'Cust_4062144116'
I do succeed to get a result when im putting the filepath, but as i understand and saw in previouse examples i should be able to use the vp keys as well, isnt is so?
Sorry if its a trivial one, i just cant understand why it is working in one example but not in the other.

Rutger you are not correct. It is possible to "partial" index a multiIndex series. I simply did it the wrong way.
The index first level is the file name (e.g. E:\Audio\7168965711_5601_4.wav above) and the second level is vp. Meaning, for each file name i have multiple vps.
Now, this is correct:
res['E:\Audio\7168965711_5601_4.wav]
and will return:
Cust_2102513187 2
Cust_4062144116 8
....
but trying to index by the inner index (the Cust_ indexes) will fail.

You groupby two columns and therefore get a MultiIndex in return. This means you also have to slice using those to columns, not with a single index value.
Your .size() on the groupby object converts it into a Series. If you force it in a DataFrame you can use the .xs method to slice a single level:
res = pd.DataFrame(df.groupby(['filePath','vp']).size())
res.xs('Cust_4062144116', level=1)
That works. If you want to keep it as a series, boolean indexing can help, something like:
res[res.index.get_level_values(1) == 'Cust_4062144116']
The last option is a bit less readable, but sometimes also more flexibile, you could test for multiple values at once for example:
res[res.index.get_level_values(1).isin(['Cust_4062144116', 'Cust_6073165338'])]

Crystal Formula to exclude if value does not contain a certain format

I'm trying to exclude any value from a certain field (table.value) that does not match this format AA#####A. Example if they entered APT12345T, or PT12345PT and No Value then I want to exclude it from the report. It needs to match example AP12345P. What selection formula can I use to accomplish this.
Any help is greatly appreciated
Thanks in advance.

try reading Crystal's help topics on the mid() and isnumeric() functions.
here's an example from the help file:
Examples The following example is applicable to both Basic and Crystal
syntax:
Mid("abcdef", 3, 2)
Returns "cd".
so, in your case, you want to strip your value into three pieces,
mid(table.value,1,2)
mid(table.value,3,5)
mid(table.value,8,1)
and build up a three-part boolean variable where:
the first piece is not numeric(), or between 'AA' and 'ZZ', or however
else you want to test for letters,
the second part isnumeric(), and
the third part passes the same test as the first part.
where are you getting stuck?
something like this:
not isnumeric(mid({table.field},1,2)) and
isnumeric(mid({table.field},3,5) and
not isnumeric(mid({table.field},8,1))

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to extract column names from spark sql Column? - scala

Related

Pyspark adding columns to existing dataframe

How to use In operator for comparing multiple values in EXCEL with Drools

Issues with "QUERY(IMPORTRANGE)"

dataFrame keying using pandas groupby method

Crystal Formula to exclude if value does not contain a certain format

Categories

Resources