dataFrame keying using pandas groupby method - group-by

I new to pandas and trying to learn how to work with it. Im having a problem when trying to use an example I saw in one of wes videos and notebooks on my data. I have a csv file that looks like this:
filePath,vp,score
E:\Audio\7168965711_5601_4.wav,Cust_9709495726,-2
E:\Audio\7168965711_5601_4.wav,Cust_9708568031,-80
E:\Audio\7168965711_5601_4.wav,Cust_9702445777,-2
E:\Audio\7168965711_5601_4.wav,Cust_7023544759,-35
E:\Audio\7168965711_5601_4.wav,Cust_9702229339,-77
E:\Audio\7168965711_5601_4.wav,Cust_9513243289,25
E:\Audio\7168965711_5601_4.wav,Cust_2102513187,18
E:\Audio\7168965711_5601_4.wav,Cust_6625625104,-56
E:\Audio\7168965711_5601_4.wav,Cust_6073165338,-40
E:\Audio\7168965711_5601_4.wav,Cust_5105831247,-30
E:\Audio\7168965711_5601_4.wav,Cust_9513082770,-55
E:\Audio\7168965711_5601_4.wav,Cust_5753907026,-79
E:\Audio\7168965711_5601_4.wav,Cust_7403410322,11
E:\Audio\7168965711_5601_4.wav,Cust_4062144116,-70
I loading it to a data frame and the group it by "filePath" and "vp", the code is:
res = df.groupby(['filePath','vp']).size()
res.index
and the output is:
[E:\Audio\7168965711_5601_4.wav Cust_2102513187,
Cust_4062144116, Cust_5105831247,
Cust_5753907026, Cust_6073165338,
Cust_6625625104, Cust_7023544759,
Cust_7403410322, Cust_9513082770,
Cust_9513243289, Cust_9702229339,
Cust_9702445777, Cust_9708568031,
Cust_9709495726]
Now Im trying to approach the index like a dict, as i saw in examples, but when im doing
res['Cust_4062144116']
I get an error:
KeyError: 'Cust_4062144116'
I do succeed to get a result when im putting the filepath, but as i understand and saw in previouse examples i should be able to use the vp keys as well, isnt is so?
Sorry if its a trivial one, i just cant understand why it is working in one example but not in the other.

Rutger you are not correct. It is possible to "partial" index a multiIndex series. I simply did it the wrong way.
The index first level is the file name (e.g. E:\Audio\7168965711_5601_4.wav above) and the second level is vp. Meaning, for each file name i have multiple vps.
Now, this is correct:
res['E:\Audio\7168965711_5601_4.wav]
and will return:
Cust_2102513187 2
Cust_4062144116 8
....
but trying to index by the inner index (the Cust_ indexes) will fail.

You groupby two columns and therefore get a MultiIndex in return. This means you also have to slice using those to columns, not with a single index value.
Your .size() on the groupby object converts it into a Series. If you force it in a DataFrame you can use the .xs method to slice a single level:
res = pd.DataFrame(df.groupby(['filePath','vp']).size())
res.xs('Cust_4062144116', level=1)
That works. If you want to keep it as a series, boolean indexing can help, something like:
res[res.index.get_level_values(1) == 'Cust_4062144116']
The last option is a bit less readable, but sometimes also more flexibile, you could test for multiple values at once for example:
res[res.index.get_level_values(1).isin(['Cust_4062144116', 'Cust_6073165338'])]

Related

Check for list of substrings inside string column in PySpark

For checking if a single string is contained in rows of one column. (for example, "abc" is contained in "abcdef"), the following code is useful:
df_filtered = df.filter(df.columnName.contains('abc'))
The result would be for example "_wordabc","thisabce","2abc1".
How can I check for multiple strings (for example ['ab1','cd2','ef3']) at the same time?
I'm ideally searching for something like this:
df_filtered = df.filter(df.columnName.contains(['word1','word2','word3']))
The result would be for example "x_ab1","_cd2_","abef3".
Please, post scalable solutions (no for loops, for example) because the aim is to check a big list around 1000 elements.
All you need is isin
df_filtered = df.filter(df['columnName'].isin('word1','word2','word3')
Edit
You need rlike function to achieve your result
words="(aaa|bbb|ccc)"
df.filter(df['columnName'].rlike(words))

Exporting the output of MATLAB's methodsview

MATLAB's methodsview tool is handy when exploring the API provided by external classes (Java, COM, etc.). Below is an example of how this function works:
myApp = actxserver('Excel.Application');
methodsview(myApp)
I want to keep the information in this window for future reference, by exporting it to a table, a cell array of strings, a .csv or another similar format, preferably without using external tools.
Some things I tried:
This window allows selecting one line at a time and doing "Ctrl+c Ctrl+v" on it, which results in a tab-separated text that looks like this:
Variant GetCustomListContents (handle, int32)
Such a strategy can work when there are only several methods, but not viable for (the usually-encountered) long lists.
I could not find a way to access the table data via the figure handle (w/o using external tools like findjobj or uiinspect), as findall(0,'Type','Figure') "do not see" the methodsview window/figure at all.
My MATLAB version is R2015a.
Fortunately, methodsview.m file is accessible and allows to get some insight on how the function works. Inside is the following comment:
%// Internal use only: option is optional and if present and equal to
%// 'noUI' this function returns methods information without displaying
%// the table. `
After some trial and error, I saw that the following works:
[titles,data] = methodsview(myApp,'noui');
... and returns two arrays of type java.lang.String[][].
From there I found a couple of ways to present the data in a meaningful way:
Table:
dataTable = cell2table(cell(data));
dataTable.Properties.VariableNames = matlab.lang.makeValidName(cell(titles));
Cell array:
dataCell = [cell(titles).'; cell(data)];
Important note: In the table case, the "Return Type" column title gets renamed to ReturnType, since table titles have to be valid MATLAB identifiers, as mentioned in the docs.

Using a csv feeder to check a count i.e. convert to Int

I am using Gatling 2.0.0-SNAPSHOT (from a few months ago) and I have a CSV file:
search,min_results
test,1000
testing,50
...
I would, ideally, like to check that the number of results are equal or greater than the min_results column, something like:
.get(...)
.check(
jsonPath("$..results")
.count
.is(>= "${min_results}".toInt)
The last line doesn't work as it probably isn't valid Scala but also"${min_results}".toInt tries to convert ${min_results} rather than its value to an Int.
I would settle for just fixing the conversion toInt problem but this might result in a slightly less robust script. However, I could then add a max_results column and use the .in(xx to yy).
Thanks
In this case, toInt definitely won't work, as it would be called before the EL has been resolved.
However, this code snippet would achieve what you want :
.get(...)
.check(
jsonPath("$...results")
.count
.greaterThanOrEqual(session => session("min_results").as[String].toInt))
Here, using as[Int] works as expected, since the min_results attribute has already been resolved from the session at this point.

Create ordinal array with multiple groups

I need to categorize a dataset according to different age groups. The categorization depends on whether the Sex is Male or Female. I first subset the data by gender and then use the ordinal function (dataset is from a Matlab example). The following code crashes on the last line when I try to vertically concatenate the subsets:
load hospital;
subset_m=hospital(hospital.Sex=='Male',:);
subset_f=hospital(hospital.Sex=='Female',:);
edges_f=[0 20 max(subset_f.Age)];
edges_m=[0 30 max(subset_m.Age)];
labels_m = {'0-19','20+'};
labels_f = {'0-29','30+'};
subset_m.AgeGroup= ordinal(subset_m.Age,labels_m,[],edges_m);
subset_f.AgeGroup = ordinal(subset_f.Age,labels_f,[],edges_f);
vertcat(subset_m,subset_f);
Error using dataset/vertcat (line 76)
Could not concatenate the dataset variable 'AgeGroup' using VERTCAT.
Caused by:
Error using ordinal/vertcat (line 36)
Ordinal levels and their ordering must be identical.
Edit
It seems that a vital part was missing in the question, here is the answer to the corrected question. You need to use join rather than vertcat, for example:
joinFull = join(subset_f,subset_m,'LeftKeys','LastName','RightKeys','LastName','type','rightouter','mergekeys',true)
Solution of original problem
It seems like you are actually trying to work with the wrong variable. If I change all instances of hospitalCopy into hospital then everything works fine for me.
Perhaps you copied hospital and edited it, thus losing the validity of the input.
If you really need to have hospitalCopy make sure to assign to it directly after load hospital.
If this does not help, try using clear all before running the code and make sure there is no file called 'hospital' in your current directory.

Data Processing, how to approach

I have the following Problem, given this XML Datastructure:
<level1>
<level2ElementTypeA></level2ElementTypeA>
<level2ElementTypeB>
<level3ElementTypeA>String1Ineed<level3ElementTypeB>
</level2ElementTypeB>
...
<level2ElementTypeC>
<level3ElementTypeB attribute1>
<level4ElementTypeA>String2Ineed<level4ElementTypeA>
<level3ElementTypeB>
<level2ElementTypeC>
...
<level2ElementTypeD></level2ElementTypeD>
</level1>
<level1>...</level1>
I need to create an Entity which contain: String1Ineed and String2Ineed.
So every time I came across a level3ElementTypeB with a certain value in attribute1, I have my String2Ineed. The ugly part is how to obtain String1Ineed, which is located in the first element of type level2ElementTypeB above the current level2ElementTypeC.
My 'imperative' solution looks like that that I always keep an variable with the last value of String1Ineed and if I hit criteria for String2Ineed, I simply use that. If we look at this from a plain collection processing point of view. How would you model the backtracking logic between String1Ineed and String2Ineed? Using the State Monad?
Isn't this what XPATH is for? You can find String2Ineed and then change the axis to search back for String1Ineed.