PigLatin - Removing A Tuple/Field From Relation - filtering

This is my first post on StackOverflow, so pardon me in advance if this is too lengthy and/or vague.
I have a Pig relation as follows:
my_relation = LOAD '/path/to/data' USING PigStorage(',')
AS (f1:chararray, f2:chararray, f3:chararray);
Now, I wanted to filter out the field 'f3' from the above relation 'my_relation'. I know I could do it like so:
my_new_relation = FOREACH my_relation GENERATE my_relation.f1, my_relation.f2;
The problem with this method comes when I have a large number of fields/tuples in the 'my_relation' relation. Say, my_relation had 900 fields/tuples, and I wanted all of them but for one. Now, with the above method, I'd have to list out 899 fields/tuples after my 'GENERATE' keyword!
My question: Is there an easy way to filter out a handful of fields/tuples from a relation in PigLatin?
Prior: My prior on Apache Pig and PigLatin in general is very weak (as can be told by the difficulty of this question). I'm still reading through the Pig documentation found here.
Thanks for reading this question! Any/all help is appreciated!

First of all, your syntax is not quite right. If you wanted to just save the fields f1 and f2, you would do it like this:
my_new_relation = FOREACH my_relation GENERATE f1, f2;
As to your question, you can use a project-range expression:
my_new_relation = FOREACH my_relation GENERATE f1 .. f345, f347 .. f900;

Also you can write your own loader where you can specify which all columns you want to load.

Related

KDB:Trying to read multiple csv files at a location

I am trying to run below code to read all csv files available at location C:/q/BitCoin/Input.Getting an error and dont know what the solution is?csv files are standard ones with three fields.
raze{[x]
inputdir:`:C:/q/BitCoin/Input;
filelist1:key inputdir;
filelist2:` sv' inputdir,'filelist1;
filelist3:string filelist2;
r:flip`Time`Qty`Price!("ZFF";",")0:x;
select from r
} each `$filelist3
Hard coding the file names and running below code works but I don't want to hard code
raze {[x]
r:flip`Time`Qty`Price!("ZFF";",")0:x;
select from r
} each (`$"C:/q/BitCoin/Input/bitbayPLN.csv";`$"C:/q/BitCoin/Input/anxhkAUD.csv")
Getting below error
An error occurred during execution of the query.
The server sent the response:
filelist3
Can someone help with issue?
The reason that you are receiving the error 'filelist3 is because filelist3 is defined in the lambda and outside of the lambda it is not recognised or defined. There are various ways to overcome this as outlined below.
Firstly you can essentially take all of the defined work done on the inside of the lambda and put it on the right side of the each.
raze{[x] r:flip`Time`Qty`Price!("ZFF";",")0:x; select from r
} each `$(string (` sv' `:C:/q/BitCoin/Input,'(key `:C:/q/BitCoin/Input)))
Or if you wanted to you could create a function which will generate filelist3 for you and use that on the right hand side of the each also.
f:{[inputdir] filelist1:key inputdir; filelist2:` sv' inputdir,'filelist1; filelist3:string filelist2; filelist3}
raze{[x] r:flip`Time`Qty`Price!("ZFF";",")0:x; select from r
} each `$f[`:C:/q/BitCoin/Input]
I hope this helps.
Many thanks,
Joel

kdb q - lookup in nested list

Is there a neat way of looking up the key of a dictionary by an atom value if that atom is inside a value list ?
Assumption: The value lists of the dictionary have each unique elements
Example:
d:`tech`fin!(`aapl`msft;`gs`jpm) / would like to get key `fin by looking up `jpm
d?`gs`jpm / returns `fin as expected
d?`jpm / this doesn't work unfortunately
$[`jpm in d`fin;`fin;`tech] / this is the only way I can come up with
The last option does not scale well with the number of keys
Thanks!
You can take advantage of how where operates with dictionaries, and use in :
where `jpm in/:d
,`fin
Note this will return a list, so you might need to do first on the output if you want to replicate what you have above.
Why are you making this difficult on yourself? Use a table!
q)t:([] c:`tech`tech`fin`fin; sym:`aapl`msfw`gs`jpm)
q)first exec c from t where sym=`jpm
You can of course do what you're asking:
first where `jpm in'd
but this doesn't extend well to vectors while the table-approach does!
q)exec c from t where sym in `jpm`gs
I think you can take advantage of the value & key keywords to find what you're after:
q)key[d]where any value[d]in `jpm
,`fin
Hope that helps!
Jemma
The answers you have received so far are excellent. Here's my contribution building on Ryan's answer:
{[val;dict]raze {where y in/:x}[dict]'[val]}[`msft`jpm`gs;d]
The main difference is that you can pass a list of values to be evaluated and the result will be a list of keys.
[`msft`jpm`gs;d]
Output:
`tech`fin`fin

GAMS: retrieve information from solution

GAMS: I think I have a pretty simple question, however I'm stuck and was wondering if someone could help here.
A simplified version of my model looks like this:
set(i,t) ;
parameter price
D;
variable p(i,t)
e(i,t);
equations
Equation1
obj.. C=sum((i,t), p(i,t)*price);
Model file /all/ ;
Solve file minimizing C using MIP ;
Display C.l;
p(i,t) and e(i,t) are related:
Equation1 .. e(i,t)=e=e(i,t-1)+p(i,t)*D
Now I want to retrieve information from the solution: lets say I want to know at what t e(i,t) has a certain value for example --> e(i,t)= x(i) or otherwise formulated e(i,t=TD)=x(i) find TD, where x(i) thus is depending on i. Does anyone know how I can write this in to my GAMs model? To be clear I do not want to change anything about my solution and the model I have runs; I just want to retrieve this information from the solution given.
So far I tried a couple of thing and nothing worked. I think that this must be simple, can anyone help? Thank you!
Try something like this:
set i /i1*i10/
t /t1*t10/;
variable e(i,t);
*some random dummy "solution"
e.l(i,t) = uniformInt(1,10);
set find5(i,t) 'find all combinations of i and t for which e.l=5';
find5(i,t)$(e.l(i,t)=5) = yes;
display e.l,find5;
Hope that helps,
Lutz

dataFrame keying using pandas groupby method

I new to pandas and trying to learn how to work with it. Im having a problem when trying to use an example I saw in one of wes videos and notebooks on my data. I have a csv file that looks like this:
filePath,vp,score
E:\Audio\7168965711_5601_4.wav,Cust_9709495726,-2
E:\Audio\7168965711_5601_4.wav,Cust_9708568031,-80
E:\Audio\7168965711_5601_4.wav,Cust_9702445777,-2
E:\Audio\7168965711_5601_4.wav,Cust_7023544759,-35
E:\Audio\7168965711_5601_4.wav,Cust_9702229339,-77
E:\Audio\7168965711_5601_4.wav,Cust_9513243289,25
E:\Audio\7168965711_5601_4.wav,Cust_2102513187,18
E:\Audio\7168965711_5601_4.wav,Cust_6625625104,-56
E:\Audio\7168965711_5601_4.wav,Cust_6073165338,-40
E:\Audio\7168965711_5601_4.wav,Cust_5105831247,-30
E:\Audio\7168965711_5601_4.wav,Cust_9513082770,-55
E:\Audio\7168965711_5601_4.wav,Cust_5753907026,-79
E:\Audio\7168965711_5601_4.wav,Cust_7403410322,11
E:\Audio\7168965711_5601_4.wav,Cust_4062144116,-70
I loading it to a data frame and the group it by "filePath" and "vp", the code is:
res = df.groupby(['filePath','vp']).size()
res.index
and the output is:
[E:\Audio\7168965711_5601_4.wav Cust_2102513187,
Cust_4062144116, Cust_5105831247,
Cust_5753907026, Cust_6073165338,
Cust_6625625104, Cust_7023544759,
Cust_7403410322, Cust_9513082770,
Cust_9513243289, Cust_9702229339,
Cust_9702445777, Cust_9708568031,
Cust_9709495726]
Now Im trying to approach the index like a dict, as i saw in examples, but when im doing
res['Cust_4062144116']
I get an error:
KeyError: 'Cust_4062144116'
I do succeed to get a result when im putting the filepath, but as i understand and saw in previouse examples i should be able to use the vp keys as well, isnt is so?
Sorry if its a trivial one, i just cant understand why it is working in one example but not in the other.
Rutger you are not correct. It is possible to "partial" index a multiIndex series. I simply did it the wrong way.
The index first level is the file name (e.g. E:\Audio\7168965711_5601_4.wav above) and the second level is vp. Meaning, for each file name i have multiple vps.
Now, this is correct:
res['E:\Audio\7168965711_5601_4.wav]
and will return:
Cust_2102513187 2
Cust_4062144116 8
....
but trying to index by the inner index (the Cust_ indexes) will fail.
You groupby two columns and therefore get a MultiIndex in return. This means you also have to slice using those to columns, not with a single index value.
Your .size() on the groupby object converts it into a Series. If you force it in a DataFrame you can use the .xs method to slice a single level:
res = pd.DataFrame(df.groupby(['filePath','vp']).size())
res.xs('Cust_4062144116', level=1)
That works. If you want to keep it as a series, boolean indexing can help, something like:
res[res.index.get_level_values(1) == 'Cust_4062144116']
The last option is a bit less readable, but sometimes also more flexibile, you could test for multiple values at once for example:
res[res.index.get_level_values(1).isin(['Cust_4062144116', 'Cust_6073165338'])]

MongoPasswordField setPassword + save

I've started working a little bit with lift+scala+mongorecord but I found a small annoyance :
Usually to easily create a record ( document ) I just do:
User.createRecord.loginName("user").firstName("Name").lastName("LastName").save
But when I use the MongoPasswordField it is impossible to do it in just one line:
val userRecord = User.createRecord.loginName("user").firstName("Name").lastName("LastName")
userRecord.password.setPassword("SomePassword")|
userRecord.save
Source code for the filed is at http://scala-tools.org/mvnsites/liftweb-2.2/framework/scaladocs/lift-persistence/lift-mongodb-record/src/main/scala/net/liftweb/mongodb/record/field/MongoPasswordField.scala.html
Is there any way of doing this in just one line?
or at least can the field code be modified in some way to actually allow doing this?
I think you could do this:
User.createRecord.loginName("user").firstName("Name").lastName("LastName").password(Password("Some password")).save