How to find highly similar observations in another dataset using Spark - pyspark

I have two csv files. File 1:
D,FNAME,MNAME,LNAME,GENDER,DOB,snapshot
2,66M,J,Rock,F,1995,201211.0
3,David,HM,Lee,M,,201211.0
6,66M,,Rock,F,,201211.0
0,David,H M,Lee,,1990,201211.0
3,Marc,H,Robert,M,2000,201211.0
6,Marc,M,Robert,M,,201211.0
6,Marc,MS,Robert,M,2000,201211.0
3,David,M,Lee,,1990,201211.0
5,Paul,ABC,Row,F,2008,201211.0
3,Paul,ACB,Row,,,201211.0
4,David,,Lee,,1990,201211.0
4,66,J,Rock,,1995,201211.0
File 2:
PID,FNAME,MNAME,LNAME,GENDER,DOB,FNAMELNAMEMNAMEGENDERDOB
S2,66M,J,Rock,F,1995,66MRockJF1995
S3,David,HM,Lee,M,1990,DavidLeeHMM1990
S0,Marc,HM,Robert,M,2000,MarcRobertHMM2000
S1,Marc,MS,Robert,M,2000,MarcRobertMSM2000
S6,Paul,Row,M,2008,PaulRowM2008
S7,Sam,O,Baby,F,2018,SamBabyOF2018
For example, I want to extract those highly similar observations in File 2 with MarcHRobertM2000 in File 1.
My expected output will be:
S0,Marc,HM,Robert,M,2000,MarcRobertHMM2000
S1,Marc,MS,Robert,M,2000,MarcRobertMSM2000
I used the following code:
sqlContext.registerDataFrameAsTable(df2,'table')
query=""" SELECT PID, FNAMELNAMEMNAMEGENDERDOB, similarity(lower(FNAMELNAMEMNAMEGENDERDOB), 'MarcHRobertM2000') as sim
FROM table
WHERE sim>0.7 """
df=sqlContext.sql(query)
It looks like the similarity in SQL does not work in sqlcontext. I have no idea how to fix it. In addition, File 2 is big, around 5 GB so I did not use the fuzzywuzzy in python. And soundex is not satisfying. Could you help me? Thank you.

you can use Levenshtein distance function to check the similarity.
Please refer to the below code
query=""" SELECT PID, FNAMELNAMEMNAMEGENDERDOB, levenshtein(FNAMELNAMEMNAMEGENDERDOB, 'MarcHRobertM2000') as sim
FROM table
WHERE sim < 4 """
Also please check https://medium.com/#mrpowers/fuzzy-matching-in-spark-with-soundex-and-levenshtein-distance-6749f5af8f28 for good read.

Related

how get data(postgresql) in datacamp workspace

i want to get the complete data but it says "truncated to 2222". My question is why can it be like that even though I don't use truncated statements. my code is like below:
SELECT *
FROM cinema.films;
and I also do another code that is more specific, the goal is that all lines can be downloaded
SELECT *
FROM cinema.films
LIMIT 4968;
the form of the table is more or less like this. look in the lower right corner, it says "truncated to 2222". how so that no data is truncated and I can get the data as a whole?
I've searched on google and on the stackoverflow forums. I hope that someone will help solve my problem

Pyspark adding columns to existing dataframe

I am trying to add multiple column to to right how can i do that?
Attributes = ["RequestTypePesId","AgentId","UpdatedBy","CauseType","OriginatingSystem"] for i in Attributes: a = df2load.select(i).distinct() b = a.join(b,a.select(i) == b.select(i),"fullouter")
Output should be:
enter image description here
Check out this example:
https://stackoverflow.com/a/71966176/9658895
and since you’re new, you might find this article useful:
“Hello World” of PySpark for Python & Pandas User [Pandas Vs PySpark]
Lastly, before you ask your next question, please make sure that you have followed standard protocol of asking a question. This video might help you.

KDB:Trying to read multiple csv files at a location

I am trying to run below code to read all csv files available at location C:/q/BitCoin/Input.Getting an error and dont know what the solution is?csv files are standard ones with three fields.
raze{[x]
inputdir:`:C:/q/BitCoin/Input;
filelist1:key inputdir;
filelist2:` sv' inputdir,'filelist1;
filelist3:string filelist2;
r:flip`Time`Qty`Price!("ZFF";",")0:x;
select from r
} each `$filelist3
Hard coding the file names and running below code works but I don't want to hard code
raze {[x]
r:flip`Time`Qty`Price!("ZFF";",")0:x;
select from r
} each (`$"C:/q/BitCoin/Input/bitbayPLN.csv";`$"C:/q/BitCoin/Input/anxhkAUD.csv")
Getting below error
An error occurred during execution of the query.
The server sent the response:
filelist3
Can someone help with issue?
The reason that you are receiving the error 'filelist3 is because filelist3 is defined in the lambda and outside of the lambda it is not recognised or defined. There are various ways to overcome this as outlined below.
Firstly you can essentially take all of the defined work done on the inside of the lambda and put it on the right side of the each.
raze{[x] r:flip`Time`Qty`Price!("ZFF";",")0:x; select from r
} each `$(string (` sv' `:C:/q/BitCoin/Input,'(key `:C:/q/BitCoin/Input)))
Or if you wanted to you could create a function which will generate filelist3 for you and use that on the right hand side of the each also.
f:{[inputdir] filelist1:key inputdir; filelist2:` sv' inputdir,'filelist1; filelist3:string filelist2; filelist3}
raze{[x] r:flip`Time`Qty`Price!("ZFF";",")0:x; select from r
} each `$f[`:C:/q/BitCoin/Input]
I hope this helps.
Many thanks,
Joel

Change the file-name in tblwrite when created table

I want to create 15 txt tables with different names by using tblwrite in matlab. my attempt was in this code
for j=1:15
rownames=char(rownames1(:,j));
T(:,1)=[1:length(K_minus_t)];
table=char(T(j,1));
colnames{j}={'lambda_L';'C_L';'lambda_N';'tau_N';'lambda_c';'tau_c';};
values{j}=[coef_line lam_tau_curve lam_tau_cap];
tblwrite(values{j},colnames{j},rownames,'table.txt');
end
unfortunately, it was failed. Any help would be greatly appreciated. I will be grateful to you.

dataFrame keying using pandas groupby method

I new to pandas and trying to learn how to work with it. Im having a problem when trying to use an example I saw in one of wes videos and notebooks on my data. I have a csv file that looks like this:
filePath,vp,score
E:\Audio\7168965711_5601_4.wav,Cust_9709495726,-2
E:\Audio\7168965711_5601_4.wav,Cust_9708568031,-80
E:\Audio\7168965711_5601_4.wav,Cust_9702445777,-2
E:\Audio\7168965711_5601_4.wav,Cust_7023544759,-35
E:\Audio\7168965711_5601_4.wav,Cust_9702229339,-77
E:\Audio\7168965711_5601_4.wav,Cust_9513243289,25
E:\Audio\7168965711_5601_4.wav,Cust_2102513187,18
E:\Audio\7168965711_5601_4.wav,Cust_6625625104,-56
E:\Audio\7168965711_5601_4.wav,Cust_6073165338,-40
E:\Audio\7168965711_5601_4.wav,Cust_5105831247,-30
E:\Audio\7168965711_5601_4.wav,Cust_9513082770,-55
E:\Audio\7168965711_5601_4.wav,Cust_5753907026,-79
E:\Audio\7168965711_5601_4.wav,Cust_7403410322,11
E:\Audio\7168965711_5601_4.wav,Cust_4062144116,-70
I loading it to a data frame and the group it by "filePath" and "vp", the code is:
res = df.groupby(['filePath','vp']).size()
res.index
and the output is:
[E:\Audio\7168965711_5601_4.wav Cust_2102513187,
Cust_4062144116, Cust_5105831247,
Cust_5753907026, Cust_6073165338,
Cust_6625625104, Cust_7023544759,
Cust_7403410322, Cust_9513082770,
Cust_9513243289, Cust_9702229339,
Cust_9702445777, Cust_9708568031,
Cust_9709495726]
Now Im trying to approach the index like a dict, as i saw in examples, but when im doing
res['Cust_4062144116']
I get an error:
KeyError: 'Cust_4062144116'
I do succeed to get a result when im putting the filepath, but as i understand and saw in previouse examples i should be able to use the vp keys as well, isnt is so?
Sorry if its a trivial one, i just cant understand why it is working in one example but not in the other.
Rutger you are not correct. It is possible to "partial" index a multiIndex series. I simply did it the wrong way.
The index first level is the file name (e.g. E:\Audio\7168965711_5601_4.wav above) and the second level is vp. Meaning, for each file name i have multiple vps.
Now, this is correct:
res['E:\Audio\7168965711_5601_4.wav]
and will return:
Cust_2102513187 2
Cust_4062144116 8
....
but trying to index by the inner index (the Cust_ indexes) will fail.
You groupby two columns and therefore get a MultiIndex in return. This means you also have to slice using those to columns, not with a single index value.
Your .size() on the groupby object converts it into a Series. If you force it in a DataFrame you can use the .xs method to slice a single level:
res = pd.DataFrame(df.groupby(['filePath','vp']).size())
res.xs('Cust_4062144116', level=1)
That works. If you want to keep it as a series, boolean indexing can help, something like:
res[res.index.get_level_values(1) == 'Cust_4062144116']
The last option is a bit less readable, but sometimes also more flexibile, you could test for multiple values at once for example:
res[res.index.get_level_values(1).isin(['Cust_4062144116', 'Cust_6073165338'])]