pyspark dataframe iterate over the list - pyspark

I have a list and pyspark dataframe like below. Today my list has only 3 elements and tomorrow it might have 5 elements and the list is dynamic not static.
my_list = ['4587','9920408','9920316']
a=spark.createDataFrame([(101,'~1~20448~3~22901~12214~27681~9920408~20013~19957~19993~ ~ ~ ~ ~ ~'),(102, '~1~20448~4462~4586~24739~4587~9914381~9921471~12777~ ~ ~ ~ ~ ~ ~'),(103,'~1~20448~3~22901~3891~4148~9920948~14845~4230~ ~ ~ ~ ~ ~ ~'),(104, '~1~20448~3~22901~3891~4148~9920316~4211~4212~ ~ ~ ~ ~ ~ ~')], ['ID', 'MSH'])
Now i want to create with column like ind if msh like '%~4587~%' or msh like '%~9920408~%' or msh like '%~9920316~%' then it's 1 otherwise 0.
I tried like below and it's working.
b=a.withColumn('ind', F.expr("if((msh like '%~4587~%' or msh like '%~9920408~%' or msh like '%~9920316~%'),1,0)"))
Is there a way to have a dynamically create the if condition to have msh like n times if we have n items in the list.
Appreciate your support.

Multiple likes should consider using rlike regular matching, and use | to match multiple patterns at the same time.
reg_expr = '|'.join([f'~{i}~' for i in my_list])
b = a.withColumn('ind', F.expr(fr"if(msh rlike '{reg_expr}', 1, 0)"))
b.show(truncate=False)

Related

Spark (Scala) modify the contents of a Dataset Column

I would like to have a Dataset, where the first column contains single words and the second column contains the filenames of the files where these words appear.
My current code looks something like this:
val path = "path/to/folder/with/files"
val tokens = spark.read.textFile(path).
.flatMap(line => line.split(" "))
.withColumn("filename", input_file_name)
tokens.show()
However this returns something like
|word1 |whole/path/to/some/file |
|word2 |whole/path/to/some/file |
|word1 |whole/path/to/some/otherfile|
(I don't need the whole path, just the last bit). My idea to fix this, was to use the map function
val tokensNoPath = tokens.
map(r => (r(0), r(1).asInstanceOf[String].split("/").lastOption))
So basically, just going to every tow, grabbing the second entry and deleting everything before the last slash.
However since I'm very new to Spark and Scala I can't figure out how to get the syntax for this right
Docs:
substring_index "substring_index(str, delim, count) Returns the substring from str before count occurrences of the delimiter delim... If count is negative, everything to the right of the final delimiter (counting from the right) is returned."
.withColumn("filename", substring_index(input_file_name, "/", -1))
You can split by slash and get the last element:
val tokens2 = tokens.withColumn("filename", element_at(split(col("filename"), "/"), -1))

MATLAB: Count punctuation marks in table columns

I'm trying to find the amount of sentences in this table:
Download Table here: http://www.mediafire.com/file/m81vtdo6bdd7bw8/Table_RandomInfoMiddle.mat/file
As you can see by the full-stops, there is one sentence in column one, and 2 sentences in column 3.
At the end of the day I desire to have a table with nothing but punctuation marks(with the exception of place holders like "", to keep the table rows the same length) that indicate the end of a sentence(Like "." or "?" or "!"), in order to calculate the total number of punctuation marks of each column.
This is my code(Yet unsuccessful):
EqualCoumns = [2:2:max(width(Table_RandomInfoMiddle))];
for t=EqualCoumns
MiddleOnlySentenceIndicators = Table_RandomInfoMiddle((Table_RandomInfoMiddle{:, t}=='punctuation'),:);
%Reomve all but "!.?" = Which is the only sentence enders
MiddleOnlySentenceIndicators(MiddleOnlySentenceIndicators{:, t} == ',', :) = [];
MiddleOnlySentenceIndicators(MiddleOnlySentenceIndicators{:, t} == ';', :) = [];
MiddleOnlySentenceIndicators(MiddleOnlySentenceIndicators{:, t} == ':', :) = [];
MiddleOnlySentenceIndicators(MiddleOnlySentenceIndicators{:, t} == '-', :) = [];
MiddleSentence_Nr(t) = height(MiddleOnlySentenceIndicators);
end
Right now this is almost giving good results, there is a little mistake somewhere.
(In the answer I would like to request only one thing, that I might have access to the results in the same table like form, it should look something like this(edited):
Any help will be appreciated.
Thank you!
If we use the table from my previous answer, t, we can use the following solution:
punctuation_table = table();
for col=1:size(t,2)
column_name = sprintf('Punctuation count for column %d',col);
punctuation_table.(column_name) = nnz(ismember(t(:,col).Variables,{'?',',','.','!'}));
end
which will create a table like this:
punctuation_table =
1×4 table
Punctuation count for column 1 Punctuation count for column 2 Punctuation count for column 3 Punctuation count for column 4
______________________________ ______________________________ ______________________________ ______________________________
2 0 2 0

How to filter dataframe columns that start with something and end with something

I have this piece of code currently that works as intended
val rules_list = df.columns.filter(_.startsWith("rule")).toList
However this is including some columns that I don't want. How would I add a second filter to this so that the total filter is "columns that start with 'rule' and end with any integer value"
So it should return "rule_1" in the list of columns but not "rule_1_modified"
Thanks and have a great day!
You can simply add a regex expression to your filter:
val rules_list = data.columns.filter(c => c.startsWith("rule") && c.matches("^.*\\d$")).toList
You can use python's Regex module like this
import re
columns = df.columns;
rules_list = [];
for col_name in range(len(columns)):
rules_list += re.findall('rule[_][0-9]',columns[col_name])
print(rules_list)

Fuzzy Compare between two hive columns using apache spark with scala

I am reading the data from 2 hive tables. Token table has the tokens that needs to be matched with the input data. Input data will have description column along with other columns. I need to split input data and need to compare each splitted element with all the elements from the token table.
currently I am using me.xdrop.fuzzywuzzy.FuzzySearch library for fuzzy match.
below is my code snippet-
val tokens = sqlContext.sql("select token from tokens")
val desc = sqlContext.sql("select description from desceriptiontable")
val desc_tokens = desc.flatMap(_.toString().split(" "))
Now i need to iterate desc_tokens and each element of desc_tokens should be fuzzy matched with each element of tokens and it it exceeds 85% match i need to replace element from desc_tokens by element from the tokens.
Example --
My token list is
hello
this
is
token
file
sample
and my input description is
helo this is input desc sampl
code should return
hello this is input desc sample
as hello and helo are fuzzy matched > 85% so helo will be replaced by hello. Similarly for sampl.
I make a test with this library : https://github.com/rockymadden/stringmetric
Other idea (Not optimized) :
//I change order tokens
val tokens = Array("this","is","sample","token","file","hello");
val desc_tokens = Array("helo","this","is","token","file","sampl");
val res = desc_tokens.map(str => {
//Compute score beetween tokens and desc_tokens
val elem = tokens.zipWithIndex.map{ case(tok,index) => (tok,index,JaroMetric.compare(str, tok).get)}
//Get token has max score
val emax = elem.maxBy{case(_,_,score) => score}
//if emax have a score > 0.85 get It. Else keep input
if(emax._3 > 0.85) tokens(emax._2) else str
})
res.foreach { println }
My Output :
hello
this
is
token
file
sample

How does reduceByKey work [duplicate]

This question already has answers here:
reduceByKey: How does it work internally?
(5 answers)
Closed 5 years ago.
I am doing some work with Scala and spark - beginner programmer and poster- the goal is to map each request (line) to a pair (userid, 1) then sum the hits.
Can anyone explain in more detail what is happening on the 1st and 3rd line and what the => in: line => line.split means?
Please excuse any errors in my post formatting as I am new to this website.
val userreqs = logs.map(line => line.split(' ')).
map(words => (words(2),1)).
reduceByKey((v1,v2) => v1 + v2)
considering the below hypothetical log
trans_id amount user_id
1 100 A001
2 200 A002
3 300 A001
4 200 A003
this how the data is processed in spark for each operation performed on the logs.
logs // RDD("1 100 A001","2 200 A002", "3 300 A001", "3 200 A003")
.map(line => line.split(' ')) // RDD(Array(1,100,A001),Array(2,200,A002),Array(3,300,A001), Array(4,200,A003))
.map(words => (words(2),1)) // RDD((A001,1), (A002,1), (A001,1), (A003,1))
.reduceByKey((v1,v2) => v1+v2) // RDD(A001,2),A(A002,1),A(`003,1))
line.split(' ') splits a string into Array of String. "Hello World" => Array("Hello", "World")
reduceByKey(_+_) run a reduce operation grouping data by key. in this case its adds all the values of key. In the above example there were two occurences for the user-key A001 and the value associated with each of those key was 1. This is now reduced to value 2 using the additive function (_ + _) provided in the reduceByKey method.
The easiest way to learn Spark and reduceByKey is to read the official documentation of PairRDDFunctions that says:
reduceByKey(func: (V, V) ⇒ V): RDD[(K, V)] Merge the values for each key using an associative and commutative reduce function.
So it basically takes all the values per key and sums them together to a value that is a sum of all the values per key.
Now, you may be asking yourself:
What is the key?
The key to understand the key (pun intended) is to see how keys are generated and that's the role of the line
map(words => (words(2),1)).
This is where you take words and destructure it into a pair of key and 1.
This is a classic map-reduce algorithm where you give 1 to all keys to reduce them in the following step.
In the end, after this map you'll have a series of key-value pairs as follows:
(hello, 1)
(world, 1)
(nice, 1)
(to, 1)
(see, 1)
(you, 1)
(again, 1)
(again, 1)
I repeated the last (again, 1) pair on purpose to show you that pairs can occur multiple times.
The series is created using RDD.map operator that takes a function that splits a single line and tokenize it into words.
logs.map(line => line.split(' ')).
It reads:
For every line in logs, split the line into tokens using (space) as separator.
I'd change this line to use a regex like \\s+ so any white character would get considered part of a separator.
line.split(' ') splits each line with the space which returns an array of string
For example:
"hello spark scala".split(' ') gives [hello, spark, scala]
`reduceByKey((v1,v2) => v1 + v2)` is equivalent to `reduceByKey(_ + _)`
Here is how reduceByKey works https://i.stack.imgur.com/igmG3.gif and http://backtobazics.com/big-data/spark/apache-spark-reducebykey-example/
For the same key it keeps adding all the values.
Hope this helped!