Update values in a global dictionary from a PySpark UDF - pyspark

I have a User Defined Function (UDF) that adds a new column to a spark dataframe but it's a little slow.
The UDF calculates the edit distance between user input and a list of correctly spelled words and I was hoping to speed it up by storing the user input and the closest word match in a global dictionary. The idea is to reference the global dictionary first before spending time calculating scores for all the words again.
I'm new to Spark/PySpark so I don't know all the correct terms, but from what I've read it sounds like the executors don't keep track of global variables across threads (or something). I also read about Broadcast variables but I think those are passed as inputs and Accumulators only allow numeric data.
Here's some sample code I'm working with currently:
def guess_word(user_entry):
user_entry= user_entry.upper().strip()
# Check if the best match has already been calculated from a previous row,
# if not, calculate scores and return the one with the lowest score
if user_entry not in global_dict:
scores = {}
# Calculate scores against every word
for word in word_dataset:
word= word.upper().strip()
if word not in scores:
scores[word] = distance(user_entry, word)
# Get the word with the lowest score (aka best match)
word_guess, score = sorted(scores.items(), key=lambda kv: kv[1])[0]
# Update the global dictionary
global_dict[user_entry] = (word_guess,score)
word_guess = global_dict[user_entry]
return word_guess
global_dict = {}
guess_word_udf = udf(lambda x: guess_word(x), StringType())
user_data = user_data.withColumn('word_guess', guess_word_udf('user_entry'))
After running this code, global_dict is always empty after running this code. Is it possible to...
I just realized I don't need the dictionary after the UDF finishes running and this question is now pointless :D


Need to create dictionary of idf values, associating words with their idf values

I understand how to get the idf values and vocabulary using the vectorizer. With vocabulary the frequency of the word is the value and the word is the key of a dictionary, however, what I want the value to be is the idf value.
I haven't been able to try anything because I don't know how to work with sklearn.
from sklearn.feature_extraction.text import TfidfVectorizer
# list of text documents
text = ["The quick brown fox jumped over the lazy dog.",
"The dog.",
"The fox"]
# create the transform
vectorizer = TfidfVectorizer()
# tokenize and build vocab
# summarize
# encode document
vector = vectorizer.transform([text[0]])
# summarize encoded vector
The code provided above is what I was originally trying to work with.
I have since come up with a new solution that does not use scikit:
for string in text_array:
for word in string:
if word not in total_dict.keys(): # build up a word frequency in the dictionary
total_dict[word] = 1
total_dict[word] += 1
for word in total_dict.keys(): # calculate the tf-idf of each word in the dictionary using this url: https://nlpforhackers.io/tf-idf/
total_dict[word] = math.log(len(text_array) / float(1 + total_dict[word]))
print("word", word, ":" , total_dict[word])
Let me know if the code snippet above is enough to allow a reasonable estimation of what is going on. I included a link to what I was using for guidance.
You can directly use vectorizer.fit_transform(text) for the first time.
What it does is build a vocabulary set according to all the word/tokens in the text.
And then you can use vectorizer.transform(anothertext) to vectorize another text with the same mapping as the previous text.
More explanation:
fit() is to learn vocabulary and idf from training set. transform() is to transform the documents based on the learned vocabulary from the previous fit().
So you should only do fit() once, and can transform many times.

Apache Spark: multiple outputs in one map task

TL;DR: I have a large file that I iterate over three times to get three different sets of counts out. Is there a way to get three maps out in one pass over the data?
Some more detail:
I'm trying to compute PMI between words and features that are listed in a large file. My pipeline looks something like this:
val wordFeatureCounts = sc.textFile(inputFile).flatMap(line => {
val word = getWordFromLine(line)
val features = getFeaturesFromLine(line)
for (feature <- features) yield ((word, feature), 1)
And then I repeat this to get word counts and feature counts separately:
val wordCounts = sc.textFile(inputFile).flatMap(line => {
val word = getWordFromLine(line)
val features = getFeaturesFromLine(line)
for (feature <- features) yield (word, 1)
val featureCounts = sc.textFile(inputFile).flatMap(line => {
val word = getWordFromLine(line)
val features = getFeaturesFromLine(line)
for (feature <- features) yield (feature, 1)
(I realize I could just iterate over wordFeatureCounts to get the wordCounts and featureCounts, but that doesn't answer my question, and looking at running times in practice I'm not sure it's actually faster to do it that way. Also note that there are some reduceByKey operations and other stuff that I do with this after the counts are computed that aren't shown, as they aren't relevant to the question.)
What I would really like to do is something like this:
val (wordFeatureCounts, wordCounts, featureCounts) = sc.textFile(inputFile).flatMap(line => {
val word = getWordFromLine(line)
val features = getFeaturesFromLine(line)
val wfCounts = for (feature <- features) yield ((word, feature), 1)
val wCounts = for (feature <- features) yield (word, 1)
val fCounts = for (feature <- features) yield (feature, 1)
Is there any way to do this with spark? In looking for how to do this, I've seen questions about multiple outputs when you're saving the results to disk (not helpful), and I've seen a bit about accumulators (which don't look like what I need), but that's it.
Also note that I can't just yield all of these results in one big list, because I need three separate maps out. If there's an efficient way to split a combined RDD after the fact, that could work, but the only way I can think of to do this would end up iterating over the data four times, instead of the three I currently do (once to create the combined map, then three times to filter it into the maps I actually want).
It is not possible to split an RDD into multiple RDDs. This is understandable if you think about how this would work under the hood. Say you split RDD x = sc.textFile("x") into a = x.filter(_.head == 'A') and b = x.filter(_.head == 'B'). Nothing happens so far, because RDDs are lazy. But now you print a.count. So Spark opens the file, and iterates through the lines. If the line starts with A it counts it. But what do we do with lines starting with B? Will there be a call to b.count in the future? Or maybe it will be b.saveAsTextFile("b") and we should be writing these lines out somewhere? We cannot know at this point. Splitting an RDD is just not possible with the Spark API.
But nothing stops you from implementing something if you know what you want. If you want to get both a.count and b.count you can map lines starting with A into (1, 0) and lines with B into (0, 1) and then sum up the tuples elementwise in a reduce. If you want to save lines with B into a file while counting lines with A, you could use an aggregator in a map before filter(_.head == 'B').saveAsTextFile.
The only generic solution is to store the intermediate data somewhere. One option is to just cache the input (x.cache). Another is to write the contents into separate directories in a single pass, then read them back as separate RDDs. (See Write to multiple outputs by key Spark - one Spark job.) We do this in production and it works great.
This is one of the major disadvantages of Spark over traditional map-reduce programming. An RDD/DF/DS can be transformed into another RDD/DF/DS but you cannot map an RDD into multiple outputs. To avoid recomputation you need to cache the results into some intermediate RDD and then run multiple map operations to generate multiple outputs. The caching solution will work if you are dealing with reasonable size data. But if the data is large compared to the memory available the intermediate outputs will be spilled to disk and the advantage of caching will not be that great. Check out the discussion here - https://issues.apache.org/jira/browse/SPARK-1476. This is an old Jira but relevant. Checkout out the comment by Mridul Muralidharan.
Spark needs to provide a solution where a map operation can produce multiple outputs without the need to cache. It may not be elegant from the functional programming perspective but I would argue, it would be a good compromise to achieve better performance.
I was also quite disappointed to see that this is a hard limitation of Spark over classic MapReduce. I ended up working around it by using multiple successive maps in which I filter out the data I need.
Here's a schematic toy example that performs different calculations on the numbers 0 to 49 and writes both to different output files.
from functools import partial
import os
from pyspark import SparkContext
# Generate mock data
def generate_data():
for i in range(50):
yield 'output_square', i * i
yield 'output_cube', i * i * i
# Map function to siphon data to a specific output
def save_partition_to_output(part_index, part, filter_key, output_dir):
# Initialise output file handle lazily to avoid creating empty output files
file = None
for key, data in part:
if key != filter_key:
# Pass through non-matching rows and skip
yield key, data
if file is None:
file = open(os.path.join(output_dir, '{}-part{:05d}.txt'.format(filter_key, part_index)), 'w')
# Consume data
file.write(str(data) + '\n')
yield from []
if file is not None:
def main():
sc = SparkContext()
rdd = sc.parallelize(generate_data())
# Repartition to number of outputs
# (not strictly required, but reduces number of output files).
# To split partitions further, use repartition() instead or
# partition by another key (not the output name).
rdd = rdd.partitionBy(numPartitions=2)
# Map and filter to first output.
rdd = rdd.mapPartitionsWithIndex(partial(save_partition_to_output, filter_key='output_square', output_dir='.'))
# Map and filter to second output.
rdd = rdd.mapPartitionsWithIndex(partial(save_partition_to_output, filter_key='output_cube', output_dir='.'))
# Trigger execution.
if __name__ == '__main__':
This will create two output files output_square-part00000.txt and output_cube-part00000.txt with the desired output splits.

Defaultdict() the correct choice?

EDIT: mistake fixed
The idea is to read text from a file, clean it, and pair consecutive words (not permuations):
file = f.read()
words = [word.strip(string.punctuation).lower() for word in file.split()]
pairs = [(words[i]+" " + words[i+1]).split() for i in range(len(words)-1)]
Then, for each pair, create a list of all the possible individual words that can follow that pair throughout the text. The dict will look like
Thus, referencing the dictionary for a given pair will return all of the words that can follow that pair. E.g.
wordsThatFollow[('she', 'was')]
>> ['alone', 'happy', 'not']
My algorithm to achieve this involves a defaultdict(list)...
wordsThatFollow = defaultdict(list)
for i in range(len(words)-1):
# pairs overlap, want second word of next pair
# wordsThatFollow[tuple(pairs[i])] = pairs[i+1][1]
EDIT: wordsThatFollow[tuple(pairs[i])].update(pairs[i+1][1][0]
except Exception:
I'm not so worried about the value error I have to circumvent with the 'try-except' (unless I should be). The problem is that the algorithm only successfully returns one of the followers:
wordsThatFollow[('she', 'was')]
>> ['not']
Sorry if this post is bad for the community I'm figuring things out as I go ^^
Your problem is that you are always overwriting the value, when you really want to extend it:
# Instead of this
wordsThatFollow[tuple(pairs[i])] = pairs[i+1][1]
# Do this

Spark caching strategy

I have a Spark driver that goes like this:
EDIT - earlier version of the code was different & didn't work
var totalResult = ... // RDD[(key, value)]
var stageResult = totalResult
do {
stageResult = stageResult.flatMap(
// Some code that returns zero or more outputs per input,
// and updates `acc` to number of outputs
).reduceByKey((x, y) => x.sum(y))
totalResult = totalResult.union(stageResult)
} while(stageResult.count() > 0)
I know from properties of my data that this will eventually terminate (I'm essentially aggregating up the nodes in a DAG).
I'm not sure of a reasonable caching strategy here - should I cache stageResult each time through the loop? Am I setting up a horrible tower of recursion, since each totalResult depends on all previous incarnations of itself? Or will Spark figure that out for me? Or should I put each RDD result in an array and take one big union at the end?
Suggestions will be welcome here, thanks.
I would rewrite this as follows:
do {
stageResult = stageResult.flatMap(
//Some code that returns zero or more outputs per input
totalResult = totalResult.union(stageResult)
} while(stageResult.count > 0)
I am fairly certain(95%) that the stageResult DAG used in the union will be the correct reference (especially since count should trigger it), but this might need to be double checked.
Then when you call totalResult.ACTION, it will put all of the cached data together.
As long as you have the memory space, then I would indeed cache everything along the way as it stores the data of each stageResult, unioning all of those data points at the end. In fact, each union does not rely on the past as that is not the semantics of RDD.union, it merely puts them together at the end. You could just as easily change your code to use a val due to RDD immutability.
As a final note, maybe the DAG visualization will help understand why there would not be recursive ramifications:

Generate unique random strings

I am writing a very small URL shortener with Dancer. It uses the REST plugin to store a posted URL in a database with a six character string which is used by the user to access the shorted URL.
Now I am a bit unsure about my random string generation method.
sub generate_random_string{
my $length_of_randomstring = shift; # the length of
# the random string to generate
my #chars=('a'..'z','A'..'Z','0'..'9','_');
my $random_string;
# rand #chars will generate a random
# number between 0 and scalar #chars
$random_string.=$chars[rand #chars];
# Start over if the string is already in the Database
generate_random_string(6) if database->quick_select('urls', { shortcut => $random_string });
return $random_string;
This generates a six char string and calls the function recursively if the generated string is already in the DB. I know there are 63^6 possible strings but this will take some time if the database gathers more entries. And maybe it will become a nearly infinite recursion, which I want to prevent.
Are there ways to generate unique random strings, which prevent recursion?
Thanks in advance
We don't really need to be hand-wavy about how many iterations (or recursions) of your function there will be. I believe at every invocation, the expected number of iterations is geomtrically distributed (i.e. number of trials before first success is governed by the geomtric distribution), which has mean 1/p, where p is the probability of successfully finding an unused string. I believe that p is just 1 - n/63^6, where n is the number of currently stored strings. Therefore, I think that you will need to have stored 30 billion strings (~63^6/2) in your database before your function recurses on average more than 2 times per call (p = .5).
Furthermore, the variance of the geomtric distribution is 1-p/p^2, so even at 30 billion entries, one standard deviation is just sqrt(2). Therefore I expect ~99% of the time that the loop will take fewerer than 2 + 2*sqrt(2) interations or ~ 5 iterations. In other words, I would just not worry too much about it.
From an academic stance this seems like an interesting program to work on. But if you're on the clock and just need random and distinct strings I'd go with the Data::GUID module.
use strict;
use warnings;
use Data::GUID qw( guid_string );
my $guid = guid_string();
Getting rid of recursion is easy; turn your recursive call into a do-while loop. For instance, split your function into two; the "main" one and a helper. The "main" one simply calls the helper and queries the database to ensure it's unique. Assuming generate_random_string2 is the helper, here's a skeleton:
do {
$string = generate_random_string2(6);
} while (database->quick_select(...));
As for limiting the number of iterations before getting a valid string, what about just saving the last generated string and always building your new string as a function of that?
For example, when you start off, you have no strings, so let's just say your string is 'a'. Then the next time you build a string, you get the last built string ('a') and apply a transformation on it, for instance incrementing the last character. This gives you 'b'. and so on. Eventually you get to the highest character you care for (say 'z') at which point you append an 'a' to get 'za', and repeat.
Now there is no database, just one persistent value that you use to generate the next value. Of course if you want truly random strings, you will have to make the algorithm more sophisticated, but the basic principle is the same:
Your current value is a function of the last stored value.
When you generate a new value, you store it.
Ensure your generation will produce a unique value (one that did not occur before).
I've got one more idea based on using MySQL.
create table string (
string_id int(10) not null auto_increment,
string varchar(6) not null default '',
primary key(string_id)
insert into string set string='';
update string
set string = lpad( hex( last_insert_id() ), 6, uuid() )
where string_id = last_insert_id();
select string from string
where string_id = last_insert_id();
This gives you an incremental hex value which is left padded with non-zero junk.