Scala - how to use a computed variable name - scala

I am using Gatling (https://gatling.io) and struggling a bit with the scala (just learning).
I have a feeder which pulls in user data from a csv file:
val feeder = csv("seedfile.csv").circular
And I can happily access values in this file, e.g this allows me login using a value from the 'user_email' column:
exec(http("SubmitLogin")
.post("/auth/login")
.formParam("email", "${user_email}")
The issue Im having is that a range of the columns on my csv file are named item1, item2, item3, etc. I would like to iterate over these items in a loop. I was hoping Scala may have a feature like php $$ vars (http://php.net/manual/en/language.variables.variable.php) so i could do something like:
// actual values pulled from csv file
val item1 = "i'm item 1s val"
val item2 = "i'm item 2s val"
// for i in range
var varname = "item"+i
println(s"${$varname}") //so that for i=1 would be equivalent to println($item1)
Note: I have also tried:
s"$${varname}"
s"${${varname}}"
based on my googling and playing with the repl it appears this is not an option in Scala (which I guess makes sense for statically typed language which encourages immutable data) so any advice on how to approach this the Scala way would be greatly appreciated

Related

Removing Data Type From Tuple When Printing In Scala

I currently have two maps: -
mapBuffer = Map[String, ListBuffer[(Int, String, Float)]
personalMapBuffer = Map[mapBuffer, String]
The idea of what I'm trying to do is create a list of something, and then allow a user to create a personalised list which includes a comment, so they'd have their own list of maps.
I am simply trying to print information as everything is good from the above.
To print the Key from mapBuffer, I use: -
mapBuffer.foreach(line => println(line._1))
This returns: -
Sample String 1
Sample String 2
To print the same thing from personalMapBuffer, I am using: -
personalMapBuffer.foreach(line => println(line._1.map(_._1)))
However, this returns: -
List(Sample String 1)
List(Sample String 2)
I obviously would like it to just return "Sample String" and remove the List() aspect. I'm assuming this has something to do with the .map function, although this was the only way I could find to access a tuple within a tuple. Is there a simple way to remove the data type? I was hoping for something simple like: -
line._1.map(_._1).removeDataType
But obviously no such pre-function exists. I'm very new to Scala so this might be something extremely simple (which I hope it is haha) or it could be a bit more complex. Any help would be great.
Thanks.
What you see if default List.toString behaviour. You build your own string with mkString operation :
val separator = ","
personalMapBuffer.foreach(line => println(line._1.map(_._1.mkString(separator))))
which will produce desired result of Sample String 1 or Sample String 1, Sample String 2 if there will be 2 strings.
Hope this helps!
I have found a way to get the result I was looking for, however I'm not sure if it's the best way.
The .map() method just returns a collection. You can see more info on that here:- https://www.geeksforgeeks.org/scala-map-method/
By using any sort of specific element finder at the end, I'm able to return only the element and not the data type. For example: -
line._1.map(_._1).head
As I was writing this Ivan Kurchenko replied above suggesting I use .mkString. This also works and looks a little bit better than .head in my mind.
line._1.map(_._1).mkString("")
Again, I'm not 100% if this is the most efficient way but if it is necessary for something, this way has worked for me for now.

Update values in a global dictionary from a PySpark UDF

I have a User Defined Function (UDF) that adds a new column to a spark dataframe but it's a little slow.
The UDF calculates the edit distance between user input and a list of correctly spelled words and I was hoping to speed it up by storing the user input and the closest word match in a global dictionary. The idea is to reference the global dictionary first before spending time calculating scores for all the words again.
I'm new to Spark/PySpark so I don't know all the correct terms, but from what I've read it sounds like the executors don't keep track of global variables across threads (or something). I also read about Broadcast variables but I think those are passed as inputs and Accumulators only allow numeric data.
Here's some sample code I'm working with currently:
def guess_word(user_entry):
user_entry= user_entry.upper().strip()
# Check if the best match has already been calculated from a previous row,
# if not, calculate scores and return the one with the lowest score
if user_entry not in global_dict:
scores = {}
# Calculate scores against every word
for word in word_dataset:
word= word.upper().strip()
if word not in scores:
scores[word] = distance(user_entry, word)
else:
continue
# Get the word with the lowest score (aka best match)
word_guess, score = sorted(scores.items(), key=lambda kv: kv[1])[0]
# Update the global dictionary
global_dict[user_entry] = (word_guess,score)
else:
word_guess = global_dict[user_entry]
return word_guess
global_dict = {}
guess_word_udf = udf(lambda x: guess_word(x), StringType())
user_data = user_data.withColumn('word_guess', guess_word_udf('user_entry'))
After running this code, global_dict is always empty after running this code. Is it possible to...
I just realized I don't need the dictionary after the UDF finishes running and this question is now pointless :D

General purpose Tuple in Hadoop

I'm new to Hadoop, so please do not judge strictly my seemingly simple question.
The short version: What tuple data type can I use in Hadoop, to store 2 longs as a single value is a sequence file?
Moreover, I want to be able to read and process this file with Apache Pig like A = LOAD '/my/file' AS (a:long, (b:long, c:long)) and with Scala & Spark like val a = sc.sequenceFile[LongWritable, DesiredTuple]("/my/file", 1).
The full story:
I'm writing a Hadoop Job in Java, and I need to output a sequence file, which contains 3 long values at each line. I use first value a a key and group two other values together as a value in my Reducer.
I tried several variants:
Using org.apache.hadoop.mapreduce.lib.join.TupleWritable
public class MyReducer extends Reducer<...> {
public void reduce(Context context){
long a,b,c;
// ...
context.write(a, new TupleWritable(
new LongWritable[]{new LongWritable(b), new LongWritable(c)}));
}
}
But the javadoc of TupleWritable class says " * This is not a general-purpose tuple type." It seems to be ok for first attempt, but I can't get back my Tuples. Look as a simple script in Apace Pig:
A = LOAD '/my/file' USING org.apache.pig.piggybank.storage.SequenceFileLoader()
AS (a:long, (b:long, t:long));
DUMP A;
I got Something like this:
(2220,)
(5640,)
(6240,)
...
So what is the Apache Pig way of reading Hadoop's TupleWritable from a sequence file?
Furthermore, I tried to change sequence format to text format: job.setOutputFormatClass(TextOutputFormat.class);
This time I just looked in one of outputed files:
> hdfs dfs -cat /my/file/part-r-00000 | head
2220 [,]
5640 [,]
6240 [,]
...
So is the next question: Why there is nothing in my TupleWritable value?
After that, I tried org.apache.mahout.cf.taste.hadoop.EntityEntityWritable.
For a sequence file I got the same result as before:
grunt> A = LOAD '/my/file' USING org.apache.pig.piggybank.storage.SequenceFileLoader() AS (a:long, (b:long, c:long));
(2220,)
(5640,)
(6240,)
...
For a text file I got the desired result:
2220 2 15
5640 1 9
6240 0 1
...
And next question is: How to read such tuples (EntityEntityWritable) and may be other custom objects back from Hadoop-written sequence file?

Spark: RDD.saveAsTextFile when using a pair of (K,Collection[V])

I have a dataset of employees and their leave-records. Every record (of type EmployeeRecord) contains EmpID (of type String) and other fields. I read the records from a file and then transform into PairRDDFunctions:
val empRecords = sc.textFile(args(0))
....
val empsGroupedByEmpID = this.groupRecordsByEmpID(empRecords)
At this point, 'empsGroupedByEmpID' is of type RDD[String,Iterable[EmployeeRecord]]. I transform this into PairRDDFunctions:
val empsAsPairRDD = new PairRDDFunctions[String,Iterable[EmployeeRecord]](empsGroupedByEmpID)
Then, I go for processing the records as per the logic of the application. Finally, I get an RDD of type [Iterable[EmployeeRecord]]
val finalRecords: RDD[Iterable[EmployeeRecord]] = <result of a few computations and transformation>
When I try to write the contents of this RDD to a text file using the available API thus:
finalRecords.saveAsTextFile("./path/to/save")
the I find that in the file every record begins with an ArrayBuffer(...). What I need is a file with one EmployeeRecord in each line. Is that not possible? Am I missing something?
I have spotted the missing API. It is well...flatMap! :-)
By using flatMap with identity, I can get rid of the Iterator and 'unpack' the contents, like so:
finalRecords.flatMap(identity).saveAsTextFile("./path/to/file")
That solves the problem I have been having.
I also have found this post suggesting the same thing. I wish I saw it a bit earlier.

How to write this snippet in Python?

I am learning Python (I have a C/C++ background).
I need to write something practical in Python though, whilst learning. I have the following pseudocode (my first attempt at writing a Python script, since reading about Python yesterday). Hopefully, the snippet details the logic of what I want to do. BTW I am using python 2.6 on Ubuntu Karmic.
Assume the script is invoked as: script_name.py directory_path
import csv, sys, os, glob
# Can I declare that the function accepts a dictionary as first arg?
def getItemValue(item, key, defval)
return !item.haskey(key) ? defval : item[key]
dirname = sys.argv[1]
# declare some default values here
weight, is_male, default_city_id = 100, true, 1
# fetch some data from a database table into a nested dictionary, indexed by a string
curr_dict = load_dict_from_db('foo')
#iterate through all the files matching *.csv in the specified folder
for infile in glob.glob( os.path.join(dirname, '*.csv') ):
#get the file name (without the '.csv' extension)
code = infile[0:-4]
# open file, and iterate through the rows of the current file (a CSV file)
f = open(infile, 'rt')
try:
reader = csv.reader(f)
for row in reader:
#lookup the id for the code in the dictionary
id = curr_dict[code]['id']
name = row['name']
address1 = row['address1']
address2 = row['address2']
city_id = getItemValue(row, 'city_id', default_city_id)
# insert row to database table
finally:
f.close()
I have the following questions:
Is the code written in a Pythonic enough way (is there a better way of implementing it)?
Given a table with a schema like shown below, how may I write a Python function that fetches data from the table and returns is in a dictionary indexed by string (name).
How can I insert the row data into the table (actually I would like to use a transaction if possible, and commit just before the file is closed)
Table schema:
create table demo (id int, name varchar(32), weight float, city_id int);
BTW, my backend database is postgreSQL
[Edit]
Wayne et al:
To clarify, what I want is a set of rows. Each row can be indexed by a key (so that means the rows container is a dictionary (right)?. Ok, now once we have retrieved a row by using the key, I also want to be able to access the 'columns' in the row - meaning that the row data itself is a dictionary. I dont know if Python supports multidimensional array syntax when dealing with dictionaries - but the following statement will help explain how I intend to conceptually use the data returned from the db. A statement like dataset['joe']['weight'] will first fetch the row data indexed by the key 'joe' (which is a dictionary) and then index that dictionary for the key 'weight'. I want to know how to build such a dictionary of dictionaries from the retrieved data in a Pythonic way like you did before.
A simplistic way would be to write something like:
import pyodbc
mydict = {}
cnxn = pyodbc.connect(params)
cursor = cnxn.cursor()
cursor.execute("select user_id, user_name from users"):
for row in cursor:
mydict[row.id] = row
Is this correct/can it be written in a more pythonic way?
to get the value from the dictionary you need to use .get method of the dict:
>>> d = {1: 2}
>>> d.get(1, 3)
2
>>> d.get(5, 3)
3
This will remove the need for getItemValue function. I wont' comment on the existing syntax since it's clearly alien to Python. Correct syntax for the ternary in Python is:
true_val if true_false_check else false_val
>>> 'a' if False else 'b'
'b'
But as I'm saying below, you don't need it at all.
If you're using Python > 2.6, you should use with statement over the try-finally:
with open(infile) as f:
reader = csv.reader(f)
... etc
Seeing that you want to have row as dictionary, you should be using csv.DictReader and not a simple csv. reader. However, it is unnecessary in your case. Your sql query could just be constructed to access the fields of the row dict. In this case you wouldn't need to create separate items city_id, name, etc. To add default city_id to row if it doesn't exist, you could use .setdefault method:
>>> d
{1: 2}
>>> d.setdefault(1, 3)
2
>>> d
{1: 2}
>>> d.setdefault(3, 3)
3
>>> d
{1: 2, 3: 3}
and for id, simply row[id] = curr_dict[code]['id']
When slicing, you could skip 0:
>>> 'abc.txt'[:-4]
'abc'
Generally, Python's library provide a fetchone, fetchmany, fetchall methods on cursor, which return Row object, that might support dict-like access or return a simple tuple. It will depend on the particular module you're using.
It looks mostly Pythonic enough for me.
The ternary operation should look like this though (I think this will return the result you expect):
return defval if not key in item else item[key]
Yeah, you can pass a dictionary (or any other value) in basically any order. The only difference is if you use the *args, **kwargs (named by convention. Technically you can use any name you want) which expect to be in that order and the last one or two arguments.
For inserting into a DB you can use the odbc module:
import odbc
conn = odbc.odbc('servernamehere')
cursor = conn.cursor()
cursor.execute("INSERT INTO mytable VALUES (42, 'Spam on Eggs', 'Spam on Wheat')")
conn.commit()
You can read up or find plenty of examples on the odbc module - I'm sure there are other modules as well, but that one should work fine for you.
For retrieval you would use
cursor.execute("SELECT * FROM demo")
#Reads one record - returns a tuple
print cursor.fetchone()
#Reads the rest of the records - a list of tuples
print cursor.fetchall()
to make one of those records into a dictionary:
record = cursor.fetchone()
# Removes the 2nd element (at index 1) from the record
mydict[record[1]] = record[:1] + record[2:]
Though that practically screams for a generator expression if you want the whole shebang at once
mydict = dict((record[1], record[:1] + record[2:] for record in cursor.fetchall())
which should give you all of the records packed up neatly in a dictionary, using the name as a key.
HTH
a colon required after defs:
def getItemValue(item, key, defval):
...
boolean operators: In python !->not; &&->and and ||->or (see http://docs.python.org/release/2.5.2/lib/boolean.html for boolean operators). There's no ? : operator in python, there is a return (x) if (x) else (x) expression although I personally rarely use it in favour of plain if's.
booleans/None: True, False and None have capitals before them.
checking types of arguments: In python, you generally don't declare types of function parameters. You could go e.g. assert isinstance(item, dict), "dicts must be passed as the first parameter!" in the function although this kind of "strict checking" is often discouraged as it's not always necessary in python.
python keywords: default isn't a reserved python keyword and is acceptable as arguments and variables (just for the reference.)
style guidelines: PEP 8 (the python style guideline) states that module imports should generally only be one per line, though there are some exceptions (I have to admit I often don't follow the import sys and os on separate lines, though I usually follow it otherwise.)
file open modes: rt isn't valid in python 2.x - it will work, though the t will be ignored. See also http://docs.python.org/tutorial/inputoutput.html#reading-and-writing-files. It is valid in python 3 though, so I don't think it it'd hurt if you want to force text mode, raising exceptions on binary characters (use rb if you want to read non-ASCII characters.)
working with dictionaries: Python used to use dict.has_key(key) but you should use key in dict now (which has largely replaced it, see http://docs.python.org/library/stdtypes.html#mapping-types-dict.)
split file extensions: code = infile[0:-4] could be replaced with code = os.path.splitext(infile)[0] (which returns e.g. ('root', '.ext') with the dot in the extension (see http://docs.python.org/library/os.path.html#os.path.splitext).
EDIT: removed multiple variable declarations on a single line stuff and added some formatting. Also corrected the rt isn't a valid mode in python when in python 3 it is.