How to filter dataframe columns that start with something and end with something - scala

I have this piece of code currently that works as intended
val rules_list = df.columns.filter(_.startsWith("rule")).toList
However this is including some columns that I don't want. How would I add a second filter to this so that the total filter is "columns that start with 'rule' and end with any integer value"
So it should return "rule_1" in the list of columns but not "rule_1_modified"
Thanks and have a great day!

You can simply add a regex expression to your filter:
val rules_list = data.columns.filter(c => c.startsWith("rule") && c.matches("^.*\\d$")).toList

You can use python's Regex module like this
import re
columns = df.columns;
rules_list = [];
for col_name in range(len(columns)):
rules_list += re.findall('rule[_][0-9]',columns[col_name])
print(rules_list)

Related

Spark (Scala) modify the contents of a Dataset Column

I would like to have a Dataset, where the first column contains single words and the second column contains the filenames of the files where these words appear.
My current code looks something like this:
val path = "path/to/folder/with/files"
val tokens = spark.read.textFile(path).
.flatMap(line => line.split(" "))
.withColumn("filename", input_file_name)
tokens.show()
However this returns something like
|word1 |whole/path/to/some/file |
|word2 |whole/path/to/some/file |
|word1 |whole/path/to/some/otherfile|
(I don't need the whole path, just the last bit). My idea to fix this, was to use the map function
val tokensNoPath = tokens.
map(r => (r(0), r(1).asInstanceOf[String].split("/").lastOption))
So basically, just going to every tow, grabbing the second entry and deleting everything before the last slash.
However since I'm very new to Spark and Scala I can't figure out how to get the syntax for this right
Docs:
substring_index "substring_index(str, delim, count) Returns the substring from str before count occurrences of the delimiter delim... If count is negative, everything to the right of the final delimiter (counting from the right) is returned."
.withColumn("filename", substring_index(input_file_name, "/", -1))
You can split by slash and get the last element:
val tokens2 = tokens.withColumn("filename", element_at(split(col("filename"), "/"), -1))

Apply a text-preprocessing function to a dataframe column in scala spark

I want to create a function to handle the text-prepocessing in a problem I am facing with text data. I am familiar with Python and pandas dataframe and my usual thought process of solving the problem is to use a function and then using pandas apply method to apply the function to all the elements in a column. However I don't know where to begin to accomplish this.
So, I created two functions to handle the replacements. The problem is that I don't know how to put more than one replace inside this method. I need to make about 20 replacements for three separate dataframes so to solve it with this method it would take me 60 lines of code. Is there a way to do all the replacements inside a single function and then apply it to all the elements in a dataframe column in scala?
def removeSpecials: String => String = _.replaceAll("$", " ")
def removeSpecials2: String => String = _.replaceAll("?", " ")
val udf_removeSpecials = udf(removeSpecials)
val udf_removeSpecials2 = udf(removeSpecials2)
val consolidated2 = consolidated.withColumn("product_description", udf_removeSpecials($"product_description"))
val consolidated3 = consolidated2.withColumn("product_description", udf_removeSpecials2($"product_description"))
consolidated3.show()
Well you can simply add every replacement next to the previous one like this :
def removeSpecials: String => String = _.replaceAll("$", " ").replaceAll("?", " ")
But in this case where the replacement character is the same, it would be better to use regular expressions to avoid multiple replaceAll.
def removeSpecials: String => String = _.replaceAll("\\$|\\?", " ")
Note that \\ is used as escape character.

Spark - read text file, string off first X and last Y rows using monotonically_increasing_id

I have to read in files from vendors, that can get potentially pretty big (multiple GB). These files may have multiple header and footer rows I want to strip off.
Reading the file in is easy:
val rawData = spark.read
.format("csv")
.option("delimiter","|")
.option("mode","PERMISSIVE")
.schema(schema)
.load("/path/to/file.csv")
I can add a simple row number using monotonically_increasing_id:
val withRN = rawData.withColumn("aIndex",monotonically_increasing_id())
That seems to work fine.
I can easily use that to strip off header rows:
val noHeader = withRN.filter($"aIndex".geq(2))
but how can I strip off footer rows?
I was thinking about getting the max of the index column, and using that as a filter, but I can't make that work.
val MaxRN = withRN.agg(max($"aIndex")).first.toString
val noFooter = noHeader.filter($"aIndex".leq(MaxRN))
That returns no rows, because MaxRN is a string.
If I try to convert it to a long, that fails:
noHeader.filter($"aIndex".leq(MaxRN.toLong))
java.lang.NumberFormatException: For input string: "[100000]"
How can I use that max value in a filter?
Is trying to use monotonically_increasing_id like this even a viable approach? Is it really deterministic?
This happens because first will return a Row. To access the first element of the row you must do:
val MaxRN = withRN.agg(max($"aIndex")).first.getLong(0)
By converting the row to string you will get [100000] and of course this is not a valid Long that's why the casting is failing.

PySpark - ValueError: Cannot convert column into bool

So I've seen this solution:
ValueError: Cannot convert column into bool
which has the solution I think. But I'm trying to make it work with my dataframe and can't figure out how to implement it.
My original code:
if df2['DayOfWeek']>=6 :
df2['WeekendOrHol'] = 1
this gives me the error:
Cannot convert column into bool: please use '&' for 'and', '|' for
'or', '~' for 'not' when building DataFrame boolean expressions.
So based on the above link I tried:
from pyspark.sql.functions import when
when((df2['DayOfWeek']>=6),df2['WeekendOrHol'] = 1)
when(df2['DayOfWeek']>=6,df2['WeekendOrHol'] = 1)
but this is incorrect as it gives me an error too.
To update a column based on a condition you need to use when like this:
from pyspark.sql import functions as F
# update `WeekendOrHol` column, when `DayOfWeek` >= 6,
# then set `WeekendOrHol` to 1 otherwise, set the value of `WeekendOrHol` to what it is now - or you could do something else.
# If no otherwise is provided then the column values will be set to None
df2 = df2.withColumn('WeekendOrHol',
F.when(
F.col('DayOfWeek') >= 6, F.lit(1)
).otherwise(F.col('WeekendOrHol')
)
Hope this helps, good luck!
Best answer as provided by pault:
df2=df2.withColumn("WeekendOrHol", (df2["DayOfWeek"]>=6).cast("int"))
This is a duplicate of:
this

find line number in an unstructured file in scala

Hi guys I am parsing an unstructured file for some key words but i can't seem to easily find the line number of what the results I am getiing
val filePath:String = "myfile"
val myfile = sc.textFile(filePath);
var ora_temp = myfile.filter(line => line.contains("MyPattern")).collect
ora_temp.length
However, I not only want to find the lines that contains MyPatterns but I want more like a tupple (Mypattern line, line number)
Thanks in advance,
You can use ZipWithIndex as eliasah pointed out in a comment (with probably the most succinct way to do this using the direct tuple accessor syntax), or like so using pattern matching in the filter:
val matchingLineAndLineNumberTuples = sc.textFile("myfile").zipWithIndex().filter({
case (line, lineNumber) => line.contains("MyPattern")
}).collect