Match percentage of 2string - spark sql - pyspark

I have a requirement to check match percentage of 2columns from a table.
For example:
Sample data:
ColA
ColB
AAB
Aab
AACC
Aacc
WER
Wer
from difflib import SequenceMatcher
def similar(a, b):
return SequenceMatcher( None,a, b).ratio()
spark.udf.register('similar',similar)
Output:
similar('AAB','Aab')
Out[16]: 0.3333333333333333
I am able to achieve the requirement by using sequenceMatcher lib but the issue is I am not able to use that function inside spark sql and facing below error. Is there any other way we can achieve the same??
df=spark.sql(f"""SELECT ColA,ColB,Similar(ColA,ColB) FROM test""")
display(df)
Error:
PythonException: 'AttributeError: 'SequenceMatcher' object has no attribute 'matching_blocks'', from , line 4. Full traceback below:

• SequenceMatcher accepts strings along with a junk value criteria.
• If any of these input strings are None the error occurs. Empty strings work.
Make sure None inputs are replaced by empty strings.

Related

What is PySpark SQL equivalent function for pyspark.pandas.DataFrame.to_string?

Pandas API function: https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.to_string.html
Another answer, though it doesn't work for me pyspark : Convert DataFrame to RDD[string]
Following above post advice, I tried going with
data.rdd.map(lambda row: [str(c) for c in row])
Then I get this error
TypeError: 'PipelinedRDD' object is not iterable
I would like for it to output rows of strings as if it's similar to to_string() above. Is this possible?
Would pyspark.sql.DataFrame.show satisfy your expectations about the console output? You can sort the df via pyspark.sql.DataFrame.sort before printing if required.

How to remove the first 2 rows using zipwithindex using spark scala

I have two headers in the file. have to remove them. i tried with zipwithindex. it will assign the index from zero onwards. But its showing error while performing filter condition on it.
val data=spark.sparkContext.textFile(filename)
val s=data.zipWithIndex().filter(row=>row[0]>1) --> throwing error here
Any help here please.
Sample data:
===============
sno,empno,name --> need to remove
c1,c2,c3 ==> need to remove
1,123,ramakrishna
2,234,deepthi
Error: identifier expected but integer literal found
value row of type (String,Long) does not take type parameters.
not found: type <error>
If you have to remove rows, you can use row_num and remove rows easily by filter row_num>2.

Scala Spark - Cannot resolve a column name

This should be pretty straightforward, but I'm having an issue with the following code:
val test = spark.read
.option("header", "true")
.option("delimiter", ",")
.csv("sample.csv")
test.select("Type").show()
test.select("Provider Id").show()
test is a dataframe like so:
Type
Provider Id
A
asd
A
bsd
A
csd
B
rrr
Exception in thread "main" org.apache.spark.sql.AnalysisException:
cannot resolve '`Provider Id`' given input columns: [Type, Provider Id];;
'Project ['Provider Id]
It selected and shows the Type column just fine but couldn't get it to work for the Provider Id. I wondered if it were because the column name had a space, so I tried using backticks, removing and replacing the space, but nothing seemed to work. Also, it ran fine when I'm using Spark libraries 3.x but doesn't work when I'm using Spark 2.1.x (meanwhile I need to use 2.1.x)
Additional: I tried changing the CSV column order from Type - Provider Id to Provider Id then Type. The error was the opposite, Provider Id shows but for Type it's throwing an exception now.
Any suggestions?
test.printSchema()
You can use the result from printSchema() to see how exactly spark read your column in, then use that in your code.

flatten a spark data frame's column values and put it into a variable

Spark version 1.60, Scala version 2.10.5.
I have a spark-sql dataframe df like this,
+-------------------------------------------------+
|addess | attributes |
+-------------------------------------------------+
|1314 44 Avenue | Tours, Mechanics, Shopping |
|115 25th Ave | Restaurant, Mechanics, Brewery|
+-------------------------------------------------+
From this dataframe, I would like values as below,
Tours, Mechanics, Shopping, Brewery
If I do this,
df.select(df("attributes")).collect().foreach(println)
I get,
[Tours, Mechanics, Shopping]
[Restaurant, Mechanics, Brewery]
I thought I could use flatMapinstead found this, so, tried to put this into a variable using,
val allValues = df.withColumn(df("attributes"), explode("attributes"))
but I am getting an error:
error: type mismatch;
found:org.apache.spark.sql.column
required:string
I was thinking if I can get an output using explode I can use distinct to get the unique values after flattening them.
How can I get the desired output?
I strongly recommend you to use spark 2.x version. In Cloudera, when you issue "spark-shell", it launches 1.6.x version.. however, if you issue "spark2-shell", you get the 2.x shell. Check with your admin
But if you need with Spark 1.6 and rdd solution, try this.
import spark.implicits._
import scala.collection.mutable._
val df = Seq(("1314 44 Avenue",Array("Tours", "Mechanics", "Shopping")),
("115 25th Ave",Array("Restaurant", "Mechanics", "Brewery"))).toDF("address","attributes")
df.rdd.flatMap( x => x.getAs[mutable.WrappedArray[String]]("attributes") ).distinct().collect.foreach(println)
Results:
Brewery
Shopping
Mechanics
Restaurant
Tours
If the "attribute" column is not an array, but comma separated string, then use the below one which gives you same results
val df = Seq(("1314 44 Avenue","Tours,Mechanics,Shopping"),
("115 25th Ave","Restaurant,Mechanics,Brewery")).toDF("address","attributes")
df.rdd.flatMap( x => x.getAs[String]("attributes").split(",") ).distinct().collect.foreach(println)
The problem is that withColumn expects a String in its first argument (which is the name of the added column), but you're passing it a Column here df.withColumn(df("attributes").
You only need to pass "attributes" as a String.
Additionally, you need to pass a Column to the explode function, but you're passing a String - to make it a column you can use df("columName") or the Scala shorthand $ syntax, $"columnName".
Hope this example can help you.
import org.apache.spark.sql.functions._
val allValues = df.select(explode($"attributes").as("attributes")).distinct
Note that this will only preserve the attributes Column, since you want the distinct elements on that one.

Conditionally map through rows in CSV file in Scala / Spark to produce another CSV file

I am quite new to Scala / Spark and I have been thrown into the deep end. I have been trying hard since several weeks to find a solution for a seemingly simple problem on Scala 2.11.8 but have been unable to find a good solution for it.
I have a large database in csv format close to 150 GB, with plenty of null values, which needs to be reduced and cleaned based on the values of individual columns.
The schema of the original CSV file is as follows:
Column 1: Double
Columnn 2: Integer
Column 3: Double
Column 4: Double
Columnn 5: Integer
Column 6: Double
Columnn 7: Integer
So, I want to conditionally map through all the rows of the CSV file and export the results to another CSV file with the following conditions for each row:
If the value for column 4 is not null, then the values for columns 4, 5, 6 and 7 of that row should be stored as an array called lastValuesOf4to7. (In the dataset if the element in column 4 is not null, then columns 1, 2 and 3 are null and can be ignored)
If the value of column 3 is not null, then the values of columns 1, 2 and 3 and the four elements from the lastValuesOf4to7 array, as described above, should be exported as a new row into another CSV file called condensed.csv. (In the dataset if the element in column 3 is not null, then columns 4, 5, 6 & 7 are null and can be ignored)
So in the end I should get a csv file called condensed.csv, which has 7 columns.
I have tried using the following code in Scala but have not been able to progress further:
import scala.io.Source
object structuringData {
def main(args: Array[String]) {
val data = Source.fromFile("/path/to/file.csv")
var lastValuesOf4to7 = Array("0","0","0","0")
val lines = data.getLines // Get the lines of the file
val splitLine = lines.map(s => s.split(',')).toArray // This gives an out of memory error since the original file is huge.
data.close
}
}
As you can see from the code above, I have tried to move it into an array but have been unable to progress further since I am unable to process each line individually.
I am quite certain that there must be straightforward solution to processing csv files on Scala / Spark.
Use the Spark-csv package and then use the Sql query to query the data and make the filters according to your use case and then export it at the end.
If you are using spark 2.0.0 then spark-csv will be present in spark-sql or else if you are using a old version add the dependency accordingly.
You can find a link to the spark-csv here.
You can also look at the example here: http://blog.madhukaraphatak.com/analysing-csv-data-in-spark/
Thank you for the response. I managed to create a solution myself using Bash Script. I had to start with a blank condensed.csv file first. My code shows how easy it was to achieve this:
#!/bin/bash
OLDIFS=$IFS
IFS=","
last1=0
last2=0
last3=0
last4=0
while read f1 f2 f3 f4 f5 f6 f7
do
if [[ $f4 != "" ]];
then
last1=$f4
last2=$f5
last3=$f6
last4=$f7
elif [[ $f3 != "" ]];
then
echo "$f1,$f2,$f3,$last1,$last2,$last3,$last4" >> path/to/condensed.csv
fi
done < $1
IFS=$OLDIFS
If the script is saved with the name extractcsv.sh then it should be run using the following format:
$ ./extractcsv.sh path/to/original/file.csv
This only goes to confirm my observation that ETL is easier on Bash than in Scala. Thank you for your help, though.