This is the Data in text file format. I need to find top salary for every city
first_name last_name city county salary
--------------------------------------------------------
James Butt New Orleans Orleans 250000
Josephine Darakjy Brighton Livingston 300000
Art Venere Bridgeport Gloucester 400000
Leota Dilliard Bridgeport Gloucester 430000
> val scq = sc.textFile("path.txt")
> scq.flatMap(al=>al.split("\n")).sortBy(_._5,ascending = false).collect.take(5).foreach(println)
// sorting on salary
But I am getting error as value _5 is not a member of String , when I use toString it is giving error value _5 is not a member of char.
How should it be handled?
Try this:
> val scq = sc.textFile("path.txt")
> val d = scq.map(_.split("\t")).sortBy(_.apply(4), ascending = false)
This will produce an RDD[Array[String]] as the output. If you want to view them as tuple, you can do the following:
> val d1 = d.map(c => (c(0), c(1), c(2), c(3), c(4))) // Prefer case class over this always
> d.collect.foreach(println)
This will produce the following output:
(Leota,Dilliard,Bridgeport,Gloucester,430000)
(Art,Venere,Bridgeport,Gloucester,400000)
(Josephine,Darakjy,Brighton,Livingston,300000)
(James,Butt,New Orleans,Orleans,250000)
Related
I have a csv file that contains (FileName,ColumnName,Rule and RuleDetails) as headers.
As per the Rule Detail I need to get the count of columnname(INSTALLDATE) which are not matching with the RuleDetail DataFormat
I have to pass ColumnName and RuleDetails dynamically
I tried with below Code
from pyspark.sql.functions import *
DateFields = []
for rec in df_tabledef.collect():
if rec["Rule"] == "DATEFORMAT":
DateFields.append(rec["Columnname"])
DateFormatValidvalues = [str(x) for x in rec["Ruledetails"].split(",") if x]
DateFormatString = ",".join([str(elem) for elem in DateFormatValidvalues])
DateColsString = ",".join([str(elem) for elem in DateFields])
output = (
df_tabledata.select(DateColsString)
.where(
DateColsString
not in (datetime.strptime(DateColsString, DateFormatString), "DateFormatString")
)
.count()
)
display(output)
Expected output is count of records which are not matching with the given dateformat.
For Example - If 4 out of 10 records are not in (YYYY-MM-DD) then the count should be 4
I got the below Error Message if u run the above code.
I am new to PySpark and have purchased a book to enhance my PySpark skills. I am stuck while using a function.
Function
def filterDuplicates ( ( userID, ratings ) ):
(movie1, rating1) = ratings[0]
(movie2, rating2) = ratings[1]
return movie1 < movie2
I am getting error due to two continuous parenthesis. Step basically gets an RDD which is basically a list of touple as show below:
[(196, ((242, 3.0), (242, 3.0))), (196, ((242, 3.0), (393, 4.0)))]
The final result should be only distinct movie ID, rating BY each viewer.
So in the above-given example, 196 is viewer ID, 242 is movie ID and 3.0 is rating given by viewer.
Kindly advise if I need to download a different version of python to use double parenthesis. Presently I have Python 3.7 installed on my machine.
Thanks,
AJ
The variable names inside a tuple is of no use. If you really want the tuple to be parameter of the function, name the whole tuple like
def filterDuplicates ( userData ):
userId = userData[0]
ratings = userData[1]
movie1 = ratings[0][0]
rating1 = ratings[0][1]
movie2 = ratings[1][0]
rating2 = ratings[1][1]
return movie1 < movie2
I'm working in Spark and Scala for the past 2 months and I'm new to this technology. I framed the select columns(with regexp_replace) as List [String] () and passed for Spark Data frame creation and its throwing me error as "Cannot resolve". Please find below the steps, I have followed and tried.
Defining the val:
Defining the column which I would like to identify in the src data frame
val col_name = "region_id"
Defining the column which will be used to replace the src data frame column from ref data frame
val surr_key_col_name = "surrogate_key"
I have created two Data frames as shown below
src_df
region id | region_name | region_code
10001189 | Spain | SP09 8545
10001765 | Africa | AF97 6754
ref_df
region id | surrogate_key
1189 | 2345
1765 | 8978
val src_df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("s3://bucket/src_details.csv")
val ref_df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("s3://bucket/ref_details.csv")
I'm iterating through to identify the column I need to use reg match and replace with another Data Frame column value and storing it in List to pass it to Data Frames select
val src_header_rec = src_df.columns.toList
//Loop through source file header to identify the region_id and replace it with surrogate_id by doing a pattern match( I don't want to replace the
for (src_header_cols <- src_header_rec) {
if (col_name == src_header_cols) {
src_column_names :+="regexp_replace("+"$"+s""""src.$src_header_cols""""+","+"$"+s""""ref.$src_header_cols""""+","+"$"+s""""ref.$surr_key_col_name""""+")"+".as("+s""""$src_header_cols""""+")"
}
else {
src_column_names :+= "src."+src_header_cols
}
}
After building the select column in the List [String] () using the for loop above, I'm passing it to the select columns for final_df creation
val final_df = src_df.alias("src").join(ref_df.alias("ref"), src_df(col_name)=== ref_df(col_name),"left_outer").select(src_column_names.head,src_column_names.tail:_*)
If I directly pass the columns without using the List [String] () in the select of the data frame my regexp_replace substitution works
val final_df = src_df.alias("src").join(ref_df.alias("ref"), src_df(col_name)=== ref_df(col_name),"left_outer").select(regexp_replace($"src.region_id",$"ref.region_id",$"ref.surrogate_key").as("region_id"))
I'm not sure why its not working when I'm passing it as a List [String] ()
When I remove the regexp_replace substitution in the for loop and pass it as List [String] () for Data Frame select it works properly as shown below:
This code works very well with Data Frame select:
for (src_header_cols <- src_header_rec) {
if (col_name == src_header_cols) {
src_column_names :+= "ref."+surr_key_col_name
}
else {
src_column_names :+= "src."+src_header_cols
}
}
val final_df = src_df.alias("src").join(ref_df.alias("ref"), src_df(col_name)===ref_df(col_name),"left_outer").select(src_column_names.head,src_column_names.tail:_*)
The result/output Data Frame I'm trying to derive is
final_df
region id | region_name | region_code
1000**2345** | Spain | SP09 8545
1000**8978** | Africa | AF97 6754
So, when I'm trying to build the Spark Data Frame select in the for loop with regexp_replace as a List and use it its throwing me "Cannot resolve" error.
I have tried creating Temporary view of the Data Frame and used the same regexp in the select statement of the Temporary view. It worked. Please find below the code I have tried and it worked.
//This for loop will scan through my header list and whichever column matches it frames regexp for those columns.So, the region_id from the Data Frame header matches the variable value that I have defined.
for (src_header_cols <- src_header_rec) {
if (col_name == src_header_cols) {
src_column_names :+= "regexp_replace(src."+s"$src_header_cols"+",ref."+s"$ref_col_name"+",ref."+s"$surr_key_col_name"+")"+s" $src_header_cols"
}
else {
src_column_names :+= "src."+src_header_cols
}
}
//Creating Temporary view to apply SQL queries on it
src_df.createOrReplaceTempView("src")
ref_df.createOrReplaceTempView("ref")
//Framing SQL statements to be passed while querying
val selectExpr_1 = "select "+src_column_names.mkString(",")
val selectExpr_2 = selectExpr_1+" from src left outer join ref on(src."+s"$col_name"+" = ref."+s"$ref_col_name"+")"
// Creating a final Data Frame using the SQL statement created
val src_policy_masked_df = spark.sql(s"$selectExpr_2")
I'm trying to convert a dataframe from long to wide as suggested at How to pivot DataFrame?
However, the SQL seems to misinterpret the Countries list as a variable from the table. The below are the messages I saw from the console and the sample data and codes from the above link. Anyone knows how to resolve the issues?
Messages from the scala console:
scala> val myDF1 = sqlc2.sql(query)
org.apache.spark.sql.AnalysisException: cannot resolve 'US' given input columns >id, tag, value;
id tag value
1 US 50
1 UK 100
1 Can 125
2 US 75
2 UK 150
2 Can 175
and I want:
id US UK Can
1 50 100 125
2 75 150 175
I can create a list with the value I want to pivot and then create a string containing the sql query I need.
val countries = List("US", "UK", "Can")
val numCountries = countries.length - 1
var query = "select *, "
for (i <- 0 to numCountries-1) {
query += "case when tag = " + countries(i) + " then value else 0 end as " + countries(i) + ", "
}
query += "case when tag = " + countries.last + " then value else 0 end as " + countries.last + " from myTable"
myDataFrame.registerTempTable("myTable")
val myDF1 = sqlContext.sql(query)
Country codes are literals and should be enclosed in quotes otherwise SQL parser will treat these as the names of the columns:
val caseClause = countries.map(
x => s"""CASE WHEN tag = '$x' THEN value ELSE 0 END as $x"""
).mkString(", ")
val aggClause = countries.map(x => s"""SUM($x) AS $x""").mkString(", ")
val query = s"""
SELECT id, $aggClause
FROM (SELECT id, $caseClause FROM myTable) tmp
GROUP BY id"""
sqlContext.sql(query)
Question is why even bother with building SQL strings from scratch?
def genCase(x: String) = {
when($"tag" <=> lit(x), $"value").otherwise(0).alias(x)
}
def genAgg(f: Column => Column)(x: String) = f(col(x)).alias(x)
df
.select($"id" :: countries.map(genCase): _*)
.groupBy($"id")
.agg($"id".alias("dummy"), countries.map(genAgg(sum)): _*)
.drop("dummy")
I have few tables, lets say 2 for simplicity. I can create them in this way,
...
val tableA = new Table[(Int,Int)]("tableA"){
def a = column[Int]("a")
def b = column[Int]("b")
}
val tableB = new Table[(Int,Int)]("tableB"){
def a = column[Int]("a")
def b = column[Int]("b")
}
Im going to have a query to retrieve value 'a' from tableA and value 'a' from tableB as a list inside the results from 'a'
my result should be:
List[(a,List(b))]
so far i came upto this point in query,
def createSecondItr(b1:NamedColumn[Int]) = for(
b2 <- tableB if b1 === b1.b
) yield b2.a
val q1 = for (
a1 <- tableA
listB = createSecondItr(a1.b)
) yield (a1.a , listB)
i didn't test the code so there might be errors in the code. My problem is I cannot retrieve data from the results.
to understand the question, take trains and classes of it. you search the trains after 12pm and you need to have a result set where the train name and the classes which the train have as a list inside the train's result.
I don't think you can do this directly in ScalaQuery. What I would do is to do a normal join and then manipulate the result accordingly:
import scala.collection.mutable.{HashMap, Set, MultiMap}
def list2multimap[A, B](list: List[(A, B)]) =
list.foldLeft(new HashMap[A, Set[B]] with MultiMap[A, B]){(acc, pair) => acc.addBinding(pair._1, pair._2)}
val q = for (
a <- tableA
b <- tableB
if (a.b === b.b)
) yield (a.a, b.a)
list2multimap(q.list)
The list2multimap is taken from https://stackoverflow.com/a/7210191/66686
The code is written without assistance of an IDE, compiler or similar. Consider the debugging free training :-)