I am pretty new in scala. Currently working on scala 2 and postgres is the DB. Now I have written a Left innerjoin query in slick like below
val executors = TableQuery[Executors]
val innerJoin = (for {
(rel,a) <- executors joinLeft executors on ( (e1, e2) => {
e1.column1 === e2.column1 && e1.column2 === e2.column2
} ) if rel.id === id
} yield rel.name)
When I tried to print innerJoin.result.statements.headOption it gives me desired query
But the problem I am facing is instead of yielding rel.name I want a.name. But I am getting error value name is not a member of slick.lifted.Rep[Option[Executors]].
I checked slick documentation , not sure what I am missing here .
I have fixed it by a.map(_.name) . Reference https://scala-slick.org/doc/3.0.0/queries.html
Related
I'm trying to filter two fields of two datasets in csv files. I've already applied an INNER JOIN for dataset1.csv and dataset2.csv.
This is my starting code:
case class Customer(
Customer_ID:Int,
Customer_Name:String,
Account_Number:Double,
Marital_Status:String,
Age:Int,
Contact:Double,
Location:String,
Monthly_Income_USD:Double,
Yearly_Balance_USD:Double,
Job_type:String,
Credit_Card:String
)
case class HouseLoan(
Customer_ID:Int,
Account_Number:Double,
House_Loan_Amount_USD:Double,
Total_Installment:Int,
Installment_Pending:Int,
Loan_Defaulter:String
)
val custRDD = sc.textFile("dataset1.csv")
.map(_.split(","))
.map(r => (
r(0),
Customer( r(0).toInt, r(1), r(2).toDouble, r(3), r(4).toInt, r(5).toDouble, r(6), r(7).toDouble, r(8).toDouble, r(9), r(10))
)
)
val houseRDD = sc.textFile("dataset2.csv")
.map(_.split(","))
.map(r => (
r(0),
HouseLoan(r(0).toInt,r(1).toDouble,r(2).toDouble,r(3).toInt,r(4).toInt,r(5))
)
)
val joinTab = custRDD.join(houseRDD)
joinTab.collect().foreach(println)
Until here everything is okay, the result is show in the image.
Now, I need:
the field of Customer_ID (the key of both tables)
the Job_Type
for those who are:
"Doctor" and
House_Loan_Amoun_USD is more than 1000000
using join.
I tried something like
val joinTab = custRDD.filter{record => (record.split(",")(9) == "Doctor").join(houseRDD)filter{record => (record.split(",")(2) > 100000}
but it is obviously wrong and I'm still noob for Apache Spark.
Note: I can't use spark sql because I'm learning this topic in my university (Spark Core - RDD) so I must do it with RDD join.
scala/spark newbie here. I have inherited an old code which I have refactored and been trying to use in order to retrieve data from Scylla. The code looks like:
val TEST_QUERY = s"SELECT user_id FROM test_table WHERE name = ? AND id_type = 'test_type';"
var selectData = List[Row]()
dataRdd.foreachPartition {
iter => {
// Build up a cluster that we can connect to
// Start a session with the cluster by connecting to it.
val cluster = ScyllaConnector.getCluster(clusterIpString, scyllaPreferredDc, scyllaUsername, scyllaPassword)
var batchCounter = 0
val session = cluster.connect(tableConfig.keySpace)
val preparedStatement: PreparedStatement = session.prepare(TEST_QUERY)
iter.foreach {
case (test_name: String) => {
// Get results
val testResults = session.execute(preparedStatement.bind(test_name))
if (testResults != null){
val testResult = testResults.one()
if(testResult != null){
val user_id = testResult.getString("user_id")
selectData ::= Row(user_id, test_name)
}
}
}
}
session.close()
cluster.close()
}
}
println("Head is =======> ")
println(selectData.head)
The above does not return any data and fails with null pointer exception because the selectedData list is empty although there is data in there for sure that matches the select statement. I feel like how I'm doing it is not correct but can't figure out what needs to change in order to get this fixed so any help is much appreciated.
PS: The whole idea of me using a list to keep the results is so that I can use that list to create a dataframe. I'd be grateful if you could point me to the right direction here.
If you look into the definition of the foreachPartition function, you will see that it's by definition can't return anything because its return type is void.
Anyway, it's a very bad way of querying data from Cassandra/Scylla from Spark. For that exists Spark Cassandra Connector that should be able to work with Scylla as well because of the protocol compatibility.
To read a dataframe from Cassandra just do:
spark.read
.format("cassandra")
.option("keyspace", "ksname")
.option("table", "tab")
.load()
Documentation is quite detailed, so just read it.
I'm trying to run in Slick 3.1 a transaction that contains two updates. The second update is plain SQL using Slick's sqlu command. This is my attempt:
val table = TableQuery[TableDB]
val update1 = table.filter(f => f.name === name).update(rec)
val update2 = sqlu"UPDATE table2 SET field = 1 WHERE field = 2"
val action = (for {
_ <- update1
_ <- update2 // <-- compilation error here
} yield ()).transactionally
val future = db.run(action.asTry)
// ... rest of the code
Slick complains in update2 line with the following messages
Implicit conversion found: ⇒
augmentString(): scala.collection.immutable.StringOps
type mismatch; found : scala.collection.immutable.IndexedSeq[Unit]
required: slick.dbio.DBIOAction[?,?,?]
Is it possible to make this work in a single database transaction?
I am trying to dedupe event records, using the hiveContext in spark with Scala.
df to rdd is compilation error saying "object Tuple23 is not a member of package scala". There is known issue, that Scala Tuple can't have 23 or more
Is there any other way to dedupe
val events = hiveContext.table("default.my_table")
val valid_events = events.select(
events("key1"),events("key2"),events("col3"),events("col4"),events("col5"),
events("col6"),events("col7"),events("col8"),events("col9"),events("col10"),
events("col11"),events("col12"),events("col13"),events("col14"),events("col15"),
events("col16"),events("col17"),events("col18"),events("col19"),events("col20"),
events("col21"),events("col22"),events("col23"),events("col24"),events("col25"),
events("col26"),events("col27"),events("col28"),events("col29"),events("epoch")
)
//events are deduped based on latest epoch time
val valid_events_rdd = valid_events.rdd.map(t => {
((t(0),t(1)),(t(2),t(3),t(4),t(5),t(6),t(7),t(8),t(9),t(10),t(11),t(12),t(13),t(14),t(15),t(16),t(17),t(18),t(19),t(20),t(21),t(22),t(23),t(24),t(25),t(26),t(28),t(29)))
})
// reduce by key so we will only get one record for every primary key
val reducedRDD = valid_events_rdd.reduceByKey((a,b) => if ((a._29).compareTo(b._29) > 0) a else b)
//Get all the fields
reducedRDD.map(r => r._1 + "," + r._2._1 + "," + r._2._2).collect().foreach(println)
Off the top of my head:
use cases classes which no longer have size limit. Just keep in mind that cases classes won't work correctly in Spark REPL,
use Row objects directly and extract only keys,
operate directly on a DataFrame,
import org.apache.spark.sql.functions.{col, max}
val maxs = df
.groupBy(col("key1"), col("key2"))
.agg(max(col("epoch")).alias("epoch"))
.as("maxs")
df.as("df")
.join(maxs,
col("df.key1") === col("maxs.key1") &&
col("df.key2") === col("maxs.key2") &&
col("df.epoch") === col("maxs.epoch"))
.drop(maxs("epoch"))
.drop(maxs("key1"))
.drop(maxs("key2"))
or with window function:
val w = Window.partitionBy($"key1", $"key2").orderBy($"epoch")
df.withColumn("rn_", rowNumber.over(w)).where($"rn" === 1).drop("rn")
I am using Scala and Slick and I am trying to execute simple query with two conditions
import JiraData._
import org.scala_tools.time.Imports._
import scala.slick.driver.PostgresDriver.simple._
val today = new DateTime()
val yesterday = today.plusDays(-1)
implicit val session = Database.forURL("jdbc:postgresql://localhost/jira-performance-manager",
driver = "org.postgresql.Driver",
user = "jira-performance-manager",
password = "jira-performance-manager").withSession {
implicit session =>
val activeUsers = users.filter(_.active === true)
for (activeUser <- activeUsers) {
val activeUserWorkogs = worklogs.filter(x => x.username === activeUser.name && x.workDate === yesterday)
}
}
But I receive error:
Error:(20, 95) value === is not a member of scala.slick.lifted.Column[org.scala_tools.time.Imports.DateTime]
Note: implicit value session is not applicable here because it comes after the application point and it lacks an explicit result type
val activeUserWorkogs = worklogs.filter(x => x.username === activeUser.name && x.workDate === yesterday)
^
What's wrong here? How can I get list of results filtered by two conditions?
scala-tools.time uses JodaDateTime. See https://github.com/jorgeortiz85/scala-time/blob/master/src/main/scala/org/scala_tools/time/Imports.scala . Slick does not have built-in support for Joda. There is Slick Joda mapper: https://github.com/tototoshi/slick-joda-mapper . Or it is easy to add yourself: http://slick.typesafe.com/doc/2.1.0/userdefined.html#using-custom-scalar-types-in-queries
As a side-note: Something like
for (activeUser <- activeUsers) {
val activeUserWorkogs = worklogs.filter(...)
looks like going into the wrong direction. It will run another query for each active user. Better is to use a join or run a single accumulated query for the work logs of all active users.