How to get columns to show in select statement spark scala - scala

I am using the code below to select the columns from 2 tables. I am using spark scala 2.11.11 and it runs, but, it will only return the package id and the number of packages. I need to see package id, number of packages, first name and last name in the result set. What am I missing in my code?
import org.apache.spark.sql.functions._
import spark.implicits._
flData_csv
.toDF("packageId", "flId", "date", "to", "from")
customers_csv.toDF("packageId", "firstName", "lastName")
flData_csv
.join(customers_csv, Seq("packageId"))
.select("packageId", "count", "firstName", "lastName")
.withColumnRenamed("packageId", "Package ID").groupBy("Package ID").count()
.withColumnRenamed("count", "Number of Packages")
.filter(col("count") >= 20)
.withColumnRenamed("firstName", "First Name")
.withColumnRenamed("lastName", "Last Name")
.show(100)

After reading your code, I notice that there's .groupBy call after packageId renaming. After .groupBy call, basically you're left with group key(s) (Package ID in this case) and what comes with aggregation.
I think adding firstName and lastName as group keys would solve your problem. Here's a sample code
flData_csv
.join(customers_csv, Seq("packageId"))
.select("packageId", "count", "firstName", "lastName")
.withColumnRenamed("packageId", "Package ID")
.groupBy("Package ID", "firstName", "lastName").count()
.withColumnRenamed("count", "Number of Packages")
.filter(col("count") >= 20)
.withColumnRenamed("firstName", "First Name")
.withColumnRenamed("lastName", "Last Name")
.show(100)

Related

Can't insert string to Delta Table using Update in Pyspark

I have encountered an issue were it will not allow me to insert a string using update and returns. I'm running 6.5 (includes Apache Spark 2.4.5, Scala 2.11), but it is not working on 6.4 runtime as well.
I have a delta table with the following columns, partitioned by the created date
ID string
, addressLineOne string
, addressLineTwo string
, addressLineThree string
, addressLineFour string
, matchName string
, createdDate
And I'm running a process that hits an API and updates the matchName column.
Using Pyspark if it do this, just to test writing
deltaTable.update(col("ID") == "ABC123", {"matchName ": "example text"})
I get the following error:
Py4JJavaError: An error occurred while calling o1285.update.
: org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to dataType on unresolved object, tree: 'example
If I try this, change the string to 123, it updates without an issue
deltaTable.update(col("ID") == "ABC123", {"matchName ": "123"})
Yet if I use sql and do
UPDATE myTable SET matchName = "Some text" WHERE ID = "ABC123"
It inserts fine. I've searched and can't see a similar issue, Any suggestions? Have I missed something obvious?
Looks like you have an extra space after matchName in your python code

Compare current and all next value in records in map in spark

I am new to scala. Sample data :
1,"jack",34.5
2,"jackk",14.5
3,"jacky",24.5
4,"jack",64.5
And many more.
I want to compare each filed of first record with all other field, then second with all other and so on. (Please don't consider Syntaxs)
I have written below code :
val data = sc.parallalize(Seq((1,"jack",34.5),
(2,"jackk",14.5),
(3,"jacky",24.5),
(4,"jack",64.5))
val res = data.map{f =>
val rr = f._1.equals(f._1) //here same field compare with each other But I want to compare current record with all next records.
Row(rr)
}
Example:
"jack" with "jackk"
"jack" with "jacky"
"jack" with "jack"
"jackk" with "jacky"
"jackk" with "jack"
"jacky" with "jack"
I am using .map because i want code should be executed on cluster.
Please Give some suggestion.
Thanks in Advance.
try like this:
data.cartesian(data).map(pair => compare(pair._1, pair._2))
but be aware that 'cartesian' operation takes N*N space.

From Keyword Not Found Where Expected Error in Self Join

I have a data table Employees, I want to show the employee name, the employee number, the manager number and the manager name of the employee who has the least salary in the company. I decide to perform a self join, and here's my code:
select worker.employee_id, worker.last_name "Worker Last Name",
worker.salary manager.last_name "Manager Last Name", manager.manager_id
from employees worker join employees manager
on worker.manager_id = manager.employee_id
having worker.salary = (select min(salary)
from employees);
However, when I run this, the error "from keyword not found where expected" pops up. What should I do?
Oops, realized my own mistakes. I forgot to place a comma between worker.salary and manager.last_name, and I should not have WHERE instead of HAVING.
select worker.employee_id, worker.last_name "Worker Last Name",
worker.salary, manager.last_name "Manager Last Name", manager.manager_id
from employees worker join employees manager
on worker.manager_id = manager.employee_id
where worker.salary = (select min(salary)
from employees);
After fixing those two mistakes, the code runs fine.

How can I validate fields in a table in Vaadin with Scala

How can I validate a field in a vaadin table? For example the year field with a regex:
val persons: BeanContainer[Int, Person] =
new BeanContainer[Int, Person] classOf[Person])
persons.setBeanIdProperty("id")
persons.addBean(new Person("Thomas", "Mann", 1929, 123123))
persons.addBean(new Person("W. B.", "Yeats", 1923, 643454))
persons.addBean(new Person("Günter", "Grass", 1999, 743523))
// create table
val table: Table = new Table("Nobel Prize for Literature", persons)
table.setVisibleColumns(Array("id", "firstName", "lastName", "year"))
table.setColumnHeader("lastName", "last name")
table.setColumnHeader("firstName", "first name")
table.setColumnHeader("year", "year")
// create a validator
val yearValidator = new RegexpValidator("[1-2][0-9]{3}",
"year must be a number 1000-2999.");
// TODO check the year field!
table.addValidator(yearValidator)
I create a Regex Validator, but how can I set the validator to the right field?
You have to intercept the creation of the fields with a field factory and add the validators there:
table.setTableFieldFactory(new DefaultFieldFactory() {
#Override
public Field createField(Item item, Object propertyId, Component uiContext) {
Field field = super.createField(item, propertyId, uiContext);
if ("year".equals(propertyId)) {
field.addValidator(new RegexpValidator("[1-2][0-9]{3}",
"year must be a number 1000-2999.");
}
return field;
}
});
(Java, not Scala, but it should be straightforward to translate this to scala).

How to do CRUD operations on domain models using Casbah for MongoDb?

There is a tutorial on Casbah:
http://api.mongodb.org/scala/casbah/current/tutorial.html
But I find it hard to follow the tutorial as I am still learning Scala.
All I wanted to find out how to do simple CRUD operations using Casbah to begin with
before I can go more advanced.
Given below domain models:
class Hotel (var name: String, var stars: Int, val address: Address)
class Address(var street:String, var city: String, var postCode: String)
val address = new Address(street = "1234 st", city = "edmond", postCode = "1232234", country = "USA" )
val hotel = new Hotel(name = "Super Nice", stars = 4, address = address)
val address2 = new Address(street = "main st", city = "edmond", postCode = "1232234", country = "USA" )
val hotel2 = new Hotel(name = "Big Hotel", stars = 4, address = address2)
Given above what Casbah code is to achieve these tasks?
(1) save both hotels in mongodb
(2) find all hotels that have stars equal to 4 or greater than 4. this should give me
a list over which I can iterate
(3) find a hotel by the name "Super Nice" and change its name to "Ultra Nice"
(4) get addresses of all hotels and change country to lower case and save in database
Here you can see how to insert data: Casbah wiki
If you want to directly save case classes (without needing a MongoDBObject) in MongoDB you should have a look at Salat and SalatDao: Salat presentation
In my opinion, the answers to question (2) - (4) can be found easily in the documentation of casbah and salat.