Hi I am very new in pyspark.i didn't code in pyspark so I need help to run sql query on pyspark using python.
can you please tell me how to create dataframe and then view and run sql query on top of it?
what are the modules required to run the query?
Can you please help me how to run?
The data is coming from file TERR.txt
sql query:
select a.id as nmitory_id, a.dscrptn as nmitory_desc, a.nm as terr_nm, a.pstn_type, a.parnt_terr as parnt_nm_id, b.nm as parnt_terr_nm, a.start_dt, a.type,
CASE
WHEN substr (a.nm, 1, 6) IN ('105-30',
'105-31',
'105-32',
'105-41',
'105-42',
'105-43',
'200-CD',
'200-CG',
'200-CO',
'200-CP',
'200-CR',
'200-DG'
)
THEN
'JBI'
WHEN substr (a.nm, 1, 6) IN ('100-SC',
'105-05',
'105-06',
'105-07',
'105-08',
'105-13',
'105-71',
'105-72',
'105-73'
)
THEN
'JP'
WHEN substr (a.nm, 1, 6) IN ('103-16')
THEN
'JT'
WHEN substr (a.nm, 1, 6) IN ('105-51',
'200-HA',
'200-HF',
'200-HT',
'105-HT')
THEN
'JSA'
WHEN substr (a.nm, 1, 6) IN ('105-61',
'200-PR')
THEN
'PR'
WHEN substr (a.nm, 1, 3) IN ('302')
THEN
'Canada - MEM'
WHEN substr (a.nm, 1, 3) IN ('301')
THEN
'Canada - MSL'
ELSE
'Unspecified'
END
AS DEPARTMENT,
CASE
WHEN substr (a.nm, 1, 6) IN ('105-06',
'105-07',
'105-08'
)
THEN
'CVM MSL'
WHEN substr (a.nm, 1, 6) IN ('100-SC',
'105-13'
)
THEN
'CVM CSS'
WHEN substr (a.nm, 1, 6) IN ('105-41',
'200-CD'
)
THEN
'Derm MSL'
WHEN substr (a.nm, 1, 6) IN ('105-42',
'200-CG'
)
THEN
'Gastro MSL'
WHEN substr (a.nm, 1, 6) IN ('105-31')
THEN
'Heme Onc MSL'
WHEN substr (a.nm, 1, 6) IN ('200-DG')
THEN
'Imm MD'
WHEN substr (a.nm, 1, 6) IN ('103-16')
THEN
'ID MSL'
WHEN substr (a.nm, 1, 6) IN ('200-CP')
THEN
'Imm Ops'
WHEN substr (a.nm, 1, 6) IN ('105-05',
'105-71',
'105-72',
'105-73'
)
THEN
'Neuro MSL'
WHEN substr (a.nm, 1, 6) IN ('105-30',
'200-CO'
)
THEN
'Onc MSL'
WHEN substr (a.nm, 1, 6) IN ('105-61',
'200-PR'
)
THEN
'Puerto Rico MSL'
WHEN substr (a.nm, 1, 6) IN ('105-43',
'200-CR'
)
THEN
'Rheum MSL'
WHEN substr (a.nm, 1, 6) IN ('105-51',
'200-HF'
)
THEN
'RWVE Field'
WHEN substr (a.nm, 1, 6) IN ('105-32')
THEN
'Solid Tumor MSL'
WHEN substr (a.nm, 1, 6) IN ('200-HT',
'105-HT')
THEN
'RWVE Pop Health'
WHEN substr (a.nm, 1, 6) IN ('301-PC')
THEN
'Canada - PC MSL'
WHEN substr (a.nm, 1, 6) IN ('301-VR')
THEN
'Canada - VR/ONC MSL'
WHEN substr (a.nm, 1, 6) IN ('301-SO')
THEN
'Canada - Hematology (Myeloid) MSL'
WHEN substr (a.nm, 1, 6) IN ('301-ON')
THEN
'Canada - Hematology (Lymphoid) MSL'
WHEN substr (a.nm, 1, 6) IN ('301-IP')
THEN
'Canada - CNS MSL'
WHEN substr (a.nm, 1, 6) IN ('301-RD')
THEN
'Canada - Rheum MSL'
WHEN substr (a.nm, 1, 6) IN ('301-IB')
THEN
'Canada - Gastro MSL'
WHEN substr (a.nm, 1, 6) IN ('301-DE')
THEN
'Canada - Derm MSL'
WHEN substr (a.nm, 1, 6) IN ('301-SE')
THEN
'Canada - Biologics MSL'
WHEN substr (a.nm, 1, 6) IN ('302-PC')
THEN
'Canada - PC MEM'
WHEN substr (a.nm, 1, 6) IN ('302-VR')
THEN
'Canada - VR/ONC MEM'
WHEN substr (a.nm, 1, 6) IN ('302-SO')
THEN
'Canada - Hematology (Myeloid) MEM'
WHEN substr (a.nm, 1, 6) IN ('302-ON')
THEN
'Canada - Hematology (Lymphoid) MEM'
WHEN substr (a.nm, 1, 6) IN ('302-IP')
THEN
'Canada - CNS MEM'
WHEN substr (a.nm, 1, 6) IN ('302-RD')
THEN
'Canada - Rheum MEM'
WHEN substr (a.nm, 1, 6) IN ('302-IB')
THEN
'Canada - Gastro MEM'
WHEN substr (a.nm, 1, 6) IN ('302-DE')
THEN
'Canada - Derm MEM'
WHEN substr (a.nm, 1, 6) IN ('302-SE')
THEN
'Canada - Biologics MEM'
ELSE
'Unspecified'
END
AS FRANCHISE
from outbound.terr a left outer join outbound.terr b on a.parnt_terr = b.id
You should create a temp view and query on it.
For example:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("sample").getOrCreate()
df = spark.read.load("TERR.txt")
df.createTempView("example")
df2 = spark.sql("SELECT * FROM example")
Save your query to a variable like a string, and assuming you know what a SparkSession object is, you can use SparkSession.sql to fire the query on the table:
df.createTempView('TABLE_X')
query = "SELECT * FROM TABLE_X"
df = spark.sql(query)
To read a csv into Spark:
def read_csv_spark(spark, file_path):
df = (
spark.read.format("com.databricks.spark.csv")
.options(header="true", inferSchema="true")
.load(file_path)
)
return df
df = read_csv_spark(spark, "/path/to/file.csv")
Related
Let's look at following source of code:
def foo(s1: Set[Int], s2: Set[Int], s3: Set[Int]): Set[Set[Int]] = {
for {
ss1 <- s1
ss2 <- s2
ss3 <- s3
} yield Set(ss1, ss2, ss3)
}
How to define analogous function for def foo(ss: Set[Int]*) ?
It's almost the same as the usual cartesian product, except that you have to cram all the results into sets instead of collecting them in ordered tuples:
/** Forms cartesian product of sets,
* then collapses each resulting tuple into a set.
*/
def collapsedCartesian[A](sets: Set[A]*): Set[Set[A]] = sets match
case Seq() => Set(Set.empty)
case Seq(h, t # _*) => for a <- h; b <- collapsedCartesian(t: _*) yield (b + a)
Note that here, the + adds an element to a set: set + elem, which is an oddly asymmetric operation to be denoted by such a symmetric symbol.
The outcome seems reasonably irregular:
collapsedCartesian(Set(1, 2), Set(3, 4)).foreach(println)
println("---")
collapsedCartesian(Set(1, 2), Set(1, 2)).foreach(println)
println("---")
collapsedCartesian(Set(1, 2, 3), Set(4, 5), Set(6, 7)).foreach(println)
println("---")
collapsedCartesian(Set(1, 2, 3), Set(2, 3, 4), Set(4, 5)).foreach(println)
gives:
Set(3, 1)
Set(4, 1)
Set(3, 2)
Set(4, 2)
---
Set(1)
Set(2, 1)
Set(2)
---
Set(7, 5, 1)
Set(6, 4, 2)
Set(6, 4, 1)
Set(7, 4, 1)
Set(6, 5, 1)
Set(7, 5, 3)
Set(7, 4, 2)
Set(6, 5, 2)
Set(6, 4, 3)
Set(7, 5, 2)
Set(7, 4, 3)
Set(6, 5, 3)
---
Set(5, 3, 1)
Set(5, 4, 2)
Set(5, 4, 1)
Set(4, 2)
Set(4, 1)
Set(5, 3)
Set(5, 3, 2)
Set(5, 4, 3)
Set(4, 2, 1)
Set(5, 2, 1)
Set(4, 3, 1)
Set(4, 3, 2)
Set(5, 2)
Set(4, 3)
Please don't ask how to do it in Spark, this exponentially exploding stuff is obviously useless for any dataset with more than just a couple of entries.
In Scala, what would be the right way of selecting elements of a list based on the position of two elements? Suppose I have the list below and I would like to select all the elements between 2 and 7, including them (note: not greater than/smaller than, but the elements that come after 2 and before 7 in the list):
scala> val l = List(1, 14, 2, 17, 35, 9, 12, 7, 9, 40)
l: List[Int] = List(1, 14, 2, 17, 35, 9, 12, 7, 9, 40)
scala> def someMethod(l: List[Int], from: Int, to: Int) : List[Int] = {
| // some code here
| }
someMethod: (l: List[Int], from: Int, to: Int)List[Int]
scala> someMethod(l, 2, 7)
res0: List[Int] = List(2, 17, 35, 9, 12, 7)
Expected output:
For lists that don't contain 2 and/or 7: an empty list
Input: (1, 2, 2, 2, 3, 4, 7, 8); Output: (2, 2, 2, 3, 4, 7)
Input: (1, 2, 3, 4, 7, 7, 7, 8); Output: (2, 3, 4, 7)
Input: (1, 2, 3, 4, 7, 1, 2, 3, 5, 7, 8); Output: ((2, 3, 4, 7), (2, 3, 5, 7))
Too bad that the regex-engines work only with strings, not with general lists - would be really nice if you could find all matches for something like L.*?R with two arbitrary delimiters L and R. Since it doesn't work with regex, you have to build a little automaton yourself. Here is one way to do it:
#annotation.tailrec
def findDelimitedSlices[A](
xs: List[A],
l: A,
r: A,
revAcc: List[List[A]] = Nil
): List[List[A]] = {
xs match {
case h :: t => if (h == l) {
val idx = xs.indexOf(r)
if (idx >= 0) {
val (s, rest) = xs.splitAt(idx + 1)
findDelimitedSlices(rest, l, r, s :: revAcc)
} else {
revAcc.reverse
}
} else {
findDelimitedSlices(t, l, r, revAcc)
}
case Nil => revAcc.reverse
}
}
Input:
for (example <- List(
List(1, 2, 2, 2, 3, 4, 7, 8),
List(1, 2, 3, 4, 7, 7, 7, 8),
List(1, 2, 3, 4, 7, 1, 2, 3, 5, 7, 8)
)) {
println(findDelimitedSlices(example, 2, 7))
}
Output:
List(List(2, 2, 2, 3, 4, 7))
List(List(2, 3, 4, 7))
List(List(2, 3, 4, 7), List(2, 3, 5, 7))
You're looking for slice:
# l.slice(2, 7)
res1: List[Int] = List(2, 17, 35, 9, 12)
# l.slice(2, 8)
res2: List[Int] = List(2, 17, 35, 9, 12, 7)
I'm trying to get List(0,1,2,...n)
Is there a cleaner/better way than:
scala> List(0 to 9)
res0: List[scala.collection.immutable.Range.Inclusive] = List(Range(0, 1, 2, 3, 4,
5, 6, 7, 8, 9))
scala> List(0 to 9).flatten
res1: List[Int] = List(0, 1, 2, 3, 4, 5, 6, 7, 8, 9)
The best way might be:
(0 to 9).toList
scala> List.range(0, 10)
res0: List[Int] = List(0, 1, 2, 3, 4, 5, 6, 7, 8, 9)
Also
List(0 to 9: _*)
I suspect though that List.range is the most efficient one.
Imagine a function combineSequences: (seqs: Set[Seq[Int]])Set[Seq[Int]] that combines sequences when the last item of first sequence matches the first item of the second sequence. For example, if you have the following sequences:
(1, 2)
(2, 3)
(5, 6, 7, 8)
(8, 9, 10)
(3, 4, 10)
The result of combineSequences would be:
(5, 6, 7, 8, 8, 9, 10)
(1, 2, 2, 3, 3, 4, 10)
Because sequences 1, 2, and 5 combine together. If multiple sequences could combine to create a different result, the decisions is arbitrary. For example, if we have the sequences:
(1, 2)
(2, 3)
(2, 4)
There are two correct answers. Either:
(1, 2, 2, 3)
(2, 4)
Or:
(1, 2, 2, 4)
(2, 3)
I can only think of a very imperative and fairly opaque implementation. I'm wondering if anyone has a solution that would be more idiomatic scala. I've run into related problems a few times now.
Certainly not the most optimized solution but I've gone for readability.
def combineSequences[T]( seqs: Set[Seq[T]] ): Set[Seq[T]] = {
if ( seqs.isEmpty ) seqs
else {
val (seq1, otherSeqs) = (seqs.head, seqs.tail)
otherSeqs.find(_.headOption == seq1.lastOption) match {
case Some( seq2 ) => combineSequences( otherSeqs - seq2 + (seq1 ++ seq2) )
case None =>
otherSeqs.find(_.lastOption == seq1.headOption) match {
case Some( seq2 ) => combineSequences( otherSeqs - seq2 + (seq2 ++ seq1) )
case None => combineSequences( otherSeqs ) + seq1
}
}
}
}
REPL test:
scala> val seqs = Set(Seq(1, 2), Seq(2, 3), Seq(5, 6, 7, 8), Seq(8, 9, 10), Seq(3, 4, 10))
seqs: scala.collection.immutable.Set[Seq[Int]] = Set(List(1, 2), List(2, 3), List(8, 9, 10), List(5, 6, 7, 8), List(3, 4, 10))
scala> combineSequences( seqs )
res10: Set[Seq[Int]] = Set(List(1, 2, 2, 3, 3, 4, 10), List(5, 6, 7, 8, 8, 9, 10))
scala> val seqs = Set(Seq(1, 2), Seq(2, 3, 100), Seq(5, 6, 7, 8), Seq(8, 9, 10), Seq(100, 4, 10))
seqs: scala.collection.immutable.Set[Seq[Int]] = Set(List(100, 4, 10), List(1, 2), List(8, 9, 10), List(2, 3, 100), List(5, 6, 7, 8))
scala> combineSequences( seqs )
res11: Set[Seq[Int]] = Set(List(5, 6, 7, 8, 8, 9, 10), List(1, 2, 2, 3, 100, 100, 4, 10))
I have some problems in Maple.
If I have a matrix:
Matrix1 := Matrix(2, 2, {(1, 1) = 31, (1, 2) = -80, (2, 1) = -50, (2, 2) = 43});
I want to decide if it is in the below list:
MatrixList := [Matrix(2, 2, {(1, 1) = 31, (1, 2) = -80, (2, 1) = -50, (2, 2) = 43}), Matrix(2, 2, {(1, 1) = -61, (1, 2) = 77, (2, 1) = -48, (2, 2) = 9})];
I did the following:
evalb(Matrix1 in MatrixList);
but got "false".
Why? And how do I then make a program that decide if a matrix is
contained in a list of matrices.
Here's a much cheaper way than DrC's
ormap(LinearAlgebra:-Equal, MatrixList, Matrix1)