Run SQL query on dataframe via Pyspark - pyspark

I would like to run sql query on dataframe but do I have to create a view on this dataframe?
Is there any easier way?
df = spark.createDataFrame([
('a', 1, 1), ('a',1, None), ('b', 1, 1),
('c',1, None), ('d', None, 1),('e', 1, 1)
]).toDF('id', 'foo', 'bar')
and the query I want to run some complex queries against this dataframe.
For example
I can do
df.createOrReplaceTempView("temp_view")
df_new = pyspark.sql("select id,max(foo) from temp_view group by id")
but do I have to convert it to view first before querying it?
I know there is a dataframe equivalent operation.
The above query is only an example.

You can just do
df.select('id', 'foo')
This will return a new Spark DataFrame with columns id and foo.

Related

Compare 2 pyspark dataframe columns and change values of another column based on it

I have a problem where I have generated a dataframe from a graph algorithm that I have written. The thing is that I want to keep the value of the underlying component to be the same essentially after every run of the graph code.
This is a sample dataframe generated:
df = spark.createDataFrame(
[
(1, 'A1'),
(1, 'A2'),
(1, 'A3'),
(2, 'B1'),
(2, 'B2'),
(3, 'B3'),
(4, 'C1'),
(4, 'C2'),
(4, 'C3'),
(4, 'C4'),
(5, 'D1'),
],
['old_comp_id', 'db_id']
)
After another run the values change completely, so the new run has values like these,
df2 = spark.createDataFrame(
[
(2, 'A1'),
(2, 'A2'),
(2, 'A3'),
(3, 'B1'),
(3, 'B2'),
(3, 'B3'),
(1, 'C1'),
(1, 'C2'),
(1, 'C3'),
(1, 'C4'),
(4, 'D1'),
],
['new_comp_id', 'db_id']
)
So the thing I need to do is to compare the values between the above two dataframes and change the values of the component id based on the database id associated.
if the database_id are the same then update the component id to be from the 1st dataframe
if they are different then assign a completely new comp_id (new_comp_id = max(old_comp_id)+1)
This is what I have come up with so far:
old_ids = df.groupBy("old_comp_id").agg(F.collect_set(F.col("db_id")).alias("old_db_id"))
new_ids = df2.groupBy("new_comp_id").agg(F.collect_set(F.col("db_id")).alias("new_db_id"))
joined = new_ids.join(old_ids,old_ids.old_comp_id == new_ids.new_comp_id,"outer")
joined.withColumn("update_comp", F.when( F.col("new_db_id") == F.col("old_db_id"), F.col('old_comp_id')).otherwise(F.max(F.col("old_comp_id")+1))).show()
In order to use aggregated functions in non-aggregated columns, you should use Windowing Functions.
First, you outer-join the DFs with the db_id:
from pyspark.sql.functions import when, col, max
joinedDF = df.join(df2, df["db_id"] == df2["new_db_id"], "outer")
Then, start to building the Window (which where you group by db_id, and order by old_comp_id, in order to have in first rows the old_comp_id with highest value.
from pyspark.sql.window import Window
from pyspark.sql.functions import desc
windowSpec = Window\
.partitionBy("db_id")\
.orderBy(desc("old_comp_id"))\
.rowsBetween(Window.unboundedPreceding, Window.currentRow)
Then, you build the max column using the windowSpec
from pyspark.sql.functions import max
maxCompId = max(col("old_comp_id")).over(windowSpec)
Then, you apply it on the select
joinedDF.select(col("db_id"), when(col("new_db_id").isNotNull(), col("old_comp_id")).otherwise(maxCompId+1).alias("updated_comp")).show()
For more information, please refer to the documentation (http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Window)
Hope this helps

How to join two spark RDD

I have 2 spark RDD, the 1st one contains a mapping between some indices and ids which are strings and the 2nd one contains tuples of related indices
val ids = spark.sparkContext.parallelize(Array[(Int, String)](
(1, "a"), (2, "b"), (3, "c"), (4, "d"), (5, "e"))).toDF("index", "idx")
val relationships = spark.sparkContext.parallelize(Array[(Int, Int)](
(1, 3), (2, 3), (4, 5))).toDF("index1", "index2")
I want to join somehow these RDD (or merge or sql or any best spark practice) to have at the end related ids instead:
The result of my combined RDD should return:
("a", "c"), ("b", "c"), ("d", "e")
Any idea how I can achieve this operation in an optimal way without loading any of the RDD into a memory map (because in my scenarios, these RDD can potentially load millions of records)
You can approach this by creating a two views from DataFrame as following
relationships.createOrReplaceTempView("relationships");
ids.createOrReplaceTempView("ids");
Next run the following SQL query to generate the required result which performs inner join between relationships and ids view to generate the required result
import sqlContext.sql;
val result = spark.sql("""select t.index1, id.idx from
(select id.idx as index1, rel.index2
from relationships rel
inner join
ids id on rel.index1=id.index) t
inner join
ids id
on id.index=t.index2
""");
result.show()
Another approach using DataFrame without creating views
relationships.as("rel").
join(ids.as("ids"), $"ids.index" === $"rel.index1").as("temp").
join(ids.as("ids"), $"temp.index2"===$"ids.index").
select($"temp.idx".as("index1"), $"ids.idx".as("index2")).show

pyspark feed one RDD to another using the 'in' clause

I have a pyspark RDD (myRDD) that is a variable-length list of IDs, such as
[['a', 'b', 'c'], ['d','f'], ['g', 'h', 'i','j']]
I have a pyspark dataframe (myDF) with columns ID and value.
I want to query myDF with the query:
outputDF = myDF.select(F.collect_set("value")).alias("my_values").where(col("ID").isin(id_list))
where id_list is an element from the myRDD, such as ['d','f'] or ['a', 'b', 'c'].
An example would be:
outputDF = myDF.select(F.collect_set("value")).alias("my_values").where(col("ID").isin(['d','f']))
What is a parallelizable way to use the RDD to query the DF like this?
Considering your dataframe column "ID" is of type stringType(), you want to keep ID values that appear in any of your RDD's row.
First, let's transform our RDD into a one column dataframe with with a unique ID for each row:
from pyspark.sql import HiveContext
hc = HiveContext(sc)
ID_df = hc.createDataFrame(
myRDD.map(lambda row: [row]),
['ID']
).withColumn("uniqueID", psf.monotonically_increasing_id())
We'll explose it so that each row only has one ID value:
import pyspark.sql.functions as psf
ID_df = ID_df.withColumn('ID', psf.explode(ID_df.ID))
We can now join out original dataframe, the inner join will serve as a filter:
myDF = myDF.join(ID_df, "ID", "inner)
A collect_set is an aggregation function, so you need so kind of groupBy before using it, for instance by the newly created row ID:
myDF.groupBy("uniqueID").agg(
psf.collect_set("ID").alias("ID")
)

How to sort RDD entries using two features simultaneously?

I have a Spark RDD whose entries I want to sort in an organized manner. Let's say the entry is a tuple with 3 elements (name,phonenumber,timestamp). I want to sort the entries first depending on the value of phonenumber and then depending on the value of timestamp while respecting and not changing the sort that was done based on phonenumber. (so timestamp only re-arranges based on the phonenumber sort). Is there a Spark function to do this?
(I am using Spark 2.x with Scala)
In order to do the sorting based on Multiple elements in RDD, you can use sortBy function. Please find below some sample code in Python. you can similarly implement in other languages as well.
tmp = [('a', 1), ('a', 2), ('1', 3), ('1', 4), ('2', 5)]
sc.parallelize(tmp).sortBy(lambda x: (x[0], x[1]), False).collect()
Regards,
Neeraj
You can use sortBy function on RDD as below
val df = spark.sparkContext.parallelize(Seq(
("a","1", "2017-03-10"),
("b","12", "2017-03-9"),
("b","123", "2015-03-12"),
("c","1234", "2015-03-15"),
("c","12345", "2015-03-12")
))//.toDF("name", "phonenumber", "timestamp")
df.sortBy(x => (x._1, x._3)).foreach(println)
Output:
(c,1234,2015-03-15)
(c,12345,2015-03-12)
(b,12,2017-03-9)
(b,123,2015-03-12)
(a,1,2017-03-10)
If you have a dataframe with toDF("name", "phonenumber", "timestamp")
Then you could simply do
df.sort("name", "timestamp")
Hope this helps!

How to convert a SQL query output (dataframe) into an array list of key value pairs in Spark Scala?

I created a dataframe in spark scala shell for SFPD incidents. I queried the data for Category count and the result is a datafame. I want to plot this data into a graph using Wisp. Here is my dataframe,
+--------------+--------+
| Category|catcount|
+--------------+--------+
| LARCENY/THEFT| 362266|
|OTHER OFFENSES| 257197|
| NON-CRIMINAL| 189857|
| ASSAULT| 157529|
| VEHICLE THEFT| 109733|
| DRUG/NARCOTIC| 108712|
| VANDALISM| 91782|
| WARRANTS| 85837|
| BURGLARY| 75398|
|SUSPICIOUS OCC| 64452|
+--------------+--------+
I want to convert this dataframe into an arraylist of key value pairs. So I want result like this with (String,Int) type,
(LARCENY/THEFT,362266)
(OTHER OFFENSES,257197)
(NON-CRIMINAL,189857)
(ASSAULT,157529)
(VEHICLE THEFT,109733)
(DRUG/NARCOTIC,108712)
(VANDALISM,91782)
(WARRANTS,85837)
(BURGLARY,75398)
(SUSPICIOUS OCC,64452)
I tried converting this dataframe (t) into an RDD as val rddt = t.rdd. And then used flatMapValues,
rddt.flatMapValues(x=>x).collect()
but still couldn't get the required result.
Or is there a way to directly give the dataframe output into Wisp?
In pyspark it'd be as below. Scala will be quite similar.
Creating test data
rdd = sc.parallelize([(0,1), (0,1), (0,2), (1,2), (1,1), (1,20), (3,18), (3,18), (3,18)])
df = sqlContext.createDataFrame(rdd, ["id", "score"])
Mapping the test data, reformatting from a RDD of Rows to an RDD of tuples. Then, using collect to extract all the tuples as a list.
df.rdd.map(lambda x: (x[0], x[1])).collect()
[(0, 1), (0, 1), (0, 2), (1, 2), (1, 1), (1, 20), (3, 18), (3, 18), (3, 18)]
Here's the Scala Spark Row documentation that should help you convert this to Scala Spark code