PySpark Dataframe from Python Dictionary without Pandas - pyspark

I am trying to convert the following Python dict into PySpark DataFrame but I am not getting expected output.
dict_lst = {'letters': ['a', 'b', 'c'],
'numbers': [10, 20, 30]}
df_dict = sc.parallelize([dict_lst]).toDF() # Result not as expected
df_dict.show()
Is there a way to do this without using Pandas?

Quoting myself:
I find it's useful to think of the argument to createDataFrame() as a list of tuples where each entry in the list corresponds to a row in the DataFrame and each element of the tuple corresponds to a column.
So the easiest thing is to convert your dictionary into this format. You can easily do this using zip():
column_names, data = zip(*dict_lst.items())
spark.createDataFrame(zip(*data), column_names).show()
#+-------+-------+
#|letters|numbers|
#+-------+-------+
#| a| 10|
#| b| 20|
#| c| 30|
#+-------+-------+
The above assumes that all of the lists are the same length. If this is not the case, you would have to use itertools.izip_longest (python2) or itertools.zip_longest (python3).
from itertools import izip_longest as zip_longest # use this for python2
#from itertools import zip_longest # use this for python3
dict_lst = {'letters': ['a', 'b', 'c'],
'numbers': [10, 20, 30, 40]}
column_names, data = zip(*dict_lst.items())
spark.createDataFrame(zip_longest(*data), column_names).show()
#+-------+-------+
#|letters|numbers|
#+-------+-------+
#| a| 10|
#| b| 20|
#| c| 30|
#| null| 40|
#+-------+-------+

Your dict_lst is not really the format you want to adopt to create a dataframe. It would be better if you had a list of dict instead of a dict of list.
This code creates a DataFrame from you dict of list :
from pyspark.sql import SQLContext, Row
sqlContext = SQLContext(sc)
dict_lst = {'letters': ['a', 'b', 'c'],
'numbers': [10, 20, 30]}
values_lst = dict_lst.values()
nb_rows = [len(lst) for lst in values_lst]
assert min(nb_rows)==max(nb_rows) #We must have the same nb of elem for each key
row_lst = []
columns = dict_lst.keys()
for i in range(nb_rows[0]):
row_values = [lst[i] for lst in values_lst]
row_dict = {column: value for column, value in zip(columns, row_values)}
row = Row(**row_dict)
row_lst.append(row)
df = sqlContext.createDataFrame(row_lst)

Using pault's answer above I imposed a specific schema on my dataframe as follows:
import pyspark
from pyspark.sql import SparkSession, functions
spark = SparkSession.builder.appName('dictToDF').getOrCreate()
get data:
dict_lst = {'letters': ['a', 'b', 'c'],'numbers': [10, 20, 30]}
data = dict_lst.values()
create schema:
from pyspark.sql.types import *
myschema= StructType([ StructField("letters", StringType(), True)\
,StructField("numbers", IntegerType(), True)\
])
create df from dictionary - with schema:
df=spark.createDataFrame(zip(*data), schema = myschema)
df.show()
+-------+-------+
|letters|numbers|
+-------+-------+
| a| 10|
| b| 20|
| c| 30|
+-------+-------+
show df schema:
df.printSchema()
root
|-- letters: string (nullable = true)
|-- numbers: integer (nullable = true)

You can also use a Python List to quickly prototype a DataFrame. The idea is based from Databricks's tutorial.
df = spark.createDataFrame(
[(1, "a"),
(1, "a"),
(1, "b")],
("id", "value"))
df.show()
+---+-----+
| id|value|
+---+-----+
| 1| a|
| 1| a|
| 1| b|
+---+-----+

Try this out :
dict_lst = [{'letters': 'a', 'numbers': 10},
{'letters': 'b', 'numbers': 20},
{'letters': 'c', 'numbers': 30}]
df_dict = sc.parallelize(dict_lst).toDF() # Result as expected
Output:
>>> df_dict.show()
+-------+-------+
|letters|numbers|
+-------+-------+
| a| 10|
| b| 20|
| c| 30|
+-------+-------+

The most efficient approach is to use Pandas
import pandas as pd
spark.createDataFrame(pd.DataFrame(dict_lst))

Related

Filter df by date using pyspark

everyone!!
I have tried to filter a dataset in pyspark. I had to filter the column date (date type) and I have written this code, but there is somwthing wrong: the dataset is empty. Someone could tell me how to fix it?
df = df.filter((F.col("date") > "2018-12-12") & (F.col("date") < "2019-12-12"))
Tanks
You need first to make sure date column is in date format then use lit for your filter:
df exemple:
df = spark.createDataFrame(
[
('20/12/2018', '1', 50),
('18/01/2021', '2', 23),
('31/01/2022', '3', -10)
], ['date', 'id', 'value']
)
df.show()
+----------+---+-----+
| date| id|value|
+----------+---+-----+
|20/12/2018| 1| 50|
|18/01/2021| 2| 23|
|31/01/2022| 3| -10|
+----------+---+-----+
from pyspark.sql import functions as F
df\
.withColumn('date', F.to_date('date', 'd/M/y'))\
.filter((F.col('date') > F.lit('2018-12-12')) & (F.col("date") < F.lit('2019-12-12')))\
.show()
+----------+---+-----+
| date| id|value|
+----------+---+-----+
|2018-12-20| 1| 50|
+----------+---+-----+

How to creat a pyspark DataFrame inside of a loop?

How to creat a pyspark DataFrame inside of a loop? In this loop in each iterate I am printing 2 values print(a1,a2). now I want to store all these value in a pyspark dataframe.
Initially, before the loop, you could create an empty dataframe with your preferred schema. Then, create a new df for each loop with the same schema and union it with your original dataframe. Refer the code below.
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType,StructField, StringType
spark = SparkSession.builder.getOrCreate()
schema = StructType([
StructField('a1', StringType(), True),
StructField('a2', StringType(), True)
])
df = spark.createDataFrame([],schema)
for i in range(1,5):
a1 = i
a2 = i+1
newRow = spark.createDataFrame([(a1,a2)], schema)
df = df.union(newRow)
print(df.show())
This gives me the below result where the values are appended to the df in each loop.
+---+---+
| a1| a2|
+---+---+
| 1| 2|
| 2| 3|
| 3| 4|
| 4| 5|
+---+---+

show all the matched string in pyspark dataframe

I wanted to show all the filtered results of similar matched string.
codes:
# Since most of the stackoverflow questionaire-s and also answerer-s are all super SMART and leave out all the necessary import libraries and required codes before using pyspark so that the readers can crack their minds in researching more instead of getting direct answer, I share the codes from beginning as below in order for future reviewers.
# Import libraries
from pyspark.sql import SparkSession
from pyspark import SparkContext
import pandas as pd
import numpy as np
# Initiate the session
spark = SparkSession\
.builder\
.appName('Operations')\
.getOrCreate()
# sc = SparkContext()
sc =SparkContext.getOrCreate()
# Create dataframe 1
sdataframe_temp = spark.createDataFrame([
(1,2,'3'),
(2,2,'yes')],
['a', 'b', 'c']
)
# Create Dataframe 2
sdataframe_temp2 = spark.createDataFrame([
(4,6,'yes'),
(5,7,'yes')],
['a', 'b', 'c']
)
# Combine dataframe
sdataframe_union_1_2 = sdataframe_temp.union(sdataframe_temp2)
# Filter out the columns based on respective rules
sdataframe_temp\
.filter(sdataframe_union_1_2['c'] == 'yes')\
.select(['a', 'b'])\ # I wish to stick with dataframe method if possible.
.show()
Output:
+---+---+
| a| b|
+---+---+
| 2| 2|
+---+---+
Expected output:
+---+---+
| a| b|
+---+---+
| 2| 2|
+---+---+
| 4| 6|
+---+---+
| 5| 7|
+---+---+
Can anyone please give some suggestions or improvement?
Here's a way using unionByName:
df = (sdataframe_temp1
.unionByName(sdataframe_temp2)
.where("c == 'yes'")
.drop('c'))
df.show()
+---+---+
| a| b|
+---+---+
| 2| 2|
| 4| 6|
| 5| 7|
+---+---+
you should change last line of code. for col function you should import from pyspark.sql.functions
from pyspark.sql.functions import *
sdataframe_union_1_2\
.filter(col('c') == 'yes')\
.select(['a', 'b'])\ # I wish to stick with dataframe method if possible.
.show()
or
sdataframe_union_1_2\
.filter(sdataframe_union_1_2['c'] == 'yes')\
.select(['a', 'b'])\ # I wish to stick with dataframe method if possible.
.show()
you have to select data from sdataframe_union_1_2 and you are selecting from sdataframe_temp that's why you are getting one record.

Spark: Joining with array

I need to join a dataframe with a string column to one with array of string so that if one of the values in the array is matched, the rows will join.
I tried this but I guess it's not support.
Any other way to do this?
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("test")
val spark = SparkSession.builder().config(sparkConf).getOrCreate()
import spark.implicits._
val left = spark.sparkContext.parallelize(Seq(1, 2, 3)).toDF("col1")
val right = spark.sparkContext.parallelize(Seq((Array(1, 2), "Yes"),(Array(3),"No"))).toDF("col1", "col2")
left.join(right,"col1")
Throws:
org.apache.spark.sql.AnalysisException: cannot resolve '(col1
=col1)' due to data
type mismatch: differing types in '(col1 =
col1)' (int and array).;;
One option is to create an UDF for building your join condition:
import org.apache.spark.sql.functions._
import scala.collection.mutable.WrappedArray
val left = spark.sparkContext.parallelize(Seq(1, 2, 3)).toDF("col1")
val right = spark.sparkContext.parallelize(Seq((Array(1, 2), "Yes"),(Array(3),"No"))).toDF("col1", "col2")
val checkValue = udf {
(array: WrappedArray[Int], value: Int) => array.contains(value)
}
val result = left.join(right, checkValue(right("col1"), left("col1")), "inner")
result.show
+----+------+----+
|col1| col1|col2|
+----+------+----+
| 1|[1, 2]| Yes|
| 2|[1, 2]| Yes|
| 3| [3]| No|
+----+------+----+
The most succinct way to do this is to use the array_contains spark sql expression as shown below, that said I've compared the performance of this with the performance of doing an explode and join as shown in a previous answer and the explode seems more performant.
import org.apache.spark.sql.functions.expr
import spark.implicits._
val left = Seq(1, 2, 3).toDF("col1")
val right = Seq((Array(1, 2), "Yes"),(Array(3),"No")).toDF("col1", "col2").withColumnRenamed("col1", "col1_array")
val joined = left.join(right, expr("array_contains(col1_array, col1)")).show
+----+----------+----+
|col1|col1_array|col2|
+----+----------+----+
| 1| [1, 2]| Yes|
| 2| [1, 2]| Yes|
| 3| [3]| No|
+----+----------+----+
Note you can't use the org.apache.spark.sql.functions.array_contains function directly as it requires the second argument to be a literal as opposed to a column expression.
You could use explode on you Array column before the join. Explode creates a new line for each element in the array :
right = right.withColumn("exploded_col",explode(right("col1")))
right.show()
+------+----+--------------+
| col1|col2|exploded_col_1|
+------+----+--------------+
|[1, 2]| Yes| 1|
|[1, 2]| Yes| 2|
| [3]| No| 3|
+------+----+--------------+
Then you can easily join with your first dataset.

How to aggregate a Spark data frame to get a sparse vector using Scala?

I have a data frame like the one below in Spark, and I want to group it by the id column and then for each line in the grouped data I need to create a sparse vector with elements from the weight column at indices specified by the index column. The length of the sparse vector is known, say 1000 for this example.
Dataframe df:
+-----+------+-----+
| id|weight|index|
+-----+------+-----+
|11830| 1| 8|
|11113| 1| 3|
| 1081| 1| 3|
| 2654| 1| 3|
|10633| 1| 3|
|11830| 1| 28|
|11351| 1| 12|
| 2737| 1| 26|
|11113| 3| 2|
| 6590| 1| 2|
+-----+------+-----+
I have read this which is sort of similar of what I want to do, but for a rdd. Does anyone know of a good way to do this for a data frame in Spark using Scala?
My attempt so far is to first collect the weights and indices as lists like this:
val dfWithLists = df
.groupBy("id")
.agg(collect_list("weight") as "weights", collect_list("index") as "indices"))
which looks like:
+-----+---------+----------+
| id| weights| indices|
+-----+---------+----------+
|11830| [1, 1]| [8, 28]|
|11113| [1, 3]| [3, 2]|
| 1081| [1]| [3]|
| 2654| [1]| [3]|
|10633| [1]| [3]|
|11351| [1]| [12]|
| 2737| [1]| [26]|
| 6590| [1]| [2]|
+-----+---------+----------+
Then I define a udf and do something like this:
import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.sql.functions.udf
def toSparseVector: ((Array[Int], Array[BigInt]) => Vector) = {(a1, a2) => Vectors.sparse(1000, a1, a2.map(x => x.toDouble))}
val udfToSparseVector = udf(toSparseVector)
val dfWithSparseVector = dfWithLists.withColumn("SparseVector", udfToSparseVector($"indices", $"weights"))
but this doesn't seem to work, and it feels like there should be an easier way to do it without needing to collecting the weights and indices to lists first.
I'm pretty new to Spark, Dataframes and Scala, so any help is highly appreciated.
You have to collect them as vectors must be local, single machine: https://spark.apache.org/docs/latest/mllib-data-types.html#local-vector
For creating the sparse vectors you have 2 options, using unordered (index, value) pairs or specifying the indices and values arrays:
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.Vectors$
If you can get the data into a different format (pivoted), you could also make use of the VectorAssembler:
https://spark.apache.org/docs/latest/ml-features.html#vectorassembler
With some small tweaks you can get your approach working:
:paste
// Entering paste mode (ctrl-D to finish)
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
val df = Seq((11830,1,8), (11113, 1, 3), (1081, 1,3), (2654, 1, 3), (10633, 1, 3), (11830, 1, 28), (11351, 1, 12), (2737, 1, 26), (11113, 3, 2), (6590, 1, 2)).toDF("id", "weight", "index")
val dfWithFeat = df
.rdd
.map(r => (r.getInt(0), (r.getInt(2), r.getInt(1).toDouble)))
.groupByKey()
.map(r => LabeledPoint(r._1, Vectors.sparse(1000, r._2.toSeq)))
.toDS
dfWithFeat.printSchema
dfWithFeat.show(10, false)
// Exiting paste mode, now interpreting.
root
|-- label: double (nullable = true)
|-- features: vector (nullable = true)
+-------+-----------------------+
|label |features |
+-------+-----------------------+
|11113.0|(1000,[2,3],[3.0,1.0]) |
|2737.0 |(1000,[26],[1.0]) |
|10633.0|(1000,[3],[1.0]) |
|1081.0 |(1000,[3],[1.0]) |
|6590.0 |(1000,[2],[1.0]) |
|11830.0|(1000,[8,28],[1.0,1.0])|
|2654.0 |(1000,[3],[1.0]) |
|11351.0|(1000,[12],[1.0]) |
+-------+-----------------------+
dfWithFeat: org.apache.spark.sql.Dataset[org.apache.spark.mllib.regression.LabeledPoint] = [label: double, features: vector]