Description . How can I convert a dictionary dataset to DataFrame in PySpark:
Error/Not expected result I tried
df = spark.createDataFrame([Row(**i) for i in bounds])
but get:
TypeError: Can not infer schema for type: <class 'str'>
this code :
rdd = sc.parallelize(bounds)
rdd.map(lambda x: (x,)).toDF().show()
and others give unexpected result.
Expected result:
My DataSet:
Your input to createDataFrame() has incorrect format. It should look like this -
[("price", {"q1":1, "q3": 3, "upper": 10, "lower":2} ),
("carAge", {"q1":1, "q3": 3, "upper": 11, "lower":1})]
This is a list of tuples (a list of lists would also work), where each tuple has two elements, the first is string, and the second is dictionary. Each tuple contains all row data in the future spark dataframe, and two elements in tuple means there will be 2 columns in the dataframe you'll create.
To bring your dictionary data to the format above, use this line of code:
[(x, dct[x]) for x in dct.keys()]
where dct is your original dictionary as in My DataSet image.
Then, you can create spark dataframe as follows:
df = (spark.createDataFrame([(x, dct[x]) for x in dct.keys()],
schema=["Colums", "dct_col"]))
This dataframe will have only two columns, the second column, "dct_col" will be the dictionary column, and you can get "q1" , "q3", and other columns as follows:
df_expected_result = (df
.withColumn("q1", df.dct_col["q1"])
.withColumn("q3", df.dct_col["q3"])
.withColumn("lower", df.dct_col["lower"])
.withColumn("upper", df.dct_col["upper"]))
Related
I am reading the data from Store table which is in snowflake. I want to pass the date from dataframe maxdatefromtbl to my query in spark sql to filter records.
This condition (s"CREATED_DATE!='$maxdatefromtbl'") is not working as expected
var retail = spark.read.format("snowflake").options(options).option("query","Select MAX(CREATED_DATE) as CREATED_DATE from RSTORE").load()
val maxdatefromtbl = retail.select("CREATED_DATE").toString
var retailnew = spark.read.format("snowflake").options(options).option("query","Select * from RSTORE").load()
var finaldataresult = retailnew.filter(s"CREATED_DATE!='$maxdatefromtbl'")
Select a single value from the retail dataframe to use in the filter.
val maxdatefromtbl = retail.select("CREATED_DATE").collect().head.getString(0)
var finaldataresult = retailnew.filter(col("CREATED_DATE") =!= maxdatefromtbl)
The type of retail.select("CREATED_DATE") is DataFrame, and DataFrame.toString returns the schema rather than the value of the single row you have. Please see the following example from a Spark shell.
scala> val s = Seq(1, 2, 3).toDF()
scala> s.select("value").toString
res0: String = [value: int]
In first line in the code snipped above, collect() wraps the dataframe, with a single row in your case, in an array; head takes the first element of the array, and .getString(0) gets the value from the cell with at the index 0 as String. Please see the DataFrame and Row documentation pages for more information.
So, I have n number of strings that I can keep either in an array or in a list like this:
val checks = Array("check1", "check2", "check3", "check4", "check5")
val checks: List[String] = List("check1", "check2", "check3", "check4", "check5")
Now, I have a spark dataframe df and I want to add a column with the values present in this List/Array. (It is guaranteed that the number of items in my List/Array will be exactly equal to the number of rows in the dataframe, i.e n)
I tried doing:
df.withColumn("Value", checks)
But that didn't work. What would be the best way to achieve this?
You need to add it as an array column as follows:
val df2 = df.withColumn("Value", array(checks.map(lit):_*))
If you want a single value for each row, you can get the array element:
val df2 = df.withColumn("Value", array(checks.map(lit):_*))
.withColumn("rn", row_number().over(Window.orderBy(lit(1))) - 1)
.withColumn("Value", expr("Value[rn]"))
.drop("rn")
I am trying to make the next operation:
var test = df.groupBy(keys.map(col(_)): _*).agg(sequence.head, sequence.tail: _*)
I know that the required parameter inside the agg should be a Seq[Columns].
I have then a dataframe "expr" containing the next:
sequences
count(col("colname1"),"*")
count(col("colname2"),"*")
count(col("colname3"),"*")
count(col("colname4"),"*")
The column sequence is of string type and I want to use the values of each row as input of the agg, but I am not capable to reach those.
Any idea of how to give it a try?
If you can change the strings in the sequences column to be SQL commands, then it would be possible to solve. Spark provides a function expr that takes a SQL string and converts it into a column. Example dataframe with working commands:
val df2 = Seq("sum(case when A like 2 then A end) as A", "count(B) as B").toDF("sequences")
To convert the dataframe to Seq[Column]s do:
val seqs = df2.as[String].collect().map(expr(_))
Then the groupBy and agg:
df.groupBy(...).agg(seqs.head, seqs.tail:_*)
I have a pysaprk dataframe with 100 cols:
df1=[(col1,string),(col2,double),(col3,bigint),..so on]
I have another pyspark dataframe df2 with same col count and col names but different datatypes.
df2=[(col1,bigint),(col2,double),(col3,string),..so on]
how do i make the dataypes of all the cols in df2 same as ones present in the dataframe df1 for their respective cols?
It should happen iteratively and if the datatypes match then it should not change
If as you said the column names match and columns count match, then you can simply loop in the schema of df1 and cast the columns to dataTypes of df1
df2 = df2.select([F.col(c.name).cast(c.dataType) for c in df1.schema])
You can use the cast function:
from pyspark.sql import functions as f
# get schema for each DF
df1_schema=df1.dtypes
df2_schema=df2.dtypes
# iterate through cols to cast columns which differ in type
for (c1, d1), (c2,d2) in zip(df1_schema, df2_schema):
# check if datatypes are the same, otherwise cast
if d1!=d2:
df2=df2.withColumn(c2, f.col(c2).cast(d2))
Say I have a dataframe resulted from a sequence of transformations. It looks like the following:
id matrixRow
0 [1,2,3]
1 [4,5,6]
2 [7,8,9]
each row actually corresponds to a row of a matrix.
How can I convert the matrixRow column of the dataframe to RowMatrix?
After numerous tries, here's one solution:
val rdd = df.rdd.map(
row => Vectors.dense(row.getAs[Seq[Double]](1).toArray)//get the second column value as Seq[Double], then as Array, then cast to Vector
)
val row = new RowMatrix(rdd)