What is wrong with my code, I am using pyspark to convert a data type of a column.
company_df=company_df.withColumn("Revenue" ,company_df("Revenue").cast(DoubleType())) \
.withColumn("GROSS_PROFIT",company_df("GROSS_PROFIT").cast(DoubleType())) \
.withColumn("Net_Income" ,company_df("Net_Income").cast(DoubleType())) \
.withColumn("Enterprise_Value" ,company_df("Enterprise_Value").cast(DoubleType())) \
I am getting error as :
AttributeError: 'DataFrame' object has no attribute 'cast'
A short, clean, scalable solution
Change some columns, leave the rest untouched
import pyspark.sql.functions as F
# That's not part of the solution, just a creation of a sample dataframe
# df = spark.createDataFrame([(10, 1,2,3,4),(20, 5,6,7,8)],'Id int, Revenue int ,GROSS_PROFIT int ,Net_Income int ,Enterprise_Value int')
cols_to_cast = ["Revenue" ,"GROSS_PROFIT" ,"Net_Income" ,"Enterprise_Value"]
df = df.select([F.col(c).cast('double') if c in cols_to_cast else c for c in df.columns])
df.printSchema()
root
|-- Id: integer (nullable = true)
|-- Revenue: double (nullable = true)
|-- GROSS_PROFIT: double (nullable = true)
|-- Net_Income: double (nullable = true)
|-- Enterprise_Value: double (nullable = true)
If this helps
df = spark.createDataFrame([(1, 0),
(2, 1),
(3 ,1),
(4, 1),
(5, 0),
(6 ,0),
(7, 1),
(8 ,1),
(9 ,1),
(10, 1),
(11, 0),
(12, 0)],
('Time' ,'Tag1'))
df = df.withColumn('a', col('Time').cast('integer')).withColumn('a1', col('Tag1').cast('double'))
df.printSchema()
df.show()
Alternatively, to #wwnde's answer you could do something as below -
from pyspark.sql.functions import *
from pyspark.sql.types import *
company_df = (company_df.withColumn("Revenue_cast" , col("Revenue_cast").cast(DoubleType()))
.withColumn("GROSS_PROFIT_cast", col("GROSS_PROFIT").cast(DoubleType()))
.withColumn("Net_Income_cast" , col("Net_Income").cast(DoubleType()))
.withColumn("Enterprise_Value_cast", col("Enterprise_Value").cast(DoubleType()))
)
Or,
company_df = (company_df.withColumn("Revenue_cast" , company_df["Revenue"].cast(DoubleType()))
.withColumn("GROSS_PROFIT_cast", company_df["GROSS_PROFIT".cast(DoubleType()))
.withColumn("Net_Income_cast" , company_df["Net_Income".cast(DoubleType()))
.withColumn("Enterprise_Value_cast", company_df["Enterprise_Value"].cast(DoubleType()))
)
Related
In order to apply PCA from pyspark.ml.feature, I need to convert a org.apache.spark.sql.types.ArrayType:array<float> to org.apache.spark.ml.linalg.VectorUDT
Say I have the following dataframe :
df = spark.createDataFrame([
('string1',[5.0,4.0,0.5]),
('string2',[2.0,0.76,7.54]),
], schema='a string, b array<float>')
Whereas a = Vectors.dense(df.select('b').head(1)[0][0]) seems to work for one row, I was wondering how I can apply this function for all the rows.
You'd have to map it back to RDD and manually create a Vector using lambda function
from pyspark.ml.linalg import Vectors
# df = ... # your df
df2 = df.rdd.map(lambda x: (x['a'], Vectors.dense(x['b']))).toDF(['a', 'b'])
df2.show()
df2.printSchema()
+-------+------------------------------------------+
|a |b |
+-------+------------------------------------------+
|string1|[5.0,4.0,0.5] |
|string2|[2.0,0.7599999904632568,7.539999961853027]|
+-------+------------------------------------------+
root
|-- a: string (nullable = true)
|-- b: vector (nullable = true)
Using pyspark 2.4, I am doing a left join of a dataframe on itself.
df = df.alias("t1") \
.join(df.alias("t2"),
col(t1_anc_ref) == col(t2_anc_ref), "left")
The resulting structure of this join is the following:
root
|-- anc_ref_1: string (nullable = true)
|-- anc_ref_2: string (nullable = true)
|-- anc_ref_1: string (nullable = true)
|-- anc_ref_2: string (nullable = true)
I would like to be able to drop the penultimate column of this dataframe (anc_ref_1).
Using the column name is not possible, as there are duplicates. So instead of this, I select the column by index and then try to drop it:
col_to_drop = len(df.columns) - 2
df= df.drop(df[col_to_drop])
However, that gives me the following error:
pyspark.sql.utils.AnalysisException: "Reference 'anc_ref_1' is
ambiguous, could be: t1.anc_ref_1, t2.anc_ref_1.;"
Question:
When I print the schema, there is no mention of t1 and t2 in column names. Yet it is mentionned in the stack trace. Why is that and can I use it to reference a column ?
I tried df.drop("t2.anc_ref_1") but it had no effect (no column dropped)
EDIT: Works well with df.drop(col("t2.anc_ref_1"))
How can I handle the duplicate column names ? I would like to rename/drop so that the result is:
root
|-- anc_ref_1: string (nullable = true)
|-- anc_ref_2: string (nullable = true)
|-- anc_ref_1: string (nullable = true) -> dropped
|-- anc_ref_2: string (nullable = true) -> renamed to anc_ref_3
Option1
drop the column by referring to the original source dataframe.
Data
df= spark.createDataFrame([ ( 'Value1', 'Something'),
('Value2', '1057873 1057887'),
('Value3', 'Something Something'),
('Value4', None),
( 'Value5', '13139'),
( 'Value6', '1463451 1463485'),
( 'Value7', 'Not In Database'),
( 'Value8', '1617275 16288')
],( 'anc_ref_1', 'anc_ref'))
df.show()
Code
df_as1 = df.alias("df_as1")
df_as2 = df.alias("df_as2")
df1 = df_as1.join(df_as2, df_as1.anc_ref == df_as2.anc_ref, "left").drop(df_as1.anc_ref_1)#.drop(df_as2.anc_ref)
df1.show()
Option 2 Use a string sequence to join and then select the join column
df_as1.join(df_as2, "anc_ref", "left").select('anc_ref',df_as1.anc_ref_1).show()
I create a new column and cast it into integer. But the column is not nullable. How can I make the new column nullable?
from pyspark.sql import functions as F
from pyspark.sql import types as T
zschema = T.StructType([T.StructField("col1", T.StringType(), True),\
T.StructField("col2", T.StringType(), True),\
T.StructField("time", T.DoubleType(), True),\
T.StructField("val", T.DoubleType(), True)])
df = spark.createDataFrame([("a","b", 1.0,2.0), ("a","b", 2.0,3.0) ], zschema)
df.printSchema()
df.show()
df = df.withColumn("xcol" , F.lit(0))
df = df.withColumn( "xcol" , F.col("xcol").cast(T.IntegerType()) )
df.printSchema()
df.show()
df1 = df.rdd.toDF()
df1.printSchema()
root
|-- col1: string (nullable = true)
|-- col2: string (nullable = true)
|-- time: double (nullable = true)
|-- val: double (nullable = true)
|-- xcol: long (nullable = true)
I can make a Spark DataFrame with a vector column with the toDF method.
val dataset = Seq((1.0, org.apache.spark.ml.linalg.Vectors.dense(0.0, 10.0, 0.5))).toDF("id", "userFeatures")
scala> dataset.printSchema()
root
|-- id: double (nullable = false)
|-- userFeatures: vector (nullable = true)
scala> dataset.schema
res5: org.apache.spark.sql.types.StructType = StructType(StructField(id,DoubleType,false), StructField(userFeatures,org.apache.spark.ml.linalg.VectorUDT#3bfc3ba7,true))
I'm not sure how to create a vector column with the createDataFrame method. There isn't a VectorType type in org.apache.spark.sql.types.
This doesn't work:
val rows = spark.sparkContext.parallelize(
List(
Row(1.0, Vectors.dense(1.0, 2.0))
)
)
val schema = List(
StructField("id", DoubleType, true),
StructField("features", new org.apache.spark.ml.linalg.VectorUDT, true)
)
val df = spark.createDataFrame(
rows,
StructType(schema)
)
df.show()
df.printSchema()
To create a Spark Vector Column with createDataFrame, you can use following code:
val rows = spark.sparkContext.parallelize(
List(
Row(1.0, org.apache.spark.mllib.linalg.Vectors.dense(1.0, 2.0))
)
)
val schema = List(
StructField("id", DoubleType, true),
StructField("features", new org.apache.spark.mllib.linalg.VectorUDT, true)
)
val df = spark.createDataFrame(
rows,
StructType(schema)
)
df.show()
+---+---------+
| id| features|
+---+---------+
|1.0|[1.0,2.0]|
+---+---------+
df.printSchema()
root
|-- id: double (nullable = true)
|-- features: vector (nullable = true)
The actual issue was incompatible type org.apache.spark.ml.linalg.Vectors.dense which is not a valid external type for schema of vector. So, we have to switch to mllib package instead of ml package.
I hope it helps!
Note: I am using Spark v2.3.0. Also, class VectorUDT in package linalg cannot be accessed in package org.apache.spark.ml.linalg.
For reference - https://github.com/apache/spark/tree/master/mllib/src/main/scala/org/apache/spark/mllib
I have a dataframe (df1) with 2 StringType fields.
Field1 (StringType) Value-X
Field2 (StringType) value-20180101
All I am trying to do is create another dataframe (df2) from df1 with 2 fields-
Field1 (StringType) Value-X
Field2 (Date Type) Value-2018-01-01
I am using the below code-
df2=df1.select(
col("field1").alias("f1"),
unix_timestamp(col("field2"),"yyyyMMdd").alias("f2")
)
df2.show
df2.printSchema
For this field 2, I tried multiple things - unix_timestamp , from_unixtimestamp, to_date, cast(“date”) but nothing worked
I need the following schema as output:
df2.printSchema
|-- f1: string (nullable = false)
|-- f2: date (nullable = false)
I'm using Spark 2.1
to_date seems to work fine for what you need:
import org.apache.spark.sql.functions._
val df1 = Seq( ("X", "20180101"), ("Y", "20180406") ).toDF("c1", "c2")
val df2 = df1.withColumn("c2", to_date($"c2", "yyyyMMdd"))
df2.show
// +---+----------+
// | c1| c2|
// +---+----------+
// | X|2018-01-01|
// | Y|2018-04-06|
// +---+----------+
df2.printSchema
// root
// |-- c1: string (nullable = true)
// |-- c2: date (nullable = true)
[UPDATE]
For Spark 2.1 or prior, to_date doesn't take format string as a parameter, hence explicit string formatting to the standard yyyy-MM-dd format using, say, regexp_replace is needed:
val df2 = df1.withColumn(
"c2", to_date(regexp_replace($"c2", "(\\d{4})(\\d{2})(\\d{2})", "$1-$2-$3"))
)