How to add header to csv table scala spark

How to add header to csv table scala spark - scala

I am trying to read data from a table that is in a csv file. It does not have a header so when I try and query the table using Spark SQL, all the results are null.
I have tried creating a schema struct, and while it does display when I do printschema(), when I try and ( select * from tableName ) it does not work, all values are null. I have also tried the StructType() and .add( colName ) instead of StructField and that yielded the same results.
val schemaStruct1 = StructType(
StructField( "AgreementVersionID", IntegerType, true )::
StructField( "ProgramID", IntegerType, true )::
StructField( "AgreementID", IntegerType, true )::
StructField( "AgreementVersionNumber", IntegerType, true )::
StructField( "AgreementStatusID", IntegerType, true )::
StructField( "AgreementEffectiveDate", DateType, true )::
StructField( "AgreementEffectiveDateDay", IntegerType, true )::
StructField( "AgreementEndDate", DateType, true )::
StructField( "AgreementEndDateDay", IntegerType, true )::
StructField( "MasterAgreementNumber", IntegerType, true )::
StructField( "MasterAgreementEffectiveDate", DateType, true )::
StructField( "MasterAgreementEffectiveDateDay", IntegerType, true )::
StructField( "MasterAgreementEndDate", DateType, true )::
StructField( "MasterAgreementEndDateDay", IntegerType, true )::
StructField( "SalesContactName", StringType, true )::
StructField( "RevenueSubID", IntegerType, true )::
StructField( "LicenseAgreementContractTypeID", IntegerType, true )::Nil
)
val df1 = session.read
.option( "header", true )
.option( "delimiter", "," )
.schema( schemaStruct1 )
.csv( LicenseAgrmtMaster )
df1.printSchema()
df1.createOrReplaceTempView( "LicenseAgrmtMaster" )
Printing this schema gives me this schema which is correct
root
|-- AgreementVersionID: integer (nullable = true)
|-- ProgramID: integer (nullable = true)
|-- AgreementID: integer (nullable = true)
|-- AgreementVersionNumber: integer (nullable = true)
|-- AgreementStatusID: integer (nullable = true)
|-- AgreementEffectiveDate: date (nullable = true)
|-- AgreementEffectiveDateDay: integer (nullable = true)
|-- AgreementEndDate: date (nullable = true)
|-- AgreementEndDateDay: integer (nullable = true)
|-- MasterAgreementNumber: integer (nullable = true)
|-- MasterAgreementEffectiveDate: date (nullable = true)
|-- MasterAgreementEffectiveDateDay: integer (nullable = true)
|-- MasterAgreementEndDate: date (nullable = true)
|-- MasterAgreementEndDateDay: integer (nullable = true)
|-- SalesContactName: string (nullable = true)
|-- RevenueSubID: integer (nullable = true)
|-- LicenseAgreementContractTypeID: integer (nullable = true)
which is correct however trying to query this gives me a table yielding only null values even though the table is not filled with nulls. I need to be able to read this table in order to join to another to complete a stored procedure

I would suggest go with steps below then you can change your code based on your need
val df = session.read.option( "delimiter", "," ).csv("<Path of your file/dir>")
val colum_names = Seq("name","id")// this is example define exact number of columns
val dfWithHeader = df.toDF(colum_names:_*)
// now you have header here and data should be also here check the type or you can cast

Related

ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions

I got this error while using this code to drop a nested column with pyspark. Why is this not working? I was trying to use a tilde instead of a not != as the error suggests but it doesnt work either. So what do you do in that case?
def drop_col(df, struct_nm, delete_struct_child_col_nm):
fields_to_keep = filter(lambda x: x != delete_struct_child_col_nm, df.select("
{}.*".format(struct_nm)).columns)
fields_to_keep = list(map(lambda x: "{}.{}".format(struct_nm, x), fields_to_keep))
return df.withColumn(struct_nm, struct(fields_to_keep))

I built a simple example with a struct column and a few dummy columns:
from pyspark import SQLContext
from pyspark.sql import SparkSession
from pyspark.sql.functions import monotonically_increasing_id, lit, col, struct
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
spark = SparkSession.builder.getOrCreate()
sql_context = SQLContext(spark.sparkContext)
schema = StructType(
[
StructField('addresses',
StructType(
[StructField("state", StringType(), True),
StructField("street", StringType(), True),
StructField("country", StringType(), True),
StructField("code", IntegerType(), True)]
)
)
]
)
rdd = [({'state': 'pa', 'street': 'market', 'country': 'USA', 'code': 100},),
({'state': 'ca', 'street': 'baker', 'country': 'USA', 'code': 101},)]
df = sql_context.createDataFrame(rdd, schema)
df = df.withColumn('id', monotonically_increasing_id())
df = df.withColumn('name', lit('test'))
print(df.show())
print(df.printSchema())
Output:
+--------------------+-----------+----+
| addresses| id|name|
+--------------------+-----------+----+
|[pa, market, USA,...| 8589934592|test|
|[ca, baker, USA, ...|25769803776|test|
+--------------------+-----------+----+
root
|-- addresses: struct (nullable = true)
| |-- state: string (nullable = true)
| |-- street: string (nullable = true)
| |-- country: string (nullable = true)
| |-- code: integer (nullable = true)
|-- id: long (nullable = false)
|-- name: string (nullable = false)
To drop the whole struct column, you can simply use the drop function:
df2 = df.drop('addresses')
print(df2.show())
Output:
+-----------+----+
| id|name|
+-----------+----+
| 8589934592|test|
|25769803776|test|
+-----------+----+
To drop specific fields, in a struct column, it's a bit more complicated - there are some other similar questions here:
Dropping a nested column from Spark DataFrame
Dropping nested column of Dataframe with PySpark
In any case, I found them to be a bit complicated - my approach would just be to reassign the original column with the subset of struct fields you want to keep:
columns_to_keep = ['country', 'code']
df = df.withColumn('addresses', struct(*[f"addresses.{column}" for column in columns_to_keep]))
Output:
+----------+-----------+----+
| addresses| id|name|
+----------+-----------+----+
|[USA, 100]| 8589934592|test|
|[USA, 101]|25769803776|test|
+----------+-----------+----+
Alternatively, if you just wanted to specify the columns you want to remove rather than the columns you want to keep:
columns_to_remove = ['country', 'code']
all_columns = df.select("addresses.*").columns
columns_to_keep = list(set(all_columns) - set(columns_to_remove))
df = df.withColumn('addresses', struct(*[f"addresses.{column}" for column in columns_to_keep]))
Output:
+------------+-----------+----+
| addresses| id|name|
+------------+-----------+----+
|[pa, market]| 8589934592|test|
| [ca, baker]|25769803776|test|
+------------+-----------+----+
Hope this helps!

How to make a recently generated column nullable?

I create a new column and cast it into integer. But the column is not nullable. How can I make the new column nullable?
from pyspark.sql import functions as F
from pyspark.sql import types as T
zschema = T.StructType([T.StructField("col1", T.StringType(), True),\
T.StructField("col2", T.StringType(), True),\
T.StructField("time", T.DoubleType(), True),\
T.StructField("val", T.DoubleType(), True)])
df = spark.createDataFrame([("a","b", 1.0,2.0), ("a","b", 2.0,3.0) ], zschema)
df.printSchema()
df.show()
df = df.withColumn("xcol" , F.lit(0))
df = df.withColumn( "xcol" , F.col("xcol").cast(T.IntegerType()) )
df.printSchema()
df.show()

df1 = df.rdd.toDF()
df1.printSchema()
root
|-- col1: string (nullable = true)
|-- col2: string (nullable = true)
|-- time: double (nullable = true)
|-- val: double (nullable = true)
|-- xcol: long (nullable = true)

Change schema of existing dataframe

I want to change schema of existing dataframe,while changing the schema I'm experiencing error.Is it possible I can change the existing schema of a dataframe.
val customSchema=StructType(
Array(
StructField("data_typ", StringType, nullable=false),
StructField("data_typ", IntegerType, nullable=false),
StructField("proc_date", IntegerType, nullable=false),
StructField("cyc_dt", DateType, nullable=false),
));
val readDF=
+------------+--------------------+-----------+--------------------+
|DatatypeCode| Description|monthColNam| timeStampColNam|
+------------+--------------------+-----------+--------------------+
| 03099|Volumetric/Expand...| 201867|2018-05-31 18:25:...|
| 03307| Elapsed Day Factor| 201867|2018-05-31 18:25:...|
+------------+--------------------+-----------+--------------------+
val rows= readDF.rdd
val readDF1 = sparkSession.createDataFrame(rows,customSchema)
expected result
val newdf=
+------------+--------------------+-----------+--------------------+
|data_typ_cd | data_typ_desc|proc_dt | cyc_dt |
+------------+--------------------+-----------+--------------------+
| 03099|Volumetric/Expand...| 201867|2018-05-31 18:25:...|
| 03307| Elapsed Day Factor| 201867|2018-05-31 18:25:...|
+------------+--------------------+-----------+--------------------+
Any help will be appricated

You can do something like this to change the datatype from one to other.
I have created a dataframe similar to yours like below:
import sparkSession.sqlContext.implicits._
import org.apache.spark.sql.types._
var df = Seq(("03099","Volumetric/Expand...", "201867", "2018-05-31 18:25:00"),("03307","Elapsed Day Factor", "201867", "2018-05-31 18:25:00"))
.toDF("DatatypeCode","data_typ", "proc_date", "cyc_dt")
df.printSchema()
df.show()
This gives me the following output:
root
|-- DatatypeCode: string (nullable = true)
|-- data_typ: string (nullable = true)
|-- proc_date: string (nullable = true)
|-- cyc_dt: string (nullable = true)
+------------+--------------------+---------+-------------------+
|DatatypeCode| data_typ|proc_date| cyc_dt|
+------------+--------------------+---------+-------------------+
| 03099|Volumetric/Expand...| 201867|2018-05-31 18:25:00|
| 03307| Elapsed Day Factor| 201867|2018-05-31 18:25:00|
+------------+--------------------+---------+-------------------+
If you see the schema above all the columns are of type String. Now I want to change the column proc_date to Integer type and cyc_dt to Date type, I will do the following:
df = df.withColumnRenamed("DatatypeCode", "data_type_code")
df = df.withColumn("proc_date_new", df("proc_date").cast(IntegerType)).drop("proc_date")
df = df.withColumn("cyc_dt_new", df("cyc_dt").cast(DateType)).drop("cyc_dt")
and when you check the schema of this dataframe
df.printSchema()
then it gives the output as following with the new column names:
root
|-- data_type_code: string (nullable = true)
|-- data_typ: string (nullable = true)
|-- proc_date_new: integer (nullable = true)
|-- cyc_dt_new: date (nullable = true)

You cannot change schema like this. Schema object passed to createDataFrame has to match the data, not the other way around:
To parse timestamp data use corresponding functions, for example like Better way to convert a string field into timestamp in Spark
To change other types use cast method, for example how to change a Dataframe column from String type to Double type in pyspark

Is it possible to create a StructField of tuple type using PySpark?

I need to create a schema for a dataframe in Spark. I have no problem creating regular StructFields, such as StringType, IntegerType. However, I want to create a StructField for a tuple.
I have tried the following:
StructType([
StructField("dst_ip", StringType()),
StructField("port", StringType())
])
However, it throws an error
"list object has no attribute 'name'"
Is it possible to create a StructField for a tuple type?

You can define a StructType inside of a StructField:
schema = StructType(
[
StructField(
"myTuple",
StructType(
[
StructField("dst_ip", StringType()),
StructField("port", StringType())
]
)
)
]
)
df = sqlCtx.createDataFrame([], schema)
df.printSchema()
#root
# |-- myTuple: struct (nullable = true)
# | |-- dst_ip: string (nullable = true)
# | |-- port: string (nullable = true)

The class StructType--used to to define the structure of a DataFrame--is the data type representing a Row and it consists of a list of StructField's.
In order to define a tuple datatype for a column (say columnA) you need to encapsulate (list) the StructType's of the the tuple's elements into a StructField. Note that StructFields need to have names since they represent columns.
Define tuple StructField as a new StructType:
columnA = StructField('columnA', StructType([
StructField("dst_ip", StringType()),
StructField("port", StringType())
])
)
Define schema containing columnA and columnB (of type FloatType):
mySchema = StructType([ columnA, StructField("columnB", FloatType())])
Apply schema to dataframe:
data =[{'columnA': ('x', 'y'), 'columnB': 1.0}]
# data = [Row(columnA=('x', 'y'), columnB=1.0)] (needs from pyspark.sql import Row)
df = spark.createDataFrame(data, mySchema)
df.printSchema()
# root
# |-- columnA: struct (nullable = true)
# | |-- dst_ip: string (nullable = true)
# | |-- port: string (nullable = true)
# |-- columnB: float (nullable = true)
Show dataframe:
df.show()
# +-------+-------+
# |columnA|columnB|
# +-------+-------+
# | [x, y]| 1.0|
# +-------+-------+
(this is just the longer version of the other answer)

How to compute statistics on a streaming dataframe for different type of columns in a single query?

I have a streaming dataframe having three columns time, col1,col2.
+-----------------------+-------------------+--------------------+
|time |col1 |col2 |
+-----------------------+-------------------+--------------------+
|2018-01-10 15:27:21.289|0.4988615628926717 |0.1926744113882285 |
|2018-01-10 15:27:22.289|0.5430687338123434 |0.17084552928040175 |
|2018-01-10 15:27:23.289|0.20527770821641478|0.2221980020202523 |
|2018-01-10 15:27:24.289|0.130852802747647 |0.5213147910202641 |
+-----------------------+-------------------+--------------------+
The datatype of col1 and col2 is variable. It could be a string or numeric datatype.
So I have to calculate statistics for each column.
For string column, calculate only valid count and invalid count.
For timestamp column, calculate only min & max.
For numeric type column, calculate min, max, average and mean.
I have to compute all statistics in a single query.
Right now, I have computed with three queries separately for every type of column.

Enumerate cases you want and select. For example, if stream is defined as:
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Column
val schema = StructType(Seq(
StructField("v", TimestampType),
StructField("x", IntegerType),
StructField("y", StringType),
StructField("z", DecimalType(10, 2))
))
val df = spark.readStream.schema(schema).format("csv").load("/tmp/foo")
The result would be
val stats = df.select(df.dtypes.flatMap {
case (c, "StringType") =>
Seq(count(c) as s"valid_${c}", count("*") - count(c) as s"invalid_${c}")
case (c, t) if Seq("TimestampType", "DateType") contains t =>
Seq(min(c), max(c))
case (c, t) if (Seq("FloatType", "DoubleType", "IntegerType") contains t) || t.startsWith("DecimalType") =>
Seq(min(c), max(c), avg(c), stddev(c))
case _ => Seq.empty[Column]
}: _*)
// root
// |-- min(v): timestamp (nullable = true)
// |-- max(v): timestamp (nullable = true)
// |-- min(x): integer (nullable = true)
// |-- max(x): integer (nullable = true)
// |-- avg(x): double (nullable = true)
// |-- stddev_samp(x): double (nullable = true)
// |-- valid_y: long (nullable = false)
// |-- invalid_y: long (nullable = false)
// |-- min(z): decimal(10,2) (nullable = true)
// |-- max(z): decimal(10,2) (nullable = true)
// |-- avg(z): decimal(14,6) (nullable = true)
// |-- stddev_samp(z): double (nullable = true)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to add header to csv table scala spark - scala

Related

ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions

How to make a recently generated column nullable?

Change schema of existing dataframe

Is it possible to create a StructField of tuple type using PySpark?

How to compute statistics on a streaming dataframe for different type of columns in a single query?

Categories

Resources