How to convert schema from text file to scala dataframe - scala

We have text file having below data :
String Col1
String Col2
List <Object> Col3 [
String Col3_1
String Col3_2
]
String Col4
String Col5
We need to convert this to Scala Schema as :
val Schema = new StructType()
.add("Col1",StringType)
.add("Col2",StringType)
.add("Col3",StructType)
.add("Col4",StringType)
.add("Col5",StringType)
The data would be nested json and schema will keep evolving, Hence an automated solution is required.
Option of having schemaRegistry is good but its not in the option for current implementation.

Related

Is there any way to change one spark DF's datatype to match with other DF's datatype

I have one spark DF1 with datatype,
string (nullable = true)
integer (nullable = true)
timestamp (nullable = true)
And one more spark DF2 (which I created from Pandas DF)
object
int64
object
Now I need to change the DF2 datatype to match with the Df1 datatype. Is there any common way to do that. Because every time I may get different columns and different datatypes.
Is there any way like assign the DF1 data type to some structType and use that for DF2?
suppose you have 2 dataframes - data1_sdf and data2_sdf. you can use a dataframe's schema to extract the column's data type by data_sdf.select('column_name').schema[0].dataType.
here's an example where data2_sdf columns are casted using data1_sdf within a select
data2_sdf. \
select(*[func.col(c).cast(data1_sdf.select(c).schema[0].dataType) if c in data1_sdf.columns else c for c in data_sdf.columns])
If you make sure that your first object is a string-like column and your third object is timestamp-like column, you can try to use this method:
df2 = spark.createDataFrame(
df2.rdd, schema=df1.schema
)
However, this method is not guaranteed, some cases are not valid (eg transform from string to integer). Also, this method might not be good in performance perspective. Therefore, you better use cast to change the data type of each column.

How to get datatype for specific field name from schema attribute of pyspark dataframe (from parquet files)?

Have a folder of parquet files that I am reading into a pyspark session. How can I inspect / parse the individual schema field types and other info (eg. for the purpose of comparing schemas between dataframes to see exact type differences)?
I can see the parquet schema and specific field names with something like...
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
sparkSession = SparkSession.builder.appName("data_debugging").getOrCreate()
df = sparkSession.read.option("header", "true").parquet("hdfs://hw.co.local:8020/path/to/parquets")
df.schema # or df.printSchema()
df.fieldNames()
So I can see the schema
StructType(List(StructField(SOME_FIELD_001,StringType,true),StructField(SOME_FIELD_002,StringType,true),StructField(SOME_FIELD_003,StringType,true)))
but not sure how to get the values for specific fields, eg. something like...
df.schema.getType("SOME_FIELD_001")
or
df.schema.getData("SOME_FIELD_001") #type: dict
Does anyone know how to do something like this?
This function collects (name,type,nullability) in a dict, and makes it easy to lookup info based on column name of dataframe.
If name is specified as df, the metadata dict will be called df.meta
name=df #enter name of dataframe here
def metadata(name): #function for getting metadata in a dict
null=[str(n.nullable) for n in name.schema.fields] #nullability
types=[str(i.dataType) for i in name.schema.fields] #type
both = [list(a) for a in zip(types, null)]#combine type+nullability
names= name.columns #names of columns
final = {} #create dict
for key in names:
for value in both:
final[key] = value
both.remove(value)
break
return final
name.meta= metadata(name) # final dict is called df.meta
# if name=df2, final dict will be df2.meta
Now you can compare column info of different dataframe.
example:
Input: df.meta
Output: {'col1': ['StringType', 'True'],
'col2': ['StringType', 'True'],
'col3': ['LongType', 'True'],
'col4': ['StringType', 'True']}
#get column info
Input: df.meta['col1']
Output: ['StringType', 'True']
#compare column type + nullability
Input: df.meta['col1'] == df2.meta['col1']
Ouput: True/False
#compare only column type
Input: df.meta['col1'][0] == df2.meta['col1'][0]
Output: True/False
#compare only nullability
Input: df.meta['col1'][1] == df2.meta['col1'][1]
Output: True/False
Method 1:
You can use the df.dtypes method to get the field name along with it's datatype and the same can be converted to a dict object as shown below,
myschema = dict(df.dtypes)
Now, you can obtain the datatypes as shown below,
myschema.get('some_field_002')
Output:
'string'
Method 2:
Alternatively, if you want the datatypes as a pyspark.sql.types object, you can use the df.schema method and create a custom schema dictionary as shown below,
myschema = dict(map(lambda x: (x.name, x.dataType), df.schema.fields))
print(myschema.get('some_field_002'))
Output:
StringType

Spark dataframe cast column for Kudu compatibility

(I am new to Spark, Impala and Kudu.) I am trying to copy a table from an Oracle DB to an Impala table having the same structure, in Spark, through Kudu. I am getting an error when the code tries to map an Oracle NUMBER to a Kudu data type. How can I change the data type of a Spark DataFrame to make it compatible with Kudu?
This is intended to be a 1-to-1 copy of data from Oracle to Impala. I have extracted the Oracle schema of the source table and created a target Impala table with the same structure (same column names and a reasonable mapping of data types). I was hoping that Spark+Kudu would map everything automatically and just copy the data. Instead, Kudu complains that it cannot map DecimalType(38,0).
I would like to specify that "column #1, with name SOME_COL, which is a NUMBER in Oracle, should be mapped to a LongType, which is supported in Kudu".
How can I do that?
// This works
val df: DataFrame = spark.read
.option("fetchsize", 10000)
.option("driver", "oracle.jdbc.driver.OracleDriver")
.jdbc("jdbc:oracle:thin:#(DESCRIPTION=...)", "SCHEMA.TABLE_NAME", partitions, props)
// This does not work
kuduContext.insertRows(df.toDF(colNamesLower: _*), "impala::schema.table_name")
// Error: No support for Spark SQL type DecimalType(38,0)
// See https://github.com/cloudera/kudu/blob/master/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/SparkUtil.scala
// So let's see the Spark data types
df.dtypes.foreach{case (colName, colType) => println(s"$colName: $colType")}
// Spark data type: SOME_COL DecimalType(38,0)
// Oracle data type: SOME_COL NUMBER -- no precision specifier; values are int/long
// Kudu data type: SOME_COL BIGINT
Apparently, we can specify a custom schema when reading from a JDBC data source.
connectionProperties.put("customSchema", "id DECIMAL(38, 0), name STRING")
val jdbcDF3 = spark.read
.jdbc("jdbc:postgresql:dbserver", "schema.tablename", connectionProperties)
That worked. I was able to specify a customSchema like so:
col1 Long, col2 Timestamp, col3 Double, col4 String
and with that, the code works:
import spark.implicits._
val df: Dataset[case_class_for_table] = spark.read
.option("fetchsize", 10000)
.option("driver", "oracle.jdbc.driver.OracleDriver")
.jdbc("jdbc:oracle:thin:#(DESCRIPTION=...)", "SCHEMA.TABLE_NAME", partitions, props)
.as[case_class_for_table]
kuduContext.insertRows(df.toDF(colNamesLower: _*), "impala::schema.table_name")

i have json string in my dataframe i already tried to exract json sting columns using pyspark

df = spark.read.json("dbfs:/mnt/evbhaent2blobs", multiLine=True)
df2 = df.select(F.col('body').cast("Struct").getItem('CustomerType').alias('CustomerType'))
display(df)
my df is
my oupputdf
I am taking a guess that your dataframe has a column "body" which is a json string and you want to parse the json and extract an element from it.
First you need to define or extract the json schema. And then parse json string and extract its elements as column. From the extracted columns, you can select the desired columns.
json_schema = spark.read.json(df.rdd.map(lambda row: row.body)).schema
df2 = df.withColumn('body_json', F.from_json(F.col('body'), json_schema))\
.select("body_json.*").select('CustomerType')
display(df2)

How to read decimal logical type into spark dataframe

I have an Avro file containing a decimal logicalType as follow:
"type":["null",{"type":"bytes","logicalType":"decimal","precision":19,"scale":2}]
when I try to read the file with scala spark library the df schema is
MyField: binary (nullable = true)
How can I convert it into a decimal type?
You can specify schema in read operation:
val schema = new StructType()
.add(StructField("MyField", BooleanType))
or you can cast column
val binToInt: String => Integer = Integer.ParseInt(_, 2);
val binToIntegerUdf = udf(binToInt);
df.withColumn("Myfield", binToIntegerUdf(col("MyField").cast("string")))