reading csv to dataframe with dynamic custom schema with pyspark - pyspark

I'm working with databricks in a notebook. I want to read csv files with custom schema.
I'd like to be able to loop over all the csv files in a folder and read them with their respective schema.
So I have a schema for each csv file:
csv_1 = StructType([
StructField('foo', StringType(), False),
StructField('bar', StringType(), True),
csv_2 = StructType([
StructField('foo', StringType(), False),
StructField('bar', StringType(), True),
StructField('baz', StringType(), True),
csv_3 = StructType([
StructField('bar', StringType(), True)
Then I have this loop:
for file in os.listdir(path):
filename = os.path.splitext(file)[0]
dataframes[filename] =, header=True, schema=???)
I guess I probably need to use some mapping somewhere but I'm not sure how.

filename_to_related_mapping = {
'name1': csv_1,
'name2': csv_2,
for file in os.listdir(path):
filename = os.path.splitext(file)[0]
dataframes[filename] =, header=True, schema=filename_to_related_mapping[filename])
Anyway its just a CSV, another way is not pass the schema and it will be infered dynamically.


Spark Structured Stream qubole Kinesis connector errors out with "Got an exception while fetching credentials"

I am using the following code to write to Kinesis from a spark structured stream code. It errors out with following error. The AWS credentials have admin access. I am able to use aws console using that. What could be the issue here?
22/03/16 13:46:34 ERROR AWSInstanceProfileCredentialsProviderWithRetries: Got an exception while fetching credentials org.apache.spark.sql.kinesis.shaded.amazonaws.SdkClientException: Unable to load credentials from service endpoint
val finalDF ="CAST(rand() AS STRING) as partitionKey"),
val query = finalDF.writeStream
.option("streamName", "sparkstream2")
.option("endpointUrl", "")
.option("region", "us-east-1")
.option("awsAccessKey", "") // Creds removed
.option("awsSecretKey", "") // Creds removed
.option("checkpointLocation", "chk-point-dir")
Printschema output looks as follows
|-- partitionKey: string (nullable = false)
|-- data: string (nullable = true)
I am using the connector from qubole
I had this issue too - add on .option("awsUseInstanceProfile", "false"). The kinesis-sql package doesn't handle AWS credentials by default as one would expect. I found this GitHub issue for a lead here.

'TypeError: StructType can not accept object

I'm trying to convert this json string data to Dataframe in Databricks
a = """{ "id": "a",
"message_type": "b",
"data": [ {"c":"abcd","timestamp":"2022-03-
01T13:10:00+00:00","e":0.18,"f":0.52} ]}"""
the schema I defined for the data is this
StructField("data", ArrayType(StructType([
and when I run this command
df = sqlContext.createDataFrame(sc.parallelize([a]), schema)
I get this error
PythonException: 'TypeError: StructType can not accept object '{ "id": "a",\n"message_type": "JobMetric",\n"data": [ {"c":"abcd","timestamp":"2022-03- \n01T13:10:00+00:00","e":0.18,"f":0.52=} ]' in type <class 'str'>'. Full traceback below:
anyone could help me with this, would much appreciate it!
Your a variable is wrong.
"data": [ "{"JobId":"ATLUPS10m2101V1","Timestamp":"2022-03-
01T13:10:00+00:00","number1":0.9098145961761475,"number2":0.5294908881187439}" ]
Should be
"data": [ {"JobId":"ATLUPS10m2101V1","Timestamp":"2022-03-
01T13:10:00+00:00","number1":0.9098145961761475,"number2":0.5294908881187439} ]
And check if it is OK to match name with JobId to job_id and Timestamp to timestamp.
Issue is whenever you're passing the string object to struct schema it expects RDD([StringType, StringType,...]) however, in your current scenario it is getting just string object. In order to fix it first you need to convert your string to a json object and from there you'll need to create a RDD. See the below logic for details -
Input Data -
a = """{"run_id": "1640c68e-5f02-4f49-943d-37a102f90146",
"message_type": "JobMetric",
"data": [ {"JobId":"ATLUPS10m2101V1","timestamp":"2022-03-01T13:10:00+00:00",
Converting to a RDD using json object -
from pyspark.sql.types import *
import json
StructField("data", ArrayType(StructType([
df = spark.createDataFrame(data=sc.parallelize([json.loads(a)]),schema=schema)
Output -
|run_id |message_type|data |
|1640c68e-5f02-4f49-943d-37a102f90146|JobMetric |[{ATLUPS10m2101V1, 2022-03-01T13:10:00+00:00, 0.9098145961761475, 0.5294908881187439}]|

Sink Kafka Stream to MongoDB using PySpark Structured Streaming

My Spark:
spark = SparkSession\
.config("spark.streaming.stopGracefullyonShutdown", "true")\
Mongo URI:
input_uri_weld = 'mongodb://'
output_uri_weld = 'mongodb://'
Function for writing stream batches to Mongo:
def save_to_mongodb_collection(current_df, epoc_id, mongodb_collection_name):
.format("com.mongodb.spark.sql.DefaultSource") \
.mode("append") \
.option("spark.mongodb.output.uri", output_uri_weld) \
Kafka Stream:
kafka_df = spark.readStream\
.option("kafka.bootstrap.servers", kafka_broker)\
.option("subscribe", kafka_topic)\
.option("startingOffsets", "earliest")\
Write to Mongo:
mongo_writer = df_parsed.write\
.option("spark.mongodb.output.uri", output_uri_weld)\
& my spark.conf file:
spark.jars.packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1,org.apache.spark:spark-avro_2.12:3.0.1,com.datastax.spark:spark-cassandra-connector_2.12:3.0.0
java.lang.ClassNotFoundException: Failed to find data source: com.mongodb.spark.sql.DefaultSource. Please find packages at
I found a solution.
Since I couldn't find the right Mongo driver for Structured Streaming, I worked on another solution.
Now, I use the direct connection to mongoDb, and use "foreach(...)" instead of foreachbatch(...). My code looks like this in file:
import pymongo
from pymongo import MongoClient
local_url = "mongodb://localhost:27017"
def write_machine_df_mongo(target_df):
cluster = MongoClient(local_url)
db = cluster["test_db"]
collection = db.test1
post = {
"machine_id": target_df.machine_id,
"proc_type": target_df.proc_type,
"sensor1_id": target_df.sensor1_id,
"sensor2_id": target_df.sensor2_id,
"time": target_df.time,
"sensor1_val": target_df.sensor1_val,
"sensor2_val": target_df.sensor2_val,

Read / Write data from HBase using Pyspark

I am trying to read data from HBase using Pyspark and it is getting many weird errors. Below is the sample snippet of my code.
Please suggest any solution.
empdata = ''.join("""
'table': {
'namespace': 'default',
'name': 'emp'
'rowkey': 'key',
'columns': {
'emp_id': {'cf': 'rowkey', 'col': 'key', 'type': 'string'},
'emp_name': {'cf': 'personal data', 'col': 'name', 'type': 'string'}
df = sqlContext \
.read \
.options(catalog=empdata) \
.format('org.apache.spark.sql.execution.datasources.hbase') \
I have used the below version
HBase 2.1.6,
Pyspark 2.3.2, Hadoop 3.1
I have ran the code as follows
pyspark --master local --packages com.hortonworks:shc-core:1.1.1-1.6-s_2.10 --repositories --files /etc/hbase/conf/hbase-site.xml
The error is
An error occurred while calling o71.load. : java.lang.NoclassDefFoundError: org/apache/apark/Logging

How to convert a SnowflakeCursor in to a pySpark dataframe

The result of sending an SQL to Snowflake is a SnowflakeCursor. How would I easily convert it into a pySpark dataframe?
When using Databricks (, we can use the to load the result of an SQL statement into a dataframe. Note that specifying the sfRole may be the key to gain access to your database objects.
options = {
"sfUrl": "",
"sfUser": user,
"sfPassword": pw,
"sfDatabase": "db",
"sfSchema": "schema",
"sfRole": "Accountadmin",
"sfWarehouse": "wh"
df = \
.format("snowflake") \
.options(**options) \
.option("query", strCheckingSQL) \