reading csv to dataframe with dynamic custom schema with pyspark - pyspark

I'm working with databricks in a notebook. I want to read csv files with custom schema.
I'd like to be able to loop over all the csv files in a folder and read them with their respective schema.
So I have a schema for each csv file:
csv_1 = StructType([
StructField('foo', StringType(), False),
StructField('bar', StringType(), True),
])
csv_2 = StructType([
StructField('foo', StringType(), False),
StructField('bar', StringType(), True),
StructField('baz', StringType(), True),
])
csv_3 = StructType([
StructField('bar', StringType(), True)
])
Then I have this loop:
for file in os.listdir(path):
filename = os.path.splitext(file)[0]
dataframes[filename] = spark.read.csv(path+file, header=True, schema=???)
I guess I probably need to use some mapping somewhere but I'm not sure how.

filename_to_related_mapping = {
'name1': csv_1,
'name2': csv_2,
...
}.get(filename)
for file in os.listdir(path):
filename = os.path.splitext(file)[0]
dataframes[filename] = spark.read.csv(path+file, header=True, schema=filename_to_related_mapping[filename])
Anyway its just a CSV, another way is not pass the schema and it will be infered dynamically.

Related

Spark Structured Stream qubole Kinesis connector errors out with "Got an exception while fetching credentials"

I am using the following code to write to Kinesis from a spark structured stream code. It errors out with following error. The AWS credentials have admin access. I am able to use aws console using that. What could be the issue here?
22/03/16 13:46:34 ERROR AWSInstanceProfileCredentialsProviderWithRetries: Got an exception while fetching credentials org.apache.spark.sql.kinesis.shaded.amazonaws.SdkClientException: Unable to load credentials from service endpoint
val finalDF = rawDF.select(expr("CAST(rand() AS STRING) as partitionKey"),
to_json(struct("*")).alias("data"))
finalDF.printSchema()
val query = finalDF.writeStream
.outputMode("update")
.format("kinesis")
.option("streamName", "sparkstream2")
.option("endpointUrl", "https://kinesis.us-east-1.amazonaws.com")
.option("region", "us-east-1")
.option("awsAccessKey", "") // Creds removed
.option("awsSecretKey", "") // Creds removed
.option("checkpointLocation", "chk-point-dir")
.start()
query.awaitTermination()
spark.stop()
Printschema output looks as follows
root
|-- partitionKey: string (nullable = false)
|-- data: string (nullable = true)
I am using the connector from qubole
https://github.com/qubole/kinesis-sql
I had this issue too - add on .option("awsUseInstanceProfile", "false"). The kinesis-sql package doesn't handle AWS credentials by default as one would expect. I found this GitHub issue for a lead here.

'TypeError: StructType can not accept object

I'm trying to convert this json string data to Dataframe in Databricks
a = """{ "id": "a",
"message_type": "b",
"data": [ {"c":"abcd","timestamp":"2022-03-
01T13:10:00+00:00","e":0.18,"f":0.52} ]}"""
the schema I defined for the data is this
schema=StructType(
[
StructField("id",StringType(),False),
StructField("message_type",StringType(),False),
StructField("data", ArrayType(StructType([
StructField("c",StringType(),False),
StructField("timestamp",StringType(),False),
StructField("e",DoubleType(),False),
StructField("f",DoubleType(),False),
])))
,
]
)
and when I run this command
df = sqlContext.createDataFrame(sc.parallelize([a]), schema)
I get this error
PythonException: 'TypeError: StructType can not accept object '{ "id": "a",\n"message_type": "JobMetric",\n"data": [ {"c":"abcd","timestamp":"2022-03- \n01T13:10:00+00:00","e":0.18,"f":0.52=} ]' in type <class 'str'>'. Full traceback below:
anyone could help me with this, would much appreciate it!
Your a variable is wrong.
"data": [ "{"JobId":"ATLUPS10m2101V1","Timestamp":"2022-03-
01T13:10:00+00:00","number1":0.9098145961761475,"number2":0.5294908881187439}" ]
Should be
"data": [ {"JobId":"ATLUPS10m2101V1","Timestamp":"2022-03-
01T13:10:00+00:00","number1":0.9098145961761475,"number2":0.5294908881187439} ]
And check if it is OK to match name with JobId to job_id and Timestamp to timestamp.
Issue is whenever you're passing the string object to struct schema it expects RDD([StringType, StringType,...]) however, in your current scenario it is getting just string object. In order to fix it first you need to convert your string to a json object and from there you'll need to create a RDD. See the below logic for details -
Input Data -
a = """{"run_id": "1640c68e-5f02-4f49-943d-37a102f90146",
"message_type": "JobMetric",
"data": [ {"JobId":"ATLUPS10m2101V1","timestamp":"2022-03-01T13:10:00+00:00",
"score":0.9098145961761475,
"severity":0.5294908881187439
}
]
}"""
Converting to a RDD using json object -
from pyspark.sql.types import *
import json
schema=StructType(
[
StructField("run_id",StringType(),False),
StructField("message_type",StringType(),False),
StructField("data", ArrayType(StructType([
StructField("JobId",StringType(),False),
StructField("timestamp",StringType(),False),
StructField("score",DoubleType(),False),
StructField("severity",DoubleType(),False),
])))
,
]
)
df = spark.createDataFrame(data=sc.parallelize([json.loads(a)]),schema=schema)
df.show(truncate=False)
Output -
+------------------------------------+------------+--------------------------------------------------------------------------------------+
|run_id |message_type|data |
+------------------------------------+------------+--------------------------------------------------------------------------------------+
|1640c68e-5f02-4f49-943d-37a102f90146|JobMetric |[{ATLUPS10m2101V1, 2022-03-01T13:10:00+00:00, 0.9098145961761475, 0.5294908881187439}]|
+------------------------------------+------------+--------------------------------------------------------------------------------------+

Sink Kafka Stream to MongoDB using PySpark Structured Streaming

My Spark:
spark = SparkSession\
.builder\
.appName("Demo")\
.master("local[3]")\
.config("spark.streaming.stopGracefullyonShutdown", "true")\
.config('spark.jars.packages','org.mongodb.spark:mongo-spark-connector_2.12:3.0.1')\
.getOrCreate()
Mongo URI:
input_uri_weld = 'mongodb://127.0.0.1:27017/db.coll1'
output_uri_weld = 'mongodb://127.0.0.1:27017/db.coll1'
Function for writing stream batches to Mongo:
def save_to_mongodb_collection(current_df, epoc_id, mongodb_collection_name):
current_df.write\
.format("com.mongodb.spark.sql.DefaultSource") \
.mode("append") \
.option("spark.mongodb.output.uri", output_uri_weld) \
.save()
Kafka Stream:
kafka_df = spark.readStream\
.format("kafka")\
.option("kafka.bootstrap.servers", kafka_broker)\
.option("subscribe", kafka_topic)\
.option("startingOffsets", "earliest")\
.load()
Write to Mongo:
mongo_writer = df_parsed.write\
.format('com.mongodb.spark.sql.DefaultSource')\
.mode('append')\
.option("spark.mongodb.output.uri", output_uri_weld)\
.save()
& my spark.conf file:
spark.jars.packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1,org.apache.spark:spark-avro_2.12:3.0.1,com.datastax.spark:spark-cassandra-connector_2.12:3.0.0
Error:
java.lang.ClassNotFoundException: Failed to find data source: com.mongodb.spark.sql.DefaultSource. Please find packages at http://spark.apache.org/third-party-projects.html
I found a solution.
Since I couldn't find the right Mongo driver for Structured Streaming, I worked on another solution.
Now, I use the direct connection to mongoDb, and use "foreach(...)" instead of foreachbatch(...). My code looks like this in testSpark.py file:
....
import pymongo
from pymongo import MongoClient
local_url = "mongodb://localhost:27017"
def write_machine_df_mongo(target_df):
cluster = MongoClient(local_url)
db = cluster["test_db"]
collection = db.test1
post = {
"machine_id": target_df.machine_id,
"proc_type": target_df.proc_type,
"sensor1_id": target_df.sensor1_id,
"sensor2_id": target_df.sensor2_id,
"time": target_df.time,
"sensor1_val": target_df.sensor1_val,
"sensor2_val": target_df.sensor2_val,
}
collection.insert_one(post)
machine_df.writeStream\
.outputMode("append")\
.foreach(write_machine_df_mongo)\
.start()

Read / Write data from HBase using Pyspark

I am trying to read data from HBase using Pyspark and it is getting many weird errors. Below is the sample snippet of my code.
Please suggest any solution.
empdata = ''.join("""
{
'table': {
'namespace': 'default',
'name': 'emp'
},
'rowkey': 'key',
'columns': {
'emp_id': {'cf': 'rowkey', 'col': 'key', 'type': 'string'},
'emp_name': {'cf': 'personal data', 'col': 'name', 'type': 'string'}
}
}
""".split())
df = sqlContext \
.read \
.options(catalog=empdata) \
.format('org.apache.spark.sql.execution.datasources.hbase') \
.load()
df.show()
I have used the below version
HBase 2.1.6,
Pyspark 2.3.2, Hadoop 3.1
I have ran the code as follows
pyspark --master local --packages com.hortonworks:shc-core:1.1.1-1.6-s_2.10 --repositories http://repo.hortonworks.com/content/groups/public/ --files /etc/hbase/conf/hbase-site.xml
The error is
An error occurred while calling o71.load. : java.lang.NoclassDefFoundError: org/apache/apark/Logging

How to convert a SnowflakeCursor in to a pySpark dataframe

The result of sending an SQL to Snowflake is a SnowflakeCursor. How would I easily convert it into a pySpark dataframe?
Thanks!
When using Databricks (https://docs.databricks.com/data/data-sources/snowflake.html), we can use the spark.read to load the result of an SQL statement into a dataframe. Note that specifying the sfRole may be the key to gain access to your database objects.
options = {
"sfUrl": "https://yourinstance.snowflakecomputing.com/",
"sfUser": user,
"sfPassword": pw,
"sfDatabase": "db",
"sfSchema": "schema",
"sfRole": "Accountadmin",
"sfWarehouse": "wh"
}
df = spark.read \
.format("snowflake") \
.options(**options) \
.option("query", strCheckingSQL) \
.load()