Polars DataFrame save to sql - python-polars

Is there a way to save Polars DataFrame into a database, MS SQL for example?
ConnectorX library doesn’t seem to have that option.

Polars doesen't support direct writing to a database. You can proceed in two ways:
Export the DataFrame in an intermediate format (such as .csv using .write_csv()), then import it into the database.
Process it in memory: you can convert the DataFrame in a simpler data structure using .to_dicts(). The result will be a list of dictionaries, each of them containing a row in key/value format. At this point is easy to insert them into a database using SqlAlchemy or any specific library for your database of choice.

Related

spark Dataframe vs pandas-on-spark Dataframe

I'm quite new to pySpark but I'm confused about the difference between a spark Dataframe (created for example from an RDD ) and a pandas-on-spark Dataframe.
Are those the same object ? Looking at the type it seems they are different classes.
What's the core difference, if any ? (I know that working with pandas-on-spark Dataframe you can use almost the same syntax of Pandas on a distributed Dataframe but I'm wondering if is only this one the difference )
Thanks
Answering directly:
Are those the same object ? Looking at the type it seems they are different classes.
No, they are completely different objects (classes).
What's the core difference, if any ?
A pySpark DataFrame is an object from the PySpark library, with its own API and it can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs.
A Pandas-on-Spark DataFrame and pandas DataFrame are similar. However, the former is distributed and the latter is in a single machine. When converting to each other, the data is transferred between multiple machines and the single client machine.
A Pandas DataFrame, is an object from the pandas library, also with its own API and it can also be constructed from a wide range of methods.
Also, I recommend checking this documentation about Pandas on Spark

Ways of Pyspark tabular data manipulation

I'm very new to pyspark and I'm kind of confused with the data manipulation. What I learned lately is that we can manipulate data (tabular data) with SQL queries or with pyspark dataframes built in methods. My question is
Is there another way to manipulate tabular data in pyspark other than with SQL queries or with pyspark dataframes built in methods?
Why some people manipulate the data with SQL and some others with the built in methods? I mean it's mentioned that spark dataframes can act like SQL table, so why using the built in functions?
In the best practice, when to manipulate the data with SQL queries and with pyspark dataframes built in methods?
I'm sorry if this is a basic question but I'm very new at this and I have been looking for articles to answer the questions I have but to no avail.

Importing spark.sql.catalyst library in spark

I am trying to find a way to interpret the table names from spark sql.
The answer given here is in Scala How to get table names from SQL query?
I want to change this into pyspark.
For that I want to import the library
org.apache.spark.sql.catalyst.analysis.UnresolvedRelation (or its equivalent) into pyspark.
Can this be done?

AWS Glue: How to handle nested JSON with varying schemas

Objective:
We're hoping to use the AWS Glue Data Catalog to create a single table for JSON data residing in an S3 bucket, which we would then query and parse via Redshift Spectrum.
Background:
The JSON data is from DynamoDB Streams and is deeply nested. The first level of JSON has a consistent set of elements: Keys, NewImage, OldImage, SequenceNumber, ApproximateCreationDateTime, SizeBytes, and EventName. The only variation is that some records do not have a NewImage and some don't have an OldImage. Below this first level, though, the schema varies widely.
Ideally, we would like to use Glue to only parse this first level of JSON, and basically treat the lower levels as large STRING objects (which we would then parse as needed with Redshift Spectrum). Currently, we're loading the entire record into a single VARCHAR column in Redshift, but the records are nearing the maximum size for a data type in Redshift (maximum VARCHAR length is 65535). As a result, we'd like to perform this first level of parsing before the records hit Redshift.
What we've tried/referenced so far:
Pointing the AWS Glue Crawler to the S3 bucket results in hundreds of tables with a consistent top level schema (the attributes listed above), but varying schemas at deeper levels in the STRUCT elements. We have not found a way to create a Glue ETL Job that would read from all of these tables and load it into a single table.
Creating a table manually has not been fruitful. We tried setting each column to a STRING data type, but the job did not succeed in loading data (presumably since this would involve some conversion from STRUCTs to STRINGs). When setting columns to STRUCT, it requires a defined schema - but this is precisely what varies from one record to another, so we are not able to provide a generic STRUCT schema that works for all the records in question.
The AWS Glue Relationalize transform is intriguing, but not what we're looking for in this scenario (since we want to keep some of the JSON intact, rather than flattening it entirely). Redshift Spectrum supports scalar JSON data as of a couple weeks ago, but this does not work with the nested JSON we're dealing with. Neither of these appear to help with handling the hundreds of tables created by the Glue Crawler.
Question:
How would we use Glue (or some other method) to allow us to parse just the first level of these records - while ignoring the varying schemas below the elements at the top level - so that we can access it from Spectrum or load it physically into Redshift?
I'm new to Glue. I've spent quite a bit of time in the Glue documentation and looking through (the somewhat sparse) info on forums. I could be missing something obvious - or perhaps this is a limitation of Glue in its current form. Any recommendations are welcome.
Thanks!
I'm not sure you can do this with a table definition, but you can accomplish this with an ETL job by using a mapping function to cast the top level values as JSON strings. Documentation: [link]
import json
# Your mapping function
def flatten(rec):
for key in rec:
rec[key] = json.dumps(rec[key])
return rec
old_df = glueContext.create_dynamic_frame.from_options(
's3',
{"paths": ['s3://...']},
"json")
# Apply mapping function f to all DynamicRecords in DynamicFrame
new_df = Map.apply(frame=old_df, f=flatten)
From here you have the option of exporting to S3 (perhaps in Parquet or some other columnar format to optimize for querying) or directly into Redshift from my understanding, although I haven't tried it.
This is a limitation of Glue as of now. Have you taken a look at Glue Classifiers? It's the only piece I haven't used yet, but might suit your needs. You can define a JSON path for a field or something like that.
Other than that - Glue Jobs are the way to go. It's Spark in the background, so you can do pretty much everything. Set up a development endpoint and play around with it. I've run against various roadblocks for the last three weeks and decided to completely forgo any and all Glue functionality and only Spark, that way it's both portable and actually works.
One thing you might need to keep in mind when setting up the dev endpoint is that the IAM role must have a path of "/", so you will most probably need to create a separate role manually that has this path. The one automatically created has a path of "/service-role/".
you should add a glue classifier preferably $[*]
When you crawl the json file in s3, it will read the first line of the file.
You can create a glue job in order to load the data catalog table of this json file into the redshift.
My only problem with here is that Redshift Spectrum has problems reading json tables in the data catalog..
let me know if you have found a solution
The procedure I found useful to shallow nested json:
ApplyMapping for the first level as datasource0;
Explode struct or array objects to get rid of element level
df1 = datasource0.toDF().select(id,col1,col2,...,explode(coln).alias(coln), where explode requires from pyspark.sql.functions import explode;
Select the JSON objects that you would like to keep intact by intact_json = df1.select(id, itct1, itct2,..., itctm);
Transform df1 back to dynamicFrame and Relationalize the
dynamicFrame as well as drop the intact columns by dataframe.drop_fields(itct1, itct2,..., itctm);
Join relationalized table with the intact table based on 'id'
column.
As of 12/20/2018, I was able to manually define a table with first level json fields as columns with type STRING. Then in the glue script the dynamicframe has the column as a string. From there, you can do an Unbox operation of type json on the fields. This will json parse the fields and derive the real schema. Combining Unbox with Filter allows you to loop through and process heterogeneous json schemas from the same input if you can loop through a list of schemas.
However, one word of caution, this is incredibly slow. I think that glue is downloading the source files from s3 during each iteration of the loop. I've been trying to find a way to persist the initial source data but it looks like .toDF derives the schema of the string json fields even if you specify them as glue StringType. I'll add a comment here if I can figure out a solution with better performance.

Data Analysis Scala on Spark

I am new to Scala, and i have to use Scala and Spark's SQL, Mllib and GraphX in order to perform some analysis on huge data set. The analyses i want to do are:
Customer life cycle Value (CLV)
Centrality measures (degree, Eigenvector, edge-betweenness,
closeness) The data is in a CSV file (60GB (3 years transnational data))
located in Hadoop cluster.
My question is about the optimal approach to access the data and perform the above calculations?
Should i load the data from the CSV file into dataframe and work on
the dataframe? or
Should i load the data from the CSV file and convert it into RDD and
then work on the RDD? or
Are there any other approach to access the data and perform the analyses?
Thank you so much in advance for your help..
Dataframe gives you sql like syntax to work with the data where as RDD gives Scala collection like methods for data manipulation.
One extra benefit with Dataframes is underlying spark system will optimise your queries just like sql query optimisation. This is not available in case of RDD's.
As you are new to Scala its highly recommended to use Dataframes API initially and then Pick up RDD API later based on requirement.
You can use Databricks CSV reader api, which is easy to use and returns DataFrame. It automatically infer data types. If you pass the file with header it can automatically use that as Schema, otherwise you can construct schema using StructType.
https://github.com/databricks/spark-csv
Update:
If you are using Spark 2.0 Version , by default it support CSV datasource, please see the below link.
https://spark.apache.org/releases/spark-release-2-0-0.html#new-features
See this link for how to use.
https://github.com/databricks/spark-csv/issues/367