I'm very new on GCP Google Cloud Platform, so I hope my question will not look so silly.
Footstage:
The main goals is gather few extend tables from BigQuery and apply few transformations. Because of the size of the tables I'm planning use Dataproc deploying a Pyspark script, ideally I would be able to use sqlContext to apply few sql queries to the DFs (tables pulled from BQ). Finally, I could easily dump this info into a file within a data storage bucket.
Questions :
Can I use import google.datalab.bigquery as bq within my Pyspark script?
Is this proposed schema the most efficient or instead I might validate any other? keep in mind that I need to create many temporal queries and this is why I though on Spark.
I expect to use pandas and bq to read the results queries as pandas df following this example. Later, I might use sc.parallelize from Spark to transform the pandas df into a spark df. Is this approach the right one?
my script
Update:
After have a back and forth with #Tanvee that kindly attend this question we conclude that GCP requires an intermediate allocation step when you need to read data from DataStorage into your Dataproc. Briefly, your spark or hadoop script might need a temporal bucket where store the data from the table and then bring it into Spark.
References:
Big Query Connector \
Deployment
thanks so much
You will need to use BigQuery connector for spark. There are some examples in the GCP documentation here and here. It will create RDD which you can convert to dataframe and then you will be able to perform all typical transformations. Hope that helps.
You can directly use following options to connect bigquery table from spark.
You can also use spark-bigquery connectors https://github.com/samelamin/spark-bigquery to directly run your queries on dataproc using spark.
https://github.com/GoogleCloudPlatform/spark-bigquery-connector This is new connector which is in beta. This is spark datasource api to bigquery which is easy to use.
Please refer following link:
Dataproc + BigQuery examples - any available?
Related
My company is implementing Azure Data Explorer (ADX) as a backend. They also want to use Databricks for Data Science projects including data exploration. I'm in charge of connecting Databricks to ADX.
I first tried Azure Kusto package.
from azure.kusto.data import KustoClient, KustoConnectionStringBuilder
from azure.kusto.data.exceptions import KustoServiceError
from azure.kusto.data.helpers import dataframe_from_result_table
import pandas as pd
...
df = dataframe_from_result_table(RESPONSE.primary_results[0])
Full steps here
Functionlly this works well.
But it loses completely the lazy loading feature of both ADX and Databricks-Spark.
I thought that because df is a just a Pandas dataframe, also if I try to convert this to a hive table it persists the data, which is not required, as we need fresh online data we don't want a local copy.
The next thing I've tried was to have this loaded in a spark dataframe. I've tried the following code (after installing the relevant libraries)
df = spark.read. \
format("com.microsoft.kusto.spark.datasource"). \
option("kustoCluster", KUSTO_URI). \
option("kustoDatabase",KUSTO_DATABASE). \
option("kustoQuery", "some_table_in_adx"). \
option("kustoAadAppId",CLIENT_ID). \
option("kustoAadAppSecret",CLIENT_SECRET). \
option("kustoAadAuthorityID", AAD_TENANT_ID).load()
which again loads the data into the spark dataframe without any issue.
However, performance wise it's far far away from the direct query in ADX. A count in a table of 600 thousands records is subseconds in ADX while it's more than 20 seconds in the Databricks notebook on a DS3_V2 14GB 4 cores.
Before even to consider a SaveAsTable or CreateOrReplaceTempView I wonder why I'm experiencing this performance issue. So my questions are :
Does this connection use the lazy loading (I know doing a count is not the right way to check that)?
If not is there any way to have lazy loading instea of loading the full table in dataframes before doing operations
what would happen if I create a hive table from the spark dataframe, will it copy the data or still have a virtual table pointing to ADX
Thanks for your help
For the Python part -
Pandas is not Spark dataframe therefore it's not lazy computed, to utilize these together you may use Spark parallelize.
For the Spark ADX connector -
This is indeed lazy loading. It is not evaluated until some evaluation method was requested - like the count.
If the count was done by Spark syntax i.e spark.read.kusto...count() then it would cause all the data to be first brought into spark and then operate count on it - so 20 seconds sounds legit, to compare with a query simply change the value of "kustoQuery" to "some_table_in_adx | count" which will lead to a count on the ADX side - uploading to spark just the final int result.
The connector offers simple Kusto query or distributed export command via the readMode option with ForceSingleMode to perform a simple Kusto query from driver node - as explained in the docs
As Hive table operated over dataframes - once you create a dataframe from Kusto read - it will also be lazy evaluated with table operations.
I'm currently developing a Glue ETL script in PySpark that needs to query my Glue Data Catalog's partitions and join that information with other Glue tables programmatically.
At the moment, I'm able to do this with Athena using SELECT * FROM db_name.table_name$partitions JOIN table_name2 ON ..., but looks like this doesn't work with Spark SQL. The closest thing I've been able to find is SHOW PARTIIONS db_name.table_name, which doesn't seem to cut it.
Does anyone know an easy way I can leverage Glue ETL / Boto3 (Glue API) / PySpark to query my partition information in a SQL-like manner?
For the time being, the only possible workaround seems like the get_partitions() method in Boto3, but this looks like a lot more complex work to deal with from my end. I already have my Athena queries to get the information I need, so if there's ideally a way to replicate getting my tables' partitions in a similar way using SQL, that'd be amazing. Please let me know, thank you!
For those interested, an alternative workaround I've been able to find but still need to test out is the Athena API with the Boto3 client. I may also possibly use the AWS Wrangler integrated with Athena to retrieve a dataframe.
I have my Spark project in Scala I want to use Redshift as my DataWarehouse, I have found spark-redshift repo exists but Databricks made it private since a couple of years ago and doesn't support it publicly anymore.
What's the best option right now to deal with Amazon Redshift and Spark (Scala)
This is a partial answer as I have only been using Spark->Redshift in a real world use-case and never benchmarked Spark read from Redshift performance.
When it comes to writing from Spark to Redshift, by far the most performant way that I could find was to write parquet to S3 and then use Redshift Copy to load the data. Writing to Redshift through JDBC also works but it is several orders of magnitude slower than the former method. Other storage formats could be tried as well, but I would be surprised if any row-oriented format could beat Parquet as Redshift internally stores data in columnar format. Another columnar format that is supported by both Spark and Redshift is ORC.
I never came across a use-case of reading large amounts of data from Redshift using Spark as it feels more natural to load all the data to Redshift and use it for joins and aggregations. It is probably not cost-efficient to use Redshift just as a bulk storage and use another engine for joins and aggregations. For reading small amounts of data, JDBC works fine. For large reads, my best guess is Unload command and S3.
I have little experience in Hive and currently learning Spark with Scala. I am curious to know whether Hive on Tez really faster than SparkSQL. I searched many forums with test results but they have compared older version of Spark and most of them are written in 2015. Summarized main points below
ORC will do the same as parquet in Spark
Tez engine will give better performance like Spark engine
Joins are better/faster in Hive than Spark
I feel like Hortonworks supports more for Hive than Spark and Cloudera vice versa.
sample links :
link1
link2
link3
Initially I thought Spark would be faster than anything because of their in-memory execution. after reading some articles I got Somehow existing Hive also getting improvised with new concepts like Tez, ORC, LLAP etc.
Currently running with PL/SQL Oracle and migrating to big data since volumes are getting increased. My requirements are kind of ETL batch processing and included data details involved in every weekly batch runs. Data will increase widely soon.
Input/lookup data are csv/text formats and updating into tables
Two input tables which has 5 million rows and 30 columns
30 look up tables used to generate each column of output table which contains around 10 million rows and 220 columns.
Multiple joins involved like inner and left outer since many look up tables used.
Kindly please advise which one of below method I should choose for better performance with readability and easy to include minor updates on columns for future production deployment.
Method 1:
Hive on Tez with ORC tables
Python UDF thru TRANSFORM option
Joins with performance tuning like map join
Method 2:
SparkSQL with Parquet format which is converting from text/csv
Scala for UDF
Hope we can perform multiple inner and left outer join in Spark
The best way to implement the solution to your problem as below.
To load the data into the table the spark looks good option to me. You can read the tables from the hive metastore and perform the incremental updates using some kind of windowing functions and register them in hive. While ingesting as data is populated from various lookup table, you are able to write the code in programatical way in scala.
But at the end of the day, there need to be a query engine that is very easy to use. As your spark program register the table with hive, you can use hive.
Hive support three execution engines
Spark
Tez
Mapreduce
Tez is matured, spark is evolving with various commits from Facebook and community.
Business can understand hive very easily as a query engine as it is much more matured in the industry.
In short use spark to process the data for daily processing and register them with hive.
Create business users in hive.
I'm new to spark. I'm used to a Web developer, not familiar to big data.
That's say I have a portal website. user's behavior and action will store in 5 sharded mongoDB clusters.
How to I analyze it with spark ?
Or Spark can get the data from any databases directly (postgres/mongoDB/mysql/....)
Because most website may use Relational DB as back-end database.
Should I export whole data in the website's databases into HBase ?
I stored all the users log in postgreSQL, is it practical to export data into HBase or other Spark preffered databases ?
It seems it will make lots of duplicated data if I copy the data to a new database.
Does my big data model need other framework excepts Spark ?
For analyze the data in the website's databases,
I don't see the reasons that I need HDFS, Mesos, ...
How to make Spark workers can access the data in PostgreSQL databases ?
I only know how to read data from text file,
and saw some codes about how to load data from HDFS://
But I don't have HDFS system now , should I create one HDFS for my purpose ?
Spark is a distributed compute engine; so it expects to have files accessible from all nodes. Here are some choices you might consider
There seems to be Spark - MongoDB connector. This post explains how to get it working
Export the data out of MongoDB into Hadoop. And then use Spark to process the files. For this , you need to have a Hadoop cluster running
If you are on Amazon, you can put the files in S3 store and access from Spark