I am working on IOT data pipeline and I am getting messages every second from multiple devices into postgres database. Postgres will be having data for only two days and after two days data will be flushed so that every time there is data for last two days. Now I needed to do data archival from postgres to HDFS daily. The parameters I have are :
deviceid, timestamp, year, month, day, temperature, humidity
I want to archive it daily into HDFS and query that data using hive query. For that I need to create external partitioned table in Hive using deviceid, year and month as partitions. I have tried following options but its not working:
I have tried using sqoop for data copying but it can't create dynamic folders based on different deviceid,year and month so that the external hive table can pick partitions
Used sqoop import using --hive-import attribute so that data can be copied directly into hive table but in this case it overwrites the existing table and I am also not sure whether this works for partitioned table or not
Please suggest some solutions for the archival.
Note: I am using azure services so option for Azure Data Factory is open.
Related
I am creating a data warehouse using Azure Data Factory to extract data from a MySQL table and saving it in parquet format in an ADLS Gen 2 filesystem. From there, I use Synapse notebooks to process and load data into destination tables.
The initial load is fairly easy using spark.write.saveAsTable('orders') however, I am running into some issues doing incremental load following the intial load. In particular, I have not been able to find a way to reliably insert/update information into an existing Synapse table.
Since Spark does not allow DML operations on a table, I have resorted to reading the current table into a Spark DataFrame and inserting/updating records in that DataFrame. However, when I try to save that DataFrame using spark.write.saveAsTable('orders', mode='overwrite', format='parquet'), I run into a Cannot overwrite table 'orders' that is also being read from error.
A solution indicated by this suggests creating a temporary table and then inserting using that but that still resorts in the above error.
Another solution in this post suggests to write the data into a temporary table, drop the target table, and then rename the table but upon doing this, Spark gives me a FileNotFound errors regarding metadata.
I know Delta Tables can fix this issue pretty reliably but our company is not yet ready to move over to DataBricks.
All suggestions are greatly appreciated.
I am using Export Collections to BigQuery extension with firestore to export data to the bigquery.I can export data to the big query but the table is not getting partitioned even I have enabled the table partitioned with extension while configuring extension at the creation time.
I had the same issue today when exporting data from Firestore to BigQuery and doing a historical load with npx firebaseextensions/fs-bq-import-collection command.
These are my findings.
If you configure the extension to have partitioning based on the ingestion time, you have to wait for a first new record that will trigger that streaming extension and create a partitioned table.
If you just create an extension and immediately execute a historical load, the table created won't be partitioned. I tested this 3 times today.
A question for you -- did you execute a historical load or imported only new records?
currently scraping data and dumping them on a cloudSQL postgres database .. this data tends to grow exponentially and I need an efficient way to execute queries .. database grows by ~3GB/day and I'm looking to keep data for at least 3 months .. therefore, I've connected my CloudSQL to BigQuery .. the following is an example of a query that I'm running on BigQuery but I'm skeptical .. not sure if the query is being executed in Postgres or BigQuery ..
SELECT * FROM EXTERNAL_QUERY("project.us-cloudsql-instance", "SELECT date_trunc('day', created_at) d, variable1, AVG(variable2) FROM my_table GROUP BY 1,2 ORDER BY d;");
seems like the query is being executed in postgreSQL though, not BigQuery .. is this true? if it is, is there a way for me to load data from postgresql to bigquery in realtime and execute queries directly in bigquery ?
I think you are using federated queries. These queries are intended to collect data from BigQuery and from a CloudSQLInstance:
BigQuery Cloud SQL federation enables BigQuery to query data residing in Cloud SQL in real-time, without copying or moving data. It supports both MySQL (2nd generation) and PostgreSQL instances in Cloud SQL.
The query is being executed in CloudSQL and this could lead into a lower performance than if you run in BigQuery.
EXTERNAL_QUERY executes the query in Cloud SQL and returns results as a temporary table. The result would be a BigQuery table.
Now, the current ways to load data into BigQuery are from: GCS, other Google Ad Manager and Google Ads, a readtable data source, By inserting individual records using streaming inserts, DML statements and BigQuery I/O transform in a Dataflow pipeline.
This solution is well worth to take a look which is pretty similar to what you need:
The MySQL to GCS operator executes a SELECT query against a MySQL table. The SELECT pulls all data greater than (or equal to) the last high watermark. The high watermark is either the primary key of the table (if the table is append-only), or a modification timestamp column (if the table receives updates). Again, the SELECT statement also goes back a bit in time (or rows) to catch potentially dropped rows from the last query (due to the issues mentioned above).
With Airflow they manage to keep BigQuery synchronized to their MySQL database every 15 minutes.
Although technically, it is possible to rewrite the query as
SELECT date_trunc('day', created_at) d, variable1, AVG(variable2)
FROM EXTERNAL_QUERY("project.us-cloudsql-instance",
"SELECT created_at, variable1, variable2 FROM my_table")
GROUP BY 1,2 ORDER BY d;
It is not recommended though. Better do aggregation and filtering on CloudSQL as much as possible to reduce the amount of data that has to be transfered from CloudSQL to BigQuery.
I have to create an app which transfer data from snowflake to postgres everyday. Some tables in postgres are truncated before migration and all data from corresponding snowflake table is copied. While for other tables, data after last timestamp in postgres is copied from snowflake.
This job has to run at night sometime and not when customers are using the service at daytime.
What is the best way to do this ?
Do you have constraints, limiting your choices in:
ETL or bulk data tooling
Development languages?
According to this site, you can create a foreign data wrapper on Postgresql for snowflake
Currently i have exported data as sharded to Google cloud, downloaded in server and streaming to the partitioned table, but the problem is that it takes long time. It streams like 1 Gb for 40 Minutes. Please help me to make it faster. My machine is 12 kernel and 20 Gb RAM CPU.
You can directly load data from Google Cloud Storage into your partition using a generated API call or other methods
To update data in a specific partition, append a partition decorator to the name of the partitioned table when loading data into the table. A partition decorator represents a specific date and takes the form:
$YYYYMMDD
For example, the following command replaces the data in the entire partition for the date January 1, 2016 (20160101) in a partitioned table named mydataset.table1 with content loaded from a Cloud Storage bucket:
bq load --replace --source_format=NEWLINE_DELIMITED_JSON 'mydataset.table1$20160101' gs://[MY_BUCKET]/replacement_json.json
Note: Because partitions in a partitioned table share the table schema, replacing data in a partition will not replace the schema of the table. Instead, the schema of the new data must be compatible with the table schema. To update the schema of the table with the load job, use configuration.load.schemaUpdateOptions.
Read more https://cloud.google.com/bigquery/docs/creating-partitioned-tables