I want to Extract data from the strava API to a personal Snowflake or big query warehouse.
I already have the script for the API. I was thinking of using airflow to schedule the extract daily. But if I run airflow in local, it will stop every time I close my computer ?
So my question is: what is the best way to extract and load daily data (vey small volume), without the dependency of a running computer.
Thanks a lot!
Lucas
Related
I am using the delta lake oss version 0.8.0.
Let's assume we calculated aggregated data and cubes using the raw data and saved the results in a gold table using delta lake.
My question is, is there a well known way to access these gold table data and deliver them to a web dashboard for example?
In my understanding, you need a running spark session to query a delta table.
So one possible solution could be to write a web api, which executes these spark queries.
Also you could write the gold results in a database like postgres to access it, but that seems just duplicating the data.
Is there a known best practice solution?
The real answer depends on your requirements regarding latency, number of requests per second, amount of data, deployment options (cloud/on-prem, where data located - HDFS/S3/...), etc. Possible approaches are:
Have the Spark running in the local mode inside your application - it may require a lot of memory, etc.
Run Thrift JDBC/ODBC server as a separate process, and access data via JDBC/ODBC
Read data directly using the Delta Standalone Reader library for JVM, or via delta-rs library that works with Rust/Python/Ruby
We have a screen which has number of fields from different tables, i need to extract those fields from tables and keep the data in Excel sheet. How can i do this ?
Is this a "one-time" data-transfer, or will it be an ongoing, automated process?
IMHO, for a "one-time" data-extraction, the easiest way to accomplish that is using ODBC. Historically, I've used ODBC to import the data into Microsoft Access. From there, it's extremely easy to export the data into Excel.
For a regularly-occurring, automated method, I think using the CpyToImpF command works the best. It takes a little trial-and-error to get the process working, but once you've got it set up, it can run in regularly scheduled job to export the data. (Google the syntax for the command, and try it yourself.)
HTH,
Dave
There is a tool called iEXL which will create .Xlsx spreadsheets native on the AS400.
If you after only data from one screen we will help you and the price would we adjusted to take this into account.
WWW.iEXLSOFTWARE.COM
I think I have a pretty simple, conceptual question.
I have a database where I run queries. Lets just say its a postgres database to make things simple.
I would like to run a certain query on a schedule ... Lets say every night ... And store the results of that query to a .csv file that is uploaded automatically to a certain S3 bucket.
I know people have been able to implement this workflow with Jenkins, but I am looking for a simpler solution. Any ideas?
I am trying to setup Real time change data capture between two different MySQL databases using Talend Studio. I was able to successfully create a job that uses Publish/Subscribe model that picks up only the changed data from source and populates in the target database.
I could not find the documentation to setup CDC in real time i.e. as soon as a new row is inserted in the source database it will be picked up by the job and populated in target database. The Talend job will be running continuously to look for possible changes in the source.
My question: is scheduling the Talend job using some scheduler for desired interval the only option in this case?
Thanks in advance.
You could also use triggers for Create, Update, Delete on the database and use those triggers to start a process pushing the data somewhere or starting a process.
My company starts to use mongo and we are starting to think about what is the best way to extract data from mongodb and send it to our data warehouse.
My question focus around the extract part of the process. As i see it, the best way is to expose API on the service that is built on top of mongo, that the ETL process (that is invoked by a job from the data warehouse) will execute with some specific query that will probably will query for set of times (i.e. - startdate and enddate for every record).
is that sound right or i am missing something or maybe there is better way than that?
initially i was thinking about doing mongoexport every X duration but according to the documentation it seems not so good performance wise.
Thanks in advance!
give a try to pentaho kettle.
https://anonymousbi.wordpress.com/2012/07/25/creating-pentaho-reports-from-mongodb/
I am using Alteryx Designer to extract from MongoDB with the dedicated connector and prep my data to load into Tableau, with optional data prep in between.
Works pretty well!
ALteryx can write to most DBs though...