I have 6.5 GB data which consists 900000 rows in my Input table ****(tPostgresqlInput)** ,I am trying to load the same data into my output table(tPostgresqlOutput) , while running the job i am not getting any response from my input table, Is there Any solution to load the data? pls refer my attachment
You made need to develop a strategy to retrieve more manageable chunks of data, for example dividing up the data based on row IDs. That way, it does not take as much memory or time to retrieve the data .
You could also increase the Default memory limit for the job from 1 GB to a higher number .
If you job runs on the same network as your database server, that can also increase performance.
Make sure you enable Use Cursor on the Inputs advanced settings. The default 1k value is fine.
Also enable the batch size on the ouput which does similar.
By enabling this Talend will work with 1k records at a time.
If this two tables are in the same DB you can try to use Talend ELT component
no push down you processing to the database. Take a look on following set of components:
https://help.talend.com/display/TalendOpenStudioComponentsReferenceGuide60EN/tELTPostgresqlInput
https://help.talend.com/display/TalendOpenStudioComponentsReferenceGuide60EN/tELTPostgresqlMap
https://help.talend.com/display/TalendOpenStudioComponentsReferenceGuide60EN/tELTPostgresqlOutput
Related
So, basically, I have a website that will be used by people to modify filters and then click 'download' and the resulting Excel file will have the data (specified by their filters). There are about 125,000+ data-points in my postgreSQL database, and I currently have it being loaded in the background using a simple
df = pd.read_sql_query('select * from real_final_sale_data', con = engine)
The only problem is that this quickly overwhelms Heroku's memory allowance on 1 dyno (512 MB), but I do not understand why this is happening or what the solution is.
For instance, when I run this on my computer and do 'df.info()' it shows that it's only using about 30 MB of space, so how come when I read it, it suddenly is sucking up so much MB?
Thank you so much for your time!
So, the solution that ended up working was to just use some of the filters as queries to SQL. I.e., I had been just doing a select * without filtering anything from SQL, so my database which has like 120,000 rows and 30 columns caused a bit of strain on Heroku's dyno so it's definitely recommended to either use chunking or do some filtering when querying the DB.
I want to get last one hour and day aggregation result from druid. Most queries I use includes ad-hoc queries. I want to ask two question;
1- Is a good idea that ingest all raw data without rollup? Without rollup, Can I re-index data with multiple times?. For example; one task reindex data to find unique user counts for each hour, and another task re-index the same data to find total count for each 10 minutes.
2- If rollup enabled to find some basic summarizes, this prevent to get information from the raw data(because it is summerized). When I want to reindex data, some useful informations may not found. Is good practise that enable rollup in streaming mode?
Whether to enable roll-up depends on your data size. Normally we
keep data outside of druid to replay and reindex again in the
different data sources. If you have a reasonable size of the data
you can keep your segment granularity to be hours/day/ week/month
ensuring that each segment doesn't exceed the ideal segment size (
500 MB recommended ). And query granularity to the none at index
time, so you can do this unique and total count aggregation at query
time.
You can actually set your query granularity at the index time to be
10 mins and it can still provide you uniques in 1 hr and total count
received in 1 hr.
Also, you can index data in multiple data sources if that's what you
are asking. If you are reindexing data for the same data source, it
will create duplicates and skew your results.
It depends on your use case. Rollup will help you better performance
and space optimization in druid cluster. Ideally, I would suggest
keeping your archived data separate in replayable format to reuse.
How many rows can the web data connector handle to import data into Tableau? Or what is the maximum number of rows which I can generally import?
There are no limitations to how many rows of data you bring back with your web data connector; performance scales pretty well as you bring back more and more rows, so it's really just a matter of how much time you are OK dealing with.
The total performance will be a combination of:
The time it takes for you to retrieve data from the API.
The time it takes our database to create an extract with that data once your web data connector passes it back to Tableau.
#2 will be comparable to the time it would take to create an extract from an Excel file with the same schema and size as the data in your web data connector.
On a related note, the underlying database used (Tableau Data Engine) handles a large number of rows well, but is not as suited for handling a large number of columns, thus our guidance is to bring back less than 60 columns if possible.
How to store millions of data coming from justdial scraper engine into mongodb manually in an effective way?
We actually manually running the script to put the .json data into mongodb. It took 8:30 hours to just to insert the data into database and our database is growing like anything (we get duplicate data which we handle after inserting all the records) and it is also consuming lot of RAM. Is there any better way we can do it.
Thanks
It looks like there is a need for sharding your load between many servers - thats most efficient scenario - see here for more info
As there is a huge chunk of data to digest - to run this on same server, please consider :
use dedicated SDD drive for mongo data directory (and for indexes as well)
use WiredTiger as storage engine (which is default from 3.2)
divide input file, as this will reduce swapping (as Japanese says: you can eat an elephant, but don't do it on one dinner :-) )
try to build array of some number of documents (let's say 1000)
instead of processing OneByOne - or AllAtOnce
I'm working with Talend ETL to transfer data between two Salesforce Orgs. I'm trying to run preliminary tests to make sure everything is setup properly.
Is there a way to limit the number of rows being transferred? The database has over 50,000 rows, and I only want to send over 15 or 20.
Thank you.
On the Talend side, you can use tSampleRow to only process a limited number of rows which were retrieved . For example you can use a line number range to only process rows 1-50.