I'm reading 8 million rows from a source and trying to enrich columns and store data to HDFS.
For enrichment, I need to left join the above table with again 4 tables.
Source table has data encrypted, so I wrote a HIVE function to decrypt the data of a column which becomes the joining criteria for other 4 tables.
All the 4 tables has the same column in encrypted form, So before joining I am decrypting these table column.
Issue:
While joining any two tables I am getting Container error issue
Steps I did:
Read source data and wrote it to Parquet format. Repartitoned the data.
Created a Temp view on the above data which I am using in queries
Running config:
Driver Memory 45 GB
executor memory - 50GB
Executor core - 5
No. of Executor - 15.
Kindly someone please help me suggesting the ways to load this data. Also how to avoid this container issue.
Related
I am creating a data warehouse using Azure Data Factory to extract data from a MySQL table and saving it in parquet format in an ADLS Gen 2 filesystem. From there, I use Synapse notebooks to process and load data into destination tables.
The initial load is fairly easy using spark.write.saveAsTable('orders') however, I am running into some issues doing incremental load following the intial load. In particular, I have not been able to find a way to reliably insert/update information into an existing Synapse table.
Since Spark does not allow DML operations on a table, I have resorted to reading the current table into a Spark DataFrame and inserting/updating records in that DataFrame. However, when I try to save that DataFrame using spark.write.saveAsTable('orders', mode='overwrite', format='parquet'), I run into a Cannot overwrite table 'orders' that is also being read from error.
A solution indicated by this suggests creating a temporary table and then inserting using that but that still resorts in the above error.
Another solution in this post suggests to write the data into a temporary table, drop the target table, and then rename the table but upon doing this, Spark gives me a FileNotFound errors regarding metadata.
I know Delta Tables can fix this issue pretty reliably but our company is not yet ready to move over to DataBricks.
All suggestions are greatly appreciated.
The simple source read from postgres table(get 3 columns out of 20 columns) is taking huge time to read which I want to read to stream lookup where I fetch one column information
Here is the log:
2020/05/15 07:56:03 - load_identifications - Step **Srclkp_Individuals.0** ended successfully, processed 4869591 lines. ( 7632 lines/s)
2020/05/15 07:56:03 - load_identifications - Step LookupIndiv.0 ended successfully, processed 9754378 lines. ( 15288 lines/s)
The table input query is:
SELECT
id as INDIVIDUAL_ID,
org_ext_loc
FROM
individuals
This table is in postgres with 20 columns hardly & about 4.8 million rows..
This is for pentaho 7.1 data integration & server details below
**Our server information**:
OS : Oracle Linux 7.3
RAM : 65707 MB
HDD Capacity : 2 Terabytes
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
CPU(s): 16
CPU MHz: 2294.614
I am connecting to postgres using jdbc
Don't know what else I can do to get about 15K rows/sec throughput
Check transformation properties under Miscellaneous
Nr of rows in rowset
Feedback size
Also check your Table if it has proper index.
When you use a table input & stream lookup, the way how pentaho runs the stream lookup is slower than when you use a database lookup. As #nsousa suggested, I checked that with dummy step and got to know that pentaho's way of handling is different for every type of step
Even though database lookup & stream lookup come in same category, the performance for database lookup is better in this situation..
Pentaho help gives some idea / suggestion regarding the same
We have a table with 15 million records and one of the columns stores huge XML. Requirement is generate 30 different text files with different fields of XML with all the data (15+ million) from table.
And all these 30 jobs will run at the same time.
Often we are getting into ReadTimeoutException. Due to time constraints, we can’t think of caching solutions.
How can we mitigate these readtimeout exception? Any help will be greatful.
Below are the spring batch and Cassandra version used
Cassandra – 3.11, using Spring Batch – 3.x as unloads framework.
Currently i have exported data as sharded to Google cloud, downloaded in server and streaming to the partitioned table, but the problem is that it takes long time. It streams like 1 Gb for 40 Minutes. Please help me to make it faster. My machine is 12 kernel and 20 Gb RAM CPU.
You can directly load data from Google Cloud Storage into your partition using a generated API call or other methods
To update data in a specific partition, append a partition decorator to the name of the partitioned table when loading data into the table. A partition decorator represents a specific date and takes the form:
$YYYYMMDD
For example, the following command replaces the data in the entire partition for the date January 1, 2016 (20160101) in a partitioned table named mydataset.table1 with content loaded from a Cloud Storage bucket:
bq load --replace --source_format=NEWLINE_DELIMITED_JSON 'mydataset.table1$20160101' gs://[MY_BUCKET]/replacement_json.json
Note: Because partitions in a partitioned table share the table schema, replacing data in a partition will not replace the schema of the table. Instead, the schema of the new data must be compatible with the table schema. To update the schema of the table with the load job, use configuration.load.schemaUpdateOptions.
Read more https://cloud.google.com/bigquery/docs/creating-partitioned-tables
I have created a table where I have a text input file which is 7.5 GB in size and there are 65 million records and now I want to push that data into an Amazon RedShift table.
But after processing 5.6 million records it's no longer moving.
What can be the issue? Is there any limitation with tFileOutputDelimited as the job has been running for 3 hours.
Below is the job which I have created to push data in to Redshift table.
tFileInputDelimited(.text)---tMap--->tFilOutputDelimited(csv)
|
|
tS3Put(copy output file to S3) ------> tRedShiftRow(createTempTable)--> tRedShiftRow(COPY to Temp)
The limitation comes from Tmap component, its not the good choice to deal with large amount of data, for your case, you have to enable the option "Store temp data" to overcome the memory consumption limitation of Tmap.
Its well described in Talend Help Center.
Looks like, tFilOutputDelimited(csv) is creating the problem. Any file can't handle after certain amount of data. Not sure thought. Try to find out a way to load only portion of the parent input file and commit it in redshift. Repeat the process till your parent input file gets completely processed.
use AWS Glue to push your file data from S3 to Redshift. AWS Glue will easily push the large data into redshift without any issue.
Steps:
1: Create a connection with your Redshift
2: create a database and two tables.
a: Data-from-S3 (this will use to crawl file data from S3)
b: data-to-redshift ( add redshift connection)
3: Create a Job:
a: In Data source, select the "Data-from-S3" table
b: In Data Target, select the "data-to-redshift" table
4: Run the job.
Note: You can also automate this with lambda and SNS trigger.
You can use copy command option to load large data into aws redshift, if the copy command doesn't support txt file, then we need to have csv file. processing 65 million records will create issue. so we need to perform split and run. for that create 65 iterations and do process 1 million data a a time . To implement this use tloop and set the values inside the component. take the global variables of tloop in header and limit values of tinputdelimited component
job:
tloop----->tinputfiledelimited---->tmap(if needed)--------> tfileoutdelimited
also enable the option "Store temp data" to handle the memory issue