What is the efficient way of doing Date Partitioning in BigQuery of 2 TB data? - streaming

Currently i have exported data as sharded to Google cloud, downloaded in server and streaming to the partitioned table, but the problem is that it takes long time. It streams like 1 Gb for 40 Minutes. Please help me to make it faster. My machine is 12 kernel and 20 Gb RAM CPU.

You can directly load data from Google Cloud Storage into your partition using a generated API call or other methods
To update data in a specific partition, append a partition decorator to the name of the partitioned table when loading data into the table. A partition decorator represents a specific date and takes the form:
$YYYYMMDD
For example, the following command replaces the data in the entire partition for the date January 1, 2016 (20160101) in a partitioned table named mydataset.table1 with content loaded from a Cloud Storage bucket:
bq load --replace --source_format=NEWLINE_DELIMITED_JSON 'mydataset.table1$20160101' gs://[MY_BUCKET]/replacement_json.json
Note: Because partitions in a partitioned table share the table schema, replacing data in a partition will not replace the schema of the table. Instead, the schema of the new data must be compatible with the table schema. To update the schema of the table with the load job, use configuration.load.schemaUpdateOptions.
Read more https://cloud.google.com/bigquery/docs/creating-partitioned-tables

Related

Loading dataframe from dynamodb - aws glue pyspark

I'm trying to read the records from dynamodb table. I have tried to use the dynamic frame. Since I have 8 million records in my table it was taking too long time to filter. Anyway I don't need to load 8 million records to dataframe.Instead of applying filter in dynamic frame, I want to know is any option is available to load dataframe by passing query. So few records only loaded to dataframe and it will work faster.
You can load dataframe by passing a query in spark.sql() but before that, you will have to run the AWS Glue crawler on the Dynamo DB table so that you can get a table corresponding to that Dynamo DB table in AWS Glue catalog and then you can use this table generated in Glue Catalog to read data using Spark dataframe directly.

Container error with exit code 134 while writing Parquet

I'm reading 8 million rows from a source and trying to enrich columns and store data to HDFS.
For enrichment, I need to left join the above table with again 4 tables.
Source table has data encrypted, so I wrote a HIVE function to decrypt the data of a column which becomes the joining criteria for other 4 tables.
All the 4 tables has the same column in encrypted form, So before joining I am decrypting these table column.
Issue:
While joining any two tables I am getting Container error issue
Steps I did:
Read source data and wrote it to Parquet format. Repartitoned the data.
Created a Temp view on the above data which I am using in queries
Running config:
Driver Memory 45 GB
executor memory - 50GB
Executor core - 5
No. of Executor - 15.
Kindly someone please help me suggesting the ways to load this data. Also how to avoid this container issue.

Daily data archival from Postgres to Hive/HDFS

I am working on IOT data pipeline and I am getting messages every second from multiple devices into postgres database. Postgres will be having data for only two days and after two days data will be flushed so that every time there is data for last two days. Now I needed to do data archival from postgres to HDFS daily. The parameters I have are :
deviceid, timestamp, year, month, day, temperature, humidity
I want to archive it daily into HDFS and query that data using hive query. For that I need to create external partitioned table in Hive using deviceid, year and month as partitions. I have tried following options but its not working:
I have tried using sqoop for data copying but it can't create dynamic folders based on different deviceid,year and month so that the external hive table can pick partitions
Used sqoop import using --hive-import attribute so that data can be copied directly into hive table but in this case it overwrites the existing table and I am also not sure whether this works for partitioned table or not
Please suggest some solutions for the archival.
Note: I am using azure services so option for Azure Data Factory is open.

Materialised View in Clickhouse not populating

I am currently working on a project which needs to ingest data from a Kafka Topic (JSON format), and write it directly into Clickhouse. I followed the method as suggested in the Clickhouse documentation:
Step 1: Created a clickhouse consumer which writes into a table (say, level1).
Step 2: I performed a select query on 'level1' and it gives me a set of results, but is not particularly useful as it can be read only once.
Step 3: I created a materialised view that converts data from the engine(level1) and puts it into a previously created table (say, level2). While writing into 'level2' the aggregation is on a day level (done by converting timestamp in level1 to datetime).
Therefore, data in 'level2' :- day + all columns in 'level1'
I intend to use this view (level2) as the base for any future aggregation (say, at level3)
Problem 1: 'level2' is being created but data is not being populated in it, i.e., when I perform a basic select query (select * from level2 limit 10) on the view, the output is "0 rows in set".
Is it because of day level aggregation, and it might populate at the end of the day? Can I ingest data from 'level2' in real-time?
Problem 2: Is there a way of reading the same data from my engine 'level1', multiple times?
Problem 3: Is there a way to convert Avro to JSON while reading from a kafka topic? Or can Clickhouse write data (in Avro format) directly into 'level1' without any conversion?
EDIT: There is latency in Clickhouse while retrieving data from Kafka. Had to make changes in the user.xml file in my Clickhouse server (change max_block_size).
Problem 1: 'level2' is being created but data is not being populated in it, i.e., when I perform a basic select query (select * from level2 limit 10) on the view, the output is "0 rows in set".
This might be related to the default settings of kafka storage, which always starts consuming data from the latest offset. You can change the behavior by adding this
<kafka>
<auto_offset_reset>earliest</auto_offset_reset>
</kafka>
to config.xml
Problem 2: Is there a way of reading the same data from my engine 'level1', multiple times?
You'd better avoid reading from kafka storage directly. You can set up a dedicated materialized view M1 for 'level1' and use that to populate 'level2' too. Then reading from M1 is repeatable.
Problem 3: Is there a way to convert Avro to JSON while reading from a kafka topic? Or can Clickhouse write data (in Avro format) directly into 'level1' without any conversion?
Nope, though you can try using Cap'n Proto which should provide similar performance like Avro, and it's supported directly by ClickHouse.

AWS Glue, data filtering before loading into a frame, naming s3 objects

I have 3 questions, for the following context:
I'm trying to migrate my historical from RDS postgresql to S3. I have about a billion rows of dat in my database,
Q1) Is there a way for me to tell an aws glue job what rows to load? For example i want it to load data from a certain date onwards? There is no bookmarking feature for a PostgreSQL data source,
Q2) Once my data is processed, the glue job automatically creates a name for the s3 output objects, I know i can speciofy the path in DynamicFrame write, but can I specify the object name? if so, how? I cannot find an option for this.
Q3) I tried my glue job on a sample table with 100 rows of data, and it automatically separated the output into 20 files with 5 rows in each of those files, how can I specify the batch size in a job?
Thanks in advance
This is a question I have also posted in AWS Glue forum as well, here is a link to that: https://forums.aws.amazon.com/thread.jspa?threadID=280743
Glue supports pushdown predicates feature, however currently it works with partitioned data on s3 only. There is a feature request to support it for JDBC connections though.
It's not possible to specify name of output files. However, looks like there is an option with renaming files (note that renaming on s3 means copying file from one location into another so it's costly and not atomic operation)
You can't really control the size of output files. There is an option to control min number of files using coalesce though. Also starting from Spark 2.2 there is a possibility to set max number of records per file by setting config spark.sql.files.maxRecordsPerFile