partition File by column and create partitioned file with column values in pyspark databrick - pyspark

I wanted to partition the file based on one column and want to generated partition files based on column values
File 1-
Store product id
001 1001
001 1002
002 1003
002 1004
will be partitioned based on column name and instead of creating
part000
part001
it should create
001.xml
002.xml
Thanks in advance.

Related

Copy data activity in ADF is not deleting data in Azure Table Storage

Right now, I am using ADF CopyData activity to copy the data from Azure SQL to Azure Table Storage.
It is inserting/Replacing/Updating data data while loading the data by triggering that ADF pipeline. This operation will take care by "Insert type" option which is having in Sink of CopyData activity.
But, It is not deleting the records in destination(ATS table).
How to sync Azure SQL data with Azure Table Storage(for deleted data as well)
Ex:
Source SQL table: Employee
Id Name
1 User1
2 User2
Now, using this copy data, these 2 data synced in ATS
Destination ATS Table: Employee
PartitionKey RowKey Timestamp Name
1 NewGuid 2022-07-22 11:30 User1
2 NewGuid 2022-07-22 11:30 User2
Now, in Source SQL table getting updated as below,
Id Name
1 User2
3 User3
Now, Id 2 got deleted and Name udpated for Id 1 and Added Id 3.
Again If I run pipeling, ATS updated as below,
PartitionKey RowKey Timestamp Name
1 NewGuid 2022-07-22 12:30 User2
2 NewGuid 2022-07-22 11:30 User2
3 NewGuid 2022-07-22 12:30 User3
Now, here PartitionKey 2 is not deleted. but Insert and Update as done.
How to delete this record as well using Copy Data activity sync.?
have you tried using a stored procedure for that case
I reproduced this and I got same result.
AFAIK, Copy activity did not delete any data from any target. It will overwrite the data in target with the source data.
In the Sink settings also, it is mentioned as insert type.
But here, the Azure Table storage supports update or replace (we can assume it as delete as old data deletes) of data from outside only when the new record's Partition Key and RowKey matches with the records in Target.
That is not happening here with your last row. That's why it is not updating.
You can raise the feature request for the deletion using Copy activity for this type of storage here.
You can try this manual approach to delete that kind of records when your data is small.
Create a new Unique Column only for the RowKey of Table storage. Assign your regular table Id to partition key and this for rowkey. So, whenever you want to delete old records and update new ones, give this old RowKey value to that.

I want to create a partition table month wise only 12 tables for multiple years

I am new to the partitioning of a table and I want to make a partition of the table by range type on the inserted_on column in this table the records are inserted around ~ 40000 daily
I have tried creating a partition table as:
CREATE TABLE My.table_name_fy2022_01 PARTITION OF My.table_name FOR VALUES FROM ('2022-01-01') TO ('2022-02-01');
But this way i will have to create 12 tables per year and that I don't want to do.
My question is:- how to create a partition table such as the no. of partition table be only 12 (months wise) and stores the data according to a specific month's partition.
For Example:-
partition table June
record of 2022-06-20 insert into June,
record of 2023-06-16 insert into June,
record of 2024-06-10 insert into June,
and So on
PARTITION BY HASH should be used like:
PARTITION BY HASH(MONTH(use_time)) PARTITIONS 12;

MySQL Workbench table looping

In the MySQL Workbench tool, this is what I need to accomplish. Table A contains one unique member. Table B can have anywhere from 1 to 7 rows for the member because this table contains all the different IDs a member can have, ie, first row contains ssn #, 2nd row contains subscriber_id, etc. I need to link the 2 tables, obtaining ALL the member ids in the Table B rows, BUT I only want to produce one output line for the member. Here is an example of the ouput line:
member_name ssn# subscriber# another_id another_id another_id
John doe 123456789 634 0018 blank 3876
Basically, what I'm trying to do is loop thru Table B to obtain each type of id for a member.
Thank you for any suggestions.

How to transfer data from a csv with multiple/single delimiter to postgres DB

Hi I have a dataset in CSV. I want to import it into Postgre DB in a peculiar format
The data in the CSV is in following format.
1::comedy*drama*horror
2::suspense*thriller
Now I want to import this in a Postgre table having two columns id and genre, where id is a foreign key as :
id genre
1 comedy
1 drama
1 horror
2 suspense
2 thriller
Appreciate your help thanks!

Hive partitioning external table based on range

I want to partition an external table in hive based on range of numbers. Say numbers with 1 to 100 go to one partition. Is it possible to do this in hive?
I am assuming here that you have a table with some records from which you want to load data to an external table which is partitioned by some field say RANGEOFNUMS.
Now, suppose we have a table called testtable with columns name and value. The contents are like
India,1
India,2
India,3
India,3
India,4
India,10
India,11
India,12
India,13
India,14
Now, suppose we have a external table called testext with some columns along with a partition column say, RANGEOFNUMS.
Now you can do one thing,
insert into table testext partition(rangeofnums="your value")
select * from testtable where value>=1 and value<=5;
This way all records from the testtable having value 1 to 5 will come into one partition of the external table.
The scenario is my assumption only. Please comment if this is not the scenario you have.
Achyut