Annotate point cloud with information from csv file - annotations

I have a point cloud file (ply format) from a forest scene which is not annotated. The information about the class of the vegetation is inside a csv file. I was wondering how can I extract the information about the classes from the csv file in order to acquire the annotated point cloud.

Related

Azure Data Factory - run script on parquet files and output as parquet files

In Azure Data Factory I have a pipeline, created from the built-in copy data task, that copies data from 12 entities (campaign, lead, contact etc.) from Dynamics CRM (using a linked service) and outputs the contents as parquet files in account storage. This is run every day, into a folder structure based on the date. The output structure in the container looks something like this:
Raw/CRM/2022/05/28/campaign.parquet
Raw/CRM/2022/05/28/lead.parquet
Raw/CRM/2022/05/29/campaign.parquet
Raw/CRM/2022/05/29/lead.parquet
That's just an example, but there is a folder structure for every year/month/day that the pipeline runs, and a parquet file for each of the 12 entities I'm retrieving.
This involved creating a pipeline, dataset for the source and dataset for the target. I modified the pipeline to add the pipeline's run date/time as a column in the parquet files, called RowStartDate (which I'll need in the next stage of processing)
My next step is to process the data into a staging area, which I'd like to output to a different folder in my container. My plan was to create 12 scripts (one for campaigns, one for leads, one for contact etc.) that essentially does the following:
accesses all of the correct files, using a wildcard path along the lines of: Raw/CRM/ * / * / * /campaign.parquet
selects the columns that I need
Rename column headings
in some cases, just take the most recent data (using the RowStartDate)
in some cases, create a slowly changing dimension, ensuring every row has a RowEndDate
I made some progress figuring out how to do this in SQL, by running a query using OPENROWSET with wildcards in the path as per above - but I don't think I can use my SQL script in ADF to move/process the data into a separate folder in my container.
My question is, how can I do this (preferably in ADF pipelines):
for each of my 12 entities, access each occurrence in the container with some sort of Raw/CRM///*/campaign.parquet statement
Process it as per the logic I've described above - a script of some sort
Output the contents back to a different folder in my container (each script would produce 1 output)
I've tried:
Using Azure Data Factory, but when I tell it which dataset to use, I point it to the dataset I created in my original pipeline - but this dataset has all 12 entities in the dataset and the data flow activity produces the error: "No value provided for Parameter 'cw_fileName" - but I don't see any place when configuring the data flow to specify a parameter (its not under source settings, source options, projection, optimize or inspect)
using Azure Data Factory, tried to add a script - but in trying to connect to my SQL script in Synapse - I don't know my Service Principal Key for the synapse workspace
using a notebook Databricks, I tried to mount my container but got an error along the lines that "adding secret to Databricks scope doesn't work in Standard Tier" so couldn't proceed
using Synapse, but as expected, it wants things in SQL whereas I'm trying to keep things in a container for now.
Could anybody point me in the right direction. What's the best approach that I should take? And if its one that I've described above, how do I go about getting past the issue I've described?
Pass the data flow dataset parameter values from the pipeline data flow activity settings.

Specify parquet file name when saving in Databricks to Azure Data Lake

Is there a way to specify the name of a parquet file when I am saving it in Databricks to Azure Data Lake? For example, when I try to run the following statement:
append_df.write.mode('append').format('parquet').save('/mnt/adls/covid/base/Covid19_Cases')
a folder called Covid_Cases gets created and there are parquet files with random names inside of it.
What I would like to do is to use the saved parquet file in Data Factory copy activity. In order to do that, I need to specify the parquet file's name, otherwise I can't point to a specific file.
Since spark is executing in distributed mode and files or their revatives, e.g. dataframes, are being processed in parallel, processed data will be stored in different files in same folder . You can use folder level name to Data Factory copy activity. But you really want to make it single file, you can use below approach ,
save_location= "/mnt/adls/covid/base/Covid19_Cases"+year
parquet_location = save_location+"temp.folder"
file_location = save_location+'export.parquet'
df.repartition(1).write.parquet(path=parquet_location, mode="append", header="true")
file = dbutils.fs.ls(parquet_location)[-1].path
dbutils.fs.cp(file, file_location)
dbutils.fs.rm(parquet_location, recurse=True)

Azure Data Factory data flow file sink

I am using a .csv file to import data into an Azure SQL database. After the data import is complete I am now moving the source file from the Source container to myArchive container. I am now trying to save the filename as SaleData_yyyyMMdd_HHmm.csv, but, I have the folder with this name getting created and the file is broken down into multiple part files (part-00000-, part-00001-,...). Could you please guide me on how to specify the filename with current data & timestamp.
File System: myArchive
Folder Path: concat('SalesDepartment/Warehouse1/','SaleData_',toString(currentTimestamp(),'yyyyMMdd_HHmm'),'.csv')
Folder path can be mentioned directly in the sink dataset. (Note, my source and sink both are delimited type)
For filename,
Under sink data set, create a parameter to pass file name and use it in the file name portion of dataset.
Use the below expression in copy activity sink's parameter value
#concat('SaleData_',formatDateTime(utcnow(),'yyyyMMdd_HHmm'),'.csv')
Remember, this just copies your source in a different name. We need to add a delete activity to delete the original source file.
If you are using a dataflow,
make sure you are choosing single partition in the optimize tab of Sink instead of Use current Partitioning.
Then, go to Settings, choose Output to SIngle file. Under filename, mention the expression with timestamp.
concat('SaleData_',toString(currentUTC('yyyyMMdd_HHmm')),'.csv')

Data Flow Sink is generating zero byte files at each folder location

In one of the data factory which I am currently using, the data flow sink generates file at every folder location.
Suppose I want to generate a csv as MyData.csv inside Folder1/Folder2/ location.
It does generates MyData.csv inside Folder1/Folder2 location but additionally it generates zero byte file at every Folder location. So my hierarchy looks something like below
/Folder1/Folder2 (0 byte file)
/Folder1 (0 byte file)
/Folder1/Folder2/MyData.csv (Desired Output)
In old Data factory I was not facing the above issue, I checked the data set and linked service connection. Everything is same. Also checked whether the storage account has a problem, but using the same storage account but different data factory, issue persists. Kindly give suggestions

Reading file from Google Drive with Talend

I need to read an uploaded file in Google Drive and perform X transformation with it. As per my reading, the single way to do it is by downloading the file to my local machine with the Talend component and then, reading from there.
If it is correct, I cannot figure what would be the file name assuming that I don't want to use the exact name of the file.
I found http://meowbi.com/2018/02/23/getting-google-sheet-gdrive-talend/ and it is exactly what I need - read from Google Drive, check the file name and proceed if the file name is X. What is unclear for me is what they used in tJava.
The output schema of tGoogleDriveList component's Main row contains a field name that is the file name you're looking for. Using Iterate row is less straightforward as you need to extract values from GlobalMap. In the article you cited they get file name by "tGoogleDriveList_1_TITLE" key of the GlobalMap.
Main row between tGoogleDriveList and tJava
For more details please look into the Talend Reference for Google Drive components. The Listing files and folders in Google Drive section should be particularly topical for your case.