Powershell: Remove duplicate entries from columns, only keeping first unique value in its original cell of a csv file

Powershell: Remove duplicate entries from columns, only keeping first unique value in its original cell of a csv file - powershell

I am pretty new to powershell. I have been trying to remove duplicate entries from columns, only keeping first unique value in its original cell of a csv file.
Any help is greatly appreciated. Thanks in advance.
1,TestSCript,Test1,Data1,passed
1,TestSCript,Test1,Data2,passed
1,TestSCript,Test2,Data3,passed
1,TestSCript,Test2,Data4,passed
1,TestSCript,Test2,Data5,passed
1,TestSCript,Test3,Data6,passed
1,TestSCript,Test3,Data7,passed
1,TestSCript,Test4,Data8,passed
1,TestSCript,Test5,Data9,passed
Expected Result:
1,TestSCript,Test1,Data1,passed
,,,Data2,passed
,,Test2,Data3,passed
,,,Data4,passed
,,,Data5,passed
,,Test3,Data6,passed
,,,Data7,passed
,,Test4,Data8,passed
,,Test5,Data9,passed

Related

How to keep keep original column after applying data validation in same column

I have a task to validate decimal and date field.I am able to validate decimal and date filed on same column but not able to keep old column values.
Input:
id,amt1
1,123
2,321
3,345
4,543
5,789
Current Output:
id,amt1
1,12.3
2,32.1
3,34.5
4,54.3
5,78.9
Expected Output:
id,amt1,original_amt1_values
1,12.3,123
2,32.1,321
3,34.5,345
4,54.3,543
5,78.9,789
Below is the code, I am able to validate decimal filed but not able to keep original values. Kindly help me on this. I want to keep its original column in dataframe itself.
SourceFileDF = SourceFileDF.withColumn("amt1", DecimalConversion(col(amt1)))
DecimalConversion is my UDF and SourceFileDF is my dataframe.

You can use a temporary column name for "amt1" and the use column rename
SourceFileDF.withColumn("amt1_converted", DecimalConversion(col(amt1)))
SourceFileDF.withColumnRenamed("amt1", "original_amt1_values")
SourceFileDF.withColumnRenamed("amt1_converted", "amt1")

You can use select and provide the alias in a single line :
sourceFileDF.select(
DecimalConversion($"amt1").as("amt1") ,
$"amt1".as("original_amt1_values")
)

Trouble with looping function into structure index

I'm relatively new to matlab and would really appreciate any help.
Currently, I have a function (we'll call it readf) that reads in data from a single ascii file into a struct of multiple fields (we'll call it cdata).
names = cellstr(char('A','B','C','D','E','F','G'));
cdata = readf('filestring','dataNames',names);
The function works fine and gives me the correct output of a struct with these field names, with the value of each field name being a cell array of the corresponding data.
My task is to create a for loop that uses this readf function to read in a folder of these ascii files at once. I'm trying to work it so that the for loop creates a struct with an index of the different cdata structs. After trying a few different methods, I am stumped.
This is what I have so far.
files = struct2cell(dir('folderstring')); %creates a cell array of the names of the files withing the folder
for ii=length(files);
cdata(ii) = readf([folderstring,files(1,1:ii),names],'dataName',names);
end;
This is currently giving me the following error.
"Error using horzcat
Dimensions of matrices being concatenated are not consistent."
I am not sure what is wrong. How can I fix this code so i can read in all the data from a folder at once??? Is there a better and more efficient way to do this than making an index to this struct? Perhaps a cell array of different structures or even a structure of nested structures? Thanks!

Change:
for ii=length(files);
cdata(ii) = readf([folderstring,files(1,1:ii),names],'dataName',names);
end;
To:
for ii=1:length(files); % CHECK to make sure length(files) is giving you the right number
cdata(ii) = readf([folderstring,files{ii},names],'dataName',names);
end;
% CHECK files{ii}, with 1,2,3 etc. is giving you the correct file name.

How to give column names after one hot encoding with sklearn?

Here is my question, I hope someone can help me to figure it out..
To explain, there are more than 10 categorical columns in my data set and each of them has 200-300 categories. I want to convert them into binary values. For that I used first label encoder to convert string categories into numbers. The Label Encoder code and the output is shown below.
After Label Encoder, I used One Hot Encoder From scikit-learn again and it is worked. BUT THE PROBLEM IS, I need column names after one hot encoder. For example, column A with categorical values before encoding.
A = [1,2,3,4,..]
It should be like that after encoding,
A-1, A-2, A-3
Anyone know how to assign column names to (old column names -value name or number) after one hot encoding. Here is my one hot encoding and it's output;
I need columns with name because I trained an ANN, but every time data comes up I cannot convert all past data again and again. So, I want to add just new ones every time. Thank anyway..

As #Vivek Kumar mentioned, you can use the pandas function get_dummies() instead of OneHotEncoder. I wanted to preserve a version of my initial DataFrame so I did the folowing;
import pandas as pd
DataFrame2 = pd.get_dummies(DataFrame)

I used the following code to rename each one-hot encoded columns to "original name_one-hot encoded name". So for your example it would give A_1, A_2, A_3. Feel free to change the "_" below to "-".
#Create list of columns with "object" dtype
cat_cols = [col for col in df_pro.columns if df_pro[col].dtype == np.object]
#Find the array of new columns from one-hot encoding
cat_labels = ohenc.categories_
#Convert array of columns into list
cat_labels = np.concatenate(cat_labels).ravel().tolist()
#Use list comprehension to generate new list with labels needed
cat_labels_new = [(col + "_" + label) for label in cat_labels for col in cat_cols if
label in df_pro[col].values.tolist()]
#Create new DataFrame of transformed columns using new list labels
cat_ohc = pd.DataFrame(cat_arr, columns = cat_labels)
#Concat with original DataFrame and drop original columns (only columns with "object" dtype)

Is PapaParse adding an empty string to the end of its data array?

Papa Parse seems wise, but I think he might be giving me null. I'm just:
Papa.parse(countries);
Where countries is a string containing the XMLHttpRequest of the countries csv file from a timezone database here:
https://timezonedb.com/download
But Papa Parse seems to have added an empty array to the end of it's data array. So when I'm searching and sorting through the array, that one empty guy at the end is giving me troubles. I can write around it but it's not ideal, and I thought Papa Parse was supposed to make those kind of csv parsing problems go away. Am I Parsing wrong?
Here is the end of the PapaParsed Array in console:

You need to use skipEmptyLines: true in parse config. For example:
Papa.parse(this.csvData, {skipEmptyLines: true,})

it was adding empty line to my iteration as well. i decided to skip it by doing loop:
for(let i=0;i<data.length -1;i++){

We can also use below syntax to remove empty lines from the record.
For example, in order to remove empty values from header, we can use the below code snippet.
headers.filter(Boolean);

Excel SUM current column (via Excel::Template)

I'm using Excel::Template to generate a series of Excel files via perl. However, I need to do a SUM function on the current Column. I know I can do
=SUM(3:15)
but that gives the sum of ALL columns in rows 3-15. Is there an easier way to do what I'm trying to do?

=sum(indirect(concatenate(address(<row_start>,column()),":")&address(<row_end>,column())))
gives me exactly what I need. Not exactly sure how it works, but found on MrExcel.com

For column C,
=SUM(C3:C15)
Since =SUM(...) is just a string, you may have to parametrize the column if you don't know it before runtime. For instance
$str = "=SUM(" . col_char . "3:" . col_char . "15)";