I have a table with 108 columns and most of the values contain NA and blank spaces. I want to get only rows where all columns are filled with values. If a column contains NA or blank space, the entire row should be deleted. I know writing a UDF would do the job easier but is there a hive query for this. I would be glad to know about this.
Related
I have a dataset containing roughly 50 columns. I want to check the data inputs and show how many blanks/if any are in each column.
I originally started using calculated fields for columns and using sum(if isnull([column] then 1 else 0 end)). This works but doesn’t seem very efficient if I want to do it for multiple columns.
Tried doing a pivot table as well however I need the data in it’s original form for other analysis I would like to do.
Would appreciate any help.
Thanks
Fiona
i have as solution which goes like
df1 -->dataframe 1 with having 50 columns of data
df2 --->datarame 2 having footer/trailer 3 columns of data like Trailer,count of rows,date
so i added the remaining 47 columns as "","",""..... so on
so that i can union 2 dataframe:
df3=df1.union(df2)
now if i want to save
df3.coalesce(1).write.format("com.databricks.spark.csv")\
.option("header","true").mode("overwrite")\
.save(output_blob_path);
so now i am getting the footer as well
like this Trailer,400,20210805,"","","","","","","".. and so on
if any one can suggest how to remove ,"","","",.. these double quotes from the last row
where i want to save this file in blob container.
it would be very helpful
You can try to define structure of data frame to treat entire row as single column for both the files and then perform union. This way you no need to add extra columns on data frame 2 and then struck in to tricky situation to remove extra columns after union.
I have a view, A, with 20 columns which forms my primary data. I have a table B which lists some of the columns from A and contains data I want to exclude from A.
For example table B will have 6 columns 2 of which are 'customer' and 'country' and contain the data 'HP' and 'America'. These columns exist in A. But I want to write a query that brings back data from A except where any rows that have a combination HP and America.
There are 6 columns and table B can have any combination of rows. Anywhere between 1 and all 6 rows could be filled in or there could be a row which has 5 columns filled in. Also another row with a different 5 columns filled in and so on.
I want to be prepared for any possible combination of the 6 rows and the query to search A for the combination and exclude any rows with that data from B.
I have tried this
SELECT *
FROM A T1
WHERE not EXISTS
(SELECT * FROM [dbo].[ExcludedItems] T2
WHere ReportNumber=1
AND
(
T1.job=ISNULL(T2.job,T1.job) and T1.CustomerName=ISNULL(T2.CustomerName,T1.CustomerName) and
T1.COUNTRY= ISNULL(T2.COUNTRY,T1.COUNTRY) and T1.CONTINENT=ISNULL(T2.CONTINENT,T1.CONTINENT) AND
T1.continer= ISNULL(T2.ContainerName, T1.continer) and T1.UnscheduledJob= ISNULL(T2.unscheduledJob, T1.UnscheduledJob) and
T1.[Price]= ISNULL(T2.Price, T1.Price) and
T1.[Haulage]= ISNULL(T2.[Haulage], T1.[Haulage]) and
T1.SiteAdress= ISNULL(T2.SiteAddress, T1.SiteAdress) and T1.Delta=ISNULL(T2.Delta, T1.Delta) and
T1.Cost= ISNULL(T2.Cost, T1.Cost)
)
)
The problem is the result set is not correct. I have tried with a smaller column sample and able to exclude the correct combination of Customer and Country but when I introduce a 3rd or 4th column combination I can eyeball the result set and immediately see its incorrect. Not sure if I have to use multiple NOT EXISTS for each possible combination, was hoping not to.
A constraint is A has to be a view not a table. Otherwise I would have used variables in some manner and wrapped the whole thing in a stored procedure.
Appreciate any help, fall back is to manually add to the code each time an item combination is supplied in B!
I have two fields that contain concatenated strings. The first field contains medical codes and the second field contains the descriptions of those codes. I don't want to break these into multiple fields because some of them would contain hundreds of splits. Is there any way to break them into a row each like below? The code and description values are separated by a semicolon (;)
code description
----- ------------
80400 description1
80402 description2
A sample of the data:
One way is you can custom split two columns at ; which will create separate columns for every entry then you can pivot code columns and description columns separately.
One issue will be you can't guarantee if every code is mapped to correct description.
One more way is export data to excel sheet and then split and pivot the columns and then match the code and description, Then take the excel as datasource to the tableau.
I'm working in Pentaho 4.4.1-GA (Kettle / PDI). The database is Postgres.
I need to be able to insert multiple records into a fact table based on the fields that come from a single record. The single record contains fields:
productcode1, price1
productcode2, price2
productcode3, price3
...
productcode10,price10
So if there was a value for each of the 10 productcode / prices then I'd need to insert a total of 10 records into the fact table. If there were values for 4 of the combinations, then I'd need to insert 4 records into the fact table, etcetera. All field values for the fact records would be identical except for the PK (generated by sequence), product codes, and prices.
I figure that I need some type of looping construct which would let me check whether or not a value was present for each productx field, and if so, do an insert/update step on the fact table with the desired field values. I'm just not sure how to do this in Pentaho.
Any ideas? All suggestions are welcome :)
Thank You,
Rakesh
Could you give a sample input and output for your scenario??
From your example data I can infer that if there are 10 different product codes and only 4 product prices you want to have 4 records inserted into your table. Is that so?
Well for a start you can add a constant value of 1 to those records by filtering for NOT NULL and then use an Group BY Step to count the number of 1's. This would give you the count. BTW it would be helpful if you could provide more details on what columns you would be loading as there are ways to make a PDI transformation execute multiple times