I have some corrupted rows in my large CSV file where some data values get shifted due to missing line breaks. This results in values appearing in the wrong column header. For eg. if three columns exists in my table, , , , after corruption, I start to see values like , , .
Is there a way for me to drop all rows where for e.g. I see a non-int in a row that I know should, in fact, be an Int?
What you can do is loop through the lines, and when the lines.split(",").count() doesn't equal what you want, you can filter it out. Something like this:
import scala.io.Source
val n = 5 //or how many columns you require
Source.fromFile(input_file).getLines().toSeq.map(_.split(",")).filter(_.count == n)
This should do what you want :)
Related
I am working with the OpenFoodFacts dataset using PySpark. There's quite a lot of columns which are entirely made up of missing values and I want to drop said columns. I have been looking up ways to retrieve the number of missing values on each column, but they are displayed in a table format instead of actually giving me the numeric value of the total null values.
The following code shows the number of missing values in a column but displays it in a table format:
from pyspark.sql.functions import col, isnan, when, count
data.select([count(when(isnan("column") | col("column").isNull(), "column")]).show()
I have tried the following codes:
This one does not work as intended as it doesn't drop any columns (as expected)
for c in data.columns:
if(data.select([count(when(isnan(c) | col(c).isNull(), c)]) == data.count()):
data = data.drop(c)
data.show()
This one I am currently trying but takes ages to execute
for c in data.columns:
if(data.filter(data[c].isNull()).count() == data.count()):
data = data.drop(c)
data.show()
Is there a way to get ONLY the number? Thanks
If you need the number instead of showing in the table format, you need to use the .collect(), which is:
list_of_values = data.select([count(when(isnan("column") | col("column").isNull(), "column")]).collect()
What you get is a list of Row, which contain all the information in the table.
I'm using matlab to read in COVID-19 data provided by Johns Hopkins as a .csv-file using urlread, but I'm not sure how to use textscan in the next step in order to convert the string into a table. The first two columns of the .csv-file are strings specifying the region, followed by a large number of columns containing the registered number of infections by date.
Currently, I just save the string returned by urlread locally and open this file with importdata afterwards, but surely there should be a more elegant solution.
You have mixed-up two things: Either you want to read from the downloaded csv-file using ´textscan´ (and ´fopen´,´fclose´ of course), or you want to use ´urlread´ (or rather ´webread´ as MATLAB recommends not to use ´urlread´ anymore). I go with the latter, since I have never done this myself^^
So, first we read in the data and split it into rows
url = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Confirmed.csv";
% read raw data as single character array
web = webread(url);
% split the array into a cell array representing each row of the table
row = strsplit(web,'\n');
Then we allocate a table (pre-allocation is good for MATLAB as it stores variables on consecutive addresses in the RAM, so tell MATLAB beforehand how much space you need):
len = length(row);
% get the CSV-header as information about the number of columns
Head = strsplit(row{1},',');
% allocate table
S = strings(len,2);
N = NaN(len,length(Head)-2);
T = [table(strings(len,1),strings(len,1),'VariableNames',Head(1:2)),...
repmat(table(NaN(len,1)),1,length(Head)-2)];
% rename columns of table
T.Properties.VariableNames = Head;
Note that I did a little trick to allocate so many reparate columns of ´NaN´s by repeating a single table. However, concatnating this table with the table of strings is difficult as both contain the column-names var1 and var2. That is why I renamed the column of the first table right away.
Now we can actually fill the table (which is a bit nasty due to the person who found it nice to write ´Korea, South´ into a comma-separated file)
for i = 2:len
% split this row into columns
col = strsplit(row{i},',');
% quick conversion
num = str2double(col);
% keep strings where the result is NaN
lg = isnan(num);
str = cellfun(#string,col(lg));
T{i,1} = str(1);
T{i,2} = strjoin(str(2:end));% this is a nasty workaround necessary due to "Korea, South"
T{i,3:end} = num(~lg);
end
This should also work for the days that are about to come. Let me know what you actually gonna do with the data
I have a dimension I am showing in a text table that can have one of 3 possibilities "A", "B", or "C" and I want at all times to have A, B and C shown in a text table even if one of them has 0 occurrences. The issue is that I am filtering this based on date, so it is possible that for example B may not exist, but I still want to have a 0 printed for B.
I have gone to Analysis -> Table layout -> show empty rows which will show "B", but in the count display it shows a blank. How can I get it to display a 0?
This problem is very famous among tableau users and I still did not see a generic tableau-only solution. All proper solutions start with injecting rows to your data which I assume you do not want this.
Below method will only work if you have a Date Dimension on the measure and no-data dates are not completely filtered-out; so you will be seeing zeros even though that date has no data as you may see on below screenshot.
When you filter out the no-data dates, unfortunately you will keep on seeing NULLs.
If you are using the SUM of Number of Records as your occurrences, then you may create a calculated field as below and use it in your pane:
ZN(LOOKUP(SUM([Number of Records]),0))
You can leave the Default Table Calculation as Automatic so the Results are computed along Table (accross).
I have two columns of data. Some of the data in the first column repeats (they represent questions). The data in the second column is unique (they represent multiple answers to the same question).
I need to merge all the data in the second column for each unique value in the first column. e.g.:
Q,A
1,yes.
1,is possible.
2,no.
2,not possible.
2,cannot do this.
2,impossible.
3,maybe.
merged to:
Q,A
1,yes.is possible.
2,no.not possible.cannot do this.impossible.
3,maybe.
Something like this is crude but may be adequate:
=IF(A1=A2,C1&B2,B2)
copied down to suit. Then select the last entry (identifiable with something like =A1=A2 copied down to suit) for each Question number.
Questions in column A sorted in order
Answers in column B
In C1 use =B1
In C2 use =if(a2=A1,C1&B2,B2)
Drag down formula in C2.
It will keep adding the lines together as long as the question remains the same. When it gets to a new question, it'll start a new string. The last time each question is listed will be the complete string in column C.
Create a 2 column project in Google Refine
Sort by Q column (if not already sorted) and make sort permanent
Blank Down on Q column to remove duplicate values
On A column, do Edit Cells -> Merge multi-valued cells
Please i need help but it's a little hard for me to declare it correctly in English,please be patient with me.
I've got a cell array which for example has 10 rows and 10 columns.
I fill each rows of the cell array in a loop(for) and there is this Probability that a row of it wont get any value then the result is e.g there would be rows 2 and 4 but there would n't be any third row:
t{2,1},...,t{2,10} exits
t{4,1},...,t{4,10} exists
but there is no t{3,1},.....,t{3,10}
Now i want to check if the third row exists or not?
I tried:
if t{3,1}
but it did not worked and there is not any codes like:
if exists(t{3,1})
what should i do?
t{3,1} does exist, it's just empty. Therefore what you need is something along the lines of:
if ~isempty(t{3,1})