I am trying to arrange the date of the column published, so I can perform text minning on specific months.
The 2 CSV files I am working with can be found on Kaggle. Kaggle link
I have tried to make the date arrangeable by the following code:
Guardians_Russia_Ukraine <- read_csv("news_conflict/Guardians_Russia_Ukraine.csv",
col_types = cols(published = col_character()))
Guardians_Russia_Ukraine %>% mutate(published1 = parse_date(published))
published1 gives the result NA and is therefore not usable.
Related
I have uploaded the file here. These are some lines from my txt file:
RSN1146_KOCAELI_AFY000 1.345178e-02
RSN1146_KOCAELI_AFY090 1.493577e-02
RSN1146_KOCAELI_AFYDWN 5.350641e-03
RSN4003_SANSIMEO_25862-UP 4.869095e-03
RSN4003_SANSIMEO_25862090 1.199087e-02
RSN4003_SANSIMEO_25862360 1.181286e-02
I would like to remove the data with DWN on 3rd line and -UP in 4th line. So the data will only have:
RSN1146_KOCAELI_AFY000 1.345178e-02
RSN1146_KOCAELI_AFY090 1.493577e-02
RSN4003_SANSIMEO_25862090 1.199087e-02
RSN4003_SANSIMEO_25862360 1.181286e-02
Then, I want to obtain the maximum value for RSN1146 & RSN4003.
I tried to read the file with the code below:
Data=fopen('maxPGA.txt','r');
readfile=fscanf(Data,'%c %s')
It is weird as I cannot perform further analysis as the data is not imported as 2 column in MATLAB, any solution for this?
I tried:
Data= importdata('maxPGA.txt')
as well, but the data are grouped into 2 different table in this case.
Here is my question, I hope someone can help me to figure it out..
To explain, there are more than 10 categorical columns in my data set and each of them has 200-300 categories. I want to convert them into binary values. For that I used first label encoder to convert string categories into numbers. The Label Encoder code and the output is shown below.
After Label Encoder, I used One Hot Encoder From scikit-learn again and it is worked. BUT THE PROBLEM IS, I need column names after one hot encoder. For example, column A with categorical values before encoding.
A = [1,2,3,4,..]
It should be like that after encoding,
A-1, A-2, A-3
Anyone know how to assign column names to (old column names -value name or number) after one hot encoding. Here is my one hot encoding and it's output;
I need columns with name because I trained an ANN, but every time data comes up I cannot convert all past data again and again. So, I want to add just new ones every time. Thank anyway..
As #Vivek Kumar mentioned, you can use the pandas function get_dummies() instead of OneHotEncoder. I wanted to preserve a version of my initial DataFrame so I did the folowing;
import pandas as pd
DataFrame2 = pd.get_dummies(DataFrame)
I used the following code to rename each one-hot encoded columns to "original name_one-hot encoded name". So for your example it would give A_1, A_2, A_3. Feel free to change the "_" below to "-".
#Create list of columns with "object" dtype
cat_cols = [col for col in df_pro.columns if df_pro[col].dtype == np.object]
#Find the array of new columns from one-hot encoding
cat_labels = ohenc.categories_
#Convert array of columns into list
cat_labels = np.concatenate(cat_labels).ravel().tolist()
#Use list comprehension to generate new list with labels needed
cat_labels_new = [(col + "_" + label) for label in cat_labels for col in cat_cols if
label in df_pro[col].values.tolist()]
#Create new DataFrame of transformed columns using new list labels
cat_ohc = pd.DataFrame(cat_arr, columns = cat_labels)
#Concat with original DataFrame and drop original columns (only columns with "object" dtype)
I am quite new to Scala / Spark and I have been thrown into the deep end. I have been trying hard since several weeks to find a solution for a seemingly simple problem on Scala 2.11.8 but have been unable to find a good solution for it.
I have a large database in csv format close to 150 GB, with plenty of null values, which needs to be reduced and cleaned based on the values of individual columns.
The schema of the original CSV file is as follows:
Column 1: Double
Columnn 2: Integer
Column 3: Double
Column 4: Double
Columnn 5: Integer
Column 6: Double
Columnn 7: Integer
So, I want to conditionally map through all the rows of the CSV file and export the results to another CSV file with the following conditions for each row:
If the value for column 4 is not null, then the values for columns 4, 5, 6 and 7 of that row should be stored as an array called lastValuesOf4to7. (In the dataset if the element in column 4 is not null, then columns 1, 2 and 3 are null and can be ignored)
If the value of column 3 is not null, then the values of columns 1, 2 and 3 and the four elements from the lastValuesOf4to7 array, as described above, should be exported as a new row into another CSV file called condensed.csv. (In the dataset if the element in column 3 is not null, then columns 4, 5, 6 & 7 are null and can be ignored)
So in the end I should get a csv file called condensed.csv, which has 7 columns.
I have tried using the following code in Scala but have not been able to progress further:
import scala.io.Source
object structuringData {
def main(args: Array[String]) {
val data = Source.fromFile("/path/to/file.csv")
var lastValuesOf4to7 = Array("0","0","0","0")
val lines = data.getLines // Get the lines of the file
val splitLine = lines.map(s => s.split(',')).toArray // This gives an out of memory error since the original file is huge.
data.close
}
}
As you can see from the code above, I have tried to move it into an array but have been unable to progress further since I am unable to process each line individually.
I am quite certain that there must be straightforward solution to processing csv files on Scala / Spark.
Use the Spark-csv package and then use the Sql query to query the data and make the filters according to your use case and then export it at the end.
If you are using spark 2.0.0 then spark-csv will be present in spark-sql or else if you are using a old version add the dependency accordingly.
You can find a link to the spark-csv here.
You can also look at the example here: http://blog.madhukaraphatak.com/analysing-csv-data-in-spark/
Thank you for the response. I managed to create a solution myself using Bash Script. I had to start with a blank condensed.csv file first. My code shows how easy it was to achieve this:
#!/bin/bash
OLDIFS=$IFS
IFS=","
last1=0
last2=0
last3=0
last4=0
while read f1 f2 f3 f4 f5 f6 f7
do
if [[ $f4 != "" ]];
then
last1=$f4
last2=$f5
last3=$f6
last4=$f7
elif [[ $f3 != "" ]];
then
echo "$f1,$f2,$f3,$last1,$last2,$last3,$last4" >> path/to/condensed.csv
fi
done < $1
IFS=$OLDIFS
If the script is saved with the name extractcsv.sh then it should be run using the following format:
$ ./extractcsv.sh path/to/original/file.csv
This only goes to confirm my observation that ETL is easier on Bash than in Scala. Thank you for your help, though.
I'm trying to import data from a text file into the workspace by using the readtable function.
The text file structure is pretty simple, being composed by 4 columns of types date, time, integer and float respectively as shown in the following minimal example:
2013-07-07 05:15:19 8 213.0
2013-07-07 05:15:19 11 109.0
2013-07-07 05:15:20 14 33.5
2013-07-07 05:15:24 56 182.0
When I try to load the data like this:
data = readtable(filename,...
'Format','%{yyyy-MM-dd}D %{HH:mm:ss}D %d %f %*[^\n]',...
'ReadVariableNames',false);
I get the following error:
Error using textscan
Badly formed format string.
Error in table/readTextFile (line 160)
raw = textscan(fid,format,'delimiter',delimiter,'whitespace',whiteSpace, ...
Error in table.readFromFile (line 41)
t = table.readTextFile(filename,otherArgs);
Error in readtable (line 114)
t = table.readFromFile(filename,varargin);
If I try this instead:
data = readtable(filename,...
'Format','%{yyyy-MM-dd}D%{HH:mm:ss}D%d%f%*[^\n]',...
'Delimiter',' ',...
'ReadVariableNames',false);
I get the exactly same error.
I've checked the Mathwork's online documentation, but I was unable to find the solution to my problem.
EDIT: Actually, the desired table format would be to have a datetime column replacing the date and time columns. What I'm doing is joining date and time manually after reading the table. If you know a way to import the table merging those 2 variables straight away, that would be great.
Initially if you will do this with your data format:
data = readtable('data.txt','Delimiter',' ','ReadVariableNames',false)
You will get Nx4 data array so that you can manipulate it as much as you like.
You can read on how to manipulate with the data imported as table here
I have been stuck on this for a few days. I have a folder with hundreds of shapefiles. I want to add an attribute field to the shapefiles giving the shapefile's name as a date. The shapefile name includes Landsat path/row, year, and Julien date ('1800742003032.shp). I want just the date '2003032' to be added under a "Date" field.
Here's what I have so far:
arcpy.env.workspace = r"C:\Users\mkelly\Documents\Namibia\Raster_Water\1993\Polygons"
for fc in arcpy.ListFeatureClasses("*", "ALL"):
print str("processing" + fc)
field = "DATE"
expression = str(fc)[6:13]
arcpy.AddField_management(fc, field, "TEXT")
arcpy.CalculateField_management(fc, field, "expression", "PYTHON")
Results:
processing1800742003032.shp
processing1800742009136.shp
processing1820732010289.shp
end Processing...
It runs perfectly (on a sample 3 shapefiles) but the problem is that when I open the shapefiles in Arcmap, they all have the same date. The results show that it processed each of the 3 shapefiles, and the add field management must have worked because all of the fields are populated. So there is an issue with either the expression, or the Calculate field command.
How can I get it to populate the specific date for each shapefile, and not just have all of them be '2003032'?? There are no error messages.
Thanks in advance!
I figured it out! For calculate field management, expression should not be in quotes. It should be: arcpy.CalculateField_management(fc, field, expression, "PYTHON")
This post may have been a waste of time, but at least maybe it will help someone with a similar problem in the future.