How to append sorted data in new csv file - python-3.7

"i want to sort data in interval and save it to new csv file with same old file indexes,please help"
b=0
i=0
sorted_data=[]
for i in range(0,int (max_ht),interval):
s=df[(df[' Height'] >b) & (df[' Height'] <= i)]
b=i
sorted_data.append(s)
sorted_data=pd.DataFrame( sorted_data)
sorted_data.to_csv(" sorted_data")
i got error /home/Ami/anaconda3/lib/python3.6/site-packages/pandas/core/indexes/range.py:465: RuntimeWarning: '<' not supported between instances of 'int' and 'str', sort order is undefined for incomparable objects
return self._int64index.union(other)

Related

Convert csv file to map

I have a csv file containing a list of abbreviations and their full values such that the file looks like the below
original,mappedValue
bbc,britishBroadcastingCorporation
ch4,channel4
I want to convert this csv file into a Map such that it is of the form
val x:Map[String,String] = Map("bbc"->"britishBroadcastingCorporation", "ch4"->"channel4")
I have tried using the below:
Source.fromFile("pathToFile.csv").getLines().drop(1).map(_.split(","))
but this leaves me with an Iterator[Array[String]]
You are close , split provides an array. You have to convert it into a tuple and then to a map
Source.fromFile("/home/agr/file.csv").getLines().drop(1).map(csv=> (csv.split(",")(0),csv.split(",")(1))).toMap
res4: scala.collection.immutable.Map[String,String] = Map(bbc -> britishBroadcastingCorporation, ch4 -> channel4)
In real life , you will check for existance of bad rows and filtering out the array splits whose length is less than 2 or may be put that into another bin as bad data etc.

Group word based on length using pyspark

I would like to group the data based on the length using pyspark.
a= sc.parallelize(("number","algebra","int","str","raj"))
Expected output is in the form
(("int","str","raj"),("number"),("algebra"))
a= sc.parallelize(("number","algebra","int","str","raj"))
a.collect()
['number', 'algebra', 'int', 'str', 'raj']
Now, do the following steps to get the final output -
# Creating a tuple of the length of the word and the word itself.
a = a.map(lambda x:(len(x),x))
# Grouping by key (which is length of tuple)
a = a.groupByKey().mapValues(lambda x:list(x)).map(lambda x:x[1])
a.collect()
[['int', 'str', 'raj'], ['number'], ['algebra']]

To split data into good and bad rows and write to output file using Spark program

I am trying to filter the good and bad rows by counting the number of delimiters in a TSV.gz file and write to separate files in HDFS
I ran the below commands in spark-shell
Spark Version: 1.6.3
val file = sc.textFile("/abc/abc.tsv.gz")
val data = file.map(line => line.split("\t"))
var good = data.filter(a => a.size == 995)
val bad = data.filter(a => a.size < 995)
When I checked the first record the value could be seen in the spark shell
good.first()
But when I try to write to an output file I am seeing the below records,
good.saveAsTextFile(good.tsv)
Output in HDFS (top 2 rows):
[Ljava.lang.String;#1287b635
[Ljava.lang.String;#2ef89922
Could ypu please let me know on how to get the required output file in HDFS
Thanks.!
Your final RDD is type of org.apache.spark.rdd.RDD[Array[String]]. Which leads to writing objects instead of string values in the write operation.
You should convert the array of strings to tab separated string values again before saving. Just try;
good.map(item => item.mkString("\t")).saveAsTextFile("goodFile.tsv")

stepwise regression: Undefined function ' stepwiselm' for input arguments of type 'cell'

I have one .txt file and I have converted it to first a table Ta(Ta=readtable('xxx.txt')) then an array Aa(Aa=table2array(Ta)), the .txt file contains 220 rows and 12 cols, but the table and the array only have 219 rows and 1 col. Where did I do wrong?
Then when I tried to do stepwise regression I got error message: Undefined function ' stepwiselm' for input arguments of type 'cell'.
My coad was: mdl=stepwiselm(Aa)
In the .txt file, the first raw are texts e.g. elevation, hight, yields etc. I though I could use these names to define Predictor variables and Response variable. But since these names are lost in Aa, how should I write code for stepwise regression?Thanks!
Try the following
delim = ' ';
nrhdr = 1;
A = importdata('A-100spreg2-raa06a.txt', delim, nrhdr);
A.data will be your data, A.textdata your header. A ".txt" does not contain columns, so you need to specify a delimiter (I assumed a space). You can then use your A.data in your stepwise function.
As you indicated you wanted column 10 as y, and I assume others as X, use
stepwise(A.data(:,1:9),A.data(:,10))
I wouldn't use the headers for anything other than creating labels in figures.

Perl Parsing Log/Storing Results/Reading Results

A while back I created a log parser. The logs can be several thousands of lines up to millions of lines. I store the parsed entries in an array of hash refs.
I am looking for suggestions on how to store my output, so that I can quickly read it back in if the script is run again (this prevents the need to re-parse the log).
The end goal is to have a web interface that will allow users to create queries (basically treating the parsed output like it existed within a database).
I have already considered writing the output of Data::Dumper to a file.
Here is an example array entry printed with Data::Dumper:
$VAR =
{
'weekday' => 'Sun',
'index' => 26417,
'timestamp' => '1316326961',
'text' => 'sys1 NSP
Test.cpp 1000
This is a example error message.
',
'errname' => 'EM_TEST',
'time' => {
'array' => [
2011,
9,
18,
'06',
22,
41
],
'stamp' => '20110918062241',
'whole' => '06:22:41',
'hour' => '06',
'sec' => 41,
'min' => 22
},
'month' => 'Sep',
'errno' => '2261703',
'dayofmonth' => 18,
'unknown2' => '1',
'unknown3' => '1',
'year' => 2011,
'unknown1' => '0',
'line' => 219154
},`
Is there a more efficient way of accomplishing my goal?
If your output is an object (or if you want to make it into an object), then you can use KiokuDB (along with a database back end of your choice). If not, then you can use Storable. Of course, if your data structure essentially mimics a CSV file, then you can just write the output to file. Or you can output the data into a JSON object that you can store in a file. Or you can forgo the middleman and simply use a database.
You mentioned that your data structure is a "array of hashes" (presumably you mean an array of hash references). If the keys of each hash reference are the same, then you can store this in CSV.
You're unlikely to get a specific answer without being more specific about your data.
Edit: Now that you've posted some sample data, you can simply write this to a CSV file or a database with the values for index,timestamp,text,errname,errno,unknown1,unknown2,unknown3, and line.
use Storable;
# fill my hash
store \%hash, 'file';
%hash = ();
%hash = %{retrieve('file')};
# print my hash
You can always use KiokuDB, Storable or what have we, but if you are planning to do aggregation, using a relational data base (or some data store that supports queries) may be the best solution in the longer run. A lightweight data store with an SQL engine like SQLite that doesn't require running a database server could be a good starting point.