Predicting answers from data in a CSV file when user asks a question - neural-network

I have a CSV file with columns on crop production
Columns of the CSV file
If a user asks a question like
What should I produce in Maharashtra in 2019?
User should get an answer based on the history of the data in CSV
Can someone help me with the implementation?

For a good answer (training), you probably need the amount of requested or sold crop rather than the amount of produced crop.
A first implementation could give the average of the past years production. Possibly weighing the past few years more.

Related

Best solution for weather data warehouse netcdf or grib [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 1 year ago.
Improve this question
Problem:
Looking for best solution to store and make easily available big amount of weather data for the machine learning specialists team.
Initially I'm fetching data from cds.climate.copernicus.eu in netCDF or grib format. There will be some around 10-20Tb in grib or netCDF.
Requirements:
ML specialists can easily query data for given location (point, polygon) in given time range.
Results are returned in reasonable time.
Ideas:
Postgres. I thought that maybe pg would handle that amount of data. But the problem I encoutered with this is that loading data into postgres will take ages additionally it would take much more space than 10-20Tb (Because I planned to store that in row like format where you have two tables Point and WeatherMeasurement) Is it a good idea? Have anyone experience with this kind of data and pg?
Amazon Redshift. Would it be good approach to use this solution for weather data. How to load netCDf or grib into it? I have zero experience with could solutions like this.
Files. Just store data in the grib or netCDF files. I would write some simplified Python interface to fetch data from those files? But the questions is will the queries be fast enough? Have anyone experience with those?
For data this size that you want to sub-select quickly along multiple dimensions I'd lean toward Redshift. You will want to pay attention to how you want to query the data and establish the data model to provide the fastest access for the needed subsets. You may want to get some help setting this up initially as trial-and-error approach will take a while with this data size. Also Redshift isn't cheap at this scale so ask the budget questions too. This can be reduced if the database only needs to be up part of the time.
Files isn't a terrible idea as long as you can partition the data such that only a subset of files need to be accessed for any query. A partitioning strategy based on YEAR, MONTH, LAT-Decade, and LON-Decade might work - you'll need to understand what queries need to be performed and how fast (what's reasonable time?). This approach will be the least cost.
There is also a combo option - Redshift Spectrum. Redshift can utilize on database information AND in S3 stored data in the same queries. Again setting up the Redshift data model and S3 partitioning will be critical but this combo could give you attributes that will be valuable.
For any of these options you will want to convert to a more database friendly format like Parquet (or even CSV). This conversion process along with how to merge new data will need to be understood. There are lots of cloud tools to help with this processing.
Given the size of data you are working with I'll stress again that learning as you go will be time consuming. You will likely want to find experts in the tools you are working with (and at the data sizes you have) to get up quickly.

Managing 1 million records per month insert/select [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I am working on a real time feed which gives me real time data.
The Number of records are up to 1 million per month and I need to provide reports based on these records.
I chose Mongodb as it is high performer in fetching records.
I am facing issue in managing that data because 12 million records.
Do I need to keep every data month wise?
Should I use different collections per month?
There are lot of select queries for Analytics Report and everything.
It depends on how you want to use the data, that's up to you to decide. There is nothing wrong with a lot of data, you just need to limit your heavy queries with the same logic as cache works (easier access, but less fresh). A common methods is:
You have a "raw data" table which contains your millions of records. This table is very large, but contains 'pure' data. You want to access this table as little as possible as it'll be slow.
The next table is less accurrate and sums information you need. In your case this could be a 'month_summery' which you create after a month ends. That way you still have the complete dataset, but also a small table with relevent info (e.g. num lines, sumOfX, averageOfY, etc). Your heavy query is now once per month and you can base your stats of this.
If you need data say per week, you'd make a 'week_summery' table. Or if you need stats per day, you make it per day, 365 entries per year is still a whole lot less than millions.

How to efficiently partition a large amount of data? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
I wonder what would be a more efficient way to partition Parquet data when storing it in S3.
In my cluster I currently have a folder data with a huge amount of Parquet files. I would like to change the way I save data in order to simplify the data retrieval.
I have two options. One option is to store Parquet files in the following folder path:
PARTITION_YEAR=2017/PARTITION_MONTH=07/PARTITION_DAY=12/my-parquet-files-go-here
or
PARTITION_DATE=20170712/my-parquet-files-go-here
Which of these two alternatives would be more recommended if I need to read a range of 7 days in Spark using spark.read.parquet?
Which alternative would be faster?
Since in both cases you are storing data with daily granularity, given the appropriate implementation at read time these two should be equivalent, but the former allows you to define better grained pruning based on your needs: you can easily get data for a whole year, a single month or a single day (or a combination of those) with well supported glob patterns.
I'd encourage you to use the former solution to be more flexible, as for your current use case the efficiency doesn't change significantly.
I would strongly advise against having many, many folders in your s3 store. Why? Spark uses S3 connectors which mimic directory trees through multiple HTTP requests: the deeper and wider the tree, the more inefficient this becomes, not least because AWS S3 throttles HTTP requests
The year/month/day naming scheme works well with hive & spark, but if you go into too much depth (by day, by hour) then you may experience worse performance than if you didn't.
The answer is quite simple... it depends on how you will query the data!
If you are querying purely on a range of days, then the second option is the easiest:
SELECT ...
FROM table
WHERE date BETWEEN ... AND ...
If you partition by month and day, you'd have to write a WHERE clause that uses both fields, which would be difficult if the desired 7-day range straddles two moths (eg 2018-05-27 to 2015-06-02):
SELECT ...
FROM table
WHERE (month = 5 and date BETWEEN 27 AND 31) OR
(month = 6 and date BETWEEN 1 AND 2)
This is the best way to make the partitions work, but is not very efficient for coding.
Thus, if you are using a WHERE on the date, then partition by date!

File Age Reporting [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I often find my answers on this site without, but on this occasion I need more personalized assistance. I hope someone can point me in the right direction.
I have been messing around with trying to draw a report off my NAS system to get statistics of the Age of Data and the Size of data so I can attempt to provide a Charge Back/Show back solution.
I have managed to do this mostly with Powershell using get-childitem and I have even trying to tap into .net using [System.IO.Directory]::EnumerateDirectories,
and other commands. All these solutions work, but I seem to get really slow times getting this information, especially if I compare this to Jam TreeSize which fishes this information out fairly quickly.
to note, I have even tried multi-threading in powershell, thinking if I can collect the data from various points it would collect the whole data would be quicker, but I have had largely mix results.
I'm hoping someone else has tackled this sort of project before and managed to get a nice quick(er) way of doing this. I am even open to other languages tackling this.
Quick notes, I am doing this in powershell v5. I have also started learning a bit of python so if anyone has suggestion in that it would be a great place for me to learn.
Edit:
OK Here are some examples.
Times:
Treesize takes 10 seconds
Powershell Get-ChildItem takes about 2minutes
Powershell dotnet actions takes about 2 minutes
Number of objects counted 60 000 objects, size 120gb.
get-childitem with recurse will get you all file objects in a specified location including their attributes, such as last accessed time and size in bytes dotnet you need to use a combination of EnumerateFiles etc and then loop that with FileInfo which is get the file objects in the given location and then inspect their attributes respectively
In terms of multithreading I will point you to some links which I used, it will be too much to add in here, but I have tried creating a runspacepool, but I also tried manually running two separate runspaces to compare results and they were much the same. why I am obsessed with times, while the test directory I am using above is only 2 minutes, my NAS in some volumes has millions of files. The one test I did took an hour and a half to complete, and if I were to do that with other volumes it would take hours. I just want to find speeds closer to Treesize
Edit: I have marked robocopy workaround as the answer, however if you do have any suggestions on a different language and procedure please feel free to comment and it will be something I will look into in the future
I've been there, and to get what you want is... tricky, at least: TreeSize is reading the information directly from the MFT table, while Get-ChildItem is acting at a higher level, already in the OS. Therefore, the speed varies a lot.
So if you want to speed up your report you really need to go under the hood and code something at lower levels.
For me, even if it wasn't the fastest solution, I got a compromise and used robocopy /l /log:c:\mylog.txt (which doesn't copy a byte, and just logs the files to mylog.txt), and then I've parsed it. You can play with the multithreading option (/MT:[N], where N is 8 by default) to speed it up.
What I find useful with this method is that, if I need further investigation, I've all the data I need in a file and therefore it'll be faster to query it. Static, not updated, but when you're talking about million of files, a photo of a certain moment is a good approach, I think.

Change a 1500 column data-set for easier front-end manipulation [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I have a data-set that consist of 1500 columns and 6500 rows and I am trying to figure out what the best way is to shape the data for web based user interactive visualizations.
What I am trying to do is make the data more interactive and create an admin console that allows anyone to filter the data visually.
Front-end could potentially be based on Crossfilter, D3 and DC.js and give the user basically end-less filtering possibilities(date, value, country. In addition there will be some pre defined views like top and bottom 10 values.
I have seen and tested some great examples like this one, but after testing it did not really fit for the large amount of columns I had and it was based on a full JSON dump from the MongoDB. This amounted in very long loading times and loss of full interactivity with the data.
So in the end my question is what is the best approach (starting with normalization) in getting the data shaped in the right way so it can be manipulated from a front-end. Changing the amount of columns is a priority.
A quick look at the piece of data that you shared suggests that the dataset is highly denormalized. To allow for querying and visualization from a database backend I would suggest normalizing. This is no small bit software work but in the end you will have relational data which is much easier to deal with.
It's hard to guess where you would start but from the bit of data you showed there would be a country table, an event table of some sort and probably some tables of enumerated values.
In any case you will have a hard time finding a db engine that a lows that many columns. The row count is not a problem. I think in the end you will want a db with dozens of tables.