User behaviors analysis, stackoverflow public data dump - data-dump

I have a question - what would be the best way to figure out in which timezone particular user is situated based on the location field data? It seems like considerable amount of users have this field populated with some data, the form, however, is far from being normalized.
While I am figuring out ways to normalize users locations and infer timezones, I wonder, if someone did it before and could share some experience, or maybe (ideally) there is some magic webservice which I can ask for timezones by a given location?
So far I am running through fairly simple process - tokenizing the field, sorting, grouping by frequencies and assigning timezones manually based on my best knowledge.

Related

How to enter estimated dates in a PSQL timestamp field?

An existing database uses a timestamp field, but sometimes the user only knows year and month of the event. How should the data be stored when only the month, day, or time is known? Keep in mind this is for appointment data, e.g. a ."time of appointment" field, but sometimes its historical in nature like "saw my lawyer, I think it was June 2019".
I'm really looking for a general solution that is language independent, likely focusing on relatively standard DB data structures for handling the limitations of a timestamp field in this regard and field types for such a common programming situation that I have named "an estimated timestamp".
It appears a timestamp field doesn't allow a date such as "2019-07-00-T00:00" which would appear to be the ideal solution if it did as it: a) maintains natural sort orders and b) provides a clear indicator its an estimated date with adding an estimated T/F field.
What solutions have you come up with for such a situation, with the understanding this DB data is accessed by many web based front ends.

How can I handle a very large database and do not miss the performance?

if i want to develop an application, I'm worried about its performance after the number of users and stored data increases.
actually I don't know what is the best way to implement a program that it works with a really large data and do some things like search in it, find and receive user information, search text and so on in real time without any delay !
Let's me explain the problem more
for example i have chosen 'Mongodb' as a database and suppose we have at least five million users and a user want to log in into the system, the user has sent the username and password
The first thing that we should do is to find the user with that username and then check the password, in mongodb we should use something like 'find' method to get the user's information, something like below:
Users.find({ username: entered_username })
then get the user information and we check the password
but the 'find' method should search the username between million users and it's a large number and if any person request for authentication, this method should be run for each of them and it cause a heavy processing on the system
but unfortunately this problem is only for something like finding a user, if we decide to search a text when we have a lot of texts and posts on the database the problem is more bigger
i don't know how big companies like facebook and linkedin search through millions of data in such a short span of time. actually i don't want to create something like facebook or more but i have a large amount of data and i'm looking for a good way to handle it
is there any framework or something else that help me to handle large data on the databases or is there exist a method to implement data on database so that we search and find data fast and quickly? should i use a particular data structure?
i founded an opensource project elasticsearch that it help us to search faster but i don't know if i found something with elastic how can i find it on mongodb too for doing something like updating data and if i use elastic search i should use mongodb too or not!? can i use elastic as a database and as a search engine simultaneous !?
if i use elasticsearch and mongodb together then i should have two copies of my data, one in mongodb and one in elasticsearch!? and this two copies of the data that are separated :( i wish elasticsearch search in the mongodb that does not have to create two copies of the data
thank you if you help me to find out a good way and understand what should i do.
When you talk about performance, it usually boils down to three things:
Your design
Your definition of "quick", and
How much you're willing to pay
Your design
MongoDB is great if you want to iterate on your data model, can scale horizontally, and very quick if used properly. Elasticsearch on the other hand, is not a database. However, it is very quick for searching. A traditional relational database will be useful if you know exactly how your data looks like, and don't expect it to change much, or is relational by nature.
You can, for example, use a relational database for user login, use MongoDB for everything else, and use Elastic for textual, searchable data. There is no rule that tells you to keep everything within a single database.
Make sure you understand indexing, and know how to utilize it to its fullest potential. The fastest hardware will not help you if you don't design your database properly.
Conclusion: use any tool you need, combine if necessary, but understand their strengths and weaknesses.
Your definition of "quick"
How "quick" is quick enough for your application? Is 100ms quick enough? Is 10ms quick enough? Remember that more performance you ask of the machine, more expensive it will be. You can get more performance with a better design, but design can only go so far.
Usually this boils down to what is acceptable for you and your client. Not every application needs a sub-10ms response time. There's plenty of applications that can tolerate queries that return in seconds.
Conclusion: determine what is acceptable, and design accordingly.
How much you're willing to pay
Of course, it all depends on how much you're willing to pay for all the hardware that need to host all that stuff. MongoDB might be open source, but you need some place to host it. Also, you cannot expect magic. You can't throw thousands of queries and updates per second, and expect it to be blazing fast when you only give it 1 GB of RAM.
Conclusion: never under-provision to save money if you want your application to be successful.

Calculating and reporting Data Completeness

I have been working with measuring the data completeness and creating actionable reports for out HRIS system for some time.
Until now i have used Excel, but now that the requirements for reporting has stabilized and the need for quicker response time has increased i want to move the work to another level. At the same time i also wish there to be more detailed options for distinguishing between different units.
As an example I am looking at missing fields. So for each employee in every company I simply want to count how many fields are missing.
For other fields I am looking to validate data - like birthdays compared to hiring dates, threshold for different values, employee groups compared to responsibility level, and so on.
My question is where to move from here. Is there any language that is better than any of the others when dealing with importing lists, doing evaluations on fields in the lists and then quantify it on company and other levels? I want to be able to extract data from our different systems, then have a program do all calculations and summarize the findings in some way. (I consider it to be a good learning experience.)
I've done something like this in the past and sort of cheated. I wrote a program that ran nightly, identified missing fields (not required but necessary for data integrity) and dumped those to an incomplete record table that was cleared each night before the process ran. I then sent batch emails to each of the different groups responsible for the missing element(s) to the responsible group (Payroll/Benefits/Compensation/HR Admin) so the missing data could be added. I used .Net against and Oracle database and sent emails via Lotus Notes, but a similar design should work on just about any environment.

Store static locations and info iOS

Lets say that I have gps coordinates for 1000 stores and a short text to each one. What would be the best way to store this information? SQL? One more thing to consider is how to load the information into the app. It don't seem to be a smart thing to load everything direct, the best way seems to be loading the stores in the specific area but how do I search for those stores? Is that easy to do in SQL? As you see I don't have so much experience of database programming.
storing in SQLite3 file would be best for the moment as you have lots of data and through a db, you can fetch the required data (via query) back on demand.
as for the getting stores of specific area - assuming you want to locate nearby stores within 10 KM of radius.
This will not be hard to fetch from db.
I am no server expert but can give you a guideline which will work:-
you will be having current lat/long, you will have lat/long in the table as well for each row.
may be a server guy could help you create you a query with formula where you can fit the distance calculation formula in 'where' clause's condition and you will get the records of nearby (10 KMs) places records.

Would MongoDB be a good fit for my industry?

I work in the promotional products industry. We sell pretty much anything that you can print, embroider, engrave, or any other method to customize. Popular products are pens, mugs, shirts, caps, etc. Because we have such a large variety of products, storing information about these products including all the possible product options, decoration options, and all associated extra charges gets extremely complicated. So much so, that although many have tried, no one has been able to provide industry product data in such a way that you could algorithmically turn the data into an eCommerce store without some degree of data massaging. It seems near impossible to store this information to properly in a relational database. I am curious if MongoDB, or any other NoSQL option, would allow me to model the information in such a way that makes it easier to store and manipulate our product information better than an RDBMS like MySQL. The company I work for is over 100 years old and has been using DB2 on an AS400 for many years. I'll need some good reasons to convince them to go with a non relational DB solution.
A common example product used in our industry is the Bic Clic Stic Pen which has over 20 color options each for barrel and trim colors. Even more colors to choose for silkscreen decoration. Then you can choose additional options for what type of ink to use. There are multiple options for packaging. After all that is selected, you have additional option for rush processing. All of these options may or may not have additional charges that can be based on how many pens you order or how many colors in your decoration. Pricing is usually based on quantity, so ordering 250 pens would be cost more per pen than ordering 1000. Similarly, the extra charge for getting special ink would be cheaper per pen ordered when you order 1000 than 250.
Without wanting to sound harsh, this has the ring of a silver bullet question.
You have an inherently complex business domain. It's not clear to me that a different way of storing your data will have any impact on that complexity - storing documents rather than relational data probably doesn't make it easier to price your pen at $0.02 less if the customer orders more than 250.
I'd recommend focussing on the business domain, and not worrying too much about the storage mechanism. I'm a big fan of Domain Driven Design - this sounds like a perfect case for that approach.
Using a document database won't solve your problem completely, but it probably can help.
If your documents represent the options available on a product and an order for that product, in most cases you will be accessing the document as a whole - it's nothing you can't do with SQL, but a good fit for a document database. Since the structure of the documents is flexible, it is relatively easy to define an object within the document as a complex type to define a particular option or rule without changing the database.
However, that only helps with the data - your real problem is on the UI side. The two documents together map directly to the order form, but whatever method you use to define the options/rules some of the products are going to end up with extremely complex settings pages.
Yes, MongoDB is what you need. It doesn't have a strict documents structure, so you'll be able to create set of models you need and embed them into your product page in any order and combinations you need. Actually its possible to work with this data without describing the real model fields directly, so I (for example) can use fields my Rails application doesn't know about at all.
Also MongoDB is extremely easy to set up for replication and sharding. Also it supports GridFS virtual filesystem, so you can store images for your products with documents which describe them and manipulate them as a single object easily.
You should definitely give it a try.
UPD: Anyway it would be good to keep your RDBMS for financial data and crunching numbers, like grouping reports for the sales analysys and so on. NoSQL bases aren't very good at this.