It's a long story, but I am trying to save an internal website from the pointy hair bosses who see no value from it anymore and will be flicking the switch at some point in the future. I feel the information contained is important and future generations will want to use it. No, it's not some adult site, but since it's some big corp, I can't say any more.
The problem is, the site is a mess of ASP and Flash that only works under IE7 and is buggy under IE8 and 32bit only even. All the urls are session style and are gibberish. The flash objects itself pull extra information with GET request to ASP objects. It's really poorly designed for scraping. :)
So my idea is to do a tcpdump as I navigate the entire site. Then somehow dump the result of every GET into a sql database. Then with a little messing with the host file, redirect every request to some cgi script that will look for a matching get request in the database and return the data. So the entire site will be located in an SQL database in URL/Data keypairs. Flat file may also work.
In theory, I think this is the only way to go about this. The only problem I see is if they do some client side ActiveX/Flash stuff that generates session URLs that will be different each time.
Anyway, I know Perl, and the idea seems simple with the right modules, so I think I can do most of the work in that, but I am open to any other ideas before I get started. Maybe this exist already?
Thanks for any input.
To capture I wouldn't use tcpdump, but either the crawler itself or a webproxy that can be tweaked to save everything, e.g. Fiddler, Squid, or mod_proxy.
Related
this isn't a specific programming related question, but more so a conceptual/software engineering related question.
I'm a new web dev hire at a small local company, who was given a really cool chance to learn and grow as a professional. They were kind enough to give me a chance, and I'd like to be proactive in learning as much about how their back-end system is working as I can, considering it's what I'll be working in most of the time.
From what I've gathered, their entire in-house built job tracking interface is built in Perl (will the aid of css, js, and sql), where the html pages are generated and spat out as the user wants to access them.
For example, if I want to access a specific job, it'll look like this in the user's url. https://tracking.ourcompanywebsite/jobtracker/job/1234
On the internal side, I know we have a "viewing" script that would be called something like "JobView" that will literally query all of the fields in the perl script, and structure an html page around that data we are requesting.
My question is, how the fudge is this happening? How does a user putting in that address on the url trigger a perl script to run on our server, and generate a page that is spat back out to the user?
I guess that's my main curiosity. In your average bare bones web development courses in college, I learned to make your html, css, and js files. When you want to view a web page, you simply put the directory of that html page, and it constructs everything around that.
When you put a directory to a perl file in a browser, it will just open that raw perl code haha.
I'm sure there may be some modules and various add-ons in our software that allows this to all work, that I may be missing, so please forgive me.
I know you guys don't have the codebase in front of you, but I figured conceptually there is something to be learned that doesn't necessarily need all of the specifics.
I hope that this question could be used for any other amateur devs having the same questions.
Consider the following two snippets:
cat file | program
printf 'foo\n' | cat | program
In the first snippet, cat reads its output from a file. In the second, it gets it from another program. But program doesn't care about any of that. It just reads whatever was provided to its STDIN.
The web browser is like program. It doesn't care where the web server got the HTML or image or whatever it requested. It sends a URL, and it receives a response with a document from the web server.
The web server, like cat, can obtain what it needs from multiple sources. Specifically, it can be configured to get the requested document in a few different ways.
The "default" would be to map the URL to a directory and return the file found there. But that's not the only option. There are two other major options commonly found in web servers:
Common Gateway Interface (CGI)
Some web servers can be configured to run a program based on the URL received. Information about the request is passed to the program, which is tasked with producing a response. The web server simply returns the output of this program to requesting browser.
FastCGI
It can be quite wasteful to spawn a new child for each request. FastCGI allows a web server to talk to an existing persistent process or pool of processes that listen for requests from the webserver. Again, the web server simply returns the response from this request to the requesting browser.
Theres a ton of videos and websites trying to explain backend vs frontend, but unfortunately none of them explains it in a way that you know how to develop a backend - driven website (at least I haven't found anything good).
So, I wanted to ensure that I understood it and kindly ask you to confirm or correct me on this topic.
Example:
I wanted to build Mini - Google. I have a Database containing 1000 stored websites.
Assumption #1:
Everytime I type something into the search bar, the autofill suggestions change. This means, everytime i type, another website / API gets called returning the current autofill suggestions. On a developer site, this means the website e.g. is a Python script which gets called with the current word typed in as a Parameter and is returning all suggestions as e.g. JSON:
// Client Side Script
function ontype(input):
suggestions = get("https://api.googlemini.com/suggestions?q=" + str(input))
show(suggestions)
Assumption #2:
This also means I could manually call the website containing the Python script, providing a random word and it would always return a JSON containing the autofill suggestions for that word.
Question #1:
If A#1 turns out true but A#2 turns out false, how could I prevent a user from randomly accessing the "API" while still returning results when called by a script?
Assumption #3:
After pressing enter, my website googlemini.com/search?... would be called. As google.com/search reloads everytime searching for a new query (or going to page 2 etc.), I assume, instead of calling an API, when the server gets the client request, it first searches through its database, sorts the results and then returns a whole html as a static webpage:
// Server Side Script
#app.route("/search")
function oncall():
query = getparam("q")
results = searchdatabase(query)
html = buildhtml(results)
return html
Question #2:
Often, I hear (or at least understand it this way) that database and webserver are 2 seperate servers. How would that work? Wouldn't that mean the database server needs to be accessible to the web too (of course it would have security layers etc., but technically it would)? How could I access the database server from the webserver?
Question #3:
Are there, on a technical basis, any other ways to build backend services?
That's it. I would also appreciate any recommendations like videos, websites or others to learn how to technically setup and / or secure backend servers.
Thanks in advance.
For your first question you can yes there is a way to prevent miss use.
What you can do is add identifier to api like Auth token to identify a user and every time a user access the api you can save the count on the server n whenever the count has exceeded a limit within a time span you can reject the call. And the limit can be set in such a way that it doesn't trouble the honest user and punishes the wrong one. There are even more complex and effective methods but this is the basic idea.
For question number to let me explain you a simple concept a database is a very efficient, resourcefull and expensive data storage solution we never want it to be used in a general sense as varible store or something. We always want to access the database in call get the data process the data update the data. So we do it data way and its not necessary you make sepreate server for data base. The thing is we mostly make databse to be accessible to various platforms android, ios, windows. So its better to add some abstraction and keep data base as a separte entity.
For the last, I am not well aware about what you meant by other but I am listing some backend teechnologies, some of these might be used in isolation some of these not some other tools as well.
Django
FLask
Djnago rest
GraphQL
SQL
PHP
Node
Deno
I am completely stuck on where to start with getting a log-in area for a Clojure site I am building (for fun).
I've looked at several resources, which I'll post below, mercilessly copy/pasted code, and the closest I can get is one of two situations:
The login page takes the login but says that the login failed, though, as far as I can tell, the login matches.
Or I get this error: No method in multimethod '->sql' for dispatch value: null
I'm not sure how to interpret the above error: is this specifying that I need a multi-method or is it specifying that I need to check for null? The null requirement makes no sense at all. I'm not really asking but if anyone wants to give an explanation, that is great.
I tested the output by comparing the results-to-select queries from raw non-hashed data, I've went through 5 variations on this theme, using everything from page-to-page calls to creating new defpartials, multi-methods, defn, etc.
Sources I have used (unfortunately, I can't list all of them being a first-time poster):
This one uses Clojure -> Korma -> PostgreSQL, but the code doesn't seem to work for multiple users?
http://www.vijaykiran.com/2012/01/17/web-application-development-with-clojure-part-2/
This one shows how to use Noir and PostgreSQL (Yes, I am using Noir):
https://yogthos.net:11794/blog/23-Noir+tutorial+-+part+2
The 4Clojure site, but that one uses CongoMongo:
The Heroku Twitter clone, but no mention of how to create logins for one person, much less several.
I also bought Programming Clojure from O'Reilly Press, but once again, nothing about how to create a log-in area.
FIRST EDIT: I was asked to create a github repository of a stand-alone site. This includes a working "Account Creation" area that is found in the welcome.clj file and only a form of the Login area in login.clj.
I was attempting to get some of the same errors working as I had last night and also attempting to get this working before I uploaded the files. I don't have any reasonable starting points yet, thus there is no beginning implementation as of yet. I'm seriously embarrassed at the solutions I've been coming up with, thus I don't want to post them. I get conceptually what I should do, but for some reason, I can't seem to translate this. This is my first github account: my background is Python, Scheme a'la SICP, and some Python + PostgreSQL marketing program I built.
SECOND EDIT: Ack! I can't seem to get the thing to work at all... Yeah, I spent well over 20 minutes (hours) on this one, so I have just have to admit that I don't yet have the requisite knowledge to accomplish this, no matter how many sources I look to. I committed the updated files and all the odd things I tried, including all the variations on login box to running raw SQL. The closest I can come is getting it so that I don't get any errors, but no evidence at all that someone is logged in. Thanks so much for the help and suggestions. I'll most certainly return to this later.
https://github.com/dt1/noirKormaLogin
There are a couple of issues that I see. First, in datapass.clj, you're creating an entity with no content. I'm not sure how Korma handles that. It's trying to thread results as inputs to other functions, so I could see how nil gets introduced there.
Secondly, you'll need something to handle the login post. (defpage ...) only handles GET requests by default. You'll need a separate defpage to handle the post. Something along these lines:
(defpage [:post "/login"] {:keys [user-name pwd]}
(if-let [user (db/find-user user)]
(if (noir.util.crypt/compare pwd (:password user))
(do
(noir.session/put! :some-key some-value)
(noir.response/redirect "/success"))
noir.response/redirect "/failed-to-login"))
(noir.response/redirect "/failed-to-login"))
session/put! is how you put data into the session. The default is to use an in-memory store. You'll need to add Ring middleware to use persistent sessions (look at Session Stores).
Also, as luck would have, someone just posted an authentication app for Noir... you may want to take a look: https://github.com/xavi/noir-auth-app
We have a working web application, which has been developed with ExtJS for client side, and Struts, Spring, Hibernate for server side. now, we are considering to migrate to GXT (or may be GWT itself). The thing is I'm very new to GWT/GXT. and we are trying to decide whether we go down this road or not.
1) Until now, we have 2 domains for our web-app. one is that the application (Struts+...) have been deployed to, and the other is mainly a cookie-less custom CDN. The transfer between client and server is mostly XHR requests, sending/receiving JSON and/or JSONP. But with the new approach ahead of us, I began to understand that we are supposed to have only ONE domain, for the whole GXT application. Is it correct or I forgot to consider something here?
and if not, Is it possible that we deployed just part of the application (i.e. com.ourcompany.webapp.gxt.server.*) to the main server, and the contents that have been compiled and generated by the GWT compiler to the other CDN-like domain?
2) The other big issue we are facing is that the current application is consists of mostly 3 huge modules. One is responsible for "SignIn", the other is for "Webtop", and the third one is "Modules which each users has access to". The latter has been generated on the server due to "access rights" of each users, and obviously could be different from one user to the other.
The only thing I could find on this matter, which might be related is Code Splitting. Although I'm not totally sure if this would be the right solution for this.
We want that the application, on Start Up, checks whether user has been logged in or not. if not, loads the SignIn sets of javascript files (i.e webapp.signin.nocache.js), then after user has entered the correct username/password, unloads this signin file and loads webtop.nocache.js AND modules.nocache.js.
I would be really appreciated if you could help me out.
1) If your GWT app is loaded from a different domain than you have to face the same origin policy. You can not do a xhr to a different domain. You could use the ScriptTagProxy to get around this. But it does not feel very netural.
2) You can use CodeSplitting in order to automatically load a particular part of your application dynamically. All you have to do is to warp your splitt point into an async call.
A detailed compile report gives you a pretty good overview how well code splitting is working.
But CodeSplitting does not unload already loaded code. If its really importend to do so you have to redirect the user to another url in order to load the appropriate user depended module.
Once Javascript code has been loaded and executed its impossible to remove the code from the browsers memory.
Grettings,
Peter
I'm attempting to move a web app we have (written in Perl) from an IIS6 server to an IIS7.5 server.
Everything seems to be parsing correctly, I'm just having some issues getting the app to actually work.
The app is basically a couple forms. You fill the first one out, click submit, it presents you with another form based on what checkboxes you selected (using includes and such).
I can get past the first form once... but then after that it stops working and pops up the generated error message. After looking into the code and such, it basically states that there aren't any checkboxes selected.
I know the app writes data into .dat files... (at what point, I'm not sure yet), but I don't see those being created. I've looked at file/directory permissions and seemingly I have MORE permissions on the new server than I did on the last. The user/group for the files/dirs are different though...
Would that have anything to do with it? Why would it pass me on to the next form, displaying the correct "modules" I checked the first time and then not any other time after that? (it seems to reset itself after a while)
I know this is complicated so if you have any questions for me, please ask and I'll answer to the best of my ability :).
Btw, total idiot when it comes to Perl.
EDIT AGAIN
I've removed the source as to not reveal any security vulnerabilities... Thanks for pointing that out.
I'm not sure what else to do to show exactly what's going on with this though :(.
I'd recommend verifying, step by step, that what you think is happening is really happening. Start by watching the HTTP request from your browser to the web server - are the arguments your second perl script expects actually being passed to the server? If not, you'll need to fix the first script.
(start edit)
There's lots of tools to watch the network traffic.
Wireshark will read the traffic as it passes over the network (you can run it on the sending or receiving system, or any system on the collision domain).
You can use a proxy server, like WebScarab (free), Burp, Paros, etc. You'll have to configure your browser to send traffic to the proxy server, which will then forward the requests to the server. These particular servers are intended to aid testing, in that you'll be able to mess with the requests as they go by (and much more)
As Sinan indicates, you can use browser addons like Fx LiveHttpHeaders, or Tamper Data, or Internet Explorer's developer kit (IIRC)
(end edit)
Next, you should print out all CGI arguments that the second perl script receives. That way, you'll know what the script really thinks it gets.
Then, you can enable verbose logging in IIS, so that it logs the full HTTP request.
This will get you closer to the source of the problem - you'll know if it's (a) the first script not creating correct HTML, resulting in an incomplete HTTP request from the browser, (b) the IIS server not receiving the CGI arguments for some odd reason, or (c) the arguments aren't getting from the IIS server and into the perl script (or, possibly, that the perl script is not correctly accessing the arguments).
Good luck!
What you need to do is clear.
There is a lot of weird excess baggage in the script. There seemed to be no subroutines. Just one long series of commands with global variables.
It is time to start refactoring.
Get one thing running at a time.
I saw HTML::Template there but you still had raw HTML mixed in with code. Separate code from presentation.