Web Crawler in Node.js to spider dynamically whole websites.
IMPORTANT: This is a DEVELOPMENT tool, therefore SHOULD NOT be used against a website you DO NOT OWN!
It helps you to map / process entire websites, spidering them and parsing each page in a smart way. It follows all the links and test several times the form objects. In this way is possible to check effectively the whole website.
This project was born with the aim of improve the legacy code, but it's not strictly restricted only to that.
With this in mind the usage of salmonJS could be different based on your own needs, like checking legacy code for dead code or profiling the web app performance.
Here below few suggestions about its usage:
- Improve the legacy code
- Check the dead code (enabling the code coverage server-side)
- Discover 500 Internal Server Errors
- Discover notices and warnings
- SQL profiling
- Process forms (it'll create easy test cases to be manually compiled)
- Process automatically JS events attached to DOM nodes
- Get the page content for each URL
- Get the screenshot for each URL
- URLs list
- Execution times
- Page output
- Page load
- Command Line Interface
- Catch and handle all the events bound to DOM elements (regardless how they have been set)
- Follows any 3xx redirect, JS document.location and meta redirect (can be disabled)
- Ignore duplicated URLs / requests and external URLs
- HTTP authentication
- Generate report for each page crawled, with: 6
- HTTP headers
- HTTP method
- Data sent (GET and POST)
- Page output
- Execution time
- Console messages
- Alerts, Confirmations & Prompts
- List of successful and failed requests
- Multiple crawlers working asynchronously one URL each one
- Support for the following HTML tags: a, area, base, form, frame, iframe, img, input, link, script
- URL normalisation
- Process the web page using PhantomJS
- Processing the output content only if it's HTML
Here the list of main dependencies:
You can install it directly from npm:
[user@hostname ~]$ npm install salmonjs -g
or you can download the source code from GitHub and run these commands:
[user@hostname ~/salmonjs]$ npm install
Change the file
src/config.js accordingly to your needs.
Here an example of a test case file:
; Test Case File ; generated by salmonJS v0.3.0 (http://fabiocicerchia.github.io/salmonjs) at Sat, 01 Jan 1970 00:00:00 GMT ; url = http://www.example.com ; id = http___www_example_com [GET] variable1=value1 [POST] variable1=value1 variable2=value2 variable3=@/path/to/file.ext ; use @ in front to use the upload feature (the file MUST exists) [COOKIE] name=value [HTTP_HEADERS] header=value [CONFIRM] Message=true ; true = OK, false = Cancel [PROMPT] Question="Answer"
__ _____ _______ .-----.---.-.| |.--------.-----.-----._| | __| |__ --| _ || || | _ | | |__ | |_____|___._||__||__|__|__|_____|__|__|_______|_______| salmonJS v0.3.0 Copyright (C) 2013 Fabio Cicerchia <firstname.lastname@example.org> Web Crawler in Node.js to spider dynamically whole websites. Usage: ./bin/salmonjs Options: --uri The URI to be crawled [required] -u, --username Username for HTTP authentication -p, --password Password for HTTP authentication -d, --details Store details for each page [default: false] -f, --follow Follows redirects [default: false] --disable-stats Disable anonymous report usage stats [default: false] --help Show the help
[user@hostname ~]$ salmonjs --uri "http://www.google.com" [user@hostname ~]$ salmonjs --uri "www.google.com" [user@hostname ~]$ salmonjs --uri "/tmp/file.html" [user@hostname ~]$ salmonjs --uri "file.html"
[user@hostname ~/salmonjs]$ npm test
- Start processing an URL
- Open a system process to PhantomJS
- Open the URL
- If there is a JS event, put it into a dedicate stack
- Inject custom event listener
- Override existent event listener
- Collect all the relevant info from the page for the report
- On load complete, execute the events in the stack
- Start to process the web page
- Get all the links from the page content
- Normalise and filter by uniqueness all the URLs collected
- Get all the JS events bound to DOM elements
- Clone the web page for each new combination in the page (confirm)
- Put the web page instance in a dedicate stack for each JS event
- Process the all the web pages in the stack
- Get all the links from the page content
- Reiterate until there are no more JS events
- If there is an error retry up to 5 times
- Collect all the data sent by the parser
- Create test cases for POST data with normalised fields
- Get POST test cases for current URL
- Launch a new crawler for each test case
- Store details in report file
- Increase the counter for possible crawlers to be launched based on the links
- Check the links if are already been processed
- If not, launch a new process for each link
- If there are no more links to be processed, check if there are still sub-crawlers running
- If not so, terminate the process
For a list of bugs please go to the GitHub Issue Page.
Copyright (C) 2013 Fabio Cicerchia email@example.com
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.