Have you ever creating offline archive of a website that you like? Which “offline downloader” did you use? Hopefully this post will give you rough idea on how to traverse the website ( and do whatever you want to do with the content ). My idea of traversing a website represented in the following steps:
- Define the startup page
- Collect all hyper-links that have not been visited and put it into queue
- Do something with the content
- Dequeue a hyper-link and return to step 2
- Repeat until no more links in the queue.
I have build my own prototype that suit my needs. I might need further tweaks and improvements before releasing it to the public. My main goal would be adding functionality to define characteristics of the page that will be traversed in step 2, and functionality of what you can do with the content of the page. Stay tuned for further updates.