TECHNICAL NOTES - Web Crawler Test Site ======================================= Last Updated: 2024 Version: 1.0 IMPLEMENTATION DETAILS ---------------------- 1. URL Structure - Uses both absolute paths (/about/) and relative paths (team.html) - Includes trailing slashes for directories - Mixed use of index.html and directory URLs 2. Link Patterns - Navigation menu on every page (intentional duplication) - Cross-section linking (products → resources, about → contact) - Multiple links to same destination for duplicate testing 3. Crawl Depth Examples Level 0: (starting point) Level 1: /, /about/, /products/, /resources/, /contact.html Level 2: /about/team.html, /products/product1.html, etc. Level 3: /resources/docs/guide.html, /resources/docs/faq.html 4. Robots.txt Testing User-agent: * - Default crawl delay: 2 seconds - Disallowed: /private/ User-agent: MunicipalCrawler/1.0 - Specific crawl delay: 1 second - Same restrictions 5. File Types - HTML files: Should be parsed for links - TXT files: Should be downloaded but not parsed - Both types should be saved as documents 6. Expected Crawler Behavior - Start at index.html - Parse HTML for links - Queue new URLs (check for duplicates) - Respect robots.txt rules - Implement BFS to depth limit - Download linked files 7. Edge Cases Tested - Same link with different anchor text - Relative vs absolute URL to same page - Links to restricted areas - Deep nesting (3 levels) - Multiple file downloads 8. Performance Considerations - Small file sizes for quick testing - Clear structure for easy debugging - Consistent naming conventions END OF TECHNICAL NOTES