TECHNICAL NOTES - Web Crawler Test Site
=======================================

Last Updated: 2024
Version: 1.0

IMPLEMENTATION DETAILS
----------------------

1. URL Structure
   - Uses both absolute paths (/about/) and relative paths (team.html)
   - Includes trailing slashes for directories
   - Mixed use of index.html and directory URLs

2. Link Patterns
   - Navigation menu on every page (intentional duplication)
   - Cross-section linking (products → resources, about → contact)
   - Multiple links to same destination for duplicate testing

3. Crawl Depth Examples
   Level 0: (starting point)
   Level 1: /, /about/, /products/, /resources/, /contact.html
   Level 2: /about/team.html, /products/product1.html, etc.
   Level 3: /resources/docs/guide.html, /resources/docs/faq.html

4. Robots.txt Testing
   User-agent: *
   - Default crawl delay: 2 seconds
   - Disallowed: /private/
   
   User-agent: MunicipalCrawler/1.0
   - Specific crawl delay: 1 second
   - Same restrictions

5. File Types
   - HTML files: Should be parsed for links
   - TXT files: Should be downloaded but not parsed
   - Both types should be saved as documents

6. Expected Crawler Behavior
   - Start at index.html
   - Parse HTML for links
   - Queue new URLs (check for duplicates)
   - Respect robots.txt rules
   - Implement BFS to depth limit
   - Download linked files

7. Edge Cases Tested
   - Same link with different anchor text
   - Relative vs absolute URL to same page
   - Links to restricted areas
   - Deep nesting (3 levels)
   - Multiple file downloads

8. Performance Considerations
   - Small file sizes for quick testing
   - Clear structure for easy debugging
   - Consistent naming conventions

END OF TECHNICAL NOTES