README - Test Site Web Crawler Validation ========================================= INSTALLATION ------------ No installation required. This is a static test site. USAGE ----- 1. Point your web crawler to the index.html file 2. Configure crawl parameters (depth, delay, etc.) 3. Monitor crawler behavior EXPECTED BEHAVIOR ----------------- A properly functioning crawler should: - Respect robots.txt rules - Not visit /private/ directory - Wait at least 1 second between requests (for MunicipalCrawler) - Detect and skip duplicate URLs - Handle both relative and absolute links - Download and process text files TEST SCENARIOS -------------- 1. Depth Testing - Set max_depth=2 to exclude docs/ pages - Set max_depth=3 to include all pages 2. Duplicate Detection - Multiple links to same pages should be visited only once - Check crawler logs for duplicate skipping 3. Robots.txt Compliance - Crawler should read and parse robots.txt - Should not access /private/ section - Should implement specified crawl delay 4. File Handling - Should download .txt files - Should parse .html files for links TROUBLESHOOTING --------------- - If crawler accesses /private/, check robots.txt parsing - If visiting duplicates, verify URL normalization - If too fast, check crawl delay implementation For support, see contact.html or visit the FAQ.