Ryan Malloy 2063d81e60 test: add comprehensive tests for security.txt and canary.txt
Added 38 new tests (16 + 22) covering all features of the new generators:

## security.txt Tests (16 tests)
- RFC 9116 field validation (Canonical, Contact, Expires)
- Automatic mailto: prefix handling for email contacts
- Auto-expiration calculation (1 year from generation)
- Multiple contact methods support
- Multiple encryption keys
- All optional fields: acknowledgments, preferredLanguages, policy, hiring
- Proper field ordering compliance

## canary.txt Tests (22 tests)
- Compact field: value format validation
- Frequency-based expiration (daily: 2d, weekly: 10d, monthly: 35d, quarterly: 100d, yearly: 380d)
- Statement filtering (only non-received statements appear)
- Default statements vs custom statements
- Function-based dynamic statements
- Blockchain proof formatting (Network:Address:TxHash)
- Personnel duress statement
- Verification field
- Previous canary references
- Contact with mailto: prefix
- Organization and frequency fields

Test suite now at 72 total tests (up from 34), all passing.
2025-11-03 08:19:38 -07:00

@astrojs/discovery

Comprehensive discovery integration for Astro - handles robots.txt, llms.txt, humans.txt, and sitemap generation

Overview

This integration provides automatic generation of all standard discovery files for your Astro site, making it easily discoverable by search engines, LLMs, and humans.

Features

  • 🤖 robots.txt - Dynamic generation with LLM bot support
  • 🧠 llms.txt - AI assistant discovery and instructions
  • 👥 humans.txt - Human-readable credits and tech stack
  • 🗺️ sitemap.xml - Automatic sitemap generation
  • Dynamic URLs - Adapts to your site config
  • 🎯 Smart Caching - Optimized cache headers
  • 🔧 Fully Customizable - Override any section

Installation

npx astro add @astrojs/discovery

Or manually:

npm install @astrojs/discovery

Quick Start

Basic Setup

// astro.config.mjs
import { defineConfig } from 'astro';
import discovery from '@astrojs/discovery';

export default defineConfig({
  site: 'https://example.com',
  integrations: [
    discovery()
  ]
});

That's it! This will generate:

  • /robots.txt
  • /llms.txt
  • /humans.txt
  • /sitemap-index.xml

With Configuration

// astro.config.mjs
import { defineConfig } from 'astro';
import discovery from '@astrojs/discovery';

export default defineConfig({
  site: 'https://example.com',
  integrations: [
    discovery({
      // Robots.txt configuration
      robots: {
        crawlDelay: 2,
        additionalAgents: [
          {
            userAgent: 'CustomBot',
            allow: ['/api'],
            disallow: ['/admin']
          }
        ]
      },

      // LLMs.txt configuration
      llms: {
        description: 'Your site description for AI assistants',
        apiEndpoints: [
          { path: '/api/chat', description: 'Chat endpoint' },
          { path: '/api/search', description: 'Search API' }
        ],
        instructions: `
          When helping users with our site:
          1. Check documentation first
          2. Use provided API endpoints
          3. Follow brand guidelines
        `
      },

      // Humans.txt configuration
      humans: {
        team: [
          {
            name: 'Jane Doe',
            role: 'Creator & Developer',
            contact: 'jane@example.com',
            location: 'San Francisco, CA'
          }
        ],
        thanks: [
          'The Astro team',
          'Open source community'
        ],
        site: {
          lastUpdate: 'auto', // or specific date
          language: 'English',
          doctype: 'HTML5',
          ide: 'VS Code',
          techStack: ['Astro', 'TypeScript', 'React']
        },
        story: 'Your project story...',
        funFacts: [
          'Built with love',
          'Coffee-powered development'
        ]
      },

      // Sitemap configuration
      sitemap: {
        // Passed through to @astrojs/sitemap
        filter: (page) => !page.includes('/admin'),
        changefreq: 'weekly',
        priority: 0.7
      }
    })
  ]
});

API Reference

discovery(options?)

Options

robots

Configuration for robots.txt generation.

Type:

interface RobotsConfig {
  crawlDelay?: number;
  allowAllBots?: boolean;
  llmBots?: {
    enabled?: boolean;
    agents?: string[]; // Custom LLM bot names
  };
  additionalAgents?: Array<{
    userAgent: string;
    allow?: string[];
    disallow?: string[];
  }>;
  customRules?: string; // Raw robots.txt content to append
}

Default:

{
  crawlDelay: 1,
  allowAllBots: true,
  llmBots: {
    enabled: true,
    agents: [
      'Anthropic-AI',
      'Claude-Web',
      'GPTBot',
      'ChatGPT-User',
      'cohere-ai',
      'Google-Extended'
    ]
  }
}

Example:

discovery({
  robots: {
    crawlDelay: 2,
    llmBots: {
      enabled: true,
      agents: ['CustomAIBot', 'AnotherBot']
    },
    additionalAgents: [
      {
        userAgent: 'BadBot',
        disallow: ['/']
      }
    ]
  }
})
llms

Configuration for llms.txt generation.

Type:

interface LLMsConfig {
  enabled?: boolean;
  description?: string;
  keyFeatures?: string[];
  importantPages?: Array<{
    name: string;
    path: string;
    description?: string;
  }>;
  instructions?: string;
  apiEndpoints?: Array<{
    path: string;
    method?: string;
    description: string;
  }>;
  techStack?: {
    frontend?: string[];
    backend?: string[];
    ai?: string[];
    other?: string[];
  };
  brandVoice?: string[];
  customSections?: Record<string, string>;
}

Example:

discovery({
  llms: {
    description: 'E-commerce platform for sustainable products',
    keyFeatures: [
      'AI-powered product recommendations',
      'Carbon footprint calculator',
      'Subscription management'
    ],
    instructions: `
      When helping users:
      1. Check product availability via API
      2. Suggest sustainable alternatives
      3. Calculate shipping costs
    `,
    apiEndpoints: [
      {
        path: '/api/products',
        method: 'GET',
        description: 'List all products'
      },
      {
        path: '/api/calculate-footprint',
        method: 'POST',
        description: 'Calculate carbon footprint'
      }
    ]
  }
})
humans

Configuration for humans.txt generation.

Type:

interface HumansConfig {
  enabled?: boolean;
  team?: Array<{
    name: string;
    role?: string;
    contact?: string;
    location?: string;
    twitter?: string;
    github?: string;
  }>;
  thanks?: string[];
  site?: {
    lastUpdate?: string | 'auto';
    language?: string;
    doctype?: string;
    ide?: string;
    techStack?: string[];
    standards?: string[];
    components?: string[];
    software?: string[];
  };
  story?: string;
  funFacts?: string[];
  philosophy?: string[];
  customSections?: Record<string, string>;
}

Example:

discovery({
  humans: {
    team: [
      {
        name: 'Alice Developer',
        role: 'Lead Developer',
        contact: 'alice@example.com',
        location: 'New York',
        github: 'alice-dev'
      }
    ],
    thanks: [
      'Coffee',
      'Stack Overflow community',
      'My rubber duck'
    ],
    story: `
      This project started when we realized that...
    `,
    funFacts: [
      'Written entirely on a mechanical keyboard',
      'Fueled by 347 cups of coffee',
      'Built during a 48-hour hackathon'
    ]
  }
})
sitemap

Configuration passed to @astrojs/sitemap.

Type:

interface SitemapConfig {
  filter?: (page: string) => boolean;
  customPages?: string[];
  i18n?: {
    defaultLocale: string;
    locales: Record<string, string>;
  };
  changefreq?: 'always' | 'hourly' | 'daily' | 'weekly' | 'monthly' | 'yearly' | 'never';
  lastmod?: Date;
  priority?: number;
  serialize?: (item: SitemapItem) => SitemapItem | undefined;
}

Example:

discovery({
  sitemap: {
    filter: (page) => !page.includes('/admin') && !page.includes('/draft'),
    changefreq: 'daily',
    priority: 0.8
  }
})
caching

Configure HTTP cache headers for discovery files.

Type:

interface CachingConfig {
  robots?: number; // seconds
  llms?: number;
  humans?: number;
  sitemap?: number;
}

Default:

{
  robots: 3600,    // 1 hour
  llms: 3600,      // 1 hour
  humans: 86400,   // 24 hours
  sitemap: 3600    // 1 hour
}

Advanced Usage

Custom Templates

You can provide custom templates for any file:

discovery({
  templates: {
    robots: (config, siteURL) => `
User-agent: *
Allow: /

# Custom content
Sitemap: ${siteURL}/sitemap-index.xml
    `,

    llms: (config, siteURL) => `
# ${config.description}

Visit ${siteURL} for more information.
    `
  }
})

Conditional Generation

Disable specific files in certain environments:

discovery({
  robots: {
    enabled: import.meta.env.PROD // Only in production
  },
  llms: {
    enabled: true // Always generate
  },
  humans: {
    enabled: import.meta.env.DEV // Only in development
  }
})

Dynamic Content

Use functions for dynamic content:

discovery({
  llms: {
    description: () => {
      const pkg = JSON.parse(fs.readFileSync('./package.json', 'utf-8'));
      return `${pkg.name} - ${pkg.description}`;
    },
    apiEndpoints: async () => {
      // Load from OpenAPI spec
      const spec = await loadOpenAPISpec();
      return spec.paths.map(path => ({
        path: path.url,
        method: path.method,
        description: path.summary
      }));
    }
  }
})

Integration with Other Tools

With @astrojs/sitemap

The discovery integration automatically includes @astrojs/sitemap, so you don't need to install it separately. Configuration is passed through:

discovery({
  sitemap: {
    // All @astrojs/sitemap options work here
    filter: (page) => !page.includes('/secret'),
    changefreq: 'weekly'
  }
})

With Content Collections

Automatically extract information from content collections:

discovery({
  llms: {
    importantPages: async () => {
      const docs = await getCollection('docs');
      return docs.map(doc => ({
        name: doc.data.title,
        path: `/docs/${doc.slug}`,
        description: doc.data.description
      }));
    }
  }
})

With Environment Variables

Use environment variables for sensitive information:

discovery({
  humans: {
    team: [
      {
        name: 'Developer',
        contact: process.env.PUBLIC_CONTACT_EMAIL
      }
    ]
  }
})

Output

The integration generates the following files:

/robots.txt

User-agent: *
Allow: /

# Sitemaps
Sitemap: https://example.com/sitemap-index.xml

# LLM-specific resources
User-agent: Anthropic-AI
User-agent: Claude-Web
User-agent: GPTBot
Allow: /llms.txt

# Crawl delay
Crawl-delay: 1

/llms.txt

# Project Name - Description

> Short tagline

## Site Information
- Name: Project Name
- Description: Full description
- URL: https://example.com

## For AI Assistants
Instructions for AI assistants...

## API Endpoints
- GET /api/endpoint - Description

/humans.txt

/* TEAM */

Name: Developer Name
Role: Position
Contact: email@example.com

/* THANKS */
- Thank you note 1
- Thank you note 2

/* SITE */
Tech stack and details...

/sitemap-index.xml

Standard XML sitemap with all your pages.

Best Practices

1. Set Your Site URL

Always configure site in your Astro config:

export default defineConfig({
  site: 'https://example.com', // Required!
  integrations: [discovery()]
});

2. Keep humans.txt Updated

Update your team information and tech stack regularly:

discovery({
  humans: {
    site: {
      lastUpdate: 'auto' // Automatically uses current date
    }
  }
})

3. Be Specific with LLM Instructions

Provide clear, actionable instructions for AI assistants:

discovery({
  llms: {
    instructions: `
      When helping users:
      1. Always check API documentation first
      2. Use the /api/search endpoint for queries
      3. Format responses in markdown
      4. Include relevant links
    `
  }
})

4. Filter Private Pages

Exclude admin, draft, and private pages:

discovery({
  sitemap: {
    filter: (page) => {
      return !page.includes('/admin') &&
             !page.includes('/draft') &&
             !page.includes('/private');
    }
  },
  robots: {
    additionalAgents: [
      {
        userAgent: '*',
        disallow: ['/admin', '/draft', '/private']
      }
    ]
  }
})

5. Optimize Cache Headers

Balance freshness with server load:

discovery({
  caching: {
    robots: 3600,    // 1 hour - changes rarely
    llms: 1800,      // 30 min - may update instructions
    humans: 86400,   // 24 hours - credits don't change often
    sitemap: 3600    // 1 hour - content changes moderately
  }
})

Troubleshooting

Files Not Generating

  1. Check your output mode:
export default defineConfig({
  output: 'hybrid', // or 'server'
  // ...
});
  1. Verify site URL is set:
export default defineConfig({
  site: 'https://example.com' // Must be set!
});
  1. Check for conflicts: Remove any existing /public/robots.txt or similar static files.

Wrong URLs in Files

Make sure your site config matches your production domain:

export default defineConfig({
  site: import.meta.env.PROD
    ? 'https://production.com'
    : 'http://localhost:4321'
});

LLM Bots Not Respecting Instructions

  • Ensure /llms.txt is accessible
  • Check robots.txt allows LLM bots
  • Verify content is properly formatted

Sitemap Issues

Check @astrojs/sitemap documentation for detailed troubleshooting: https://docs.astro.build/en/guides/integrations-guide/sitemap/

Migration Guide

From Manual Files

If you have existing static files in /public, remove them:

rm public/robots.txt
rm public/humans.txt
rm public/sitemap.xml

Then configure the integration with your existing content:

discovery({
  humans: {
    team: [/* your existing team data */],
    thanks: [/* your existing thanks */]
  }
})

From @astrojs/sitemap

Replace:

import sitemap from '@astrojs/sitemap';

export default defineConfig({
  integrations: [sitemap()]
});

With:

import discovery from '@astrojs/discovery';

export default defineConfig({
  integrations: [
    discovery({
      sitemap: {
        // Your existing sitemap config
      }
    })
  ]
});

Examples

E-commerce Site

discovery({
  robots: {
    crawlDelay: 2,
    additionalAgents: [
      {
        userAgent: 'PriceBot',
        disallow: ['/checkout', '/account']
      }
    ]
  },
  llms: {
    description: 'Online store for sustainable products',
    keyFeatures: [
      'Eco-friendly product catalog',
      'Carbon footprint calculator',
      'Sustainable shipping options'
    ],
    apiEndpoints: [
      { path: '/api/products', description: 'Product catalog' },
      { path: '/api/calculate-carbon', description: 'Carbon calculator' }
    ]
  },
  sitemap: {
    filter: (page) =>
      !page.includes('/checkout') &&
      !page.includes('/account')
  }
})

Documentation Site

discovery({
  llms: {
    description: 'Technical documentation for our API',
    instructions: `
      When helping users:
      1. Search documentation before answering
      2. Provide code examples from /examples
      3. Link to relevant API reference pages
      4. Suggest similar solutions from FAQ
    `,
    importantPages: async () => {
      const docs = await getCollection('docs');
      return docs
        .filter(doc => doc.data.featured)
        .map(doc => ({
          name: doc.data.title,
          path: `/docs/${doc.slug}`,
          description: doc.data.description
        }));
    }
  },
  humans: {
    team: [
      {
        name: 'Documentation Team',
        contact: 'docs@example.com'
      }
    ],
    thanks: [
      'Our amazing community contributors',
      'Technical writers worldwide'
    ]
  }
})

Personal Blog

discovery({
  llms: {
    description: 'Personal blog about web development',
    brandVoice: [
      'Casual and friendly',
      'Technical but accessible',
      'Focus on practical examples'
    ]
  },
  humans: {
    team: [
      {
        name: 'Jane Blogger',
        role: 'Writer & Developer',
        twitter: '@janeblogger',
        github: 'jane-dev'
      }
    ],
    story: `
      Started this blog to document my journey learning web development.
      Went from tutorial hell to building real projects. Now sharing
      what I've learned to help others on their journey.
    `,
    funFacts: [
      'All posts written in markdown',
      'Powered by coffee and curiosity',
      'Deployed automatically on every commit'
    ]
  }
})

Performance

The integration is designed for minimal performance impact:

  • Build Time: Adds ~100-200ms to build process
  • Runtime: All files are statically generated at build time
  • Caching: Smart HTTP cache headers reduce server load
  • Bundle Size: Zero client-side JavaScript

Contributing

We welcome contributions! See our Contributing Guide.

License

MIT

Credits

Built with inspiration from:

  • The Astro community
  • humanstxt.org initiative
  • Anthropic's llms.txt proposal
  • Web standards organizations

Made with ❤️ by the Astro community

Description
No description provided
Readme MIT 392 KiB
Languages
TypeScript 100%