astro-discovery/README.md
Ryan Malloy d25dde4627 feat: initial implementation of @astrojs/discovery integration
This commit introduces a comprehensive Astro integration that automatically
generates discovery files for websites:

Features:
- robots.txt with LLM bot support (Anthropic-AI, GPTBot, etc.)
- llms.txt for AI assistant context and instructions
- humans.txt for team credits and site information
- Automatic sitemap integration via @astrojs/sitemap

Technical Details:
- TypeScript implementation with full type safety
- Configurable HTTP caching headers
- Custom template support for all generated files
- Sensible defaults with extensive customization options
- Date-based versioning (2025.11.03)

Testing:
- 34 unit tests covering all generators
- Test coverage for robots.txt, llms.txt, and humans.txt
- Integration with Vitest

Documentation:
- Comprehensive README with examples
- API reference documentation
- Contributing guidelines
- Example configurations (minimal and full)
2025-11-03 07:36:39 -07:00

882 lines
17 KiB
Markdown

# @astrojs/discovery
> Comprehensive discovery integration for Astro - handles robots.txt, llms.txt, humans.txt, and sitemap generation
## Overview
This integration provides automatic generation of all standard discovery files for your Astro site, making it easily discoverable by search engines, LLMs, and humans.
## Features
- 🤖 **robots.txt** - Dynamic generation with LLM bot support
- 🧠 **llms.txt** - AI assistant discovery and instructions
- 👥 **humans.txt** - Human-readable credits and tech stack
- 🗺️ **sitemap.xml** - Automatic sitemap generation
-**Dynamic URLs** - Adapts to your `site` config
- 🎯 **Smart Caching** - Optimized cache headers
- 🔧 **Fully Customizable** - Override any section
## Installation
```bash
npx astro add @astrojs/discovery
```
Or manually:
```bash
npm install @astrojs/discovery
```
## Quick Start
### Basic Setup
```typescript
// astro.config.mjs
import { defineConfig } from 'astro';
import discovery from '@astrojs/discovery';
export default defineConfig({
site: 'https://example.com',
integrations: [
discovery()
]
});
```
That's it! This will generate:
- `/robots.txt`
- `/llms.txt`
- `/humans.txt`
- `/sitemap-index.xml`
### With Configuration
```typescript
// astro.config.mjs
import { defineConfig } from 'astro';
import discovery from '@astrojs/discovery';
export default defineConfig({
site: 'https://example.com',
integrations: [
discovery({
// Robots.txt configuration
robots: {
crawlDelay: 2,
additionalAgents: [
{
userAgent: 'CustomBot',
allow: ['/api'],
disallow: ['/admin']
}
]
},
// LLMs.txt configuration
llms: {
description: 'Your site description for AI assistants',
apiEndpoints: [
{ path: '/api/chat', description: 'Chat endpoint' },
{ path: '/api/search', description: 'Search API' }
],
instructions: `
When helping users with our site:
1. Check documentation first
2. Use provided API endpoints
3. Follow brand guidelines
`
},
// Humans.txt configuration
humans: {
team: [
{
name: 'Jane Doe',
role: 'Creator & Developer',
contact: 'jane@example.com',
location: 'San Francisco, CA'
}
],
thanks: [
'The Astro team',
'Open source community'
],
site: {
lastUpdate: 'auto', // or specific date
language: 'English',
doctype: 'HTML5',
ide: 'VS Code',
techStack: ['Astro', 'TypeScript', 'React']
},
story: 'Your project story...',
funFacts: [
'Built with love',
'Coffee-powered development'
]
},
// Sitemap configuration
sitemap: {
// Passed through to @astrojs/sitemap
filter: (page) => !page.includes('/admin'),
changefreq: 'weekly',
priority: 0.7
}
})
]
});
```
## API Reference
### `discovery(options?)`
#### Options
##### `robots`
Configuration for robots.txt generation.
**Type:**
```typescript
interface RobotsConfig {
crawlDelay?: number;
allowAllBots?: boolean;
llmBots?: {
enabled?: boolean;
agents?: string[]; // Custom LLM bot names
};
additionalAgents?: Array<{
userAgent: string;
allow?: string[];
disallow?: string[];
}>;
customRules?: string; // Raw robots.txt content to append
}
```
**Default:**
```typescript
{
crawlDelay: 1,
allowAllBots: true,
llmBots: {
enabled: true,
agents: [
'Anthropic-AI',
'Claude-Web',
'GPTBot',
'ChatGPT-User',
'cohere-ai',
'Google-Extended'
]
}
}
```
**Example:**
```typescript
discovery({
robots: {
crawlDelay: 2,
llmBots: {
enabled: true,
agents: ['CustomAIBot', 'AnotherBot']
},
additionalAgents: [
{
userAgent: 'BadBot',
disallow: ['/']
}
]
}
})
```
##### `llms`
Configuration for llms.txt generation.
**Type:**
```typescript
interface LLMsConfig {
enabled?: boolean;
description?: string;
keyFeatures?: string[];
importantPages?: Array<{
name: string;
path: string;
description?: string;
}>;
instructions?: string;
apiEndpoints?: Array<{
path: string;
method?: string;
description: string;
}>;
techStack?: {
frontend?: string[];
backend?: string[];
ai?: string[];
other?: string[];
};
brandVoice?: string[];
customSections?: Record<string, string>;
}
```
**Example:**
```typescript
discovery({
llms: {
description: 'E-commerce platform for sustainable products',
keyFeatures: [
'AI-powered product recommendations',
'Carbon footprint calculator',
'Subscription management'
],
instructions: `
When helping users:
1. Check product availability via API
2. Suggest sustainable alternatives
3. Calculate shipping costs
`,
apiEndpoints: [
{
path: '/api/products',
method: 'GET',
description: 'List all products'
},
{
path: '/api/calculate-footprint',
method: 'POST',
description: 'Calculate carbon footprint'
}
]
}
})
```
##### `humans`
Configuration for humans.txt generation.
**Type:**
```typescript
interface HumansConfig {
enabled?: boolean;
team?: Array<{
name: string;
role?: string;
contact?: string;
location?: string;
twitter?: string;
github?: string;
}>;
thanks?: string[];
site?: {
lastUpdate?: string | 'auto';
language?: string;
doctype?: string;
ide?: string;
techStack?: string[];
standards?: string[];
components?: string[];
software?: string[];
};
story?: string;
funFacts?: string[];
philosophy?: string[];
customSections?: Record<string, string>;
}
```
**Example:**
```typescript
discovery({
humans: {
team: [
{
name: 'Alice Developer',
role: 'Lead Developer',
contact: 'alice@example.com',
location: 'New York',
github: 'alice-dev'
}
],
thanks: [
'Coffee',
'Stack Overflow community',
'My rubber duck'
],
story: `
This project started when we realized that...
`,
funFacts: [
'Written entirely on a mechanical keyboard',
'Fueled by 347 cups of coffee',
'Built during a 48-hour hackathon'
]
}
})
```
##### `sitemap`
Configuration passed to `@astrojs/sitemap`.
**Type:**
```typescript
interface SitemapConfig {
filter?: (page: string) => boolean;
customPages?: string[];
i18n?: {
defaultLocale: string;
locales: Record<string, string>;
};
changefreq?: 'always' | 'hourly' | 'daily' | 'weekly' | 'monthly' | 'yearly' | 'never';
lastmod?: Date;
priority?: number;
serialize?: (item: SitemapItem) => SitemapItem | undefined;
}
```
**Example:**
```typescript
discovery({
sitemap: {
filter: (page) => !page.includes('/admin') && !page.includes('/draft'),
changefreq: 'daily',
priority: 0.8
}
})
```
##### `caching`
Configure HTTP cache headers for discovery files.
**Type:**
```typescript
interface CachingConfig {
robots?: number; // seconds
llms?: number;
humans?: number;
sitemap?: number;
}
```
**Default:**
```typescript
{
robots: 3600, // 1 hour
llms: 3600, // 1 hour
humans: 86400, // 24 hours
sitemap: 3600 // 1 hour
}
```
## Advanced Usage
### Custom Templates
You can provide custom templates for any file:
```typescript
discovery({
templates: {
robots: (config, siteURL) => `
User-agent: *
Allow: /
# Custom content
Sitemap: ${siteURL}/sitemap-index.xml
`,
llms: (config, siteURL) => `
# ${config.description}
Visit ${siteURL} for more information.
`
}
})
```
### Conditional Generation
Disable specific files in certain environments:
```typescript
discovery({
robots: {
enabled: import.meta.env.PROD // Only in production
},
llms: {
enabled: true // Always generate
},
humans: {
enabled: import.meta.env.DEV // Only in development
}
})
```
### Dynamic Content
Use functions for dynamic content:
```typescript
discovery({
llms: {
description: () => {
const pkg = JSON.parse(fs.readFileSync('./package.json', 'utf-8'));
return `${pkg.name} - ${pkg.description}`;
},
apiEndpoints: async () => {
// Load from OpenAPI spec
const spec = await loadOpenAPISpec();
return spec.paths.map(path => ({
path: path.url,
method: path.method,
description: path.summary
}));
}
}
})
```
## Integration with Other Tools
### With @astrojs/sitemap
The discovery integration automatically includes `@astrojs/sitemap`, so you don't need to install it separately. Configuration is passed through:
```typescript
discovery({
sitemap: {
// All @astrojs/sitemap options work here
filter: (page) => !page.includes('/secret'),
changefreq: 'weekly'
}
})
```
### With Content Collections
Automatically extract information from content collections:
```typescript
discovery({
llms: {
importantPages: async () => {
const docs = await getCollection('docs');
return docs.map(doc => ({
name: doc.data.title,
path: `/docs/${doc.slug}`,
description: doc.data.description
}));
}
}
})
```
### With Environment Variables
Use environment variables for sensitive information:
```typescript
discovery({
humans: {
team: [
{
name: 'Developer',
contact: process.env.PUBLIC_CONTACT_EMAIL
}
]
}
})
```
## Output
The integration generates the following files:
### `/robots.txt`
```
User-agent: *
Allow: /
# Sitemaps
Sitemap: https://example.com/sitemap-index.xml
# LLM-specific resources
User-agent: Anthropic-AI
User-agent: Claude-Web
User-agent: GPTBot
Allow: /llms.txt
# Crawl delay
Crawl-delay: 1
```
### `/llms.txt`
```
# Project Name - Description
> Short tagline
## Site Information
- Name: Project Name
- Description: Full description
- URL: https://example.com
## For AI Assistants
Instructions for AI assistants...
## API Endpoints
- GET /api/endpoint - Description
```
### `/humans.txt`
```
/* TEAM */
Name: Developer Name
Role: Position
Contact: email@example.com
/* THANKS */
- Thank you note 1
- Thank you note 2
/* SITE */
Tech stack and details...
```
### `/sitemap-index.xml`
Standard XML sitemap with all your pages.
## Best Practices
### 1. **Set Your Site URL**
Always configure `site` in your Astro config:
```typescript
export default defineConfig({
site: 'https://example.com', // Required!
integrations: [discovery()]
});
```
### 2. **Keep humans.txt Updated**
Update your team information and tech stack regularly:
```typescript
discovery({
humans: {
site: {
lastUpdate: 'auto' // Automatically uses current date
}
}
})
```
### 3. **Be Specific with LLM Instructions**
Provide clear, actionable instructions for AI assistants:
```typescript
discovery({
llms: {
instructions: `
When helping users:
1. Always check API documentation first
2. Use the /api/search endpoint for queries
3. Format responses in markdown
4. Include relevant links
`
}
})
```
### 4. **Filter Private Pages**
Exclude admin, draft, and private pages:
```typescript
discovery({
sitemap: {
filter: (page) => {
return !page.includes('/admin') &&
!page.includes('/draft') &&
!page.includes('/private');
}
},
robots: {
additionalAgents: [
{
userAgent: '*',
disallow: ['/admin', '/draft', '/private']
}
]
}
})
```
### 5. **Optimize Cache Headers**
Balance freshness with server load:
```typescript
discovery({
caching: {
robots: 3600, // 1 hour - changes rarely
llms: 1800, // 30 min - may update instructions
humans: 86400, // 24 hours - credits don't change often
sitemap: 3600 // 1 hour - content changes moderately
}
})
```
## Troubleshooting
### Files Not Generating
1. **Check your output mode:**
```typescript
export default defineConfig({
output: 'hybrid', // or 'server'
// ...
});
```
2. **Verify site URL is set:**
```typescript
export default defineConfig({
site: 'https://example.com' // Must be set!
});
```
3. **Check for conflicts:**
Remove any existing `/public/robots.txt` or similar static files.
### Wrong URLs in Files
Make sure your `site` config matches your production domain:
```typescript
export default defineConfig({
site: import.meta.env.PROD
? 'https://production.com'
: 'http://localhost:4321'
});
```
### LLM Bots Not Respecting Instructions
- Ensure `/llms.txt` is accessible
- Check robots.txt allows LLM bots
- Verify content is properly formatted
### Sitemap Issues
Check `@astrojs/sitemap` documentation for detailed troubleshooting:
https://docs.astro.build/en/guides/integrations-guide/sitemap/
## Migration Guide
### From Manual Files
If you have existing static files in `/public`, remove them:
```bash
rm public/robots.txt
rm public/humans.txt
rm public/sitemap.xml
```
Then configure the integration with your existing content:
```typescript
discovery({
humans: {
team: [/* your existing team data */],
thanks: [/* your existing thanks */]
}
})
```
### From @astrojs/sitemap
Replace:
```typescript
import sitemap from '@astrojs/sitemap';
export default defineConfig({
integrations: [sitemap()]
});
```
With:
```typescript
import discovery from '@astrojs/discovery';
export default defineConfig({
integrations: [
discovery({
sitemap: {
// Your existing sitemap config
}
})
]
});
```
## Examples
### E-commerce Site
```typescript
discovery({
robots: {
crawlDelay: 2,
additionalAgents: [
{
userAgent: 'PriceBot',
disallow: ['/checkout', '/account']
}
]
},
llms: {
description: 'Online store for sustainable products',
keyFeatures: [
'Eco-friendly product catalog',
'Carbon footprint calculator',
'Sustainable shipping options'
],
apiEndpoints: [
{ path: '/api/products', description: 'Product catalog' },
{ path: '/api/calculate-carbon', description: 'Carbon calculator' }
]
},
sitemap: {
filter: (page) =>
!page.includes('/checkout') &&
!page.includes('/account')
}
})
```
### Documentation Site
```typescript
discovery({
llms: {
description: 'Technical documentation for our API',
instructions: `
When helping users:
1. Search documentation before answering
2. Provide code examples from /examples
3. Link to relevant API reference pages
4. Suggest similar solutions from FAQ
`,
importantPages: async () => {
const docs = await getCollection('docs');
return docs
.filter(doc => doc.data.featured)
.map(doc => ({
name: doc.data.title,
path: `/docs/${doc.slug}`,
description: doc.data.description
}));
}
},
humans: {
team: [
{
name: 'Documentation Team',
contact: 'docs@example.com'
}
],
thanks: [
'Our amazing community contributors',
'Technical writers worldwide'
]
}
})
```
### Personal Blog
```typescript
discovery({
llms: {
description: 'Personal blog about web development',
brandVoice: [
'Casual and friendly',
'Technical but accessible',
'Focus on practical examples'
]
},
humans: {
team: [
{
name: 'Jane Blogger',
role: 'Writer & Developer',
twitter: '@janeblogger',
github: 'jane-dev'
}
],
story: `
Started this blog to document my journey learning web development.
Went from tutorial hell to building real projects. Now sharing
what I've learned to help others on their journey.
`,
funFacts: [
'All posts written in markdown',
'Powered by coffee and curiosity',
'Deployed automatically on every commit'
]
}
})
```
## Performance
The integration is designed for minimal performance impact:
- **Build Time**: Adds ~100-200ms to build process
- **Runtime**: All files are statically generated at build time
- **Caching**: Smart HTTP cache headers reduce server load
- **Bundle Size**: Zero client-side JavaScript
## Contributing
We welcome contributions! See our [Contributing Guide](CONTRIBUTING.md).
## License
MIT
## Related
- [@astrojs/sitemap](https://docs.astro.build/en/guides/integrations-guide/sitemap/)
- [humanstxt.org](https://humanstxt.org/)
- [llms.txt spec](https://github.com/anthropics/llm-txt)
- [robots.txt spec](https://developers.google.com/search/docs/crawling-indexing/robots/intro)
## Credits
Built with inspiration from:
- The Astro community
- humanstxt.org initiative
- Anthropic's llms.txt proposal
- Web standards organizations
---
**Made with ❤️ by the Astro community**