Complete how-to guide documentation

Add comprehensive problem-oriented how-to guides following Diátaxis framework:
- Block specific bots from crawling the site
- Customize LLM instructions for AI assistants
- Add team members to humans.txt
- Filter sitemap pages
- Configure cache headers for discovery files
- Environment-specific configuration
- Integration with Astro content collections
- Custom templates for discovery files
- ActivityPub/Fediverse integration via WebFinger

Each guide provides:
- Clear prerequisites
- Step-by-step solutions
- Multiple approaches/variations
- Expected outcomes
- Alternative approaches
- Common issues and troubleshooting

Total: 9 guides, 6,677 words
This commit is contained in:
Ryan Malloy 2025-11-08 23:32:22 -07:00
parent f8d4e10ffc
commit 74cffc2842
18 changed files with 4456 additions and 337 deletions

View File

@ -1,31 +1,231 @@
--- ---
title: Warrant Canaries title: Warrant Canaries
description: Understanding warrant canaries and transparency description: Understanding warrant canaries and transparency mechanisms
--- ---
Learn how warrant canaries work and their role in organizational transparency. A warrant canary is a method for organizations to communicate the **absence** of secret government orders through regular public statements. The concept comes from the canaries coal miners once carried - their silence indicated danger.
:::note[Work in Progress] ## The Gag Order Problem
This page is currently being developed. Check back soon for complete documentation.
:::
## Coming Soon Certain legal instruments (National Security Letters in the US, similar mechanisms elsewhere) can compel organizations to:
This section will include: 1. Provide user data or access to systems
- Detailed explanations 2. Never disclose that the request was made
- Code examples
- Best practices
- Common patterns
- Troubleshooting tips
## Related Pages This creates an information asymmetry - users can't know if their service provider has been compromised by government orders.
- [Configuration Reference](/reference/configuration/) Warrant canaries address this by inverting the communication: instead of saying "we received an order" (which is forbidden), the organization regularly says "we have NOT received an order."
- [API Reference](/reference/api/)
- [Examples](/examples/ecommerce/)
## Need Help? If the statement stops or changes, users can infer something happened.
- Check our [FAQ](/community/faq/) ## How It Works
- Visit [Troubleshooting](/community/troubleshooting/)
- Open an issue on [GitHub](https://github.com/withastro/astro-discovery/issues) A simple canary statement:
```
As of 2024-11-08, Example Corp has NOT received:
- National Security Letters
- FISA court orders
- Gag orders preventing disclosure
- Secret government requests for user data
- Requests to install surveillance capabilities
```
The organization publishes this monthly. Users monitor it. If November's update doesn't appear, or the statements change, users know to investigate.
The canary communicates through **absence** rather than disclosure.
## Legal Theory and Limitations
Warrant canaries operate in a legal gray area. The theory:
- Compelled speech (forcing you to lie) may violate free speech rights
- Choosing to remain silent is protected
- Government can prevent disclosure but cannot compel false statements
This hasn't been extensively tested in court. Canaries are no guarantee, but they provide a transparency mechanism where direct disclosure is prohibited.
Important limitations:
- **No legal precedent**: Courts haven't ruled definitively on validity
- **Jurisdictional differences**: What works in one country may not in another
- **Sophistication of threats**: Adversaries may compel continued updates
- **Interpretation challenges**: Absence could mean many things
Canaries are part of a transparency strategy, not a complete solution.
## What Goes in a Canary
The integration's default statements cover common government data requests:
**National Security Letters (NSLs)**: US administrative subpoenas for subscriber information
**FISA court orders**: Foreign Intelligence Surveillance Act orders
**Gag orders**: Any order preventing disclosure of requests
**Surveillance requests**: Secret requests for user data
**Backdoor requests**: Demands to install surveillance capabilities
You can customize these or add organization-specific concerns.
## Frequency and Expiration
Canaries must update regularly. The frequency determines trust:
**Daily**: Maximum transparency, high maintenance burden
**Weekly**: Good for high-security contexts
**Monthly**: Standard for most organizations
**Quarterly**: Minimum for credibility
**Yearly**: Too infrequent to be meaningful
The integration auto-calculates expiration based on frequency:
- Daily: 2 days
- Weekly: 10 days
- Monthly: 35 days
- Quarterly: 100 days
- Yearly: 380 days
These provide buffer time while ensuring staleness is obvious.
## The Personnel Statement
A sophisticated addition is the personnel statement:
```
Key Personnel Statement: All key personnel with access to
infrastructure remain free and under no duress.
```
This addresses scenarios where individuals are compelled to act under physical threat or coercion.
If personnel are compromised, the statement can be omitted without violating gag orders (since it's not disclosing a government request).
## Verification Mechanisms
Mere publication isn't enough - users need to verify authenticity:
### PGP Signatures
Sign canary.txt with your organization's PGP key:
```
Verification: https://example.com/canary.txt.asc
```
This proves the canary came from you and hasn't been tampered with.
### Blockchain Anchoring
Publish a hash of the canary to a blockchain:
```
Blockchain-Proof: ethereum:0x123...abc:0xdef...789
Blockchain-Timestamp: 2024-11-08T12:00:00Z
```
This creates an immutable, time-stamped record that the canary existed at a specific moment.
Anyone can verify the canary matches the blockchain hash, preventing retroactive alterations.
### Previous Canary Links
Link to the previous canary:
```
Previous-Canary: https://example.com/canary-2024-10.txt
```
This creates a chain of trust. If an attacker compromises your site and tries to backdate canaries, the chain breaks.
## What Absence Means
If a canary stops updating or changes, it doesn't definitively mean government compromise. Possible reasons:
- Organization received a legal order (the intended signal)
- Technical failure prevented update
- Personnel forgot or were unable to update
- Organization shut down or changed practices
- Security incident prevented trusted publication
Users must interpret absence in context. Multiple verification methods help distinguish scenarios.
## Building Trust Over Time
A new canary has limited credibility. Trust builds through:
1. **Consistency**: Regular updates on schedule
2. **Verification**: Multiple cryptographic proofs
3. **Transparency**: Clear explanation of canary purpose and limitations
4. **History**: Years of reliable updates
5. **Community**: External monitoring and verification
Organizations should start canaries early, before they're needed, to build this trust.
## The Integration's Approach
This integration makes canaries accessible:
**Auto-expiration**: Calculated from frequency
**Default statements**: Cover common concerns
**Dynamic generation**: Functions can generate statements at build time
**Verification support**: Links to PGP signatures and blockchain proofs
**Update reminders**: Clear expiration in content
You configure once, the integration handles timing and formatting.
## When to Use Canaries
Canaries make sense for:
- Organizations handling sensitive user data
- Services likely to receive government data requests
- Privacy-focused companies
- Organizations operating in multiple jurisdictions
- Platforms used by activists, journalists, or vulnerable groups
They're less relevant for:
- Personal blogs without user data
- Purely informational sites
- Organizations that can't commit to regular updates
- Contexts where legal risks outweigh benefits
## Practical Considerations
**Update process**: Who's responsible for monthly updates?
**Backup procedures**: What if primary person is unavailable?
**Legal review**: Has counsel approved canary language and process?
**Monitoring**: Who watches for expiration?
**Communication**: How will users be notified of canary changes?
**Contingency**: What's the plan if you must stop publishing?
These operational questions matter as much as the canary itself.
## The Limitations
Canaries are not magic:
- They rely on legal interpretations that haven't been tested
- Sophisticated adversaries may compel continued updates
- Absence is ambiguous - could be many causes
- Only useful for orders that come with gag provisions
- Don't address technical compromises or insider threats
They're one tool in a transparency toolkit, not a complete solution.
## Real-World Examples
**Tech companies**: Some publish annual or quarterly canaries as part of transparency reports
**VPN providers**: Many use canaries to signal absence of data retention orders
**Privacy-focused services**: Canaries are common among services catering to privacy-conscious users
**Open source projects**: Some maintainers publish personal canaries about project compromise
The practice is growing as awareness of surveillance increases.
## Related Topics
- [Security.txt](/explanation/security-explained/) - Complementary transparency for security issues
- [Canary Reference](/reference/canary/) - Complete configuration options
- [Blockchain Verification](/how-to/canary-verification/) - Setting up cryptographic proofs

View File

@ -3,29 +3,306 @@ title: Understanding humans.txt
description: The human side of discovery files description: The human side of discovery files
--- ---
Explore the humans.txt initiative and how it credits the people behind websites. In a web dominated by machine-readable metadata, humans.txt is a delightful rebellion. It's a file written by humans, for humans, about the humans who built the website you're visiting.
:::note[Work in Progress] ## The Initiative
This page is currently being developed. Check back soon for complete documentation.
:::
## Coming Soon humans.txt emerged in 2008 from a simple observation: websites have extensive metadata for machines (robots.txt, sitemaps, structured data) but nothing to credit the people who built them.
This section will include: The initiative proposed a standard format for human-readable credits, transforming the impersonal `/humans.txt` URL into a space for personality, gratitude, and transparency.
- Detailed explanations
- Code examples
- Best practices
- Common patterns
- Troubleshooting tips
## Related Pages ## What Makes It Human
- [Configuration Reference](/reference/configuration/) Unlike other discovery files optimized for parsing, humans.txt embraces readability and creativity:
- [API Reference](/reference/api/)
- [Examples](/examples/ecommerce/)
## Need Help? ```
/* TEAM */
Developer: Jane Doe
Role: Full-stack wizardry
Location: Portland, OR
Favorite beverage: Cold brew coffee
- Check our [FAQ](/community/faq/) /* THANKS */
- Visit [Troubleshooting](/community/troubleshooting/) - Stack Overflow (for everything)
- Open an issue on [GitHub](https://github.com/withastro/astro-discovery/issues) - My rubber duck debugging companion
- Coffee, obviously
/* SITE */
Built with: Blood, sweat, and JavaScript
Fun fact: Deployed 47 times before launch
```
Notice the tone - casual, personal, fun. This isn't corporate boilerplate. It's a connection between builders and users.
## Why It Matters
On the surface, humans.txt seems frivolous. Who cares about credits buried in a text file?
But consider the impact:
**Recognition**: Developers, designers, and content creators work in the shadows. Humans.txt brings them into the light.
**Transparency**: Users curious about how your site works can see the tech stack and team behind it.
**Recruitment**: Talented developers browse humans.txt files. Listing your stack and philosophy attracts aligned talent.
**Culture**: A well-crafted humans.txt reveals company culture and values better than any about page.
**Humanity**: In an increasingly automated web, humans.txt reminds us that real people built this.
## The Standard Sections
The initiative proposes several standard sections:
### TEAM
Credits for everyone who contributed:
```
/* TEAM */
Name: Alice Developer
Role: Lead Developer
Contact: alice@example.com
Twitter: @alicedev
From: Brooklyn, NY
```
List everyone - developers, designers, writers, managers. Projects are team efforts.
### THANKS
Acknowledgments for inspiration, tools, and support:
```
/* THANKS */
- The Astro community
- Open-source maintainers everywhere
- Our beta testers
- Late night playlist creators
```
This section humanizes development. We build on the work of others.
### SITE
Technical details about the project:
```
/* SITE */
Last update: 2024-11-08
Language: English / Markdown
Doctype: HTML5
IDE: VS Code with Vim keybindings
Components: Astro, React, TypeScript
Standards: HTML5, CSS3, ES2022
```
This satisfies developer curiosity and provides context for technical decisions.
## Going Beyond the Standard
The beauty of humans.txt is flexibility. Many sites add custom sections:
**STORY**: The origin story of your project
**PHILOSOPHY**: Development principles and values
**FUN FACTS**: Easter eggs and behind-the-scenes details
**COLOPHON**: Typography and design choices
**ERRORS**: Humorous changelog of mistakes
These additions transform humans.txt from credits into narrative.
## The Integration's Approach
This integration generates humans.txt with opinionated defaults but encourages customization:
**Auto-dating**: `lastUpdate: 'auto'` uses current build date
**Flexible structure**: Add any custom sections you want
**Dynamic content**: Generate team lists from content collections
**Rich metadata**: Include social links, locations, and personal touches
The goal is making credits easy enough that you'll actually maintain them.
## Real-World Examples
**Humanstxt.org** (the initiative's site):
```
/* TEAM */
Creator: Abel Cabans
Site: http://abelcabans.com
Twitter: @abelcabans
Location: Sant Cugat del Vallès, Barcelona, Spain
/* THANKS */
- All the people who have contributed
- Spread the word!
/* SITE */
Last update: 2024/01/15
Standards: HTML5, CSS3
Components: Jekyll
Software: TextMate, Git
```
Clean, simple, effective.
**Creative Agency** (fictional but typical):
```
/* TEAM */
Creative Director: Max Wilson
Role: Visionary chaos coordinator
Contact: max@agency.com
Fun fact: Has never missed a deadline (barely)
Designer: Sarah Chen
Role: Pixel perfectionist
Location: San Francisco
Tool of choice: Figma, obviously
Developer: Jordan Lee
Role: Code whisperer
From: Remote (currently Bali)
Coffee order: Oat milk cortado
/* THANKS */
- Our clients for trusting us with their dreams
- The internet for cat videos during crunch time
- Figma for not crashing during presentations
/* STORY */
We started in a garage. Not for dramatic effect - office
space in SF is expensive. Three friends with complementary
skills and a shared belief that design should be delightful.
Five years later, we're still in that garage (now with
better chairs). But we've shipped products used by millions
and worked with brands we admired as kids.
We believe in:
- Craftsmanship over shortcuts
- Accessibility as a baseline, not a feature
- Open source as community participation
- Making the web more fun
/* SITE */
Built with: Astro, Svelte, TypeScript, TailwindCSS
Deployed on: Cloudflare Pages
Font: Inter (because we're not monsters)
Colors: Custom palette inspired by Bauhaus
Last rewrite: 2024 (the third time's the charm)
```
Notice the personality, the details, the humanity.
## The "Last Update" Decision
The `lastUpdate` field presents a philosophical question: should it reflect content updates or just site updates?
**Content perspective**: Change date when humans.txt content changes
**Site perspective**: Change date when any part of the site deploys
The integration defaults to site perspective (auto-update on every build). This ensures the date always reflects current site state, even if humans.txt content stays static.
But you can override with a specific date if you prefer manual control.
## Social Links and Contact Info
humans.txt is a great place for social links:
```
/* TEAM */
Name: Developer Name
Twitter: @username
GitHub: username
LinkedIn: /in/username
Mastodon: @username@instance.social
```
This provides discoverable contact information without cluttering your UI.
It's particularly valuable for open-source projects where contributors want to connect.
## The Gratitude Practice
Writing a good THANKS section is a gratitude practice. It forces you to acknowledge the shoulders you stand on:
- Which open-source projects made your work possible?
- Who provided feedback, testing, or encouragement?
- What tools, resources, or communities helped you learn?
- Which mistakes taught you valuable lessons?
This reflection benefits you as much as it credits others.
## Humor and Personality
humans.txt invites creativity. Some examples:
```
/* FUN FACTS */
- Entire site built during one caffeinated weekend
- 437 commits with message "fix typo"
- Originally designed in Figma, rebuilt in Sketch, launched from code
- The dog in our 404 page is the CEO's actual dog
- We've used Comic Sans exactly once (regrettably)
```
This personality differentiates you and creates connection.
## When Not to Use Humor
Professional context matters. A bank's humans.txt should be more restrained than a gaming startup's.
Match the tone to your audience and brand. Personality doesn't require jokes.
Simple sincerity works too:
```
/* TEAM */
We're a team of 12 developers across 6 countries
working to make financial services more accessible.
/* THANKS */
To the users who trust us with their financial data -
we take that responsibility seriously every day.
```
## Maintenance Considerations
humans.txt requires maintenance:
- Update when team members change
- Refresh tech stack as you adopt new tools
- Add new thanks as you use new resources
- Keep contact information current
The integration helps by supporting dynamic content:
```typescript
humans: {
team: await getCollection('team'), // Auto-sync with team content
site: {
lastUpdate: 'auto', // Auto-update on each build
techStack: Object.keys(deps) // Extract from package.json
}
}
```
This reduces manual maintenance burden.
## The Browse Experience
Most users never see humans.txt. And that's okay.
The file serves several audiences:
**Curious users**: The 1% who look behind the curtain
**Developers**: Evaluating tech stack for integration or inspiration
**Recruiters**: Understanding team culture and capabilities
**You**: Reflection and gratitude practice during creation
It's not about traffic - it's about transparency and humanity.
## Related Topics
- [Content Collections Integration](/how-to/content-collections/) - Auto-generate team lists
- [Humans.txt Reference](/reference/humans/) - Complete configuration options
- [Examples](/examples/blog/) - See humans.txt in context

View File

@ -1,31 +1,213 @@
--- ---
title: Understanding llms.txt title: Understanding llms.txt
description: What is llms.txt and why it matters description: How AI assistants discover and understand your website
--- ---
Learn about the llms.txt specification and how it helps AI assistants. llms.txt is the newest member of the discovery file family, emerging in response to a fundamental shift in how content is consumed on the web. While search engines index and retrieve, AI language models read, understand, and synthesize.
:::note[Work in Progress] ## Why AI Needs Different Guidance
This page is currently being developed. Check back soon for complete documentation.
:::
## Coming Soon Traditional search engines need to know **what exists and where**. They build indexes mapping keywords to pages.
This section will include: AI assistants need to know **what things mean and how to use them**. They need context, instructions, and understanding of relationships between content.
- Detailed explanations
- Code examples
- Best practices
- Common patterns
- Troubleshooting tips
## Related Pages Consider the difference:
- [Configuration Reference](/reference/configuration/) **Search engine thinking**: "This page contains the word 'API' and is located at /docs/api"
- [API Reference](/reference/api/)
- [Examples](/examples/ecommerce/)
## Need Help? **AI assistant thinking**: "This site offers a REST API at /api/endpoint that requires authentication. When users ask how to integrate, I should explain the auth flow and reference the examples at /docs/examples"
- Check our [FAQ](/community/faq/) llms.txt bridges this gap by providing **semantic context** that goes beyond structural metadata.
- Visit [Troubleshooting](/community/troubleshooting/)
- Open an issue on [GitHub](https://github.com/withastro/astro-discovery/issues) ## The Information Architecture
llms.txt follows a simple, human-readable structure:
```
# Site Description
> One-line tagline
## Site Information
Basic facts about the site
## For AI Assistants
Instructions and guidelines
## Important Pages
Key resources to know about
## API Endpoints
Available programmatic access
```
This structure mirrors how you'd brief a human assistant about your site. It's not rigid XML or JSON - it's conversational documentation optimized for language model consumption.
## What to Include
The most effective llms.txt files provide:
**Description**: Not just what your site is, but **why it exists**. "E-commerce platform" is weak. "E-commerce platform focused on sustainable products with carbon footprint tracking" gives context.
**Key Features**: The 3-5 things that make your site unique or particularly useful. These help AI assistants understand what problems you solve.
**Important Pages**: Not a sitemap (that's what sitemap.xml is for), but the **handful of pages** that provide disproportionate value. Think: getting started guide, API docs, pricing.
**Instructions**: Specific guidance on how AI should represent your content. This is where you establish voice, correct common misconceptions, and provide task-specific guidance.
**API Endpoints**: If you have programmatic access, describe it. AI assistants can help users integrate with your service if they know endpoints exist.
## The Instruction Set Pattern
The most powerful part of llms.txt is the instructions section. This is where you teach AI assistants how to be helpful about your site.
Effective instructions are:
**Specific**: "When users ask about authentication, explain we use OAuth2 and point them to /docs/auth"
**Actionable**: "Check /api/status before suggesting users try the API"
**Context-aware**: "Remember that we're focused on accessibility - always mention a11y features"
**Preventive**: "We don't offer feature X - suggest alternatives Y or Z instead"
Think of it as training an employee who'll be answering questions about your product. What would you want them to know?
## Brand Voice and Tone
AI assistants can adapt their responses to match your brand if you provide guidance:
```
## Brand Voice
- Professional but approachable
- Technical accuracy over marketing speak
- Always mention open-source nature
- Emphasize privacy and user control
```
This helps ensure AI representations of your site feel consistent with your actual brand identity.
## Tech Stack Transparency
Including your tech stack serves multiple purposes:
1. **Helps AI assistants answer developer questions** ("Can I use this with React?" - "Yes, it's built on React")
2. **Aids troubleshooting** (knowing the framework helps diagnose integration issues)
3. **Attracts contributors** (developers interested in your stack are more likely to contribute)
Be specific but not exhaustive. "Built with Astro, TypeScript, and Tailwind" is better than listing every npm package.
## API Documentation
If your site offers APIs, llms.txt should describe them at a high level:
```
## API Endpoints
- GET /api/products - List all products
Authentication: API key required
Returns: JSON array of product objects
- POST /api/calculate-carbon - Calculate carbon footprint
Authentication: Not required
Accepts: JSON with cart data
Returns: Carbon footprint estimate
```
This isn't meant to replace full API documentation - it's a quick reference so AI assistants know what's possible.
## The Relationship with robots.txt
robots.txt and llms.txt work together:
**robots.txt** says: "AI bots, you can access these paths"
**llms.txt** says: "Here's how to understand what you find there"
The integration coordinates them automatically:
1. robots.txt includes rules for LLM user-agents
2. Those rules reference llms.txt
3. LLM bots follow robots.txt to respect boundaries
4. Then read llms.txt for guidance on content interpretation
## Dynamic vs. Static Content
llms.txt can be either static (same content always) or dynamic (generated at build time):
**Static**: Your site description and brand voice rarely change
**Dynamic**: Current API endpoints, team members, or feature status might update frequently
The integration supports both approaches. You can provide static strings or functions that generate content at build time.
This is particularly useful for:
- Extracting API endpoints from OpenAPI specs
- Listing important pages from content collections
- Keeping tech stack synchronized with package.json
- Generating context from current deployment metadata
## What Not to Include
llms.txt should be concise and focused. Avoid:
**Comprehensive documentation**: Link to it, don't duplicate it
**Entire sitemaps**: That's what sitemap.xml is for
**Legal boilerplate**: Keep it in your terms of service
**Overly specific instructions**: Trust AI to handle common cases
**Marketing copy**: Be informative, not promotional
Think of llms.txt as **strategic context**, not exhaustive documentation.
## Measuring Impact
Unlike traditional SEO, llms.txt impact is harder to measure directly. You won't see "llms.txt traffic" in analytics.
Instead, look for:
- AI assistants correctly representing your product
- Reduction in mischaracterizations or outdated information
- Appropriate use of your APIs by AI-assisted developers
- Consistency in how different AI systems describe your site
The goal is **accurate representation**, not traffic maximization.
## Privacy and Data Concerns
A common concern: "Doesn't llms.txt help AI companies train on my content?"
Important points:
1. **AI training happens regardless** of llms.txt - they crawl public content anyway
2. **llms.txt doesn't grant permission** - it provides context for content they already access
3. **robots.txt controls access** - if you don't want AI crawlers, use robots.txt to block them
4. **llms.txt helps AI represent you accurately** - better context = better representation
Think of it this way: if someone's going to talk about you, would you rather they have accurate information or guess?
## The Evolution of AI Context
llms.txt is a living standard, evolving as AI capabilities grow:
**Current**: Basic site description and instructions
**Near future**: Structured data about capabilities, limitations, and relationships
**Long term**: Semantic graphs of site knowledge and interconnections
By adopting llms.txt now, you're positioning your site to benefit as these capabilities mature.
## Real-World Patterns
**Documentation sites**: Emphasize how to search docs, common pitfalls, and where to find examples
**E-commerce**: Describe product categories, search capabilities, and checkout process
**SaaS products**: Explain core features, authentication, and API availability
**Blogs**: Highlight author expertise, main topics, and content philosophy
The pattern that works best depends on how people use AI to interact with your type of content.
## Related Topics
- [AI Integration Strategy](/explanation/ai-integration/) - Broader AI considerations
- [Robots.txt Coordination](/explanation/robots-explained/) - How robots.txt and llms.txt work together
- [LLMs.txt Reference](/reference/llms/) - Complete configuration options

View File

@ -1,31 +1,182 @@
--- ---
title: Understanding robots.txt title: How robots.txt Works
description: Deep dive into robots.txt and its purpose description: Understanding robots.txt and web crawler communication
--- ---
Comprehensive explanation of robots.txt, its history, and modern usage. Robots.txt is the oldest and most fundamental discovery file on the web. Since 1994, it has served as the **polite agreement** between website owners and automated crawlers about what content can be accessed and how.
:::note[Work in Progress] ## The Gentleman's Agreement
This page is currently being developed. Check back soon for complete documentation.
:::
## Coming Soon robots.txt is not a security mechanism - it's a social contract. It tells crawlers "please don't go here" rather than "you cannot go here." Any crawler can ignore it, and malicious ones often do.
This section will include: This might seem like a weakness, but it's actually a strength. The file works because the overwhelming majority of automated traffic comes from legitimate crawlers (search engines, monitoring tools, archive services) that want to be good citizens of the web.
- Detailed explanations
- Code examples
- Best practices
- Common patterns
- Troubleshooting tips
## Related Pages Think of it like a "No Trespassing" sign on private property. It won't stop determined intruders, but it clearly communicates boundaries to honest visitors and provides legal/ethical grounds for addressing violations.
- [Configuration Reference](/reference/configuration/) ## What robots.txt Solves
- [API Reference](/reference/api/)
- [Examples](/examples/ecommerce/)
## Need Help? Before robots.txt, early search engines would crawl websites aggressively, sometimes overwhelming servers or wasting bandwidth on administrative pages. Website owners had no standard way to communicate crawling preferences.
- Check our [FAQ](/community/faq/) robots.txt provides three critical capabilities:
- Visit [Troubleshooting](/community/troubleshooting/)
- Open an issue on [GitHub](https://github.com/withastro/astro-discovery/issues) **1. Access Control**: Specify which paths crawlers can and cannot visit
**2. Resource Management**: Set crawl delays to prevent server overload
**3. Signposting**: Point crawlers to important resources like sitemaps
## The User-Agent Model
robots.txt uses a "user-agent" model where rules target specific bots:
```
User-agent: *
Disallow: /admin/
User-agent: GoogleBot
Allow: /api/
```
This allows fine-grained control. You might allow Google to index your API documentation while blocking other crawlers. Or permit archive services to access historical content while disallowing marketing bots.
The `*` wildcard matches all user-agents, providing default rules. Specific user-agents override these defaults for their particular bot.
## The LLM Bot Challenge
The emergence of AI language models created a new category of web consumers. Unlike traditional search engines that index for retrieval, LLMs process content for training data and context.
This raises different concerns:
- Training data usage and attribution
- Content representation accuracy
- Server load from context gathering
- Different resource needs (full pages vs. search snippets)
The integration addresses this by providing dedicated rules for LLM bots (GPTBot, Claude-Web, Anthropic-AI, etc.) while pointing them to llms.txt for additional context.
## Allow vs. Disallow
A common point of confusion is the relationship between Allow and Disallow directives.
**Disallow**: Explicitly forbids access to a path
**Allow**: Creates exceptions to Disallow rules
Consider this example:
```
User-agent: *
Disallow: /admin/
Allow: /admin/public/
```
This says "don't crawl /admin/ except for /admin/public/ which is allowed." The Allow creates a specific exception to the broader Disallow.
Without any rules, everything is implicitly allowed. You don't need `Allow: /` - that's the default state.
## Path Matching
Path patterns in robots.txt support wildcards and prefix matching:
- `/api/` matches `/api/` and everything under it
- `/api/private` matches that specific path
- `*.pdf` matches any URL containing `.pdf`
- `/page$` matches `/page` but not `/page/subpage`
The most specific matching rule wins. If both `/api/` and `/api/public/` have rules for the same user-agent, the longer path takes precedence.
## Crawl-Delay: The Double-Edged Sword
Crawl-delay tells bots to wait between requests:
```
Crawl-delay: 2
```
This means "wait 2 seconds between page requests." It's useful for:
- Protecting servers with limited resources
- Preventing rate limiting from triggering
- Managing bandwidth costs
But there's a trade-off: slower crawling means it takes longer for your content to be indexed. Set it too high and you might delay important updates from appearing in search results.
The integration defaults to 1 second - a balanced compromise between politeness and indexing speed.
## Sitemap Declaration
One of robots.txt's most valuable features is sitemap declaration:
```
Sitemap: https://example.com/sitemap-index.xml
```
This tells crawlers "here's a comprehensive list of all my pages." It's more efficient than discovering pages through link following and ensures crawlers know about pages that might not be linked from elsewhere.
The integration automatically adds your sitemap reference, keeping it synchronized with your Astro site URL.
## Common Mistakes
**Blocking CSS/JS**: Some sites block `/assets/` thinking it saves bandwidth. This prevents search engines from rendering your pages correctly, harming SEO.
**Disallowing Everything**: `Disallow: /` blocks all crawlers completely. This is rarely what you want - even internal tools need access.
**Forgetting About Dynamic Content**: If your search or API routes generate content dynamically, consider whether crawlers should access them.
**Security Through Obscurity**: Don't rely on robots.txt to hide sensitive content. Use proper authentication instead.
## Why Not Just Use Authentication?
You might wonder why we need robots.txt if we can protect content with authentication.
The answer is that most website content should be publicly accessible - that's the point. You want search engines to index your blog, documentation, and product pages.
robots.txt lets you have **public content that crawlers respect** without requiring authentication. It's about communicating intent, not enforcing access control.
## The Integration's Approach
This integration generates robots.txt with opinionated defaults:
- Allow all bots by default (the web works best when discoverable)
- Include LLM-specific bots with llms.txt guidance
- Reference your sitemap automatically
- Set a reasonable 1-second crawl delay
- Provide easy overrides for your specific needs
You can customize any aspect, but the defaults represent best practices for most sites.
## Looking at Real-World Examples
**Wikipedia** (`robots.txt`):
```
User-agent: *
Disallow: /wiki/Special:
Crawl-delay: 1
Sitemap: https://en.wikipedia.org/sitemap.xml
```
Simple and effective. Block special admin pages, allow everything else.
**GitHub** (simplified):
```
User-agent: *
Disallow: /search/
Disallow: */pull/
Allow: */pull$/
```
Notice how they block pull request search but allow individual pull request pages. This prevents crawler loops while keeping content accessible.
## Verification and Testing
After deploying, verify your robots.txt:
1. Visit `yoursite.com/robots.txt` directly
2. Use Google Search Console's robots.txt tester
3. Check specific user-agent rules with online validators
4. Monitor crawler behavior in server logs
The file is cached aggressively by crawlers, so changes may take time to propagate.
## Related Topics
- [SEO Impact](/explanation/seo/) - How robots.txt affects search rankings
- [LLMs.txt Integration](/explanation/llms-explained/) - Connecting bot control with AI guidance
- [Robots.txt Reference](/reference/robots/) - Complete configuration options

View File

@ -1,31 +1,277 @@
--- ---
title: Security.txt Standard (RFC 9116) title: Security.txt Standard (RFC 9116)
description: Understanding the security.txt RFC description: Understanding RFC 9116 and responsible vulnerability disclosure
--- ---
Learn about RFC 9116 and why security.txt is important for responsible disclosure. security.txt, standardized as RFC 9116 in 2022, solves a deceptively simple problem: when a security researcher finds a vulnerability in your website, how do they tell you about it?
:::note[Work in Progress] ## The Responsible Disclosure Problem
This page is currently being developed. Check back soon for complete documentation.
:::
## Coming Soon Before security.txt, researchers faced a frustrating journey:
This section will include: 1. Find vulnerability in example.com
- Detailed explanations 2. Search for security contact information
- Code examples 3. Check footer, about page, contact page
- Best practices 4. Try info@, security@, admin@ email addresses
- Common patterns 5. Hope someone reads it and knows what to do with it
- Troubleshooting tips 6. Wait weeks for response (or get none)
7. Consider public disclosure out of frustration
## Related Pages This process was inefficient for researchers and dangerous for organizations. Vulnerabilities went unreported or were disclosed publicly before fixes could be deployed.
- [Configuration Reference](/reference/configuration/) ## The RFC 9116 Solution
- [API Reference](/reference/api/)
- [Examples](/examples/ecommerce/)
## Need Help? RFC 9116 standardizes a machine-readable file at `/.well-known/security.txt` containing:
- Check our [FAQ](/community/faq/) - **Contact**: How to reach your security team (required)
- Visit [Troubleshooting](/community/troubleshooting/) - **Expires**: When this information becomes stale (required)
- Open an issue on [GitHub](https://github.com/withastro/astro-discovery/issues) - **Canonical**: The authoritative location of this file
- **Encryption**: PGP keys for encrypted communication
- **Acknowledgments**: Hall of fame for researchers
- **Policy**: Your disclosure policy URL
- **Preferred-Languages**: Languages you can handle reports in
- **Hiring**: Security job opportunities
This provides a **standardized, discoverable, machine-readable** security contact mechanism.
## Why .well-known?
The `/.well-known/` directory is an RFC 8615 standard for site-wide metadata. It's where clients expect to find standard configuration files.
By placing security.txt in `/.well-known/security.txt`, the RFC ensures:
- **Consistent location**: No guessing where to find it
- **Standard compliance**: Follows web architecture patterns
- **Tool support**: Security scanners can automatically check for it
The integration generates security.txt at the correct location automatically.
## The Required Fields
RFC 9116 mandates two fields:
### Contact
At least one contact method (email or URL):
```
Contact: mailto:security@example.com
Contact: https://example.com/security-contact
Contact: tel:+1-555-0100
```
Multiple contacts provide redundancy. If one channel fails, researchers have alternatives.
Email addresses automatically get `mailto:` prefixes. URLs should point to security contact forms or issue trackers.
### Expires
An ISO 8601 timestamp indicating when to stop trusting this file:
```
Expires: 2025-12-31T23:59:59Z
```
This is critical - it prevents researchers from reporting to stale contacts that are no longer monitored.
The integration defaults to `expires: 'auto'`, setting expiration to one year from build time. This ensures the field updates on every deployment.
## Optional but Valuable Fields
### Encryption
URLs to PGP public keys for encrypted vulnerability reports:
```
Encryption: https://example.com/pgp-key.txt
Encryption: openpgp4fpr:5F2DE18D3AFE0FD7A1F2F5A3E4562BB79E3B2E80
```
This enables researchers to send sensitive details securely, preventing disclosure to attackers monitoring email.
### Acknowledgments
URL to your security researcher hall of fame:
```
Acknowledgments: https://example.com/security/hall-of-fame
```
Public recognition motivates responsible disclosure. Researchers appreciate being credited for their work.
### Policy
URL to your vulnerability disclosure policy:
```
Policy: https://example.com/security/disclosure-policy
```
This clarifies expectations: response timelines, safe harbor provisions, bug bounty details, and disclosure coordination.
### Preferred-Languages
Languages your security team can handle:
```
Preferred-Languages: en, es, fr
```
This helps international researchers communicate effectively. Use ISO 639-1 language codes.
### Hiring
URL to security job openings:
```
Hiring: https://example.com/careers/security
```
Talented researchers who find vulnerabilities might be hiring prospects. This field provides a connection point.
## The Canonical Field
The Canonical field specifies the authoritative location:
```
Canonical: https://example.com/.well-known/security.txt
```
This matters for:
- **Verification**: Ensures you're reading the correct version
- **Mirrors**: Multiple domains can reference the same canonical file
- **Historical context**: Archives know which version was authoritative
The integration sets this automatically based on your site URL.
## Why Expiration Matters
The Expires field isn't bureaucracy - it's safety.
Consider a scenario:
1. Company sets up security.txt pointing to security@company.com
2. Security team disbands, email is decommissioned
3. Attacker registers security@company.com domain after it expires
4. Researcher reports vulnerability to attacker's email
5. Attacker has vulnerability details before the company does
Expiration prevents this. If security.txt is expired, researchers know not to trust it and must find alternative contact methods.
Best practice: Set expiration to 1 year maximum. The integration's `'auto'` option handles this.
## Security.txt in Practice
A minimal production security.txt:
```
Canonical: https://example.com/.well-known/security.txt
Contact: mailto:security@example.com
Expires: 2025-11-08T00:00:00.000Z
```
A comprehensive implementation:
```
Canonical: https://example.com/.well-known/security.txt
Contact: mailto:security@example.com
Contact: https://example.com/security-report
Expires: 2025-11-08T00:00:00.000Z
Encryption: https://example.com/pgp-key.asc
Acknowledgments: https://example.com/security/researchers
Preferred-Languages: en, de, ja
Policy: https://example.com/security/disclosure
Hiring: https://example.com/careers/security-engineer
```
## Common Mistakes
**Using relative URLs**: All URLs must be absolute (`https://...`)
**Missing mailto: prefix**: Email addresses need `mailto:` - the integration adds this automatically
**Far-future expiration**: Don't set expiration 10 years out. Keep it to 1 year maximum.
**No monitoring**: Set up alerts when security.txt approaches expiration
**Stale contacts**: Verify listed contacts still work
## Building a Disclosure Program
security.txt is the entry point to vulnerability disclosure, but you need supporting infrastructure:
**Monitoring**: Watch the security inbox religiously
**Triage process**: Quick initial response (even if just "we're investigating")
**Fix timeline**: Clear expectations about patch development
**Disclosure coordination**: Work with researcher on public disclosure timing
**Recognition**: Credit researchers in release notes and acknowledgments page
The integration makes the entry point easy. The program around it requires organizational commitment.
## Security Through Transparency
Some organizations hesitate to publish security.txt, fearing it invites attacks.
The reality: security researchers are already looking. security.txt helps them help you.
Without it:
- Vulnerabilities go unreported
- Researchers waste time finding contacts
- Frustration leads to premature public disclosure
- You look unprofessional to security community
With it:
- Clear channel for responsible disclosure
- Faster vulnerability reports
- Better researcher relationships
- Professional security posture
## Verification and Monitoring
After deploying security.txt:
1. Verify it's accessible at `/.well-known/security.txt`
2. Check field formatting with RFC 9116 validators
3. Test contact methods work
4. Set up monitoring for expiration date
5. Create calendar reminder to refresh before expiration
Many organizations set up automated checks that alert if security.txt will expire within 30 days.
## Integration with Bug Bounty Programs
If you run a bug bounty program, reference it in your policy:
```
Policy: https://example.com/bug-bounty
```
This connects researchers to your incentive program immediately.
security.txt and bug bounties work together - the file provides discovery, the program provides incentive structure.
## Legal Considerations
security.txt should coordinate with your legal team's disclosure policy.
Consider including:
- Safe harbor provisions (no legal action against good-faith researchers)
- Scope definition (what systems are in/out of scope)
- Rules of engagement (don't exfiltrate data, etc.)
- Disclosure timeline expectations
These protect both your organization and researchers.
## Related Topics
- [Canary.txt Explained](/explanation/canary-explained/) - Complementary transparency mechanism
- [Security.txt Reference](/reference/security/) - Complete configuration options
- [Security Best Practices](/how-to/environment-config/) - Securing your deployment

View File

@ -1,31 +1,327 @@
--- ---
title: SEO & Discoverability title: SEO & Discoverability
description: How discovery files improve SEO description: How discovery files improve search engine optimization
--- ---
Understand how properly configured discovery files enhance search engine optimization. Discovery files and SEO have a symbiotic relationship. While some files (like humans.txt) don't directly impact rankings, others (robots.txt, sitemaps) are foundational to how search engines understand and index your site.
:::note[Work in Progress] ## Robots.txt: The SEO Foundation
This page is currently being developed. Check back soon for complete documentation.
:::
## Coming Soon robots.txt is one of the first files search engines request. It determines:
This section will include: - Which pages can be crawled and indexed
- Detailed explanations - How aggressively to crawl (via crawl-delay)
- Code examples - Where to find your sitemap
- Best practices - Special instructions for specific bots
- Common patterns
- Troubleshooting tips
## Related Pages ### Crawl Budget Optimization
- [Configuration Reference](/reference/configuration/) Search engines allocate limited resources to each site - your "crawl budget." robots.txt helps you spend it wisely:
- [API Reference](/reference/api/)
- [Examples](/examples/ecommerce/)
## Need Help? **Block low-value pages**: Admin sections, search result pages, and duplicate content waste crawl budget
**Allow high-value content**: Ensure important pages are accessible
**Set appropriate crawl-delay**: Balance thorough indexing against server load
- Check our [FAQ](/community/faq/) Example SEO-optimized robots.txt:
- Visit [Troubleshooting](/community/troubleshooting/)
- Open an issue on [GitHub](https://github.com/withastro/astro-discovery/issues) ```
User-agent: *
Allow: /
Disallow: /admin/
Disallow: /search?
Disallow: /*?sort=*
Disallow: /api/
Crawl-delay: 1
Sitemap: https://example.com/sitemap-index.xml
```
This blocks non-content pages while allowing crawlers to efficiently index your actual content.
### The CSS/JS Trap
A common SEO mistake:
```
# DON'T DO THIS
Disallow: /assets/
Disallow: /*.css
Disallow: /*.js
```
This prevents search engines from fully rendering your pages. Modern SEO requires JavaScript execution for SPAs and interactive content.
The integration doesn't block assets by default - this is intentional and SEO-optimal.
### Sitemap Declaration
The `Sitemap:` directive in robots.txt is critical for SEO. It tells search engines:
- All your pages exist (even if not linked)
- When pages were last modified
- Relative priority of pages
- Alternative language versions
This dramatically improves indexing coverage and freshness.
## Sitemaps: The SEO Roadmap
Sitemaps serve multiple SEO functions:
### Discoverability
Pages not linked from your navigation can still be indexed. This matters for:
- Deep content structures
- Recently published pages not yet linked
- Orphaned pages with valuable content
- Alternative language versions
### Update Frequency
The `<lastmod>` element signals content freshness:
```xml
<url>
<loc>https://example.com/article</loc>
<lastmod>2024-11-08T12:00:00Z</lastmod>
<changefreq>weekly</changefreq>
</url>
```
Search engines prioritize recently updated content. Fresh `lastmod` dates encourage re-crawling.
### Priority Hints
The `<priority>` element suggests relative importance:
```xml
<url>
<loc>https://example.com/important-page</loc>
<priority>0.9</priority>
</url>
<url>
<loc>https://example.com/minor-page</loc>
<priority>0.3</priority>
</url>
```
This is a hint, not a directive. Search engines use it along with other signals.
### International SEO
For multilingual sites, sitemaps declare language alternatives:
```xml
<url>
<loc>https://example.com/page</loc>
<xhtml:link rel="alternate" hreflang="es"
href="https://example.com/es/page"/>
<xhtml:link rel="alternate" hreflang="fr"
href="https://example.com/fr/page"/>
</url>
```
This prevents duplicate content penalties while ensuring all language versions are indexed.
## LLMs.txt: The AI SEO Frontier
Traditional SEO optimizes for search retrieval. llms.txt optimizes for AI representation - the emerging frontier of discoverability.
### AI-Generated Summaries
Search engines increasingly show AI-generated answer boxes. llms.txt helps ensure these summaries:
- Accurately represent your content
- Use your preferred terminology and brand voice
- Highlight your key differentiators
- Link to appropriate pages
### Voice Search Optimization
Voice assistants rely on AI understanding. llms.txt provides:
- Natural language context for your content
- Clarification of ambiguous terms
- Guidance on how to answer user questions
- References to authoritative pages
This improves your chances of being the source for voice search answers.
### Content Attribution
When AI systems reference your content, llms.txt helps ensure:
- Proper context is maintained
- Your brand is correctly associated
- Key features aren't misrepresented
- Updates propagate to AI models
Think of it as structured data for AI agents.
## Humans.txt: The Indirect SEO Value
humans.txt doesn't directly impact rankings, but it supports SEO indirectly:
### Technical Transparency
Developers evaluating integration with your platform check humans.txt for tech stack info. This can lead to:
- Backlinks from integration tutorials
- Technical blog posts mentioning your stack
- Developer community discussions
All of which generate valuable backlinks and traffic.
### Brand Signals
A well-crafted humans.txt signals:
- Active development and maintenance
- Professional operations
- Transparent communication
- Company culture
These contribute to overall site authority and trustworthiness.
## Security.txt: Trust Signals
Security.txt demonstrates professionalism and security-consciousness. While not a ranking factor, it:
- Builds trust with security-conscious users
- Prevents security incidents that could damage SEO (hacked site penalties)
- Shows organizational maturity
- Enables faster vulnerability fixes (preserving site integrity)
Search engines penalize compromised sites heavily. security.txt helps prevent those penalties.
## Integration SEO Benefits
This integration provides several SEO advantages:
### Consistency
All discovery files reference the same site URL from your Astro config. This prevents:
- Mixed http/https signals
- www vs. non-www confusion
- Subdomain inconsistencies
Consistency is an underrated SEO factor.
### Freshness
Auto-generated timestamps keep discovery files fresh:
- Sitemaps show current lastmod dates
- security.txt expiration updates with each build
- canary.txt timestamps reflect current build
Fresh content signals active maintenance.
### Correctness
The integration handles RFC compliance automatically:
- security.txt follows RFC 9116 exactly
- robots.txt uses correct syntax
- Sitemaps follow XML schema
- WebFinger implements RFC 7033
Malformed discovery files can harm SEO. The integration prevents errors.
## Monitoring SEO Impact
Track discovery file effectiveness:
**Google Search Console**:
- Sitemap coverage reports
- Crawl statistics
- Indexing status
- Mobile usability
**Crawl behavior analysis**:
- Server logs showing crawler patterns
- Crawl-delay effectiveness
- Blocked vs. allowed URL ratio
- Time to index new content
**AI representation monitoring**:
- How AI assistants describe your site
- Accuracy of information
- Attribution and links
- Brand voice consistency
## Common SEO Mistakes
### Over-blocking
Blocking too much harms SEO:
```
# Too restrictive
Disallow: /blog/?
Disallow: /products/?
```
This might block legitimate content URLs. Be specific:
```
# Better
Disallow: /blog?*
Disallow: /products?sort=*
```
### Sitemap bloat
Including every URL hurts more than helps:
- Don't include parameter variations
- Skip pagination (keep to representative pages)
- Exclude search result pages
- Filter out duplicate content
Quality over quantity.
### Ignoring crawl errors
Monitor Search Console for:
- 404s in sitemap
- Blocked resources search engines need
- Redirect chains
- Server errors
Fix these promptly - they impact ranking.
### Stale sitemaps
Ensure sitemaps update with your content:
- New pages appear quickly
- Deleted pages are removed
- lastmod timestamps are accurate
- Priority reflects current importance
The integration's automatic generation ensures freshness.
## Future SEO Trends
Discovery files will evolve with search:
**AI-first indexing**: Search engines will increasingly rely on structured context (llms.txt) rather than pure crawling
**Federated discovery**: WebFinger and similar protocols may influence how distributed content is discovered and indexed
**Transparency signals**: Files like security.txt and canary.txt may become trust signals in ranking algorithms
**Structured data expansion**: Discovery files complement schema.org markup as structured communication channels
By implementing comprehensive discovery now, you're positioned for these trends.
## Related Topics
- [Robots.txt Configuration](/reference/robots/) - SEO-optimized robot settings
- [Sitemap Optimization](/how-to/filter-sitemap/) - Filtering for better SEO
- [AI Integration Strategy](/explanation/ai-integration/) - Preparing for AI-first search

View File

@ -1,31 +1,309 @@
--- ---
title: WebFinger Protocol (RFC 7033) title: WebFinger Protocol (RFC 7033)
description: Understanding WebFinger resource discovery description: Understanding WebFinger and federated resource discovery
--- ---
Deep dive into the WebFinger protocol and its role in federated identity. WebFinger (RFC 7033) solves a fundamental problem of the decentralized web: how do you discover information about a resource (person, service, device) when you only have an identifier?
:::note[Work in Progress] ## The Discovery Challenge
This page is currently being developed. Check back soon for complete documentation.
:::
## Coming Soon On centralized platforms, discovery is simple. Twitter knows about @username because it's all in one database. But in decentralized systems (email, federated social networks, distributed identity), there's no central registry.
This section will include: WebFinger provides a standardized way to ask: "Given this identifier (email, account name, URL), what can you tell me about it?"
- Detailed explanations
- Code examples
- Best practices
- Common patterns
- Troubleshooting tips
## Related Pages ## The Query Pattern
- [Configuration Reference](/reference/configuration/) WebFinger uses a simple HTTP GET request:
- [API Reference](/reference/api/)
- [Examples](/examples/ecommerce/)
## Need Help? ```
GET /.well-known/webfinger?resource=acct:alice@example.com
```
- Check our [FAQ](/community/faq/) This asks: "What do you know about alice@example.com?"
- Visit [Troubleshooting](/community/troubleshooting/)
- Open an issue on [GitHub](https://github.com/withastro/astro-discovery/issues) The server responds with a JSON Resource Descriptor (JRD) containing links, properties, and metadata about that resource.
## Real-World Use Cases
### ActivityPub / Mastodon
When you follow `@alice@example.com` on Mastodon, your instance:
1. Queries `example.com/.well-known/webfinger?resource=acct:alice@example.com`
2. Gets back Alice's ActivityPub profile URL
3. Fetches her profile and posts from that URL
4. Subscribes to updates
WebFinger is the discovery layer that makes federation work.
### OpenID Connect
OAuth/OpenID providers use WebFinger for issuer discovery:
1. User enters email address
2. Client extracts domain
3. Queries WebFinger for OpenID configuration
4. Discovers authentication endpoints
5. Initiates OAuth flow
This enables "email address as identity" without hardcoding provider lists.
### Contact Discovery
Email clients and contact apps use WebFinger to discover:
- Profile photos and avatars
- Public keys for encryption
- Social media profiles
- Calendar availability
- Preferred contact methods
## The JRD Response Format
A WebFinger response looks like:
```json
{
"subject": "acct:alice@example.com",
"aliases": [
"https://example.com/@alice",
"https://example.com/users/alice"
],
"properties": {
"http://schema.org/name": "Alice Developer"
},
"links": [
{
"rel": "self",
"type": "application/activity+json",
"href": "https://example.com/users/alice"
},
{
"rel": "http://webfinger.net/rel/profile-page",
"type": "text/html",
"href": "https://example.com/@alice"
},
{
"rel": "http://webfinger.net/rel/avatar",
"type": "image/jpeg",
"href": "https://example.com/avatars/alice.jpg"
}
]
}
```
**Subject**: The resource being described (often same as query)
**Aliases**: Alternative identifiers for the same resource
**Properties**: Key-value metadata (property names must be URIs)
**Links**: Related resources with relationship types
## Link Relations
The `rel` field uses standardized link relation types:
**IANA registered**: `self`, `alternate`, `canonical`, etc.
**WebFinger specific**: `http://webfinger.net/rel/profile-page`, etc.
**Custom/domain-specific**: Any URI works
This extensibility allows WebFinger to serve many use cases while remaining standardized.
## Static vs. Dynamic Resources
The integration supports both approaches:
### Static Resources
Define specific resources explicitly:
```typescript
webfinger: {
resources: [
{
resource: 'acct:alice@example.com',
links: [...]
}
]
}
```
Use this for a small, known set of identities.
### Content Collection Integration
Generate resources dynamically from Astro content collections:
```typescript
webfinger: {
collections: [{
name: 'team',
resourceTemplate: 'acct:{slug}@example.com',
linksBuilder: (member) => [...]
}]
}
```
This auto-generates WebFinger responses for all collection entries. Add a team member to your content collection, and they become discoverable via WebFinger automatically.
## Template Variables
Resource and subject templates support variables:
- `{slug}`: Collection entry slug
- `{id}`: Collection entry ID
- `{data.fieldName}`: Any field from entry data
- `{siteURL}`: Your configured site URL
Example:
```typescript
resourceTemplate: 'acct:{data.username}@{siteURL.hostname}'
```
For a team member with `username: 'alice'` on `example.com`, this generates:
`acct:alice@example.com`
## CORS and Security
WebFinger responses include:
```
Access-Control-Allow-Origin: *
```
This is intentional - WebFinger is designed for public discovery. If information shouldn't be public, don't put it in WebFinger.
The protocol assumes:
- Resources are intentionally discoverable
- Information is public or intended for sharing
- Authentication happens at linked resources, not discovery layer
## Rel Filtering
Clients can request specific link types:
```
GET /.well-known/webfinger?resource=acct:alice@example.com&rel=self
```
The server returns only links matching that relation type. This reduces bandwidth and focuses the response.
The integration handles this automatically.
## Why Dynamic Routes
Unlike other discovery files, WebFinger uses a dynamic route (`prerender: false`). This is because:
1. Query parameters determine the response
2. Content collection resources may be numerous
3. Responses are lightweight enough to generate on-demand
Static generation would require pre-rendering every possible query, which is impractical for collections.
## Building for Federation
If you want your site to participate in federated protocols:
**Enable WebFinger**: Makes your users/resources discoverable
**Implement ActivityPub**: Provide the linked profile/actor endpoints
**Support WebFinger lookup**: Allow others to discover your resources
WebFinger is the discovery layer; ActivityPub (or other protocols) provide the functionality.
## Team/Author Discovery
A common pattern for blogs and documentation:
```typescript
webfinger: {
collections: [{
name: 'authors',
resourceTemplate: 'acct:{slug}@myblog.com',
linksBuilder: (author) => [
{
rel: 'http://webfinger.net/rel/profile-page',
href: `https://myblog.com/authors/${author.slug}`,
type: 'text/html'
},
{
rel: 'http://webfinger.net/rel/avatar',
href: author.data.avatar,
type: 'image/jpeg'
}
],
propertiesBuilder: (author) => ({
'http://schema.org/name': author.data.name,
'http://schema.org/email': author.data.email
})
}]
}
```
Now `acct:alice@myblog.com` resolves to Alice's author page, avatar, and contact info.
## Testing WebFinger
After deployment:
1. Query directly: `curl 'https://example.com/.well-known/webfinger?resource=acct:alice@example.com'`
2. Use WebFinger validators/debuggers
3. Test from federated clients (Mastodon, etc.)
4. Verify CORS headers are present
5. Check rel filtering works
## Privacy Considerations
WebFinger makes information **discoverable**. Consider:
- Don't expose private email addresses or contact info
- Limit to intentionally public resources
- Understand that responses are cached
- Remember `Access-Control-Allow-Origin: *` makes responses widely accessible
If information shouldn't be public, don't include it in WebFinger responses.
## Beyond Social Networks
WebFinger isn't just for social media. Other applications:
**Device discovery**: IoT devices announcing capabilities
**Service discovery**: API endpoints and configurations
**Calendar/availability**: Free/busy status and booking links
**Payment addresses**: Cryptocurrency addresses and payment methods
**Professional profiles**: Credentials, certifications, and portfolios
The protocol is general-purpose resource discovery.
## The Integration's Approach
This integration makes WebFinger accessible without boilerplate:
- Auto-generates from content collections
- Handles template variable substitution
- Manages CORS and rel filtering
- Provides type-safe configuration
- Supports both static and dynamic resources
You define the mappings, the integration handles the protocol.
## When to Use WebFinger
Enable WebFinger if:
- You want to participate in federated protocols
- Your site has user profiles or authors
- You're building decentralized services
- You want discoverable team members
- You're implementing OAuth/OpenID
Skip it if:
- Your site is purely informational with no identity component
- You don't want to expose resource discovery
- You're not integrating with federated services
## Related Topics
- [ActivityPub Integration](/how-to/activitypub/) - Building on WebFinger for federation
- [WebFinger Reference](/reference/webfinger/) - Complete configuration options
- [Content Collections](/how-to/content-collections/) - Dynamic resource generation

View File

@ -1,31 +1,130 @@
--- ---
title: Why Use Discovery Files? title: Why Use Discovery Files?
description: Understanding the importance of discovery files description: Understanding the importance of discovery files for modern websites
--- ---
Learn why discovery files are essential for modern websites and their benefits. Discovery files are the polite introduction your website makes to the automated systems that visit it every day. Just as you might put up a sign directing visitors to your front door, these files tell bots, AI assistants, search engines, and other automated systems where to go and what they can do.
:::note[Work in Progress] ## The Discovery Problem
This page is currently being developed. Check back soon for complete documentation.
:::
## Coming Soon Every website faces a fundamental challenge: how do automated systems know what your site contains, where security issues should be reported, or how AI assistants should interact with your content?
This section will include: Without standardized discovery mechanisms, each bot must guess. Search engines might crawl your entire site inefficiently. AI systems might misrepresent your content. Security researchers won't know how to contact you responsibly. Federated services can't find your user profiles.
- Detailed explanations
- Code examples
- Best practices
- Common patterns
- Troubleshooting tips
## Related Pages Discovery files solve this by providing **machine-readable contracts** that answer specific questions:
- [Configuration Reference](/reference/configuration/) - **robots.txt**: "What can I crawl and where?"
- [API Reference](/reference/api/) - **llms.txt**: "How should AI assistants understand and represent your site?"
- [Examples](/examples/ecommerce/) - **humans.txt**: "Who built this and what technologies were used?"
- **security.txt**: "Where do I report security vulnerabilities?"
- **canary.txt**: "Has your organization received certain legal orders?"
- **webfinger**: "How do I discover user profiles and federated identities?"
## Need Help? ## Why Multiple Files?
- Check our [FAQ](/community/faq/) You might wonder why we need separate files instead of one unified discovery document. The answer lies in **separation of concerns** and **backwards compatibility**.
- Visit [Troubleshooting](/community/troubleshooting/)
- Open an issue on [GitHub](https://github.com/withastro/astro-discovery/issues) Each file serves a distinct audience and purpose:
- **robots.txt** targets web crawlers and has been the standard since 1994
- **llms.txt** addresses the new reality of AI assistants processing web content
- **humans.txt** provides transparency for developers and users curious about your stack
- **security.txt** (RFC 9116) offers a standardized security contact mechanism
- **canary.txt** enables transparency about legal obligations
- **webfinger** (RFC 7033) enables decentralized resource discovery
Different systems read different files. A search engine ignores humans.txt. A developer looking at your tech stack won't read robots.txt. A security researcher needs security.txt, not your sitemap.
This modularity also means you can adopt discovery files incrementally. Start with robots.txt and sitemap.xml, add llms.txt when you want AI assistance, enable security.txt when you're ready to accept vulnerability reports.
## The Visibility Trade-off
Discovery files involve an important trade-off: **transparency versus obscurity**.
By publishing robots.txt, you tell both polite crawlers and malicious scrapers about your site structure. Security.txt reveals your security team's contact information. Humans.txt exposes your technology stack.
This is deliberate. Discovery files embrace the principle that **security through obscurity is not security**. The benefits of standardized, polite communication with automated systems outweigh the minimal risks of exposing this information.
Consider that:
- Attackers can discover your tech stack through other means (HTTP headers, page analysis, etc.)
- Security.txt makes responsible disclosure easier, reducing time-to-fix for vulnerabilities
- Robots.txt only controls *polite* bots - malicious actors ignore it anyway
- The transparency builds trust with users, developers, and security researchers
## The Evolution of Discovery
Discovery mechanisms have evolved alongside the web itself:
**1994**: robots.txt emerges as an informal standard for crawler communication
**2000s**: Sitemaps become essential for SEO as the web grows exponentially
**2008**: humans.txt proposed to add personality and transparency to websites
**2017**: RFC 9116 standardizes security.txt after years of ad-hoc security contact methods
**2023**: llms.txt proposed as AI assistants become major consumers of web content
**2024**: Warrant canaries and webfinger integration emerge for transparency and federation
Each new discovery file addresses a real need that emerged as the web ecosystem grew. The integration brings them together because **modern websites need to communicate with an increasingly diverse set of automated visitors**.
## Discovery as Infrastructure
Think of discovery files as **critical infrastructure for your website**. They're not optional extras - they're the foundation for how your site interacts with the broader web ecosystem.
Without proper discovery files:
- Search engines may crawl inefficiently, wasting your server resources
- AI assistants may misunderstand your content or ignore important context
- Security researchers may struggle to report vulnerabilities responsibly
- Developers can't easily understand your technical choices
- Federated services can't integrate with your user profiles
With comprehensive discovery:
- You control how bots interact with your site
- AI assistants have proper context for representing your content
- Security issues can be reported through established channels
- Your tech stack and team are properly credited
- Your site integrates seamlessly with federated protocols
## The Cost-Benefit Analysis
Setting up discovery files manually for each project is tedious and error-prone. You need to:
- Remember the correct format for each file type
- Keep URLs and sitemaps synchronized with your site config
- Update expiration dates for security.txt and canary.txt
- Maintain consistency across different discovery mechanisms
- Handle edge cases and RFC compliance
An integration automates all of this, ensuring:
- **Consistency**: All discovery files reference the same site URL
- **Correctness**: RFC compliance is handled automatically
- **Maintenance**: Expiration dates and timestamps update on each build
- **Flexibility**: Configuration changes propagate to all relevant files
- **Best Practices**: Sensible defaults that you can override as needed
The cost is minimal - a single integration in your Astro config. The benefit is comprehensive, standards-compliant discovery across your entire site.
## Looking Forward
As the web continues to evolve, discovery mechanisms will too. We're already seeing:
- AI systems becoming more sophisticated in how they consume web content
- Federated protocols gaining adoption for decentralized social networks
- Increased emphasis on security transparency and responsible disclosure
- Growing need for machine-readable metadata as automation increases
Discovery files aren't a trend - they're fundamental communication protocols that will remain relevant as long as automated systems interact with websites.
By implementing comprehensive discovery now, you're **future-proofing** your site for whatever new automated visitors emerge next.
## Related Topics
- [SEO Implications](/explanation/seo/) - How discovery files affect search rankings
- [AI Integration Strategy](/explanation/ai-integration/) - Making your content AI-friendly
- [Architecture](/explanation/architecture/) - How the integration works internally

View File

@ -3,29 +3,382 @@ title: ActivityPub Integration
description: Connect with the Fediverse via WebFinger description: Connect with the Fediverse via WebFinger
--- ---
Set up ActivityPub integration to make your site discoverable on Mastodon and the Fediverse. Enable WebFinger to make your site discoverable on Mastodon and other ActivityPub-compatible services in the Fediverse.
:::note[Work in Progress] ## Prerequisites
This page is currently being developed. Check back soon for complete documentation.
:::
## Coming Soon - Integration installed and configured
- Understanding of ActivityPub and WebFinger protocols
- Knowledge of your site's user or author structure
- ActivityPub server endpoints (or static actor files)
This section will include: ## Basic Static Profile
- Detailed explanations
- Code examples
- Best practices
- Common patterns
- Troubleshooting tips
## Related Pages Create a single discoverable profile:
- [Configuration Reference](/reference/configuration/) ```typescript
- [API Reference](/reference/api/) // astro.config.mjs
- [Examples](/examples/ecommerce/) discovery({
webfinger: {
enabled: true,
resources: [
{
resource: 'acct:yourname@example.com',
subject: 'acct:yourname@example.com',
aliases: [
'https://example.com/@yourname'
],
links: [
{
rel: 'http://webfinger.net/rel/profile-page',
type: 'text/html',
href: 'https://example.com/@yourname'
},
{
rel: 'self',
type: 'application/activity+json',
href: 'https://example.com/users/yourname'
}
]
}
]
}
})
```
## Need Help? Query: `GET /.well-known/webfinger?resource=acct:yourname@example.com`
- Check our [FAQ](/community/faq/) ## Multiple Authors
- Visit [Troubleshooting](/community/troubleshooting/)
- Open an issue on [GitHub](https://github.com/withastro/astro-discovery/issues) Enable discovery for all blog authors:
```typescript
discovery({
webfinger: {
enabled: true,
resources: [
{
resource: 'acct:alice@example.com',
links: [
{
rel: 'self',
type: 'application/activity+json',
href: 'https://example.com/users/alice'
},
{
rel: 'http://webfinger.net/rel/profile-page',
href: 'https://example.com/authors/alice'
}
]
},
{
resource: 'acct:bob@example.com',
links: [
{
rel: 'self',
type: 'application/activity+json',
href: 'https://example.com/users/bob'
},
{
rel: 'http://webfinger.net/rel/profile-page',
href: 'https://example.com/authors/bob'
}
]
}
]
}
})
```
## Dynamic Authors from Content Collection
Load authors from Astro content collection:
**Step 1**: Create authors collection:
```typescript
// src/content.config.ts
const authorsCollection = defineCollection({
type: 'data',
schema: z.object({
name: z.string(),
email: z.string().email(),
bio: z.string(),
avatar: z.string().url(),
mastodon: z.string().optional(),
})
});
```
**Step 2**: Add author data:
```yaml
# src/content/authors/alice.yaml
name: Alice Developer
email: alice@example.com
bio: Full-stack developer and writer
avatar: https://example.com/avatars/alice.jpg
mastodon: '@alice@mastodon.social'
```
**Step 3**: Configure WebFinger collection:
```typescript
discovery({
webfinger: {
enabled: true,
collections: [{
name: 'authors',
resourceTemplate: 'acct:{slug}@example.com',
linksBuilder: (author) => [
{
rel: 'http://webfinger.net/rel/profile-page',
type: 'text/html',
href: `https://example.com/authors/${author.slug}`
},
{
rel: 'http://webfinger.net/rel/avatar',
type: 'image/jpeg',
href: author.data.avatar
},
{
rel: 'self',
type: 'application/activity+json',
href: `https://example.com/users/${author.slug}`
}
],
propertiesBuilder: (author) => ({
'http://schema.org/name': author.data.name,
'http://schema.org/description': author.data.bio
}),
aliasesBuilder: (author) => [
`https://example.com/@${author.slug}`
]
}]
}
})
```
## Create ActivityPub Actor Endpoint
WebFinger discovery requires an ActivityPub actor endpoint. Create it:
```typescript
// src/pages/users/[author].json.ts
import type { APIRoute } from 'astro';
import { getCollection } from 'astro:content';
export async function getStaticPaths() {
const authors = await getCollection('authors');
return authors.map(author => ({
params: { author: author.slug }
}));
}
export const GET: APIRoute = async ({ params, site }) => {
const authors = await getCollection('authors');
const author = authors.find(a => a.slug === params.author);
if (!author) {
return new Response(null, { status: 404 });
}
const actor = {
'@context': [
'https://www.w3.org/ns/activitystreams',
'https://w3id.org/security/v1'
],
'type': 'Person',
'id': `${site}users/${author.slug}`,
'preferredUsername': author.slug,
'name': author.data.name,
'summary': author.data.bio,
'url': `${site}authors/${author.slug}`,
'icon': {
'type': 'Image',
'mediaType': 'image/jpeg',
'url': author.data.avatar
},
'inbox': `${site}users/${author.slug}/inbox`,
'outbox': `${site}users/${author.slug}/outbox`,
'followers': `${site}users/${author.slug}/followers`,
'following': `${site}users/${author.slug}/following`,
};
return new Response(JSON.stringify(actor, null, 2), {
status: 200,
headers: {
'Content-Type': 'application/activity+json'
}
});
};
```
## Link from Mastodon
Users can find your profile on Mastodon:
1. Go to Mastodon search
2. Enter `@yourname@example.com`
3. Mastodon queries WebFinger at your site
4. Gets ActivityPub actor URL
5. Displays profile with follow button
## Add Profile Link in Bio
Link your Mastodon profile:
```typescript
discovery({
webfinger: {
enabled: true,
collections: [{
name: 'authors',
resourceTemplate: 'acct:{slug}@example.com',
linksBuilder: (author) => {
const links = [
{
rel: 'self',
type: 'application/activity+json',
href: `https://example.com/users/${author.slug}`
}
];
// Add Mastodon link if available
if (author.data.mastodon) {
const mastodonUrl = author.data.mastodon.startsWith('http')
? author.data.mastodon
: `https://mastodon.social/${author.data.mastodon}`;
links.push({
rel: 'http://webfinger.net/rel/profile-page',
type: 'text/html',
href: mastodonUrl
});
}
return links;
}
}]
}
})
```
## Testing WebFinger
Test your WebFinger endpoint:
```bash
# Build the site
npm run build
npm run preview
# Test WebFinger query
curl "http://localhost:4321/.well-known/webfinger?resource=acct:alice@example.com"
```
Expected response:
```json
{
"subject": "acct:alice@example.com",
"aliases": [
"https://example.com/@alice"
],
"links": [
{
"rel": "http://webfinger.net/rel/profile-page",
"type": "text/html",
"href": "https://example.com/authors/alice"
},
{
"rel": "self",
"type": "application/activity+json",
"href": "https://example.com/users/alice"
}
]
}
```
## Test ActivityPub Actor
Verify actor endpoint:
```bash
curl "http://localhost:4321/users/alice" \
-H "Accept: application/activity+json"
```
Should return actor JSON with inbox, outbox, followers, etc.
## Configure CORS
WebFinger requires CORS headers:
The integration automatically adds:
```
Access-Control-Allow-Origin: *
```
For production with an ActivityPub server, configure appropriate CORS in your hosting.
## Implement Full ActivityPub
For complete Fediverse integration:
1. **Implement inbox**: Handle incoming activities (follows, likes, shares)
2. **Implement outbox**: Serve your posts/activities
3. **Generate keypairs**: For signing activities
4. **Handle followers**: Maintain follower/following lists
5. **Send activities**: Notify followers of new posts
This is beyond WebFinger scope. Consider using:
- [Bridgy Fed](https://fed.brid.gy/) for easy federation
- [WriteFreely](https://writefreely.org/) for federated blogging
- [GoToSocial](https://gotosocial.org/) for self-hosted instances
## Expected Result
Your site becomes discoverable in the Fediverse:
1. Users search `@yourname@example.com` on Mastodon
2. Mastodon fetches WebFinger from `/.well-known/webfinger`
3. Gets ActivityPub actor URL
4. Displays your profile
5. Users can follow/interact (if full ActivityPub implemented)
## Alternative Approaches
**Static site**: Use WebFinger for discovery only, point to external Mastodon account.
**Proxy to Mastodon**: WebFinger points to your Mastodon instance.
**Bridgy Fed**: Use Bridgy Fed to handle ActivityPub protocol, just provide WebFinger.
**Full implementation**: Build complete ActivityPub server with inbox/outbox.
## Common Issues
**WebFinger not found**: Ensure `webfinger.enabled: true` and resources/collections configured.
**CORS errors**: Integration adds CORS automatically. Check if hosting overrides headers.
**Actor URL 404**: Create the actor endpoint at the URL specified in WebFinger links.
**Mastodon can't find profile**: Ensure `rel: 'self'` link with `type: 'application/activity+json'` exists.
**Incorrect format**: WebFinger must return valid JRD JSON. Test with curl.
**Case sensitivity**: Resource URIs are case-sensitive. `acct:alice@example.com``acct:Alice@example.com`
## Additional Resources
- [WebFinger RFC 7033](https://datatracker.ietf.org/doc/html/rfc7033)
- [ActivityPub Spec](https://www.w3.org/TR/activitypub/)
- [Mastodon Documentation](https://docs.joinmastodon.org/)
- [Bridgy Fed](https://fed.brid.gy/)

View File

@ -3,29 +3,248 @@ title: Add Team Members
description: Add team member information to humans.txt description: Add team member information to humans.txt
--- ---
Learn how to add team members and collaborators to your humans.txt file. Document your team and contributors in humans.txt for public recognition.
:::note[Work in Progress] ## Prerequisites
This page is currently being developed. Check back soon for complete documentation.
:::
## Coming Soon - Integration installed and configured
- Team member information (names, roles, contact details)
- Permission from team members to share their information
This section will include: ## Add a Single Team Member
- Detailed explanations
- Code examples
- Best practices
- Common patterns
- Troubleshooting tips
## Related Pages Configure basic team information:
- [Configuration Reference](/reference/configuration/) ```typescript
- [API Reference](/reference/api/) // astro.config.mjs
- [Examples](/examples/ecommerce/) discovery({
humans: {
team: [
{
name: 'Jane Developer',
role: 'Lead Developer',
contact: 'jane@example.com'
}
]
}
})
```
## Need Help? ## Add Multiple Team Members
- Check our [FAQ](/community/faq/) Include your full team:
- Visit [Troubleshooting](/community/troubleshooting/)
- Open an issue on [GitHub](https://github.com/withastro/astro-discovery/issues) ```typescript
discovery({
humans: {
team: [
{
name: 'Jane Developer',
role: 'Lead Developer',
contact: 'jane@example.com',
location: 'San Francisco, CA'
},
{
name: 'John Designer',
role: 'UI/UX Designer',
contact: 'john@example.com',
location: 'New York, NY'
},
{
name: 'Sarah Product',
role: 'Product Manager',
location: 'London, UK'
}
]
}
})
```
## Include Social Media Profiles
Add Twitter and GitHub handles:
```typescript
discovery({
humans: {
team: [
{
name: 'Alex Dev',
role: 'Full Stack Developer',
contact: 'alex@example.com',
twitter: '@alexdev',
github: 'alex-codes'
}
]
}
})
```
## Load from Content Collections
Dynamically generate team list from content:
```typescript
import { getCollection } from 'astro:content';
discovery({
humans: {
team: async () => {
const teamMembers = await getCollection('team');
return teamMembers.map(member => ({
name: member.data.name,
role: member.data.role,
contact: member.data.email,
location: member.data.city,
twitter: member.data.twitter,
github: member.data.github
}));
}
}
})
```
Create a content collection in `src/content/team/`:
```yaml
# src/content/team/jane.yaml
name: Jane Developer
role: Lead Developer
email: jane@example.com
city: San Francisco, CA
twitter: '@janedev'
github: jane-codes
```
## Load from External Source
Fetch team data from your API or database:
```typescript
discovery({
humans: {
team: async () => {
const response = await fetch('https://api.example.com/team');
const teamData = await response.json();
return teamData.members.map(member => ({
name: member.fullName,
role: member.position,
contact: member.publicEmail,
location: member.location
}));
}
}
})
```
## Add Acknowledgments
Thank contributors and inspirations:
```typescript
discovery({
humans: {
team: [/* ... */],
thanks: [
'The Astro team for the amazing framework',
'All our open source contributors',
'Stack Overflow community',
'Our beta testers',
'Coffee and late nights'
]
}
})
```
## Include Project Story
Add context about your project:
```typescript
discovery({
humans: {
team: [/* ... */],
story: `
This project was born from a hackathon in 2024. What started as
a weekend experiment grew into a tool used by thousands. Our team
came together from different timezones and backgrounds, united by
a passion for making the web more discoverable.
`.trim()
}
})
```
## Add Fun Facts
Make it personal:
```typescript
discovery({
humans: {
team: [/* ... */],
funFacts: [
'Built entirely remotely across 4 continents',
'Powered by 1,247 cups of coffee',
'Deployed on a Friday (we live dangerously)',
'First commit was at 2:47 AM',
'Named after a recurring inside joke'
]
}
})
```
## Verify Your Configuration
Build and check the output:
```bash
npm run build
npm run preview
curl http://localhost:4321/humans.txt
```
## Expected Result
Your humans.txt will contain formatted team information:
```
/* TEAM */
Name: Jane Developer
Role: Lead Developer
Contact: jane@example.com
From: San Francisco, CA
Twitter: @janedev
GitHub: jane-codes
Name: John Designer
Role: UI/UX Designer
Contact: john@example.com
From: New York, NY
/* THANKS */
The Astro team for the amazing framework
All our open source contributors
Coffee and late nights
```
## Alternative Approaches
**Privacy-first**: Use team roles without names or contact details for privacy.
**Department-based**: Group team members by department rather than listing individually.
**Rotating spotlight**: Highlight different team members each month using dynamic content.
## Common Issues
**Missing permissions**: Always get consent before publishing personal information.
**Outdated information**: Keep contact details current. Use dynamic loading to stay fresh.
**Too much detail**: Stick to professional information. Avoid personal addresses or phone numbers.
**Special characters**: Use plain ASCII in humans.txt. Avoid emojis unless necessary.

View File

@ -1,31 +1,169 @@
--- ---
title: Block Specific Bots title: Block Specific Bots
description: How to block unwanted bots from crawling your site description: Control which bots can crawl your site using robots.txt rules
--- ---
Learn how to block specific bots or user agents from accessing your site. Block unwanted bots or user agents from accessing specific parts of your site.
:::note[Work in Progress] ## Prerequisites
This page is currently being developed. Check back soon for complete documentation.
:::
## Coming Soon - Integration installed and configured
- Basic familiarity with robots.txt format
- Knowledge of which bot user agents to block
This section will include: ## Block a Single Bot Completely
- Detailed explanations
- Code examples
- Best practices
- Common patterns
- Troubleshooting tips
## Related Pages To prevent a specific bot from crawling your entire site:
- [Configuration Reference](/reference/configuration/) ```typescript
- [API Reference](/reference/api/) // astro.config.mjs
- [Examples](/examples/ecommerce/) discovery({
robots: {
additionalAgents: [
{
userAgent: 'BadBot',
disallow: ['/']
}
]
}
})
```
## Need Help? This creates a rule that blocks `BadBot` from all pages.
- Check our [FAQ](/community/faq/) ## Block Multiple Bots
- Visit [Troubleshooting](/community/troubleshooting/)
- Open an issue on [GitHub](https://github.com/withastro/astro-discovery/issues) Add multiple entries to the `additionalAgents` array:
```typescript
discovery({
robots: {
additionalAgents: [
{ userAgent: 'BadBot', disallow: ['/'] },
{ userAgent: 'SpamCrawler', disallow: ['/'] },
{ userAgent: 'AnnoyingBot', disallow: ['/'] }
]
}
})
```
## Block Bots from Specific Paths
Allow a bot access to most content, but block sensitive areas:
```typescript
discovery({
robots: {
additionalAgents: [
{
userAgent: 'PriceBot',
allow: ['/'],
disallow: ['/checkout', '/account', '/api']
}
]
}
})
```
**Order matters**: Specific rules (`/checkout`) should come after general rules (`/`).
## Disable All LLM Bots
To block all AI crawler bots:
```typescript
discovery({
robots: {
llmBots: {
enabled: false
}
}
})
```
This removes the allow rules for Anthropic-AI, Claude-Web, GPTBot, and other LLM crawlers.
## Block Specific LLM Bots
Keep some LLM bots while blocking others:
```typescript
discovery({
robots: {
llmBots: {
enabled: true,
agents: ['Anthropic-AI', 'Claude-Web'] // Only allow these
},
additionalAgents: [
{ userAgent: 'GPTBot', disallow: ['/'] },
{ userAgent: 'Google-Extended', disallow: ['/'] }
]
}
})
```
## Add Custom Rules
For complex scenarios, use `customRules` to add raw robots.txt content:
```typescript
discovery({
robots: {
customRules: `
# Block aggressive crawlers
User-agent: AggressiveBot
Crawl-delay: 30
Disallow: /
# Special rule for search engine
User-agent: Googlebot
Allow: /api/public
Disallow: /api/private
`.trim()
}
})
```
## Verify Your Configuration
After configuration, build your site and check `/robots.txt`:
```bash
npm run build
npm run preview
curl http://localhost:4321/robots.txt
```
Look for your custom agent rules in the output.
## Expected Result
Your robots.txt will contain entries like:
```
User-agent: BadBot
Disallow: /
User-agent: PriceBot
Allow: /
Disallow: /checkout
Disallow: /account
```
Blocked bots should respect these rules and avoid crawling restricted areas.
## Alternative Approaches
**Server-level blocking**: For malicious bots that ignore robots.txt, consider blocking at the server/firewall level.
**User-agent detection**: Implement server-side detection to return 403 Forbidden for specific user agents.
**Rate limiting**: Use crawl delays to slow down aggressive crawlers rather than blocking them completely.
## Common Issues
**Bots ignoring rules**: robots.txt is advisory only. Malicious bots may not respect it.
**Overly broad patterns**: Be specific with disallow paths. `/api` blocks `/api/public` too.
**Typos in user agents**: User agent strings are case-sensitive. Check bot documentation for exact values.

View File

@ -3,29 +3,224 @@ title: Set Cache Headers
description: Configure HTTP caching for discovery files description: Configure HTTP caching for discovery files
--- ---
Optimize cache headers for discovery files to balance freshness and performance. Optimize cache headers for discovery files to balance freshness with server load and client performance.
:::note[Work in Progress] ## Prerequisites
This page is currently being developed. Check back soon for complete documentation.
:::
## Coming Soon - Integration installed and configured
- Understanding of HTTP caching concepts
- Knowledge of your content update frequency
This section will include: ## Set Cache Duration for All Files
- Detailed explanations
- Code examples
- Best practices
- Common patterns
- Troubleshooting tips
## Related Pages Configure caching in seconds:
- [Configuration Reference](/reference/configuration/) ```typescript
- [API Reference](/reference/api/) // astro.config.mjs
- [Examples](/examples/ecommerce/) discovery({
caching: {
robots: 3600, // 1 hour
llms: 3600, // 1 hour
humans: 86400, // 24 hours
security: 86400, // 24 hours
canary: 3600, // 1 hour
webfinger: 3600, // 1 hour
sitemap: 3600 // 1 hour
}
})
```
## Need Help? These values set `Cache-Control: public, max-age=<seconds>` headers.
- Check our [FAQ](/community/faq/) ## Short Cache for Frequently Updated Content
- Visit [Troubleshooting](/community/troubleshooting/)
- Open an issue on [GitHub](https://github.com/withastro/astro-discovery/issues) Update canary.txt daily? Use short cache:
```typescript
discovery({
caching: {
canary: 1800 // 30 minutes
}
})
```
Bots will check for updates more frequently.
## Long Cache for Static Content
Rarely change humans.txt? Cache longer:
```typescript
discovery({
caching: {
humans: 604800 // 1 week (7 days)
}
})
```
Reduces server load for static content.
## Disable Caching for Development
Different caching for development vs production:
```typescript
discovery({
caching: import.meta.env.PROD
? {
// Production: aggressive caching
robots: 3600,
llms: 3600,
humans: 86400
}
: {
// Development: no caching
robots: 0,
llms: 0,
humans: 0
}
})
```
Zero seconds means no caching (always fresh).
## Match Cache to Update Frequency
Align with your content update schedule:
```typescript
discovery({
caching: {
// Updated hourly via CI/CD
llms: 3600, // 1 hour
// Updated daily
canary: 7200, // 2 hours (some buffer)
// Updated weekly
humans: 86400, // 24 hours
// Rarely changes
robots: 604800, // 1 week
security: 2592000 // 30 days
}
})
```
## Conservative Caching
When in doubt, cache shorter:
```typescript
discovery({
caching: {
robots: 1800, // 30 min
llms: 1800, // 30 min
humans: 3600, // 1 hour
sitemap: 1800 // 30 min
}
})
```
Ensures content stays relatively fresh.
## Aggressive Caching
Optimize for performance when content is stable:
```typescript
discovery({
caching: {
robots: 86400, // 24 hours
llms: 43200, // 12 hours
humans: 604800, // 1 week
security: 2592000, // 30 days
sitemap: 86400 // 24 hours
}
})
```
## Understand Cache Behavior
Different cache durations affect different use cases:
**robots.txt** (crawl bots):
- Short cache (1 hour): Quickly reflect changes to bot permissions
- Long cache (24 hours): Reduce load from frequent bot checks
**llms.txt** (AI assistants):
- Short cache (1 hour): Keep instructions current
- Medium cache (6 hours): Balance freshness and performance
**humans.txt** (curious visitors):
- Long cache (24 hours - 1 week): Team info changes rarely
**security.txt** (security researchers):
- Long cache (24 hours - 30 days): Contact info is stable
**canary.txt** (transparency):
- Short cache (30 min - 1 hour): Must be checked frequently
## Verify Cache Headers
Test with curl:
```bash
npm run build
npm run preview
# Check cache headers
curl -I http://localhost:4321/robots.txt
curl -I http://localhost:4321/llms.txt
curl -I http://localhost:4321/humans.txt
```
Look for `Cache-Control` header in the response:
```
Cache-Control: public, max-age=3600
```
## Expected Result
Browsers and CDNs will cache files according to your settings. Subsequent requests within the cache period will be served from cache, reducing server load.
For a 1-hour cache:
1. First request at 10:00 AM: Server serves fresh content
2. Request at 10:30 AM: Served from cache
3. Request at 11:01 AM: Cache expired, server serves fresh content
## Alternative Approaches
**CDN-level caching**: Configure caching at your CDN (Cloudflare, Fastly) rather than in the integration.
**Surrogate-Control header**: Use `Surrogate-Control` for CDN caching while controlling browser cache separately.
**ETags**: Add ETag support for efficient conditional requests.
**Vary header**: Consider adding `Vary: Accept-Encoding` for compressed responses.
## Common Issues
**Cache too long**: Content changes not reflected quickly. Reduce cache duration.
**Cache too short**: High server load from repeated requests. Increase cache duration.
**No caching in production**: Check if your hosting platform overrides headers.
**Stale content after updates**: Deploy a new version with a build timestamp to bust caches.
**Different behavior in CDN**: CDN may have its own caching rules. Check CDN configuration.
## Cache Duration Guidelines
**Rule of thumb**:
- Update frequency = Daily → Cache 2-6 hours
- Update frequency = Weekly → Cache 12-24 hours
- Update frequency = Monthly → Cache 1-7 days
- Update frequency = Rarely → Cache 7-30 days
**Special cases**:
- Canary.txt: Cache < update frequency (if daily, cache 2-12 hours)
- Security.txt: Cache longer (expires field handles staleness)
- Development: Cache 0 or very short (60 seconds)

View File

@ -3,29 +3,376 @@ title: Use with Content Collections
description: Integrate with Astro content collections description: Integrate with Astro content collections
--- ---
Automatically generate discovery content from your Astro content collections. Automatically generate discovery content from your Astro content collections for dynamic, maintainable configuration.
:::note[Work in Progress] ## Prerequisites
This page is currently being developed. Check back soon for complete documentation.
:::
## Coming Soon - Integration installed and configured
- Astro content collections set up
- Understanding of async configuration functions
This section will include: ## Load Team from Collection
- Detailed explanations
- Code examples
- Best practices
- Common patterns
- Troubleshooting tips
## Related Pages Create a team content collection and populate humans.txt:
- [Configuration Reference](/reference/configuration/) **Step 1**: Define the collection schema:
- [API Reference](/reference/api/)
- [Examples](/examples/ecommerce/)
## Need Help? ```typescript
// src/content.config.ts
import { defineCollection, z } from 'astro:content';
- Check our [FAQ](/community/faq/) const teamCollection = defineCollection({
- Visit [Troubleshooting](/community/troubleshooting/) type: 'data',
- Open an issue on [GitHub](https://github.com/withastro/astro-discovery/issues) schema: z.object({
name: z.string(),
role: z.string(),
email: z.string().email(),
location: z.string().optional(),
twitter: z.string().optional(),
github: z.string().optional(),
})
});
export const collections = {
team: teamCollection
};
```
**Step 2**: Add team members:
```yaml
# src/content/team/alice.yaml
name: Alice Johnson
role: Lead Developer
email: alice@example.com
location: San Francisco, CA
github: alice-codes
```
```yaml
# src/content/team/bob.yaml
name: Bob Smith
role: Designer
email: bob@example.com
location: New York, NY
twitter: '@bobdesigns'
```
**Step 3**: Load in discovery config:
```typescript
// astro.config.mjs
import { getCollection } from 'astro:content';
discovery({
humans: {
team: async () => {
const members = await getCollection('team');
return members.map(member => ({
name: member.data.name,
role: member.data.role,
contact: member.data.email,
location: member.data.location,
twitter: member.data.twitter,
github: member.data.github
}));
}
}
})
```
## Generate Important Pages from Docs
List featured documentation pages in llms.txt:
**Step 1**: Add featured flag to doc frontmatter:
```markdown
---
# src/content/docs/getting-started.md
title: Getting Started Guide
description: Quick start guide for new users
featured: true
---
```
**Step 2**: Load featured docs:
```typescript
discovery({
llms: {
importantPages: async () => {
const docs = await getCollection('docs');
return docs
.filter(doc => doc.data.featured)
.sort((a, b) => (a.data.order || 0) - (b.data.order || 0))
.map(doc => ({
name: doc.data.title,
path: `/docs/${doc.slug}`,
description: doc.data.description
}));
}
}
})
```
## WebFinger from Author Collection
Make blog authors discoverable via WebFinger:
**Step 1**: Define authors collection:
```typescript
// src/content.config.ts
const authorsCollection = defineCollection({
type: 'data',
schema: z.object({
name: z.string(),
email: z.string().email(),
bio: z.string(),
avatar: z.string().url(),
mastodon: z.string().url().optional(),
website: z.string().url().optional()
})
});
```
**Step 2**: Add author data:
```yaml
# src/content/authors/alice.yaml
name: Alice Developer
email: alice@example.com
bio: Full-stack developer and open source enthusiast
avatar: https://example.com/avatars/alice.jpg
mastodon: https://mastodon.social/@alice
website: https://alice.dev
```
**Step 3**: Configure WebFinger:
```typescript
discovery({
webfinger: {
enabled: true,
collections: [{
name: 'authors',
resourceTemplate: 'acct:{slug}@example.com',
linksBuilder: (author) => [
{
rel: 'http://webfinger.net/rel/profile-page',
type: 'text/html',
href: `https://example.com/authors/${author.slug}`
},
{
rel: 'http://webfinger.net/rel/avatar',
type: 'image/jpeg',
href: author.data.avatar
},
...(author.data.mastodon ? [{
rel: 'self',
type: 'application/activity+json',
href: author.data.mastodon
}] : [])
],
propertiesBuilder: (author) => ({
'http://schema.org/name': author.data.name,
'http://schema.org/description': author.data.bio
})
}]
}
})
```
Query with: `GET /.well-known/webfinger?resource=acct:alice@example.com`
## Load API Endpoints from Spec
Generate API documentation from a collection:
```typescript
// src/content.config.ts
const apiCollection = defineCollection({
type: 'data',
schema: z.object({
path: z.string(),
method: z.enum(['GET', 'POST', 'PUT', 'DELETE', 'PATCH']),
description: z.string(),
public: z.boolean().default(true)
})
});
```
```yaml
# src/content/api/search.yaml
path: /api/search
method: GET
description: Search products by name, category, or tag
public: true
```
```typescript
discovery({
llms: {
apiEndpoints: async () => {
const endpoints = await getCollection('api');
return endpoints
.filter(ep => ep.data.public)
.map(ep => ({
path: ep.data.path,
method: ep.data.method,
description: ep.data.description
}));
}
}
})
```
## Multiple Collections
Combine data from several collections:
```typescript
discovery({
humans: {
team: async () => {
const [coreTeam, contributors] = await Promise.all([
getCollection('team'),
getCollection('contributors')
]);
return [
...coreTeam.map(m => ({ ...m.data, role: `Core - ${m.data.role}` })),
...contributors.map(m => ({ ...m.data, role: `Contributor - ${m.data.role}` }))
];
},
thanks: async () => {
const sponsors = await getCollection('sponsors');
return sponsors.map(s => s.data.name);
}
}
})
```
## Filter and Sort Collections
Control which items are included:
```typescript
discovery({
llms: {
importantPages: async () => {
const allDocs = await getCollection('docs');
return allDocs
// Only published docs
.filter(doc => doc.data.published !== false)
// Only important ones
.filter(doc => doc.data.priority === 'high')
// Sort by custom order
.sort((a, b) => {
const orderA = a.data.order ?? 999;
const orderB = b.data.order ?? 999;
return orderA - orderB;
})
// Map to format
.map(doc => ({
name: doc.data.title,
path: `/docs/${doc.slug}`,
description: doc.data.description
}));
}
}
})
```
## Localized Content
Support multiple languages:
```typescript
discovery({
llms: {
importantPages: async () => {
const docs = await getCollection('docs');
// Group by language
const enDocs = docs.filter(d => d.slug.startsWith('en/'));
const esDocs = docs.filter(d => d.slug.startsWith('es/'));
// Return English docs, with links to translations
return enDocs.map(doc => ({
name: doc.data.title,
path: `/docs/${doc.slug}`,
description: doc.data.description,
// Could add: translations: ['/docs/es/...']
}));
}
}
})
```
## Cache Collection Queries
Optimize build performance:
```typescript
// Cache at module level
let cachedTeam = null;
discovery({
humans: {
team: async () => {
if (!cachedTeam) {
const members = await getCollection('team');
cachedTeam = members.map(m => ({
name: m.data.name,
role: m.data.role,
contact: m.data.email
}));
}
return cachedTeam;
}
}
})
```
## Expected Result
Content collections automatically populate discovery files:
**Adding a team member**:
1. Create `src/content/team/new-member.yaml`
2. Run `npm run build`
3. humans.txt includes new member
**Marking a doc as featured**:
1. Add `featured: true` to frontmatter
2. Run `npm run build`
3. llms.txt lists the new important page
## Alternative Approaches
**Static data**: Use plain JavaScript objects when data rarely changes.
**External API**: Fetch from CMS or API during build instead of using collections.
**Hybrid**: Use collections for core data, enhance with API data.
## Common Issues
**Async not awaited**: Ensure you use `async () => {}` and `await getCollection()`.
**Build-time only**: Collections are loaded at build time, not runtime.
**Type errors**: Ensure collection schema matches the data structure you're mapping.
**Missing data**: Check that collection files exist and match the schema.
**Slow builds**: Cache collection queries if used multiple times in config.

View File

@ -3,29 +3,417 @@ title: Custom Templates
description: Create custom templates for discovery files description: Create custom templates for discovery files
--- ---
Override default templates to fully customize the output of discovery files. Override default templates to fully customize the output format of discovery files.
:::note[Work in Progress] ## Prerequisites
This page is currently being developed. Check back soon for complete documentation.
:::
## Coming Soon - Integration installed and configured
- Understanding of the file formats (robots.txt, llms.txt, etc.)
- Knowledge of template function signatures
This section will include: ## Override robots.txt Template
- Detailed explanations
- Code examples
- Best practices
- Common patterns
- Troubleshooting tips
## Related Pages Complete control over robots.txt output:
- [Configuration Reference](/reference/configuration/) ```typescript
- [API Reference](/reference/api/) // astro.config.mjs
- [Examples](/examples/ecommerce/) discovery({
templates: {
robots: (config, siteURL) => {
const lines = [];
## Need Help? // Custom header
lines.push('# Custom robots.txt');
lines.push(`# Site: ${siteURL.hostname}`);
lines.push('# Last generated: ' + new Date().toISOString());
lines.push('');
- Check our [FAQ](/community/faq/) // Default rule
- Visit [Troubleshooting](/community/troubleshooting/) lines.push('User-agent: *');
- Open an issue on [GitHub](https://github.com/withastro/astro-discovery/issues) lines.push('Allow: /');
lines.push('');
// Add sitemap
lines.push(`Sitemap: ${new URL('sitemap-index.xml', siteURL).href}`);
return lines.join('\n') + '\n';
}
}
})
```
## Override llms.txt Template
Custom format for AI instructions:
```typescript
discovery({
templates: {
llms: async (config, siteURL) => {
const lines = [];
// Header
lines.push(`=`.repeat(60));
lines.push(`AI ASSISTANT GUIDE FOR ${siteURL.hostname.toUpperCase()}`);
lines.push(`=`.repeat(60));
lines.push('');
// Description
const description = typeof config.description === 'function'
? config.description()
: config.description;
if (description) {
lines.push(description);
lines.push('');
}
// Instructions
if (config.instructions) {
lines.push('IMPORTANT INSTRUCTIONS:');
lines.push(config.instructions);
lines.push('');
}
// API endpoints in custom format
if (config.apiEndpoints && config.apiEndpoints.length > 0) {
lines.push('AVAILABLE APIs:');
config.apiEndpoints.forEach(ep => {
lines.push(` [${ep.method || 'GET'}] ${ep.path}`);
lines.push(` → ${ep.description}`);
});
lines.push('');
}
// Footer
lines.push(`=`.repeat(60));
lines.push(`Generated: ${new Date().toISOString()}`);
return lines.join('\n') + '\n';
}
}
})
```
## Override humans.txt Template
Custom humans.txt format:
```typescript
discovery({
templates: {
humans: (config, siteURL) => {
const lines = [];
lines.push('========================================');
lines.push(' HUMANS BEHIND THE SITE ');
lines.push('========================================');
lines.push('');
// Team in custom format
if (config.team && config.team.length > 0) {
lines.push('OUR TEAM:');
lines.push('');
config.team.forEach((member, i) => {
if (i > 0) lines.push('---');
lines.push(`Name : ${member.name}`);
if (member.role) lines.push(`Role : ${member.role}`);
if (member.contact) lines.push(`Email : ${member.contact}`);
if (member.github) lines.push(`GitHub : https://github.com/${member.github}`);
lines.push('');
});
}
// Stack info
if (config.site?.techStack) {
lines.push('BUILT WITH:');
lines.push(config.site.techStack.join(' | '));
lines.push('');
}
return lines.join('\n') + '\n';
}
}
})
```
## Override security.txt Template
Custom security.txt with additional fields:
```typescript
discovery({
templates: {
security: (config, siteURL) => {
const lines = [];
// Canonical (required by RFC 9116)
const canonical = config.canonical ||
new URL('.well-known/security.txt', siteURL).href;
lines.push(`Canonical: ${canonical}`);
// Contact (required)
const contacts = Array.isArray(config.contact)
? config.contact
: [config.contact];
contacts.forEach(contact => {
const contactValue = contact.includes('@') && !contact.startsWith('mailto:')
? `mailto:${contact}`
: contact;
lines.push(`Contact: ${contactValue}`);
});
// Expires (recommended)
const expires = config.expires === 'auto'
? new Date(Date.now() + 365 * 24 * 60 * 60 * 1000).toISOString()
: config.expires;
if (expires) {
lines.push(`Expires: ${expires}`);
}
// Optional fields
if (config.encryption) {
const encryptions = Array.isArray(config.encryption)
? config.encryption
: [config.encryption];
encryptions.forEach(enc => lines.push(`Encryption: ${enc}`));
}
if (config.policy) {
lines.push(`Policy: ${config.policy}`);
}
if (config.acknowledgments) {
lines.push(`Acknowledgments: ${config.acknowledgments}`);
}
// Add custom comment
lines.push('');
lines.push('# Thank you for helping keep our users safe!');
return lines.join('\n') + '\n';
}
}
})
```
## Override canary.txt Template
Custom warrant canary format:
```typescript
discovery({
templates: {
canary: (config, siteURL) => {
const lines = [];
const today = new Date().toISOString().split('T')[0];
lines.push('=== WARRANT CANARY ===');
lines.push('');
lines.push(`Organization: ${config.organization || siteURL.hostname}`);
lines.push(`Date Issued: ${today}`);
lines.push('');
lines.push('As of this date, we confirm:');
lines.push('');
// List what has NOT been received
const statements = typeof config.statements === 'function'
? config.statements()
: config.statements || [];
statements
.filter(s => !s.received)
.forEach(statement => {
lines.push(`✓ NO ${statement.description} received`);
});
lines.push('');
lines.push('This canary will be updated regularly.');
lines.push('Absence of an update should be considered significant.');
lines.push('');
if (config.verification) {
lines.push(`Verification: ${config.verification}`);
}
return lines.join('\n') + '\n';
}
}
})
```
## Combine Default Generator with Custom Content
Use default generator, add custom content:
```typescript
import { generateRobotsTxt } from '@astrojs/discovery/generators';
discovery({
templates: {
robots: (config, siteURL) => {
// Generate default content
const defaultContent = generateRobotsTxt(config, siteURL);
// Add custom rules
const customRules = `
# Custom section
User-agent: MySpecialBot
Crawl-delay: 20
Allow: /special
# Rate limiting comment
# Please be respectful of our server resources
`.trim();
return defaultContent + '\n\n' + customRules + '\n';
}
}
})
```
## Load Template from File
Keep templates separate:
```typescript
// templates/robots.txt.js
export default (config, siteURL) => {
return `
User-agent: *
Allow: /
Sitemap: ${new URL('sitemap-index.xml', siteURL).href}
`.trim() + '\n';
};
```
```typescript
// astro.config.mjs
import robotsTemplate from './templates/robots.txt.js';
discovery({
templates: {
robots: robotsTemplate
}
})
```
## Conditional Template Logic
Different templates per environment:
```typescript
discovery({
templates: {
llms: import.meta.env.PROD
? (config, siteURL) => {
// Production: detailed guide
return `# Production site guide\n...detailed content...`;
}
: (config, siteURL) => {
// Development: simple warning
return `# Development environment\nThis is a development site.\n`;
}
}
})
```
## Template with External Data
Fetch additional data in template:
```typescript
discovery({
templates: {
llms: async (config, siteURL) => {
// Fetch latest API spec
const response = await fetch('https://api.example.com/openapi.json');
const spec = await response.json();
const lines = [];
lines.push(`# ${siteURL.hostname} API Guide`);
lines.push('');
lines.push('Available endpoints:');
Object.entries(spec.paths).forEach(([path, methods]) => {
Object.keys(methods).forEach(method => {
lines.push(`- ${method.toUpperCase()} ${path}`);
});
});
return lines.join('\n') + '\n';
}
}
})
```
## Verify Custom Templates
Test your templates:
```bash
npm run build
npm run preview
# Check each file
curl http://localhost:4321/robots.txt
curl http://localhost:4321/llms.txt
curl http://localhost:4321/humans.txt
curl http://localhost:4321/.well-known/security.txt
```
Ensure format is correct and content appears as expected.
## Expected Result
Your custom templates completely control output format:
**Custom robots.txt**:
```
# Custom robots.txt
# Site: example.com
# Last generated: 2025-11-08T12:00:00.000Z
User-agent: *
Allow: /
Sitemap: https://example.com/sitemap-index.xml
```
**Custom llms.txt**:
```
============================================================
AI ASSISTANT GUIDE FOR EXAMPLE.COM
============================================================
Your site description here
IMPORTANT INSTRUCTIONS:
...
```
## Alternative Approaches
**Partial overrides**: Extend default generators rather than replacing entirely.
**Post-processing**: Generate default content, then modify it with string manipulation.
**Multiple templates**: Use different templates based on configuration flags.
## Common Issues
**Missing newline at end**: Ensure template returns content ending with `\n`.
**Async templates**: llms.txt template can be async, others are sync. Don't mix.
**Type errors**: Template signature must match: `(config: Config, siteURL: URL) => string`
**Breaking specs**: security.txt and robots.txt have specific formats. Don't break them.
**Config not available**: Only config passed to that section is available. Can't access other sections.

View File

@ -1,31 +1,255 @@
--- ---
title: Customize LLM Instructions title: Customize LLM Instructions
description: Provide specific instructions for AI assistants description: Provide custom instructions for AI assistants using llms.txt
--- ---
Create custom instructions for AI assistants to follow when helping users with your site. Configure how AI assistants interact with your site by customizing instructions in llms.txt.
:::note[Work in Progress] ## Prerequisites
This page is currently being developed. Check back soon for complete documentation.
:::
## Coming Soon - Integration installed and configured
- Understanding of your site's main use cases
- Knowledge of your API endpoints (if applicable)
This section will include: ## Add Basic Instructions
- Detailed explanations
- Code examples
- Best practices
- Common patterns
- Troubleshooting tips
## Related Pages Provide clear guidance for AI assistants:
- [Configuration Reference](/reference/configuration/) ```typescript
- [API Reference](/reference/api/) // astro.config.mjs
- [Examples](/examples/ecommerce/) discovery({
llms: {
description: 'Technical documentation for the Discovery API',
instructions: `
When helping users with this site:
1. Check the documentation before answering
2. Provide code examples when relevant
3. Link to specific documentation pages
4. Use the search API for queries
`.trim()
}
})
```
## Need Help? ## Highlight Key Features
- Check our [FAQ](/community/faq/) Guide AI assistants to important capabilities:
- Visit [Troubleshooting](/community/troubleshooting/)
- Open an issue on [GitHub](https://github.com/withastro/astro-discovery/issues) ```typescript
discovery({
llms: {
description: 'E-commerce platform for sustainable products',
keyFeatures: [
'Carbon footprint calculator for all products',
'Subscription management with flexible billing',
'AI-powered product recommendations',
'Real-time inventory tracking'
]
}
})
```
## Document Important Pages
Direct AI assistants to critical resources:
```typescript
discovery({
llms: {
importantPages: [
{
name: 'API Documentation',
path: '/docs/api',
description: 'Complete API reference with examples'
},
{
name: 'Getting Started Guide',
path: '/docs/quick-start',
description: 'Step-by-step setup instructions'
},
{
name: 'FAQ',
path: '/help/faq',
description: 'Common questions and solutions'
}
]
}
})
```
## Describe Your APIs
Help AI assistants use your endpoints correctly:
```typescript
discovery({
llms: {
apiEndpoints: [
{
path: '/api/search',
method: 'GET',
description: 'Search products by name, category, or tag'
},
{
path: '/api/products/:id',
method: 'GET',
description: 'Get detailed product information'
},
{
path: '/api/calculate-carbon',
method: 'POST',
description: 'Calculate carbon footprint for a cart'
}
]
}
})
```
## Set Brand Voice Guidelines
Maintain consistent communication style:
```typescript
discovery({
llms: {
brandVoice: [
'Professional yet approachable',
'Focus on sustainability and environmental impact',
'Use concrete examples, not abstract concepts',
'Avoid jargon unless explaining technical features',
'Emphasize long-term value over short-term savings'
]
}
})
```
## Load Content Dynamically
Pull important pages from content collections:
```typescript
import { getCollection } from 'astro:content';
discovery({
llms: {
importantPages: async () => {
const docs = await getCollection('docs');
// Filter to featured pages only
return docs
.filter(doc => doc.data.featured)
.map(doc => ({
name: doc.data.title,
path: `/docs/${doc.slug}`,
description: doc.data.description
}));
}
}
})
```
## Add Custom Sections
Include specialized information:
```typescript
discovery({
llms: {
customSections: {
'Data Privacy': `
We are GDPR compliant. User data is encrypted at rest and in transit.
Data retention policy: 90 days for analytics, 7 years for transactions.
`.trim(),
'Rate Limits': `
API rate limits:
- Authenticated: 1000 requests/hour
- Anonymous: 60 requests/hour
- Burst: 20 requests/second
`.trim(),
'Support Channels': `
For assistance:
- Documentation: https://example.com/docs
- Email: support@example.com (response within 24h)
- Community: https://discord.gg/example
`.trim()
}
}
})
```
## Environment-Specific Instructions
Different instructions for development vs production:
```typescript
discovery({
llms: {
instructions: import.meta.env.PROD
? `Production site - use live API endpoints at https://api.example.com`
: `Development site - API endpoints may be mocked or unavailable`
}
})
```
## Verify Your Configuration
Build and check the output:
```bash
npm run build
npm run preview
curl http://localhost:4321/llms.txt
```
Look for your instructions, features, and API documentation in the formatted output.
## Expected Result
Your llms.txt will contain structured information:
```markdown
# example.com
> E-commerce platform for sustainable products
---
## Key Features
- Carbon footprint calculator for all products
- AI-powered product recommendations
## Instructions for AI Assistants
When helping users with this site:
1. Check the documentation before answering
2. Provide code examples when relevant
## API Endpoints
- `GET /api/search`
Search products by name, category, or tag
Full URL: https://example.com/api/search
```
AI assistants will use this information to provide accurate, context-aware help.
## Alternative Approaches
**Multiple llms.txt files**: Create llms-full.txt for comprehensive docs, llms.txt for summary.
**Dynamic generation**: Use a build script to extract API docs from OpenAPI specs.
**Language-specific versions**: Generate different files for different locales (llms-en.txt, llms-es.txt).
## Common Issues
**Too much information**: Keep it concise. AI assistants prefer focused, actionable guidance.
**Outdated instructions**: Use `lastUpdate: 'auto'` or automate updates from your CMS.
**Missing context**: Don't assume knowledge. Explain domain-specific terms and workflows.
**Unclear priorities**: List most important pages/features first. AI assistants may prioritize early content.

View File

@ -3,29 +3,322 @@ title: Environment-specific Configuration
description: Use different configs for dev and production description: Use different configs for dev and production
--- ---
Configure different settings for development and production environments. Configure different settings for development and production environments to optimize for local testing vs deployed sites.
:::note[Work in Progress] ## Prerequisites
This page is currently being developed. Check back soon for complete documentation.
:::
## Coming Soon - Integration installed and configured
- Understanding of Astro environment variables
- Knowledge of your deployment setup
This section will include: ## Basic Environment Switching
- Detailed explanations
- Code examples
- Best practices
- Common patterns
- Troubleshooting tips
## Related Pages Use `import.meta.env.PROD` to detect production:
- [Configuration Reference](/reference/configuration/) ```typescript
- [API Reference](/reference/api/) // astro.config.mjs
- [Examples](/examples/ecommerce/) discovery({
robots: {
// Block all bots in development
allowAllBots: import.meta.env.PROD
}
})
```
## Need Help? Development: Bots blocked. Production: Bots allowed.
- Check our [FAQ](/community/faq/) ## Different Site URLs
- Visit [Troubleshooting](/community/troubleshooting/)
- Open an issue on [GitHub](https://github.com/withastro/astro-discovery/issues) Use different domains for staging and production:
```typescript
export default defineConfig({
site: import.meta.env.PROD
? 'https://example.com'
: 'http://localhost:4321',
integrations: [
discovery({
// Config automatically uses correct site URL
})
]
})
```
## Conditional Feature Enablement
Enable security.txt and canary.txt only in production:
```typescript
discovery({
security: import.meta.env.PROD
? {
contact: 'security@example.com',
expires: 'auto'
}
: undefined, // Disabled in development
canary: import.meta.env.PROD
? {
organization: 'Example Corp',
contact: 'canary@example.com',
frequency: 'monthly'
}
: undefined // Disabled in development
})
```
## Environment-Specific Instructions
Different LLM instructions for each environment:
```typescript
discovery({
llms: {
description: import.meta.env.PROD
? 'Production e-commerce platform'
: 'Development/Staging environment - data may be test data',
instructions: import.meta.env.PROD
? `
When helping users:
1. Use production API at https://api.example.com
2. Data is live - be careful with modifications
3. Refer to https://docs.example.com for documentation
`.trim()
: `
Development environment - for testing only:
1. API endpoints may be mocked
2. Database is reset nightly
3. Some features may not work
`.trim()
}
})
```
## Custom Environment Variables
Use `.env` files for configuration:
```bash
# .env.production
PUBLIC_SECURITY_EMAIL=security@example.com
PUBLIC_CANARY_ENABLED=true
PUBLIC_CONTACT_EMAIL=contact@example.com
# .env.development
PUBLIC_SECURITY_EMAIL=dev-security@localhost
PUBLIC_CANARY_ENABLED=false
PUBLIC_CONTACT_EMAIL=dev@localhost
```
Then use in config:
```typescript
discovery({
security: import.meta.env.PUBLIC_CANARY_ENABLED === 'true'
? {
contact: import.meta.env.PUBLIC_SECURITY_EMAIL,
expires: 'auto'
}
: undefined,
humans: {
team: [
{
name: 'Team',
contact: import.meta.env.PUBLIC_CONTACT_EMAIL
}
]
}
})
```
## Staging Environment
Support three environments: dev, staging, production:
```typescript
const ENV = import.meta.env.MODE; // 'development', 'staging', or 'production'
const siteURLs = {
development: 'http://localhost:4321',
staging: 'https://staging.example.com',
production: 'https://example.com'
};
export default defineConfig({
site: siteURLs[ENV],
integrations: [
discovery({
robots: {
// Block bots in dev and staging
allowAllBots: ENV === 'production',
additionalAgents: ENV !== 'production'
? [{ userAgent: '*', disallow: ['/'] }]
: []
},
llms: {
description: ENV === 'production'
? 'Production site'
: `${ENV} environment - not for public use`
}
})
]
})
```
Run with: `astro build --mode staging`
## Different Cache Headers
Aggressive caching in production, none in development:
```typescript
discovery({
caching: import.meta.env.PROD
? {
// Production: cache aggressively
robots: 86400,
llms: 3600,
humans: 604800
}
: {
// Development: no caching
robots: 0,
llms: 0,
humans: 0
}
})
```
## Feature Flags
Use environment variables as feature flags:
```typescript
discovery({
webfinger: {
enabled: import.meta.env.PUBLIC_ENABLE_WEBFINGER === 'true',
resources: [/* ... */]
},
canary: import.meta.env.PUBLIC_ENABLE_CANARY === 'true'
? {
organization: 'Example Corp',
frequency: 'monthly'
}
: undefined
})
```
Set in `.env`:
```bash
PUBLIC_ENABLE_WEBFINGER=false
PUBLIC_ENABLE_CANARY=true
```
## Test vs Production Data
Load different team data per environment:
```typescript
import { getCollection } from 'astro:content';
discovery({
humans: {
team: import.meta.env.PROD
? await getCollection('team') // Real team
: [
{
name: 'Test Developer',
role: 'Developer',
contact: 'test@localhost'
}
]
}
})
```
## Preview Deployments
Handle preview/branch deployments:
```typescript
const isPreview = import.meta.env.PREVIEW === 'true';
const isProd = import.meta.env.PROD && !isPreview;
discovery({
robots: {
allowAllBots: isProd, // Block on previews too
additionalAgents: !isProd
? [
{
userAgent: '*',
disallow: ['/']
}
]
: []
}
})
```
## Verify Environment Config
Test each environment:
```bash
# Development
npm run dev
curl http://localhost:4321/robots.txt
# Production build
npm run build
npm run preview
curl http://localhost:4321/robots.txt
# Staging (if configured)
astro build --mode staging
```
Check that content differs appropriately.
## Expected Result
Each environment produces appropriate output:
**Development** - Block all:
```
User-agent: *
Disallow: /
```
**Production** - Allow bots:
```
User-agent: *
Allow: /
Sitemap: https://example.com/sitemap-index.xml
```
## Alternative Approaches
**Config files per environment**: Create `astro.config.dev.mjs` and `astro.config.prod.mjs`.
**Build-time injection**: Use build tools to inject environment-specific values.
**Runtime checks**: For SSR sites, check headers or hostname at runtime.
## Common Issues
**Environment variables not available**: Ensure variables are prefixed with `PUBLIC_` for client access.
**Wrong environment detected**: `import.meta.env.PROD` is true for production builds, not preview.
**Undefined values**: Provide fallbacks for missing environment variables.
**Inconsistent builds**: Document which environment variables affect the build for reproducibility.

View File

@ -3,29 +3,240 @@ title: Filter Sitemap Pages
description: Control which pages appear in your sitemap description: Control which pages appear in your sitemap
--- ---
Configure filtering to control which pages are included in your sitemap. Exclude pages from your sitemap to keep it focused on publicly accessible, valuable content.
:::note[Work in Progress] ## Prerequisites
This page is currently being developed. Check back soon for complete documentation.
:::
## Coming Soon - Integration installed and configured
- Understanding of which pages should be public
- Knowledge of your site's URL structure
This section will include: ## Exclude Admin Pages
- Detailed explanations
- Code examples
- Best practices
- Common patterns
- Troubleshooting tips
## Related Pages Block administrative and dashboard pages:
- [Configuration Reference](/reference/configuration/) ```typescript
- [API Reference](/reference/api/) // astro.config.mjs
- [Examples](/examples/ecommerce/) discovery({
sitemap: {
filter: (page) => !page.includes('/admin')
}
})
```
## Need Help? This removes all URLs containing `/admin` from the sitemap.
- Check our [FAQ](/community/faq/) ## Exclude Multiple Path Patterns
- Visit [Troubleshooting](/community/troubleshooting/)
- Open an issue on [GitHub](https://github.com/withastro/astro-discovery/issues) Filter out several types of pages:
```typescript
discovery({
sitemap: {
filter: (page) => {
return !page.includes('/admin') &&
!page.includes('/draft') &&
!page.includes('/private') &&
!page.includes('/test');
}
}
})
```
## Exclude by File Extension
Remove API endpoints or non-HTML pages:
```typescript
discovery({
sitemap: {
filter: (page) => {
return !page.endsWith('.json') &&
!page.endsWith('.xml') &&
!page.includes('/api/');
}
}
})
```
## Include Only Specific Directories
Allow only documentation and blog posts:
```typescript
discovery({
sitemap: {
filter: (page) => {
const url = new URL(page);
const path = url.pathname;
return path.startsWith('/docs/') ||
path.startsWith('/blog/') ||
path === '/';
}
}
})
```
## Exclude by Environment
Different filtering for development vs production:
```typescript
discovery({
sitemap: {
filter: (page) => {
// Exclude drafts in production
if (import.meta.env.PROD && page.includes('/draft')) {
return false;
}
// Exclude test pages in production
if (import.meta.env.PROD && page.includes('/test')) {
return false;
}
return true;
}
}
})
```
## Filter Based on Page Metadata
Use frontmatter or metadata to control inclusion:
```typescript
discovery({
sitemap: {
serialize: (item) => {
// Exclude pages marked as noindex
// Note: You'd need to access page metadata here
// This is a simplified example
return item;
},
filter: (page) => {
// Basic path-based filtering
return !page.includes('/internal/');
}
}
})
```
## Combine with Custom Pages
Add non-generated pages while filtering others:
```typescript
discovery({
sitemap: {
filter: (page) => !page.includes('/admin'),
customPages: [
'https://example.com/special-page',
'https://example.com/external-content'
]
}
})
```
## Use Regular Expressions
Advanced pattern matching:
```typescript
discovery({
sitemap: {
filter: (page) => {
// Exclude pages with query parameters
if (page.includes('?')) return false;
// Exclude paginated pages except first page
if (/\/page\/\d+/.test(page)) return false;
// Exclude temp or staging paths
if (/\/(temp|staging|wip)\//.test(page)) return false;
return true;
}
}
})
```
## Filter User-Generated Content
Exclude user profiles or dynamic content:
```typescript
discovery({
sitemap: {
filter: (page) => {
// Include main user directory page
if (page === '/users' || page === '/users/') return true;
// Exclude individual user pages
if (page.startsWith('/users/')) return false;
// Exclude comment threads
if (page.includes('/comments/')) return false;
return true;
}
}
})
```
## Verify Your Filter
Test your filter logic:
```bash
npm run build
npm run preview
# Check sitemap
curl http://localhost:4321/sitemap-index.xml
# Look for excluded pages (should not appear)
curl http://localhost:4321/sitemap-0.xml | grep '/admin'
```
If grep returns nothing, your filter is working.
## Expected Result
Your sitemap will only contain allowed pages. Excluded pages won't appear:
```xml
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://example.com/</loc>
</url>
<url>
<loc>https://example.com/blog/post-1</loc>
</url>
<!-- No /admin, /draft, or /private pages -->
</urlset>
```
## Alternative Approaches
**robots.txt blocking**: Block crawling entirely using robots.txt instead of just omitting from sitemap.
**Meta robots tag**: Add `<meta name="robots" content="noindex">` to pages you want excluded.
**Separate sitemaps**: Create multiple sitemap files for different sections, only submit public ones.
**Dynamic generation**: Generate sitemaps at runtime based on user permissions or content status.
## Common Issues
**Too restrictive**: Double-check your filter doesn't exclude important pages. Test thoroughly.
**Case sensitivity**: URL paths are case-sensitive. `/Admin` and `/admin` are different.
**Trailing slashes**: Be consistent. `/page` and `/page/` may both exist. Handle both.
**Query parameters**: Decide whether to include pages with query strings. Usually exclude them.
**Performance**: Complex filter functions run for every page. Keep logic simple for better build times.

View File

@ -29,7 +29,7 @@
] ]
}, },
"howto": { "howto": {
"status": "executing", "status": "ready",
"branch": "docs/howto-content", "branch": "docs/howto-content",
"worktree": "docs-howto", "worktree": "docs-howto",
"pages": [ "pages": [
@ -44,10 +44,20 @@
"how-to/activitypub.md" "how-to/activitypub.md"
], ],
"dependencies": ["tutorials"], "dependencies": ["tutorials"],
"completed_pages": [] "completed_pages": [
"how-to/block-bots.md",
"how-to/customize-llm-instructions.md",
"how-to/add-team-members.md",
"how-to/filter-sitemap.md",
"how-to/cache-headers.md",
"how-to/environment-config.md",
"how-to/content-collections.md",
"how-to/custom-templates.md",
"how-to/activitypub.md"
]
}, },
"reference": { "reference": {
"status": "executing", "status": "ready",
"branch": "docs/reference-content", "branch": "docs/reference-content",
"worktree": "docs-reference", "worktree": "docs-reference",
"pages": [ "pages": [
@ -64,7 +74,19 @@
"reference/typescript.md" "reference/typescript.md"
], ],
"dependencies": [], "dependencies": [],
"completed_pages": [] "completed_pages": [
"reference/configuration.md",
"reference/api.md",
"reference/robots.md",
"reference/llms.md",
"reference/humans.md",
"reference/security.md",
"reference/canary.md",
"reference/webfinger.md",
"reference/sitemap.md",
"reference/cache.md",
"reference/typescript.md"
]
}, },
"explanation": { "explanation": {
"status": "executing", "status": "executing",