The Bluesky team published Proposal 0008: User Intents for Data Reuse earlier last year, along with a discussion thread. The proposal describes a robots.txt-style mechanism for users to declare preferences about how their public data gets reused. It covers four broad categories: generative AI, protocol bridging, bulk datasets, and public archiving. The design is deliberately simple with a single record and tri-state booleans.
I think time-stamped tri-bools with strong defaults makes sense, but we also need to incorporate extension and inheritance to make it effective with real-world use-cases that warrant finer granularity. A user who is fine with retrieval-augmented generation but opposed to training data inclusion is making a meaningfully different statement than someone who denies all AI usage. And users should be able to carve out exceptions for specific entities or content types without abandoning their defaults.
This post introduces community.lexicon.preference.ai, a lexicon schema that decomposes AI preferences into distinct categories and adds a scoped override mechanism for exceptions.
Three questions
The design starts from three questions.
How does a user signal their immediate AI preferences? A record in the user's ATProto repository, discoverable via getRecord, broadcast over the firehose, and included in CAR exports. The record lives at the well-known key self in the community.lexicon.preference.ai collection so any consumer can find it with a single call.
How does a user signal they've changed their preferences? Two layers. The record itself carries an updatedAt timestamp that tells polling consumers "something changed, re-evaluate." Each individual preference also carries its own updatedAt, so a consumer can determine exactly which preference changed and when. This matters for compliance pipelines that need cutoff dates. The ATProto commit log provides the full historical audit trail for anyone who needs it.
How does a user carve out exceptions? Multiple records in the same collection. The self-keyed record is the default policy. Additional records keyed by TID are scoped overrides that target specific entities (by DID or domain) or specific collections (by NSID). Each override only needs to declare the preferences it changes. Everything else falls through to the default.
Preference categories
The Bluesky proposal groups all generative AI usage under a single syntheticContentGeneration flag. This lexicon breaks that apart into four categories.
training: Use of data as input for training, fine-tuning, distillation, or RLHF. This is what most users think about when they think about AI and their data.
inference: Use of data at inference time for retrieval, RAG, or context injection. The data is used but not baked into model weights.
syntheticContent: Use of data to generate new content or interactions derived from user data. Style imitation, content generation, synthetic personas.
embedding: Use of data for vector embeddings or semantic indexing. The Bluesky proposal explicitly excludes embeddings from its AI category. Some users will want control over this too, so it gets its own knob.
Each preference is tri-state: allow (true), deny (false), or undefined (field omitted). Undefined means the user has expressed no opinion, and consumers are left to their own policy decisions.
Scoping and overrides
Every record declares a scope that says what it applies to. This is a union of three types.
globalScope is the account-wide default. The record at key self should carry this scope. If a consumer finds no matching override for their situation, the global scope record is what applies.
entityScope targets a specific AI consumer identified by DID or domain. This is the "allow Anthropic for training even though my default is deny" case.
collectionScope targets a specific NSID in the user's repository. This is the "deny all AI usage of my images even though my default allows inference" case.
Making scope required on every record (rather than optional with an implicit "global if missing") avoids ambiguity when someone accidentally creates two scopeless records. Every record is self-describing. A consumer reading any single record knows exactly what it applies to without inspecting the record key.
Resolution is merge, not replace. An entity override that only declares training: { allow: true } inherits the global record's stance on inference, embedding, and everything else. Overrides are surgical.
The lexicon
{
"lexicon": 1,
"id": "community.lexicon.preference.ai",
"description": "Declares a user's preferences regarding AI usage of their public data. A record at key 'self' with globalScope establishes default preferences. Additional records keyed by TID establish scoped overrides for specific entities or content collections.",
"defs": {
"main": {
"type": "record",
"key": "any",
"record": {
"type": "object",
"required": ["updatedAt", "scope", "preferences"],
"properties": {
"updatedAt": {
"type": "string",
"format": "datetime",
"description": "Timestamp of the most recent change to this record."
},
"scope": {
"type": "union",
"description": "What this record's preferences apply to.",
"refs": ["#globalScope", "#entityScope", "#collectionScope"]
},
"preferences": {
"type": "ref",
"ref": "#preferenceSet"
}
}
}
},
"preferenceSet": {
"type": "object",
"description": "A set of AI usage preferences. Omitted fields mean undefined (no declared preference).",
"properties": {
"training": {
"type": "ref",
"ref": "#preference",
"description": "Use as input for training, fine-tuning, distillation, or RLHF of ML models."
},
"inference": {
"type": "ref",
"ref": "#preference",
"description": "Use at inference time for retrieval, RAG, or context injection."
},
"syntheticContent": {
"type": "ref",
"ref": "#preference",
"description": "Use to generate synthetic content or interactions derived from user data."
},
"embedding": {
"type": "ref",
"ref": "#preference",
"description": "Use for vector embeddings or semantic indexing."
}
}
},
"preference": {
"type": "object",
"required": ["allow", "updatedAt"],
"properties": {
"allow": {
"type": "boolean",
"description": "Whether this usage is permitted (true) or denied (false)."
},
"updatedAt": {
"type": "string",
"format": "datetime",
"description": "When this specific preference was last changed."
}
}
},
"globalScope": {
"type": "object",
"description": "Account-wide default. The record at key 'self' should carry this scope."
},
"entityScope": {
"type": "object",
"description": "Scopes preferences to a specific AI consumer.",
"required": ["entity"],
"properties": {
"entity": {
"type": "string",
"description": "DID or domain of the entity this override applies to."
}
}
},
"collectionScope": {
"type": "object",
"description": "Scopes preferences to a specific record collection in the user's repository.",
"required": ["collection"],
"properties": {
"collection": {
"type": "string",
"format": "nsid",
"description": "NSID of the collection this override applies to."
}
}
}
}
}
Example records
Global default (key: self)
A user who denies training and synthetic content generation but allows inference. Embedding is left undefined.
{
"$type": "community.lexicon.preference.ai",
"updatedAt": "2026-04-04T12:00:00.000Z",
"scope": {
"$type": "#globalScope"
},
"preferences": {
"training": {
"allow": false,
"updatedAt": "2026-04-04T12:00:00.000Z"
},
"inference": {
"allow": true,
"updatedAt": "2026-04-04T12:00:00.000Z"
},
"syntheticContent": {
"allow": false,
"updatedAt": "2026-04-04T12:00:00.000Z"
}
}
}
Entity override (key: TID)
The same user grants a specific entity permission to use their data for training, overriding the global deny. All other preferences inherit from the global default.
{
"$type": "community.lexicon.preference.ai",
"updatedAt": "2026-04-04T13:00:00.000Z",
"scope": {
"$type": "#entityScope",
"entity": "did:plc:example-ai-company"
},
"preferences": {
"training": {
"allow": true,
"updatedAt": "2026-04-04T13:00:00.000Z"
}
}
}
Collection override (key: TID)
The same user denies all AI usage of records in a specific collection, regardless of the global default.
{
"$type": "community.lexicon.preference.ai",
"updatedAt": "2026-04-04T14:00:00.000Z",
"scope": {
"$type": "#collectionScope",
"collection": "app.bsky.feed.post"
},
"preferences": {
"training": {
"allow": false,
"updatedAt": "2026-04-04T14:00:00.000Z"
},
"inference": {
"allow": false,
"updatedAt": "2026-04-04T14:00:00.000Z"
},
"syntheticContent": {
"allow": false,
"updatedAt": "2026-04-04T14:00:00.000Z"
},
"embedding": {
"allow": false,
"updatedAt": "2026-04-04T14:00:00.000Z"
}
}
}
Consumer resolution
A consumer resolving preferences for a given request follows this order:
- 1.
Check for an entity-scoped override matching the consumer's DID or domain.
- 2.
Check for a collection-scoped override matching the content's NSID.
- 3.
Fall back to the global default at key
self.
For any matched override, declared preferences take effect and undeclared preferences fall through to the global default. If the global default also omits a preference, the result is undefined and the consumer applies their own policy.
When both an entity override and a collection override match, the more specific combination wins. For v1, I'd recommend treating this as undefined behavior and encouraging consumers to apply whichever override they find first. Compound scope resolution is worth specifying properly in a future version once real usage patterns emerge.
Relationship to the Bluesky user intents proposal
This lexicon is complementary to Proposal 0008, not a replacement. The Bluesky proposal owns the broad categories (bridging, archiving, bulk datasets) and the coarse syntheticContentGeneration flag at the protocol level. community.lexicon.preference.ai decomposes the AI dimension with finer granularity and adds the exception mechanism that the proposal deliberately omits.
A consumer could check both: the user-intents record for the high-level signal, and the AI preference records for nuance. If the user-intents record denies syntheticContentGeneration and the AI preference record allows inference, the AI preference record is the more specific signal and should take precedence for inference-related use cases.
The IETF is also working on related standards through the AI Preferences working group, including Short Usage Preference Strings and a vocabulary for AI training preferences. As those standards mature, the preference categories in this lexicon can evolve to align with whatever consensus vocabulary emerges. The scoping and override mechanism is independent of the specific categories and should remain stable.
What's next
A PR to add this lexicon to lexicon.community is live at lexicon-community/lexicon #72. If you want to discuss the design or propose changes, open an issue or find me on Bluesky. Once the lexicon lands, the immediate next steps are getting it into Lexicon Garden and others for discoverability, building a simple settings UI, and writing a reference consumer that demonstrates the resolution logic.