A RAG demo is a notebook that calls an embedding model, stuffs three chunks into a prompt, and returns plausible text. A production RAG system is a governed pipeline: it knows where every answer came from, refuses to leak PII, never talks about topics you forbid, runs over private networking, and costs a predictable amount per month. This guide builds that second thing on Amazon Bedrock — Knowledge Bases, a vector store, Guardrails, VPC endpoints, and the observability you need to trust it in front of users.
From prototype to production: the architecture
The moving parts and how they fit:
User query
|
v
[App in private subnet] --(VPC endpoint)--> [Bedrock Agent Runtime]
| |
| RetrieveAndGenerate | 1. embed query
| | 2. vector search
| v
| [Vector store: OpenSearch
| Serverless / Aurora pgvector]
| ^
| | ingestion job
| [Knowledge Base] <-- [S3 data source]
| |
| Guardrail applied on input + output |
v v
[Grounded answer + citations] [Foundation model: generation]
Three Bedrock surfaces matter here, and IAM policies and VPC endpoints are per-service: bedrock is the control plane (model access, Guardrail and Knowledge Base definitions); bedrock-runtime invokes models directly (InvokeModel, Converse); bedrock-agent-runtime runs the RAG orchestration (Retrieve, RetrieveAndGenerate).
| Layer | Service | What it owns |
|---|---|---|
| Model access | Bedrock console / bedrock |
Which foundation models the account may call |
| Knowledge | Bedrock Knowledge Bases | Chunking, embeddings, ingestion, vector store wiring |
| Retrieval + generation | bedrock-agent-runtime |
Vector search and grounded answer assembly |
| Safety | Bedrock Guardrails | Content filters, PII, denied topics, grounding checks |
| Network + crypto | VPC endpoints, KMS | Private path, encryption at rest |
Assume us-east-1 and the AWS CLI v2 throughout. Generation examples use Anthropic Claude on Bedrock; embeddings use Amazon Titan Text Embeddings v2.
Step 1 — Model access and throughput
Nothing works until you request access to the specific models you intend to use. This is a one-time per-account, per-region action in the Bedrock console under Model access, and it is the single most common reason a first Retrieve call fails with AccessDeniedException.
You need access to two model families: an embeddings model for the Knowledge Base and a text model for generation. Confirm what is granted:
# Models your account can actually call, by output modality
aws bedrock list-foundation-models \
--region us-east-1 \
--by-output-modality TEXT \
--query "modelSummaries[].modelId" --output table
aws bedrock list-foundation-models \
--region us-east-1 \
--by-output-modality EMBEDDING \
--query "modelSummaries[].modelId" --output table
On-demand vs. provisioned throughput
On-demand is pay-per-token with shared, account-level service quotas. It is correct for almost every workload starting out. Provisioned Throughput buys dedicated capacity in model units on an hourly commitment (1-month or 6-month terms are cheaper than no-commitment). Reach for it only when you have a sustained, predictable request rate that bumps into on-demand throttling, or a latency SLA that the shared pool cannot guarantee.
Do not buy Provisioned Throughput to fix sporadic
ThrottlingExceptions. First request a quota increase on the relevant on-demand TPM/RPM quota in Service Quotas, and add client-side retry with exponential backoff. A 6-month commitment to fix a bursty dev workload is an expensive mistake.
A second knob: cross-region inference profiles (IDs prefixed by a geography, e.g. us.anthropic...) route requests across regions to raise effective throughput and resilience. Use the inference profile ID rather than the bare model ID for production generation — it is the better default.
Step 2 — Building the Knowledge Base
A Knowledge Base owns the ingestion pipeline: it reads a data source, chunks the documents, embeds the chunks, and writes vectors into your store. Land your corpus in S3 first; S3 is the most common and best-understood data source (others include web crawlers, Confluence, SharePoint, and Salesforce).
Chunking strategy
Chunking is the decision that most affects retrieval quality, and it is awkward to change later because re-chunking means re-embedding everything. Bedrock offers several strategies:
| Strategy | Best for | Trade-off |
|---|---|---|
| Fixed-size | Uniform prose, simplest baseline | Can split mid-thought |
| Default | General purpose (~300 tokens) | Reasonable, opinionated |
| Hierarchical | Long structured docs; parent/child context | More vectors, more cost |
| Semantic | Topic-coherent boundaries | Higher ingestion cost |
| None | You pre-chunked upstream | You own the splitting |
Start with fixed-size around 300-500 tokens with ~20% overlap, measure retrieval, and only move to hierarchical or semantic if recall is poor on long documents. Overlap matters: it keeps a sentence that straddles a boundary retrievable from either chunk.
Embeddings
Titan Text Embeddings v2 supports 256, 512, or 1024 dimensions — 1024 is the quality default; drop to 512 if storage cost dominates and you tolerate slightly lower recall. The dimension you pick must match the vector field dimension in the store exactly, or ingestion fails.
Create the Knowledge Base pointing at an existing vector store (built in Step 3) with an OpenSearch Serverless configuration:
{
"name": "kb-product-docs",
"roleArn": "arn:aws:iam::111122223333:role/BedrockKBRole",
"knowledgeBaseConfiguration": {
"type": "VECTOR",
"vectorKnowledgeBaseConfiguration": {
"embeddingModelArn": "arn:aws:bedrock:us-east-1::foundation-model/amazon.titan-embed-text-v2:0",
"embeddingModelConfiguration": {
"bedrockEmbeddingModelConfiguration": { "dimensions": 1024 }
}
}
},
"storageConfiguration": {
"type": "OPENSEARCH_SERVERLESS",
"opensearchServerlessConfiguration": {
"collectionArn": "arn:aws:aoss:us-east-1:111122223333:collection/abc123",
"vectorIndexName": "bedrock-kb-index",
"fieldMapping": {
"vectorField": "bedrock-kb-vector",
"textField": "AMAZON_BEDROCK_TEXT_CHUNK",
"metadataField": "AMAZON_BEDROCK_METADATA"
}
}
}
}
aws bedrock-agent create-knowledge-base --region us-east-1 \
--cli-input-json file://kb.json
Then attach the S3 data source and run an ingestion job. Re-run the ingestion job whenever the corpus changes — Bedrock syncs incrementally:
aws bedrock-agent start-ingestion-job --region us-east-1 \
--knowledge-base-id KB123 --data-source-id DS456
Step 3 — Choosing the vector store
Bedrock can manage an OpenSearch Serverless collection for you, or you bring Aurora PostgreSQL with pgvector. (Other supported stores include Pinecone, MongoDB Atlas, Neptune Analytics, and the newer S3 Vectors option.) The two you will weigh most often:
| Dimension | OpenSearch Serverless | Aurora PostgreSQL + pgvector |
|---|---|---|
| Setup | Bedrock can create it for you | You provision the cluster |
| Floor cost | Minimum OCU billing, always-on | Scales toward zero with Serverless v2 |
| Best when | You want managed, fast to stand up | You already run Postgres / want SQL joins |
| Operational model | Search-native | Familiar RDBMS, one less system |
OpenSearch Serverless is the fastest path and the right default if you do not already operate Postgres. Be aware it has a non-trivial minimum OCU cost even when idle — a real consideration for low-traffic internal tools.
Aurora pgvector wins when you already run Postgres and want one fewer system to operate, or when you want to filter vectors with ordinary SQL WHERE clauses against the same database. Bedrock connects to Aurora via the RDS Data API using credentials in Secrets Manager. The schema must exist before you create the Knowledge Base — the table needs a vector column, a text column, a metadata jsonb column, and a primary key:
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE bedrock_kb (
id uuid PRIMARY KEY DEFAULT gen_random_uuid(),
embedding vector(1024), -- MUST match the embedding dimension
chunks text, -- the text chunk
metadata jsonb -- source metadata for citations
);
-- HNSW index for approximate nearest-neighbour search
CREATE INDEX ON bedrock_kb
USING hnsw (embedding vector_cosine_ops);
The vector(1024) dimension and the vector_cosine_ops distance operator must line up with how the Knowledge Base embeds. Cosine is the standard choice for Titan embeddings.
Step 4 — Retrieve-and-generate with citations
With the Knowledge Base populated, the application calls bedrock-agent-runtime. RetrieveAndGenerate does the whole loop — embed query, search, assemble a grounded prompt, generate — and returns the answer with citations mapping each span back to source chunks. This citation payload is what separates a defensible enterprise answer from a confident hallucination.
import boto3
client = boto3.client("bedrock-agent-runtime", region_name="us-east-1")
resp = client.retrieve_and_generate(
input={"text": "What is our data retention policy for EU customers?"},
retrieveAndGenerateConfiguration={
"type": "KNOWLEDGE_BASE",
"knowledgeBaseConfiguration": {
"knowledgeBaseId": "KB123",
"modelArn": "arn:aws:bedrock:us-east-1:111122223333:inference-profile/"
"us.anthropic.claude-sonnet-4-20250514-v1:0",
"retrievalConfiguration": {
"vectorSearchConfiguration": {"numberOfResults": 5}
},
"generationConfiguration": {
"guardrailConfiguration": {
"guardrailId": "gr-abc123",
"guardrailVersion": "1"
}
}
}
},
)
print(resp["output"]["text"])
for c in resp["citations"]:
for ref in c["retrievedReferences"]:
print("source:", ref["location"])
Two knobs do most of the work. numberOfResults controls how many chunks feed the model — more context costs more tokens and can dilute relevance; 4-6 is a sane starting band. The retrieval config also supports a SEMANTIC or HYBRID search type — hybrid (vector similarity plus keyword matching) noticeably improves recall when queries contain exact identifiers, error codes, or SKUs.
To run retrieval and generation separately — to re-rank chunks, inject your own system prompt, or call a model outside the Knowledge Base — use Retrieve for raw chunks and then Converse against bedrock-runtime. RetrieveAndGenerate is the right default; the split gives you control when you need it.
Step 5 — Safety and compliance with Guardrails
A Guardrail is a policy object you attach at invocation time. The same Guardrail can protect direct Converse calls and RetrieveAndGenerate — define once, apply everywhere. It evaluates both the user input and the model output. Four controls carry most production requirements:
- Content filters — strength-graded filters (NONE/LOW/MEDIUM/HIGH) for hate, insults, sexual content, violence, misconduct, and a prompt-attack filter for jailbreak attempts.
- Denied topics — natural-language definitions of subjects the assistant must refuse (e.g. “investment advice”), independent of whether the content is otherwise harmful.
- Sensitive information / PII — detect-and-
ANONYMIZE(mask) orBLOCKfor built-in PII types, plus your own regex patterns. - Contextual grounding check — scores whether the answer is grounded in the retrieved source and relevant to the query; below your threshold, it blocks. This is the strongest available lever against RAG hallucination.
aws bedrock create-guardrail --region us-east-1 \
--name "support-assistant-guardrail" \
--blocked-input-messaging "I can't help with that request." \
--blocked-outputs-messaging "I can't provide that information." \
--content-policy-config '{
"filtersConfig": [
{"type": "HATE", "inputStrength": "HIGH", "outputStrength": "HIGH"},
{"type": "PROMPT_ATTACK", "inputStrength": "HIGH", "outputStrength": "NONE"}
]
}' \
--sensitive-information-policy-config '{
"piiEntitiesConfig": [
{"type": "EMAIL", "action": "ANONYMIZE"},
{"type": "CREDIT_DEBIT_CARD_NUMBER", "action": "BLOCK"}
]
}' \
--topic-policy-config '{
"topicsConfig": [
{"name": "investment-advice", "type": "DENY",
"definition": "Recommendations to buy, sell, or hold specific securities."}
]
}'
PROMPT_ATTACKonly makes sense withoutputStrength: NONE— it is an input-side filter. Setting an output strength on it is a configuration smell. Note that the contextual grounding check is its own policy block configured separately from content filters.
Guardrails are versioned. The create call produces a working DRAFT; cut an immutable numbered version with create-guardrail-version and reference that version from your application. Never point production at DRAFT — it changes under you whenever someone edits the Guardrail.
Step 6 — Private networking, KMS, and least-privilege IAM
By default Bedrock calls traverse the public AWS API endpoints. For a system handling regulated data, keep traffic on the AWS network with interface VPC endpoints (PrivateLink). You need one per service surface your app touches:
# Runtime endpoint for Converse / InvokeModel
aws ec2 create-vpc-endpoint --region us-east-1 \
--vpc-id vpc-0abc --vpc-endpoint-type Interface \
--service-name com.amazonaws.us-east-1.bedrock-runtime \
--subnet-ids subnet-0a subnet-0b \
--security-group-ids sg-0endpoint \
--private-dns-enabled
# Agent runtime endpoint for Retrieve / RetrieveAndGenerate
aws ec2 create-vpc-endpoint --region us-east-1 \
--vpc-id vpc-0abc --vpc-endpoint-type Interface \
--service-name com.amazonaws.us-east-1.bedrock-agent-runtime \
--subnet-ids subnet-0a subnet-0b \
--security-group-ids sg-0endpoint \
--private-dns-enabled
With --private-dns-enabled, the normal service hostnames resolve to private endpoint IPs inside the VPC, so the SDK needs no code change. The endpoint security group must permit HTTPS (443) from your app’s security group, and an endpoint policy can scope which actions and resources are reachable through it.
For encryption, use a customer-managed KMS key for the Knowledge Base (and the OpenSearch Serverless collection / Aurora cluster) rather than the AWS-managed default — it lets you audit key usage in CloudTrail and revoke access decisively. The Bedrock service role needs kms:Decrypt and kms:GenerateDataKey on that key.
IAM is where most teams over-grant. The Knowledge Base service role needs exactly: read on the S3 data source, the specific embedding model, the vector store API (aoss:APIAccessAll for OpenSearch Serverless, or rds-data:* plus the secret for Aurora), and the KMS key. The application’s role is narrower still — it only invokes runtime APIs and should be pinned to specific model and Guardrail ARNs:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "InvokeRagWithGuardrail",
"Effect": "Allow",
"Action": [
"bedrock:RetrieveAndGenerate",
"bedrock:Retrieve",
"bedrock:InvokeModel"
],
"Resource": [
"arn:aws:bedrock:us-east-1:111122223333:knowledge-base/KB123",
"arn:aws:bedrock:us-east-1::foundation-model/*",
"arn:aws:bedrock:us-east-1:111122223333:inference-profile/*",
"arn:aws:bedrock:us-east-1:111122223333:guardrail/gr-abc123"
]
}
]
}
Observability, evaluation, and cost control
You cannot operate what you cannot see. Three pillars:
Model-invocation logging. Off by default. Turn it on to capture full request/response (and embeddings) to CloudWatch Logs or S3 — essential for debugging bad answers and for audit:
aws bedrock put-model-invocation-logging-configuration --region us-east-1 \
--logging-config '{
"cloudWatchConfig": {
"logGroupName": "/bedrock/model-invocations",
"roleArn": "arn:aws:iam::111122223333:role/BedrockLoggingRole"
},
"textDataDeliveryEnabled": true,
"embeddingDataDeliveryEnabled": true
}'
Evaluation. Do not eyeball quality. Bedrock offers RAG evaluation (and model evaluation) jobs that score retrieval and end-to-end answers against a dataset using metrics like correctness, completeness, and faithfulness/groundedness. Run an eval job on every chunking, embedding-dimension, or numberOfResults change so you can prove a tweak helped instead of guessing.
Cost control. Two layers. First, Application Inference Profiles let you tag inference and attribute Bedrock spend per team or app in Cost Explorer — without tagging, a shared account becomes an un-splittable bill. Second, set CloudWatch alarms on token-count and invocation metrics so a runaway loop or a load test does not become a surprise invoice. Throttling is a cost control, not just a reliability one.
Enterprise scenario
A fintech support team shipped a Bedrock RAG assistant over policy PDFs. It passed every demo, then in the first week of production a customer asked, “What’s the refund window?” and got a fluent, completely wrong answer citing a deprecated 2022 policy. The grounding check was on, so why did it pass? Two findings. First, the old PDF was still in the S3 prefix — nobody had deleted it, and Bedrock’s incremental sync only adds and updates; it does not remove an object from the index just because you stopped caring about it. The stale chunk was legitimately “grounded.” Second, hybrid search surfaced it because the deprecated doc used the exact phrase “refund window” while the current policy said “return period.”
The fix was twofold. They moved retired documents out of the data-source prefix entirely and forced a clean re-sync, because deleting from S3 only drops from the index on the next ingestion job:
aws s3 mv s3://kb-policy-docs/active/refund-2022.pdf \
s3://kb-archive/refund-2022.pdf
aws bedrock-agent start-ingestion-job --region us-east-1 \
--knowledge-base-id KB123 --data-source-id DS456
Then they added a metadata filter so retrieval only ever sees current documents, tagging each chunk with a status field in its .metadata.json sidecar and filtering on it:
"retrievalConfiguration": {
"vectorSearchConfiguration": {
"numberOfResults": 5,
"filter": {"equals": {"key": "status", "value": "active"}}
}
}
The lesson: grounding answers the question “is this supported by a retrieved chunk?” — not “is this chunk one you should still be serving?” Corpus hygiene and metadata filtering are the controls for the second question, and no Guardrail substitutes for them.
Verify
Prove the full path end to end before declaring victory:
# 1. The KB is ACTIVE and the last ingestion succeeded
aws bedrock-agent get-knowledge-base --region us-east-1 \
--knowledge-base-id KB123 --query "knowledgeBase.status"
aws bedrock-agent list-ingestion-jobs --region us-east-1 \
--knowledge-base-id KB123 --data-source-id DS456 \
--query "ingestionJobSummaries[0].status"
# 2. Retrieval returns chunks for a known query
aws bedrock-agent-runtime retrieve --region us-east-1 \
--knowledge-base-id KB123 \
--retrieval-query '{"text": "data retention policy"}' \
--query "retrievalResults[].location"
# 3. The Guardrail blocks what it should (expect the blocked-input message)
aws bedrock-runtime apply-guardrail --region us-east-1 \
--guardrail-identifier gr-abc123 --guardrail-version 1 \
--source INPUT \
--content '[{"text": {"text": "My card is 4111111111111111"}}]'
# 4. Traffic is private: resolves to a 10.x address inside the VPC
nslookup bedrock-agent-runtime.us-east-1.amazonaws.com
A green run is: KB ACTIVE, ingestion COMPLETE, retrieve returns sources, apply-guardrail reports GUARDRAIL_INTERVENED, and the runtime hostname resolves to a private IP.
Production checklist
Pitfalls
The recurring ones, in order of how often they bite:
- Dimension mismatch. The embedding model dimension and the vector store field dimension must be identical. A mismatch fails ingestion with an opaque error. Decide the dimension once, write it everywhere.
- Pointing production at
DRAFT. A Guardrail DRAFT mutates whenever anyone edits it. Always cut a numbered version and reference that. - Forgetting to re-ingest. New documents in S3 are invisible until an ingestion job runs. Automate the sync (e.g. on an S3 event or schedule) or stale answers become a silent bug.
- OpenSearch Serverless idle cost. It bills a minimum OCU floor even at zero traffic. For a low-volume internal tool, that floor can dwarf token costs — model the total before committing.
- Skipping the grounding check. Content filters stop harmful text; they do nothing about a confidently wrong, ungrounded answer. The contextual grounding check is the control that does, and leaving it off is the most common RAG-safety gap.
Get these six steps right — access, ingestion, the right store, grounded generation with citations, Guardrails on both sides, and a private encrypted path — and you have a Bedrock RAG system you can put in front of real users and defend in an audit. Next, wrap it in an evaluation pipeline so every prompt or chunking change is measured, not guessed.