Create a GenAI knowledge base

03/27/2025 Contributors

After you've deployed the AI infrastructure and identified the data sources that you'll integrate in your knowledge base from your FSx for ONTAP datastores, you are ready to build the knowledge base using workload factory. As part of this step, you'll also define the AI characteristics and create conversation starters.

About this task

Knowledge bases have two data integration modalities - public mode and Enterprise mode.

Public mode

A knowledge base can be used without integrating data sources from your organization. In this case, an application integrated with the knowledge base will only provide results from publicly available information on the internet. This is known as a public mode integration.

Enterprise mode

In most cases you'll want to integrate data sources from your organization into the knowledge base. This is known as an Enterprise mode integration because it provides knowledge from your enterprise.

Data sources from your organization may contain Private Identity Information (PII). To safeguard this sensitive information, you can enable data guardrails when creating and configuring knowledge bases. Data guardrails, powered by BlueXP classification, identifies and masks PII, making it inaccessible and irretrievable.

Learn about BlueXP classification.

BlueXP workload factory for GenAI does not mask sensitive personal information (SPii). Refer to types of sensitive personal data for more information about this type of data.

Data guardrails can be enabled or disabled at any time. If you switch data guardrails enablement, workload factory scans the entire knowledge base from scratch, which incurs a cost.

Create and configure the knowledge base

The knowledge base defines characteristics such as the Bedrock AI models and embedding format that you want to use to create your knowledge base.

Steps

Log in to workload factory using one of the console experiences.
In the AI workloads tile, select Deploy & manage.
From the Knowledge bases tab, select Add knowledge base.
On the Define knowledge base page, configure the knowledge base settings:
1. Name: Enter the name you want to use for the knowledge base.
2. Description: Enter a detailed description for the knowledge base.
3. Embedding model: The embedding model defines how your data will be converted into vector embeddings for the knowledge base. Workload factory supports the following models:
  - Titan Embeddings G1 - Text
  - Titan Embedding Text v2
  - Titan Multimodal Embeddings G1
    
    Note that you must have already enabled the embedding model from Amazon Bedrock.
    
    Learn more about Amazon Titan
4. Chat model: Choose from Anthropic Claude or Amazon Nova chat models that are integrated in Amazon Bedrock. Note that you must have already enabled the chat model from Amazon Bedrock.
  
  Learn more about using these models with Amazon Bedrock:
  - Anthropic's Claude in Amazon Bedrock
  - Getting started with Amazon Nova in the Amazon Bedrock console
5. Data guardrails: Choose whether you want to enable or disable data guardrails. Learn about data guardrails, powered by BlueXP classification.
  
  The following prerequisites must be met to enable data guardrails.
  - A service account is required to communicate with BlueXP classification. You must have the Organization admin role on your BlueXP tenancy account for service account creation. A member who has the Organization admin role can complete all actions in BlueXP. Learn how to add a role to a member in BlueXP
  - The AI engine must have access to the BlueXP API endpoint.
  - You'll need to do the following as described in BlueXP classification documentation:
    
    Create a BlueXP Connector
    
    Ensure that your environment can meet the prerequisites
    
    Deploy BlueXP classification
The data guardrails feature is not supported when ingesting structured data files such as CSV, JSON, JSONP, or Parquet.
1. Conversation starters: Choose whether you want to provide up to four conversation starter prompts that are displayed to users who interact with a chatbot that uses this knowledge base. We recommend that you enable this setting.
  
  If you activate conversation starters, "Automatic mode" is selected by default. "Manual mode" can be enabled only after you've added data sources to your knowledge base. Learn how to modify knowledge base settings.
2. FSx for ONTAP file system: When you define a new knowledge base, Workload factory creates a new Amazon FSx for NetApp ONTAP volume to store it. Choose an existing file system name and SVM (also called a storage VM) where the new volume will be created.
3. Snapshot policy: Choose a snapshot policy from the list of existing policies defined in the workload factory storage inventory. Recurring snapshots of the knowledge base will automatically be created at a frequency based on the snapshot policy you select.
  
  If the snapshot policy you need doesn't exist, you can create a snapshot policy on the storage VM that contains the volume.
Select Create knowledge base to add the knowledge base to GenAI.

A progress indicator appears while the knowledge base is created.

After the knowledge base is created, you have the option to add a data source to your new knowledge base or to end the process without adding a data source. We recommend that you select Add data source and add one or more data sources now.

Add data sources to the knowledge base

You can add one or more data sources to populate the knowledge base with your organization's data.

About this task

The maximum number of supported data sources is 10.

Steps

After you select Add data source, the Select a file system page displays.
Select a file system: Select the FSx for ONTAP file system where your data source files reside and select Next.
Select a volume: Select the volume on which your data source files reside and select Next.

When selecting files stored using the SMB protocol, you'll need to enter the Active Directory information, which includes the domain, IP address, user name, and password.
Select a data source: Select the data source location based on where you have saved the files. This can be an entire volume, or just a specific folder or sub-folder in the volume, and select Next.

Configurations: Configure how the data source ingests information from your files, and which files it includes in scans:

Define data source: In the Chunking strategy section, define the how the GenAI engine splits data source content into chunks when the data source is integrated with a knowledge base. You can choose one of the following strategies:
- Multi-sentence chunking: Organizes information from your data source into sentence-defined chunks. You can choose how many sentences make up each chunk (up to 100).
- Overlap-based chunking: Organizes information from your data source into character-defined chunks that can overlap neighboring chunks. You can choose the size of each chunk in characters, and how much each chunk overlaps with adjacent chunks. You can configure a chunk size of between 50 and 3000 characters, and an overlap percentage of between 1 and 99%.
  
  Choosing a high overlap percentage can greatly increase storage requirements with only slight improvements in retrieval accuracy.

File filtering: Configure which files are included in scans:

In the File types support section, choose to either include all types of files, or select individual file types for inclusion in the data source scans.

If you include images or PDF files, BlueXP workload factory for GenAI parses text in the images (including images in PDF documents), and this incurs a higher cost.

When including text data from images, GenAI is unable to mask Personally-Identifiable Information (PII) from the image as the scanned text data is sent from your environment to AWS. However, once the data is stored, all PII is masked in the GenAI database.

Your choice to include image files in scans is related to the knowledge base chat model. If you include image files in scans, the chat model must support images. If image file types are selected here, you cannot switch the knowledge base to a chat model that does not support image files.

In the File modification time filter section, choose to enable or disable inclusion of files based on their modification time. If you enable modification time filtering, select a date range from the list.

If you include files based on a modification date range, as soon as the date range is not satisfied (the files have not been modified within the date range you specify), the files will be excluded from the periodic scan, and the data source will not include these files.

In the Permission aware section, which is available only when the data source you selected is on a volume that uses the SMB protocol, you can enable or disable permission-aware responses:
- Enabled: Users of the chatbot who access this knowledge base will only get responses to queries from data sources to which they have access.
- Disabled: Users of the chatbot will receive responses using content from all integrated data sources.
Select Add to add this data source to your knowledge base.

Result

The data source starts to be embedded into your knowledge base. The status changes from "Embedding" to "Embedded" when the data source is completely embedded.

After you add a single data source to the knowledge base, you can test it locally in the chatbot simulator window and make any required changes before you make the chatbot available to your users. You can also follow the same steps to add additional data sources to the knowledge base.

Create a GenAI knowledge base

Creating your file...

Create and configure the knowledge base

Add data sources to the knowledge base