Skip to main content

Create a GenAI knowledge base

Contributors netapp-mwallis

After you've deployed the AI infrastructure and identified the data sources that you'll integrate in your knowledge base from your FSx for ONTAP datastores, you are ready to build the knowledge base using Workload Factory. As part of this step, you'll also define the AI characteristics and create conversation starters.

Ensure that your environment meets the requirements for knowledge bases before proceeding.

About this task

Knowledge bases have two data integration modalities - public mode and Enterprise mode.

Public mode

A knowledge base can be used without integrating data sources from your organization. In this case, an application integrated with the knowledge base will only provide results from publicly available information on the internet. This is known as a public mode integration.

Enterprise mode

In most cases you'll want to integrate data sources from your organization into the knowledge base. This is known as an Enterprise mode integration because it provides knowledge from your enterprise.

Data sources from your organization may contain Personally Identifiable Information (PII). To safeguard this sensitive information, you can enable data guardrails when creating and configuring knowledge bases. Data guardrails, powered by NetApp Data Classification, identifies and masks PII, making it inaccessible and irretrievable.

Note NetApp Workload Factory for GenAI does not mask sensitive personal information (SPii). Refer to types of sensitive personal data for more information about this type of data.
Note Data guardrails can be enabled or disabled at any time. If you switch data guardrails enablement, Workload Factory scans the entire knowledge base from scratch, which incurs a cost.

Create and configure the knowledge base

The knowledge base defines characteristics such as the Bedrock AI models and embedding format that you want to use to create your knowledge base.

Steps
  1. Log in to Workload Factory using one of the console experiences.

  2. In the AI workloads tile, select Deploy & manage.

  3. From the Knowledge bases & Connectors menu, select the Create New dropdown and choose NetApp GenAI knowledge base for Bedrock.

  4. On the Create NetApp GenAI knowledge base page, configure the knowledge base settings:

Knowledge base details
  1. Name: Enter the name you want to use for the knowledge base.

  2. Description: Enter a detailed description for the knowledge base.

  3. Bedrock: Choose the region where Amazon Bedrock is available for your AWS account.

Ingestion
  1. Embedding model:

    • Choose an embedding model to use for the knowledge base. The embedding model defines how your data will be converted into vector embeddings for the knowledge base. Workload Factory supports the following models:

    • Titan Embeddings G1 - Text

    • Titan Embedding Text v2

    • Titan Multimodal Embeddings G1

    • Embed English

    • Embed Multilingual

      Note that you must have already enabled the embedding model from Amazon Bedrock.

    • If applicable, select the inference type that matches the configuration of the selected embedding model.

  2. Data guardrails: Choose whether you want to enable or disable data guardrails. Learn about data guardrails, powered by NetApp Data Classification.

    The following prerequisites must be met to enable data guardrails.

    Note The data guardrails feature is not supported when ingesting structured data files such as CSV, JSON, JSONP, or Parquet.
Chat and Retrieval settings
  1. Chat model:

    • Choose from various chat models that are integrated in Amazon Bedrock. Note that you must have already enabled the chat model from Amazon Bedrock.

    • If applicable, select the inference type that matches the configuration of the selected model.

  2. Chat settings:

    • Choose a temperature for the chatbot to configure the randomness and creativity of responses. A lower temperature results in more predictable responses, and a higher temperature results in more varied responses.

    • Choose a maximum response length to configure how detailed responses are. Longer response lengths use more response tokens, and can incur a higher cost.

  3. Thinking mode: When thinking mode is enabled, the chatbot will take more time to process queries and the results will usually be more accurate. When you enable thinking mode, you can control how many reasoning tokens are used when generating results. Using more reasoning tokens can lead to more accurate responses, but might incur a higher cost.

  4. Reranking: Enable or disable reranking, which can improve the relevance and quality of query results. Choose a standard chat model or a specialized reranker model to use for reranking. Reranker model options are only shown if they are available in your region. Select the inference type that matches the configuration of the selected model.

  5. Conversation starters: Choose whether you want to provide up to four conversation starter prompts that are displayed to users who interact with a chatbot that uses this knowledge base. We recommend that you enable this setting.

    If you activate conversation starters, "Automatic mode" is selected by default. "Manual mode" can be enabled only after you've added data sources to your knowledge base. Learn how to modify knowledge base settings.

Storage definitions
  1. FSx for ONTAP file system: When you define a new knowledge base, Workload Factory creates a new Amazon FSx for NetApp ONTAP volume to store it. Choose an existing file system name and SVM (also called a storage VM) where the new volume will be created.

  2. Snapshot policy: Choose a snapshot policy from the list of existing policies defined in the Workload Factory storage inventory. Recurring snapshots of the knowledge base will automatically be created at a frequency based on the snapshot policy you select.

  3. S3 Bucket: If chatbot query results contain structured data, GenAI can store the results in an S3 bucket. To use this feature, enable the Activate S3 Bucket setting and choose an S3 bucket that is associated with your account from the list. When these results are stored in an S3 bucket, you can download them using the download link within the chat session.

    If the snapshot policy you need doesn't exist, you can create a snapshot policy on the storage VM that contains the volume.

  1. Select Create knowledge base to add the knowledge base to GenAI.

    A progress indicator appears while the knowledge base is created.

    After the knowledge base is created, you have the option to add a data source to your new knowledge base or to end the process without adding a data source. We recommend that you select Add data source and add one or more data sources now.

Add data sources to the knowledge base

You can add one or more data sources to populate the knowledge base with your organization's data.

About this task

The maximum number of supported data sources is 10.

Steps
  1. After you select Add data source, select the type of data source you want to add:

    • Add FSx for ONTAP file system (use files from an existing FSx for ONTAP volume)

    • Add file system (use files from a generic SMB or NFS share)

Add an FSx for ONTAP file system
  1. Select a file system: Select the FSx for ONTAP file system where your data source files reside and select Next.

  2. Select a volume: Select the volume on which your data source files reside and select Next.

    When selecting files stored using the SMB protocol, you'll need to enter the Active Directory information, which includes the domain, IP address, user name, and password.

  3. Select a data source: Select the data source location based on where you have saved the files. This can be an entire volume, or just a specific folder or sub-folder in the volume, and select Next.

  4. Configurations: Configure how the data source ingests information from your files, and which files it includes in scans:

    • Define data source: In the Chunking strategy section, define the how the GenAI engine splits data source content into chunks when the data source is integrated with a knowledge base. You can choose one of the following strategies:

      • Multi-sentence chunking: Organizes information from your data source into sentence-defined chunks. You can choose how many sentences make up each chunk (up to 100).

      • Overlap-based chunking: Organizes information from your data source into character-defined chunks that can overlap neighboring chunks. You can choose the size of each chunk in characters, and how much each chunk overlaps with adjacent chunks. You can configure a chunk size of between 50 and 3000 characters, and an overlap percentage of between 1 and 99%.

        Note Choosing a high overlap percentage can greatly increase storage requirements with only slight improvements in retrieval accuracy.
    • File filtering: Configure which files are included in scans:

      • In the File types support section, choose to either include all types of files, or select individual file types for inclusion in the data source scans.

        If you include images or PDF files, NetApp Workload Factory for GenAI parses text in the images (including images in PDF documents), and this incurs a higher cost.

        When including text data from images, GenAI is unable to mask Personally-Identifiable Information (PII) from the image as the scanned text data is sent from your environment to AWS. However, once the data is stored, all PII is masked in the GenAI database.

        Note Your choice to include image files in scans is related to the knowledge base chat model. If you include image files in scans, the chat model must support images. If image file types are selected here, you cannot switch the knowledge base to a chat model that does not support image files.
      • In the File modification time filter section, choose to enable or disable inclusion of files based on their modification time. If you enable modification time filtering, select a date range from the list.

        Note If you include files based on a modification date range, as soon as the date range is not satisfied (the files have not been modified within the date range you specify), the files will be excluded from the periodic scan, and the data source will not include these files.
  5. In the Permission aware section, which is available only when the data source you selected is on a volume that uses the SMB protocol, you can enable or disable permission-aware responses:

    • Enabled: Users of the chatbot who access this knowledge base will only get responses to queries from data sources to which they have access.

    • Disabled: Users of the chatbot will receive responses using content from all integrated data sources.

  6. Select Add to add this data source to your knowledge base.

Add a generic NFS file system
  1. Select a file system: Enter the IP address or FQDN for the filesystem host where your data source files reside, choose the NFS protocol for the network share, and select Next.

  2. Select a data source: Select the data source location based on where you have saved the files. This can be an entire volume, or just a specific folder or sub-folder in the volume, and select Next.

    Note In some cases, you might need to enter the NFS export name manually and select Retrieve directories to display the available directories. You can choose to select the entire export, or only specific folders from the export.
  3. Configurations: Configure how the data source ingests information from your files, and which files it includes in scans:

    • Define data source: In the Chunking strategy section, define the how the GenAI engine splits data source content into chunks when the data source is integrated with a knowledge base. You can choose one of the following strategies:

      • Multi-sentence chunking: Organizes information from your data source into sentence-defined chunks. You can choose how many sentences make up each chunk (up to 100).

      • Overlap-based chunking: Organizes information from your data source into character-defined chunks that can overlap neighboring chunks. You can choose the size of each chunk in characters, and how much each chunk overlaps with adjacent chunks. You can configure a chunk size of between 50 and 3000 characters, and an overlap percentage of between 1 and 99%.

        Note Choosing a high overlap percentage can greatly increase storage requirements with only slight improvements in retrieval accuracy.
    • File filtering: Configure which files are included in scans:

      • In the File types support section, choose to either include all types of files, or select individual file types for inclusion in the data source scans.

        If you include images or PDF files, NetApp Workload Factory for GenAI parses text in the images (including images in PDF documents), and this incurs a higher cost.

        When including text data from images, GenAI is unable to mask Personally-Identifiable Information (PII) from the image as the scanned text data is sent from your environment to AWS. However, once the data is stored, all PII is masked in the GenAI database.

        Note Your choice to include image files in scans is related to the knowledge base chat model. If you include image files in scans, the chat model must support images. If image file types are selected here, you cannot switch the knowledge base to a chat model that does not support image files.
      • In the File modification time filter section, choose to enable or disable inclusion of files based on their modification time. If you enable modification time filtering, select a date range from the list.

        Note If you include files based on a modification date range, as soon as the date range is not satisfied (the files have not been modified within the date range you specify), the files will be excluded from the periodic scan, and the data source will not include these files.
  4. Select Add data source to add this data source to your knowledge base.

Add a generic SMB file system
  1. Select file system:

    1. Enter the IP address or FQDN for the filesystem host where your data source files reside.

    2. Choose the SMB protocol for the network share.

    3. Enter the Active Directory information, which includes the domain, IP address, user name, and password.

    4. Select Next.

  2. Select a data source: Select the data source location based on where you have saved the files. This can be an entire volume, or just a specific folder or sub-folder in the volume, and select Next.

    Note In some cases, you might need to enter the SMB share name manually and select Retrieve directories to display the available directories. You can choose to select the entire share, or only specific folders from the share.
  3. Configurations: Configure how the data source ingests information from your files, and which files it includes in scans:

    • Define data source: In the Chunking strategy section, define the how the GenAI engine splits data source content into chunks when the data source is integrated with a knowledge base. You can choose one of the following strategies:

      • Multi-sentence chunking: Organizes information from your data source into sentence-defined chunks. You can choose how many sentences make up each chunk (up to 100).

      • Overlap-based chunking: Organizes information from your data source into character-defined chunks that can overlap neighboring chunks. You can choose the size of each chunk in characters, and how much each chunk overlaps with adjacent chunks. You can configure a chunk size of between 50 and 3000 characters, and an overlap percentage of between 1 and 99%.

        Note Choosing a high overlap percentage can greatly increase storage requirements with only slight improvements in retrieval accuracy.
    • Permission aware: Enable or disable permission-aware responses:

      • Enabled: Users of the chatbot who access this knowledge base will only get responses to queries from data sources to which they have access.

      • Disabled: Users of the chatbot will receive responses using content from all integrated data sources.

    • File filtering: Configure which files are included in scans:

      • In the File types support section, choose to either include all types of files, or select individual file types for inclusion in the data source scans.

        If you include images or PDF files, NetApp Workload Factory for GenAI parses text in the images (including images in PDF documents), and this incurs a higher cost.

        When including text data from images, GenAI is unable to mask Personally-Identifiable Information (PII) from the image as the scanned text data is sent from your environment to AWS. However, once the data is stored, all PII is masked in the GenAI database.

        Note Your choice to include image files in scans is related to the knowledge base chat model. If you include image files in scans, the chat model must support images. If image file types are selected here, you cannot switch the knowledge base to a chat model that does not support image files.
      • In the File modification time filter section, choose to enable or disable inclusion of files based on their modification time. If you enable modification time filtering, select a date range from the list.

        Note If you include files based on a modification date range, as soon as the date range is not satisfied (the files have not been modified within the date range you specify), the files will be excluded from the periodic scan, and the data source will not include these files.
  4. Select Add data source to add this data source to your knowledge base.

Result

The data source starts to be embedded into your knowledge base. The status changes from "Embedding" to "Embedded" when the data source is completely embedded.

After you add a single data source to the knowledge base, you can test it locally in the chatbot simulator window and make any required changes before you make the chatbot available to your users. You can also follow the same steps to add additional data sources to the knowledge base.