Skip to main content
AI Data Engine

Create data collections in AI Data Engine Console

Contributors netapp-dbagwell

Data collections are the core RAG building blocks in AI Data Engine (AIDE). As a data engineer or data scientist, you define which files belong in a collection, configure embedding and indexing options, and publish the collection so that applications can query it through a retrieval endpoint.

You'll perform all data collection tasks in the AI Data Engine Console.

Before you begin
  • You need data engineer or data scientist privileges in AI Data Engine Console (https://<cluster_management_ip>/console).

  • You have access to at least one workspace with metadata extracted and in Ready state.

  • You have explored the workspace metadata and identified queries or filters that define meaningful subsets of data.

  • The AI Data Engine software license is installed and inferencing features are enabled.

Create a data collection from workspace metadata

Steps
  1. Navigate to Data Curator > Workspaces and select the workspace that contains your target data.

  2. Select Add data collection.

  3. In the Create new data collection page, do the following:

    1. Enter a name and description for the collection (for example, Support_KB_RAG_EN).

    2. Choose whether the collection should be:

      • Dynamic: New files are automatically identified and added to the data collection based on the filtering criteria you define. This happens during workspace refreshes.

      • Static: You choose which files are included in the collection. You can edit the files if the data collection is in draft state. After the data collection moves to Published state, it cannot be edited.

  4. Specify the source subset:

    1. Use keywords and filters (file type, timestamps, and other attributes) to find the relevant files to include.

      Note You can select a file name to open a preview window of the content.
  5. Add these files to the data collection.

  6. Select Save to finalize the collection.

Result

You have defined the scope of the data collection and added the required files to it. AIDE generates embeddings and builds the vector index when you publish the collection.

Tip Create small, focused collections (for example, per use case or domain) rather than a single "everything" collection. This improves retrieval relevance and manageability.

Publish a data collection

Publish the data collection to make it queryable by AI applications through a RAG retrieval endpoint. Publishing generates vector embeddings from your selected files and indexes them for semantic search. After the collection reaches Ready state, its endpoint becomes available for data scientists to integrate into notebooks, pipelines, and AI applications for retrieval-augmented generation (RAG) and search.

Tip For large collections, consider scheduling initial publish and major re-publishes during off-peak times to minimize resource contention.
Steps
  1. Navigate to Data Curator > Data collections and select the options menu (three horizontal blue dots) for your data collection.

  2. Select Publish.

  3. Select a default or custom optimization configuration.

  4. Select Publish to initiate the data transformation.

  5. In AIDE Console, open the collection detail view (Data Curator > Data collections) for status updates.

Result

The collection reaches the Ready state and is available for use by downstream applications and data scientists.

From Data Curator > Data collections, you can select Copy URI to obtain the information needed to access the data collection using an API.

Update or delete a data collection

Over time you might need to refine or retire data collections. Refining a collection might involve adjusting filters to add or remove files, changing embedding settings, or updating the collection description. Deleting a collection removes it permanently and makes its retrieval endpoint unavailable.

Update a data collection

You can update a data collection when it's in draft state.

Steps
  1. Navigate to Data Curator > Data collections.

  2. Select the collection you want to modify.

  3. Choose Edit.

  4. Adjust any of the following:

    • Name and description

    • Filters (paths, file types, classification tags).

    • Embedding and chunking settings.

  5. Save your changes.

  6. Publish the collection again so that the new definition and embeddings take effect.

Result

A new indexing job runs with the updated configuration, and the collection returns to a Ready state when complete.

Delete a collection

Deleting a collection is permanent. Ensure that no production application still depends on the collection's retrieval endpoint before deleting it.

Steps
  1. Navigate to Data Curator > Data collections, and select the options menu (three horizontal blue dots) for the collection.

  2. Choose Delete.

  3. Confirm the deletion.

Result

The collection definition and its embeddings are removed from AI Data Engine. Applications attempting to query the former retrieval endpoint will fail after the collection is removed.

What's next?