Learn how AI Data Engine data engineers and data scientists work with AIDE components
As a data engineer or data scientist, you use the AI Data Engine Console to explore workspaces you have been granted access to, create and manage data collections, perform semantic searches, and integrate retrieval endpoints into AI/ML workflows.
Data engineers focus on transforming raw data into AI-ready datasets by building collections, configuring embedding pipelines, and controlling which users can access published collections. Data scientists focus on leveraging curated datasets for analysis, model training, and GenAI applications, without managing access control or infrastructure.
Data user component access
| Component | Access level | Data engineer workflow | Data scientist workflow |
|---|---|---|---|
AI Data Engine Console |
Manage (create, edit, delete) |
The AI Data Engine Console is your primary interface for day-to-day tasks, including data discovery, collection management, pipeline configuration, and publishing RAG or retrieval endpoints, for the workspaces you are authorized to access. |
The AI Data Engine Console is your primary interface for data exploration, refining and versioning collections within workspaces you can access, and connecting curated datasets and retrieval endpoints to analysis, modeling, and GenAI workflows. |
ONTAP REST API |
Manage (create, edit, delete) |
You use the REST API to automate collection lifecycle operations, trigger and monitor embedding pipelines, and programmatically integrate data workflows with external tools. |
You use the REST API to programmatically access data collections, run vector search queries, and integrate retrieval endpoints into AI/ML applications and agentic frameworks. |
Workspaces |
View/use (read-only) |
You explore your assigned workspaces to identify and understand available data sources before building collections. |
You search your assigned workspaces to locate files and objects relevant to specific research or modeling tasks. |
Data collections |
Manage (create, edit, delete) |
You build data collections by selecting and filtering source data using tags, classification, and other attributes, and you manage the full collection lifecycle from creation and versioning through publishing as RAG endpoints for AI use. You also manage which data scientists and other users can access each collection. |
You create, select, annotate, version, and refine data collections within the workspaces you have been given access to. You use these collections as the basis for semantic search and GenAI workflows. |
Metadata catalog |
Query/use (consume for workflows) |
You use the metadata catalog to evaluate and select data sources for ingestion, running queries to locate relevant files and confirm they meet the requirements of the collections you are building within your assigned workspaces. |
You search and filter metadata across the workspaces you can access to locate files and objects needed for analysis or model training, relying on the catalog structure that has been built and maintained by data engineers. |
Vector database |
|
You trigger embedding pipelines, monitor vectorization status, configure chunking and embedding parameters, and expose retrieval endpoints backed by vector search. Applications and agents then query these endpoints via the API for semantic search and RAG workflows. |
You run semantic search queries against embeddings generated by data engineer-managed pipelines and integrate retrieval results into GenAI or RAG workflows for context-aware model responses. You do not configure chunking, embeddings, or pipeline parameters. |
Classifiers |
Use (consume classified data) |
You use classification results to annotate and tag source data during collection preparation, ensuring that content entering your pipelines is properly labeled for downstream AI workflows. |
You consume pre-classified data to ensure that only compliant and relevant content is used in your analysis and modeling. |