Skip to main content
Setup and administration

Build and operate well-architected workloads

Contributors netapp-rlithman

Workload Factory, the NetApp management suite for Amazon FSx for NetApp ONTAP, helps you maintain and operate reliable, secure, efficient, and cost-effective storage and database configurations that align with the AWS Well-Architected Framework. Workload Factory provides daily analysis of your storage and database workloads, recommendations, and automatic fixes to promote healthy workload operations. By automating this process, Workload Factory minimizes human error and ensures consistency in workload management.

How it works

Workload Factory analyzes Amazon FSx for NetApp ONTAP file systems, Microsoft SQL Server, and Oracle database deployments daily. The daily analysis provides the well-architected status, and insights and recommendations with options to automatically fix configuration issues so that your deployments meet best practices and operate efficiently.

After the daily analysis completes, configurations appear as "optimized" or "not optimized" in the Well-architected dashboard for the deployment. You'll find the total optimization score, configuration issues by category, and a list of configuration issues and recommendations. You can review the recommendations for configuration issues. Some issues can be fixed automatically by Workload Factory, while others require manual intervention. In this case, Workload Factory provides detailed instructions to help you implement the recommended changes.

Because requirements for storage and database workloads vary, you can dismiss the analysis of specific configurations that don't apply to your environments. This helps you avoid unnecessary alerts and inaccurate optimization results. When a specific configuration analysis is dismissed, the configuration isn't included in the total optimization score.

Why it matters

Workload Factory simplifies the difficult task of manually applying best practices to large storage or database environments; it streamlines this process by combining analysis and remediation for suboptimal configurations from the Workload Factory console. Fixes applied within the Workload Factory console eliminate the risk of human error and ensure uniformity in storage and database management. By leveraging automation, you can guarantee that the configurations are applied correctly every time and preserved this way over time, thus maintaining the high standards of performance and reliability across your entire storage infrastructure.

Get started with Workload Factory to detect and correct misconfigurations

Get started with Workload Factory by signing up and creating an account, adding credentials, and establishing connectivity so that Workload Factory can manage AWS resources directly, and then optimize your workloads by using Amazon FSx for NetApp ONTAP.

Best practices and recommendations for storage workloads

Workload Factory provides an in-depth view into ONTAP configuration best practices for storage management. Specifically, the Storage workload within Workload Factory analyzes storage configurations for compliance with the pillars of the AWS Well-Architected Framework, and provides recommendations and remediation for suboptimal configurations. From the well-architected status dashboard in Storage, you'll find insights that help you implement well-architected best practices to deliver optimal performance and reliability for your FSx for ONTAP file systems.

The well-architected analysis categorizes configurations in the following pillars of the framework: reliability, security, operational excellence, cost optimization, and performance efficiency.

Reliability

Reliability ensures that workloads perform their intended functions correctly and consistently, even when there are disruptions.

  • Schedule FSx for ONTAP backups

    FSx for ONTAP: Backing up your volumes helps support data retention and compliance needs. Use FSx for ONTAP backup to implement a centrally managed, automated backup and retention strategy for your data.

  • Schedule local snapshots

    Schedule local snapshots for efficient backup and quick restores. Snapshots are instant, point-in-time images of your volumes.

  • Cross-region replication

    Cross-region replication ensures that your data is replicated to another AWS region, providing enhanced data durability and availability. Workload Factory recommends configuring cross-region replication for disaster recovery and compliance requirements.

  • Set up data replication

    To extend data reliability, data can be replicated to an FSx for ONTAP file system in the same region or in another region. Set up data replication to support migration, disaster recovery, and long-term retention across file systems.

  • Increase SSD capacity threshold

    The SSD storage tier capacity should not exceed 80% utilization on an ongoing basis. This might impact data reads and writes to your capacity pool storage tier and impact the throughput capacity of your file system. Running out of capacity might result in data volumes becoming read-only, and services trying to write new data might fail.

  • Match labels to ensure data reliability

    The snapshot policy labels of the source volume and the replication policy labels must match to ensure data reliability.

  • Increase file capacity threshold

    The file capacity threshold should be raised to avoid hitting the volume capacity limit. Low file capacity (inodes) prevents writing additional data to the volume. Workload Factory recommends staying below 80% utilization of the available file capacity on an ongoing basis. Available file capacity is required to create new files in the volume.

Security

Security emphasizes protecting data, systems, and assets through risk assessments and mitigation strategies.

  • Enable ARP/AI

    NetApp Autonomous Ransomware Protection with AI (ARP/AI) enhances cyber resiliency and ensures active protection for volumes against evolving ransomware threats. Workload Factory recommends enabling ARP/AI for all volumes.

  • Unauthorized access to volumes

    Volumes serving application data using iSCSI should not allow NAS access in parallel. Workload Factory recommends that volumes accessed via the iSCSI protocol should be restricted to any additional protocols.

Operational excellence

Operational excellence focuses on delivering the most optimal architecture and business value.

  • Enable automatic capacity management

    Automatic capacity management should be enabled to regularly ensure that the SSD tier doesn't exceed the threshold.

  • Volume capacity utilization threshold

    Workload Factory recommends that volume capacity doesn't exceed 80% utilization on an ongoing basis. This might impact data reads and writes to your application. Volume capacity increases can be manual or automatic using the volume autogrow feature.

  • Volume utilization nearing full

    When a volume is nearing full capacity, Workload Factory recommends taking action to increase the volume capacity to avoid potential application disruptions.

  • Cache relationship write mode

    For optimal performance, Workload Factory recommends the cache relationship write mode that best suits your workload. Write-around mode provides better performance for read-heavy workloads with small files, whereas write-back mode provides better performance for write-heavy workloads with large files.

Cost optimization

Cost optimization aims to deliver business value while minimizing costs.

  • Optimize TCO by tiering cold data

    Cold data tiering should be enabled to reduce SSD storage tier utilization. Applying a tiering policy to every volume is recommended. FSx for ONTAP scans the data continuously to detect cold data and move it to the capacity storage pool tier without disruption.

  • Enable storage efficiencies

    Storage efficiencies should be enabled - compaction, compression, and deduplication - to optimize storage utilization and reduce the SSD tier cost.

  • Unnecessary snapshot and backup deletion

    Snapshots and backups that are no longer needed should be deleted to reduce costs.

Best practices and recommendations for database workloads

Workload Factory provides a set of best practices and recommendations for operating well-architected database workloads. The well-architected analysis assesses Microsoft SQL Server and Oracle Database configurations and settings related to storage sizing, storage layout, storage configuration, compute, application (SQL Server), and resiliency.

Storage sizing

  • Storage tier

    For optimal storage performance, provision FSx for ONTAP volumes on the primary SSD tier. Using the capacity pool tier may result in slower performance and higher latency.

  • File system headroom

    To optimize storage performance, provision file system capacity as 1.35 times of total size of provisioned volume.

    File system headroom percentages are as follows:

    • Under-provisioned: < 35%

    • Optimized: 35-100%

    • Over-provisioned: > 100%

  • Log drive size

    Ensure accurate sizing and regular monitoring of the SQL Server log drive to prevent issues such as transaction rollbacks, database unavailability, data corruption, and performance degradation caused by a full log drive.

    Log drive size percentages are as follows:

    • Under-provisioned: < 20%

    • Optimized: 20-30%

    • Over-provisioned: > 30%

  • TempDB drive size

    Ensure accurate sizing and regular monitoring of the SQL Server TempDB to optimize performance and maintain overall stability. Properly configured TempDB prevents performance issues and instability. Insufficient space or high contention can lead to query slowdowns, application timeouts, and system crashes.

    TempDB drive size percentages are as follows:

    • Under-provisioned: < 10%

    • Optimized: 10-20%

    • Over-provisioned: > 20%

Storage layout

  • Data files (.mdf) placement

    Separating data and log files onto different drives improves performance by allowing simultaneous I/O activity, independent backup schedules, and improved restore functionality. We recommend separating data and log LUN paths into different volumes for smaller databases. This separation is required when there is more than one large database (> 500 GiB).

  • Log files (.ldf) placement

    Separating data and log files onto different drives improves performance by allowing simultaneous I/O activity, independent backup schedules, and improved restore functionality. We recommend separating data and log LUN paths into different volumes for smaller databases. This separation is required when there is more than one large database (> 500 GiB).

  • TempDB placement

    Isolate TempDB I/O and avoid I/O contention from other databases by placing TempDB on its own dedicated drive. This optimization improves overall SQL Server performance and stability. Failure to do so can result in significant I/O bottlenecks, slower query performance, and potential system instability.

Storage configuration

  • ONTAP configuration

    Entity Setting Recommendation

    Volume

    • Thin provisioning (-space-guarantee = none)

    • Autosize on

    • Autosize-mode = grow

    • Fractional reserve = 0%

    • Snapshot copy reserve = 0%

    • Snapshot autodelete (volume/oldest first)

    • Space-mgmt-try-first = volume_grow

    To optimize storage efficiency and cost-effectiveness, configure thin provisioning, autosize, and space management options for your FSx for ONTAP volumes. Without thin provisioning, storage is allocated upfront, leading to inefficient use and higher costs due to over-provisioning; static allocation results in paying for unused capacity, increasing expenses; lack of dynamic allocation hampers scalability and flexibility, impacting performance; and without space reclamation, deleted data occupies space, reducing efficiency.

    Volume

    • Tiering-policy = snapshot-only

    • Tiering-minimum-cooling-days = 7

    For optimal database performance and cost efficiency, Workload Factory recommends moving only snapshots to the capacity tier. This strategy ensures high performance while reducing costs. It is especially recommended to tier snapshots that are older than 7 days.

    LUN

    OS type = windows_2008

    ONTAP LUN OS type value should match the operating system partitioning scheme to achieve I/O alignment. Incorrect configuration may result in suboptimal performance.

    LUN

    Space reservation enabled

    When space reservation is enabled, ONTAP reserves enough space in the volume so that writes to those LUNs do not fail because of a lack of disk space.

    LUN

    Space allocation enabled

    This option ensures that FSx for ONTAP notifies the EC2 host when the volume is full and cannot accept writes. This setting also allows FSx for ONTAP to automatically reclaim space when SQL Server on the EC2 host deletes data. If disabled, write failures are possible and space might be inefficiently utilized.

  • Windows storage configuration

    Entity Setting Recommendation

    Microsoft Multipath I/O (MPIO)

    • Status = Enabled

    • Policy = Round Robin

    • Number of sessions = 5

    To ensure optimal uptime and data access consistency for Microsoft SQL Server databases on EC2 with underlying LUNs provisioned in FSx for ONTAP, Workload Factory recommends enabling and configuring Multipath I/O (MPIO). MPIO provides multiple paths to FSx for ONTAP, enhancing both resiliency and performance. This best practice protects against potential data loss or downtime by maintaining data access even if a component fails.

    Allocation unit size

    NTFS allocation unit size = 64K

    Set NTFS allocation unit size to 64K to better utilize disk space, reduce fragmentation, and improve file read/write performance. Failure to configure this properly might lead to inefficient disk usage and degraded performance.

Compute

  • Compute rightsizing

    To ensure optimal performance and cost efficiency for your SQL Server EC2 instance, we recommend rightsizing based on your workload demands. If your current instance is under-provisioned, upgrading will enhance CPU, memory, and I/O capacity. If it is over-provisioned, downgrading will maintain performance while reducing costs.

  • Operating system patch

    Whenever possible, apply the latest patches to ensure security and stability. Applying the latest patch helps protect your SQL Server databases from vulnerabilities and significantly improves overall system reliability.

  • Network adapter settings

    Accurate configuration of receive side scaling (RSS) is essential for optimal network performance in Microsoft SQL Server instances. RSS distributes network processing across multiple processors, preventing bottlenecks and enhancing system performance. Workload Factory recommends the following RSS settings:

    • Disable TCP Offloading Features: Ensure all TCP offloading features are disabled.

    • Number of Receive Queues: Set to 8 if vCPUs > 8. Set to the number of vCPUs if vCPUs ≤ 8.

    • RSS Profile: Set to NUMAStatic.

    • Base Processor Number: Set to 2.

      Following these settings will improve the performance and reliability of your Microsoft SQL Server instances. We suggest that you test the recommended settings to determine performance improvements before making changes to your production environment.

Application (SQL Server)

  • License

    The SQL Server license assessment and recommendation are provided at the host level.

    Not optimized: A license is considered "not optimized" when Workload Factory detects that your database infrastructure doesn't use any of the commercial software license features you're paying for. An unoptimized license might result in unnecessary costs.

    Optimized: A license is considered "optimized" when the commercial software license for your databases meets your performance requirements.

  • Microsoft SQL Server patch

    Whenever possible, apply the latest patches to ensure security and stability. Applying the latest patch helps protect your SQL Server databases from vulnerabilities and significantly improves overall system reliability.

  • MAXDOP

    Set the Maximum Degree of Parallelism (MAXDOP) to optimize query performance by balancing parallel processing. Accurate MAXDOP configuration enhances performance and efficiency. Setting MAXDOP to 4, 8, or 16 generally provides the best results in most use cases. We recommend that you test your workload and monitor for any parallelism-related wait types such as CXPACKET.

Reliability

  • Schedule FSx for ONTAP backups

    Backing up your Microsoft SQL Server volumes is crucial for supporting your data retention and compliance requirements. Use FSx for ONTAP backup to implement a centrally managed, automated backup and retention strategy for your SQL Server data.

  • Schedule local snapshots

    Schedule local snapshots for efficient backup and quick restores. Snapshots are instant, point-in-time images of your volumes.

  • Cross-region replication

    Cross-region replication ensures that your data is replicated to another AWS region, providing enhanced data durability and availability. Workload Factory recommends configuring cross-region replication for disaster recovery and compliance requirements.