Security and best practices for MLOps on Azure

ImportantCore Security Principles
  • Least privilege everywhere
  • Prefer identity-based (RBAC) access
  • Separate dev, staging, and prod
  • Make pipelines reproducible and auditable
  • Log enough to detect issues without leaking sensitive data

Identity and access (RBAC - Role Based Access Control)

  • Assign roles at the smallest scope possible (resource group, workspace, storage container)
  • Separate duties
    • Data owners manage data access
    • ML engineers run training jobs
    • Release owners deploy to production
TipBest Practice

Use Azure’s built-in roles (Contributor, Reader, Data Scientist) when possible. Create custom roles only when necessary.

Secrets management

  • Store secrets in Azure Key Vault
  • Access Key Vault using Managed Identity (avoid connection strings in code)
  • Do not put secrets in
    • Git repos
    • notebooks
    • environment variables committed to YAML files
  • Rotate secrets regularly if any must exist
WarningCommon Mistake

Never hardcode connection strings or API keys in notebooks or code. Always use Key Vault with Managed Identity.

Data protection

  • Store data in Azure Storage (Blob or ADLS Gen2)
  • Disable public blob access
  • Use RBAC for data access instead of storage account keys when possible
  • Keep clear data zones
    • raw
    • processed
    • features
    • evaluation
  • Track dataset versions (Azure ML Data assets) and record which version trained each model
NoteData Versioning

Tracking which dataset version trained each model is critical for reproducibility and debugging production issues.

Network protection

  • Prefer private access when data is sensitive
TipProduction Recommendation

For production workloads with sensitive data, use Azure Private Link to keep traffic within the Azure network.

Reproducible pipelines

  • Run preprocessing and training as Azure ML Jobs (not only notebooks)
  • Pin dependencies (requirements or conda environment)
  • Version everything in git
    • code
    • environment files
    • pipeline definitions
    • deployment configs
  • Capture metadata in every run
    • data version
    • git commit SHA
    • metrics
    • environment version
ImportantReproducibility Checklist

Every training run should be fully reproducible. If you can’t recreate a model from its metadata, your pipeline needs improvement.

Model registry and promotion

  • Register models in Azure ML Model Registry
  • Require minimum metadata tags
    • data_version
    • git_sha
    • metric_summary
    • owner
  • Use promotion stages
    • candidate
    • validated
    • production
  • Avoid “latest” for prod deployments (deploy specific versions only)
WarningDeployment Safety

Never deploy “latest” to production. Always deploy specific, validated model versions with known performance characteristics.

Deployment hardening

  • Enable authentication
  • Validate inputs (types, ranges, schema) before scoring
  • Do not log raw sensitive payloads
    • log request IDs, latency, error codes, and summary stats instead
TipInput Validation

Always validate input data before scoring. Invalid inputs can cause errors, security issues, or incorrect predictions.

Monitoring and alerts

  • Use Application Insights + Log Analytics for endpoint telemetry
  • Monitor service health
    • latency
    • error rate
    • throughput
  • Monitor model signals
    • feature drift (compare to training stats)
    • prediction drift
  • Set alerts to notify owners when thresholds are exceeded
NoteDrift Detection

Model drift and data drift are inevitable in production. Set up monitoring to detect them early before they impact users.

Code Safety

  • Require PR reviews and CI checks before merge
  • Pin dependencies and avoid installing random packages at runtime
  • Store images in Azure Container Registry (ACR)
TipDependency Management

Pin exact versions in requirements.txt (e.g., pandas==2.0.3 not pandas>=2.0). This prevents unexpected breaking changes.

Compute hygiene and cost guardrails

  • Prefer autoscaling compute clusters for jobs and scale down to zero
  • Stop/delete unused compute instances
  • Set budgets and alerts per resource group
  • Tag resources with owner and environment
WarningCost Management

Unused compute instances can drain your Azure credits quickly. Always stop or delete them when not in use, and set up budget alerts.

Back to top