Mastering Prometheus Integration: From Batch Jobs to Production Monitoring

Prometheus has become the de facto standard for monitoring modern infrastructure, but integrating batch jobs and scheduled workflows presents unique challenges. This blog post explores advanced Prometheus integration patterns, focusing on the Pushgateway pattern for batch job monitoring and best practices for production deployments.

Monitoring

Mastering Prometheus Integration: From Batch Jobs to Production Monitoring

Introduction

Understanding Prometheus Architecture Patterns

Pull vs Push: When to Use Each

Prometheus traditionally operates on a pull model, where the Prometheus server scrapes metrics from target applications. However, batch jobs and short-lived processes require a different approach.

Pull Model Limitations for Batch Jobs

Ephemeral Nature: Batch jobs may complete before Prometheus can scrape them
Network Topology: Jobs behind firewalls or NAT may not be directly accessible
Scheduling Conflicts: Scrape intervals may not align with job execution windows

Push Model Benefits

Job Lifecycle Independence: Metrics persist after job completion
Network Flexibility: Works in complex network environments
Batch Compatibility: Perfect for scheduled and ad-hoc jobs

Implementing Pushgateway Integration

Basic Pushgateway Setup

The Pushgateway acts as an intermediary, receiving pushed metrics and exposing them for Prometheus to scrape:

# Push metrics to Pushgateway
curl -X POST \
 -H "Content-Type: text/plain" \
 --data-binary "metric_name value" \
 "http://pushgateway:9091/metrics/job/job_name"

Advanced Integration Patterns

Structured Metric Formatting

METRIC_DATA="msisdn_stock_available{job=\"$JOB_NAME\",instance=\"$INSTANCE_NAME\"} $MSISDN_COUNT
msisdn_last_updated{job=\"$JOB_NAME\",instance=\"$INSTANCE_NAME\"} $(date +%s)
"

This approach demonstrates: - Multi-metric Payloads: Sending multiple related metrics in a single request - Timestamp Integration: Including execution timestamps for correlation - Label Consistency: Maintaining consistent labeling across metrics

Error Handling and Reliability

curl_response=$(curl -X POST \
 -H "Content-Type: text/plain" \
 --data-binary "$METRIC_DATA" \
 --write-out "%{http_code}" \
 --silent \
 --output /dev/null \
 "$PUSHGATEWAY_URL/metrics/job/$JOB_NAME/instance/$INSTANCE_NAME") if [ "$curl_response" -eq 200 ]; then
 echo "✅ Successfully pushed metrics to Prometheus Gateway"
else
 echo "❌ Failed to push metrics (HTTP $curl_response)"
 exit 1
fi

Robust error handling ensures: - HTTP Status Validation: Verifying successful metric delivery - Graceful Failure: Appropriate error messages and exit codes - Monitoring Health: Ability to monitor the monitoring system itself

Metric Design Best Practices

Naming Conventions

Meaningful Names

# Good
msisdn_stock_available
http_requests_duration_seconds
database_connections_active # Bad
stock
requests
connections

Unit Specification

# Time-based metrics
response_time_seconds
process_uptime_seconds # Count-based metrics
items_processed_total
errors_count # Gauge metrics
memory_usage_bytes
cpu_utilization_percent

Label Strategy

Consistent Labeling

msisdn_stock_available{
 job="msisdn_stock_monitor",
 instance="wireless_fulfillment_scripts",
 environment="production",
 region="us-east-1"
}

Label Cardinality Management

# Good - Low cardinality
labels="job=\"$JOB_NAME\",instance=\"$INSTANCE_NAME\",environment=\"$ENV\"" # Bad - High cardinality (avoid user_id, session_id, etc.)
labels="job=\"$JOB_NAME\",user_id=\"$USER_ID\",session=\"$SESSION\""

High cardinality labels can overwhelm Prometheus and should be avoided.

Production Integration Patterns

Environment-Specific Configuration

# Environment-based Pushgateway selection
case "$ENVIRONMENT" in
 "production")
 PUSHGATEWAY_URL="https://pushgateway.prod.company.com:9091"
 ;;
 "staging")
 PUSHGATEWAY_URL="https://pushgateway.staging.company.com:9091"
 ;;
 "development")
 PUSHGATEWAY_URL="https://pushgateway.dev.company.com:9091"
 ;;
esac

Authentication and Security

API Token Authentication

curl -X POST \
 -H "Content-Type: text/plain" \
 -H "Authorization: Bearer <REDACTED>
 --data-binary "$METRIC_DATA" \
 "$PUSHGATEWAY_URL/metrics/job/$JOB_NAME"

TLS Configuration

curl -X POST \
 --cacert /path/to/ca-bundle.crt \
 --cert /path/to/client.crt \
 --key /path/to/client.key \
 -H "Content-Type: text/plain" \
 --data-binary "$METRIC_DATA" \
 "$PUSHGATEWAY_URL/metrics/job/$JOB_NAME"

Metric Lifecycle Management

Metric Cleanup

# Delete metrics for completed job
curl -X DELETE \
 "$PUSHGATEWAY_URL/metrics/job/$JOB_NAME/instance/$INSTANCE_NAME"

Metric Grouping

# Group related metrics by instance
curl -X POST \
 -H "Content-Type: text/plain" \
 --data-binary "$METRIC_DATA" \
 "$PUSHGATEWAY_URL/metrics/job/$JOB_NAME/instance/$INSTANCE_NAME"

Advanced Monitoring Patterns

Composite Metrics

# Calculate and push derived metrics
SUCCESS_RATE=$(echo "scale=2; $SUCCESSFUL_OPERATIONS * 100 / $TOTAL_OPERATIONS" | bc) METRIC_DATA="
job_success_rate{job=\"$JOB_NAME\"} $SUCCESS_RATE
job_operations_successful{job=\"$JOB_NAME\"} $SUCCESSFUL_OPERATIONS 
job_operations_total{job=\"$JOB_NAME\"} $TOTAL_OPERATIONS
job_duration_seconds{job=\"$JOB_NAME\"} $JOB_DURATION
"

Histogram Metrics for Batch Jobs

# Custom histogram implementation for batch processing
declare -A histogram_buckets
histogram_buckets["le_1"]=0
histogram_buckets["le_5"]=0
histogram_buckets["le_10"]=0
histogram_buckets["le_inf"]=0 # Process items and populate buckets
for duration in $processing_times; do
 if [ "$duration" -le 1 ]; then
 ((histogram_buckets["le_1"]++))
 elif [ "$duration" -le 5 ]; then
 ((histogram_buckets["le_5"]++))
 elif [ "$duration" -le 10 ]; then
 ((histogram_buckets["le_10"]++))
 fi
 ((histogram_buckets["le_inf"]++))
done # Push histogram data
HISTOGRAM_DATA="
processing_duration_seconds_bucket{le=\"1\"} ${histogram_buckets["le_1"]}
processing_duration_seconds_bucket{le=\"5\"} ${histogram_buckets["le_5"]}
processing_duration_seconds_bucket{le=\"10\"} ${histogram_buckets["le_10"]}
processing_duration_seconds_bucket{le=\"+Inf\"} ${histogram_buckets["le_inf"]}
processing_duration_seconds_count ${histogram_buckets["le_inf"]}
processing_duration_seconds_sum $total_processing_time
"

Operational Excellence

Health Check Integration

# Include health check metrics
HEALTH_STATUS=1 # 1 for healthy, 0 for unhealthy if ! check_dependency_health; then
 HEALTH_STATUS=0
fi METRIC_DATA="$METRIC_DATA
job_health_status{job=\"$JOB_NAME\"} $HEALTH_STATUS
"

Performance Monitoring

# Track job performance metrics
START_TIME=$(date +%s) # ... job execution ... END_TIME=$(date +%s)
DURATION=$((END_TIME - START_TIME)) METRIC_DATA="$METRIC_DATA
job_execution_duration_seconds{job=\"$JOB_NAME\"} $DURATION
job_last_execution_timestamp{job=\"$JOB_NAME\"} $END_TIME
"

Resource Usage Tracking

# Memory and CPU usage tracking
MEMORY_USAGE=$(ps -o pid,vsz,rss -p $$ | tail -1 | awk '{print $3}')
CPU_USAGE=$(ps -o pid,pcpu -p $$ | tail -1 | awk '{print $2}') METRIC_DATA="$METRIC_DATA
job_memory_usage_kb{job=\"$JOB_NAME\"} $MEMORY_USAGE
job_cpu_usage_percent{job=\"$JOB_NAME\"} $CPU_USAGE
"

Alerting Integration

Threshold-Based Alerts

# Prometheus alerting rule
- alert: MSISDNStockLow
 expr: msisdn_stock_available < 1000
 for: 5m
 labels:
 severity: warning
 service: msisdn_management
 annotations:
 summary: "MSISDN stock is running low"
 description: "Available MSISDN count is {{ $value }}, below threshold of 1000"

Job Failure Detection

- alert: BatchJobFailed
 expr: time() - job_last_execution_timestamp > 86400
 for: 10m
 labels:
 severity: critical
 service: batch_processing
 annotations:
 summary: "Batch job has not run successfully in 24 hours"
 description: "Job {{ $labels.job }} last successful execution was {{ $value }} seconds ago"

Grafana Dashboard Integration

Dashboard Design Patterns

Time-series Visualization

# PromQL for stock levels over time
msisdn_stock_available{job="msisdn_stock_monitor"}

Success Rate Calculation

# Calculate success rate
rate(job_operations_successful[5m]) / rate(job_operations_total[5m]) * 100

Performance Trends

# Average job duration over time
avg_over_time(job_execution_duration_seconds[1h])

Multi-dimensional Analysis

# Stock levels by region
sum by (region) (msisdn_stock_available) # Success rates by environment
avg by (environment) (job_success_rate)

Troubleshooting Common Issues

Metric Retention

Pushgateway Cleanup

# Regular cleanup of stale metrics
curl -X DELETE "$PUSHGATEWAY_URL/metrics/job/$OLD_JOB_NAME"

Prometheus Configuration

# prometheus.yml
global:
 scrape_interval: 15s
 evaluation_interval: 15s scrape_configs:
 - job_name: 'pushgateway'
 static_configs:
 - targets: ['pushgateway:9091']
 scrape_interval: 5s
 metrics_path: /metrics

Network and Connectivity Issues

Retry Logic

retry_count=0
max_retries=3 while [ $retry_count -lt $max_retries ]; do
 if curl -X POST "$PUSHGATEWAY_URL/metrics/job/$JOB_NAME" \
 -H "Content-Type: text/plain" \
 --data-binary "$METRIC_DATA"; then
 break
 fi  ((retry_count++))
 sleep $((retry_count * 2)) # Exponential backoff
done

Connection Validation

# Validate connectivity before pushing metrics
if ! curl -s --head "$PUSHGATEWAY_URL" > /dev/null; then
 echo "Cannot connect to Pushgateway at $PUSHGATEWAY_URL"
 exit 1
fi

Performance Optimization

Batch Processing

# Batch multiple metrics in single request
METRIC_BATCH=""
for metric in "${metrics[@]}"; do
 METRIC_BATCH="$METRIC_BATCH$metric"$'\n'
done curl -X POST \
 -H "Content-Type: text/plain" \
 --data-binary "$METRIC_BATCH" \
 "$PUSHGATEWAY_URL/metrics/job/$JOB_NAME"

Compression

# Use gzip compression for large metric payloads
curl -X POST \
 -H "Content-Type: text/plain" \
 -H "Content-Encoding: gzip" \
 --data-binary @<(echo "$METRIC_DATA" | gzip) \
 "$PUSHGATEWAY_URL/metrics/job/$JOB_NAME"

Conclusion

Effective Prometheus integration requires understanding both the technical implementation details and the operational patterns that ensure reliable monitoring. The Pushgateway pattern enables monitoring of batch jobs and scheduled workflows while maintaining the benefits of Prometheus's powerful querying and alerting capabilities.

Key success factors include:

Proper Metric Design: Meaningful names, consistent labels, and appropriate metric types
Robust Integration: Error handling, retry logic, and health checking
Security Implementation: Authentication, encryption, and access controls
Operational Excellence: Cleanup procedures, performance optimization, and troubleshooting

By following these patterns and practices, organizations can build reliable, scalable monitoring systems that provide deep visibility into batch job performance and infrastructure health.

Technical Summary

Pushgateway Integration: Proper use of push patterns for batch job monitoring
Metric Design: Best practices for naming, labeling, and structuring metrics
Error Handling: Robust error detection and retry mechanisms
Security: Authentication, TLS, and access control implementation
Performance: Optimization techniques for large-scale deployments
Operations: Monitoring, alerting, and troubleshooting strategies