【Envoy-02】Monitoring, Performance, and Troubleshooting

Posted by Hao Liang's Blog on Friday, December 8, 2023

1. Envoy Observability

Concept:

  • Mechanisms to observe Envoy’s state
  • Debugging and monitoring Envoy

Overview:

  • Admin interface
    • stats
    • config dump
    • clusters
    • log level
  • Debug logs
  • Access logs
  • Metrics Collection
  • Tracing

2. Admin Interface

  • /stats : histogram metrics, current status of Envoy(e.g. how many requests, how many succeeded, how many failed)
  • /config_dump: dump current internal Envoy configuration
  • /clusters: actual membership of cluster
  • /logging: Envoy logs
# https://github.com/solo-io/hoot/blob/master/02-observe/stats.yaml
admin:
  access_log_path: /dev/stdout
  address:
    socket_address: { address: 127.0.0.1, port_value: 9901 }

static_resources:
  listeners:
    - name: listener_0
      address:
        socket_address: { address: 0.0.0.0, port_value: 10000 }
      filter_chains:
        - filters:
            - name: envoy.filters.network.http_connection_manager
              typed_config:
                "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
                stat_prefix: edge_http
                route_config:
                  name: local_route
                  virtual_hosts:
                    - name: namespace.local_service
                      virtual_clusters:
                        - name: actions
                          headers:
                            - name: ":path"
                              prefix_match: "/foo"
                      domains: ["*"]
                      routes:
                        - match: { prefix: "/" }
                          route: { cluster: somecluster }
                http_filters:
                  - name: envoy.filters.http.router
                    typed_config:
                      "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
  clusters:
    - name: somecluster
      connect_timeout: 0.25s
      type: STRICT_DNS
      lb_policy: ROUND_ROBIN
      load_assignment:
        cluster_name: somecluster
        endpoints:
          - lb_endpoints:
              - endpoint:
                  address:
                    socket_address:
                      address: 127.0.0.1
                      port_value: 8082

Run the envoy config with stats

# envoy -c ./stats.yaml

The admin UI will be in: http://127.0.0.1:9901

Envoy stats

stats: http://127.0.0.1:9901/stats?filter=&format=html&type=All&histogram_buckets=cumulative

# cluster.<cluster_name>.<stats_name>: <stats>
cluster.somecluster.assignment_stale: 0
cluster.somecluster.assignment_timeout_received: 0
cluster.somecluster.assignment_use_cached: 0
cluster.somecluster.bind_errors: 0
cluster.somecluster.circuit_breakers.default.cx_open: 0
cluster.somecluster.circuit_breakers.default.cx_pool_open: 0
cluster.somecluster.circuit_breakers.default.rq_open: 0
cluster.somecluster.circuit_breakers.default.rq_pending_open: 0
cluster.somecluster.circuit_breakers.default.rq_retry_open: 0
cluster.somecluster.circuit_breakers.high.cx_open: 0
cluster.somecluster.circuit_breakers.high.cx_pool_open: 0
cluster.somecluster.circuit_breakers.high.rq_open: 0
cluster.somecluster.circuit_breakers.high.rq_pending_open: 0
cluster.somecluster.circuit_breakers.high.rq_retry_open: 0
cluster.somecluster.default.total_match_count: 73
cluster.somecluster.lb_healthy_panic: 0
cluster.somecluster.lb_local_cluster_not_ok: 0
cluster.somecluster.lb_recalculate_zone_structures: 0
cluster.somecluster.lb_subsets_active: 0
cluster.somecluster.lb_subsets_created: 0
cluster.somecluster.lb_subsets_fallback: 0
cluster.somecluster.lb_subsets_fallback_panic: 0
cluster.somecluster.lb_subsets_removed: 0
cluster.somecluster.lb_subsets_selected: 0
cluster.somecluster.lb_zone_cluster_too_small: 0
cluster.somecluster.lb_zone_no_capacity_left: 0
cluster.somecluster.lb_zone_number_differs: 0
cluster.somecluster.lb_zone_routing_all_directly: 0
cluster.somecluster.lb_zone_routing_cross_zone: 0
cluster.somecluster.lb_zone_routing_sampled: 0
cluster.somecluster.max_host_weight: 1

Envoy config dump

config dump: http://127.0.0.1:9901/config_dump?resource=&mask=&name_regex=

Envoy cluster discovery

clusters: http://127.0.0.1:9901/clusters

current member of the cluster:

somecluster::observability_name::somecluster
somecluster::default_priority::max_connections::1024
somecluster::default_priority::max_pending_requests::1024
somecluster::default_priority::max_requests::1024
somecluster::default_priority::max_retries::3
somecluster::high_priority::max_connections::1024
somecluster::high_priority::max_pending_requests::1024
somecluster::high_priority::max_requests::1024
somecluster::high_priority::max_retries::3
somecluster::added_via_api::false
somecluster::127.0.0.1:8082::cx_active::0
somecluster::127.0.0.1:8082::cx_connect_fail::0
somecluster::127.0.0.1:8082::cx_total::0
somecluster::127.0.0.1:8082::rq_active::0
somecluster::127.0.0.1:8082::rq_error::0
somecluster::127.0.0.1:8082::rq_success::0
somecluster::127.0.0.1:8082::rq_timeout::0
somecluster::127.0.0.1:8082::rq_total::0
somecluster::127.0.0.1:8082::hostname::127.0.0.1
somecluster::127.0.0.1:8082::health_flags::healthy
somecluster::127.0.0.1:8082::weight::1
somecluster::127.0.0.1:8082::region::
somecluster::127.0.0.1:8082::zone::
somecluster::127.0.0.1:8082::sub_zone::
somecluster::127.0.0.1:8082::canary::false
somecluster::127.0.0.1:8082::priority::0
somecluster::127.0.0.1:8082::success_rate::-1
somecluster::127.0.0.1:8082::local_origin_success_rate::-1

Envoy log level

log level: http://127.0.0.1:9901/logging

active loggers:
  admin: info
  alternate_protocols_cache: info
  aws: info
  assert: info
  backtrace: info
  cache_filter: info
  client: info
  config: info
  connection: info
  conn_handler: info
  decompression: info
  dns: info
  dubbo: info
  envoy_bug: info
  ext_authz: info
  ext_proc: info
  rocketmq: info
  file: info
  filter: info
  ...

change Envoy log level to debug:

# static
# envoy -c ./stats.yaml -l debug

# dynamic
# curl -XPOST "localhost:9901/logging?level=debug"

3. Access logs

# https://github.com/solo-io/hoot/blob/master/02-observe/accesslogs.yaml
admin:
  access_log_path: /dev/stdout
  address:
    socket_address: { address: 127.0.0.1, port_value: 9901 }

static_resources:
  listeners:
    - name: listener_0
      address:
        socket_address: { address: 0.0.0.0, port_value: 10000 }
      filter_chains:
        - filters:
            - name: envoy.filters.network.http_connection_manager
              typed_config:
                "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
                access_log:
                  - name: "envoy.access_loggers.file"
                    filter:
                      status_code_filter:
                        comparison:
                          op: GE
                          value:
                            default_value: 400
                            runtime_key: "filter.request_type"
                    typed_config:
                      "@type": "type.googleapis.com/envoy.extensions.access_loggers.file.v3.FileAccessLog"
                      path: /dev/stdout
                stat_prefix: edge_http
                route_config:
                  name: local_route
                  virtual_hosts:
                    - name: namespace.local_service
                      domains: ["*"]
                      routes:
                        - match: { prefix: "/" }
                          route: { cluster: somecluster }
                http_filters:
                  - name: envoy.filters.http.router
                    typed_config:
                      "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router

  clusters:
    - name: somecluster
      connect_timeout: 0.25s
      type: STRICT_DNS
      lb_policy: ROUND_ROBIN
      load_assignment:
        cluster_name: somecluster
        endpoints:
          - lb_endpoints:
              - endpoint:
                  address:
                    socket_address:
                      address: 127.0.0.1
                      port_value: 8082

the crucial part of access log config:

                access_log:
                  - name: "envoy.access_loggers.file"
                    filter:
                      status_code_filter:
                        comparison:
                          op: GE
                          value:
                            default_value: 400
                            runtime_key: "filter.request_type"
                    typed_config:
                      "@type": "type.googleapis.com/envoy.extensions.access_loggers.file.v3.FileAccessLog"
                      path: /dev/stdout

It clarifies whenever the status code is equal or greater than 400, write access log to /dev/stdout(terminal output).

# print access log every 1 ms
# envoy -c ./accesslogs.yaml --file-flush-interval-msec 1

The 500 status code access log printed on terminal

[2023-12-09T05:33:20.740Z] "GET / HTTP/1.1" 503 UF 0 151 8 - "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36" "27d6dca9-1170-4f92-be1f-214a260e639b" "127.0.0.1:10000" "127.0.0.1:8082"
[2023-12-09T05:33:21.086Z] "GET /favicon.ico HTTP/1.1" 503 UF 0 151 0 - "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36" "22f9a6d6-8c43-474d-91fa-394d364fbb9e" "127.0.0.1:10000" "127.0.0.1:8082"
[2023-12-09T05:33:43.391Z] "GET /123 HTTP/1.1" 503 UF 0 151 1 - "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36" "7ca8235b-b4c2-4f42-ab9e-c4a98f65bf85" "127.0.0.1:10000" "127.0.0.1:8082"
[2023-12-09T05:33:43.505Z] "GET /favicon.ico HTTP/1.1" 503 UF 0 151 2 - "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36" "711fb9fb-09b5-4df2-b4e7-6f17e06d43bc" "127.0.0.1:10000" "127.0.0.1:8082"

4. Metrics Collection

Prometheus integration: http://127.0.0.1:9901/stats/prometheus?filter=

# TYPE envoy_cluster_assignment_stale counter
envoy_cluster_assignment_stale{envoy_cluster_name="somecluster"} 0
# TYPE envoy_cluster_assignment_timeout_received counter
envoy_cluster_assignment_timeout_received{envoy_cluster_name="somecluster"} 0
# TYPE envoy_cluster_assignment_use_cached counter
envoy_cluster_assignment_use_cached{envoy_cluster_name="somecluster"} 0
# TYPE envoy_cluster_bind_errors counter
envoy_cluster_bind_errors{envoy_cluster_name="somecluster"} 0
# TYPE envoy_cluster_default_total_match_count counter
envoy_cluster_default_total_match_count{envoy_cluster_name="somecluster"} 55

5. Tracing

Jaeger integration, trace the associated requests and report them to jaeger open tracing system.

# https://github.com/solo-io/hoot/blob/master/02-observe/jaeger.yaml
admin:
  access_log_path: /dev/stdout
  address:
    socket_address: { address: 127.0.0.1, port_value: 9901 }
static_resources:
  listeners:
    - name: listener_0
      address:
        socket_address: { address: 0.0.0.0, port_value: 10000 }
      traffic_direction: OUTBOUND
      filter_chains:
        - filters:
            - name: envoy.filters.network.http_connection_manager
              typed_config:
                "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
                generate_request_id: true
                tracing:
                  provider:
                    name: envoy.tracers.dynamic_ot
                    typed_config:
                      "@type": type.googleapis.com/envoy.config.trace.v3.DynamicOtConfig
                      library: ./libjaegertracing.so.0.4.2
                      config:
                        service_name: edge-proxy
                        sampler:
                          type: const
                          param: 1
                        reporter:
                          localAgentHostPort: 127.0.0.1:6831
                        headers:
                          jaegerDebugHeader: jaeger-debug-id
                          jaegerBaggageHeader: jaeger-baggage
                          traceBaggageHeaderPrefix: edgectx-
                        baggage_restrictions:
                          denyBaggageOnInitializationFailure: false
                          hostPort: ""
                stat_prefix: edge_http
                use_remote_address: true
                route_config:
                  name: local_route
                  virtual_hosts:
                    - name: namespace.local_service
                      domains: ["*"]
                      routes:
                        - match: { prefix: "/" }
                          decorator:
                            operation: fetchContent
                          route:
                            cluster: somecluster
                            rate_limits:
                              - actions:
                                  - {source_cluster: {}}
                                  - {generic_key: {descriptor_value: some_value}}
                http_filters:
                  - name: envoy.filters.http.rate_limit
                    typed_config:
                      "@type": type.googleapis.com/envoy.extensions.filters.http.ratelimit.v3.RateLimit
                      domain: "domain"
                      timeout: 5s
                      rate_limit_service:
                        grpc_service:
                          timeout: 5s
                          envoy_grpc:
                            cluster_name: rate-limit
                  - name: envoy.filters.http.router
                    typed_config:
                      "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
  clusters:
    - name: somecluster
      connect_timeout: 0.25s
      type: STRICT_DNS
      lb_policy: ROUND_ROBIN
      load_assignment:
        cluster_name: somecluster
        endpoints:
          - lb_endpoints:
              - endpoint:
                  address:
                    socket_address:
                      address: 127.0.0.1
                      port_value: 8082
    - name: rate-limit
      http2_protocol_options: {}
      connect_timeout: 0.25s
      type: logical_dns
      lb_policy: round_robin
      load_assignment:
        cluster_name: rate-limit
        endpoints:
          - lb_endpoints:
              - endpoint:
                  address:
                    socket_address:
                      address: 127.0.0.1
                      port_value: 10004

    - name: jaeger
      connect_timeout: 1s
      type: strict_dns
      lb_policy: round_robin
      load_assignment:
        cluster_name: jaeger
        endpoints:
          - lb_endpoints:
              - endpoint:
                  address:
                    socket_address:
                      address: 127.0.0.1
                      port_value: 9411
# envoy -c jaeger.yaml 

6. Reference