[P1BIGROCKS-1539] Cluster logging ADR
This is part of the BigBang Core Opensource Initiative (&74 (closed)).
Summary
Produce an ADR that evaluates and compares the following:
- Elasticsearch, Fluentbit (or alternative) and Kibana
- OpenSearch, Fluentbit (or alternative) and Kibana
- Loki, Promtail (or alternative), and Grafana
EFK is currently the solution for BigBang core, but due to licensing constraints and the desire to transition all BigBang core products to free and open-source tools, this is going to be put under evaluation.
The goal is to identify the alternative solutions (loki, opensearch) and compare features, performance, cost, and IronBank feasibility/maintenance to make a decision on whether we stick with EFK or not.
Definition of Done
-
ADR with recommendation of tool
- Features
- Performance
- Cost
- IB feasibility
- Hub and spoke model for easily centralizing/tagging/viewing multicluster logs
Extra points of contention
- Does it fully support mTLS with Istio?
- EFK docs claim they do, but the fine print says, "only if you don't use paid features such as sso", which makes it unusable
Summary
TLDR
EFK vs Open Distro
- AWS Open Distro is free fork of v7.10 EFK before EFK changed their license. Minor lift and shift, no support, some EFK features are unavailable.
EFK/Open Distro vs PLG
- PLG has different underlying design, but it's the same data, different interface, logs & metrics in the same pane of glass, potentially smaller BigBang footprint, less "paywalled" features compared to EFK, and potentially cheaper due to architecture and licensing.
EFK - Elasticsearch Fluentbit Kibana
DEFINITIONS
- Elasticsearch is a real-time, distributed object storage, search and analytics engine. It excels in indexing semi-structured data such as logs. The information is serialized as JSON documents and indexed in real-time and distributed across nodes in the cluster. Elasticsearch uses an inverted index which lists all unique words and their related documents for full-text search, which is based on Apache Lucene search engine library.
- FluentD is a data collector which unifies the data collection and consumption for better use. It tries to structure data as JSON as much as possible. It has plugin-architecture and supported by 100s of community provided plugins for many use-cases.
- Kibana is the visualization engine for elasticsearch data, with features like time-series analysis, machine learning, graph and location analysis.
NOTES
- Currently integrated and part of Big Bang
- Current teams have monitoring, dashboards, and alerts based around EFK
- Many features are locked behind a license to include SSO
- Resource hungry forcing some edge users to disable
- High customer familiarity.
License Type
- Elastic License v2
- Tiered Paid License Model
RBAC SUPPORT
- Free Role-based access control for controlling user access to cluster APIs and indexes.
- Document-level security requires a license.
- RBAC management and control is very limited without SSO, which requires a paid license
SSO SUPPORT
- Requires license
- Single sign-on and Active Directory/LDAP authentication.
HELM
OPERATOR
mTLS Compatibility
- No, issues arise when using paid features that require TLS, like SSO.
Ironbank feasibility
- Product has already been integrated.
Open Distro & Fluent Bit
DEFINITIONS
- Elasticsearch is a real-time, distributed object storage, search and analytics engine. It excels in indexing semi-structured data such as logs. The information is serialized as JSON documents and indexed in real-time and distributed across nodes in the cluster. Elasticsearch uses an inverted index which lists all unique words and their related documents for full-text search, which is based on Apache Lucene search engine library.
- FluentD is a data collector which unifies the data collection and consumption for better use. It tries to structure data as JSON as much as possible. It has plugin-architecture and supported by 100s of community provided plugins for many use-cases.
- Kibana is the visualization engine for elasticsearch data, with features like time-series analysis, machine learning, graph and location analysis.
NOTES
- Based on Upstream open source Elasticsearch.
- Upstream open source Elasticsearch development ended with 7.10 (Elasticsearch currently on v7.12.*) when the announcement was made that it would move to a non-open source license. Development will continue on forked Elasticsearch and Kibana 7.10 base under the Apache 2.0 license.
- All dashboards, filters, and alerts created in current EFK design should transfer over.
- Resource hungry, would force some edge users to disable.
- Will feel the same to customers.
License Type
- Apache License 2.0
- Open Distro for Elasticsearch and all included plugins are licensed under the Apache License, Version 2.0.
RBAC SUPPORT
- Yes, Roles contain any combination of cluster-wide permissions, index-specific permissions, document and field-level security, and tenants.
- RBAC is not locked behind licensing.
SSO SUPPORT
- OpenID Connect, LDAP/Active Directory, SAML, Kerberos, JSON web tokens, TLS certificates, and Proxy authentication/SSO for user authentication
HELM
OPERATOR
- None
mTLS Compatibility
- Istio mTLS compatible if Open Distro default TLS is disable, further work and testing required.
Ironbank feasibility
- Work required, a good amount of effort from elasticsearch and kibana will transfer over.
PLG - Promtail Loki Grafana
DEFINITIONS
- Loki is a horizontally scalable, highly available, multi-tenant log aggregation system inspired by Prometheus. It indexes only metadata and doesn’t index the content of the log. This design decision makes it very cost-effective and easy to operate.
- Promtail is an agent that ships the logs from the local system to the Loki cluster.
- Grafana is the visualization tool which consumes data from Loki data sources.
NOTES
- Grafana could now provide a "single pane of glass" for metrics and logs enabling users to seamlessly switch between metrics and logs, helping with root cause analysis.
- Uses significantly less resources compared to Elasticsearch.
- Would require building new dashboards and alerts.
- Promtail works similar to fluentd/fluentbit, pulls logs from containers and delivers them to Loki.
- If Promtail is not able to capture data, fluentd/fluentbit have the ability to send data to Loki.
- Loki is an extremely cost-effective solution because of the design decision to avoid indexing the actual log data. Only metadata is indexed and thus it saves on the storage and memory (cache). Can utilize object storage which is normal cheaper as compared to the block storage required by Elasticsearch clusters.
- Multi-tenant support
- Potential loss in advanced alerts and visualization because logs are stored in plaintext form tagged with a set of label names and values. Further testing required.
License Type
RBAC SUPPORT
- Folder or dashboard permissions assigned to your team (Admin, Editor, or Viewer).
- Folder or dashboard permissions assigned to your user account (Admin, Editor, or Viewer).
- Data source permissions, Grafana Enterprise License Required. For more information, refer to Data source permissions in Grafana Enterprise.
SSO SUPPORT
- Yes, Grafana SSO is currently integrated and available in the Big Bang stack
HELM
OPERATOR
- None
mTLS Compatibility
- Istio mTLS compatible
Ironbank feasibility
- Grafana has already been integrated.
- Would require work for Promtail and Loki.
Opensearch
- Not ready for consideration/evaluation, project started in early 2021 and is still in beta.
RELEVANT LINKS
- https://grafana.com/docs/loki/latest/overview/comparisons/
- https://www.infracloud.io/blogs/logging-in-kubernetes-efk-vs-plg-stack/
- https://opendistro.github.io/for-elasticsearch/faq.html
- https://www.elastic.co/what-is/opensearch
- https://www.opensearch.org/
- https://github.com/opensearch-project/OpenSearch
- https://grafana.com/docs/loki/latest/clients/fluentd/