EFK issues with BB 1.5.0

When deploying to OCP Azure or AKS, the logging namespace is throwing errors, resulting in partial logs not getting to Kibana.

For AKS, one of the 4 logging-ek-es-data pods and one of the 3 logging-ek-es-master pods have this exception:

{{logging-ek-es-data-0}{r4piFGtiTeeOSJ5Rt5_2lA}{Yfs_rAB2S2-fxF7Pf36RCQ}{10.244.2.8}{10.244.2.8:9300}{dirt}{k8s_node_name=aks-default-42013760-vmss000000, xpack.installed=true, transform.node=true}}, term: 1, version: 113, reason: ApplyCommitRequest{term=1, version=113, sourceNode={logging-ek-es-master-1}{r1nFFAuwTsm7_nRHFpPHUw}{hqDfI3dOTe2zPuAgWE0A-g}{10.244.3.11}{10.244.3.11:9300}{mr}{k8s_node_name=aks-default-42013760-vmss000002, xpack.installed=true, transform.node=false}}", "cluster.uuid": "rnNEEZyGTgmFo-0yBG4tWw", "node.id": "E0hv4YMhQyS7zqj89yWO5g" } {"type": "server", "timestamp": "2021-04-21T16:17:05,844Z", "level": "WARN", "component": "o.e.h.AbstractHttpServerTransport", "cluster.name": "logging-ek", "node.name": "logging-ek-es-data-2", "message": "caught exception while handling client http traffic, closing connection Netty4HttpChannel{localAddress=/127.0.0.1:9200, remoteAddress=/127.0.0.1:43288}", "cluster.uuid": "rnNEEZyGTgmFo-0yBG4tWw", "node.id": "E0hv4YMhQyS7zqj89yWO5g" , "stacktrace": ["io.netty.handler.codec.DecoderException: javax.net.ssl.SSLException: Unrecognized SSL message, plaintext connection?", "at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:471) ~[netty-codec-4.1.49.Final.jar:4.1.49.Final]", "at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:276) ~[netty-codec-4.1.49.Final.jar:4.1.49.Final]", "at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) [netty-transport-4.1.49.Final.jar:4.1.49.Final]", "at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) [netty-transport-4.1.49.Final.jar:4.1.49.Final]", "at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) [netty-transport-4.1.49.Final.jar:4.1.49.Final]", "at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) [netty-transport-4.1.49.Final.jar:4.1.49.Final]", "at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) [netty-transport-4.1.49.Final.jar:4.1.49.Final]", "at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) [netty-transport-4.1.49.Final.jar:4.1.49.Final]", "at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) [netty-transport-4.1.49.Final.jar:4.1.49.Final]", "at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163) [netty-transport-4.1.49.Final.jar:4.1.49.Final]", "at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714) [netty-transport-4.1.49.Final.jar:4.1.49.Final]", "at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:615) [netty-transport-4.1.49.Final.jar:4.1.49.Final]", "at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:578) [netty-transport-4.1.49.Final.jar:4.1.49.Final]", "at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) [netty-transport-4.1.49.Final.jar:4.1.49.Final]", "at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) [netty-common-4.1.49.Final.jar:4.1.49.Final]", "at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.49.Final.jar:4.1.49.Final]", "at java.lang.Thread.run(Thread.java:832) [?:?]", "Caused by: javax.net.ssl.SSLException: Unrecognized SSL message, plaintext connection?", "at sun.security.ssl.SSLEngineInputRecord.bytesInCompletePacket(SSLEngineInputRecord.java:146) ~[?:?]", "at sun.security.ssl.SSLEngineInputRecord.bytesInCompletePacket(SSLEngineInputRecord.java:64) ~[?:?]", "at sun.security.ssl.SSLEngineImpl.readRecord(SSLEngineImpl.java:612) ~[?:?]", "at sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:506) ~[?:?]", "at sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:482) ~[?:?]", "at javax.net.ssl.SSLEngine.unwrap(SSLEngine.java:637) ~[?:?]", "at io.netty.handler.ssl.SslHandler$SslEngineType$3.unwrap(SslHandler.java:282) ~[netty-handler-4.1.49.Final.jar:4.1.49.Final]", "at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1372) ~[netty-handler-4.1.49.Final.jar:4.1.49.Final]", "at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1267) ~[netty-handler-4.1.49.Final.jar:4.1.49.Final]", "at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1314) ~[netty-handler-4.1.49.Final.jar:4.1.49.Final]", "at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:501) ~[netty-codec-4.1.49.Final.jar:4.1.49.Final]", "at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:440) ~[netty-codec-4.1.49.Final.jar:4.1.49.Final]", "... 16 more"] }

Potentially related, the all fluent-bit pods spit out this error (many times) around startup:

[2021/04/21 16:07:03] [error] [filter:kubernetes:kubernetes.0] kubelet upstream connection error [2021/04/21 16:07:03] [error] [filter:kubernetes:kubernetes.0] kubelet upstream connection error [2021/04/21 16:07:03] [error] [filter:kubernetes:kubernetes.0] kubelet upstream connection error [2021/04/21 16:07:03] [error] [filter:kubernetes:kubernetes.0] kubelet upstream connection error

and once fully started up, every 10 seconds (indefinitely), they spit out this:

[2021/04/21 16:07:51] [error] [upstream] connection #69 to logging-ek-es-http:9200 timed out after 10 seconds [2021/04/21 16:07:51] [error] [upstream] connection #77 to logging-ek-es-http:9200 timed out after 10 seconds [2021/04/21 16:07:51] [error] [upstream] connection #65 to logging-ek-es-http:9200 timed out after 10 seconds [2021/04/21 16:07:51] [error] [upstream] connection #93 to logging-ek-es-http:9200 timed out after 10 seconds [2021/04/21 16:07:51] [error] [upstream] connection #97 to logging-ek-es-http:9200 timed out after 10 seconds [2021/04/21 16:07:51] [error] [upstream] connection #87 to logging-ek-es-http:9200 timed out after 10 seconds [2021/04/21 16:07:51] [error] [upstream] connection #106 to logging-ek-es-http:9200 timed out after 10 seconds [2021/04/21 16:07:51] [error] [upstream] connection #93 to logging-ek-es-http:9200 timed out after 10 seconds [2021/04/21 16:07:51] [error] [upstream] connection #94 to logging-ek-es-http:9200 timed out after 10 seconds [2021/04/21 16:07:51] [error] [upstream] connection #102 to logging-ek-es-http:9200 timed out after 10 seconds [2021/04/21 16:07:51] [error] [upstream] connection #97 to logging-ek-es-http:9200 timed out after 10 seconds [2021/04/21 16:07:51] [error] [upstream] connection #80 to logging-ek-es-http:9200 timed out after 10 seconds [2021/04/21 16:07:51] [error] [upstream] connection #87 to logging-ek-es-http:9200 timed out after 10 seconds [2021/04/21 16:07:51] [error] [upstream] connection #93 to logging-ek-es-http:9200 timed out after 10 seconds [2021/04/21 16:07:51] [error] [upstream] connection #69 to logging-ek-es-http:9200 timed out after 10 seconds [2021/04/21 16:07:51] [error] [upstream] connection #66 to logging-ek-es-http:9200 timed out after 10 seconds [2021/04/21 16:07:51] [error] [upstream] connection #89 to logging-ek-es-http:9200 timed out after 10 seconds [2021/04/21 16:07:51] [error] [upstream] connection #72 to logging-ek-es-http:9200 timed out after 10 seconds [2021/04/21 16:07:51] [error] [upstream] connection #78 to logging-ek-es-http:9200 timed out after 10 seconds [2021/04/21 16:07:51] [error] [upstream] connection #65 to logging-ek-es-http:9200 timed out after 10 seconds [2021/04/21 16:07:51] [error] [upstream] connection #87 to logging-ek-es-http:9200 timed out after 10 seconds [2021/04/21 16:07:51] [error] [upstream] connection #92 to logging-ek-es-http:9200 timed out after 10 seconds [2021/04/21 16:07:51] [error] [upstream] connection #95 to logging-ek-es-http:9200 timed out after 10 seconds [2021/04/21 16:07:51] [error] [upstream] connection #78 to logging-ek-es-http:9200 timed out after 10 seconds [2021/04/21 16:07:51] [error] [upstream] connection #107 to logging-ek-es-http:9200 timed out after 10 seconds [2021/04/21 16:07:51] [error] [upstream] connection #92 to logging-ek-es-http:9200 timed out after 10 seconds [2021/04/21 16:07:51] [error] [upstream] connection #108 to logging-ek-es-http:9200 timed out after 10 seconds [2021/04/21 16:07:51] [error] [upstream] connection #112 to logging-ek-es-http:9200 timed out after 10 seconds [2021/04/21 16:07:51] [error] [upstream] connection #113 to logging-ek-es-http:9200 timed out after 10 seconds [2021/04/21 16:07:51] [error] [upstream] connection #104 to logging-ek-es-http:9200 timed out after 10 seconds [2021/04/21 16:07:51] [error] [upstream] connection #109 to logging-ek-es-http:9200 timed out after 10 seconds [2021/04/21 16:07:51] [error] [upstream] connection #94 to logging-ek-es-http:9200 timed out after 10 seconds [2021/04/21 16:07:51] [error] [upstream] connection #80 to logging-ek-es-http:9200 timed out after 10 seconds [2021/04/21 16:07:51] [error] [upstream] connection #78 to logging-ek-es-http:9200 timed out after 10 seconds

To reproduce this, deploy BB 1.5.0 onto OCP Azure or AKS. We do not override any helm chart values for OCP Azure. The only overrides we are using for AKS are (to increase vm map size) attached in a screenshot. When i paste in here it looks nasty.

Edited Apr 21, 2021 by Ben Mountjoy

Admin message

EFK issues with BB 1.5.0