记录每一次的分享与讨论,
为了下一次的碰撞而准备。

ES 2.3.3 查询线程以及CPU使用率居高不下

背景

一个线上的真实案例,献给还在坑里的战友。
在这里暂且称做项目A,项目A使用到一个独立的 Elasticsearch 集群,随着业务> 的接入方越来越多,整个Elasticsearch有一个奇怪的现象就是每隔一段时间就会有一个节点的Search ThreadPool持续飙升到最大值致使后续的请求全部进入线程池的队列,当充满队列时触发检索请求被拒绝的策略。伴随着请求线程上升,该节点的CPU也趋近100%,如下图

cpu 飙升
Search thread_pool 飙升

集群相关信息

  • 版本为(ES version): 2.3.3
  • 索引总数(Total Indices):5
  • 每个索引平均索引总数(Document size per indices): 10亿(不包含副本)
  • 文档大小(Document Size): 0.5 ~ 1KB

项目A的性能要求

  • QPS : 8000+
  • Concurrency : 400 ~ 600
  • latency: 50ms (99.99%)

异常信息收集及分析

硬件

  • 没有充足的证据表明是硬件的问题,暂不下结论。

业务

  • devops 无法拿到更多的业务监控,同样无法下结论。

ES集群

  • 由于线上集群的重要性,以及低延迟要求 无法在线上集群收集更多的信息 e.g. 线程信息(jstack), 内存信息(jamp).
  • 构建一个新的集群(生产集群的四分之一),模拟线上的并发以及逻辑进行问题复现,并收集异常信息用于分析
  • 收集有问题节点的线程信息。
  • 记录所有的检索请求耗时(slow log)。
  • 收集有问题节点的GC信息。
  • 收集有问题节点的日志信息。

蛛丝马迹

发现隐藏在线程信息中的BLOCKED 线程

"elasticsearch[ES-000-001][search][T#14]" #10682 daemon prio=5 os_prio=0 tid=0x00007fc1a408b800 nid=0x21fdef waiting for monitor entry [0x00007fbf0a28a000]
java.lang.Thread.State: BLOCKED (on object monitor)
at org.apache.lucene.search.LRUQueryCache.get(LRUQueryCache.java:218)
- waiting to lock <0x00000002c143aa90> (a org.elasticsearch.indices.cache.query.IndicesQueryCache$1)
at org.apache.lucene.search.LRUQueryCache$CachingWrapperWeight.scorer(LRUQueryCache.java:622)
at org.elasticsearch.indices.cache.query.IndicesQueryCache$CachingWeightWrapper.scorer(IndicesQueryCache.java:263)
at org.apache.lucene.search.BooleanWeight.scorer(BooleanWeight.java:389)
at org.apache.lucene.search.Weight.bulkScorer(Weight.java:135)
at org.apache.lucene.search.BooleanWeight.bulkScorer(BooleanWeight.java:370)
at org.apache.lucene.search.LRUQueryCache$CachingWrapperWeight.bulkScorer(LRUQueryCache.java:655)
at org.elasticsearch.indices.cache.query.IndicesQueryCache$CachingWeightWrapper.bulkScorer(IndicesQueryCache.java:269)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:818)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:535)
at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:384)
at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:113)
at org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:366)
at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:378)
at org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:368)
at org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:365)
at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33)
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:300)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Query Cache 在复现的问题时间范围内,Miss Count 确实在增长(miss count 黄线)

Query Cache

通过社区发现两个 Issue, LUCENE-7237 以及 ES #21566

  • LUCENE-7237, LRUQueryCache should rather not cache than wait on a lock
  • ES #21566,term queries are not cached anymore, leaving space in the history for more complex queries

修复该issue 后,再测试压力测试

  • Search Thread_pool 问题仍然没有解决,但是CPU没有继续飙升。
    cpu 飙升
    Search thread_pool 飙升
  • 再次发现BLOCKED 线程

"elasticsearch[ES-000-002][search][T#47]" #329 daemon prio=5 os_prio=0 tid=0x00007fe414093000 nid=0x3d5cde waiting for monitor entry [0x00007fd386ce5000]
java.lang.Thread.State: BLOCKED (on object monitor)
at java.lang.ref.ReferenceQueue.poll(ReferenceQueue.java:107)
- waiting to lock <0x00000002d7e68998> (a java.lang.ref.ReferenceQueue$Lock)
at org.apache.lucene.util.WeakIdentityMap.reap(WeakIdentityMap.java:237)
at org.apache.lucene.util.WeakIdentityMap.put(WeakIdentityMap.java:130)
at org.apache.lucene.store.ByteBufferIndexInput.buildSlice(ByteBufferIndexInput.java:291)
at org.apache.lucene.store.ByteBufferIndexInput.clone(ByteBufferIndexInput.java:255)
at org.apache.lucene.store.ByteBufferIndexInput$SingleBufferImpl.clone(ByteBufferIndexInput.java:389)
at org.apache.lucene.codecs.blocktree.SegmentTermsEnum.initIndexInput(SegmentTermsEnum.java:115)
at org.apache.lucene.codecs.blocktree.SegmentTermsEnumFrame.loadBlock(SegmentTermsEnumFrame.java:148)
at org.apache.lucene.codecs.blocktree.SegmentTermsEnum.seekExact(SegmentTermsEnum.java:507)
at org.apache.lucene.index.TermContext.build(TermContext.java:94)
at org.apache.lucene.search.TermQuery.createWeight(TermQuery.java:192)
at org.apache.lucene.search.IndexSearcher.createWeight(IndexSearcher.java:904)
at org.apache.lucene.search.BooleanWeight.<init>(BooleanWeight.java:57)
at org.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:239)
at org.apache.lucene.search.IndexSearcher.createWeight(IndexSearcher.java:904)
at org.apache.lucene.search.BooleanWeight.<init>(BooleanWeight.java:57)
at org.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:239)
at org.apache.lucene.search.IndexSearcher.createWeight(IndexSearcher.java:904)
at org.apache.lucene.search.IndexSearcher.createNormalizedWeight(IndexSearcher.java:887)
at org.elasticsearch.search.internal.ContextIndexSearcher.createNormalizedWeight(ContextIndexSearcher.java:100)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:535)
at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:384)
at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:113)
at org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:366)
at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:378)
at org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:368)
at org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:365)
at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33)
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:300)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
=================================================================================================
"elasticsearch[ES-000-003][search][T#49]" #335 daemon prio=5 os_prio=0 tid=0x00007fa29c08e800 nid=0x2096ea waiting for monitor entry [0x00007f9b2e8e7000]
java.lang.Thread.State: BLOCKED (on object monitor)
at org.elasticsearch.common.lucene.ShardCoreKeyMap.getShardId(ShardCoreKeyMap.java:104)
- waiting to lock <0x00000002c119aa00> (a org.elasticsearch.common.lucene.ShardCoreKeyMap)
at org.elasticsearch.indices.cache.query.IndicesQueryCache$1.getOrCreateStats(IndicesQueryCache.java:95)
at org.elasticsearch.indices.cache.query.IndicesQueryCache$1.onMiss(IndicesQueryCache.java:185)
at org.apache.lucene.search.LRUQueryCache.get(LRUQueryCache.java:236)
at org.apache.lucene.search.LRUQueryCache$CachingWrapperWeight.bulkScorer(LRUQueryCache.java:711)
at org.elasticsearch.indices.cache.query.IndicesQueryCache$CachingWeightWrapper.bulkScorer(IndicesQueryCache.java:269)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:818)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:535)
at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:384)
at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:113)
at org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:366)
at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:378)
at org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:368)
at org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:365)
at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33)
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:300)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)</init></init>

上述两个阻塞线程的通过社区发现解决方案

  • LUCENE-7409 Look into making mmapdirectory’s unmap safer (Patch for first blocked thread)
  • ES #22316 Improved concurrency of ShardCoreKeyMap(Patch for second blocked thread)
  • LUCENE-7651 Javadocs build fails with Java 8 update 121 (这是我额外引入的patch,为了在JDK8上跑通test cases.)

修复该issue 后,再测试压力测试

  • 好像解决了问题了, 多测试几次,效果如下
    cpu 飙升
    Search thread_pool 飙升

小结与建议

小结

  • 总计打Patch 5个
  • Lucene
  • LUCENE-7237 LRUQueryCache should rather not cache than wait on a lock
  • LUCENE-7651 Javadocs build fails with Java 8 update 121
  • LUCENE-7409 Look into making mmapdirectory’s unmap safer
  • ES
  • #21566 term queries are not cached anymore, leaving space in the history for more complex queries (需要关闭索引,再开启,期间请求会全部失败)
  • #22316 Improved concurrency of ShardCoreKeyMap.
  • 由于打的patch 官方都是基于较高的版本,所以backport 是否会出现衰退,只能依赖测试用例的覆盖,若覆盖不到则可能有衰退的风险。
  • 打上patch 的集群目前,从监控上看效果是很好的,也没有集群级异常,对线上业务是否会有影响暂时通过底层无法评估。

建议

  • 逻辑回归和性能回归还是要做毕竟业务说好才是真的好。
  • 最后个人推荐还是升级集群的版本吧,不要折腾低于6.3以及Lucene 6.2以下的版本了。

未经允许不得转载:Elasticsearch Club » ES 2.3.3 查询线程以及CPU使用率居高不下

分享到:更多 ()
希望每个得到帮助的朋友,能够赞助一波:

    

评论 抢沙发

评论前必须登录!