lucene的分组（group by）

xtuhcy

浏览: 138665 次
性别:
来自: 北京

最近访客更多访客>>

zlf3865072

james1110

orangehome

ljmybfq

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

lucene

lucene grouping 分组

最近在优化搜索服务，以前的搜索服务十分简单，现在需要增加查询结果分组统计的功能，以提高用户体验。

g了一番，有以下几个主要的实现方式：

1、利用search中的collect，自己实现一个collect输出分组统计结果;

2、一个开源的lucene插件bobo browser

3、lucnen3.2后有个grouping模块

最后我选择了grouping这个模块来实现

grouping模块有2种方法实现，2次遍历法和1次遍历法。1次遍历法是在lucene3.3之后才开始提供。

1次遍历法效率很高，但是需要在索引时添加特殊的标志：

To use the single-pass BlockGroupingCollector, first, at indexing time, you must ensure all docs in each group are added as a block, and you have some way to find the last document of each group. One simple way to do this is to add a marker binary field:

网站访问量目前还很小，又不想重新建索引，于是还是采用了2次遍历法。2次遍历的第一次遍历是把分组拿出来，如果您只需要分组，不需要各个分组的数量，可以一次遍历即可。第二次遍历是把各个分组的搜索结果拿出来，当然也就能知道各个分组的数量。

2次遍历法由于需要经过2次搜索效率较低，因此引入了一个cache机制，CachingCollector。这样在第二次遍历是就可以直接读内存了。

代码片段如下：

public Map<String, Integer> groupBy(Query query, String field, int topCount) {
  Map<String, Integer> map = new HashMap<String, Integer>();

  long begin = System.currentTimeMillis();
  int topNGroups = topCount;
  int groupOffset = 0;
  int maxDocsPerGroup = 100;
  int withinGroupOffset = 0;
  try {
   TermFirstPassGroupingCollector c1 = new TermFirstPassGroupingCollector(field, Sort.RELEVANCE, topNGroups);
   boolean cacheScores = true;
   double maxCacheRAMMB = 4.0;
   CachingCollector cachedCollector = CachingCollector.create(c1, cacheScores, maxCacheRAMMB);
   indexSearcher.search(query, cachedCollector);
   Collection<SearchGroup<String>> topGroups = c1.getTopGroups(groupOffset, true);
   if (topGroups == null) {
    return null;
   }
   TermSecondPassGroupingCollector c2 = new TermSecondPassGroupingCollector(field, topGroups, Sort.RELEVANCE, Sort.RELEVANCE, maxDocsPerGroup, true, true, true);
   if (cachedCollector.isCached()) {
    // Cache fit within maxCacheRAMMB, so we can replay it:
    cachedCollector.replay(c2);
   } else {
       // Cache was too large; must re-execute query:
    indexSearcher.search(query, c2);
   }

   TopGroups<String> tg = c2.getTopGroups(withinGroupOffset);
   GroupDocs<String>[] gds = tg.groups;
   for(GroupDocs<String> gd : gds) {
    map.put(gd.groupValue, gd.totalHits);
   }
  } catch (IOException e) {
   e.printStackTrace();
  }
  long end = System.currentTimeMillis();
  System.out.println("group by time :" + (end - begin) + "ms");
  return map;
}

几个参数说明：

groupField: 分组域
groupSort: 分组排序
topNGroups: 最大分组数
groupOffset: 分组分页用
withinGroupSort: 组内结果排序
maxDocsPerGroup: 每个分组的最多结果数
withinGroupOffset: 组内分页用

3
顶

5
踩

分享到：

linux查找xxx天未访问的文件列表

2011-08-09 14:11
浏览 5904
评论(5)
分类:开源软件
查看更多

5 楼 xtuhcy 2012-07-05

由于时间关系一次循环我一直没测试过，呵呵，如果大家试验成功也共享一下～

4 楼 ternus 2012-06-20

我现在也遇到了同样的问题，不知道各位用第一次循环成功了没，成功了的讲一下，我的QQ：309754782。谢谢。。。

3 楼 zhaoshijie 2012-06-15

大侠

，希望能把一次循环的附上，感激不尽啊，谢谢了！

2 楼 wyyl1 2011-12-07

非常感谢！
今日搜索了，lucene的解决方法，你说的这个可以用。但是只用一次循环的那个方法，我试了好长时间没有搞定！有时间希望交流一下，循环一次的那个方法。

我使用的是lucene3.5.0版本的组件
就是参考http://lucene.apache.org/java/3_5_0/api/contrib-grouping/org/apache/lucene/search/grouping/package-summary.html
循环一次的没有成功

1 楼 xu101q 2011-10-20

不错~~~学习了！！！，第一次，没太在意，第二次看，才发现好用！

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论