对于某些集合,此方法可能是可行的,但需要注意以下几点:
- 基数聚合使用HyperLogLog ++算法来 近似 基数。对于低基数字段,此近似值可能完全准确,而对于高基数字段,则近似值不那么准确。
- 术语对于 许多 术语而言,聚合可能在计算上很昂贵,因为每个存储桶都需要构建在内存中,然后序列化以响应。
您可能可以跳过基数汇总来获取大小,而只需将其int.MaxValue作为术语汇总的大小即可。在速度方面效率较低的另一种方法是滚动浏览范围内的所有文档,使用源过滤器仅返回您感兴趣的字段。我希望使用Scroll方法可以减轻群集的压力,但我建议您监视您采用的任何方法。
这是对Stack Overflow数据集(2016年6月,IIRC)上这两种方法的比较,研究了两年前的今天和一年前的今天的独特提问者。
术语汇总void Main() { var pool = new SingleNodeConnectionPool(new Uri("http://localhost:9200")); var connectionSettings = new ConnectionSettings(pool) .MapDefaultTypeIndices(d => d .Add(typeof(Question), NDC.StackOverflowIndex) ); var client = new Elasticclient(connectionSettings); var twoYearsAgo = DateTime.UtcNow.Date.AddYears(-2); var yearago = DateTime.UtcNow.Date.AddYears(-1); var searchResponse = client.Search<Question>(s => s .Size(0) .Query(q => q .Daterange(c => c.Field(p => p.CreationDate) .GreaterThan(twoYearsAgo) .Lessthan(yearago) ) ) .Aggregations(a => a .Terms("unique_users", c => c .Field(f => f.OwnerUserId) .Size(int.MaxValue) ) ) ); var uniqueOwnerUserIds = searchResponse.Aggs.Terms("unique_users").Buckets.Select(b => b.KeyAsstring).ToList(); // 3.83 seconds // unique question askers: 795352 Console.WriteLine($"unique question askers: {uniqueOwnerUserIds.Count}"); }滚动API
void Main() { var pool = new SingleNodeConnectionPool(new Uri("http://localhost:9200")); var connectionSettings = new ConnectionSettings(pool) .MapDefaultTypeIndices(d => d .Add(typeof(Question), NDC.StackOverflowIndex) ); var client = new Elasticclient(connectionSettings); var uniqueOwnerUserIds = new HashSet<int>(); var twoYearsAgo = DateTime.UtcNow.Date.AddYears(-2); var yearago = DateTime.UtcNow.Date.AddYears(-1); var searchResponse = client.Search<Question>(s => s .source(sf => sf .Include(ff => ff .Field(f => f.OwnerUserId) ) ) .Size(10000) .Scroll("1m") .Query(q => q .Daterange(c => c .Field(p => p.CreationDate) .GreaterThan(twoYearsAgo) .Lessthan(yearago) ) ) ); while (searchResponse.Documents.Any()) { foreach (var document in searchResponse.Documents) { if (document.OwnerUserId.HasValue) uniqueOwnerUserIds.Add(document.OwnerUserId.Value); } searchResponse = client.Scroll<Question>("1m", searchResponse.ScrollId); } client.ClearScroll(c => c.ScrollId(searchResponse.ScrollId)); // 91.8 seconds // unique question askers: 795352 Console.WriteLine($"unique question askers: {uniqueOwnerUserIds.Count}"); }
术语汇总比Scroll API方法快24倍。
解决方法我想获取给定期间内唯一数字用户ID的列表。
假设字段为userId,时间字段为startTime,我成功获得如下结果;
HashSet<int> hashUserIdList= new HashSet<int>(); // guarantees to store unique userIds. // Step 1. get unique number of userIds var total = client.Search<Log>(s => s .Query(q => q .DateRange(c => c.Field(p => p.startTime) .GreaterThan(FixedDate))) .Aggregations(a => a .Cardinality("userId_cardinality",c => c .Field("userId")))) .Aggs.Cardinality("userId_cardinality"); int totalCount = (int)total.Value; // Step 2. get unique userId values by Terms aggregation. var response = client.Search<Log>(s => s .Source(source => source.Includes(inc => inc.Field("userId"))) .Query(q => q .DateRange(c => c.Field(p => p.startTime) .GreaterThan(FixedDate))) .Aggregations(a => a .Terms("userId_terms",c => c .Field("userId").Size(totalCount)))) .Aggs.Terms("userId_terms"); // Step 3. store unique userIds to HashSet. foreach (var element in response.Buckets) { hashUserIdList.Add(int.Parse(element.Key)); }
它 可以工作,
但效率不高,因为(1)totalCount首先获取,并且(2)它定义Size(totalCount)由于存储桶溢出(如果结果有成千上万个),可能导致500个服务器错误。
以某种foreach方式进行迭代会很好,但是我无法使它们按大小迭代100。我在这里放了From/ Size或Skip/
Take,但是返回值不可靠。
如何正确编码?
评论列表