性夜影院午夜看片,无码精品久久一区二区三区,婷婷成人丁香五月综合激情,校园春色 qvod,性调教室高h学校

大數(shù)據(jù)計(jì)算:如何僅用1.5KB內(nèi)存為十億對(duì)象計(jì)數(shù)

Big Data Counting: How To Count A Billion Distinct Objects Using Only 1.5K

This is a guest post by Matt Abrams (@abramsm), from Clearspring, discussing how they are able to accurately estimate the cardinality of sets with billions of distinct elements using surprisingly small data structures. Their servers receive well over 100 billion events per month.

Clearspring,我們從事統(tǒng)計(jì)數(shù)據(jù)。統(tǒng)計(jì)一組不同元素且數(shù)量很大的數(shù)據(jù)集時(shí),是一個(gè)挑戰(zhàn)。

為了更好地理解已經(jīng)明確基數(shù)的大數(shù)據(jù)集的挑戰(zhàn),我們假設(shè)你的日志文件包含16個(gè)字符的ID,并且你想統(tǒng)計(jì)不同ID的數(shù)量.例如:

4f67bfc603106cb2

這16個(gè)字符需要用128位來(lái)表示。6萬(wàn)5千個(gè)ID將需要1MB的空間。我們每天收到30多億條事件記錄,每條記錄都有一個(gè)ID。這些ID需要3840億位或45GB的存儲(chǔ)。而這僅僅是ID字段需要的空間。我們采取一種簡(jiǎn)單的方法獲取日常事件記錄中以ID為基數(shù)的數(shù)據(jù)。最簡(jiǎn)單的辦法就是使用哈希集合且存放到內(nèi)存中,其中哈希集包含唯一ID的列表(即輸入文件中可能會(huì)有多條記錄的id是相同,但在哈希集中只存放一條)。即使我們假設(shè)只有1/3的條記錄ID是唯一的(即2/3的記錄ID是重復(fù)的),哈希集仍需要119GB的RAM,這其中不包括Java需要在內(nèi)存中存儲(chǔ)對(duì)象的開(kāi)銷。你需要一臺(tái)配備幾百GB內(nèi)存的機(jī)器來(lái)計(jì)算不同的元素,并且這只是計(jì)算一天內(nèi)日志事件記錄的唯一ID的內(nèi)存消耗。如果我們想要統(tǒng)計(jì)數(shù)周或數(shù)月的數(shù)據(jù),這問(wèn)題只會(huì)變得更加困難。我們身邊當(dāng)然不會(huì)有一臺(tái)配備幾百GB內(nèi)存的空閑機(jī)器,所以我們需要一個(gè)更好的解決方案。

解決這一問(wèn)題的常見(jiàn)辦法是使用位圖(博客:海量數(shù)據(jù)處理算法—Bit-Map)。位圖可以快速、準(zhǔn)確地獲取一個(gè)給定輸入的基數(shù)。位圖的基本思想是使用哈希函數(shù)把數(shù)據(jù)集映射到一個(gè)bit位,每個(gè)輸入元素與bit位是一一對(duì)應(yīng)。這樣Hash將沒(méi)有產(chǎn)生碰撞沖突,并減少需要計(jì)算每個(gè)元素映射到1個(gè)bit的空間。雖然Bit-map大大節(jié)省了存儲(chǔ)空間,但當(dāng)統(tǒng)計(jì)很高的基數(shù)或非常大的不同的數(shù)據(jù)集,它們?nèi)匀挥袉?wèn)題。例如,如果我們想要使用Bit-map計(jì)數(shù)十億,你將需要Bit-map位,或需要每個(gè)約120 MB的計(jì)數(shù)器。稀疏的位圖可以被壓縮,以獲得更多的空間效率,但也并不總是有幫助的。

幸運(yùn)的是,基數(shù)估計(jì)是一個(gè)熱門(mén)的研究領(lǐng)域。我們已經(jīng)利用這項(xiàng)研究提供了一個(gè)開(kāi)源實(shí)現(xiàn)的基數(shù)估計(jì)、集合元素檢測(cè)和top-k算法。

基數(shù)估計(jì)算法就是使用準(zhǔn)確性換取空間。為了說(shuō)明這一點(diǎn),我們用三種不同的計(jì)算方法統(tǒng)計(jì)所有莎士比亞作品中不同單詞的數(shù)量。請(qǐng)注意,我們的輸入數(shù)據(jù)集增加了額外的數(shù)據(jù)以致比問(wèn)題的參考基數(shù)更高。這三種技術(shù)是:Java HashSet、Linear Probabilistic Counter以及一個(gè)Hyper LogLog Counter。結(jié)果如下:

                            

1.png

該表顯示,我們統(tǒng)計(jì)這些單詞只用了512 bytes,而誤差在3%以內(nèi)。相比之下,HashMap的計(jì)數(shù)準(zhǔn)確度最高,但需要近10MB的空間,你可以很容易地看到為什么基數(shù)估計(jì)是有用的。在實(shí)際應(yīng)用中準(zhǔn)確性并不是很重要的,這是事實(shí),在大多數(shù)網(wǎng)絡(luò)規(guī)模和網(wǎng)絡(luò)計(jì)算的情況下,用概率計(jì)數(shù)器會(huì)節(jié)省巨大的空間。

線性概率計(jì)數(shù)器

線性概率計(jì)數(shù)器是高效的使用空間,并且允許實(shí)現(xiàn)者指定所需的精度水平。該算法在注重空間效率時(shí)是很有用的,但你需要能夠控制結(jié)果的誤差。該算法分兩步運(yùn)行:第一步,首先在內(nèi)存中分配一個(gè)初始化為都為0的Bit-map,然后使用哈希函數(shù)對(duì)輸入數(shù)據(jù)中的每個(gè)條目進(jìn)行hash計(jì)算,哈希函數(shù)運(yùn)算的結(jié)果是將每條記錄(或者是元素)映射到Bit-map的一個(gè)Bit位上,該Bit位被置為1;第二步,算法計(jì)算空的bit位數(shù)量,并使用這個(gè)數(shù)輸入到下面的公式來(lái)進(jìn)行估算:

n=-m ln Vn

注意:ln Vn=Loge(Vn) 自然對(duì)數(shù)

在公式中,m是 Bit-map的大小,Vn是空bit位和map的大小的比率。需要重點(diǎn)注意的是原始Bit-map的大小,可以遠(yuǎn)小于預(yù)期的最大基數(shù)。到底小多少取決于你可以承受誤差的大小。因?yàn)锽it-map的大小m小于不同元素的總數(shù)將會(huì)產(chǎn)生碰撞。雖然碰撞可以節(jié)省空間,但同時(shí)也造成了估算結(jié)果出現(xiàn)誤差。所以通過(guò)控制原始map的大小,我們可以估算碰撞的次數(shù),以致我們將在最終結(jié)果中看到誤差有多大。

Hyper LogLog

顧名思義,Hyper LogLog計(jì)數(shù)器就是估算Nmax為基數(shù)的數(shù)據(jù)集僅需使用loglog(Nmax)+O(1) bits就可以。如線性計(jì)數(shù)器的Hyper LogLog計(jì)數(shù)器允許設(shè)計(jì)人員指定所需的精度值,在Hyper LogLog的情況下,這是通過(guò)定義所需的相對(duì)標(biāo)準(zhǔn)差和預(yù)期要計(jì)數(shù)的最大基數(shù)。大部分計(jì)數(shù)器通過(guò)一個(gè)輸入數(shù)據(jù)流M,并應(yīng)用一個(gè)哈希函數(shù)設(shè)置h(M)來(lái)工作。這將產(chǎn)生一個(gè)S = h(M) of {0,1}^∞字符串的可觀測(cè)結(jié)果。通過(guò)分割哈希輸入流成m個(gè)子字符串,并對(duì)每個(gè)子輸入流保持m的值可觀測(cè) ,這就是相當(dāng)一個(gè)新Hyper LogLog(一個(gè)子m就是一個(gè)新的Hyper LogLog)。利用額外的觀測(cè)值的平均值,產(chǎn)生一個(gè)計(jì)數(shù)器,其精度隨著m的增長(zhǎng)而提高,這只需要對(duì)輸入集合中的每個(gè)元素執(zhí)行幾步操作就可以完成。其結(jié)果是,這個(gè)計(jì)數(shù)器可以僅使用1.5 kb的空間計(jì)算精度為2%的十億個(gè)不同的數(shù)據(jù)元素。與執(zhí)行 HashSet所需的120 兆字節(jié)進(jìn)行比較,這種算法的效率很明顯。

合并分布式計(jì)數(shù)器

我們已經(jīng)證明了使用上面描述的計(jì)數(shù)器我們可以估算大集合的基數(shù)。但是,如果你的原始輸入數(shù)據(jù)集不適合于單臺(tái)機(jī)器,將怎么做呢?這正是我們?cè)?a target="_blank">Clearspring所面臨的問(wèn)題。我們的數(shù)據(jù)分散在數(shù)百臺(tái)服務(wù)器上,并且每個(gè)服務(wù)器只包含整個(gè)數(shù)據(jù)集子集的一部分。這事實(shí)上我們能合并一組分布式計(jì)數(shù)器的內(nèi)容是至關(guān)重要的。這個(gè)想法有點(diǎn)令人費(fèi)解,但如果你花費(fèi)一些時(shí)間去思考這個(gè)問(wèn)題,就會(huì)發(fā)現(xiàn)其與基本的基數(shù)估計(jì)值相比并沒(méi)有太大的不同。因?yàn)檫@個(gè)計(jì)數(shù)器表示映射中的位作為基數(shù),我們可以采取兩個(gè)兼容計(jì)數(shù)器并將他們bit位合并到單一的map上。這個(gè)算法已經(jīng)處理碰撞,所以我們可以得到一個(gè)基數(shù)估計(jì)所需的精密,即使我們從來(lái)沒(méi)有把所有的輸入數(shù)據(jù)到一臺(tái)機(jī)器。這是非常有用的,節(jié)省了我們?cè)诰W(wǎng)絡(luò)中移動(dòng)數(shù)據(jù)的大量時(shí)間和精力。

Next Steps

希望這篇文章能幫助你更好地理解這個(gè)概念和概率計(jì)數(shù)器的應(yīng)用。如果估算大集合的基數(shù)是一個(gè)問(wèn)題,而你又碰巧使用一個(gè)基于JVM的語(yǔ)言,那么你應(yīng)該使用stream-lib項(xiàng)目——它提供了其他幾個(gè)流處理工具以及上文所述的算法的實(shí)現(xiàn)。

本文來(lái)自:High Scalability

本人結(jié)束語(yǔ):自認(rèn)英語(yǔ)不好,翻譯可能會(huì)有出入。但我看了csdn的翻譯,我懷疑那是技術(shù)員人翻譯的嗎? 一些翻譯直接是工具翻譯過(guò)來(lái)。

若深入了解Hyper LogLog:請(qǐng)看http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf和http://www.ic.unicamp.br/~celio/peer2peer/math/bitmap-algorithms/durand03loglog.pdf

At Clearspring we like to count things. Counting the number of distinct elements (the cardinality) of a set is challenge when the cardinality of the set is large.

To better understand the challenge of determining the cardinality of large sets let's imagine that you have a 16 character ID and you'd like to count the number of distinct IDs that you've seen in your logs. Here is an example:

4f67bfc603106cb2

These 16 characters represent 128 bits. 65K IDs would require 1 megabyte of space. We receive over 3 billion events per day, and each event has an ID. Those IDs require 384,000,000,000 bits or 45 gigabytes of storage. And that is just the space that the ID field requires! To get the cardinality of IDs in our daily events we could take a simplistic approach. The most straightforward idea is to use an in memory hash set that contains the unique list of IDs seen in the input files. Even if we assume that only 1 in 3 records are unique the hash set would still take 119 gigs of RAM, not including the overhead Java requires to store objects in memory. You would need a machine with several hundred gigs of memory to count distinct elements this way and that is only to count a single day's worth of unique IDs. The problem only gets more difficult if we want to count weeks or months of data. We certainly don't have a single machine with several hundred gigs of free memory sitting around so we needed a better solution.

One common approach to this problem is the use of bitmaps. Bitmaps can be used to quickly and accurately get the cardinality of a given input. The basic idea with a bitmap is mapping the input dataset to a bit field using a hash function where each input element uniquely maps to one of the bits in the field. This produces zero collisions, and reduces the space required to count each unique element to 1 bit. While bitmaps drastically reduce the space requirements from the naive set implementation described above they are still problematic when the cardinality is very high and/or you have a very large number of different sets to count. For example, if we want to count to one billion using a bitmap you will need one billion bits, or roughly 120 megabytes for each counter. Sparse bitmaps can be compressed in order to gain space efficiency, but that is not always helpful.

Luckily, cardinality estimation is a popular area of research. We've leveraged this research to provide a open source implementation of cardinality estimators, set membership detection, and top-k algorithms.

Cardinality estimation algorithms trade space for accuracy. To illustrate this point we counted the number of distinct words in all of Shakespeare's works using three different counting techniques. Note that our input dataset has extra data in it so the cardinality is higher than the standard reference answer to this question. The three techniques we used were Java HashSet, Linear Probabilistic Counter, and a Hyper LogLog Counter. Here are the results:

Counter

Bytes Used

Count

Error

HashSet

10447016

67801

0%

Linear

3384

67080

1%

HyperLogLog

512

70002

3%

The table shows that we can count the words with a 3% error rate using only 512 bytes of space. Compare that to a perfect count using a HashMap that requires nearly 10 megabytes of space and you can easily see why cardinality estimators are useful. In applications where accuracy is not paramount, which is true for most web scale and network counting scenarios, using a probabilistic counter can result in tremendous space savings.

Linear Probabilistic Counter

The Linear Probabilistic Counter is space efficient and allows the implementer to specify the desired level of accuracy. This algorithm is useful when space efficiency is important but you need to be able to control the error in your results. This algorithm works in a two-step process. The first step assigns a bitmap in memory initialized to all zeros. A hash function is then applied to the each entry in the input data. The result of the hash function maps the entry to a bit in the bitmap, and that bit is set to 1. The second step the algorithm counts the number of empty bits and uses that number as input to the following equation to get the estimate.

n=-m ln Vn

In the equation m is the size of the bitmap and Vn is the ratio of empty bits over the size of the map. The important thing to note is that the size of the original bitmap can be much smaller than the expected max cardinality. How much smaller depends on how much error you can tolerate in the result. Because the size of the bitmap, m, is smaller than the total number of distinct elements, there will be collisions. These collisions are required to be space-efficient but also result in the error found in the estimation. So by controlling the size of the original map we can estimate the number of collisions and therefore the amount of error we will see in the end result.

Hyper LogLog

The Hyper LogLog Counter's name is self-descriptive. The name comes from the fact that you can estimate the cardinality of a set with cardinality Nmax using just loglog(Nmax) + O(1) bits. Like the Linear Counter the Hyper LogLog counter allows the designer to specify the desired accuracy tolerances. In Hyper LogLog's case this is done by defining the desired relative standard deviation and the max cardinality you expect to count. Most counters work by taking an input data stream, M, and applying a hash function to that set, h(M). This yields an observable result of S = h(M) of {0,1}^∞ strings. Hyper LogLog extends this concept by splitting the hashed input stream into m substrings and then maintains m observables for each of the substreams. Taking the average of the additional observables yields a counter whose accuracy improves as m grows in size but only requires a constant number of operations to be performed on each element of the input set. The result is that, according to the authors of thispaper, this counter can count one billion distinct items with an accuracy of 2% using only 1.5 kilobytes of space. Compare that to the 120 megabytes required by the HashSet implementation and the efficiency of this algorithm becomes obvious.

Merging Distributed Counters

We've shown that using the counters described above we can estimate the cardinality of large sets. However, what can you do if your raw input dataset does not fit on single machine? This is exactly the problem we face at Clearspring. Our data is spread out over hundreds of servers and each server contains only a partial subset of the the total dataset. This is where the fact that we can merge the contents of a set of distributed counters is crucial. The idea is a little mind-bending but if you take a moment to think about it the concept is not that much different than basic cardinality estimation. Because the counters represent the cardinality as set of bits in a map we can take two compatible counters and merge their bits into a single map. The algorithms already handle collisions so we can still get a cardinality estimation with the desired precision even though we never brought all of the input data to a single machine. This is terribly useful and saves us a lot of time and effort moving data around our network.

Next Steps

Hopefully this post has helped you better understand the concept and application of probabilistic counters. If estimating the cardinality of large sets is a problem and you happen to use a JVM based language then you should check out the stream-lib project — it provides implementations of the algorithms described above as well as several other stream-processing utilities.

轉(zhuǎn)自:http://blog.csdn.net/hguisu/article/details/8433731

相關(guān)新聞

歷經(jīng)多年發(fā)展,已成為國(guó)內(nèi)好評(píng)如潮的Linux云計(jì)算運(yùn)維、SRE、Devops、網(wǎng)絡(luò)安全、云原生、Go、Python開(kāi)發(fā)專業(yè)人才培訓(xùn)機(jī)構(gòu)!