High-Speed Dynamic Packet FilteringLuca [email protected] problem encountered while monitoring gigabit networks, is the need tofilter only those packets that are interesting for a given task while ignoring theothers. Popular packet filtering technologies enable users to specify complexfilters but do not usually allow multiple filters to be specified.This paper describes the design and implementation of a new dynamic packetfiltering solution that allows users to specify several IP filters simultaneouslywith almost no packet loss even on high-loaded gigabit links. The advantage isthat modern traffic monitoring applications such as P2P, IPTV, VoIP monitoring,and lawful interception can dynamically set packet filters to efficiently discardpackets into the operating system kernel according to traffic, calls and usersbeing monitored.KeywordsPassive packet capture, packet filtering, traffic monitoring, Linux kernel.1. INTRODUCTIONWith the advent of gigabit networks, many existing applications such as IDS (Intrusion DetectionSystems), traffic monitoring applications, and packet sniffers are faced with problems such as highpacket-loss and CPU utilization due to the amount of traffic to be analyzed. The industry andacademia have produced some solutions both in hardware [9] [18] [7] [27] [28] and software [1][19] [10] [26] [29] able to accelerate packet capture in order to avoid packet loss due to highincoming packet rate. This is the theory as well as in practice that monitoring applications still needto process incoming packets at high-speed in order not to loose packets. The reason is that all theabove solutions are suitable for accelerating packet capture at wire speed with minimum packetsize, but offer very little in terms of packet filtering as usually the number of filters that can bespecified is quite limited. In other application domains such as lawful interception [25] the problemis even worse as network operators usually provide a copy of packets flowing through a link whereseveral hundred users are connected, while the law states that only the traffic of intercepted userscan actually be captured and analyzed. This means that if an xDSL user is intercepted, only a fewtenths pps (packets per second) need to be captured out of million pps flowing on the link. Theanalysis of VoIP (Voice over IP) traffic is also challenging as monitoring applications analyzesignaling protocols such as SIP and H.323 in order to dynamically figure out the IP and port wherethe video/voice RTP stream for a given call will happen. In general P2P traffic monitoring [22] [23][24] is the most difficult traffic to be tracked as it is by nature very dynamic, often encrypted/scrambled, with continuously evolving protocols.In a nutshell the above examples demonstrate that high-speed packet capture without advancedkernel filtering capabilities is useless in many scenarios; this is because the overall systemperformance can be improved only if both the kernel and applications do not waste several CPUcycles just for pushing unnecessary packets to user space that will be later discarded. Furthermoremodern applications require dynamic packet filtering based on simple VLAN/IP address/portnumber criteria whereas popular packet filtering facilities such as BPF (Berkeley Packet Filter) orrouter-based ASIC filtering [16] allow one (or few) static filter whose reconfiguration has an impact1

on active packet capture applications in terms of packet loss. The aim of this paper is to prove that itis possible to provide a positive answer to the above challenges, even using pure software-basedapproaches without the need to adopt costly hardware-based packet capture cards. Of course thiswork can be applied to cases where packet filtering needs to be performed, as it brings no advantageto those applications that need to analyze the whole input stream without any filtering at all.2. MOTIVATION AND SCOPE OF WORKThe problem of packets filtering is well known [12] and it has been tackled very often in literature.Historically the BPF (Berkeley Packet Filter) [4] dated 1990, an evolution of early packet filteringefforts carried on at Carnegie-Mellon University, is still most the widely used solution to theproblem. BPF includes a packet filtering machine able to execute programs. Each program is anarray of instructions that sequentially execute some actions on a pseudo-machine state. The populartcpdump tool allows the filtering program to be easily inspected (e.g. tcpdump -d “tcp and port80”). In BPF checking whether a given packet is TCP takes 8 instructions; a slightly more complexcheck such as verifying if a packet is http, takes more than double the number of instructions. Thisexample shows that even pretty simple filters can require a large number of instructions whosenumber increases significantly if boolean operators are used in the filter. The release of BPF greatlystimulated the research community as over the years several improvements to BPF have beenproduced [5] [11] [14]. Unfortunately all these efforts basically present the same characteristics: Only very few filters can be specified; in general only one filter can be specified although itcan be divided into sub-filters linked with boolean operators that can be arranged on a filtergraph. The filtering expression is implemented using pseudo instructions whose number is proportional to the number and complexity of the filters. Adding/removing filters require a general reconfiguration that can lead to packet loss as capturing activities might be temporarily interrupted. In some cases, packet filtering is accelerated using network processors that are no longeravailable on the market.The conclusion is that the family of BPF-based approaches are suitable for some applications thatrequire one arbitrarily complex filter, such as a packet sniffer like tcpdump or wireshark, but not fordynamic applications such as those previously listed.Hardware-based packet filters such as those based on FPGA/NPU-accelerated [21] [15] cards, donot generally present limitations in terms of filtering speed as filters operate at wire rate with nopacket loss. Instead the limitations are: Very few filters can be specified (usually in the 8-64 range), as they are limited by the spaceavailable on the silicon/RAM used for storing filters. The more complex the filter, the less filters can be specified. This is because filters are apparently implemented in a similar way to BPF, where the length of the filtering program depends on the filter complexity. Filters are usually very basic, compared to the richness and expressiveness of BPF, andsometimes they need to be specified in byte-code. Negative filtering (e.g. not expression , not (tcp and port 80)) is often not supported. Adding/removing filters can require a general reconfiguration of the hardware (e.g. whenFPGAs are used), in some case this can take up to a minute, which is not compatible withmost traffic monitoring applications that require to operate continuously. Filters are often not able to detect mixed encapsulation. For instance if MPLS-tagged packets are mixed with plain (ethernet IP) packets and VLAN tagged packets, the filter is not2

able to operate properly as they usually assume a unique type of encapsulation. This is because filters are defined with offsets that change according to the encapsulation, hence theyfail when mixed encapsulations are used, which is very common on high-speed links wheredifferent kinds of traffic are transported using tagging.ASIC-based filtering facilities present in some network routers such as Juniper JunOS or Cisco IOS,allow packets to be filtered efficiently using filtering expressions that are not as powerful as BPFbut sufficient for most applications. The drawback of using routers as packet filters is the cost,complexity and feasibility of the whole solution, in addition to the fact that routers have not beendesigned to be used as packet filters with continuous/frequent (e.g. when VoIP traffic is analyzed,tracking RTP streams based on signaling) configuration changes.The lack of inexpensive solutions for dynamic packet filtering in host-based systems has been themotivation for this research work. In fact both hardware and software-based solutions, with theexception of ASIC-based filtering that is too costly to be deployed in several scenarios, haveinteresting features but do not satisfy the requirements of modern dynamic network monitoringapplications (e.g. VoIP monitoring, lawful interception, multimedia/IPTV, P2P traffic monitoring)and applications that need to handle hundreds of filtering rules such as host-based IDSs: Ability to specify a thousand different IP packet filters. A VoIP gateway can very well support thousand calls simultaneously, or an IDS can have several hundred active rules. Ability to dynamically add/remove filters without having to interrupt existing applications.In other words, filter reconfiguration should not require stopping the whole system even fora short period of time, instead the system must be able to operate while filters are manipulated. The filter processing speed and memory being used should not be proportional to the numberof filters but independent from their number and complexity. Filters do not need to be as rich as BPF but header-based filtering ‘VLAN/Protocol/IPaddress/port number’ is enough for the targeted applications. Due to the nature of applications for which this filter has been designed, only “precise” filters (e.g. host X and port Y) are supported. Features like ranges (e.g. port 1024) and sets(e.g. host X or host Y or host Z) are not supported as, if necessary, they can be expanded in several precise filters that are instead supported by the system. Note that with alittle effort it is also possible to support subnetworks, that have not been taken into accountas they are used very seldom to monitor the above targeted applications. In order to achieve a reasonable performance, the system can have a limited false-positivefiltering rate (i.e. the filtering system can report that a given packet matches the filters evenif it not so) but no false-negative rate (i.e. no packet matching one or more filter should bediscarded by mistake by the system).In a nutshell, the goal of this work is to create a low-cost system based on commodity hardware,able to filter packets at wire speed directly into the kernel so that dynamic monitoring applicationsreceive only those packets they are interested in. This does not require that the filtering processneeds to be fully accurate as long as dynamic in-kernel filtering dramatically reduces the amount ofwork that applications need to carry-on, also because applications might still need to do some extrafiltering based on more complex criteria (e.g. TCP connection state, or last message exchanged onthe connection). The goal will be achieved only if this system has a very high performancecompared to the traditional approach “get every packet and filter it in user-space” applied by mostapplications or whenever (e.g. in FPGA/NPU filtering systems) it is not possible to define as manyfilters as necessary to the application due to hardware limitations.This work will also be beneficial to applications such as IDSs that need to handle hundreds of rulessuch as alert source - destination ( extra criteria ). In fact for each of the aboverules a filtering rule ‘ source - destination ’ can be set so that packets are early filtered into3

the kernel and the IDS gets only those packets that are potentially interesting. Doing the same withBPF would result in a giant filter ((packet header filter for rule 1) (packet headerfilter for rule 2).) that will be inefficient and handled by BPF implementations.UserSpacePacket ConsumptionKernelSpaceBPF Filtering (Optional)NetworkDeviceDriverIn general the scope of this work is not to replace BPF-like filters, useful for several applications,but to implement a pre-BPF filtering layer designed on the requirements of emerging trafficmonitoring applications.Dynamic FilteringFigure 1. Network Packet JourneyFor this reason, dynamic filtering is implemented directly into the network device driver because: This is the earliest place on the system where a packet pops up. This means that if packetsare dropped on this component due to filtering, all components on top of the device driver(kernel and user space applications) will benefit from it. Instead, if dynamic filtering is implemented on top of the device driver or, even worse, in user-space, the amount of workneeded to move a packet from the device to the filtering component will be wasted for thosepackets that do not match any filter. This is the layer under the kernel where the BPF (or other BPF-like filters) resides, hencedynamic filtering can be used by BPF as a pre-filtering stage in order to filter out all thosepackets that will definitively not match any BPF filter. From the user’s point of view, thismeans that dynamic filtering is transparent to existing applications so no modification at allis necessary in order to take advantage of it. Dynamic filtering can take advantage of existing packet capture acceleration facilities thatcan further accelerate the journey of filtered packets through the kernel.Reducing (by means of filtering) the amount of work that an application has to carry on in order toachieve a certain task, has a positive effect on the overall system performance. Later in this paper itwill be shown that early (in kernel) and efficient packet filter leads to better applicationperformance than when using a non-filtering accelerated network driver with filtering inside themonitoring application. On the other hand it is out of the scope of this paper to discuss how userspace application performance can be improved in terms of packet capture and analysisperformance.The following chapter describes the design choices and implementation of the filtering.Furthermore it gives an evaluation in terms of performance of the solution.3. THE DESIGN OF DYNAMIC FILTERSIn order to achieve the planned goal of having thousand of filters that can be dynamically added andremoved, for the reasons explained before, it is obvious that it is not feasible to base it on a pseudomachine as the one used by the BPF family. Instead it is necessary to use a different solution thatguarantees constant memory consumption regardless of the number of filters with low/limited falsepositives and no false negatives. Bloom [2] [3] [6] [20] filters are a perfect solution to the above4

problem as they are space-efficient, do not present false negatives, and allow elements to bedynamically added from the filter set. A bloom filter is an array of m bits all initially set to zero, anda set of k different hash functions each of which maps a key value in the 0.m-1 range. When a keyelement is added to the set, each hash is applied to the key, and all the bits that correspond to theresults of the hash computation are set to one. Checking whether an element belongs to the set ispretty simple, as the k hashes are applied to the element, and if all the bits that correspond to theresult of the hash functions are set to one, then the element belongs to the set. Bloom filters haveseveral advantages with respect to other efficient data structures such as binary search trees andtries, as the time needed to add items or check whether an element belongs to the filtering set isconstant, independently of the number of elements that belong to the set, similar to what happenswith sparse hash tables [8]. The disadvantage is that elements can be added to blooms but theycannot be removed as each bit can have been set to one by several hashes hence if during removal itis set to zero this will break the logic. For this reason if removal is necessary a counting bloom [13]is used, where each bloom array element is not a bit but a counter that is incremented/decrementedevery time the bit is set/reset. Unlike filtering sets based on hash tables, adding an element neverfails due to the overflow of the data structure, although the false positive rate increases as elementsare added to the set.Even if bloom filters seem to have several interesting features and few limitations, they aregenerally used only on hardware as their implementation in software can be rather costly because: Every time an element is checked if it belongs to the set, k hash functions need to be calculated. This is not a problem in hardware as hashes are computed in parallel, whereas in software they are computed sequentially with obvious limitations in terms of speed. It is worthto note that computing them sequentially and synchronizing results using semaphores doesnot bring any speed advantage. In order to both add and remove elements from the filter set, it is necessary to use countingblooms that unfortunately have the drawback of using too much memory, as they replace bitswith counters, in particular if implemented inside the kernel where contiguous memory, necessary for allocating the bloom array, is always an issue.Starting from the principle that a bloom-like approach is a good solution to the problem of keepingin memory a large amount of filters, the author tried to distillate a hybrid solution able to feature theadvantages of blooms in a way that it could be efficiently implemented in software. It is clear thatgiven a large number of filters, all the procedural algorithms (e.g. all those based on pseudomachines such as BPF) cannot be used, whereas a “compression” algorithm such as hashes orblooms is a good approach as it creates a small “fingerprint” easy to store and maintain. For thisreason the author has decided to implement software bloom filters with the following differenceswith respect to the original recipe: The number of hash functions is limited to 2 (k 2). If the bloom size is large and the hashfunction is good, this solution does not lead to many false positives while providing a goodperformance as only two hashes need to be computed in the worst case whenever a packet ischecked for inclusion in the filter set.Bloom array indexBloom array indexBloom counterBloom counterPointer to next elementPointer to next elementFigure 2. Counting Bloom Implementation As counting blooms are necessary but the author is not willing to pay for the drawback interms of large memory usage, a novel implementation of counting bloom is presented, basedon the fact that hash collisions (i.e. two different filters produce the same hash result), in particular with a large bloom array, are pretty rare. For this reason the bloom array is still im-5

plemented with single bits, but whenever a new element is added to the filter set, in case ofcollision, a list of collisions is maintained. For instance if a new filter is added and the resultof the hash function is x, if the bloom[x] is one, the collision list is searched for x. If the collision element does not exist, it is created and its counter is set to two, the original value ofbloom[x] plus one (the collision), instead if it already exists it is incremented. If a filter withhash y is removed, before setting bloom[y] to zero the collision list is searched for y; if itexists the corresponding bloom counter is decremented by one, if it does not exists bloom[y]is set to zero. In case the bloom counter is decremented and the new value is one, the collision element for y is removed from the list as there are no more collisions for such element.With this solution it is possible to implement counting bloom at little cost as only the extracollision list needs to be maintained with respect to the original bloom.Before discussing implementation details and performance, it is worth analyzing how the proposedsolution behaves in terms of probability of false positives, i.e. packets that pass the filtering but thatdo not match any rule. The standard equation for calculating the probability of a false positive in abloom filter is q 1 - (1- (1/m))nk q (1- e-nk/m)k where: q is the probability that a random bit of the bloom filter is 1. n be the number of elements that have been added to the bloom filter. m is the number of bits used to implement a Bloom filter. k is the number of hash functions that as stated before will be set to 2.The increased probability of setting k to 2 (a relatively low value as in hardware usually 8 or morehashes are used) can be balanced using a large value for m, that it is generally not a problem insoftware as in hardware, even when coding inside the kernel operating system that has moreconstraints than user space. So doing some simple computations [17], supposing to set 1000 filters,the memory necessary for having a false negative probability of 10-6 is about 244 KB. This meansthat implementing blooms properly, it is possible to dramatically reduce the amount of packets to beanalyzed by the monitoring application at the cost of few KB or memory, no false negatives, andwith a constant filtering time as it is not proportional to the number of filters and bloom dictionarysize.Bloom filters are rather easy to implement as they are a contiguous array of bits that are set to 0/1according to a hash function calculated on some input value, that is usually a concatenation ofspecific packet properties such as VLAN id, IP address, protocol and port. This concatenation is notunique as it is affected by the nature of the monitoring application how packets are filtered out.Application TypePacket Filtering CriteriaP2PIP (P2P server).Port, used to track default P2P ports.P2P Session (e.g. TCP, IP address/port).Payload.Lawful InterceptionIntercepted IP address.Port (e.g. radius/1812) regardless of the user.VoIP, StreamingIPTVSignaling protocol (e.g. H.323, SIP: UDP and port 5060).RTP audio/video (UDP, IP address/port).RTCP for audio/video synchronization.IDSHome network (IP adddress/mask) to be analyzed.Port (e.g. http/80) where signatures are searched for match.6

Table I. Comparison of Filtering Criteria based on Application TypeThe previous table shows that there are no common criteria for filtering traffic among the listedapplications, although all of them are pretty similar. In order to satisfy all of them, the authordecided to compute the value of the bloom hash function on the concatenation of protocol, IPaddress and port.83216ProtoIP AddressPortFigure 3. Bloom Hash CalculationThis means that for each incoming packet, the hash function in the worst case is calculated twice onboth the source and destination IP address/port using the same protocol value. Note that if the matchusing the first hash fails, the second hash is not computed at all. If some filtering rules require awildcard, the value used for the corresponding hash element is set to zero. For instance in the rule“TCP/any-address/port 80” the value for the any-address is set to zero.Wildcard implementation however is not cheap as: When the rule is added to the dictionary, its wildcard value is zero. Note that this algorithmdoes not lead to errors whenever the wildcard is used on a field where zero is a valid value(e.g. protocol id 0 corresponds to IP). When a packet is searched for match, for each wildcard value, both the real packet value andthe wildcard need to be used in the hash. For instance if port can have a wildcard value, twohashes need to be calculated (protocol/ip/port and protocol/ip/0) hence the bloom dictionaryis checked twice, one per hash value.The consequence is that using wildcards with blooms is possible, if this practice is limited to one ortwo fields at most, because wildcard support significantly increases filtering time. Another practiceto be avoided is the ability to handle large value ranges, such as IP addresses with a small mask(e.g. /16), because it will be necessary to explode all the addresses and compute a hash with each ofthem. If ranges support is mandatory, then it is recommended to implement several blooms one foreach address space (e.g. bloom1 will handle proto/ip address with mask 32/port, bloom2 will handleproto/ip address with mask 24/port, and so on) than to have a single bloom and compute the hashwith all the possible range values). Luckily all the monitoring applications taken into account intothis work require precise matching with limited wildcard support, however it is worth to mentionthat if value range support and wildcard is strongly required, either some tricks are used or bloomfilters need to be avoided as they are not adequate in terms of performance and false positive rate.In general, the work described into this paper is quite general and from case to case it can beimplemented on a different way depending on the measurement requirements of the monitoringapplication.4. IMPLEMENTING DYNAMIC FILTERINGWhile it is desirable to have a perfect filtering algorithm, monitoring applications can tolerate arelatively low false positive rate at the cost of dramatically reducing the amount of work that theyinstead would have to carry on without any kind of filtering. Therefore the idea is to implement atwo stage packet filtering: in addition to the already implemented traffic filtering and selectionsfacilities, a lower filtering layer based on bloom filters is used to reduce the number of packets thatare forwarded to the application and that will be discarded later on.7

Monitoring applicationBloom filteringUser SpaceKernelFigure 4. Two Stage Traffic FilteringIn order to efficiently implement bloom filtering and avoid modifying exiting applications, the bestplace to put them is inside the kernel as: Early packet filtering (as close as possible to the network adapter from which packets arereceived) reduces the amount of work necessary to drop filtered packets. Implementing it under the BPF layer facility used by most applications makes it transparentto the application while significantly improving the overall performance. Bloom complexity and memory requirements are rather limited, which makes them suitableto be implemented inside the kernel, where there are more limitations than in user space.The linux 2.6 operating system has been selected as reference platform for the implementation. Forperformance reasons, blooms should be implemented as close as possible to the network adapterand the author has decided to implement them inside the network device driver rather than insidethe kernel as this is the first place where incoming packets show up. The drawback of this designchoice is that the implementation is NIC-dependent, even if as explained later in this paper, portingthe code across different adapters is pretty simple as all the network device drivers share the samedata structures hence most of the code can work with no change. The author selected the Intel GEadapter (e1000 driver) as reference card, and then ported the code to the Broadcom GE adapter (tg3driver) in order to verify the code portability and as well demonstrate that the work was not tight toan adapter but that is pretty general.On linux, packets are fetch from network adapters using NAPI, a network API that implementsefficient packet polling. This means that whenever there is an incoming packet just received by theadapter and ready to be handled by the driver, NAPI does the job by calling a function (e.g. tg3 rx()on the Broadcom driver) that polls packets out of the adapter.User SpaceLinux KernelBloomNIC Device DriverNAPINICPacketJourneyFigure 5. Packet Journey from NIC to User SpaceBloom filtering is implemented into this function, so that only those packets that satisfy the bloomfilters are returned, whereas those having no corresponding filtering rule are dropped at this stageand never handled by NAPI. The advantage of this solution is that packets do not enter at all theprotocol stack and hence there is no need to allocate memory (this is called skbuff or socket bufferon Linux) to store the packet that will then be deallocated later when the packet is dropped.Reducing the number of memory (de)allocations is already a significant performance improvementas this has to be done for each incoming packet.NAPI is based on interrupts: an interrupt is generated as soon as an the adapter receives anincoming packet. At this point NAPI takes over the control, disables NIC interrupts and startspolling packets as long as there are packets to receive. When no more packets are available,8

interrupts are restored. In the current Linux kernel implementation, packet polling is a single threadof execution, meaning that only one NIC at a time can be polled and that packet polling is a startand-stop activity: at each NAPI polling session only some packets are fetched, then the kernelcarries on other activities, then polling continues. Although this mechanism is rather efficient, itdoes not fully exploit multi-core/processor architectures where it would be possible to run severalpolling threads on different cores/CPUs. For this reason the author has implemented bloom filteringin two flavors: In the first implementation, the function called by NAPI for polling packets (e.g. tg3 rx() forBroadcom adapters) has been enhanced with bloom filtering, so as soon as NAPI fetchespackets, before they are returned to NAPI they are filtered and then passed to NAPI. Thisapproach requires very little code changes as it is simply an enhancement over the existingNIC driver. In the second implementation, as soon as the network card is initialized, a kernel thread isallocated for each active NIC. This thread, that can be spawn on a physical core/CPU (a.k.a.processor affinity) or be bounced across CPUs (this is the standard behavior for kernelthreads under Linux). Each thread acts as a private NAPI poller for the NIC as it continuously polls packets as NAPI does (e.g. disabling/enabling interrupts as needed). In this solution, packets are polled as soon as they become available, then filtered with blooms and ifth

the standard equation for calculating the probability of a false positive in a bloom filter is q 1 - (1- (1/m))nk-nk/m q (1- e )k where: qis the probability that a random bit of the bloom filter is 1. nbe the number of elements that have been added to the bloom filter. mis the number of bits used to implement a bloom filter. kis the number