
Transcription
Mellanox Messaging Accelerator(VMA) Library for LinuxUser ManualRev 8.6.10www.mellanox.comMellanox Technologies
NOTE:THIS HARDWARE, SOFTWARE OR TEST SUITE PRODUCT (“PRODUCT(S)”) AND ITS RELATEDDOCUMENTATION ARE PROVIDED BY MELLANOX TECHNOLOGIES “AS-IS” WITH ALL FAULTS OF ANYKIND AND SOLELY FOR THE PURPOSE OF AIDING THE CUSTOMER IN TESTING APPLICATIONS THATUSE THE PRODUCTS IN DESIGNATED SOLUTIONS. THE CUSTOMER'S MANUFACTURING TESTENVIRONMENT HAS NOT MET THE STANDARDS SET BY MELLANOX TECHNOLOGIES TO FULLYQUALIFY THE PRODUCT(S) AND/OR THE SYSTEM USING IT. THEREFORE, MELLANOX TECHNOLOGIESCANNOT AND DOES NOT GUARANTEE OR WARRANT THAT THE PRODUCTS WILL OPERATE WITH THEHIGHEST QUALITY. ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THEIMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE ANDNONINFRINGEMENT ARE DISCLAIMED. IN NO EVENT SHALL MELLANOX BE LIABLE TO CUSTOMER ORANY THIRD PARTIES FOR ANY DIRECT, INDIRECT, SPECIAL, EXEMPLARY, OR CONSEQUENTIALDAMAGES OF ANY KIND (INCLUDING, BUT NOT LIMITED TO, PAYMENT FOR PROCUREMENT OFSUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY FROM THE USE OF THEPRODUCT(S) AND RELATED DOCUMENTATION EVEN IF ADVISED OF THE POSSIBILITY OF SUCHDAMAGE.Mellanox Technologies350 Oakmead Parkway Suite 100Sunnyvale, CA 94085U.S.A.www.mellanox.comTel: (408) 970-3400Fax: (408) 970-3403 Copyright 2018. Mellanox Technologies Ltd. All Rights Reserved.Mellanox , Mellanox logo, Accelio , BridgeX , CloudX logo, CompustorX , Connect-IB , ConnectX ,CoolBox , CORE-Direct , EZchip , EZchip logo, EZappliance , EZdesign , EZdriver , EZsystem ,GPUDirect , InfiniHost , InfiniBridge , InfiniScale , Kotura , Kotura logo, Mellanox CloudRack , MellanoxCloudXMellanox , Mellanox Federal Systems , Mellanox HostDirect , Mellanox Multi-Host , Mellanox OpenEthernet , Mellanox OpenCloud , Mellanox OpenCloud Logo , Mellanox PeerDirect , Mellanox ScalableHPC ,Mellanox StorageX , Mellanox TuneX , Mellanox Connect Accelerate Outperform logo, Mellanox Virtual ModularSwitch , MetroDX , MetroX , MLNX-OS , NP-1c , NP-2 , NP-3 , NPS , Open Ethernet logo, PhyX ,PlatformX , PSIPHY , SiPhy , StoreX , SwitchX , Tilera , Tilera logo, TestX , TuneX , The Generation ofOpen Ethernet logo, UFM , Unbreakable Link , Virtual Protocol Interconnect , Voltaire and Voltaire logo areregistered trademarks of Mellanox Technologies, Ltd.All other trademarks are property of their respective owners.For the most updated list of Mellanox trademarks, visit http://www.mellanox.com/page/trademarksDoc #: DOC-00393Mellanox Technologies2
Introduction to VMATable of ContentsDocument Revision History. 7About this Manual . 912Introduction to VMA . 111.1VMA Overview . 111.2Basic Features . 111.3Target Applications . 111.4Advanced VMA Features . 12VMA Library Architecture . 132.1Top-Level . 132.2VMA Internal Thread . 132.3Socket Types . 143Installing VMA . 144Configuring VMA . 154.14.24.35Configuring libvma.conf . 154.1.1Configuring Target Application or Process . 154.1.2Configuring Socket Transport Control . 164.1.3Example of VMA Configuration. 17VMA Configuration Parameters . 174.2.1Configuration Parameters Values . 204.2.2Beta Level Features Configuration Parameters . 33Loading VMA Dynamically . 36Advanced Features . 375.15.25.3Packet Pacing . 375.1.1Prerequisites . 375.1.2Usage. 37Precision Time Protocol (PTP) . 385.2.1Prerequisites . 385.2.2Usage. 38On-Device-Memory. 395.3.1Prerequisites . 395.3.2Verifying On-Device-Memory Capability in the Hardware . 395.3.3On-Device-Memory Statistics . 395.4TCP QUICKACK Threshold . 415.5Linux Guest over Windows Hypervisor. 41Rev 8.6.105.5.1Prerequisites . 415.5.2Windows Hypervisor Configuration . 425.5.3VMA Daemon Design . 42Mellanox Technologies3
Introduction to VMA5.5.46Using sockperf with VMA . 427Example - Running sockperf Ping-pong Test . 448VMA Extra API . 458.1Overview of the VMA Extra API . 458.2Using VMA Extra API . 458.3Control Off-load Capabilities During Run-Time . 468.4Adding libvma.conf Rules During Run-Time. 468.3.2Creating Sockets as Off-loaded or Not-Off-loaded . 46Packet Filtering . 478.4.1Zero Copy recvfrom() . 488.4.2Freeing Zero Copied Packet Buffers . 48Dump fd Statistics using VMA Logger . 508.6"Dummy Send" to Improve Low Message Rate Latency. 508.898.3.18.58.78.6.1Verifying “Dummy Send” capability in HW . 518.6.2“Dummy Packets” Statistics . 51Multi Packet Receive Queue . 528.7.1Prerequisites . 528.7.2Usage. 52SocketXtreme . 558.8.1Polling For VMA Completions . 568.8.2Getting Number of Attached Rings . 578.8.3Getting ring FD . 588.8.4Free VMA packets . 588.8.5Decrement VMA Buffer Reference Counter . 598.8.6Increment VMA Buffer Reference Counter . 598.8.7Usage example . 608.8.8Installation . 618.8.9Limitations . 61Debugging, Troubleshooting, and Monitoring . 619.1Monitoring – the vma stats Utility. 619.1.19.24TAP Statistics . 42Examples . 63Debugging . 689.2.1VMA Logs . 689.2.2Ethernet Counters . 699.2.3tcpdump . 699.2.4NIC Counters . 699.3Peer Notification Service . 699.4Troubleshooting . 70Mellanox TechnologiesRev 8.6.10
Introduction to VMAAppendix A:A.1Sockperf - UDP/TCP Latency and Throughput Benchmarking Tool . 73Overview . 73A.1.1Advanced Statistics and Analysis . 74A.2Configuring the Routing Table for Multicast Tests . 74A.3Latency with Ping-pong Test . 75A.4A.5A.3.1UDP Ping-pong . 75A.3.2TCP Ping-pong . 75A.3.3TCP Ping-pong using VMA . 75Bandwidth and Packet Rate With Throughput Test . 75A.4.1UDP MC Throughput . 75A.4.2UDP MC Throughput using VMA . 76A.4.3UDP MC Throughput Summary . 76sockperf Subcommands . 77A.5.1Additional Options . 77A.5.2Sending Bursts . 80A.5.3SocketXtreme . 80A.6Debugging sockperf . 80A.7Troubleshooting sockperf . 80Appendix B:B.1Multicast Interface Definitions . 82Appendix C:Rev 8.6.10Multicast Routing . 82Acronyms . 83Mellanox Technologies5
Introduction to VMAList of TablesTable 1: Document Revision History . 7Table 2: Typography . 10Table 3: Target Process Statement Options . 16Table 4: Socket Transport Statement Options . 16Table 5: Configuration Parameter Values . 20Table 6: Beta Level Configuration Parameter Values . 33Table 7: add conf rule Parameters . 46Table 8: add conf rule Parameters . 46Table 9: Packet Filtering Callback Function Parameters . 47Table 10: Zero-copy revcfrom Parameters. 48Table 11: Freeing Zero-copy Datagram Parameters . 49Table 12: Dump fd Statistics Parameters . 50Table 13: "Dummy Send" Parameters . 50Table 14: vma stats Utility Options . 62Table 15: UDP MC Throughput Results . 76Table 16: Available Subcommands . 77Table 17: General sockperf Options . 77Table 18: Client Options . 79Table 19: Server Options . 79Table 20: Acronym Table . 836Mellanox TechnologiesRev 8.6.10
Introduction to VMADocument Revision HistoryTable 1: Document Revision HistoryRevisionDateDateRev 8.6.10July 5, 2018 Added VMA STATS SHMEM DIR as a new VMA parameterto Table 5: Configuration Parameter Values. Removed VMA RX SW CSUM parameter. Updated section Linux Guest over Windows Hypervisor. Updated section 8.8.9: Limitations by removing twosocketXtreme limitations. Added a new issue to section 9.4: Troubleshooting. Updated the examples in Appendix A: Sockperf - UDP/TCPLatency and Throughput Benchmarking Tool.Rev 8.5.7March 1, 2018 Updated section 8.8 by renaming vmapoll (Explicit RingPolling) to SocketXtreme and performing several changesthroughout the section Added a new value for the VMA configuration parameterVMA TCP CC ALGO Added the following sections:Rev 8.4.10December 4, 2017 SocketXtreme to Sockperf Appendix Loading VMA Dynamically Linux Guest over Windows Hypervisor Added section TCP QUICKACK Threshold Added VMA TCP QUICKACK and VMA TCP QUICKACKconfiguration parameters (see section VMA ConfigurationParameters) Updated section sockperf SubcommandsRev 8.4.8October 31, 2017 Added the following sections: On-Device-Memory Prerequisites Verifying On-Device-Memory Capability in the Hardware On-Device-Memory Statistics Added the VMA TRIGGER DUMMY SEND GETSOCKNAMEconfiguration parameter (see section VMA ConfigurationParameters) Added the VMA RING DEV MEM TX configurationparameter (see section Beta Level Features ConfigurationParameters) Updated the Example in section VMA ConfigurationParameters Updated section Troubleshooting: added Issue #6Rev 8.3.7June 30, 2017 Added the following sections and their subsections: Packet Pacing Precision Time Protocol (PTP) Updated the following section:Rev 8.6.10Mellanox Technologies7
Introduction to VMARevisionDateDate Rev 8.3.5May 31, 2017VMA Configuration Parameters: addedVMA HW TS CONVERSION Added the following section: VMA Internal Thread Updated the following sections: Rev 8.2.108March 28, 2017Mellanox TechnologiesMulti Packet Receive Queue Updated the following sections: VMA Configuration Parameters Latency with Ping-pong Test Bandwidth and Packet Rate With Throughput TestRev 8.6.10
Introduction to VMAAbout this ManualThis manual describes Mellanox Messaging Accelerator (VMA) Library for Linux.AudienceThis manual is primarily intended for: Market data professionals Messaging specialists Software engineers and architects Systems administrators tasked with installing/uninstalling/maintaining VMA ISV partners who want to test/integrate their traffic-consuming/producing applicationswith VMA.Document ConventionsThe following lists conventions used in this document.NOTE: Identifies important information that contains helpful suggestions.CAUTION: Alerts you to the risk of personal injury, system damage, or loss of data.WARNING: Warns you that failure to take or avoid a specific action might result inpersonal injury or a malfunction of the hardware or software. Be aware of the hazardsinvolved with electrical circuitry and be familiar with standard practices for preventingaccidents before you work on any equipment.WARNING: Warns you that failure to take or avoid a specific action might result inpersonal injury or a malfunction of the hardware or software. Be aware of the hazardsinvolved with electrical circuitry and be familiar with standard practices for preventingaccidents before you work on any equipment.Rev 8.6.10Mellanox Technologies9
Introduction to VMATypographyThe following table describes typographical conventions in Mellanox documentation. Allterms refer to isolated terms within body text or regular table text unless otherwisementioned in the Notes column.Table 2: TypographyTerm, Construct,Text BlockExampleFile name, pathname/opt/ufm/conf/gv.cfgConsole session (code)- flashClear CR NotesComplete sample line or block.Comprises both input andoutput.The code can also be shaded.Linux shell prompt#The "#"character stands for theLinux shell prompt.Mellanox CLI Guest ModeSwitch Mellanox CLI Guest Mode.Mellanox CLI admin modeSwitch #Mellanox CLI admin modeString or []Strings in or [ ] aredescriptions of what will actuallybe shown on the screen, forexample, the contents of yourip could be 192.168.1.1Management GUI label, itemnameNew Network,New EnvironmentManagement GUI labels and itemnames appear in bold, whether ornot the name is explicitlydisplayed (for example, buttonsand icons).User text entered intoManager, e.g., to assign asthe name of a logical object"Env1", "Network1"Note the quotes. The text entereddoes not include the quotes.Related DocumentationFor additional relevant information, refer to the latest revision of the following documents: Mellanox Messaging Accelerator (VMA) Library for Linux Release Notes (DOC-00329) Mellanox Messaging Accelerator (VMA) Installation Guide (DOC-10055) Performance Tuning Guidelines for Mellanox Network Adapters (DOC 3368)10Mellanox TechnologiesRev 8.6.10
Introduction to VMA1Introduction to VMA1.1VMA OverviewThe Mellanox Messaging Accelerator (VMA) library is a network-traffic offload,dynamically-linked user-space Linux library which serves to transparently enhance theperformance of socket-based networking-heavy applications over an InfiniBand or Ethernetnetwork. VMA has been designed for latency-sensitive and throughput-demanding, unicastand multicast applications. VMA can be used to accelerate producer applications andconsumer applications, and enhances application performance by orders of magnitudewithout requiring any modification to the application code.The VMA library accelerates TCP and UDP socket applications, by offloading traffic fromthe user-space directly to the network interface card (NIC) or Host Channel Adapter (HCA),without going through the kernel and the standard IP stack (kernel-bypass). VMA increasesoverall traffic packet rate, reduces latency, and improves CPU utilization.1.2Basic FeaturesThe VMA library utilizes the direct hardware access and advanced polling techniques ofRDMA-capable network cards. Utilization of InfiniBand's and Ethernet’s direct hardwareaccess enables the VMA kernel bypass, which causes the VMA library to bypass the kernel’snetwork stack for all IP network traffic transmit and receive socket API calls. Thus,applications using the VMA library gain many benefits, including: Reduced context switches and interrupts, which result in: Lower latencies Higher throughput Improved CPU utilization Minimal buffer copies between user data and hardware – VMA needs only a single copyto transfer a unicast or multicast offloaded packet between hardware and the application’sdata buffers.1.3Target ApplicationsGood application candidates for VMA include, but are not limited to: Fast transaction-based network applications, which require a high rate of request-responsetype operations over TCP or UDP unicast. This also includes any send/receive to/from anexternal network entity, such as a Market Data Order Gateway application working withan exchange. Market-data feed-handler software which consumes multicast data feeds (and which oftenuse multicast as a distribution mechanism downstream), such as Wombat WDF andReuters RMDS, or any home-grown feed handlers. Messaging applications responsible for producing/consuming relatively large amounts ofmulticast data including applications that use messaging middleware, such as TibcoRendezvous (RV).Rev 8.6.10Mellanox Technologies11
Introduction to VMA Caching/data distribution applications, which utilize quick network transactions for cachecreation/state maintenance, such as MemCacheD and Redis. Applications that handle distributed denial of service (DDoS) and web servicesapplications with a heavy load of DNS requests. Messaging applications, such as UMS Informatica, which VMA 6.4 was certified with Any other applications that make heavy use of multicast or unicast that require anycombination of the following:1.4 Higher Packets per Second (PPS) rates than with kernel. Lower data distribution latency. Lower CPU utilization by the multicast consuming/producing application in order tosupport further application scalability.Advanced VMA FeaturesThe VMA library provides several significant advantages: The underlying wire protocol used for the unicast and multicast solution is standard TCPand UDP IPv4, which is interoperable with any TCP/UDP/IP networking stack. Thus, theopposite side of the communication can be any machine with any OS, and can be locatedon an InfiniBand or an Ethernet networkNOTE: VMA uses a standard protocol that enables an application to use the VMA forasymmetric acceleration purposes. A ‘TCP server side’ only application, a 'multicastconsuming' only or 'multicast publishing' only application can leverage this, whileremaining compatible with Ethernet or IPoIB peers. Kernel bypass for unicast and multicast transmit and receive operations. This deliversmuch lower CPU overhead since TCP/IP stack overhead is not incurred Reduced number of context switches. All VMA software is implemented in user space inthe user application’s context. This allows the server to process a significantly higherpacket rate than would otherwise be possible Minimal buffer copies. Data is transferred from the hardware (NIC/HCA) straight to theapplication buffer in user space, with only a single intermediate user space buffer andzero kernel IO buffers Fewer hardware interrupts for received/transmitted packets Fewer queue congestion problems witnessed in standard TCP/IP applications Supports legacy socket applications – no need for application code rewrite Maximizes Messages per second (MPS) rates Minimizes message latency Reduces latency spikes (outliers) Lowers the CPU usage required to handle traffic12Mellanox TechnologiesRev 8.6.10
VMA Library Architecture2VMA Library Architecture2.1Top-LevelThe VMA library is a dynamically linked user-space library. Use of the VMA library doesnot require any code changes or recompiling of user applications. Instead, it is dynamicallyloaded via the Linux OS environment variable, LD PRELOAD. However, it is possible toload VMA library dynamically without using the LD PRELOAD parameter, which requiresminor application modifications, as described in TBD.When a user application transmits TCP and UDP, unicast and multicast IPv4 data, or listensfor such network traffic data, the VMA library: Intercepts the socket receive and send calls made to the stream socket or datagram socketaddress families. Implements the underlying work in user space (instead of allowing the buffers to pass onto the usual OS network kernel libraries).VMA implements native RDMA verbs API. The native RDMA verbs have beenextended into the Ethernet RDMA-capable NICs, enabling the packets to pass directlybetween the user application and the InfiniBand HCA or Ethernet NIC, bypassing thekernel and its TCP/UDP handling network stack.You can implement the code in native RDMA verbs API, without making any changes toyour applications. The VMA library does all the heavy lifting under the hood, whiletransparently presenting the same standard socket API to the application, thus redirecting thedata flow.The VMA library operates in a standard networking stack fashion to serve multiple networkinterfaces.The VMA library behaves according to the way the application calls the bind, connect, andsetsockopt directives and the administrator sets the route lookup to determine the interface tobe used for the socket traffic. The library knows whether data is passing to or from anInfiniBand HCA or Ethernet NIC. If the data is passing to/from a supported HCA or EthernetNIC, the VMA library intercepts the call and does the bypass work. If the data is passingto/from an unsupported HCA or Ethernet NIC, the VMA library passes the call to the usualkernel libraries responsible for handling network traffic. Thus, the same application canlisten in on m
www.mellanox.com Mellanox Technologies Mellanox Messaging A