2024年5月9日发(作者:电脑垃圾清理专家)
Datasheet
NVIDIA NetQ
Holistic and real-time visibility, troubleshooting,
and monitoring.
As cloud-scale networking becomes the enterprise norm, so does complexity.
Network operators must manage constant change within multiple network layers,
and polling-based legacy network management tools simply cannot adapt.
Network operators often struggle with operational challenges, such as network
disruption caused by maintenance and configuration changes, and simple
misconfigurations can have a significant impact on operator workloads. Furthermore,
business networks are usually large and complex, which means the set of tasks a
network administrator needs to perform can quickly overwhelm manual efforts. This
requires a shift to both modern networking and modern operational tools.
NVIDIA® NetQ
™
is a highly-scalable, modern network operations tool that provides
actionable visibility for the NVIDIA Spectrum
™
Ethernet platform, including NVIDIA®
Cumulus® Linux and SONiC (software for open networking in the cloud), as well as
NVIDIA data processing units (DPUs).
NetQ is built to accelerate NVIDIA platforms, including NVIDIA EGX, NVIDIA
™
Key Features
> Validations
> Trace
> WJH
> Flow telemetry analysis
> RoCE monitoring
> Events and threshold crossing
alerts
> Notification channels
> Software upgrade management
> Snapshot and compare
DGX
™
POD and NVIDIA OVX
™
SuperPOD, and AI solution stacks such as NVIDIA AI
Enterprise and NVIDIA LaunchPad. NetQ uses fabric-wide telemetry data to provide
visibility and troubleshooting of the overlay and underlay network in real time,
delivering the following benefits:
> Network-outage prevention using validation and functional testing with network
continuous integration/continuous delivery (CI/CD), in conjunction with the
NVIDIA Air platform.
> Rapid root-cause detection using network telemetry data, including NVIDIA
What Just Happened® (WJH) data from NVIDIA switches, reducing mean time
to innocence.
> Fabric-wide latency and buffer occupancy analysis of all the paths of a 4-tuple or
5-tuple flow to identify congestion points impacting application performance.
> Network-wide telemetry database to optimize network usage supporting GUI, CLI,
API, and plug-ins (Grafana, etc.).
> Multiple event notification integrations (Syslog, PagerDuty, Slack, email, and
Generic Webhook).
> Topology
> Microservices architecture
> High-availability clustering
> APIs for integration
Proof Points
> Simplified scaling of Cumulus
Linux
> Speeds mean time to innocence
> Reduces opex
> Cuts downtime
> Increases productivity
> Simplifies upgrades
> Reduces security risks
> Maximizes value of network
infrastructure
NVIDIA NetQ | Datasheet | 1
As part of the NVIDIA Spectrum platform, NetQ is tested and validated with
NVIDIA’s full portfolio of Ethernet networking technology, including NVIDIA
BlueField® DPUs. An end-to-end switch to host solution, NetQ is critical for
powering accelerated workloads, and delivers the high performance and innovative
feature set needed to supercharge cloud-native applications at scale.
Orchestration and Operations
Pager DutySyslogSlackEmailWebhookRest APIGrafanaCLIGUI
Systems Integration
NVIDIA NETQ SERVER
Notification System
API Gateway
Telemetry Aggregation
Analytics Applications and Messaging System
and Analytics
NoSQL
On-Premise Telemetry Aggregator (OPTA)
Database
TelemetryTelemetryTelemetry
NVIDIANVIDIANVIDIA
Telemetry From
NetQ AgentCumulus LinuxNetQ Agent
SONiC
DOCA
SONIC
Agent
DPU
Switches and Hosts
Figure 1: NetQ real-time telemetry data collection and deep analytics.
Protect Network Integrity With Validations and CI/CD
Network configuration changes and software upgrades can cause numerous trouble
tickets because of the inability to test before deploying in production. When this
happens, a large amount of data is collected and stored in multiple tools, which
makes it difficult to correlate events to resolve issues. NetQ can be used as the
functional test platform for the network CI/CD in conjunction with NVIDIA Air.
Customers benefit from testing the new configuration with NetQ in the NVIDIA
Air environment (“digital twin”) and fix errors before deploying to their production
network (“physical twin”). In the physical production network, NetQ validations
provide insight into the live state of the network, shorten troubleshooting times,
and prevent network issues like MTU mismatch, VLAN misconfigurations, and more.
Rapid Root Cause Detection
NetQ greatly reduces time-to-innocence by pinpointing and isolating faults
caused by network state changes. Working hand in hand with Cumulus Linux and
SONiC, NetQ enables organizations to validate network state, both during regular
operations and for post-mortem diagnostic analysis. NetQ provides both a CLI and
robust GUI to allow for on-box interactions as part of troubleshooting or visually as
a high-level dashboard.
With the NetQ trace capability, paths are verified, providing additional information
that NetQ uses to discover misconfigurations along all the hops simultaneously,
speeding the time to resolution. NetQ trace allows users to view all of the paths
between devices to find potential problems.
NVIDIA NetQ | Datasheet | 2
The NetQ agents running on switches and hosts monitor various events in real time,
like interface state, BGP neighbors, MACs, and routes, providing a single source of
truth for data center-wide events. These events can be viewed via NetQ CLI, GUI,
and multiple third-party notification services like PagerDuty or Slack.
Deploy Reliable Networking with WJH and Flow Telemetry
WJH is a hardware-accelerated telemetry feature available on NVIDIA Spectrum
switches that streams detailed and contextual telemetry data for analysis. WJH
provides real-time visibility into problems in the network, such as hardware packet
drops due to misconfigurations, buffer congestion, ACL, or layer 1 problems.
WJH provides telemetry data from the switches collected by NetQ, extending GUI
and CLI functionality to WJH as well. When WJH capabilities are combined with
NetQ, packet drops can be identified anywhere in the fabric to improve network
reliability by:
> Viewing current or historic drop information, including the reason for the drop.
> Identifying problematic flows or endpoints and pinpointing exactly where
communication is failing in the network.
> Including contextual WJH drops information in the output with NetQ trace.
gRPC network management interface (gNMI) can also be used to collect WJH data
from the NetQ Agent.
While WJH is always-on—detecting packet drops, latency, and congestion
events—flow telemetry provides on-demand analysis of specific application flows.
NetQ, working with Cumulus Linux, samples packets matching 4-tuple or 5-tuple
application flow, analyzes and reports per switch latency (max., min., avg.), and
buffers occupancy details along the path of the flow. The NetQ GUI reports all the
possible paths, paths in use, and per-path details. On each switch, NetQ shows
minimum latency, maximum latency, average latency, and buffer occupancy.
By combining WJH with flow telemetry analysis, network operators can proactively
identify root cause server and application issues, and inform the server or
application administrator about the possible outage or performance impact.
NetQ Components and Deployment Options
NetQ Components
> NetQ Agents run on Cumulus Linux and SONiC switches and other certified Linux
systems, such as Ubuntu®, Red Hat®, and CentOS hosts. NetQ Agents capture
network data and other state information in real time and transmit the data to
the NetQ Server.
> NetQ Server consists of telemetry data collection software, “on-premises
telemetry aggregator” (OPTA), data analytics applications, and the database. The
NetQ applications and database can be deployed on-premises or consumed as a
cloud-based service.
NVIDIA NetQ | Datasheet | 3
NetQ on Customer Premises
In this deployment option, all NetQ components are deployed on customer premises.
Deployments can span a single site or multiple sites.
> Single-site deployment: NetQ Agents running on switches and hosts collect and
transmit data to the NetQ OPTA, which hosts the NetQ applications and database.
> Multi-site deployment: For the multi-site NetQ implementation, the NetQ Agents
at each premise collect and transmit data from the switches and hosts to the
local OPTA. The OPTAs then transmit the data to a common NetQ applications
server for processing and storage.
For high availability, OPTAs and applications with storage can be deployed as a cluster.
NetQ as a Cloud Service
NetQ as a cloud service is similar to the multi-site deployment, where the OPTAs
run on premises at the customer site, securely connecting to the NetQ multi-tenant
cloud service operated and maintained by NVIDIA.
Ready to Get Started?
To learn more about NetQ, visit:
/en-us/networking/ethernet-switching/netq
© 2023 NVIDIA Corporation and affiliates. All rights reserved. NVIDIA, the NVIDIA logo, NetQ, Spectrum, and
Cumulus Linux are trademarks and/or registered trademarks of NVIDIA Corporation and affiliates in the U.S. and
other countries. All other trademarks and copyrights are the property of their respective owners. 2705480. MAR23
发布者:admin,转转请注明出处:http://www.yc00.com/xitong/1715188830a2579689.html
评论列表(0条)