Discussion:
Best practice to get counters from a huge amount of routers.
(too old to reply)
Richard Mayers
2016-05-07 15:20:53 UTC
Permalink
Hi folks,

For my master thesis I am doing a load balancing project and I have to
know the link usage if possible every second. For that I set the
refresh interval to 1 second, so every thing is good so far.

My problem is that I am working with big topologies and I may have 200
or more routers. If I get the counters polling it takes forever I can
not poll the routers one by one, or not even using threads (at some
point it would not scale).

What would be the best way to get all the counters ?

Since I am simulating everything in a single machine I can do a trick
and write the counters in a file, however that will not be useful when
I test my solution in a real network.

Kind regards,
Richard
Ilya Etingof
2016-05-07 16:18:41 UTC
Permalink
I’d try to solve that running a poller over asynchronous socket:

http://net-snmp.sourceforge.net/tutorial/tutorial-5/toolkit/asyncapp/asyncapp.c
Post by Richard Mayers
Hi folks,
For my master thesis I am doing a load balancing project and I have to
know the link usage if possible every second. For that I set the
refresh interval to 1 second, so every thing is good so far.
My problem is that I am working with big topologies and I may have 200
or more routers. If I get the counters polling it takes forever I can
not poll the routers one by one, or not even using threads (at some
point it would not scale).
What would be the best way to get all the counters ?
Since I am simulating everything in a single machine I can do a trick
and write the counters in a file, however that will not be useful when
I test my solution in a real network.
Kind regards,
Richard
Richard Mayers
2016-05-08 10:11:02 UTC
Permalink
Hi,

Thanks for the replies.

What about using Traps? Can I make the routers send me the counters
every second ? Is it hard to set up?

Kind regards,
Richard
For my day job we tried open source packages for SNMP polling
mrtg
cricket
They were fine for small numbers and long intervals, but will not scale to
your usage.
I wrote an implementation from scratch that handles 10K variables
on 5 minute interval. So it can be done.
router CPU: hitting a device every second for a number of interfaces may consume
too much CPU time to be practical.
storage IO: if you are successful in retrieving the data, standard DB storage
even RRD will not be able to handle the IO load
Bulk SNMP gets
RMON
local device scripting, I believe Cisco routers can do local TCL scripting
and junos based devices have local scripting as well.
Directory based queues with small files storing samples
Good luck with your work.
Post by Richard Mayers
Hi folks,
For my master thesis I am doing a load balancing project and I have to
know the link usage if possible every second. For that I set the
refresh interval to 1 second, so every thing is good so far.
My problem is that I am working with big topologies and I may have 200
or more routers. If I get the counters polling it takes forever I can
not poll the routers one by one, or not even using threads (at some
point it would not scale).
What would be the best way to get all the counters ?
Since I am simulating everything in a single machine I can do a trick
and write the counters in a file, however that will not be useful when
I test my solution in a real network.
Kind regards,
Richard
------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Net-snmp-users mailing list
https://lists.sourceforge.net/lists/listinfo/net-snmp-users
--
James R. Leu
Jurkiewicz Jean-Marc
2016-05-09 09:25:31 UTC
Permalink
Hi,

Like James, I wrote tools that handle several 10K variables on 5 minute interval.

I am not at all trying to discourage you in your project. There are some concern you should keep in mind:

Please always remember that the primary role of the network / network equipment you try to manage is to transport data.
I mean payload data, not management data. Have you estimated/calculated the ration of the bandwidth you will consume "just" for management?
(You did not mention how many counters per routers you intend to collect data about.)
Measurement should not bias measured (or only to a minimal extend).

Switchs and routers are not designed to assist to this extend, the management station.
SNMP is not the best method (high CPU load, low priority on the equipment) => How about Netflow ( nfsen is a wonderful free tool , if you accept 5 min. period)

There is a big difference between what can be done and what make sense to be done.

In any way Get-bulk (if several counters par routers) seems more appropriate than Traps.

There are some questions you should consider and find an answer to:

How are configured the time-outs and the retries of your SNMP requests (you intend to address some "real" equipment that may respond with latency) ?
Two seconds time-out and 3 retries don't make really sense when polling every second.
Impact of a "non-responding" device? (a faulty one, defectuous one), of several faulty devices (let say 10). What happens to the 190 others?
What is the latency of the network that interconnect your 200 routers?
How do you "time-stamp" the collected data ?

Good luck with your work.

JM


-----Message d'origine-----
De : Richard Mayers [mailto:***@gmail.com]
Envoyé : dimanche, 8. mai 2016 12:11
À : ***@mindspring.com; net-snmp-***@lists.sourceforge.net
Objet : Re: Best practice to get counters from a huge amount of routers.

Hi,

Thanks for the replies.

What about using Traps? Can I make the routers send me the counters every second ? Is it hard to set up?

Kind regards,
Richard
For my day job we tried open source packages for SNMP polling
mrtg
cricket
They were fine for small numbers and long intervals, but will not
scale to your usage.
I wrote an implementation from scratch that handles 10K variables on
5 minute interval. So it can be done.
router CPU: hitting a device every second for a number of interfaces may consume
too much CPU time to be practical.
storage IO: if you are successful in retrieving the data, standard DB
storage even RRD will not be able to handle the IO load
Bulk SNMP gets
RMON
local device scripting, I believe Cisco routers can do local TCL
scripting and junos based devices have local scripting as well.
Directory based queues with small files storing samples
Good luck with your work.
Post by Richard Mayers
Hi folks,
For my master thesis I am doing a load balancing project and I have
to know the link usage if possible every second. For that I set the
refresh interval to 1 second, so every thing is good so far.
My problem is that I am working with big topologies and I may have
200 or more routers. If I get the counters polling it takes forever I
can not poll the routers one by one, or not even using threads (at
some point it would not scale).
What would be the best way to get all the counters ?
Since I am simulating everything in a single machine I can do a trick
and write the counters in a file, however that will not be useful
when I test my solution in a real network.
Kind regards,
Richard
---------------------------------------------------------------------
--------- Find and fix application performance issues faster with
Applications Manager Applications Manager provides deep performance
insights into multiple tiers of your business applications. It
resolves application problems quickly and reduces your MTTR. Get your
free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Net-snmp-users mailing list
https://lists.sourceforge.net/lists/listinfo/net-snmp-users
--
James R. Leu
------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager Applications Manager provides deep performance insights into multiple tiers of your business applications. It resolves application problems quickly and reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Net-snmp-users mailing list
Net-snmp-***@lists.sourceforge.net
Please see the following page to unsubscribe or change other options:
https://lists.sourceforge.net/lists/listinfo/net-snmp-users
Richard Mayers
2016-05-09 13:16:53 UTC
Permalink
Post by Jurkiewicz Jean-Marc
Hi,
Hi again, and thanks for the answers. Very appreciated ! I leave the
comments in-line.
Post by Jurkiewicz Jean-Marc
Like James, I wrote tools that handle several 10K variables on 5 minute interval.
Please always remember that the primary role of the network / network equipment you try to manage is to transport data.
I mean payload data, not management data. Have you estimated/calculated the ration of the bandwidth you will consume "just" for management?
(You did not mention how many counters per routers you intend to collect data about.)
Measurement should not bias measured (or only to a minimal extend).
Well that should not be a problem, since I have everything simulated
in a sever (using mininet and quagga routers). I have a dedicated out
of band interface at every device just for snmp and netflow. And
currently for snmp I am getting the data from the same machine and
storing it in a file (all the devices share the same file system), so
its kind of "cheating" but as a first step its what they told me to
do. Even with that solution I am having CPU problems due to all these
processes polling the counters and writing it to a file every second.
Post by Jurkiewicz Jean-Marc
Switchs and routers are not designed to assist to this extend, the management station.
SNMP is not the best method (high CPU load, low priority on the equipment) => How about Netflow ( nfsen is a wonderful free tool , if you accept 5 min. period)
I am already using netflow but for another purpose, and SNMP was only
for the counters to know the load at every link in "real time" ( If I
manage to accomplish that)
Post by Jurkiewicz Jean-Marc
There is a big difference between what can be done and what make sense to be done.
In any way Get-bulk (if several counters par routers) seems more appropriate than Traps.
I can have up to ...20 interfaces per routers.
Post by Jurkiewicz Jean-Marc
How are configured the time-outs and the retries of your SNMP requests (you intend to address some "real" equipment that may respond with latency) ?
Two seconds time-out and 3 retries don't make really sense when polling every second.
So far I have not consider that, I assume I have some kind of ideal
scenario where everything works... and everything its "under my
control", like my simulation or a data center.
Post by Jurkiewicz Jean-Marc
Impact of a "non-responding" device? (a faulty one, defectuous one), of several faulty devices (let say 10). What happens to the 190 others?
What is the latency of the network that interconnect your 200 routers?
The latency should be very small, 1) because I am using the out of
band network to monitor, 2) and my idea is to loadbalance in
datacenter networks.
Post by Jurkiewicz Jean-Marc
How do you "time-stamp" the collected data ?
I don't, I assume that all the data from the routers is from the same
period of "sampling".

I don't really know what to do, I need to know the load per link as
fast as possible so I can improve the load balancing decisions.

Thanks a lot,
Richard
Fredrik Björk
2016-05-09 13:13:39 UTC
Permalink
Hi!

Yes, lots of things to consider... I have done some information
gathering on CMTSes (cable modem routers) reading out several thousamd
parameters. In general I can say that an snmpBULKwalk (command line) or
its equivalents in other languages are MUCH faster than reading
individual objects if you need several adjecent objects.

# time for ((i=1001 ; i-1025 ; i++)) ; do snmpget -v 2c -c public
192.168.8.51 IF-MIB::ifInOctets.$i; done
IF-MIB::ifInOctets.1001 = Counter32: 1105357170
...
real 0m0.386s
user 0m0.316s
sys 0m0.020s

# time snmpbulkwalk -v 2c -c public 192.168.8.51 IF-MIB::ifInOctets
IF-MIB::ifInOctets.1001 = Counter32: 1105357170
...
real 0m0.055s
user 0m0.008s
sys 0m0.008s

Above, the snmpbulkwalk even reads three extra OIDs. Obviously, CLI/bash
is not the way to go when you have a need for speed, I just used it to
demonstrate the difference. The waiting time is mostly in the switch
anyway. I have used PHP, and that's not optimal, but good enough for my
needs.

If you need, for instance, most info from the ifEntry-tree, like the red
ones below:

IF-MIB::ifIndex.1001 = INTEGER: 1001
IF-MIB::ifDescr.1001 = STRING: Alcatel-Lucent 1/1
IF-MIB::ifType.1001 = INTEGER: ethernetCsmacd(6)
IF-MIB::ifMtu.1001 = INTEGER: 9216
IF-MIB::ifSpeed.1001 = Gauge32: 0
IF-MIB::ifPhysAddress.1001 = STRING: e8:e7:32:2:c5:2a
IF-MIB::ifAdminStatus.1001 = INTEGER: up(1)
IF-MIB::ifOperStatus.1001 = INTEGER: down(2)
IF-MIB::ifLastChange.1001 = Timeticks: (48888100) 5 days, 15:48:01.00
IF-MIB::ifInOctets.1001 = Counter32: 1105357170
IF-MIB::ifInUcastPkts.1001 = Counter32: 159748417
IF-MIB::ifInNUcastPkts.1001 = Counter32: 0
IF-MIB::ifInDiscards.1001 = Counter32: 0
IF-MIB::ifInErrors.1001 = Counter32: 0
IF-MIB::ifInUnknownProtos.1001 = Counter32: 0
IF-MIB::ifOutOctets.1001 = Counter32: 1778801908
IF-MIB::ifOutUcastPkts.1001 = Counter32: 2041113484
IF-MIB::ifOutNUcastPkts.1001 = Counter32: 0
IF-MIB::ifOutDiscards.1001 = Counter32: 23
IF-MIB::ifOutErrors.1001 = Counter32: 0
IF-MIB::ifOutQLen.1001 = Gauge32: 0
IF-MIB::ifSpecific.1001 = OID: SNMPv2-SMI::zeroDotZero

I'd definitely recommand walking the entire ifEntry tree instead of
walking several separate walks (again, cli ony for educational purposes):

snmpbulkwalk -v 2c -c public 192.168.38.51 IF-MIB::ifEntry -m all

It will probably be much more efficient to discard the extra info (black
lines above) instead of doing multiple walks. I think your
switches/routers will agree too, but that's very vendor- and even
product dependent. The downside is a little more traffic over the
network, but I'd say it's negligable.

I store all values retrieved in a database (I use MySQL or Postgres, but
choice is free) and then I can have the front-end pick out values
whenever convenient.

I even use the "bulkwalk" strategy to monitor almost 250 emergency
phones spread on 28 AudioCodes phone-to-SIP concentrators for a
customer. We can detect a "hook off" event in less than 5 seconds (the
criteria) by using 4 or 5 parallell jobs that walk the AudioCodes units
constantly. There are 24 ports in each unit, so selecting only the
active ports and reading just those would have required way more time
and essentially lots of parallell jobs.

There are functions in sFlow and similar that can send/push info on the
amount of traffic to an sFlow server if traffic volumes are all you're
interested in. I'm often also interested in errors, queues and so on,
but I could settle for sFlow based traffic every minute and poll errors
and such less often.

/Fredrik
Hi, Like James, I wrote tools that handle several 10K variables on 5
minute interval. I am not at all trying to discourage you in your
project. There are some concern you should keep in mind: Please always
remember that the primary role of the network / network equipment you
try to manage is to transport data. I mean payload data, not
management data. Have you estimated/calculated the ration of the
bandwidth you will consume "just" for management? (You did not mention
how many counters per routers you intend to collect data about.)
Measurement should not bias measured (or only to a minimal extend).
Switchs and routers are not designed to assist to this extend, the
management station. SNMP is not the best method (high CPU load, low
priority on the equipment) => How about Netflow ( nfsen is a wonderful
free tool , if you accept 5 min. period) There is a big difference
between what can be done and what make sense to be done. In any way
Get-bulk (if several counters par routers) seems more appropriate than
Traps. There are some questions you should consider and find an answer
to: How are configured the time-outs and the retries of your SNMP
requests (you intend to address some "real" equipment that may respond
with latency) ? Two seconds time-out and 3 retries don't make really
sense when polling every second. Impact of a "non-responding" device?
(a faulty one, defectuous one), of several faulty devices (let say
10). What happens to the 190 others? What is the latency of the
network that interconnect your 200 routers? How do you "time-stamp"
the collected data ? Good luck with your work. JM -----Message
counters from a huge amount of routers. Hi, Thanks for the replies.
What about using Traps? Can I make the routers send me the counters
every second ? Is it hard to set up? Kind regards, Richard 2016-05-07
mrtg cricket They were fine for small numbers and long intervals, but
will not scale to your usage. I wrote an implementation from scratch
that handles 10K variables on 5 minute interval. So it can be done.
Issues I think you will run into: router CPU: hitting a device every
second for a number of interfaces may consume too much CPU time to be
practical. storage IO: if you are successful in retrieving the data,
standard DB storage even RRD will not be able to handle the IO load
Things to consider: Bulk SNMP gets RMON local device scripting, I
believe Cisco routers can do local TCL scripting and junos based
devices have local scripting as well. Directory based queues with
small files storing samples Good luck with your work. On Sat, May 07,
Hi folks, For my master thesis I am doing a load balancing project
and I have to know the link usage if possible every second. For that
I set the refresh interval to 1 second, so every thing is good so
far. My problem is that I am working with big topologies and I may
have 200 or more routers. If I get the counters polling it takes
forever I can not poll the routers one by one, or not even using
threads (at some point it would not scale). What would be the best
way to get all the counters ? Since I am simulating everything in a
single machine I can do a trick and write the counters in a file,
however that will not be useful when I test my solution in a real
network. Kind regards, Richard
---------------------------------------------------------------------
--------- Find and fix application performance issues faster with
Applications Manager Applications Manager provides deep performance
insights into multiple tiers of your business applications. It
resolves application problems quickly and reduces your MTTR. Get
your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________ Net-snmp-users
https://lists.sourceforge.net/lists/listinfo/net-snmp-users
------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications
Manager Applications Manager provides deep performance insights into
multiple tiers of your business applications. It resolves application
problems quickly and reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________ Net-snmp-users mailing
https://lists.sourceforge.net/lists/listinfo/net-snmp-users
------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications
Manager Applications Manager provides deep performance insights into
multiple tiers of your business applications. It resolves application
problems quickly and reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________ Net-snmp-users mailing
https://lists.sourceforge.net/lists/listinfo/net-snmp-users
James Leu
2016-05-09 14:02:24 UTC
Permalink
RMON might come in usefull. If you configure each interface for the
statistics group with 300 buckets and an interval of 1 second
you will receive a trap every 300 seconds teling you the buckets are
full and you will need the poll the node to gather the data then issue a
reset, which will empty the buckets. Using RMON in this way can be very
memory intensive.
Post by Richard Mayers
Hi,
Thanks for the replies.
What about using Traps? Can I make the routers send me the counters
every second ? Is it hard to set up?
Kind regards,
Richard
For my day job we tried open source packages for SNMP polling
mrtg
cricket
They were fine for small numbers and long intervals, but will not scale to
your usage.
I wrote an implementation from scratch that handles 10K variables
on 5 minute interval. So it can be done.
router CPU: hitting a device every second for a number of interfaces may consume
too much CPU time to be practical.
storage IO: if you are successful in retrieving the data, standard DB storage
even RRD will not be able to handle the IO load
Bulk SNMP gets
RMON
local device scripting, I believe Cisco routers can do local TCL scripting
and junos based devices have local scripting as well.
Directory based queues with small files storing samples
Good luck with your work.
Post by Richard Mayers
Hi folks,
For my master thesis I am doing a load balancing project and I have to
know the link usage if possible every second. For that I set the
refresh interval to 1 second, so every thing is good so far.
My problem is that I am working with big topologies and I may have 200
or more routers. If I get the counters polling it takes forever I can
not poll the routers one by one, or not even using threads (at some
point it would not scale).
What would be the best way to get all the counters ?
Since I am simulating everything in a single machine I can do a trick
and write the counters in a file, however that will not be useful when
I test my solution in a real network.
Kind regards,
Richard
------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Net-snmp-users mailing list
https://lists.sourceforge.net/lists/listinfo/net-snmp-users
--
James R. Leu
--
James R. Leu
***@mindspring.com
Fredrik Björk
2016-05-09 14:07:07 UTC
Permalink
Hi again!

If that's all you need, you do a walk with PHP (or faster alternative)
and push the data into a database with a timestamp. From there you can
monitor the links as often as you like, just as long as you remember
that the data you read is as old as the pollings feeding the dB with data.


$ifInOctArray=snmp2_real_walk($RouterIP, $community,
"IF-MIB::ifHCInOctets") or die("$argv[0]: unable to walk OID
IF-MIB::ifHCInOctets");
$ifOutOctArray=snmp2_real_walk($RouterIP, $community,
"IF-MIB::ifHCOutOctets") or die("$argv[0]: unable to walk OID
IF-MIB::ifHCOutOctets");

Stuff it into your dB of choise and start reading the data! Use
ifHCOutOctets instead of ifOutOctets since the 32-bit counters in the
latter will top out at ~500 Mbps if polling every minute.

Use this to graph the data directly from the dB (if you need to create
graphs, if not, just read the data from the dB):

https://oss.oetiker.ch/rrdtool/doc/rrdgraph_libdbi.en.html

/Fredrik
Post by Richard Mayers
I don't really know what to do, I need to know the load per link as
fast as possible so I can improve the load balancing decisions.
Andrew
2016-05-09 14:09:28 UTC
Permalink
There's netflow/sflow for real-time traffic counting/monitoring.
Post by Richard Mayers
Hi,
Thanks for the replies.
What about using Traps? Can I make the routers send me the counters
every second ? Is it hard to set up?
Kind regards,
Richard
For my day job we tried open source packages for SNMP polling
mrtg
cricket
They were fine for small numbers and long intervals, but will not scale to
your usage.
I wrote an implementation from scratch that handles 10K variables
on 5 minute interval. So it can be done.
router CPU: hitting a device every second for a number of interfaces may consume
too much CPU time to be practical.
storage IO: if you are successful in retrieving the data, standard DB storage
even RRD will not be able to handle the IO load
Bulk SNMP gets
RMON
local device scripting, I believe Cisco routers can do local TCL scripting
and junos based devices have local scripting as well.
Directory based queues with small files storing samples
Good luck with your work.
Post by Richard Mayers
Hi folks,
For my master thesis I am doing a load balancing project and I have to
know the link usage if possible every second. For that I set the
refresh interval to 1 second, so every thing is good so far.
My problem is that I am working with big topologies and I may have 200
or more routers. If I get the counters polling it takes forever I can
not poll the routers one by one, or not even using threads (at some
point it would not scale).
What would be the best way to get all the counters ?
Since I am simulating everything in a single machine I can do a trick
and write the counters in a file, however that will not be useful when
I test my solution in a real network.
Kind regards,
Richard
------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Net-snmp-users mailing list
https://lists.sourceforge.net/lists/listinfo/net-snmp-users
--
James R. Leu
------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Net-snmp-users mailing list
https://lists.sourceforge.net/lists/listinfo/net-snmp-users
Laurent Dumont
2016-05-12 19:14:17 UTC
Permalink
A bit late on this but with your scale, it's definitely worth looking
into monitoring solutions based around SNMP. Solutions like Shinken,
Nagios and Icinga are all designed around snmp polling and can probably
be tuned for 1 or 5 seconds polling for your metrics. You can even send
all the data into InfluxDB and make pretty graphs with Grafana.

Writing your own tool seems overkill for this.

Cheers!

Laurent
Post by Richard Mayers
Hi folks,
For my master thesis I am doing a load balancing project and I have to
know the link usage if possible every second. For that I set the
refresh interval to 1 second, so every thing is good so far.
My problem is that I am working with big topologies and I may have 200
or more routers. If I get the counters polling it takes forever I can
not poll the routers one by one, or not even using threads (at some
point it would not scale).
What would be the best way to get all the counters ?
Since I am simulating everything in a single machine I can do a trick
and write the counters in a file, however that will not be useful when
I test my solution in a real network.
Kind regards,
Richard
------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Net-snmp-users mailing list
https://lists.sourceforge.net/lists/listinfo/net-snmp-users
Loading...