%include "ahu.mgp" %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %page %pcache 0 %charset "iso8859-1" %filter "./resettime" %endfilter %filter "./counttime.pl 1" %endfilter Traffic Shaping for the User and Developer %center Ottawa Linux Symposium 2002 %size 5, fore "red", center No manual available. Ask me, if you have problems (only try to guess answer yourself at first 8)) -- Alexey N. Kuznetsov in README.iproute2+tc %size 4, right, fore "yellow" bert hubert PowerDNS BV ahu@ds9a.nl http://ds9a.nl/ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %page %filter "./counttime.pl 2" %endfilter %size 9, center Welcome %prefix 20, size 10 Goals: show enough to bootstrap learning User Kernel developer Live demos Gurus have License to Intervene Higher pace than LK2001 - pay attention! More information in the paper %prefix 0 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %page %filter "./counttime.pl 3" %endfilter Order of things %prefix 20 LARTC introduction Demo of challenges faced Queues, Queueing Disciplines Filters Classful qdiscs Hands On Exciting other things Development %prefix 0 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %page %filter "./counttime.pl 4" %endfilter LARTC: 2 years and going %prefix 20, size 10 Started to fill documentation gap 1500 member active mailinglist 150 pages of printed pdf Manpages for most features OPN IRC channel Google on 'linux routing' finds us as first hit %prefix 0 %%%%%%%%%%%% %page %filter "./counttime.pl 5" %endfilter Demonstration of challenge %prefix 20, size 10 115k2 uplink speed 10 packet queue PPP, default kernel settings Nullmodem Representative of domestic uplinks Same principle holds for megabit links %prefix 0 %%%%%%%%%%%% %page %filter "./counttime.pl 6" %endfilter Queues are our friend and enemy Queues sit between userspace and the interface and determine how data \ gets %cont, fore "red" SENT %cont, fore "white" . Queues: * Buffer output in excess of your bandwidth This prevents packetloss in case of bursty traffic * Create latency, which hurts interactivity Your keystrokes must traverse a long queue. TCP/IP tries to fill any queue \ you offer it! %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %page %filter "./counttime.pl 7" %endfilter Queues: between user & hardware Applications send data to the kernel. Kernel enqueues data to the queuing \ discipline and immediately tries to run the queue ('kick') to the hardware. %pause queue_run() dequeue()s as much packets as the network adapter will accept, \ or until the queue is empty, or no longer wants to send - This is what we call \ 'shaping' %%%%%%%%%%%%%%% %page %filter "./counttime.pl 8" %endfilter The Queue you are already using Default queue is "pfifo_fast", which has three bands - dequeue first returns \ data from 'front band' - this corresponds to TOS settings, 'minimum delay'. A typical pfifo_fast queue might look like this: %font "courier" [Port22 Port6667 ICMP] -> [Port80 Port80] -> eth0 [Port25 Port25] %font "standard" DEMO! %%%%%%%%%%% %page %filter "./counttime.pl 9" %endfilter The Token Bucket Filter Qdisc Rapid operation, high bandwidth shaping, Network-friendly operations %font "courier" %mark, pause, font "courier" |X| |X| |X| |_| 3 2 1 0 | %again, mark, pause |X| | | |X| | | |X| | | |_| |_| 3 2 1 0 | 4 3 | 2 1 0 %again, mark, pause |X| | | |X| | | ZzzZz |X| | | |_| |_| 3 2 1 0 | 4 3 | 2 1 0 %font "standard" Until bucket is full, TBF gets a constant rate of new tokens. If out of \ tokens and a dequeue() comes in, will throttle for just the right amount of \ *ticks*. %font "courier" %again, pause |X| | | | | |X| | | | | |X| | | | | |_| |_| |_| 3 2 1 0 | 4 3 | 2 1 0 4 | 3 2 . %font "standard" %%%%%%%%%%% %page %filter "./counttime.pl 10" %endfilter TBF configuration %font "courier" # tc qdisc add dev ppp0 root tbf \ rate 220kbit limit 3000 \ burst 1500 %pause,font "standard" Parameters: dev ppp0 root tbf add directly to ppp0, a token bucket filter %pause rate 220kbit 220kbit worth of tokens/second added %pause limit 3000 burst 1500 3000 bytes buffer, 1500 bytes in bucket max %%%%%%%%%%%5 %page %filter "./counttime.pl 11" %endfilter Stochastic Fairness Queuing SFQ is a pure queue - it never delays, only reorders. %pause * Each 'session' may send packet in turn %pause * Uses hash to conserve memory %pause * Perturbs hash to restore fairness %pause * Often needs an additional shaper to be useful %page %filter "./counttime.pl 12" %endfilter Random Early Detection Normally when a queue is full, packets get dropped from the tail. This causes packetloss when traffic spikes (for small queues) or latency (with big queues). RED can have a long queue but drops random packets to indicate congestion. Very good for high bandwidth applications. %page %filter "./counttime.pl 13" %endfilter Classful qdiscs * Have internal structure - like pfifo_fast %pause * Filled in by userspace commands %pause * Need 'filters' to 'classify' traffic %pause * Queueing disciplines delay, reorder or drop data %pause * A Classful queue can also divide bandwidth %pause * Classful queues *contain* other queues - it is not a tree! %page %filter "./counttime.pl 14" %endfilter Flow of packets in a classful queue %font "courier" %mark classful qdisc +-------- |- qdisc1 | write()-> |- qdisc2 | |- qdisc3 +-------- %pause,again,mark, font "courier" classful qdisc +-------- |- qdisc1 | write()-> |- qdisc2 | |- qdisc3 +-------- filter() determines where to enqueue() %pause,again,mark classful qdisc +-------- |- qdisc1 | write()-> |- qdisc2 | |* qdisc3 +-------- filter() determines where to enqueue() %pause,again,mark classful qdisc +----------+ |- qdisc1 | | | write()-> |- qdisc2 | ->dequeue() | | |* qdisc3 | +----------+ filter() determines where to enqueue() %pause,again,mark classful qdisc +----------+ |- qdisc1 | | | write()-> |- qdisc2 | ->dequeue() | | |* qdisc3 | +----------+ filter() tries dequeue().. determines where to enqueue() %pause,again,mark classful qdisc +----------+ |- qdisc1 ?| | | write()-> |- qdisc2 | ->dequeue() | | |* qdisc3 | +----------+ filter() tries dequeue().. determines where to enqueue() %pause,again,mark classful qdisc +----------+ |- qdisc1 ?| | | write()-> |- qdisc2 ?| ->dequeue() | | |* qdisc3 | +----------+ filter() tries dequeue().. determines where to enqueue() %pause,again,mark classful qdisc +----------+ |- qdisc1 ?| | | write()-> |- qdisc2 ?| ->dequeue() | | |* qdisc3 !| +----------+ filter() tries dequeue().. determines until success! where to enqueue() %pause, font "standard" This is the 'PRIO' qdisc. %page %filter "./counttime.pl 15" %endfilter The CBQ Queueing Discipline Unshaped 100mbit/s link, 40mbit/s traffic Average packet=10000bits, 100 usec/packet: %font "courier" [activity] | |_______^^^^_______^^^^_______^^^^ | +-------------------------->[time] 100us %font "standard" Link is idle 60% of time -> 150usec between packets %page %filter "./counttime.pl 16" %endfilter CBQ for shaping * Calculates for each packet if it came earlier or later \ than the calculated idle prediction %pause * In case of <40mbit/s load and average packets, average \ time difference will be >0. Too much traffic, <0. %pause * Moving average idle has an upper cap. If it is too negative, queue \ is shut down for a number of ticks! %page %filter "./counttime.pl 17" %endfilter HTB qdisc %leftfill %pause Hierarchial Token Bucket Shares same shaping qualities as TBF %pause Easy link sharing Limit certain kinds of traffic, prioritize others %pause Easy hierarchial sharing Multiple agencies, multiple kinds of traffic %pause One drawback Not in the main kernel yet. Jamal & Werner are working with Martin %page %filter "./counttime.pl 18" %endfilter Configuration basics We start at the root of the device, ppp0: %font "courier" # tc qdisc add dev ppp0 root \ handle 1: htb %font "standard" Installs HTB as the root qdisc, names it 1:0. %pause %font "courier" # tc qdisc add dev ppp0 parent 1: \ classid 1:1 htb rate 100kbps \ burst 2k %font "standard" This attaches a shaping HTB to the HTB root, 100kbps with a 2k bucket. %font "courier" %page %filter "./counttime.pl 19" %endfilter Now add the classes %font "courier" # tc class add dev ppp0 \ parent 1:1 classid 1:10 htb \ rate 10kbps ceil 50kbps burst 2k # tc class add dev ppp0 \ parent 1:1 classid 1:11 htb \ rate 90kbps burst 2k %font "standard" The first class is guaranteed 10kbps of the 100kbps, but can grow to 50, if available. The second class however can take up to 90kbits. %page %filter "./counttime.pl 20" %endfilter Filtering to classify traffic When a packet enters the qdisc, it needs to be classified. \ This is done with 'tc filters': %font "courier" # U32="tc filter add dev ppp0 \ protocol ip parent 1:0 prio 1\ u32" # $U32 match ip dport 25 0xffff \ flowid 1:10 # $U32 match ip sport 80 0xffff \ flowid 1:11 %font "standard" The u32 match is *very* generic and can match everything. \ Baroque syntax, however. %page %filter "./counttime.pl 21" %endfilter Werner Almesberger's tcng Creates 'tc' commands based on a simple configuration: %font "courier" prio { class(1) if tcp_dport == 22; class(3) if tcp_dport == 6699; class(2) if 1; } %font "standard" Makes better tc filters than you will! %page %filter "./counttime.pl 22" %endfilter Output of tcng tcc %font "courier", size 4 tc qdisc add dev eth0 handle 1:0 root prio tc filter add dev eth0 parent 1:0 protocol ip prio 1 handle 1:0:0 u32 divisor 1 tc filter add dev eth0 parent 1:0 protocol ip prio 1 u32 match u8 0x6 0xff at 9 offset at 0 mask 0f00 shift 6 eat link 1:0:0 tc filter add dev eth0 parent 1:0 protocol ip prio 1 handle 2:0:0 u32 divisor 1 tc filter add dev eth0 parent 1:0 protocol ip prio 1 handle 1:0:1 u32 ht 1:0:0 match u16 0x16 0xffff at 2 classid 1:1 tc filter add dev eth0 parent 1:0 protocol ip prio 1 u32 match u8 0x6 0xff at 9 offset at 0 mask 0f00 shift 6 eat link 2:0:0 tc filter add dev eth0 parent 1:0 protocol ip prio 1 handle 3:0:0 u32 divisor 1 tc filter add dev eth0 parent 1:0 protocol ip prio 1 handle 2:0:1 u32 ht 2:0:0 match u16 0x1a2b 0xffff at 2 classid 1:3 tc filter add dev eth0 parent 1:0 protocol ip prio 1 u32 match u32 0 0 at 0 classid 1:2 %font "standard" %page %filter "./counttime.pl 23" %endfilter Incoming Shaping using policer Linux normally has no incoming queue. A policing filter - with built in token bucket fiter, but no queue: %font "courier" # tc filter add dev ppp0 parent ffff: protocol ip prio 50 u32 match \ ip src 0.0.0.0/0 police rate 100kbit burst 10k drop flowid :1 %font "standard" Jamal is working on enhancing this syntax, can access iptables targets. %page %filter "./counttime.pl 24" %endfilter The Intermediate Queue Is a netfilter hook into the special 'imq' device: %font "courier" $ iptables -t mangle -A PREROUTING -i eth0 -j IMQ --todev 0 $ ip link set imq0 up %font "standard" To slow down incoming http traffic: %font "courier" $ iptables -t mangle -A PREROUTING -i eth0 -p tcp --sport 80 -j IMQ --todev 0 $ tc qdisc add dev imq0 root tbf rate 220kbit limit 3000 burst 1500 %font "standard" %page %filter "./counttime.pl 25" %endfilter The Wondershaper Demo * Maintain low latency for interfactive traffic at all times * Allow 'surfing' at reasonable speeds while up or downloading * Make sure uploads don't harm downloads, and the other way around * Have the ability to mark certain hosts/ports as 'low priority' %page %filter "./counttime.pl 24" %endfilter Development of Qdiscs The struct: %font "courier", size 3.5 struct Qdisc_ops { struct Qdisc_ops *next; struct Qdisc_class_ops *cl_ops; char id[IFNAMSIZ]; int priv_size; int (*enqueue)(struct sk_buff *, struct Qdisc *); struct sk_buff *(*dequeue)(struct Qdisc *); int (*requeue)(struct sk_buff *, struct Qdisc *); int (*drop)(struct Qdisc *); int (*init)(struct Qdisc *, struct rtattr *arg); void (*reset)(struct Qdisc *); void (*destroy)(struct Qdisc *); int (*change)(struct Qdisc *, struct rtattr *arg); int (*dump)(struct Qdisc *, struct sk_buff *); }; %font "standard" %page %filter "./counttime.pl 24" %endfilter How the kernel interacts with a qdisc enqueue is called from dev_queue_xmit() in net/core/dev.c: %font "courier" /* Grab device queue */ spin_lock_bh(&dev->queue_lock); q = dev->qdisc; if (q->enqueue) { int ret = q->enqueue(skb, q); qdisc_run(dev); spin_unlock_bh(&dev->queue_lock); return ret == NET_XMIT_BYPASS ? NET_XMIT_SUCCESS : ret; } %font "standard" %page %filter "./counttime.pl 24" %endfilter queue_run() In include/net/pkt_sched.h: %font "courier" static inline void qdisc_run(struct net_device *dev) { while (!netif_queue_stopped(dev) && qdisc_restart(dev)<0) /* NOTHING */; } %font "standard" Will continue restarting until device is stopped or full. %page %filter "./counttime.pl 24" %endfilter queue_restart() qdisc_restart() in net/sched/sch_generic.c: %font "courier", size 4 int qdisc_restart(struct net_device *dev) { struct Qdisc *q = dev->qdisc; struct sk_buff *skb; /* Dequeue packet */ if ((skb = q->dequeue(q)) != NULL) { if (!netif_queue_stopped(dev)) { if (netdev_nit) dev_queue_xmit_nit(skb, dev); if (dev->hard_start_xmit(skb, dev) == 0) { dev->xmit_lock_owner = -1; return -1; %font "standard" %page %filter "./counttime.pl 25" %endfilter Conclusion %prefix 20, size 5 Advanced infrastructure Works at home Works at backbone routers Extensible %prefix 0 Questions?