ack-tegra

Author	SHA1	Message	Date
David S. Miller	b8dad61cc7	ipv4: If fib metrics are default, no need to grab ref to FIB info. The fib metric memory in this case is static in the kernel image, so we don't need to reference count it since it's never going to go away on us. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-28 14:07:16 -08:00
David S. Miller	725d1e1b45	ipv4: Attach FIB info to dst_default_metrics when possible If there are no explicit metrics attached to a route, hook fi->fib_info up to dst_default_metrics. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-28 14:05:05 -08:00
David S. Miller	9c150e82ac	ipv4: Allocate fib metrics dynamically. This is the initial gateway towards super-sharing metrics if they are all set to zero for a route. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-28 14:01:25 -08:00
John W. Linville	3e11210d46	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-2.6 Conflicts: drivers/net/wireless/ath/ath9k/init.c	2011-01-28 16:23:14 -05:00
Julia Lawall	efe1cf0c57	net/wireless/nl80211.c: Avoid call to genlmsg_cancel genlmsg_cancel subtracts some constants from its second argument before calling nlmsg_cancel. nlmsg_cancel then calls nlmsg_trim on the same arguments. nlmsg_trim tests for NULL before doing any computation, but a NULL second argument to genlmsg_cancel is no longer NULL due to the initial subtraction. Nothing else happens in this execution, so the call to genlmsg_cancel is simply unnecessary in this case. The semantic match that finds this problem is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ expression data; @@ if (data == NULL) { ... * genlmsg_cancel(..., data); ... return ...; } // </smpl> Signed-off-by: Julia Lawall <julia@diku.dk> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2011-01-28 15:46:23 -05:00
Ben Greear	4914b3bb7f	mac80211: Add sdata state and flags to debugfs. Signed-off-by: Ben Greear <greearb@candelatech.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2011-01-28 15:46:23 -05:00
Johannes Berg	6d744bacee	mac80211: add MCS information to radiotap This adds the MCS information we currently get from the drivers into radiotap. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2011-01-28 15:44:29 -05:00
Juuso Oikarinen	45cbad6a12	cfg80211: Allow non-zero indexes for device specific pair-wise ciphers Some vendor specific cipher suites require non-zero key indexes for pairwise keys, but as of currently, the cfg80211 does not allow it. As validating they cipher parameters for vendor specific cipher suites is the job of the driver or hardware/firmware, change the cfg80211 to allow also non-zero pairwise key indexes for vendor specific ciphers. Signed-off-by: Juuso Oikarinen <juuso.oikarinen@nokia.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2011-01-28 15:44:27 -05:00
Thomas Jacob	6a4ddef2a3	netfilter: xt_iprange: add IPv6 match debug print code Signed-off-by: Thomas Jacob <jacob@internet24.de> Signed-off-by: Patrick McHardy <kaber@trash.net>	2011-01-28 19:33:13 +01:00
David S. Miller	a4daad6b09	net: Pre-COW metrics for TCP. TCP is going to record metrics for the connection, so pre-COW the route metrics at route cache entry creation time. This avoids several atomic operations that have to occur if we COW the metrics after the entry reaches global visibility. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-27 22:01:53 -08:00
David S. Miller	8571a19c4a	Merge branch 'master' of ssh://master.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next-2.6	2011-01-27 16:00:37 -08:00
Eric Dumazet	ccf434380d	net: fix dev_seq_next() Commit `c6d14c8456` (net: Introduce for_each_netdev_rcu() iterator) added a race in dev_seq_next(). The rcu_dereference() call should be done _before_ testing the end of list, or we might return a wrong net_device if a concurrent thread changes net_device list under us. Note : discovered thanks to a sparse warning : net/core/dev.c:3919:9: error: incompatible types in comparison expression (different address spaces) Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-27 15:02:56 -08:00
David S. Miller	065825402c	net: Store ipv4/ipv6 COW'd metrics in inetpeer cache. Please note that the IPSEC dst entry metrics keep using the generic metrics COW'ing mechanism using kmalloc/kfree. This gives the IPSEC routes an opportunity to use metrics which are unique to their encapsulated paths. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-27 14:59:31 -08:00
David S. Miller	1397e171f1	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6	2011-01-27 14:59:08 -08:00
David S. Miller	8f2771f2b8	ipv6: Remove route peer binding assertions. They are bogus. The basic idea is that I wanted to make sure that prefixed routes never bind to peers. The test I used was whether RTF_CACHE was set. But first of all, the RTF_CACHE flag is set at different spots depending upon which ip6_rt_copy() caller you're talking about. I've validated all of the code paths, and even in the future where we bind peers more aggressively (for route metric COW'ing) we never bind to prefix'd routes, only fully specified ones. This even applies when addrconf or icmp6 routes are allocated. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-27 14:55:22 -08:00
Eric Dumazet	c2aa3665cf	net: add kmemcheck annotation in __alloc_skb() pskb_expand_head() triggers a kmemcheck warning when copy of skb_shared_info is done in pskb_expand_head() This is because destructor_arg field is not necessarily initialized at this point. Add kmemcheck_annotate_variable() call in __alloc_skb() to instruct kmemcheck this is a normal situation. Resolves bugzilla.kernel.org 27212 Reference: https://bugzilla.kernel.org/show_bug.cgi?id=27212 Reported-by: Christian Casteyde <casteyde.christian@free.fr> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-27 14:41:06 -08:00
Kurt Van Dijck	6d3a9a6854	net: fix validate_link_af in rtnetlink core I'm testing an API that uses IFLA_AF_SPEC attribute. In the rtnetlink core , the set_link_af() member of the rtnl_af_ops struct receives the nested attribute (as I expected), but the validate_link_af() member receives the parent attribute. IMO, this patch fixes this. Signed-off-by: Kurt Van Dijck <kurt.van.dijck@eia.be> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-27 14:39:21 -08:00
Eric Dumazet	389f2a18c6	econet: remove compiler warnings net/econet/af_econet.c: In function ‘econet_sendmsg’: net/econet/af_econet.c:494: warning: label ‘error’ defined but not used net/econet/af_econet.c:268: warning: unused variable ‘sk’ Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Phil Blundell <philb@gnu.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-27 14:15:54 -08:00
David S. Miller	144001bddc	inetpeer: Mark metrics as "new" in fresh inetpeer entries. Set the RTAX_LOCKED metric to INETPEER_METRICS_NEW (basically, all ones) on fresh inetpeer entries. This way code can determine if default metrics have been loaded in from a routing table entry already. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-27 13:52:16 -08:00
Thomas Jacob	705ca14717	netfilter: xt_iprange: typo in IPv4 match debug print code Signed-off-by: Thomas Jacob <jacob@internet24.de> Signed-off-by: Patrick McHardy <kaber@trash.net>	2011-01-27 10:56:32 +01:00
David S. Miller	62fa8a846d	net: Implement read-only protection and COW'ing of metrics. Routing metrics are now copy-on-write. Initially a route entry points it's metrics at a read-only location. If a routing table entry exists, it will point there. Else it will point at the all zero metric place-holder called 'dst_default_metrics'. The writeability state of the metrics is stored in the low bits of the metrics pointer, we have two bits left to spare if we want to store more states. For the initial implementation, COW is implemented simply via kmalloc. However future enhancements will change this to place the writable metrics somewhere else, in order to increase sharing. Very likely this "somewhere else" will be the inetpeer cache. Note also that this means that metrics updates may transiently fail if we cannot COW the metrics successfully. But even by itself, this patch should decrease memory usage and increase cache locality especially for routing workloads. In those cases the read-only metric copies stay in place and never get written to. TCP workloads where metrics get updated, and those rare cases where PMTU triggers occur, will take a very slight performance hit. But that hit will be alleviated when the long-term writable metrics move to a more sharable location. Since the metrics storage went from a u32 array of RTAX_MAX entries to what is essentially a pointer, some retooling of the dst_entry layout was necessary. Most importantly, we need to preserve the alignment of the reference count so that it doesn't share cache lines with the read-mostly state, as per Eric Dumazet's alignment assertion checks. The only non-trivial bit here is the move of the 'flags' member into the writeable cacheline. This is OK since we are always accessing the flags around the same moment when we made a modification to the reference count. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-26 20:51:05 -08:00
David S. Miller	b4e69ac670	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6	2011-01-26 13:49:30 -08:00
David S. Miller	7cc2edb834	xfrm6: Don't forget to propagate peer into ipsec route. Like ipv4, we have to propagate the ipv6 route peer into the ipsec top-level route during instantiation. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-26 13:41:03 -08:00
Johannes Berg	ba99d93b3d	mac80211: use DECLARE_EVENT_CLASS For events that include only the local struct as their parameter, we can use DECLARE_EVENT_CLASS and save quite some binary size across segments as well lines of code. text data bss dec hex filename 375745 19296 916 395957 60ab5 mac80211.ko.before 367473 17888 916 386277 5e4e5 mac80211.ko.after -8272 -1408 0 -9680 -25d0 delta Some more tracepoints with identical arguments could be combined like this but for now this is the one that benefits most. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2011-01-26 16:15:45 -05:00
Eric Dumazet	144ce879b0	net_sched: sch_mqprio: dont leak kernel memory mqprio_dump() should make sure all fields of struct tc_mqprio_qopt are initialized. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-26 13:15:29 -08:00
David S. Miller	9b6941d8b1	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-2.6	2011-01-26 11:49:49 -08:00
Patrick McHardy	2e0348c449	Merge branch 'connlimit' of git://dev.medozas.de/linux	2011-01-26 16:28:45 +01:00
Jan Engelhardt	ad86e1f27a	netfilter: xt_connlimit: pick right dstaddr in NAT scenario xt_connlimit normally records the "original" tuples in a hashlist (such as "1.2.3.4 -> 5.6.7.8"), and looks in this list for iph->daddr when counting. When the user however uses DNAT in PREROUTING, looking for iph->daddr -- which is now 192.168.9.10 -- will not match. Thus in daddr mode, we need to record the reverse direction tuple ("192.168.9.10 -> 1.2.3.4") instead. In the reverse tuple, the dst addr is on the src side, which is convenient, as count_them still uses &conn->tuple.src.u3. Signed-off-by: Jan Engelhardt <jengelh@medozas.de>	2011-01-26 13:01:39 +01:00
Linus Lüssing	dd58ddc692	batman-adv: Fix kernel panic when fetching vis data on a vis server The hash_iterate removal introduced a bug leading to a kernel panic when fetching the vis data on a vis server. That commit forgot to rename one variable name, which this commit fixes now. Reported-by: Russell Senior <russell@personaltelco.net> Signed-off-by: Linus Lüssing <linus.luessing@web.de> Signed-off-by: Sven Eckelmann <sven@narfation.org>	2011-01-25 23:58:33 +01:00
Jerry Chu	44f5324b5d	TCP: fix a bug that triggers large number of TCP RST by mistake This patch fixes a bug that causes TCP RST packets to be generated on otherwise correctly behaved applications, e.g., no unread data on close,..., etc. To trigger the bug, at least two conditions must be met: 1. The FIN flag is set on the last data packet, i.e., it's not on a separate, FIN only packet. 2. The size of the last data chunk on the receive side matches exactly with the size of buffer posted by the receiver, and the receiver closes the socket without any further read attempt. This bug was first noticed on our netperf based testbed for our IW10 proposal to IETF where a large number of RST packets were observed. netperf's read side code meets the condition 2 above 100%. Before the fix, tcp_data_queue() will queue the last skb that meets condition 1 to sk_receive_queue even though it has fully copied out (skb_copy_datagram_iovec()) the data. Then if condition 2 is also met, tcp_recvmsg() often returns all the copied out data successfully without actually consuming the skb, due to a check "if ((chunk = len - tp->ucopy.len) != 0) {" and "len -= chunk;" after tcp_prequeue_process() that causes "len" to become 0 and an early exit from the big while loop. I don't see any reason not to free the skb whose data have been fully consumed in tcp_data_queue(), regardless of the FIN flag. We won't get there if MSG_PEEK is on. Am I missing some arcane cases related to urgent data? Signed-off-by: H.K. Jerry Chu <hkchu@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-25 13:46:30 -08:00
Felix Fietkau	eb3e554b4b	mac80211: fix a crash in ieee80211_beacon_get_tim on change_interface Some drivers (e.g. ath9k) do not always disable beacons when they're supposed to. When an interface is changed using the change_interface op, the mode specific sdata part is in an undefined state and trying to get a beacon at this point can produce weird crashes. To fix this, add a check for ieee80211_sdata_running before using anything from the sdata. Signed-off-by: Felix Fietkau <nbd@openwrt.org> Cc: stable@kernel.org Signed-off-by: John W. Linville <linville@tuxdriver.com>	2011-01-25 16:28:56 -05:00
Eric Dumazet	26ad787962	pktgen: speedup fragmented skbs We spend lot of time clearing pages in pktgen. (Or not clearing them on ipv6 and leaking kernel memory) Since we dont modify them, we can use one zeroed page, and get references on it. This page can use NUMA affinity as well. Define pktgen_finalize_skb() helper, used both in ipv4 and ipv6 Results using skbs with one frag : Before patch : Result: OK: 608980458(c608978520+d1938) nsec, 1000000000 (100byte,1frags) 1642088pps 1313Mb/sec (1313670400bps) errors: 0 After patch : Result: OK: 345285014(c345283891+d1123) nsec, 1000000000 (100byte,1frags) 2896158pps 2316Mb/sec (2316926400bps) errors: 0 Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-25 13:26:05 -08:00
David S. Miller	73a8bd74e2	ipv6: Revert 'administrative down' address handling changes. This reverts the following set of commits: `d1ed113f16` ("ipv6: remove duplicate neigh_ifdown") `29ba5fed1b` ("ipv6: don't flush routes when setting loopback down") `9d82ca98f7` ("ipv6: fix missing in6_ifa_put in addrconf") `2de7957072` ("ipv6: addrconf: don't remove address state on ifdown if the address is being kept") `8595805aaf` ("IPv6: only notify protocols if address is compeletely gone") `27bdb2abcc` ("IPv6: keep tentative addresses in hash table") `93fa159abe` ("IPv6: keep route for tentative address") `8f37ada5b5` ("IPv6: fix race between cleanup and add/delete address") `84e8b803f1` ("IPv6: addrconf notify when address is unavailable") `dc2b99f71e` ("IPv6: keep permanent addresses on admin down") because the core semantic change to ipv6 address handling on ifdown has broken some things, in particular "disable_ipv6" sysctl handling. Stephen has made several attempts to get things back in working order, but nothing has restored disable_ipv6 fully yet. Reported-by: Eric W. Biederman <ebiederm@xmission.com> Tested-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-25 12:49:08 -08:00
Andy Adamson	778be232a2	NFS do not find client in NFSv4 pg_authenticate The information required to find the nfs_client cooresponding to the incoming back channel request is contained in the NFS layer. Perform minimal checking in the RPC layer pg_authenticate method, and push more detailed checking into the NFS layer where the nfs_client can be found. Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-01-25 15:26:51 -05:00
Sage Weil	42961d2333	libceph: fix socket write error handling Pass errors from writing to the socket up the stack. If we get -EAGAIN, return 0 from the helper to simplify the callers' checks. Signed-off-by: Sage Weil <sage@newdream.net>	2011-01-25 08:19:34 -08:00
Sage Weil	98bdb0aa00	libceph: fix socket read error handling If we get EAGAIN when trying to read from the socket, it is not an error. Return 0 from the helper in this case to simplify the error handling cases in the caller (indirectly, try_read). Fix try_read to pass any error to it's caller (con_work) instead of almost always returning 0. This let's us respond to things like socket disconnects. Signed-off-by: Sage Weil <sage@newdream.net>	2011-01-25 08:17:48 -08:00
Tejun Heo	ada609ee2a	workqueue: use WQ_MEM_RECLAIM instead of WQ_RESCUER WQ_RESCUER is now an internal flag and should only be used in the workqueue implementation proper. Use WQ_MEM_RECLAIM instead. This doesn't introduce any functional difference. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: dm-devel@redhat.com Cc: Neil Brown <neilb@suse.de>	2011-01-25 14:35:54 +01:00
Changli Gao	9f4e1ccd80	netfilter: ipvs: fix compiler warnings Fix compiler warnings when IP_VS_DBG() isn't defined. Signed-off-by: Changli Gao <xiaosuo@gmail.com> Acked-by: Hans Schillstrom <hans.schillstrom@ericsson.com> Signed-off-by: Simon Horman <horms@verge.net.au>	2011-01-25 23:17:51 +10:00
Vlad Dogaru	a512b92b3a	net: add sysfs entry for device group The group of a network device can be queried or changed from userspace using sysfs. For example, considering sysfs mounted in /sys, one can change the group that interface lo belongs to: echo 1 > /sys/class/net/lo/group Signed-off-by: Vlad Dogaru <ddvlad@rosedu.org> Acked-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-24 23:23:28 -08:00
Eugene Teo	b7c7d01aae	net: clear heap allocation for ethtool_get_regs() There is a conflict between commit `b00916b1` and `a77f5db3`. This patch resolves the conflict by clearing the heap allocation in ethtool_get_regs(). Cc: stable@kernel.org Signed-off-by: Eugene Teo <eugeneteo@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-24 21:05:17 -08:00
Hans Schillstrom	07924709f6	IPVS netns BUG, register sysctl for root ns The newly created table was not used when register sysctl for a new namespace. I.e. sysctl doesn't work for other than root namespace (init_net) Signed-off-by: Hans Schillstrom <hans.schillstrom@ericsson.com> Signed-off-by: Simon Horman <horms@verge.net.au>	2011-01-25 12:13:08 +10:00
David S. Miller	d80bc0fd26	ipv6: Always clone offlink routes. Do not handle PMTU vs. route lookup creation any differently wrt. offlink routes, always clone them. Reported-by: PK <runningdoglackey@yahoo.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-24 16:01:58 -08:00
Michał Mirosław	acd1130e87	net: reduce and unify printk level in netdev_fix_features() Reduce printk() levels to KERN_INFO in netdev_fix_features() as this will be used by ethtool and might spam dmesg unnecessarily. This converts the function to use netdev_info() instead of plain printk(). As a side effect, bonding and bridge devices will now log dropped features on every slave device change. Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-24 15:45:15 -08:00
Michał Mirosław	04ed3e741d	net: change netdev->features to u32 Quoting Ben Hutchings: we presumably won't be defining features that can only be enabled on 64-bit architectures. Occurences found by `grep -r` on net/, drivers/net, include/ [ Move features and vlan_features next to each other in struct netdev, as per Eric Dumazet's suggestion -DaveM ] Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-24 15:32:47 -08:00
Michał Mirosław	57422dc530	net: Move check of checksum features to netdev_fix_features() Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-24 15:29:11 -08:00
John Fastabend	3dce38a02d	dcbnl: make get_app handling symmetric for IEEE and CEE DCBx The IEEE get/set app handlers use generic routines and do not require the net_device to implement the dcbnl_ops routines. This patch makes it symmetric so user space and drivers do not have to handle the CEE version and IEEE DCBx versions differently. Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-24 15:19:55 -08:00
Ben Hutchings	c445477d74	net: RPS: Enable hardware acceleration of RFS Allow drivers for multiqueue hardware with flow filter tables to accelerate RFS. The driver must: 1. Set net_device::rx_cpu_rmap to a cpu_rmap of the RX completion IRQs (in queue order). This will provide a mapping from CPUs to the queues for which completions are handled nearest to them. 2. Implement net_device_ops::ndo_rx_flow_steer. This operation adds or replaces a filter steering the given flow to the given RX queue, if possible. 3. Periodically remove filters for which rps_may_expire_flow() returns true. Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-24 14:53:01 -08:00
Eric Dumazet	fd0273c503	tcp: fix bug in listening_get_next() commit `a8b690f98b` (tcp: Fix slowness in read /proc/net/tcp) introduced a bug in handling of SYN_RECV sockets. st->offset represents number of sockets found since beginning of listening_hash[st->bucket]. We should not reset st->offset when iterating through syn_table[st->sbucket], or else if more than ~25 sockets (if PAGE_SIZE=4096) are in SYN_RECV state, we exit from listening_get_next() with a too small st->offset Next time we enter tcp_seek_last_pos(), we are not able to seek past already found sockets. Reported-by: PK <runningdoglackey@yahoo.com> CC: Tom Herbert <therbert@google.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-24 14:41:20 -08:00
David S. Miller	3408404a4c	inetpeer: Use correct AVL tree base pointer in inet_getpeer(). Family was hard-coded to AF_INET but should be daddr->family. This fixes crashes when unlinking ipv6 peer entries, since the unlink code was looking up the base pointer properly. Reported-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-24 14:38:09 -08:00
Michal Schmidt	d1dc7abf2f	GRO: fix merging a paged skb after non-paged skbs Suppose that several linear skbs of the same flow were received by GRO. They were thus merged into one skb with a frag_list. Then a new skb of the same flow arrives, but it is a paged skb with data starting in its frags[]. Before adding the skb to the frag_list skb_gro_receive() will of course adjust the skb to throw away the headers. It correctly modifies the page_offset and size of the frag, but it leaves incorrect information in the skb: ->data_len is not decreased at all. ->len is decreased only by headlen, as if no change were done to the frag. Later in a receiving process this causes skb_copy_datagram_iovec() to return -EFAULT and this is seen in userspace as the result of the recv() syscall. In practice the bug can be reproduced with the sfc driver. By default the driver uses an adaptive scheme when it switches between using napi_gro_receive() (with skbs) and napi_gro_frags() (with pages). The bug is reproduced when under rx load with enough successful GRO merging the driver decides to switch from the former to the latter. Manual control is also possible, so reproducing this is easy with netcat: - on machine1 (with sfc): nc -l 12345 > /dev/null - on machine2: nc machine1 12345 < /dev/zero - on machine1: echo 1 > /sys/module/sfc/parameters/rx_alloc_method # use skbs echo 2 > /sys/module/sfc/parameters/rx_alloc_method # use pages - See that nc has quit suddenly. [v2: Modified by Eric Dumazet to avoid advancing skb->data past the end and to use a temporary variable.] Signed-off-by: Michal Schmidt <mschmidt@redhat.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-24 14:27:18 -08:00

... 108 109 110 111 112 ...

23369 Commits