Summary of changes from v2.5.34 to v2.5.35 ============================================ ppc64: remove some unnecessary sign extensions ppc64: remove ancient stat syscalls ppc64: add mmap64 support ppc64: add sendfile64 support and restore ioperm syscall ppc64: Dont force O_LARGEFILE on for 32 bit apps. From sparc64 ppc64: merge in changes from x86 irq balance code ppc64: Update the fake pci read code to handle a return of all 1s. ppc64: Fix sys32_readahead wrapper to obey ABI wrt passing long longs ppc64: remove status, no longer used ppc64: Remove use of ppc64: remove some old code ppc64: clean up syscall table, making it obvious which are obsolete and which are 32 bit only ppc64: Remove old keyboard code ppc64: fixes for 2.5.32 hvc_console: stop HVC console while xmon is running ppc64: make udelay a barrier, fixes problem with input layer keyboard probing ppc64: defconfig update ppc64: config.in cleanup ppc64: Add security and AIO syscalls ppc64: copy FE0 and FE1 bits into MSR when ptracing ppc64: warn when registering duplicate ioctls ppc64: Compile in LLC, needed for token ring ppc64: turn off token ring for the moment, it oopses ACPI trivial cleanups (Kochi Takayoshi) This fixes problems in serport.c found by Russell King: 1) Problem with current->state in serport_ldisc_read. Solved by using wait_event_interruptible() 2) Problem when serport_ldisc_read() is entered twice. Solved using set_bit et al. 3) Complex naming of the serio ports. Using tty_name() instead. 4) Possible stack overflows in name generations. Using tty_name() instead. Because x86-64 also always reserves the kbd region, we must not call request_region() in i8042-io.h, like we don't for i386, alpha, etc. By John Belmonte - improvements to Toshiba ACPI driver: 1) Fix sscanf 2) Add TV out support 3) Add hotkey status 4) Add version info ACPI Config.in update by Christoph Hellwig - 3 space indents - one menu for all arches instead of duplicating - define_*s moved below the real questions Remove obsolete OSL functions (Kochi Takayoshi) ifdef some arch-specific ACPI code [PATCH] uhci, doc + cleanup Another UHCI patch. I'm sending this since Dan said he was going to start teaching "uhci-hcd" how to do control and interrupt queueing, and this may help. Granted it checks out (I didn't test the part that has a chance to break, though it "looks right"), I think it should get merged in at some point. What it does: - updates and adds some comments/docs - gets rid of a "magic number" calling convention, instead passing an explicit flag UHCI_PTR_DEPTH or UHCI_PTR_BREADTH (self-doc :) - deletes bits of unused/dead code - updates the append-to-qh code: * start using list_for_each() ... clearer than handcrafted loops, and it prefetches too. Lots of places should get updated to do this, IMO. * re-orders some stuff to fix a sequencing problem * adds ascii-art to show how the urb queueing is done (based on some email Johannes sent me recently) That sequencing problem is that when splicing a QH between A and B, it currently splices A-->QH before QH-->B ... so that if the HC is looking at that chunk of schedule at that time, everything starting at B will be ignored during the rest of that frame. (Since the QH is initted to have UHCI_PTR_TERM next, stopping the schedule scan.) I said "problem" not "bug" since in the current code it would probably (what does that "PIIX bug" do??) just reduce control/bulk throughput. That's because the logic is only appending towards the end of each frame's schedule, where the FSBR loopback kicks in. [PATCH] Re: [patch 2.5.31-bk5] uhci, misc This patch has some small UHCI bugfixes - on submit error, frees memory and (!) returns error code - root hub should disconnect only once - pci pool code shouldn't be given GFP_DMA - uses del_timer_sync(), which behaves on SMP, not del_timer() and cleanups: - use container_of - doesn't replicate so much hcd state - no such status -ECONNABORTED - uses bus_name in procfs, not "hc0", "hc1" etc [PATCH] pci_free_consistent on ohci initialisation failure The trace at the end of the message shows the init failure. [PATCH] Lexar USB CF Reader Two weeks ago I sent this patch to the listed USB storage maintainer (mdharm-usb@one-eyed-alien.net) and have not yet heard back. The attached patch adds support for the Lexar USB CF Reader identified by id_product 0xb002, version 0x0113 (which is the version I have). This patch is against the 2.4.19 kernel, sorry if this is the wrong address to send this stuff to. Thanks. [PATCH] ehci locking I've been chasing problems on a KT333 based system, with the 8253 southbridge and EHCI 1.0 (!), and this fixes at least some of them: - locking updates: * a few routines weren't protected right * less irqsave thrashing for schedule lock - adds a watchdog timer that should fire when the STS_IAA interrupt seems to be missing. - gives ports back to companion UHCI/OHCI on rmmod - re-enables faulted QH only after all its completion callbacks have done their work - removes an oops I've seen when usb-storage unlinks stuff. (it seemed confused about error handling, but that's not a reason to oops.) - minor cleanup: deadcode rm, etc Right now the watchdog just barks, and that mechanism might go away (or into the shared hcd code). Sometimes the issue it reports seems to clear up by itself, but sometimes not... [PATCH] Re: updated ehci patch ... * keep watchdog on shorter leash, and just do standard irq processing when it barks. this means I can use a somewhat iffy vt8235 mobo. * updates to the driverfs debug output, including using S_IRUGO so anyone can gawk. * some updates, mostly to use a new hcd_to_bus(), so this version also compiles on a (slightly patched) 2.4.20-pre5 kernel. (*) [PATCH] PATCH: usb-storage: fix software eject This patch fixes the recently broken software eject of media. At least, it should. I'm back to having compile problems again, but the fix should be pretty self-evident. [PATCH] ohci-hcd endpoint scheduling, driverfs This patch cleans up some messy parts of this driver, and was pleasantly painless. - gets rid of ED dma hashtables * less memory needed * also less (+faster) code * ... rewrites all ED scheduling ops, they now use cpu addresses, like EHCI and UHCI do already - simplifies ED scheduling (no dma hashtables) * control and bulk lists are now doubly linked * periodic tree still singly linked; driver uses a new CPU view "shadow" of the hardware framelist * previous periodic code was cryptic, almost read-only * simpler tree code for EDs with {branch,period} - bugfixes periodic scheduling * when CONFIG_USB_BANDWIDTH, checks per-frame load against the limit; no more dodgey accounting * handles iso period != 1; interrupt and iso schedule EDs with the same routine (HW sees special TDs) * credit usbfs with bandwidth for endpoints, not URBs - adds driverfs output (when CONFIG_USB_DEBUG) * resembles EHCI: 'async' (control+bulk) and 'periodic' (interrupt+iso) files show schedules * shows only queue heads (EDs) just now (*) - has minor text and code cleanups, etc Now that this logic has morphed into more comprehensible form, I know what to borrow into the EHCI code! (*) It shows TDs on the td_list, but this patch won't put them there. A queue fault handling update will. [PATCH] USB: pegasus driver patch one more adapter, changed company name and forgotten flag USB: remove __NO_VERSION__ Thanks to Rusty "trivial" Russell [PATCH] 2.5.32-usb This patch appears not to be in 2.5.32, but applies cleanly. The following patch fixes 3 problems in USB: 1. Don't pci_map buffers when we know we're not going to pass them to a device. This was first noticed on ARM (no surprises here); the root hub code, rh_call_control(), placed data into the buffer and then called usb_hcd_giveback_urb(). This function called pci_unmap_single() on this region which promptly destroyed the data that rh_call_control() had placed there. This lead to a corrupted device descriptor and the "too many configurations" message. 2. If controller->hcca is NULL, don't try to dereference it. 3. If we free the root hub (in ohci-hcd.c or uhci-hcd.c), don't leave a dangling pointer around to trip us up in usb_disconnect(). EHCI appears to get this right. USB: clean up the error path in create_special_files() for usbfs Thanks to David Brownell for pointing out the problem here. A small documentation update and a unused constant removal. PPC32: Use vunmap rather than vfree in iounmap. ACPI: Remove interpreter debugger and kdb directories. These ultimately didn't prove useful enough to be used on a regular basis. [PATCH] Feiya 5-in-1 Card Reader I have a USB 5-in-1 Card Reader, that will read CF and SM and SD/MMC. Under Linux it appears as three SCSI devices. For today, the report is on the CF part. The CF part works fine under ordinary usb-storage SCSI simulation, with one small problem: 8 and 32 MB cards, that are detected as having 15872 and 63488 sectors by other readers, are detected as having 15873 and 63489 sectors by this Feiya reader (0x090c / 0x1132). In the good old days probably nobody would have noticed, but these days the partition reading code also wants to read the last sector. This results in the SCSI code taking the device off line: [USB storage does a READ_10, which fails since the sector is past the end of the disk. Then it tries a READ_6 and nothing ever happens, probably because the device does not support READ_6. Then the error handler does an abort which triggers some bugs in scsiglue.c and transport.c, then the error handler does a device reset, then a host reset, then a bus reset, and finally the device is taken offline.] The patch below does not address any bugs in the SCSI error code (a big improvement would be just to rip it all out - this error code never achieves anything useful but has crashed many a machine) and does not fix the USB code either. It just adds a flag to the unusual_devices section mentioning that this device (my revision is 1.00) has this bug. Without the patch the kernel crashes, or insmod usb-storage hangs. With the patch the CF part of the device works perfectly. (Another change is to only print "Fixing INQUIRY data" when really something is changed, not when the data was OK already.) Andries ACPI: Do not do certain bits of APIC config if CONFIG_ACPI_HT_ONLY is set. [TIGON3]: Merge to version 1.1 - When not low-power, only set GPIO enables in lclctrl on 5700 chips - Follow all writes to foo DMAC_MODE with a readback and udelay(40) - Be explicit about the fact that the driver disables wake-on-lan by default and how the user may enable it - A few NIC_SRAM_DATA_CFG_foo bits were wrong or missing - Clock control programming for some chips when going to low power mode was wrong. - Bump driver version/reldata for release - PCI write posting fixes * Sanitize every PCI write that requires a delay afterwards by doing a dummy read back from the register. * Handle the interesting case of this when doing a core-clock reset by using PCI config space indirect writes to GRC_MISC_CFG since we cannot do an MMIO read back from the chip during this reset event because it clears MMIO space enable in PCI_CONFIG * Add a new tg3_flag TG3_FLAG_MBOX_WRITE_REORDER which is set on chipsets that may violate PCI write ordering rules, when set we always read back from tx/rx ring mailbox registers after a write to guarentee the writes appear to the chip in order. - Make sure to always enable AS_MASTER bits when necessary - PHY reset fixes * Always reset PHY on init, for every chip revision * Program 5703 specific PHY stuff after the reset * Always enable Ethernet@WireSpeed after that reset * Always set ADVERTISE_PAUSE_CAP in initial adv reg. [PATCH] two byte offset for kaweth this is the two byte offset patch to kaweth for 2.5 to prevent MIPS crashing and speed up other arches. [PATCH] usbnet, add YOPY device IDs A now-happy Yopy user sent me these IDs. Bump up JFS_LINK_MAX from 64K to 4G. Taking advantage of the change of i_nlink from nlink_t to unsigned int. arch/sparc/config.in: Add missing parts for modern fashion configs. arch/sparc/defconfig: Supply working defconfig to show what is working, what is not. [SPARC]: Kill remaining remnants of kgdb support. [SPARC64]: Cleanup serial_console declarations. [SPARC]: Get 2.5.x building once more. drivers/serial/sunzilog.c: Fix build of sparc32 probing code. ppc64: add arg to do_fork and fix ELF_AUX entries as done in ppc32 Extended attribute fixes for JFS. [TIGON3] Initial TCP segmentation offload support. [TIGON3] Fix typos in TSO changes. [TIGON3]: Force use of PCI config space reg writes when loading firmare. [TIGON3]: Disable TSO for now, tso firmware can hang tx cpu. [TCP]: Delay tstamp state commit in input fast path until we verify csum. PPC32: Update the PCI config-space access functions for PReP. These got missed in my previous commit. PPC32: rearrange includes in arch/ppc/kernel/irq.c to fix a compile error. [PATCH] Toshiba.c IRQ Patch (Christoph Hellwig eats people?) Somewhere around 2.5.31 the method for setting and clearing interrupts changed: From- To- save_flags(flags); local_irq_save(flags); cli(); restore_flags(flags); local_irq_restore(flags); Though bordering on trivial, including toshiba support with stock 2.5.34 fails to compile, which this patch seems to fix. This patch fixes this issue and has worked reliably for me under 2.5.31, though it is untested on 2.5.32 and 2.5.33 because I didn't manage to get those to work. A note to those that are a bit rough on kernel patch newbies.... submitting a kernel patch for the very first time is a rather intimidating experience so please don't chew my head off unless its absolutely necessary. See my point? I was so worried that Cristoph Hellwig is going to come to my house and eat me I forgot to include the patch itself. :) [PATCH] USB storage: abort bug fix Also, have you sent in the one-line fix I found for the abort bug? Andries found that it cured his BUG_ON problem. In case you didn't save a copy of it, I've included it below. [PATCH] [PATCH] (repost) fix for big endian machines in scanner.c This patch fixes a problem with big endian machines and scanner drivers which use the SCANNER_IOCTL_CTRLMSG ioctl. The big endian to little endian swap was done twice, resulting in a no-op. [PATCH] [PATCH 2.5.33+] ohci and iso-in I added a bug in 2.5.23 when cleaning up something that was broken ... it wasn't broken in quite the way I had thought at the time! This fixes a problem some folk have reported recently with ISO-IN, by masking a common non-error outcome. Please merge to Linus' tree, on top of the one patch you already have queued. Thanks to Nemosoft for such quick turnaround on testing! Compaq PCI Hotplug driver: fixed __FUNCTION__ usages [PATCH] USB: se401 driver update PCI: hotplug core cleanup to get pci hotplug working again - removed pci_announce_device_to_drivers() prototype as the function is long gone - always call /sbin/hotplug when pci devices are added to the system if so configured (this includes during the system bring up.) ACPI: Do not compile functions not used in HT_ONLY mode [PATCH] IBM PCI Hotplug driver update - fix polling logic - add ability to write [chassis/rxe]#slot# instead of just slot# [PATCH] IBM PCI Hotplug driver update for ISA based controllers [PATCH] IBM PCI Hotplug driver update for PCI based controllers PCI: export pci_scan_bus() as the IBM PCI Hotplug driver needs it. PCI Hotplug: remove pci_*_nodev() prototypes as the functions are gone. The pci_bus_* functions should be used instead. [PATCH] 2.5.34 kernel-api DocBook fix Update kernel-api.tmpl to reflect mtrr changes so that the docs will build. [PATCH] 2.5.34: recalc_sigpending missing for modules When recalc_sigpending was converted from inline to real function, appropriate EXPORT_SYMBOL() was not created. Needed at least for ncpfs and lockd. [PATCH] : Grammatical fixes Documentation/porting: s/are/and/ Documentation/directory-locking: s/that means// was repeated [PATCH] Re: Performance issue in 2.5.32+ - The early startup code was changed so smp_prepare_cpus() is now called before do_basic_setup(). do_basic_setup() is where mtrr_init() is called, which mtrr_init_secondary_cpu() is dependent on being called. - mtrr_init_boot_cpu() was removed from the AP startup code. This was a SMP-only hack that made sure mtrr_init() happened when SMP was enabled. That's right - two different code paths to do the same thing, obscured by compile-time defines. The appended patch makes sure mtrr_init() is called before smp_prepare_cpus(). It's ugly, and I'll work on a cleaner solution, but James: could you try it and see if it fixes your performance issues? Compaq PCI Hotplug driver: changed calls to pci_*_nodev() to pci_bus_*() IBM PCI Hotplug driver: changed calls to pci_*_nodev() to pci_bus_*() Get Intel model name from the CPU Reorganize the mtrr init sequence a bit. All mtrr init now happens during the initcall sequence, after all CPUs have been brought up. mtrr_init() calls a static init_other_cpus(), which fires off a function on all other cpus to replicate the state across all of them. arch/i386/kernel/smpboot.c::smp_callin() had the following: #ifdef CONFIG_MTRR /* * Must be done before calibration delay is computed */ mtrr_init_secondary_cpu (); #endif I couldn't figure this one out. The P4 manual says nothing about this, nor find any other documentation about it. The P4 manual says only that state must be synchronized across all CPUs, which it is. And, it happens before anything else is executed on the other CPUs, and before any devices or drivers have been brought up. The cyrix mtrr code was also updated to handle this style of SMP initialization. ACPI: Fix possible sleeping at interrupt context (Matthew Wilcox) Never _ever_ BUG() if you don't have to Cset exclude: greg@kroah.com|ChangeSet|20020905153320|19047 [PATCH] USER_HZ & NTP problems I've been playing with different HZ values in the 2.4 kernel for a while now, and apparantly Linus also has decided to introduce a USER_HZ constant (I used CLOCKS_PER_SEC) while raising the HZ value on x86 to 1000. On x86 timekeeping has shown to be relative fragile when raising HZ (OK, I tried HZ=2048 which is quite high) because of the way the interrupt timer is configured to fire HZ times each second. This is done by configuring a divisor in the timer chip (LATCH) which divides a certain clock (1193180) and makes the chip fire interrupts at the resulting frequency. Now comes the catch: NTP requires a clock accuracy of 500 ppm. For some HZ values the clock is not accurate enough to meet this requirement, hence NTP won't work well. An example HZ value is 1020 which exceeds the 500 ppm requirement. In this case the best approximation is 1019.8 Hz. the xtime.tv_usec value is raised with a value of 980 each tick which means that after one second the tv_usec value has increased with 999404 (should be 1000000) which is an accuracy of 596 ppm. Some more examples: HZ Accuracy (ppm) ---- -------------- 100 17 1000 151 1024 632 2000 687 2008 343 2011 18 2048 1249 What I've been doing is replace tv_usec by tv_nsec, meaning xtime is now a timespec instead of a timeval. This allows the accuracy to be improved by a factor of 1000 for any (well ... any?) HZ value. Of course all kinds of calculations had te be improved as well. The ACTHZ constantant is introduced to approximate the actual HZ value, it's used to do some approximations of other related values. [PATCH] 2.5.34 ufs/super.c This is needed since 2.5.32 to successfully mount a UFS partition. [PATCH] cdrom.c is the only file to include asm/fcntl.h drivers/cdrom/cdrom.c is the only file (apart from include/linux/fcntl.h) that includes asm/fcntl.h. This changes that and should have no affect. I need to do this before I consolidate the asm/fcntl.h files into linux/fcntl.h (coming next - again). ppc64: INIT_SIGNALS fix ppc64: add rwlock_is_locked [SPARC] sparc 2.5.x again - Little woops in the new PCI configuration routines - Removal of last CONFIG_SUN_SERIAL occurances - sunzilog initialized itself even if obio is not present, also remove pointless goto - sunru oopsed outright trying to use iobase [PATCH] 2.5.34 floppy driver init/exit fixes The 2.5 floppy driver has for a long time has two init/exit bugs: 1. It calls register_sys_device() on init, but fails to call unregister_sys_device() in exit. This leads to data structure corruption if floppy is a module and it gets unloaded. 2. If calls register_sys_device() early on init, but fails to call unregister_sys_device() if init fails. Again, this leads to data structure corruption. The patch below fixes both these problems. [PATCH] undo 2.5.34 ftape damage In the 2.5.33->2.5.34 step someone removed "export-objs" from drivers/char/ftape/lowlevel/Makefile, which makes it impossible to build ftape as a module since is _does_ have a number of EXPORT_SYMBOL's. This reverts that change. [PATCH] PCI individual resource handling This merges the changes from 2.4-ac that allow drivers to enable (and mark as used) only a subset of PCI resources, for those drivers that need it (at this point apparently only the i845 IDE controller). Move around IDE files to match 2.4.20-pre5-ac4 layout. Do this before applying patches, for clarity and for keeping bk revision history. Add Makefile's for the new arm/ legacy/ pci/ pci/ directories [PATCH] blk_fs_request() Add blk_fs_request(rq) to avoid testing rq->flags & REQ_CMD directly. [PATCH] IDE pci ids Update IDE pci ids to match 2.4.20-pre5-ac4 levels. [PATCH] hdreg command updates etc Update hdreg to match 2.4 levels. o Use consistent SRV_STAT instead of SERVICE_STAT o Add sector count status bits for tcq o Add various missing commands o hd_driveid update [PATCH] Missing IDE partition 3 of 3 on 2.5.34 devfs side fixed thus: [PATCH] Re: do_syslog/__down_trylock lockup in current BK This fixes the lockup. The bug happened because reparenting in the CLONE_THREAD case was done in a fundamentally non-atomic way, which was asking for various races to happen: eg. the target parent gets reparented to the currently exiting thread ... (the non-CLONE_THREAD case is safe because nothing reparents init.) the solution is to make all of reparenting atomic (including the forget_original_parent() bit) - this is possible with some reorganization done in signal.c and exit.c. This also made some of the loops simpler. [PATCH] writer throttling fix The patch fixes a few problems in the writer throttling code. Mainly in the situation where a single large file is being written out. That file could be parked on sb->locked_inodes due to pdflush writeback, and the writer throttling path coming out of balance_dirty_pages() forgot to look for inodes on ->locked_inodes. The net effect was that the amount of dirty memory was exceeding the limit set in /proc/sys/vm/dirty_async_ratio, possibly to the point where the system gets seriously choked. The patch removes sb->locked_inodes altogether and teaches the throttling code to look for inodes on sb->s_io as well as sb->s_dirty. Also, just leave unwritten dirty pages on mapping->io_pages, and unwritten dirty inodes on sb->s_io. Putting them back onto ->dirty_pages and ->dirty_inodes was fairly pointless, given that both lists need to be looked at. [PATCH] pass the correct flags to aops->releasepage() Restore the gfp_mask in the VM's call to a_ops->releasepage(). We can block in there again, and XFS (at least) can use that. [PATCH] exact dirty state accounting Some adjustments to global dirty page accounting. Previously, dirty page accounting counted all dirty pages. Even dirty anonymous pages. This has potential to upset the throttling logic in balance_dirty_pages(). Particularly as I suspect we should decrease the dirty memory writeback thresholds by a lot. So this patch changes it so that we only account for dirty pagecache pages which have backing store. Not anonymous pages, not swapcache, not in-memory filesystem pages. To support this, the `memory_backed' boolean has been added to struct backing_dev_info. When an address space's backing device is marked as memory-backed, the core kernel knows to not include that mapping's pages in the dirty memory accounting. For memory-backed mappings, dirtiness is a way of pinning the page, and there's nothing the kernel can to do clean the page to make it freeable. driverfs, tmpfs, and ranfs have been coverted to mark their mappings as memory-backed. The ramdisk driver hasn't been converted. I have a separate patch for ramdisk, which fails to fix the longstanding problems in there :( With this patch, /bin/sync now sends /proc/meminfo:Dirty to zero, which is rather comforting. [PATCH] discontigmem code cleanup #1 Patch from Martin Bligh. "This mainly changes the PLAT_MY_MACRO_IS_ALL_CAPS() stuff to be normal_macro(), and takes out some unnecessary redirection of function names. No functionality changes, nothing touched outside i386 discontigmem ... just makes code readable. Rumour has it that the PLAT_* stuff came from IRIX - I don't see that as a good reason to make the Linux code unreadable. Tested on 16-way NUMA-Q." [PATCH] discontigmem code cleanup #2 Patch from Martin Bligh "This mainly just rips out some magic extra structures in the boot time code to determine node sizes, and counts in pages instead of bytes. Oh, and I put the code that allocates pgdat into allocage_pgdat, instead of find_max_pfn_node, which seems like an incongruous home for it. No functionality changes, nothing touched outside i386 discontigmem ... just makes code cleaner and more readable. Tested on 16-way NUMA-Q." [PATCH] reduce the default dirty memory thresholds Writeback parameter tuning. Somewhat experimental, but heading in the right direction, I hope. - Allowing 40% of physical memory to be dirtied on massive ia32 boxes is unreasonable. It pins too many buffer_heads and contribues to page reclaim latency. The patch changes the initial value of /proc/sys/vm/dirty_background_ratio, dirty_async_ratio and (the presently non-functional) dirty_sync_ratio so that they are reduced when the highmem:lowmem ratio exceeds 4:1. These ratios are scaled so that as the highmem:lowmem ratio goes beyond 4:1, the maximum amount of allowed dirty memory ceases to increase. It is clamped at the amount of memory which a 4:1 machine is allowed to use. - Aggressive reduction in the dirty memory threshold at which background writeback cuts in. 2.4 uses 30% of ZONE_NORMAL. 2.5 uses 40% of total memory. This patch changes it to 10% of total memory (if total memory <= 4G. Even less otherwise - see above). This means that: - Much more writeback is performed by pdflush. - When the application is generating dirty data at a moderate rate, background writeback cuts in much earlier, so memory is cleaned more promptly. - Reduces the risk of user applications getting stalled by writeback. - Will damage dbench numbers. It turns out that the damage is fairly small, and dbench isn't a worthwhile workload for optimisation. - Moderate reduction in the dirty level at which the write(2) caller is forced to perform writeback (throttling). Was 40% of total memory. Is now 30% of total memory (if total memory <= 4G, less otherwise). This is to reduce page reclaim latency, and generally because allowing processes to flood the machine with dirty data is a bad thing in mixed workloads. [PATCH] buffer_head takedown for bighighmem machines This patch addresses the excessive consumption of ZONE_NORMAL by buffer_heads on highmem machines. The algorithms which decide which buffers to shoot down are fairly dumb, but they only cut in on machines with large highmem:lowmem ratios and the code footprint is tiny. The buffer.c change implements the buffer_head accounting - it sets the upper limit on buffer_head memory occupancy to 10% of ZONE_NORMAL. A possible side-effect of this change is that the kernel will perform more calls to get_block() to map pages to disk. This will only be observed when a file is being repeatadly overwritten - this is the only case in which the "cached get_block result" in the buffers is useful. I did quite some testing of this back in the delalloc ext2 days, and was not able to come up with a test in which the cached get_block result was measurably useful. That's for ext2, which has a fast get_block(). A desirable side effect of this patch is that the kernel will be able to cache much more blockdev pagecache in ZONE_NORMAL, so there are more ext2/3 indirect blocks in cache, so with some workloads, less I/O will be performed. In mpage_writepage(): if the number of buffer_heads is excessive then buffers are stripped from pages as they are submitted for writeback. This change is only useful for filesystems which are using the mpage code. That's ext2 and ext3-writeback and JFS. An mpage patch for reiserfs was floating about but seems to have got lost. There is no need to strip buffers for reads because the mpage code does not attach buffers for reads. These are perhaps not the most appropriate buffer_heads to toss away. Perhaps something smarter should be done to detect file overwriting, or to toss the 'oldest' buffer_heads first. In refill_inactive(): if the number of buffer_heads is excessive then strip buffers from pages as they move onto the inactive list. This change is useful for all filesystems. This approach is good because pages which are being repeatedly overwritten will remain on the active list and will retain their buffers, whereas pages which are not being overwritten will be stripped. [PATCH] rmap pte_chain speedup and space saving The pte_chains presently consist of a pte pointer and a `next' link. So there's a 50% memory wastage here as well as potential for a lot of misses during walks of the singly-linked per-page list. This patch increases the pte_chain structure to occupy a full cacheline. There are 7, 15 or 31 pte pointers per structure rather than just one. So the wastage falls to a few percent and the number of misses during the walk is reduced. The patch doesn't make much difference in simple testing, because in those tests the pte_chain list from the previous page has good cache locality with the next page's list. The patch sped up Anton's "10,000 concurrently exitting shells" test by 3x or 4x. It gives a 10% reduction in system time for a kernel build on 16p NUMAQ. It saves memory and reduces the amount of work performed in the slab allocator. Pages which are mapped by only a single process continue to not have a pte_chain. The pointer in struct page points directly at the mapping pte (a "PageDirect" pte pointer). Once the page is shared a pte_chain is allocated and both the new and old pte pointers are moved into it. We used to collapse the pte_chain back to a PageDirect representation in page_remove_rmap(). That has been changed. That collapse is now performed inside page reclaim, via page_referenced(). The thinking here is that if a page was previously shared then it may become shared again, so leave the pte_chain structure in place. But if the system is under memory pressure then start reaping them anyway. [PATCH] resurrect CONFIG_HIGHPTE Bill Irwin's patch to fix up pte's in highmem. With CONFIG_HIGHPTE, the direct pte pointer in struct page becomes the 64-bit physical address of the single pte which is mapping this page. If the page is not PageDirect then page->pte.chain points at a list of pte_chains, which each now contain an array of 64-bit physical addresses of the pte's which are mapping the page. The functions rmap_ptep_map() and rmap_ptep_unmap() are used for mapping and unmapping the page which backs the target pte. The patch touches all architectures (adding do-nothing compatibility macros and inlines). It generally mangles lots of header files and may break non-ia32 compiles. I've had it in testing since 2.5.31. atari_rootsec.h moved to fs/partitions/atari.h, but somehow the version in include/linux didn't get deleted. The scheduler should complain not just about interrupts, but also about being called whenever we're holding any other preemption locks. [PATCH] drivers_net_pcmcia_fmvj18x_cs.c save_flags unsigned check The function save_flags must use an unsigned long parameter instead a long (signed) one This trivial patch solves the problem [PATCH] Typos in drivers_s390_net_iucv.h [PATCH] drivers_net_arcnet_arcnet.c save_flags unsigned check The function save_flags must use unsigned long instead long (signed) This trivial patch solves the problem [PATCH] drivers_net_hamradio_scc.c save_flags unsigned check The function save_flags must use unsigned long instead long (signed) This trivial patch solves the problem [PATCH] Comment fix asm-i386_hardirq.h [PATCH] drivers_net_3c505.c save_flags unsigned check The function save_flags must use unsigned long instead long (signed) This trivial patch solves the problem [PATCH] Domsch zip code change Trivial patch changes my zip code. Applies to 2.4.x and 2.5.x trees. [PATCH] [patch, 2.5] fix errorpath in apne.c [PATCH] header cleanup - drivers_char_dz.c has the normal idempotent construction. The attached file removes the second #include. [PATCH] drivers_net_pcmcia_3c574_cs.c save_flags unsigned check The function save_flags must use an unsigned long parameter instead a long (signed) one This trivial patch solves the problem [PATCH] drivers_net_ni65.c save_flags unsigned check The function save_flags must use unsigned long instead long (signed) This trivial patch solves the problem [PATCH] Designated initializers for shm The old form of designated initializers are obsolete: we need to replace them with the ISO C forms before 2.6. Gcc has always supported both forms anyway. [PATCH] typo: include_linux_pci_ids.h s_DEVIDE_DEVICE [PATCH] Designated initializers for cs46xx drivers The old form of designated initializers are obsolete: we need to replace them with the ISO C forms before 2.6. Gcc has always supported both forms anyway. [PATCH] [patch 2.5] at1700 trivial Bad error path.. ret is already set to -ENODEV, no need to set them again before jumping out. [PATCH] remove duplicated AGP Config.in drivers/char/Config.in still has a complete copy of agp/Config.in. It's an exact cut-n-paste - the md5sums even match. (: [PATCH] 2.5.31 spell_typo fix [PATCH] drivers_scsi_aic7xxx_aic7xxx_core.c, typo: the the [PATCH] 2.5.31_drivers_char_lp.c This is a trivial patch already applied in the -ac tree for the 2.4.19 kernel. Patch for lp.c avoid +/- operations with 0 and explicit some debug information as KERN_INFO or KERN_ERR. [PATCH] drivers_net_de600.c save_flags unsigned check The function save_flags must use unsigned long instead long (signed) This trivial patch solves the problem [PATCH] header cleanup - drivers_char_serial_tx3912.c has the normal idempotent construction. The attached file removes the second #include. [PATCH] Re: header cleanup - drivers_ieee1394_sbp2.c has the normal idempotent construction on every architecture. The attached file removes the second #include. [PATCH] drivers_net_pcmcia_aironet4500_cs.c save_flags unsigned check The function save_flags must use an unsigned long parameter instead a long (signed) one This trivial patch solves the problem [PATCH] drivers_net_pcmcia_smc91c92_cs.c The function save_flags must use an unsigned long parameter instead a long (signed) one This trivial patch solves the problem [PATCH] drivers_net_at1700.c save_flags unsigned check The function save_flags must use unsigned long instead long (signed) This trivial patch solves the problem [PATCH] designated initializer patches for fs_nfs Here are some patches for C99 initializers in fs/nfs. Patches are against 2.5.32. [PATCH] sleeping file locks - Add FL_SLEEP flag to indicate we intend to sleep and therefore desire to be placed on the block list. Use it for POSIX & flock locks. - Remove locks_block_on. - Change posix_unblock_lock to eliminate a race that will appear once we don't use the BKL any more. - Update the comment for locks_same_owner() and rename it to posix_same_owner(). - Change locks_mandatory_area() to allocate its lock on the stack and call posix_lock_file() instead of repeating that logic. - Rename the "caller" parameter to posix_lock_file() to "request" to better show that this is not to be inserted directly. - Redo some of the proc code a little. Stop exposing kernel addresses to userspace (whoever thought _that_ was a good idea?!) and show how we should be printing the device name. The last part is ifdeffed out to avoid breaking lslk. - Remove FL_BROKEN. And there was much rejoicing. [PATCH] ftape EXPORT_SYMBOL damage clean-up The reason for the ftape messup of export-objs is the usage of the strange FT_KSYM macro in ftape_syms.c. That exist solely for backwards compatibility for kernel 2.1.18 and older. Better clean it up. [PATCH] Remove unused Config.help When drivers/serial was split off, the following helptexts should have been deleted, but weren't. [PATCH] remove SERIAL_IO_GSC SERIAL_IO_GSC was a mistake and should never have been added. Oops, lost ID in 2.4.x merge Missing , yet testing the kernel version arm icside update Update of the legcay ide controller drivers. mainly the IN_BYTE -> inb() and preparation for truly modular low level drivers. aec62xx update alim15x3 update amd74xx update cmd640 update cmd64x update cs5530 update cy82c693 update hpt34x update hpt366 update it8172 update ns87145 update opti621 update promise update pdcadma update piix update rz1000 update serverworks update sis5513 update sl82c105 update slc90e66 update trm290 update via update adma100 update generic ide pci init code add driver for pci ide nvidia chipset add low level driver for sis sata controller ppc low level ide driver updates ide-cd updates: o kill silly ide_cdrom_end_reuquest() function, it only duplicates ide core code. o use the atapi error, status, ireason, etc types o use ide-iops functions, not IN_BYTE etc o use blk_fs_request() where appropriate o limit retries on MEDIUM_ERROR sense key o use new ide_end_request() that handles nr_sectors o rename ->reinit to ->attach ide-disk updates: o ide-iops changes o ide_end_request() now takes a nr_sectors argument, driver->end_request as well o remove idedisk_end_request(), it's a duplicate of ide core helper o byte -> u8 o ->reinit is now ->attach (to match 2.4.20-pre5-ac) ide-dma updates: o ide-iops changes o driver->end_request and ide_end_request changes o ->dmaproc() is now split into separate functions o work on new mmio adapters o init cleanup ide-floppy updates: o byte -> u8 o remove various status register definitions, these are now ata (atapi) generic o ide-iops changes o remove idefloppy_end_request(), dupe of ide core helper o driver->end_request changes o lots of style cleanups o update to new dma interface o ->reinit to ->attach updates ide-geometry updates: o byte -> u8 o small style cleanups new pci init code ide-pnp updates: o remove *_FUNC abstraction o remove MODULE ifdefs o small style changes ide-probe updates: o byte -> u8 o drive_is_flashcard() moved to probe code o ide-iops changes o various cleanups o remove useless ide_lock debug stuff ide-proc updates: o remove low level driver ifdef mess o allow "host" to register into proc list instead ide-tape update: o byte -> u8 o remove various register structs, it's ide general now o ide-iops changes o various style cleanups o update to new ide-dma api o remove idetape_do_end_request(), dupe of ide core helper o ->reinit to ->attach changes ide-taskfile updates: o ide-iops changes (mainly moving stuff to ide-iops.c) o byte -> u8 o update to new ide-dma api o driver->end_request changes o various style cleanups o remove ALTSTAT_SCREW_UP stuff o WAIT_CMD -> WAIT_WORSTCASE interrupt timeout o add (commented out) various ata commands to match 2.4.20-pre5-ac o move the flagged_* interrupt handlers ide_modes.h updates: o byte -> u8 ide core updates, and addition of ide-iops.c update ide/ Makefile to match new file/dir layout ide configure updates add ide-lib helpers ide-scsi updates: o byte -> u8 o use atapi register definitions o update to ide-iops changes o driver->end_request() changes o update to new ide-dma api o ->reinit to ->attach arch ide updates. mainly ide_ioreg_t type changes, and removal of silly old irq and region registration etc. missed pdc4030.h update: o silly IS_PDC4030_DRIVE definition ide_map_buffer() and ide_unmap_buffer() could cause imbalanced calls to bio_kmap/kunmap_irq(), which would screw the preemption count. pass in rq to ide_unmap_buffer() as well to make the right decision. bio.h: clean up with bio_kmap_irq() thing properly. remove the micro optimization of _not_ calling kmap_atomic() if this isn't a highmem page. we could keep that and do the inc_preempt_count() ourselves, but I'm not sure it's worth it and this is cleaner. JFS: add permission checks before getting or setting xattrs [TIGON3]: Do not reference vlgrp unless TG3_VLAN_TAG_USED is set. [PATCH] alpha update - signal update; make do_signal use generic get_signal_to_deliver() - irqs_disabled macro - remove vmlinux.lds.s target from arch/alpha/Makefile since it works correctly in the top level Makefile - extra argument for pcibios_enable_device (most likely we'll never use it though...) [PATCH] zftape: Cleanup zftape_syms.c Removed compatibility cruft from zftape_syms.c. There is no need to be compatible with kernel 2.1.18 and older. Replaced FT_KSYM with direct call to EXPORT_SYMBOL. [PATCH] drivers/char/Makefile: Remove pty.o from export-objs Remove pty.o from the export-objs list, since pty.c does not export any symbols. A /* EXPORT_SYMBOL */ comment may have fooled the original author. [PATCH] exit.c compilation warning fix I forgot to remove an unused label in the deadlock fix patch. [PATCH] sys_exit_group(), threading, 2.5.34 This is another step to have better threading support under Linux, it implements the sys_exit_group() system call. It's a straightforward extension of the generic 'thread group' concept, which extension also comes handy to solve a number of problems when implementing POSIX threads. POSIX exit() [the C library function] has the following semantics: all thread have to exit and the waiting parent has to get the exit code that was specified for the exit() function. It also has to be ensured that every thread has truly finished its work by the time the parent gets the notification. The exit code has to be propagated properly to the parent thread even if not the thread group leader calls the exit() function. Normal single-thread exit is done via the pthread_exit() function, which calls sys_exit(). Previous incarnations of Linux POSIX threads implementations chose the following solution: send a 'thread management' signal to the thread group leader via tkill(), which thread goes around and kills every thread in the group (except itself), then calls sys_exit() with the proper exit code. Both old libpthreads and NGPT use this solution. This works to a certain degree, unless a userspace threading library uses the initial thread for normal thread work [like the new libpthreads], which 'work' can cause the initial thread to exit prematurely. At this point the threading library has to catch the group leader in pthread_exit() and has to keep the management thread 'hanging around' artificially, waiting for the management signal. Besides being slightly confusing to users ('why is this thread still around?') even this variant is unrobust: if the initial thread is killed by the kernel (SIGSEGV or any other thread-specific event that triggers do_exit()) then the thread goes away without the thread library having a chance to intervene. the sys_exit_group() syscall implements the mechanism within the kernel, which, besides robustness, is also *much* faster. Instead of the threading library having to tkill() every thread available, the kernel can use the already existing 'broadcast signal' capability. (the threading library cannot use broadcast signals because that would kill the initial thread as well.) as a side-effect of the completion mechanism used by sys_exit_group() it was also possible to make the initial thread hang around as a zombie until every other thread in the group has exited. A 'Z' state thread is much easier to understand by users - it's around because it has to wait for all other threads to exit first. and as a side-effect of the initial thread hanging around in a guaranteed way, there are three advantages: - signals sent to the thread group via sys_kill() work again. Previously if the initial thread exited then all subsequent sys_kill() calls to the group PID failed with a -ESRCH. - the get_pid() function got faster: it does not have to check for tgid collision anymore. - procps has an easier job displaying threaded applications - since the thread group leader is always around, no thread group can 'hide' from procps just because the thread group leader has exited. [ - NOTE: the same mechanism can/will also be used by the upcoming threaded-coredumps patch. ] there's also another (small) advantage for threading libraries: eg. the new libpthreads does not even have any notion of 'group of threads' anymore - it does not maintain any global list of threads. Via this syscall it can purely rely on the kernel to manage thread groups. the patch itself does some internal changes to the way a thread exits: now the unhashing of the PID and the signal-freeing is done atomically. This is needed to make sure the thread group leader unhashes itself precisely when the last thread group member has exited. (the sys_exit_group() syscall has been used by glibc's new libpthreads code for the past couple of weeks and the concept is working just fine.) LLC: small cleanups, leave debug on for a while . dprintk already puts the log level . fix some comments to match new behaviour LLC: tcpfying the beast . s/mac_indicate/llc_rcv/g . s/llc_sap_send_ev/llc_sap_state_process/g . s/llc_station_send_ev/llc_station_state_process/g . s/llc_conn_send_ev/llc_conn_state_process/g . fix some comments wrt current behaviour . s/llc_find_sock/llc_lookup_established/g . llc_sock_alloc now receives the protocol family as a parameter, will be used by llc_lookup_listener to properly handle multiple upper layer protocols . s/inline/__inline__/g LLC: sys_listen already checks for backlog > SOMAXCONN also remove tests against SOCK_SEQPACKET, it is not supported in llc_ui_create. LLC: llc_build_and_send_pkt Rename llc_data_req_handler with llc_build_and_send_pkt, following my plan to have LLC look more like TCP/IP and to slowly remove all the ugly prim types and sap->{req,ind,conf}. No problems with Appletalk up to now as it only uses UI and I'm up to now only concentrating on connection mode, so that we can remove all the duplicated work in core and PF_LLC. LLC: kill llc_prim_data and LLC_PRIM_DATA for sap->ind() and sap->conf() On the road to kill all prims, llc_prim_data bits the dust, now the core queues the data directly and takes care of the conf semantics, i.e. waking up the upper layer when the confirmation arrives. Maybe I'll have to put more info on skb->cb for conf and ind, but for PF_LLC this is enough for now. Have to check NetBEUI tho. But we can always add back removed features, better than having features that nobody uses :-) [LLC] split llc_pdu_router into llc_{station,sap,conn}_rcv [LLC] llc_ui_wait_for_data and socket locking fixes . now llc_ui_accept uses llc_ui_wait_for_data (llc_ui_recvmsg probably will use it too, we'll see) . all the llc_ui_wait_for_ now receive the timeout in jiffies, not in seconds . use sk_rcvtimeo() . release_sock before going to sleep in the llc_ui_wait_for functions . llc_ui_release has to get the socket lock [LLC] use llc_mac_{match,null} in more places [LLC] turn tons of simple pdu functions into returning void All of those functions cannot possibly fail, so there is no point in always returning 0. I'll probably turn all of them into inlines in the future too. [LLC] use just one struct sock per connection With this PF_LLC is tightly integrated with the core and that is a good thing 8) . kill llc_ui_opt, the only non-duplicated bit is struct sockaddr_llc and this now lives in llc_opt . remove debug code from llc_sk_alloc/free (previously llc_sock_alloc/free) . the skbs allocated for event processing don't need to have any payload at all, just the skb->cb is enough, so remove the bogus 1 from alloc_skb calls . llc_conn_disc put on death row . llc_process_tmr_ev callers have to hold the socket lock . the request functions in llc_if.c doesn't hold the socket lock anymore its up to its callers on the socket layer (llc_sock.c) . llc_sk_alloc now receives a priority for sk_alloc call and is the only way to alloc a new sock (from llc_mac and llc_sock, bottom and top) . added the traditional struct sock REFCNT_DEBUG support for llc . llc_sock was simplified and is on the zen route to cleanliness, wait for the next patches, it'll shrink a lot when I zap all the crap (as in not needed) list handling, using the existing list maintained in struct llc_sap for that, probably splitting it in two, one for listening sockets and other for (being) established ones. Ah, and the sap->ind and sap->req and friends will die. [PATCH] Thread deadlock fix.. This fixes the old-pthreads breakage i can reproduce. the fix is to only do the thread-group exit-completion logic in case of thread-groups. [TIGON3]: Fix slight perf regression from TSO changes. - Keep cache of previously written vlan_tag value in TX ring. Avoid the TX descriptor write if they match. [VLAN] Use unregister_netdevice to prevent rtnl double-lock. - vlan_device_event is called by the networking with rtnl_lock held already, so if we use unregister_netdev we hang trying to get the rtnl semaphore again. [TIGON3]: New way to flush posted writes of GRC_MISC_CFG. - The indirect register trick does not work so well on some 5701 variants, so just read back PCI_COMMAND to do this. [NAPI]: Do not check netif_running() in netif_rx_schedule_prep. Allocate system call numbers: 250 and 251 for hugetlb, with 252 for exit_group [PATCH] HandyTech HandyLink patch HandyTech's Braille displays support a USB port, those are implemented with a GoHubs usb serial converter. The only difference is that the pID is 0x1200, not 0x1000. [PATCH] usbnet, Epson client * Tells about some Epson firmware that uses this as part of a Linux interop solution (PDA-ish SoCs, hmm) * Includes some GeneSys info from emails * Minor cleanups [PATCH] ehci misc fixes This removes some bugs: - a short read problem with control requests - only creates one control qh (memleak fix) - adds an omitted hardware handshake - reset timeout in octal, say what? - a couple BUG()s outlived their value Plus it deletes unused stub code for split ISO and updates some internal doc. [PATCH] fix for error handling in microtek [PATCH] new ids for hpusbscsi new device ids for hpusbscsi [PATCH] open/close fix for kaweth this handles the error case. USB: compile time fix for previous kaweth patch. [PATCH] 2.5.X config: USB speedtouch driver Minor nit: the subject driver depends on ATM, so a config-time check to see if ATM support is enabled is appropriate. LLC: llc_lookup_listener With this LLC_CONN_PRIM and friends went to the death row, next patch will introduce llc_establish_connection, turning on the electric chair switch for LLC_CONN_PRIM et al. [LLC] llc_establish_connection & LLC_CONN_PRIM bits the bucket . Bzzzt, rest in peace LLC_DATA_PRIM. We won't miss you. . In the process I also killed sap->resp and all of the functions it was calling, the Procom guys left this in the codebase but _nobody_ was actually using it. Few small fixes for Q40 keyboard support. [PATCH] ehci, async idle timout One more patch: this turns off async schedule processing if there are no control or bulk transactions for a while (currently HZ/3). Consequence: no PCI accesses unless there's work to do. (And a FIXME comment is gone!) The following patch shaves a six bytes from the loaded size of pcspkr.o and another 90 elsewhere in the .o file. Change "D: Drivers=" to "H: Handlers=" in /proc/bus/input/devices. [PATCH] Building list of drives in right order ata_attach in linux-2.5.34/drivers/ide/ide.c builds a list of IDE drives that do not yet have a device driver bound to them, in case ide-disk, ide-scsi, or whatever driver you want to use is not loaded yet. The problem was that ata_attach was adding to the head of the list, so the list was being built in reverse order. So, if you had two IDE disks, and ide-disk was a loadable module, the devfs entries for the disks would be numbered in reverse (the first disk would be /dev/discs/disc1, and the second would be /dev/discs/disc0). This fixes the problem by changing the relevant list_add to list_add_tail. Incidentally, the generic code in drivers/base/ already does it this way. [LLC] llc_send_disc & LLC_DISC_PRIM bites the dust [PATCH] fixes for races in kaweth probe using init_etherdev(0, 0) in probe is a race. The struct net_device must be allocate and filled before init_etherdev is called, or there's a race which creates a network interface that isn't usable. The patch for kaweth for 2.5 fixes it. [LLC] add missing kfree_skb in llc_conn_state_process This one fixes a skb leak in disconnection notification. [PATCH] usbmidi patch I have changed the name of a local variable "l" to be "j", because with some fonts should be difficult to see if [1+l+i] means [2+i] or what. [PATCH] UML arch (user-mode Linux) This patch implements UML for 2.5.34. [LLC] remove unsupported flowcontrol prim bits kbuild: Fix up non-verbose mode Just some cosmetical changes to align output in non-verbose mode. kbuild: Fix copying of shipped files When using cp to copy the shipped file to its actual name, permissions would be preserved, particularly the copy would be read-only when the original was (BitKeeper) read-only, leading to an error when executing the rule a second time. So now we use cat, which will generate a writable file. JFS: cleanup -- Remove excessive typedefs kbuild: Use normal rule for preprocessing vmlinux.lds.S Use the same rule as in Rules.make for preprocessing vmlinux.lds.S, that also gives automatic dependency tracking. This means we should also use the standard AFLAGS_... instead of CPPFLAGS_... to provide specific additional flags. [PATCH] Typo in do_syslog/__down_trylock lockup fix Linus spotted one cut-n-pasto ('tracing' argument) but didn't see the other: we were walking the ptrace_children list by the sibling field. So we got garbage for your task_structs when this happened. If the list wasn't empty, it would crash. Strace detaches from all tasks when it receives a Control-C so only with enough threads and SMP would this be easily seen. I needed this small patch if i8042.c is built as a module. Franz. Exporting kbd_pt_regs in keyboard.c. [PATCH] md - 1 of 3 - Remove BUG in md.c that change in 2.5.33 triggers. Since 2.5.33, the blk_dev[].queue is called without the device open, so md_queue_proc can no-longer assume that the device is open. [PATCH] md - 2 of 3 - Fix bug in raid5 AGAIN That recent bug fix in raid5 just changed the bug, it didn't fix it. I think that the original code was actually wrong, which didn't help. This time, the code actually matches the nearby comment, that has been expanded a bit, so I feel somewhat more confident that it is actually right. [PATCH] md - 3 of 3 - Fix compile errors when tracing enabled in MD both md.c and raid5.c can be compiled with debugging and compile errors in this code aren't normally noticed as they aren't even compiled. Now the debugging messages are compiled but optimised out so we will always see the errors. Current errors are fixed. [PATCH] kNFSd 1: New structure initialisers for lockd. Just the new structure initialisers. [PATCH] kNFSd 2: Lockd to shutdown without engaging with nfsd Currently, when lockd wants to invalidate all it's clients, it asks nfsd to iterate through them. Now it iterates itself. [PATCH] kNFSd 3: Increase separation between lockd and nfsd. lockd currently asks nfsd for a 'client handle' for each request. This is used as a key for finding (or creating) a 'nlm_host' structure, so that there is only one of these per client...almost. There can currently be up to 4 nlm_hosts for a given client, depending on protocol (udp/tcp) or version (v1 or v4). But this isn't handled very well. So the question is: is there any advantage in having only on nlm_host per real host, or have we simply have one for each IP address that makes requests, whether they are separate hosts or not. The nlm_host structure is used: 1/ to hold a lockd rpc client for talking to the remote lockd. Having multiple lockd clients cannot hurt except possibly to waste a little space. 2/ to identify resources to free when we receive notification from statd that a client has restarted. As statd gets a hostname and looks up all IP addresses, and then sends a notification for each IP for which it has a registration, there is no need to minimise the number of nlm_host structures (each of which register for monitoring). 3/ to identify resources to free when a client sends a "free_all" request. If a client uses multiple IP addresses to create locks, and then sends free_all from just one IP address we will loose here. However it is not clear that a client would ever want to send a free_all request, and the linux client doesn't seem to, so there is unlikely to be any loss here. This patch does not ask nfsd for a client identifier, but rather finds an nlm_host based on IP, version, protocol (udp/tcp) and whether we are acting as NFS server or client. All of this information is then placed in the cookie that is passed to statd and returned by statd when the client restarts. Previously only the IP address was passing the cookie, so possibly not all nlm_host structures would have been found. Because of these changes, lockd does not need to know anything about the nfsd export table, so the interface to nfsd is much more narrow. Another consequence is that when nfsd is told to delete a client, it cannot tell lockd to forget all the locks for that client. However it is not clear that lockd should ever forget any locks unless it is told to shutdown (or simulate a shutdown), and in anycase, the current nfsd admin tools never tell nfsd to delete a client anyway. [PATCH] kNFSd 4: Discard svc_uidmap structure It is un-used and never will be. uid mapping will be done a different way (if at all). [PATCH] kNFSd 5: Get rid of ex_parent from svc_export I was never entirely sure what it was for, but it is not used now, only set, so it can go. [PATCH] kNFSd 6: Expose anon uid and gid in /proc/fs/nfs/exports Don't print if default, which should be "-2", but is currently 65534.. We really need a 32bit uid interface for 2.6. [PATCH] kNFSd 7: Discard cl_idlen It is never used [PATCH] kNFSd 8: Don't store path in exports table. Instead, use d_path to find path from dentry/vfsmnt. This requires allocating a buffer at exp_open time, and releasing it when closing. [PATCH] kNFSd 9: Discard cl_addr We currently store the address list with each client and use it only to print out comments on /proc/fs/nfs/exports While these can be helpful, they are not critical and could be added back later after we restructure the exports table. [PATCH] kNFSd 10: Discard ex_dev and ex_ino from svc_export They can be deduced from ex_dentry [PATCH] kNFSd 11: Remove problematic "security" checks when NFS exporting. The nfs server currently doesn't allow you to export both a directory and an ancestor of that directory on the same filesystem. This check is more of a problem than a solution and can be done in user-space if needed, so it is removed. The potential for a security problem is because the files below the lower directory could be accessed as though it were under either of the export points, and so the access control that is applied might not be what is expected (by the nieve admin). e.g. export /a as readwrite and /a/b as readonly. Then a/b/c can be accessed readwrite as it is in /a which might not be the intend. Altering the user to this can be done in userspace though. The current restriction also stops exporting / as readonly and /tmp as read-write which some people want to do. Providing /tmp is also exported subtree_check (the default) there is no security issue here. [PATCH] kNFSd 12: Change exp_parent to talk directory tree, not hash table. Currently get_parent (needed to find the exportpoint above a given dentry) walks the hash table of export points checking each with is_subdir. Now it walks up the d_parent link checking each for membership in the hashtable. nfsd_lookup currently does that walk too (when crossing a mountpoint backwards) so the code gets unified. This approach makes more sense as we move towards a cache for export information that can be filled on demand. It also assumes less about the hash table (which will change). [PATCH] kNFSd 13: Separate out the multiple keys in the export hash table. Currently each entry in the export table had two hash chains going through it, one for hash-by-dev/ino, One for hash-by-fsid. This is contrary to the goal of a simple hash table structure. The two hash-tables per client are replace by one which stores 'exp_key's which contain the key (as a file handle fragment) and a pointer to the real export entry. The export entries are then all stored in a single hash table indexed by client+vfsmount+dentry; [PATCH] kNFSd 14: Filehandle lookup makes use of new export table structure. Filehandle lookup currently breaks out the interesting pieces of a filehandle and passes them to exp_get or exp_get_fsid, which put the pieces back into a filehandle fragment. We define a new interface "exp_find" which does a lookup based on a filehandle fragment to avoid this double handling. In the process, common code in exp_get_key and exp_get_fsid_key is united into exp_find_key. Also, filehandle composition now uses the mk_fsid_v? inline functions. [PATCH] kNFSd 15: Unite per-client export key hash tables. Instead of a separate hash table per client we now have one hash table which includes the client in the key. [PATCH] kNFSd 16: Remove per-client list of exports. This is used: to iterate all exports when making /proc/fs/nfs/exports to find all exports of a client to unexport them. The first can just as easily be done by iterating the export_table hash table. The second is very rarely called and can be done by iterating the hash table looking for exports for the given client. [PATCH] md - Fix problems with freeing gendisk in md.c md currently tries to set_capacity() *after* freeing the gendisk structure. It also frees the gendisk even when switching to read-only. That patch open-codes free_mddev (which is only called once) and cleans all this up. [LLC] save sockaddr_llc info in connection packets Also only unassign the sock from the sap if the socket is not zapped, because autobind can fail, leaving it unassigned... Noticed with llcping/llcpingd from Jay, that I'm using now to test PF_LLC SOCK_DGRAM (xid, test, ui). Also add more debugging calls, disabled by default in mainline. [NAPI]: Set SCHED before dev->open, clear if fails. Restore netif_running check to netif_rx_schedule_prep. [TIGON3]: Use spin_lock_irqsave in tg3_interrupt, fixes SMP hang. [TIGON3]: Add 5704 support. ppc64: xtime.tv_nsec fixes ppc64: DISCONTIGMEM updates, rework to be like x86 version ppc64: add in_atomic ppc64: updates from Rochester ppc64: EEH update from Todd Inglett ppc64: Allocate RTAS above OF, from Peter Bergner ppc64: new pci config methods, from Todd Inglett ppc64: updates from Rochester ppc64: UP compile fixes [LLC] kill sap->req() Intermediate patch for the PF_LLC SOCK_DGRAM prim clean-up, now PF_LLC is prims in the sending side, now to hack the core to not use prims to send to PF_LLC. This also fixes a skb leak on llc_sap_state_process. [PATCH] ptrace-fix-2.5.34-A2, BK-curr I distilled the attached fix-patch from Daniel's bigger patch - it includes all fixes for all currently known ptrace related breakages, which include things like bad behavior (crash) if the tracer process dies unexpectedly. [PATCH] sys_exit() threading improvements, BK-curr This implements the 'keep the initial thread around until every thread in the group exits' concept in a different, less intrusive way, along your suggestions. There is no exit_done completion handling anymore, freeing of the task is still done by wait4(). This has the following side-effect: detached threads/processes can only be started within a thread group, not in a standalone way. (This also fixes the bugs introduced by the ->exit_done code, which made it possible for a zombie task to be reactivated.) I've introduced the p->group_leader pointer, which can/will be used for other purposes in the future as well - since from now on the thread group leader is always existent. Right now it's used to notify the parent of the thread group leader from the last non-leader thread that exits [if the thread group leader is a zombie already]. [TIGON3]: GRC_MISC_CFG_BOARD_ID_5704CIOBE is wrong... kernel/signal.c: Not all systems have SIGSTKFLT. [SPARC]: Catchup with signal infrastructure changes. [SPARC]: pcibios_enable_device has new mask argument. [SPARC64]: timespecs now have tv_nsec in place of tv_usec. [SPARC64]: Delete do_gettimeofday asm. [SPARC]: Update ide headers. WARNING: this is known broken, fixes coming from Jens Axboe. - Jens needs to seperate out the IN/OUT macros to seperate what accesses are to the IDE_DATA register and the rest. On big-endian platforms the IDE_DATA register should be accessed in big-endian for it to all work out correctly or at least be compatible with the behavior existing before the IDE platform macro interface changes in 2.5.x [SPARC64]: Add rwlock_is_locked and in_atomic. arch/sparc64/defconfig: Update. arch/sparc/kernel/check_asm.sh: Handle output from newer versions of GCC. [SPARC]: Add rwlock_is_locked. [SPARC]: Add is_atomic. [SPARC]: Update for tv_nsec in xtime. [SPARC]: Add irqs_disabled. [SPARC]: Add kmap_atomic_to_page. [SPARC]: Add sys_exit_group syscall entries. net/ipv4/ip_options.c: IPOPT_END padding needs to increment optptr. include/asm-sparc/hardirq.h: Fix comment. [LLC]: Fix build bustage. [PATCH] NMI watchdog SMP fix This makes NMIs work - otherwise they go to CPU 0 only and any hard lockup on the other CPUs will not be detected by the nmi_watchdog. [PATCH] readv/writev speedup This is Janet Morgan's patch which converts the readv/writev code to submit all segments for IO before waiting on them, rather than submitting each segment separately. This is a critical performance fix for O_DIRECT reads and writes. Prior to this change, O_DIRECT vectored IO was forced to wait for completion against each segment of the iovec rather than submitting all segments and waiting on the lot. ie: for ten segments, this code will be ten times faster. There will also be moderate improvements for buffered IO - smaller code paths, plus writev() only takes i_sem once. The patch ended up quite large unfortunately - turned out that the only sane way to implement this without duplicating significant amounts of code (the generic_file_write() bounds checking, all the O_DIRECT handling, etc) was to redo generic_file_read() and generic_file_write() to take an iovec/nr_segs pair rather than `buf, count'. New exported functions generic_file_readv() and generic_file_writev() have been added: ssize_t generic_file_readv(struct file *filp, const struct iovec *iov, unsigned long nr_segs, loff_t *ppos); ssize_t generic_file_writev(struct file *file, const struct iovec *iov, unsigned long nr_segs, loff_t * ppos); If a driver does not use these in their file_operations then they will continue to use the old readv/writev code, which sits in a loop calling calls fops->read() or fops->write(). ext2, ext3, JFS and the blockdev driver are currently using this capability. Some coding cleanups were made in fs/read_write.c. Mainly: - pass "READ" or "WRITE" around to indicate the diretion of the operation, rather than the (confusing, inverted) VERIFY_READ/VERIFY_WRITE. - Use the identifier `nr_segs' everywhere to indicate the iovec length rather than `count', which is often used to indicate the number of bytes in the syscall. It was confusing the heck out of me. - Some cleanups to the raw driver. - Some additional generality in fs/direct_io.c: the core `struct dio' used to be a "populate-and-go" thing. Janet has broken that up so you can initialise a struct dio once, then loop around feeding it more file segments, then wait on completion against everything. - In a couple of places we needed to handle the situation where we knew, a-priori, that the user was going to get a short read or write. File size limit exceeded, read past i_size, etc. We handled that by shortening the iovec in-place with iov_shorten(). Which is not particularly pretty, but neither were the alternatives. [PATCH] Use a sync iocb for generic_file_read This adds support for synchronous iocbs and converts generic_file_read to use a sync iocb to call into generic_file_aio_read. The tests I've run with lmbench on a piii-866 showed no difference in file re-read speed when forced to use a completion path via aio_complete and an -EIOCBQUEUED return from generic_file_aio_read -- people with slower machines might want to test this to see if we can tune it any better. Also, a bug fix to correct a missing call into the aio code from the fork code is present. This patch sets things up for making generic_file_aio_read actually asynchronous. [LLC] remove all tmr ev structs & fix psnap and p8022 wrt ui sending . No need for the timer_running member on llc_timer, we only need it in one place, and timer_pending is equivalent. One more procom OS generalisation killed. . Move the skb->protocol assignment in llc_build_and_send_pkt routines and llc_ui_send_data to the caller, this is the common practice in Linux networking code (think netif_rx) and required to keep the request functions in psnap and p8022 simple. . Remove the rpt_status (report status) ev members, not used at all, not even in the original procom code. . Convert psnap and p8022 request functions to use llc_ui_build_and_send_ui_pkt, removing all the prim cruft. [PATCH] PATCH - cset 1.497.59.25 breaks MD autodetect The partition changes shifted a lot of indexes down one, but this one shouldn't have been shifted... [PATCH] thread exit deadlock bug This fixes the Mozilla SMP lockup in the exit path. [PATCH] signal failures in nightly LTP test On 13 Sep 2002, Paul Larson wrote: > > The nightly LTP test against the 2.5 kernel bk tree last night turned up > some test failures we don't normally see. These failures did not show > up in the run from the previous night. [...] > I found what was breaking this, looks like it was this change from your > shared thread signals patch: > - if (sig < 1 || sig > _NSIG || > - (act && (sig == SIGKILL || sig == SIGSTOP))) > + if (sig < 1 || sig > _NSIG || (act && sig_kernel_only(sig))) This fixes this bug and a number of others in the same class - the signal behavior bitmasks should never be consulted before making sure that the signal is in the word range. [PATCH] 2.5.34-bk fcntl lockup This fixes endless loop without schedule which happens as soon as smbd invokes fcntl64(7, F_SETLK64, ...). fcntl_setlk64 gets cmd F_SETLK64, not F_SETLK tested in the loop; Maybe return value from posix_lock_file should be changed to -EINPROGRESS or -EJUKEBOX instead of testing passed cmd in callers, but this oneliner works too. If you preffer changing posix_lock_file return value to clearly distinugish between -EAGAIN and lock request queued, I'll do that. [PATCH] hide-threads-2.5.34-C1 I fixed up the 'remove thread group inferiors from the tasklist' patch. I think i managed to find a reasonably good construct to iterate over all threads: do_each_thread(g, p) { ... } while_each_thread(g, p); the only caveat with this is that the construct suggests a single-loop - while it's two loops internally - and 'break' will not work. I added a comment to sched.h that warns about this, but perhaps it would help more to have naming that suggests two loops: for_each_process_do_each_thread(g, p) { ... } while_each_thread(g, p); but this looks a bit too long. I dont know. We might as well use it all unrolled and no helper macros - although with the above construct it's pretty straightforward to iterate over all threads in the system. Make sure MTRR setting is atomic on SMP, since - HT CPU's can share the MTRR state between cores - the code uses static variables that are shared [PATCH] wait4-fix-2.5.34-A0, BK-curr the attached patch (against BK-curr) fixes a sys_wait4() bug noticed by Ulrich Drepper. The kernel would not block properly if there are eligible children delayed due to the new delayed thread-group-leader logic. The solution is to introduce a new type of 'eligible child' type - and skip over delayed children but set the wait4 flag nevertheless. The libpthreads testcase that failed due to it now it works fine. [PATCH] clone-fix-2.5.34-A0, BK-curr This fixes a clone-flags bug noticed by Roland McGrath. The current CLONE_DETACHED & CLONE_THREAD forcing code did things in the wrong order, which makes it possible to force an oops the following way: main () { syscall(120, 0x00400000); } instead of changing the order of CLONE_SIGHAND and CLONE_THREAD flag forcing (which would fix the bug), the proper approach is to fail with -EINVAL if invalid combinations of clone flags are detected. This change does not affect existing applications. [PATCH] detached-fix-2.5.34-A0, BK-curr This fixes three resource accounting related bugs introduced by detached threads: - the 'child CPU usage' fields were updated in wait4 until now - this was slightly buggy for a number of reasons, eg. if the exit_code writout faults then it's possible to trigger this code multiple times. - those threads that do not go through wait4 were not properly accounted. - sched_exit() was incorrectly assuming that current == parent. In the detached case p->parent is the real parent. with this patch applied things like 'time' work again for new-style threaded apps. [PATCH] exit-thread-2.5.34-A0, BK-curr This optimizes sys_exit_group() to only take the siglock if it's a true thread group. Boots & works fine. [PATCH] wait4-fix-2.5.34-B2, BK-curr This fixes a number of bugs that broke ptrace: - wait4 must not inhibit TASK_STOPPED processes even for thread group leaders. - do_notify_parent() should not delay the notification of parents if the thread in question is ptraced. strace now works as expected for CLONE_THREAD applications as well. [PATCH] exit-fix-2.5.34-C0, BK-curr This fixes one more exit-time resource accounting issue - and it's also a speedup and a thread-tree (to-be thread-aware pstree) visual improvement. In the current code we reparent detached threads to the init thread. This worked but was not very nice in ps output: threads showed up as being related to init. There was also a resource-accounting issue, upon exit they update their parent's (ie. init's) rusage fields - effectively losing these statistics. Eg. 'time' under-reports CPU usage if the threaded app is Ctrl-C-ed prematurely. The solution is to reparent threads to the group leader - this is now very easy since we have p->group_leader cached and it's also valid all the time. It's also somewhat faster for applications that use CLONE_THREAD but do not use the CLONE_DETACHED feature. [PATCH] thread-exec-2.5.34-B1, BK-curr This implements one of the last missing POSIX threading details - exec() semantics. Previous kernels had code that tried to handle it, but that code had a number of disadvantages: - it only worked if the exec()-ing thread was the thread group leader, creating an assymetry. This does not work if the thread group leader has exited already. - it was racy: it sent a SIGKILL to every thread in the group but did not wait for them to actually process the SIGKILL. It did a yield() but that is not enough. All 'other' threads have to finish processing before we can continue with the exec(). This adds the same logic, but extended with the following enhancements: - works from non-leader threads just as much as the thread group leader. - waits for all other threads to exit before continuing with the exec(). - reuses the PID of the group. It would perhaps be a more generic approach to add a new syscall, sys_ungroup() - which would do largely what de_thread() does in this patch. But it's not really needed now - posix_spawn() is currently implemented via starting a non-CLONE_THREAD helper thread that does a sys_exec(). There's no API currently that needs a direct exec() from a thread - but it could be created (such as pthread_exec_np()). It would have the advantage of not having to go through a helper thread, but the difference is minimal. Use CLONE_KERNEL for the common kernel thread flags. PPC32: extra argument for pcibios_enable_resources/device PPC32: add argument to INIT_SIGNALS use in arch/ppc/kernel/process.c PPC32: convert xtime usage from timeval to timespec PPC32: define atomic_add_negative PPC32: allocate syscall #s for alloc/free_hugepages and exit_group and add exit_group to the syscall table. PPC32: remove the ppc32-specific ide_fix_driveid. There is a perfectly good one in drivers/ide/ide-iops.c now. PPC32: define kmap_atomic_to_page PPC32: remove unused IDE functions from include/asm-ppc/ide.h. This gets rid of ide_request/free_irq, ide_get/release_lock, ide_check/request/release_region etc. PPC32: define rwlock_is_locked(). [PATCH] thread exec fix, BK-curr The broadcast SIGKILL kept pending in the new thread as well, and killed it prematurely ... Linux v2.5.35