Data Recovery

From Lunarsoft Wiki
Jump to navigation Jump to search

Welcome to DjLizard's data recovery guide.

As this is an editorial with lots of opinions, anecdotal evidence, and personal experience, first-person informal particles "I", "me", and "my" will be used. This is unlike the rest of this wiki which is written in a hard-facts technical style. My apologies if you were expecting something more professional.

If your drive is physically failing, DO NOT use filesystem-level recovery software (OnTrack, GetDataBack, and so on) just yet. You must fix the physical problem before you fix filesystem problems. Get a clone of your failing drive first, then check out the list of filesystem-level fix/recovery software at the bottom of SH/SC's data recovery page.

Introduction

The meat of this guide describes using RIPLinux (and Linux in general) to clone a failing drive, but you can use RIPLinux (or most Linux rescue discs) and this guide to perform other data tasks such as securely wiping, data backups, or simply shuffling partitions or drive/partition images around between devices. While I typically focus on hard disks, most tools can work on any block devices supported by Linux, including USB flash drives, CD-ROMs, floppy disks, zip disks, individual disks from a RAID, and so on. Although the procedures in this article can be used to work with any platform's data (including Mac volumes if you stay at a low level) it will primarly focus on rescuing data from PCs running Microsoft Windows that have failing hard drives.

I'm also going to assume that readers of this guide don't have much prior Linux knowledge (although that is extremely helpful) so if you are a power user, try not to be offended by my hand holding.

Tools required

  • A Linux distribution that can compile/install badblocks, ddrescue, fdisk, and testdisk. The distribution can be a live CD, or installed and running on a dedicated data recovery machine.

Preparation

If you are a technician, you should consider making a dedicated data recovery machine. I suggest the following components for it:

  • At least a 1GHz or faster Athlon Thunderbird or Intel P4 "Prescott" processor. (recommended)
  • At least 256KB L1 cache on your CPU. (recommended)

Faster and newer processors are obviously better. Try not to use Celerons or Durons unless absolutely necessary and if they are faster than 1.5GHz. Avoid any processor that has 0KB L2 cache (such as certain Celerons).

  • At least 256MB of RAM (required) or 512MB+. (recommended)
  • One or more controller cards that work in RIPLinux, for additional ATA/100 IDE or SATA ports (optional).
  • An ethernet card that is recognized by RIPLinux (optional).
  • For the technicians/shops: a network-attached storage (preferably in a protective RAID mode), for storing copies of drives or large trees of customer data. At my shop I use a 1TB RAID 5 volume.
  • A known-good destination hard drive that is larger in GB or is the same GB with the last LBA address being greater than or equal to the source drive (required for drive clone+recover).
  • A known-good fully-charged battery backup/UPS unit (highly recommended).

Preparation notes

  • If you're working from the machine the drive came from rather than your own known-good system, be sure that everything is stable on the system by running diagnostics such as Memtest86+. If possible, run the OEM's diagnostics. Dell, for example, usually has a diagnostic partition or bootable floppy/CD you can download at support.dell.com. A service tag is necessary so that you can be sure you download the right version for your Dell model. While I suggest that you run all of the tests, you should at least run the CPU and memory tests. Dell's diagnostics are surprisingly reliable and accurate! At the very least be sure you're using a machine that isn't going to crash Linux. The Linux kernel isn't known for its crashing - so take kernel panics (basically, the Linux equivalent of a Blue Screen of Death) as a big warning sign.
  • It is possible (with enough effort) to disassemble RIPLinux and compile a new kernel that supports your controller card, ethernet card, etc, and build a new disc out of it. This is beyond the scope of this guide, but the RIPLinux website has a documentation area which contains a text file/FAQ that has a few instructions for those who are familiar enough with Linux to understand them. I've built my own disc and custom kernel so that I can support more devices and remove some of the features I don't need, reducing the size of my RIPLinux disc image.

IDE preparation

  • Don't use 40 conductor cables - always use 80 conductor cables. 40 conductor cables can't handle bus errors very well, especially when a drive's controller board is going out. 80 conductor cables support the fastest DMA modes as well, and speed counts when a drive doesn't have long to live. During a recovery, the drive will probably go in and out of various DMA modes, and 40 conductor cables effectively cut the available speed options in half.
  • In some cases, with quirky/medieval drives, you may end up having to use a 40 conductor cable anyway, so have one on hand.
  • Always connect the drives as master of their own channels unless there is no other choice.
  • IDE controllers have two channels: Primary and Secondary. Each channel can have two devices on it known as Master and Slave. If your controller has more than two ports on it, then it is technically more than one controller. You want to connect your source and destination drives to their own channels so that there cannot be any IDE crosstalk between them which usually corrupts data very badly when a drive's electronics are failing. If you can't connect the drives to their own channels because you don't have an IDE expansion card and your CD-ROM drive is using the secondary controller, then try to boot a Linux rescue distribution from a USB drive. You really don't want to connect a hard drive to the same controller as an optical drive:
  • Do not chain hard drives to optical drives, LS-120 drives, ZIP drives, or any other ATAPI/ATA device other than a hard drive. Data loss is possible, and probable. If you connect a known-good drive and your BIOS detects its model as a garbled mess, start sweating.

If you must chain the source and destination drives together, follow these conditionals and recommendations:

  • If one of the drives is failing, put the failing drive at the end of the IDE cable (the connector furthest from the motherboard connector end) and the known-good destination drive in the middle of the IDE cable.
  • If you are simply cloning a working drive or partition to another working drive/partition, then put the drive with the oldest manufacturing date at the end of the cable and the newer drive in the middle.
  • Set the failing drive to the "Slave" jumper position, and the good drive to "Master". If you're not working with a failing drive, set whichever drive is newer (and is in the middle of the cable) to "Master".
  • Never use cable select; you need full control over both the cable position and the logical addressing.
  • Don't use RAID cards. They're made to send up a warning flare (usually by way of incredibly loud/annoying piezo buzzer) and automatically dismount drives at the first sign of a problem. This will cause your clone to fail pretty much the instant it hits a bad spot on the drive. You want to use a controller that is going to let you get away with having a failing, flaky drive attached to it. That means you need a cheap PCI card. PCI cards usually have better performance and less issues than motherboard IDE chipsets. Since you may be booting a Linux distribution from CD-ROM, you will have already lost an entire channel anyway (since you shouldn't be connecting any drives to the same channel as a CD-ROM, as mentioned above). Invest in a cheap PCI IDE controller. RAID cards also don't typically support S.M.A.R.T. at all (or not exactly in the same fashion) as a regular IDE controller.

SATA preparation

  • When using SATA drives, the channels don't matter at all - you can connect them to any ports you like, but you need to be able to identify which is which before you clone anything. The SATA controller also must be recognized by RIPLinux (or your favorite rescue system) or you will not be able to manipulate them at all. You may have to compile your own kernel modules or recompile the rescue system's kernel in order to recognize your controller. The term "SATA controller" refers to both add-in cards and your motherboard's onboard SATA chipset. If known-good SATA drives aren't detected by RIPLinux you may need to use a SATA to IDE bridge until you can get a kernel module (i.e., driver) for the SATA controller.
  • If your SATA controller doesn't support SATA-II mode (300Gb/s) then you may need to put a jumper on the drive which tells it to only use SATA-I mode (150Gb/s). Consult the drive's label to see where to place the jumper. Some newer controllers that don't actually support SATA-II are able to work with the drive without a jumper, but if the drive doesn't detect or work correctly, consider jumpering the drive to force SATA-I mode.
  • It's perfectly fine to clone from SATA to IDE or IDE to SATA - the data doesn't really care where it is and there are no differences in the controller types that would affect things at the filesystem level. The only time it will matter is if you are trying to get an OS to boot after the clone. Windows, for example, won't boot if you've swapped the controller type out from under it, but you can remedy even this by performing a repair install. Don't forget to hit F6 at the beginning of text-mode setup (for Windows 2000 and Windows XP) and specify a controller driver if necessary. If your SATA controller is running in legacy or compatibility mode (which is up to your SATA controller's BIOS) then you may not need a driver for Windows (or even a repair install) in order to boot after cloning from IDE to SATA.
  • Don't use SATA RAID cards. They're exactly as bad as IDE RAID cards for the purpose of cloning and recovery (see above).

Software to avoid

Spinrite

GRC.com's Spinrite basically does what it says it does (minus the mumbo-jumbo and snake-oil technobabble that Steve Gibson uses to describe its capabilities), but it does a few things that I don't like:

  • At level 2 and higher, it writes back to the failing drive. Even though Spinrite tries to disable automatic reallocation (which is not a magical thing that only Spinrite can do, by the way) it does end up writing data back to the drive, which is generally very bad practice.
  • It can get "stuck" at the first failing sector it finds, trying for hours (or days) without giving up or moving on. You could have moved on and gotten all of the "known-good" data from the rest of the drive (and come back to the bad sector later) but Spinrite doesn't support that. By the time it gives up the drive heads/electronics/platters could be fully shot and you just lost your last chance for data recovery.
  • Sometimes, instead of getting stuck, it flies right by a sector, stating (in effect) that some, but not all, of the data was copied out of the sector. After it allows the drive to reallocate the sector, the rest of the data that wasn't copied is just plain lost.
  • Since Spinrite forces reallocation after a sector is "recovered", there is a potential to actually run out of remappable sectors. All drives have a finite pool of sectors set aside just for remaps, and it is entirely possible to run out before Spinrite is done with the drive. This has actually happened to me once.
  • There's not a lot of useful command-line switches. Why can't you silence the annoying beeps and bloops that the program makes with a command-line switch (as some PC speakers are really loud and not easy to disable)? Although there is a "DynaStat" command line parameter which increases (or decreases - your choice) Spinrite's threshold for giving up on a failing sector, Spinrite generally gives up before retrieving all of a sector no matter what it's set to. If you set it too high, it'll never give up and you may be sitting there for days/weeks at one bad spot.
  • Bugs in Spinrite 6 often cause the S.M.A.R.T. monitoring window to become corrupted and unusable until the program is restarted.

It is because of all of these problems that I can no longer recommend Spinrite to anyone.

Level 1 is the only acceptable level to use on a drive, because it doesn't perform writes. It doesn't matter though - why bother with Spinrite at all?

I suggest replacing Spinrite with smartctl (for monitoring S.M.A.R.T.) badblocks (for testing for bad sectors) and ddrescue (for recovering drives with bad sectors). See further below for usage and descriptions of all three.

chkdsk /R

chkdsk is Microsoft's filesystem repair utility for FAT12, FAT16, FAT32, and NTFS filesystems.

Here are the main reasons you shouldn't use the /R flag:

  1. its actions are undocumented, and some of its status messages are ridiculously terse ("found and fixed one or more errors")
  2. It doesn't tell you what files are being modified (or if they're even being modified) when working with unreadable extents ("File record segment XYZ is unreadable").
  3. Most of the time, it doesn't tell you what sectors are bad (as in LBA sector number).
  4. Even when it is giving you less terse messages about sectors ("Read failure with status 0xXYZ at offset 0xXYZ for 0xXYZ bytes") there still isn't enough information provided to use another tool (offset from what? Is this specified in LBA sector addressing or MFT logical addressing?)
  5. You can't do a heavy duty check without also performing repairs, so even the offset information in the previous bullet point is useless after chkdsk does whatever it does.
  6. It doesn't tell you if it was able to recover your data out of a bad sector.
  7. It often doesn't tell you what is corrupted or why it gives up when it gives up ("one or more unrecoverable partitions").
  8. When recovering bad sectors, it immediately writes data back to the drive in order to force reallocation, but it won't tell you if it was successful or not, and writing to a failing drive is the worst thing you can do.
  9. When the parent folder information is lost, the files are moved to X:\FOUND.### where ### is a number starting at 000. chkdsk doesn't tell you whether files put here are intact or not (and when processing a drive with bad sectors, you can usually assume the files placed here have holes in them or are just plain destroyed).

Don't use chkdsk on a drive that is failing, especially not with the /R flag.

You should use the /F flag, however, because chkdsk is a decent free NTFS filesystem-level repair utility. NOTE: You should only use the /F flag if you know the drive has absolutely no bad sectors and is working completely perfectly. Server 2003, XP Professional x64, and Vista have a new chkdsk flag (/v), which prints extra information about modified files, but this is only useful for NTFS. For FAT32, it prints a list of every file being checked.

There are several versions of chkdsk; this is a list (from best to worst) of chkdsk intelligence:

  • Vista / XP Professional x64 / Server 2003 (same version of chkdsk)
  • XP Home / Professional x86 (same version of chkdsk)
  • XP Recovery Console
  • 2000
  • 2000 Recovery Console
  • NTFSDOS Pro (this depends on the version of NT you use as a source to build your NTFSDOS boot disk, but it's generally a bit below the capabilities of the equivalent Recovery Console version of chkdsk)

As you can see, it's highly recommended to use Vista's chkdsk when possible (even through WinPE or the Vista install disc) but never with the /R flag. When using Vista's chkdsk, use '/F /V' whenever you are working with NTFS. x64 Professional, 2003 Server, and Vista share the same kernel (and chkdsk code) so you can interchange the word "Vista" in the above statements with 2003 Server or x64 Pro.

I suggest using badblocks to scan a drive for bad sectors and then clone it using ddrescue if there are any. Once you have cloned the data to a known-good drive, then you should use chkdsk /F (/F /V in Vista/x64/2003) to repair the higher-level damage. If you find a single error with chkdsk, run chkdsk again. Continue running chkdsk until there are no more errors. (See below for instructions for both badblocks and ddrescue.)

Norton Ghost

Norton Ghost for DOS is an IT admin favorite - it is pretty good at making images of filesystems, compressing them, deploying them, etc. But should you use it for recovery? No. Ghost will crash or abort when an error occurs (or can crash/abort even if you set it to ignore all errors), it doesn't do any recovery of unreadable sectors, and it rearranges the MFT. When you Ghost a drive using the default settings, it does not do a sector copy - instead it makes every file contiguous with no gaps of space between files and rewrites the MFT, which is the part that usually causes Windows systems to be unbootable after a Ghost (I'm referring to drives that are working fine). Ghost was not meant for recovery, so don't even bother with it for this purpose. For all other uses (imaging, deployment, etc), however, Ghost is just fine.

Defragmenters

Disk defragmenters try to make most files contiguous, and in doing so, write over free space areas. Writing back to a drive that's failing is a cardinal sin in data recovery, so obviously this is to be avoided. There is nothing a disk defragmenter can do to help your situation; it can only cause more issues.

Using RIPLinux

Booting methods

When you first boot stock RIPLinux, you will be presented with an ISOLINUX menu which asks you to choose from various booting methods.

  1. The first menu item boots RIPLinux using the initramfs method.
  2. The second menu item "(skip keymap prompt)" also boots using the initramfs method, but it doesn't ask you which keymap to use. Keymaps are only useful if your keyboard is NOT a US QWERTY keyboard.
  3. The third menu item boots without using initramfs.
  4. The fourth menu item boots without initramfs and skips the keymap prompt.

initramfs and non-initramfs are two methods for mounting the recovery system's root in memory. Initramfs unpacks the rescue system to the '/' mount point and then loads the kernel, and the non-initramfs method loads the kernel first and then unpacks the root after the kernel is done with its startup. With the non-initramfs method you get to see all of the kernel messages first, which should aid in debugging problems with booting RIPLinux. If you put RIPLinux on a USB drive (as described on the RIPLinux site) then you should use the non-initramfs method as it boots a bit faster than the initramfs method, but this depends on your system.

If you downloaded the "X" version of RIPLinux, then you will have four more menu items identical to the first four, except that at the end it boots into an X Window System server (aka "X11"). For now, let's stay out of graphical environments as they're not actually going to make things easier as you would like to think they would. Also, you must learn to walk (console) before you can run (GUI). Furthermore, running X11 dramatically decreases the amount of memory available - and since this is a live CD copied into memory, memory usage is quite important.

Basically, if you are in the US using a QWERTY keyboard, just select the second menu item, which uses the initramfs and skips the keymap prompt. Otherwise, choose the appropriate keymap for your keyboard by selecting the first menu item.

Cheat codes

Linux kernel options are usually called "cheat codes". If you have trouble booting RIPLinux on a particular system, you'll need to try one or more cheat codes to enable or disable certain kernel options/functions. You can get around buggy BIOSes and faulty hardware this way, also. You should test the memory (with RIPLinux's memtest86 menu item) and then check for BIOS updates for the machine before trying cheat codes.

To use one or more cheat codes, highlight one of the RIPLinux menu options (initramfs, skip keymap, etc) and press Tab to get to the line editor. Simply append one or more options.

The following cheat codes are listed in order of decreasing success in resolving problems. The first option in the options column is the suggested option. All other options are listed in arbitrary order. Append an item from the "options" column to the end of the cheat code to use it (unless there is no equals sign to the cheat code, in which case there are no options). Example: clocksource=pit

Cheat Code Options Meaning
clocksource= pit
pmtmr
tsc
acpi_pm
hpet
Changes the kernel timer mode. clock=pit (Programmable Interval Timer) is the most compatible and may even be necessary to boot a 2.6 kernel on a 64-bit or dual core machine, depending on your system configuration and motherboard's quirks. clocksource=tsc is the default in most Linux configurations unless an HPET is detected.
pci= noacpi
routeirq
pci=noacpi helps avoid broken/buggy ACPI BIOSes. This prevents the kernel from using ACPI for IRQ routing ("steering") or PCI scanning. pci=routeirq tells the kernel to use IRQ routing (sometimes known as "steering" in certain versions of Microsoft Windows) on all PCI devices, whether their drivers implicitly call for it or not. noacpi and routeirq can be used at the same time: pci=noacpi,routeirq
noapic Ignores I/O APICs if present.
nolapic Ignores local APICs if present.

Moving around in console

Once the system boots, you will receive a login prompt. Type root and press Enter. You are now at a root shell, which is denoted by a hash mark (#). Whenever you see the # prompt, your shell is ready to receive orders.

You can have more than one shell open at a time. Hold Alt and press the F-keys (F1, F2, etc) to access the other terminals. You can login to multiple terminals and have processes running on each.

The prompt

RIPLinux's default console prompt is simply the root hash mark, and this doesn't change no matter where you are in the system. I've found it helpful to modify the prompt to display pertinent information such as the current virtual terminal you are on and the current path. Here is the command to change the prompt to my recommendation:

export PS1="(\\l)\\w\\$ "

I have gone a bit further than that; I use a colorized prompt:

export PS1="\[\e[1;30m\](\[\e[0;37m\]\\l\[\e[1;30m\])\[\e[1;33m\]\\w\[\e[1;37m\]\\$ \[\e[0m\]"

I also change the cursor mode using the following command:

echo -e '\033[?18;12;0c'

I'm not even going to attempt to explain any of that, so here's an animation of both in action:

I really like the way the cursor highlights text in this mode (as shown in the animation). It makes it easier to see where you are when you're using an old monitor with low brightness/contrast or when you're swimming in a sea of boring light gray text.

For more information on customizing the prompt, read the Bash Prompt HOWTO.

At this point you may want to look at some Linux primers (books or web pages) to get yourself comfortable with the various common commands and shell built-ins.

Finding your devices

If you are dealing with mass storage devices, the easiest way to see if they are being detected properly is to issue the following command at the shell prompt: fdisk -lu

(That's a lowercase L after the dash.)

Example output:

Disk /dev/hda: 3MB, 3133440 bytes
4 heads, 17 sectors/track, 90 cylinders, total 6120 sectors
Units = sectors of 1 * 512 = 512 bytes

   Device Boot      Start         End      Blocks   Id   System
/dev/hda1              17        6119        3051+  83   Linux

This output came from a virtual machine where I created a 3MB hard disk image just for testing purposes. Real-world values will obviously be much different. /dev/hda is my first hard disk, which contains 1 partition of type 83 (Linux). The -l flag tells fdisk to list the detected devices and partitions and the 'u' part of the flag tells it to display units as sectors instead of the archaic cylinder notation. NOTE: the "Device Boot" part is actually two separate columns ("Device" and "Boot"). If my /dev/hda1 device were bootable, there would be an asterisk (*) under the second "o" in "Boot".

As a side note, you can also cat /proc/partitions to get a formatted text dump of fixed drives and partitions.

/dev/

In most UNIX-like systems (such as Linux, BSD, etc) there is a /dev/ hierarchy on the filesystem that contains special "node" files that the kernel uses to communicate to a device through its device driver. In RIPLinux (and most Linux distributions in general), IDE devices are referenced as /dev/hdXY and SCSI/SCSI-emulation devices (which includes SATA and USB mass-storage devices) are referenced as /dev/sdXY. The 'X' part of XY will be a letter from 'a' to 'z' indicating a controller channel. The 'Y' part is a number (starting at 1) which specifies a partition of that controller channel.

In most cases, /dev/hda1 will be the first partition of the first IDE hard drive, /dev/hdb1 will be the first partition on the second hard drive, and /dev/hda references the entire hard first drive. If you have a PCI IDE controller in addition to the IDE controller on your motherboard, then drives connected to it may be referenced by a much higher letter (e.g., /dev/hde, /dev/hdf, etc). A PCI SATA controller might end up as /dev/sdc and /dev/sdd if you already have two SATA channels on your motherboard, USB drives attached, or card readers built into/connected to your machine.

It is important to know the distinction between /dev/hda (for example) and /dev/hda1. If you are cloning an entire drive, you must leave off the partition number. This way, the entire hard drive's structure (including all partitions, the partition table, and the boot sector code) is copied to the other drive. If you are only cloning a single partition then a destination partition must already exist on the other drive. If /dev/hda is unformatted, for example, there won't even be a /dev/hda1.

As mentioned before, mass storage devices (such as USB flash drives) are typically detected as if they were SCSI devices, so they show up as /dev/sda, /dev/sdb, etc.

/dev/ notes

  • Internal floppy drives are known as /dev/fdX where X is a number (starting at 0). /dev/fd0 is the typical "A:" as you would know it in DOS/Windows. /dev/fd1 would be "B:" if you had a secondary floppy drive.
  • Floppy disks don't have partitions, so the numbers always indicate device ID.
  • /dev/cdrom is a symbolic link that points to the first optical drive, typically /dev/hdc in a normal IDE system. hda and hdb are Primary Master and Primary Slave, and hdc and hdd are Secondary Master and Secondary Slave in this case.
  • You can use most tools (like badblocks and ddrescue) to test or recover /dev/fd0 or /dev/cdrom just as you would /dev/hdXY or /dev/sdXY. Note: I recommend that you avoid /dev/cdrom for now, and instead directly reference the correct /dev/ node for the optical drive you wish to work with (such as /dev/hdc in a typical IDE system). This will help avoid confusion.
  • With the old /dev/ style (known as devfs), all nodes exist whether or not a device is present. With the new udev system, nodes in /dev/ don't exist until there's an actual device there. udev makes it easier to determine how your devices are being referenced because you can easily see a directory listing of udev, whereas devfs' directory listing is incredibly long and useless. Fortunately, RIPLinux uses udev, but your distribution of choice may not have adopted the 2.6 kernel series or udev yet.

smartctl

smartctl (which is part of smartmontools) allows you to query the S.M.A.R.T. attributes of a S.M.A.R.T.-supporting drive (henceforth spelled "SMART"). The SMART allows the drive to conduct tests on its own and keep track of various metrics including performance and reliability. You should check the SMART status of your hard drive before subjecting it to any scans that could stress it out. SMART may not be completely reliable in all cases (especially where a vendor has taken liberties with how to report data into the standard attributes) but if SMART thinks your drive is failing, it's failing. Immediately move on to cloning it with ddrescue.

Important SMART attributes

Decimal Hexadecimal Name
005 0x05 Reallocated_Sector_Ct
194 0xC2 Temperature_Celsius
196 0xC4 Reallocated_Event_Count
197 0xC5 Current_Pending_Sector
198 0xC6 Offline_Uncorrectable
199 0xC7 UDMA_CRC_Error_Count

Reading S.M.A.R.T.

When looking at a SMART status list of these attributes, pay attention to the "raw" value. When I use the word "generally" in the following sections about the important attributes, I mean that most vendors put "normalized" values in the raw column of these attributes or smartctl knows how to normalize them. The smartmontools project maintains a database of drive models and how to normalize some of the raw values. If your drive is in the database, normalizations (via the smartctl -v flag) are performed automatically. If you know how to normalize a value, you can use the -v flag to specify how to format a certain attribute ID. For example, if you knew that attribute 9 (Power_On_Hours) for your particular drive model was actually stored as minutes, then you could specify "-v 9,minutes" in your command line to have the raw value formatted for you.

Values that have not been normalized are not really usable by the end-user, as they contain values that could mean anything, rather than being a nicely-formatted literal count of something. The attributes in the above table are normalized by most vendors and/or are recognized by smartctl.

Only the hard drive and its manufacturer know the formulas for turning nonsensical numbers in unnormalized values into usable data because of how loose S.M.A.R.T. implementation is. For unnormalized "raw" values, you can look at the normalized regular "value" column to see an theoretical rating (which is more like a ratio) for a drive. If the "value" falls below the "threshold" then that attribute is critical and either performance or reliability has been affected.

Unfortunately, S.M.A.R.T. doesn't consider your drive failing until it's far beyond what we would consider normal. Sometimes it takes thousands of reallocated sectors before S.M.A.R.T considers a drive to be failing!

5 - Reallocated_Sector_Ct

This is generally the literal count of how many bad sectors have already been added to the grown defects list that the drive keeps track of internally. NOTE: modern hard drives only remap sectors when a write occurs, so there will be a lot of cases where this attribute's raw value is 0 but the drive has numerous bad sectors pending reallocation (see below). When a write occurs to a bad sector, the drive will add the sector to its internal "G-List" and swap out the sector with a spare from a pool of factory-defined sectors set aside just for this purpose. If this number is more than 0, your drive has begun its downward spiral of failure, and you may have already lost data (depending on whether the drive was able to read all of the data before reallocating it or not).

194 - Temperature_Celsius

This is exactly what it sounds like. Not all drives support this attribute, however. A good drive temperature is around 35-45 Celsius.

196 - Reallocated_Event_Count

This is generally the literal count of all remap attempts by the drive (both successful and unsuccessful). This may differ from the Reallocated_Sector_Ct - it will usually be higher. Watch this attribute as well.

197 - Current_Pending_Sector

This is generally the literal count of how many remaps are going to happen the next time a write occurs to the sectors marked as pending reallocation. If this number is greater than 0 then your drive has begun its downward spiral of failure. If there are current pending sectors but no previously reallocated sectors, then you may be able to get a perfect clone of the drive whose data will work as if nothing was ever wrong, provided that chkdsk /R has never touched the drive.

198 - Offline_Uncorrectable

This is generally the literal count of how many sectors were detected as being faulty the last time a SMART self-test was performed. smartctl supports running various self-tests through the -t flag.

199 - UDMA_CRC_Error_Count

This is generally the literal count of how many times the controller encountered an error while processing an ATA command in UDMA mode. It also counts how many times a CRC checksum has mismatched during operations. Usually, this indicates a problem with the cabling or drive electronics. These errors can also be triggered by incorrect IDE device chaining or simply a faulty IDE or SATA cable, or when device firmware bug match some chipset detection problem (SATA II at 3.0Gb/s drive misdetecting SATA I at 1.5Gb/s bus controller).

badblocks

badblocks is a simple program that can thoroughly test a hard drive's integrity at the sector level through software. I normally recommend one of two syntaxes, however, but feel free to view the man page (man badblocks) to see the rest of the flags and descriptions of the options.

If you know that your drive is failing, do not use badblocks -- skip ahead to ddrescue.

Non-destructive test

badblocks -sv /dev/device

-s is for showing progress (otherwise it sits there silently).

-v is for verbosity (prints out extra information).

This performs a basic read-only test which will help you find basic surface errors. Most of the time, this is all that is needed. If a bad block is encountered, its LBA address number is printed on a new line.

If you see even ONE error, press CTRL+C to abort the scan and prepare to do a recovery using ddrescue (as described further below). You do not want to do anything to a drive with one or more bad blocks (like virus scanning, defragmenting it, copying files to/from it, and especially not booting the OS that is on it). Anything you do can be hazardous to the data in that area, or you may even run into (or cause) additional problems.

Destructive test

The following command will erase all data on the drive you are testing.

The second syntax I use (which is ridiculously dangerous) is:

badblocks -svw /dev/device

The extra -w flag performs a DESTRUCTIVE READ-WRITE test with NO COMFIRMATION which erases all of the data on the drive and stores a set of byte patterns for read-verification.

Obviously you only want to do this if you've backed up your important data or if you were just going to format it anyway. This mode of badblocks is good for rooting out deeper problems (sectors that read correctly most of the time and aren't being caught as bad sectors) or for burn-in tests. If you receive errors during this test, you should first replace the drive's data cable, then the power supply, or try the test with the same drive on another computer/controller, and finally RMA/replace the hard drive. I rarely use this mode anymore as the read-only scan usually finds the error(s) so I just end up replacing the drive.

If you have found one or more bad sectors on your drive, ddrescue it to a safe location before doing anything else.

ddrescue

ddrescue, like *NIX `dd`, will sector copy an input file/device to an output file/device. The difference between ddrescue and dd is that ddrescue automates data recovery. If a bad block is encountered, it is logged and a large area is skipped. ddrescue attempts to read all of the "known-good" data in the first pass. After finishing all of the known-good data, ddrescue "splits error areas". It continues splitting these areas until the hardware block size is reached. It does not truncate the output file; it leaves holes that it can fill in during area splitting. This means that if you have multiple sources of the same exact data, you can rescue each one of their inputs to the same output file using the same log and ddrescue will use them to fill in the data. For this to work, however, the inputs have to be exactly the same stream of data from start to finish.

After ddrescue finishes the first pass and finishes splitting error areas, all of the sectors it could not read are not tried again. ddrescue quits at that point.

If you use the -r flag (see below), you can have it retry, which is very effective in squeezing a bit more data out of a drive. Sometimes you have to leave a drive cloning for up to a week or more, but the longer you wait and the more retries you use (I suggest -1 for infinity; cancel it when you get tired with CTRL+C) the more likely you are to get 100% of the data.

NOTE: I do not recommend dd_rescue or dd_rhelp as they have been quite buggy for me - I gave up on them quickly. ddrescue has never failed me. It's the best of dd_rescue and dd_rhelp without all of the bugs/crashing.

ddrescue needs at least two parameters, but four are recommended if you are doing any sort of recovery:

ddrescue input output logfile (other options)

There are a lot of options for controlling the block size and other functions but I highly recommend using the retries option, which is -r. If you give -r the value of -1 (negative one; not a switch) it will continue retrying forever. You can let ddrescue clone as much as it can and when you give up, press CTRL+C to abort.

If you were going to clone /dev/hda to /dev/hdb, then you would use this syntax:

ddrescue /dev/hda /dev/hdb /hda.log -r -1

If you wanted to rescue a scratched CD-ROM and had mounted a SAMBA network share to /mnt/win, you could use this syntax:

ddrescue /dev/hdc /mnt/win/myiso.iso /hdc.log -r -1

Once you have a perfect clone of your drive, then use a filesystem repair utility on it. If it's FAT32 or NTFS, use chkdsk.

Log file

As ddrescue runs, it updates the log file every 30 seconds with its results. It is a human and computer readable file which contains starting points, extents, and results, in that order.

Log file symbol Meaning
+ All data from this extent was copied successfully.
- No data from this extent was copied successfully.
/ This area was skipped due to errors, and has not yet been split.
? This area has not yet been read at all.

If you specify a log file, ddrescue can resume where it left off if you cancel it (with CTRL+C, for example). You should store the log in a place that won't be lost if the computer is shut down. When using a live CD, this means you will have to store the log somewhere over the network, on a floppy disk, or USB thumb drive, rather than storing it on the RAM disk filesystem.

Securely erasing data

Another use for ddrescue is securely wiping a device. It isn't really any faster (or slower) than any other method, but it produces a nice status window providing information about throughput and current position while it does it.

The following syntaxes will erase all of the data on the selected device.

If you wanted to zero-fill the primary hard drive, use this syntax:

ddrescue /dev/zero /dev/hda

If you wanted to securely wipe it, use this syntax:

ddrescue /dev/urandom /dev/hda

You don't need a log file for this kind of syntax since the log only stores events about the source input. /dev/zero and /dev/urandom aren't ever going to have errors, so there's no need for a log.

The most basic way to zero a device is simply:

cat /dev/zero > /dev/device

It won't say anything, but once it's done, you'll be returned to the # shell prompt.

TestDisk

TestDisk is a program that repairs partition tables and boot sectors. Visit the TestDisk wiki for usage information, downloads, etc. For the most part, you simply "Analyze" the problematic drive and "Write" the discovered partition table once that option is available. There are so many possibilities that it is difficult to get into the details here. I will update this section once I have some good examples.

FAQ

Bad sectors

Q: What is a bad sector?
A: A bad sector is any hardware sector where one or more bits cannot be read or does not pass checksum. Bad sectors are caused by physical defects such as weakly magnetized or warped/damaged sections of the platter itself.

Q: Why do you consider my drive failing if I only have one bad sector?
A: Whatever physical problem caused the first failure will cause more, and usually soon. You can consider your drive failed/failing when it has 1 pending or reallocated sector because it is not possible for it to get better (being that it's a hardware problem). Hard drives last anywhere from 1 week (!) to 10 years, so your mileage will always vary.

  • You can get the count of pending or previously reallocated sectors by looking at the values for the important S.M.A.R.T. attributes (listed above).

Q: Can it be a software problem? Can it be caused by a virus?
A: Bad sectors cannot be caused by software problems (especially not viruses).

Clicking

Q: My drive is clicking on startup and it fails to identify to the BIOS.
A: You will need professional data recovery assistance; there is no way around it. The head stack or System Areas could be damaged. See below under "Non-detection" for slightly more information.

Q: Everything else is fine (it detects and reads), but my drive clicks sometimes when I access a certain file or directory.
A: If this isn't a problem with the drive's head stack, ddrescue may be able to save everything on your drive except for the specific area that is causing the clicking. At any rate, it is time for a new drive.

Non-detection

Q: My drive isn't being detected by the BIOS. Can I just swap the drive's controller board for another of the same model?
A: Did you remember to plug in the power lead, is the controller known-good, and is the cable-known good? ;) or
A: Since software can't help you, seek a data recovery professional. I recommend visiting hddguru.com if you want to learn about hard drive internals. You could even post on their forums for assistance, though it's likely they'll refer you out if it's something beyond your means. Also, the likelihood of a board swap working is incredibly low unless you find an identically matched donor board. This is because the firmware has programming tailored to that drive and board. You would probably need to do some desoldering to swap chips from the source board to the donor board. Definitely seek a professional.
Q: My drive works fine but I can't get Windows to see it.
A: If it's in an external enclosure, did you set the drive's jumper setting to Master (for most drives) or Single (for Western Digital drives)? If Windows and Disk Management can see it but it refuses to mount, you may want to try TestDisk from Linux with the drive natively attached (i.e., not through an enclosure). Obviously you need to be sure that your enclosure works, too. Try another enclosure or connecting the drive natively.
A: Did your drive have Norton GoBack on it? If so, you have to change the partition type from '44' (the GoBack partition ID) to whatever filesystem is actually under there. NTFS is ID 07 and FAT32 is 0c. You can use a tool such as PTEDIT (from the Partition Magic suite) or Linux fdisk (a bit more advanced) or Linux cfdisk (easier than fdisk).

Not spinning

Q: My drive doesn't spin - should I put it in the freezer?
A: Freezing a drive only solves two very specific problems: bearing lubrication viscosity and stiction (where the head actually gets stuck to the landing zone/platter). Since it doesn't spin and it's a crapshoot from here, you might as well try it. Just make sure you seal the bag nicely so that no moisture develops on the drive's controller board. If this works, it only lasts about 30 minutes, so get your data quickly. The "iPod slap" method may actually work for stiction issues, but you do not want to try this while the drive is powered and you should not try it repeatedly.
A: If your problem is not caused by bearing lubrication viscosity problems, then it's probably time for a board swap, which requires professional assistance because of the firmware/System Area differences (see "Non-detection" above). Post on hddguru.com's forums for more information if necessary, they are a great resource (but be nice and be patient).

This page

Q: It seems unfinished.
A: It is. Once you have a clone of a failing drive, you can begin playing with its filesystem until you can read it. That's what's currently missing from this page - filesystem-level recovery and repair.