spearfoot / disk-burnin-and-testing Goto Github PK

View Code? Open in Web Editor NEW

812.0 22.0 101.0 65 KB

Shell script for burn-in and testing of new or re-purposed drives

License: Other

Shell 100.00%

disk posix-compliant freenas freenas-forum smartmontools freebsd-scripts linux-scripts

disk-burnin-and-testing's People

Contributors

Stargazers

Watchers

Forkers

xbliss joegnis dhilip89 levifig dak180 rjt rajatnair meshops vbsinterestingstuff dacbarbos flatlinebb borg1622 downloadrammore jtessier72 rehanone tuksik zenon1823 bsodmike deploynull saskifx lucas-gautier emilianbold bee27 tommyku yut148 nwillems han-yoon rocco83 schnerring pedronavf dilepa noslin005 gamanakis dave-burke tankmek labdiynez spotlesscoder rca jelmerkk digitalknk a1cy0n opencareerpodcast kyounger camara-tech xonstone jimmygle garettmd xorilog edwinclement08-forks jollerprutt drdougphd keitalbame davidalger p3lim markismus gitgb spidersavitch89 hogenf gilibenzio ciscam thrat birt frozenmosaic megamuteki nonviotale cliffbo markthomas93 nova-ace adampryke zhaojie1130 strickdj joelishness sonicpet07 evie404 taiheng mbaezner backups-archives networkshokunin s-wachspress mcclown acloserview joshfng labs-labs-labs geotsot hennk jfklingler micxer tyler351 suppaduppax pleasestopasking schlep skowalczyk kjayga erooke ninpucho gofullthrottle wullsnpaxbwzgydyyhwtkkspeqoayxxyhoisqhf shreead zamtin2021

disk-burnin-and-testing's Issues

Add check if device is in use.

I ran your script on an 8TB disk but it finished after just a couple of days. I discovered that badblocks had exited immediately with no apparent reason. It just said that it had finished and then the script continued with the next step.

When I ran badblocks myself I got the message

/dev/sdb is apparently in use by the system; it's not safe to run badblocks!

Apparently I had missed to unmount the drive. It would be nice if the script did a check if the device is mounted directly when you run it and warns you.

Thanks for a awsome project!

New version mistakenly identifies WDC Green disk as 'non mechanical'

And as a result, skips execution of badblocks program.

WDC Green model: WDC_WD20EARX-00PASB0

Need for testning disk performance?

I was reading https://www.reddit.com/r/freenas/comments/adgef1/slow_sequential_write_speed_new_8_disk_raidz2/ where they did not get the performance excepted due to one drive. This was not indicated in the SMART-test. Is there a need to do a performance test as part of your burnin and testing of new drives?

If there is a need I think such a test is within the scope of this script to early identify fault or degraded drives.
Which tool that could be used I do not know.

confirmed working environment

Thanks for the tool! If you cared to include in the readme:

fedora 33 workstation
WDC_WD140EDFZ-11A0VA0 (RED?)

burnin-WDC_WD140EDFZ-11A0VA0_Y5HVU25C.log

Why run badblocks 4 times with 4 different patterns?

I'm curious: why the badblocks test is performed 4 times with 4 different patterns? Is there an option of running a single pattern and if so, what would be the best pattern to use even it's not as rigorous?

Request - Time Remaining and PDF Generator

This is a great script, it would be great to take some features from nwipe such as showing time remaining until completion, and a pdf showing what was done and what has passed/failed

Inappropriate ioctl

Hi. I'm new to FreeNAS and setting up my first NAS. Been slowly working my way through figuring everything out.

I'm running your script as we speak on 1 drive of mine. It just returned this:

+-----------------------------------------------------------------------------
+ Run badblocks test on drive /dev/da0: Sat Jun 20 15:46:58 EDT 2020
+-----------------------------------------------------------------------------
Checking for bad blocks in read-write mode
From block 0 to 1465130645
Testing with pattern 0xaa: set_o_direct: Inappropriate ioctl for device
`  6.86% done, 31:09 elapsed. (0/0/0 errors)

I'm not sure what inappropriate ioctl is, and this thing still has 9+ hours to go before it's done. Should I be concerned?

If it helps, the disk I'm running it on is a Seagate Enterprise NAS drive. 6 TB. SATA 3.0 (6 Gbps). Model number is ST6000NM0115-1YZ.

I don't think or expect you to help me with my particular device. Just wondering if you might share some insight?

Thanks!

Use of badblocks "-c" flag

I saw the note about the long testing times, and looked up expected times for badblocks on the disks I'm using (4TB). I found this useful answer on superuser, which mentioned that adjusting the value used for the "-c" flag made a big difference to the speed:

badblocks -svn /dev/sdb
To get to 1%: 1 Hour
To get to 10%: 8 hours 40 minutes

badblocks -svn -b 512 -c 32768 /dev/sda
To get to 1%: 35 Minutes
To get to 10%: 4 hours 10 minutes

badblocks -svn -b 512 -c 65536 /dev/sda
To get to 1%: 16 Minutes
To get to 10%: 2 hours 35 minutes

I naturally wondered if there's a downside to setting a higher "-c" value. Another helpful answer mentioned this:

The -c option corresponds to how many blocks should be checked at once. Batch reading/writing, basically. This option does not affect the integrity of your results, but it does affect the speed at which badblocks runs. badblocks will (optionally) write, then read, buffer, check, repeat for every N blocks as specified by -c. If -c is set too low, this will make your badblocks runs take much longer than ordinary, as queueing and processing a separate IO request incurs overhead, and the disk might also impose additional overhead per-request. If -c is set too high, badblocks might run out of memory. If this happens, badblocks will fail fairly quickly after it starts. Additional considerations here include parallel badblocks runs: if you're running badblocks against multiple partitions on the same disk (bad idea), or against multiple disks over the same IO channel, you'll probably want to tune -c to something sensibly high given the memory available to badblocks so that the parallel runs don't fight for IO bandwidth and can parallelize in a sane way.

I'm currently testing 6x 4TB disks and my memory use is under 300M, so that doesn't seem to be much of an issue. Is there another reason this option isn't used by the script?

How to run the script?

First of all - thank you for publishing this little gem.

I'm new to this NAS game thing, and bought used disks(16x 2tb), and wanted to ensure that I know what I've got on my hands.

So I've made myself a bootable usb stick with ubuntu 18.04, ensured that all tools are available and fetched this script. It ran very fast at first, and I wondered, "large disks may take a long time", hmm what constitutes large disks?

Then I read the entire readme, carfeully, and lo and behold, hidden there in the middle, "disable dry run" shame on me for not RTFM. But bubling this to the top, would be very helpful for newcomers.

Lastly, I derived a "clever" method of running the tool for many disks(since I have some drives, and didn't want to sit and wait for it to finish).

ls /dev/sd[a-z] | cut -d'/' -f3 | sudo parallel -I{} ./wrapper.sh {}

# Wrapper contains this:
#!/bin/bash -xe
./disk-burnin.sh ${1} > logs/${1}.log

What I'm in doubt about then is, is this a good method? Does the parallel running degrade performance or in any way prevent a valid test? I Know this also tries to test my cd drive on /dev/sdr but hey, worst-case it fails :-)
From this, I also feel that it would be nice if the script accepted a full device path, rather than a device name-ish - eg to me it would be more logical to look in /dev/disk/by-path/ to figure out which disks to test.

I would be more than happy to submit a PR with these changes, I just didn't want to do too much without understanding what I'm actually doing.

EDIT, More questions:
It seems the polling logic is not working with version smartmontools release 6.6 dated 2016-05-07 at 11:17:46 UTC, due to a changed output format(this might be ubuntu 18.04 related). Also, in the mentioned version there is an option to do the task in the foreground, is there a particular reason to not doing this?(maybe because it didn't exist)

So in summary, the questions are:

What is a large disk?
Is there a good reason to place the info about turning of dry-run in the middle of the readme?
Is there a problem to running the script(or any of its tools) in parallel?
The smartmontools version in ubuntu 18.04 can do the tests in foregrounded, is there a reason not to do this?

I hope this is at least somewhat helpful feedback. :-)
/Nwillems

Combine with badblocks with cryptsetup?

Pro:

The drive might "compress" the pattern that is written to it. So combining with cryptsetup ensures that the drive actually stored every single bit correctly and did not just compress it.
As preparation to encrypt the drive, it is recommended to overwrite the secure random before which will be done while testing the drive. Win-win.

Contra:

A bit higher CPU usage. But with AES-NI, this should be acceptable.

I have been doing this for years without issues btw, ref: https://github.com/ypid/scripts/blob/master/badblocks_and_secure_erase

Script has weak SAS parsing logic [HELP WANTED]

It appears the issue is relating to weak logic surrounding SAS models. However, more test cases should be provided to confirm if this is indeed a protocol difference, or a difference in how manufacturers report SMART data.

It appears the script is incorrectly parsing smartctl results, as the script reports the following:

but sudo smartctl --all /dev/sda clearly shows the expected data

Expected behavior:
Correctly parse the results of smartctl so the script can function accordingly.

Running the script returns "Please specify device type with the -d option." for 8 TB WD WD80EZAZ-11TDBA0 (in WD My Book)

Hi, thank you so much for writing this little script! There were a few issues I had with running it on a new 8 TB WD WD80EZAZ-11TDBA0 in a My Book external hard drive enclosure.

First and foremost, running the script without any modification returned "Please specify device type with the -d option." After a bit of Googling, I found a post from 2014 on https://bugs.freedesktop.org/show_bug.cgi?id=79379 that led to me the solution: adding -d sat after every instance of smartctl in the code. It was quick and dirty, but it worked. I don't think that this can be directly implemented into the script because it may cause breakage for others, but I did want to post it somewhere where others can find it if they run into the same issue as I did. This looks to be a problem with smartctl not automatically recognizing the connector.

I also have some other suggestions for the readme file. Since root privileges were required when I ran the script, it might be useful to let people know that they can run the script in a single line on the terminal as sudo bash ./disk-burnin.sh sdX. Secondly, since the user does need to set the Dry_Run variable to 0, it might also be helpful to bold the line "The script is distributed with 'dry runs' enabled, so you will need to edit the Dry_Run variable, setting it to 0, in order to actually perform tests on drives." or potentially even have that echoed whenever a user tries to run the script. (I'm not an IT guy or programmer by trade, so I know that modifying scripts is something that might trip newcomers up.)

Thanks for your help with writing this script and making it available to others!

Incomplete dependencies availability check

No check for availability of smartmontools. When smartmontools aren't installed running script gives:

scripts/disk-burnin-and-testing-master/disk-burnin.sh: 263: smartctl: not found
scripts/disk-burnin-and-testing-master/disk-burnin.sh: 264: smartctl: not found
[2020-10-05 09:49:59 UTC] +-----------------------------------------------------------------------------
[2020-10-05 09:49:59 UTC] + Started burn-in
[2020-10-05 09:49:59 UTC] +-----------------------------------------------------------------------------
[2020-10-05 09:49:59 UTC] Host:ubuntu-server
[2020-10-05 09:49:59 UTC] OS Flavor: Linux
[2020-10-05 09:49:59 UTC] Drive: /dev/sdc
[2020-10-05 09:49:59 UTC] Disk Type: non-mechanical
[2020-10-05 09:49:59 UTC] Drive Model:
[2020-10-05 09:49:59 UTC] Serial Number:
[2020-10-05 09:49:59 UTC] Short test duration: minutes
[2020-10-05 09:49:59 UTC] 0 seconds
[2020-10-05 09:49:59 UTC] Extended test duration: minutes
[2020-10-05 09:49:59 UTC] 0 seconds
[2020-10-05 09:49:59 UTC] Log file:/home/rakoczy/diskc/burnin-.log
[2020-10-05 09:49:59 UTC] Bad blocks file:/home/rakoczy/diskc/burnin-.bb
[2020-10-05 09:49:59 UTC] +-----------------------------------------------------------------------------
[2020-10-05 09:49:59 UTC] + Running SMART short test
[2020-10-05 09:49:59 UTC] +-----------------------------------------------------------------------------
scripts/disk-burnin-and-testing-master/disk-burnin.sh: 1: eval: smartctl: not found
[2020-10-05 09:49:59 UTC] SMART short test started, awaiting completion for 0 seconds ...
scripts/disk-burnin-and-testing-master/disk-burnin.sh: 483: eval: smartctl: not found
scripts/disk-burnin-and-testing-master/disk-burnin.sh: 490: eval: smartctl: not found
scripts/disk-burnin-and-testing-master/disk-burnin.sh: 483: eval: smartctl: not found
scripts/disk-burnin-and-testing-master/disk-burnin.sh: 490: eval: smartctl: not found
scripts/disk-burnin-and-testing-master/disk-burnin.sh: 483: eval: smartctl: not found
scripts/disk-burnin-and-testing-master/disk-burnin.sh: 490: eval: smartctl: not found
^C

Forgot to set kern.geom.debugflags on FreeBSD?

Hello, thanks for this script and the write up on your blog.

I'm running your script on a new disk under FreeBSD after having 1 of 3 new disks fail on me. (I'm a total noob, and the last thing I expected (trying to rescue my raid from near death) was a problem with the new disk!)

Anyway, I'm now running a burn in on the RMA'd replacement, but I forgot to execute this first:
sysctl kern.geom.debugflags=0x10

Should I now:

Kill the bb process and do that now?
Do it now with the bb process running anyway?
Just not do it and don't worry so much?

I read somewhere that after you've set this kernel flag you should un-set it again later (e.g. reboot) to avoid 'problems'... (Note that my pool is currently online (DEGRADED but backed up), as I'm using the freenas box itself to burn in the new disk).

Sorry for the noobs and thanks for any advice,
Dan.

Consider adding a f3 testing step

I found a good manual for burn-in testing on reddit:

https://www.reddit.com/r/DataHoarder/comments/alh22g/burning_in_hard_drives/efemr7k?utm_source=share&utm_medium=web2x&context=3

In steps 3-5, he also makes an additional check with ZFS and f3write.

Would it make sense to add these steps in your script, too?

Cannot Handle Very Large Drives

When trying to run this script on some 18TB drives backblocks threw the following error:

badblocks: Value too large for defined data type invalid end block (4394582016): must be 32-bit value

It seems like this is most likely due to the block size not being big enough for drives of this size. Can we get a dynamic block size based on drive size or another command line parameter to set this manually if we choose?

Question: Utilisation of -c?

Hi, I came across this script and it's incredibly helpful, so thanks! One question I have is why you opted to change the blocksize flag to -b 8192 rather than, say, double the blocks written using -c?

I found running badblocks with the option -b 4096 was writing at around 25M/s which would have resulted in my 8T drive completing after 16 days. By modifying the call to badblocks to use -b 4096 -c 128 (double the default), I did see an almost double increase in write speeds. I didn't fancy going higher just to avoid any potential issues with badblocks misreporting anything, but figured there must be a sweet spot somewhere for larger drives?

Thanks.