May be mostly moot by the time this makes it through to the BALUG
list server and out. In any case ...
----- Forwarded message from Michael.Paoli(a)cal.berkeley.edu -----
Date: Tue, 20 Nov 2018 18:59:24 -0800
From: "Michael Paoli" <Michael.Paoli(a)cal.berkeley.edu>
Subject: outage: [www.]{sf-lug,balug}.org
To: SF-LUG <sf-lug(a)linuxmafia.com>
And ... outage again :-\
Guestimating if it's the "usual", will likely be on-line again
by sometime Wednesday evening (I may or may not get to it before then).
Per earlier:
impacts all [*.]sf-lug.{org,com} & [*.]balug.org
SF-LUG lists remain up and on-line (at least as far as I'm aware).
Also, DNS mostly not impacted (in general, slaves remain functional),
though there may be some additional latencies on DNS due to failovers
to other nameservers.
> From: "Michael Paoli" <Michael.Paoli(a)cal.berkeley.edu>
> Subject: "all better now:" Re: outage: [www.]{sf-lug,balug}.org
> Date: Wed, 10 Jan 2018 20:15:04 -0800
> And ... again, same deal, went off-line at:
> <~= 2018-01-11T01:23:21+00:00 2018-01-10T17:23:21-08:00
> and back on-line by:
>> ~= 2018-01-11T04:04:06+00:00 2018-01-10T20:04:06-08:00
> ... again, swift kick to the power switch on DSL modem to reset it,
> and "all better" - again ... at least for now.
>
>> From: "Michael Paoli" <Michael.Paoli(a)cal.berkeley.edu>
>> Subject: "all better now:" Re: outage: [www.]{sf-lug,balug}.org
>> (ETR: eveningish)
>> Date: Wed, 06 Dec 2017 18:08:09 -0800
>
>> And ... swift kick to the power switch on DSL modem to reset it,
>> and "all better now".
>> Looks like the outage started around:
>> BALUG PING: ping6 2001:470:1f04:19e::2 FAILED 2017-12-06T20:53:56+00:00
>> GW PING: ping 198.144.194.233 FAILED 2017-12-06T20:54:09+00:00
>> and service restored a few minutes ago
>>
>>
>>> From: "Michael Paoli" <Michael.Paoli(a)cal.berkeley.edu>
>>> Subject: outage: [www.]{sf-lug,balug}.org (ETR: eveningish)
>>> Date: Wed, 06 Dec 2017 14:46:38 -0800
>>
>>> So,
>>>
>>> They were up and on-line this morning, at least as late as about
>>> mid-morning or later, but off-line now.
>>> POTS line appears to be working, but not IP,
>>> likely the DSL modem needs to be reset (powercycled) again ...
>>> but don't have means to do that remotely.
>>>
>>> Presumably I'll have this resolved sometime this evening once I'm
>>> on-site again.
>>>
>>> impacts all [*.]sf-lug.{org,com} & [*.]balug.org
>>> SF-LUG lists remain up and on-line (at least as far as I'm aware).
----- End forwarded message -----
Tossing this one onto (suitable) list, because ... well, why not! :-)
> From: "Rick Moen" <rick(a)linuxmafia.com>
> Subject: Re: Login to BALUG Wiki
> Date: Thu, 15 Nov 2018 17:57:40 -0800
> Quoting Michael Paoli (Michael.Paoli(a)cal.berkeley.edu):
>
>> Seems more logical would to have that information on the
>> "self-"registration wiki page ... otherwise folks may not see it
>> anyway.
>
> After successful login, I put a notice on
> https://www.wiki.balug.org/wiki/doku.php , near the top.
Thanks ... I also tweaked the language (very) slightly just for slightly
better approximation of (mostly historical) reality.
> I ack your point that autoregistrations were few and far between, but
> lack of any information about how to get a login meant that anyone
> wanting to help would have no idea how to proceed.
>
> Sorry, but what 'self-' registraiton page? Not sure what this reference
> is to.
Heck, been so long I forget where the self-registration wiki page even
is/was.
>> I figured manual was "quite good enough" to stop the immediate issue
>> (the high volume of bot registrations was causing the wiki to bog down
>> and fail in annoying ways).
>
> I sympathise.
>
> Restoring login _with_ CAPTCHA plugin would be a 'have your cake and eat
> it too' solution, IMO -- if/when you get around to it.
Yep *somewhere* on the "todo" list.
Background/history - have a look at:
curl -s --range 134225-268633 https://www.archive.balug.org/log.txt
And *also* very handy for me too ... looks like I'd *disabled* the
registration page - so likely it doesn't show at all, or just won't
let one self-register.
Greetings! This is an advisory about ns1.linuxmafia.com DNS nameserver
downtime having ended. Root cause: AT&T (_not_ my ISP) sabotaged
my ASDL at their local exchange around 8am Tueday, then took about
2 days and 7 hours to find and fix their problem. All services
are back.
ns1.linuxmafia.com is back to doing auth. nameservice, as
arranged, for the following domains of yours:
balug.org (slave)
sf-lug.com (slave)
sf-lug.org (slave)
Evidence below is via fugly shell script ~/bin/testns that
I just cranked out:
#!/bin/bash
domain=$1
for ns in $(whois $domain | grep "Name Server" | \
awk '{ print $3 }' | tr '\r\n' ' ');
do echo -n $ns 'is '; dig +short @"$ns". $domain. SOA | awk '{print $3}';
done
:r! bin/testns balug.orgNS1.LINUXMAFIA.COM is 1540540908
NS1.SVLUG.ORG is 1540540908
NS1.BALUG.ORG is 1540540908
:r! bin/testns sf-lug.comNS.PRIMATE.NET is 1540541199
NS1.LINUXMAFIA.COM is 1540541199
NS1.SF-LUG.COM is 1540541199
:r! bin/testns sf-lug.orgNS1.LINUXMAFIA.COM is 1540541435
NS.PRIMATE.NET is 1539414019
NS1.SVLUG.ORG is 1540541435
NS1.SF-LUG.ORG is 1540541435
DO NOT REPLY ALL! (*unless* you're a member of **both** lists)
Just an FYI mostly, but temporarily we have a bit of reduced
redundancy on DNS for balug.org/sf-lug.org/sf-lug.com
Expecting this to be relatively temporary (at least once AT&T finally
gets their sh*t together ... at least for a little bit).
Anyway, in case anyone wonders or notices. Hopefully it will
be *all better soon* ... at least **relatively** soon.
Impacts should mostly be pretty minimal - some initial queries might
occasionally take a bit longer (possibly timing out on first
(randomly selected) authoritative nameserver), but should otherwise
generally have almost zero impact (due to TTLs, data will generally
be cached for a while once successfully resolved).
$ (for d in balug.orgsf-lug.orgsf-lug.com; do dig +noall +answer
+nottl "$d". NS | sed -e 's/[ ]\{1,\}/ /g'; done) | fgrep
linuxmafia.combalug.org. IN NS ns1.linuxmafia.com.
sf-lug.org. IN NS ns1.linuxmafia.com.
sf-lug.com. IN NS ns1.linuxmafia.com.
$
----- Forwarded message from rickmoen(a)gmail.com -----
Date: Wed, 24 Oct 2018 01:59:36 -0700
From: "Rick Moen" <rickmoen(a)gmail.com>
Reply-To: rick(a)deirdre.net
Subject: ns1.linuxmafia.com downtime
To: "Michael Paoli" <Michael.Paoli(a)cal.berkeley.edu>
Greetings! This is an advisory about current downtime of my
ns1.linuxmafia.com DNS nameserver, starting about 8am on Tuesday, Oct.
23rd. Near and I and Mike Durkin, proprietor of Raw Bandwidth
Communications ('RBC', my ISP) have been able to determine, AT&T somehow
sabotaged my household ASDL, and thus took my entire household including
the server online. Mike is now trying to get them to fix they screw-up.
Meantime, ns1.linuxmafia.com is _not_ doing auth. nameservice, as
arranged, for the following domains of yours:
balug.org (slave)
sf-lug.com (slave)
sf-lug.org (slave)
I'm advising everyone I'm doing auth. DNS for of the ongoing outage,
so this is your notice. Hope to give better news soon.
----- End forwarded message -----
BALUG VM was down for fair while earlier today.
Has now been up again for over 7 hours now.
Looks like there was an I/O hiccup on the physical host,
which didn't particularly impact the physical hosts, but
was enough of an interruption (delay) that the BALUG VM kernel
paniced.
Did have a 3rd hard drive testing, etc. on the physical host
at the time ... might've hit issues and possibly it did a bus
reset? Who knows for sure. Anyway ...
Went down sometime after:
2018-09-02T01:27:36-07:00
and was brought back up around:
2018-09-02T13:39:30-07:00
Various bits I noted in log:
$ curl -s --range 375155-378925 http://www.archive.balug.org/log.txt
2018-09-02 Michael Paoli
host crashed sometime after:
2018-09-02T01:27:36-07:00
but probably before about:
2018-09-02T01:35:00-07:00
on console, we got:
# [54894.969741] sd 0:0:0:0: [sda] tag#3 ABORT operation started
[54900.078084] sd 0:0:0:0: ABORT operation timed-out.
[54900.080312] sd 0:0:0:0: [sda] tag#2 ABORT operation started
[54905.198438] sd 0:0:0:0: ABORT operation timed-out.
[54905.200517] sd 0:0:0:0: [sda] tag#1 ABORT operation started
[54905.357128] Kernel panic - not syncing: assertion "i &&
sym_get_cam_status(cp->cmd) == DID_SOFT_ERROR" failed: file
"/build/linux-AcJpTp/linux-4.9.110/drivers/scsi/sym53c8xx_2/sym_hipd.c", line
3399
[54905.357128]
[54905.367774] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.9.0-8-amd64
#1 Debian 4.9.110-3+deb9u4
[54905.370776] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[54905.372768] 0000000000000000 ffffffff84f31e54 ffff9e2f75d5a300
ffff9e2f7fc03e50
[54905.375471] ffffffff84d7f6ad 0000000000000020 ffff9e2f7fc03e60
ffff9e2f7fc03df8
[54905.378226] 3ea9db08406f9671 0000000100d04ae4 ffffffffc048a250
ffffffffc0489e80
[54905.380982] Call Trace:
[54905.381867] <IRQ> [54905.382541] [<ffffffff84f31e54>] ?
dump_stack+0x5c/0x78
[54905.384428] [<ffffffff84d7f6ad>] ? panic+0xe4/0x23f
[54905.386164] [<ffffffffc048512e>] ? sym_interrupt+0x1c9e/0x1e80 [sym53c8xx]
[54905.388543] [<ffffffffc03aa010>] ?
usb_hcd_poll_rh_status+0x170/0x170 [usbcore]
[54905.391102] [<ffffffffc03a9fc9>] ?
usb_hcd_poll_rh_status+0x129/0x170 [usbcore]
[54905.393627] [<ffffffffc03aa010>] ?
usb_hcd_poll_rh_status+0x170/0x170 [usbcore]
[54905.396144] [<ffffffff84ce7562>] ? call_timer_fn+0x32/0x120
[54905.398071] [<ffffffffc047ea4b>] ? sym53c8xx_intr+0x3b/0x70 [sym53c8xx]
[54905.400386] [<ffffffff84cd418e>] ? __handle_irq_event_percpu+0x7e/0x1a0
[54905.402673] [<ffffffff84cd42e0>] ? handle_irq_event_percpu+0x30/0x70
[54905.404898] [<ffffffff84cd4359>] ? handle_irq_event+0x39/0x60
[54905.406901] [<ffffffff84cd7870>] ? handle_fasteoi_irq+0xa0/0x170
[54905.409001] [<ffffffff84c27faf>] ? handle_irq+0x1f/0x30
[54905.410834] [<ffffffff852187ee>] ? do_IRQ+0x4e/0xe0
[54905.412528] [<ffffffff85216556>] ? common_interrupt+0x96/0x96
[54905.414523] <EOI> [54905.415216] [<ffffffff852151f0>] ?
__sched_text_end+0x1/0x1
[54905.417231] [<ffffffff852154c2>] ? native_safe_halt+0x2/0x10
[54905.419235] [<ffffffff8521520a>] ? default_idle+0x1a/0xd0
[54905.421137] [<ffffffff84cbc7da>] ? cpu_startup_entry+0x1ca/0x240
[54905.423215] [<ffffffff8593df5e>] ? start_kernel+0x447/0x467
[54905.425186] [<ffffffff8593d120>] ? early_idt_handler_array+0x120/0x120
[54905.427438] [<ffffffff8593d408>] ? x86_64_start_kernel+0x14c/0x170
[54905.429842] Kernel Offset: 0x3c00000 from 0xffffffff81000000
(relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[54905.433484] ---[ end Kernel panic - not syncing: assertion "i &&
sym_get_cam_status(cp->cmd) == DID_SOFT_ERROR" failed: file
"/build/linux-AcJpTp/linux-4.9.110/drivers/scsi/sym53c8xx_2/sym_hipd.c", line
3399
[54905.433484]
... also noted within that same timeframe, on physical host, there
were some storage related events ... but no hard failues seen on that
physical host and no outages or failures or such observed on that
physical host:
Sep 2 01:29:04 vicki smartd[1093]: Device: /dev/sda [SAT], SMART
Usage Attribute: 195 Hardware_ECC_Recovered changed from 63 to 69
Sep 2 01:29:04 vicki smartd[1093]: Device: /dev/sdb [SAT], SMART
Usage Attribute: 190 Airflow_Temperature_Cel changed from 69 to 70
Sep 2 01:29:04 vicki smartd[1093]: Device: /dev/sdb [SAT], SMART
Usage Attribute: 194 Temperature_Celsius changed from 31 to 30
Sep 2 01:29:04 vicki smartd[1093]: Device: /dev/sdb [SAT], SMART
Usage Attribute: 195 Hardware_ECC_Recovered changed from 63 to 66
$
Amusing EximConfig overzealousness.
----- Forwarded message from Mail Delivery System <Mailer-Daemon(a)linuxmafia.com> -----
Date: Fri, 22 Jun 2018 22:11:56 -0700
From: Mail Delivery System <Mailer-Daemon(a)linuxmafia.com>
To: rick(a)linuxmafia.com
Subject: Mail delivery failed: returning message to sender
This message was created automatically by mail delivery software.
A message that you sent could not be delivered to one or more of its
recipients. This is a permanent error. The following address(es) failed:
balug-talk(a)lists.balug.org
SMTP error from remote mail server after end of data:
host mx.lists.balug.org [198.144.194.238]: 550-Rejected message body text:
URL link to prohibited file:
550-http://linuxmafia.com/faq/Admin/linuxmafia.com
550-.
550-[EximConfig-2.5-balug.org-Body-Reject]
550-.
550-.Verify: verified-29120-balug-talk(a)lists.balug.org
550-Contact: postmaster(a)balug.org
550-.
550-Sorry, your message has been rejected because
550-its body text/content is prohibited for the
550-above reason.
550-.
550-We apologise if you have sent a legitimate
550-message and it has been blocked. If this is
550-the case, please re-send adding verified-29120-
550-to the beginning of the E-mail address of each
550-recipient. If you do this, your message will
550-get through these restrictions.
550-.
550-If your message has been incorrectly blocked,
550-please let us know at the above contact address.
550 .
------ This is a copy of the message, including all the headers. ------
Return-path: <rick(a)linuxmafia.com>
Received: from rick by linuxmafia.com with local (Exim 4.72)
(envelope-from <rick(a)linuxmafia.com>)
id 1fWapc-00029x-Cx
for balug-talk(a)lists.balug.org; Fri, 22 Jun 2018 22:11:28 -0700
Date: Fri, 22 Jun 2018 22:11:28 -0700
From: Rick Moen <rick(a)linuxmafia.com>
To: balug-talk(a)lists.balug.org
Subject: Re: [BALUG-Talk] (forw) Re: [License-review] Fwd: [Non-DoD Source]
Resolution on NOSA 2.0
Message-ID: <20180623051128.GD32401(a)linuxmafia.com>
References: <20180622212441.GA32401(a)linuxmafia.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
In-Reply-To: <20180622212441.GA32401(a)linuxmafia.com>
Organization: If you lived here, you'd be $HOME already.
X-Mas: Bah humbug.
X-Clacks-Overhead: GNU Terry Pratchett
User-Agent: Mutt/1.5.20 (2009-06-14)
X-SA-Exim-Connect-IP: <locally generated>
X-SA-Exim-Mail-From: rick(a)linuxmafia.com
X-SA-Exim-Scanned: No (on linuxmafia.com); SAEximRunCond expanded to false
Just because, I'll sharpen some of those points that are relevant to
Tuesday's BALUG discussion.
Speaker Liz Krumbach, after her excellent talk about DC/OS, gave
generously of her time for a Q&A session with attendees sitting around
the table with her at Henry's Hunan.
In passing, Liz mentioned (and I'll paraphrase, here, and apologise in
advance if I misstate anything) that, although her heart is still in the
notion of independent Internet hosting by individuals and community
institutions, the reality is that pretty much everything's moving to
some form of hosted cloud storage, and that even she and her husband now
freely put some fairly sensitive personal data in Google Docs these
days, even though they know all about the drawbacks. (Fair enough.
That's her perception, and her and her husband's choice, and I know Liz
well enough to understand her actions to be thoughtful.)
She also pointed out, again perfectly reasonably, that increasingly most
people don't even host their own files in practically any category,
streaming their video content, streaming their music, and so on. Even
though they _could_ be more independent.
Two BALUG attendees, whom I could name (but will be nice), _then_ chimed in,
going rather far beyond what Liz said (while acting as if they were
agreeing), says that anyone who attempts to run a personal Internet
server, and they said they knew there were some sitting around the
table, were certifiable lunatics, and that it was utterly impractical.
(Hi, I'm Rick Moen, listadmin of most of the Bay Area's Linux User Group
mailing lists and owner/operator of linuxmafia.com / unixmercenary.net,
home of two LUGs, a major community calendar, etc. on old castoff
machines I run literally on fixed IP on home ADSL in my garage. I've
done this sort of thing since the 1980s, when as a staff accountant with
no IT background, I started running my own Internet servers because
nobody told me I couldn't. I just threw things together, tinkered a
little, learned on the fly, and it worked. _Later_, I became a system
administrator as my profession.)
> The Scheme community failed to take basic steps to ensure that there was
> functional, periodically tested failover to independent machine
> resources, and also tested offsite backups.
The core of this is: Do you have a creditable plan of action for each
of the things that might be plausibly be expected to go wrong? Planning
for Internet infrastructure, no different from planning for _any_
organised activity, requires contingency planning. Running a
department, or pretty much anything else? Then, make sure people know
what to do or not do if there's a fire. Or a power outage. Or a
flooded floor in an office building because the sprinklers went off or
the pipes broke upstairs.
In the case of computers and network infrastructure, it's the same
principle, and thus not really different.
Hardware: Moving-rust mass storage, fans, cables, and input devices
wear out most often, from mechanical stress. SSDs wear out too, from a
different type of stress. Do you have a plan when, not if, that
happens?
Software: Upgrades can go tragically bottom-side up if coincidental to a
hardware failure. (Ask me how I know.) Tired sysadmins, or any junior
SA armed with root access, is even more dangerous to systems than a
programmer bearing a screwdriver. Mishaps happen. Do you have a
documented plan for _what_ to back up andexactly how to restore, and
have it somewhere safe? (I do, and archive.org and cache.google.com
back it up for me:
http://linuxmafia.com/faq/Admin/linuxmafia.com-backup.html )
Is your DNS free of single points of failure, and multiple people able
and willing to administer the domain and DNS?
Do you have either failover or a credible plan to bring replacement
hardware / software online if it's important but not urgent? Like:
> Life is imperfect, so I know these things usually amount to 'We didn't
> have time and energy to do more than a half-assed job, so that's what we
> did', but, seriously, consider what the low-hanging fruit would have
> been, just an afternoon's work on the failover problem:
>
> 1. Separate person sets up a *ix machine on static IP in his/her
> (separate) premises. Today, this could be on a junk P4 or PIII.
>
> 2. That person coordinates with the first guy to do daily rsyncs
> of all important data to the failover box.
>
> 3. In the ideal case, the failover box would also be fully configured
> as a hot spare, but this is a fine point. It would probably suffice
> for a few qualified experts to agree that all essential data are being
> captured daily. If production host fails or the guy in charge goes
> crazy, parties able to control DNS flip it, and if necessary there's a
> mad scramble to make the failover host fully functional. The main thing
> that cannot be done without is the data.
>
> 4. Some ongoing monitoring is required, to make sure failover
> replication doesn't silently break.
>
> For extra credit, once every six months, _deliberately_ flip the
> failover switch. Nothing is quite as good at proving ability to
> failover as doing failover.
People look at the rickety old PIII in my garage and think 'Eek, that's
fragile.' Sure it is. But highly replaceable, as witness the fact that
linuxmafia.com has gone through about eight motherboards in a variety of
throwaway, worthless server boxes, and as many sets of hard drives.
It certainly can die. And then I just deploy another (and this time
better) one.
If I wanted to, if I weren't too lazy to do the work, it wouldn't be
difficult to do hot-sparing. Constructing the same server twice is
really no more difficult than doing it once, and then at your leisure
you can set up periodic rsync backup scripts to update the spare box's
data, to slave the spare's database process off the master's so it's
always kept updated, and so on. Resources required: Two throwaway
server machines, two static IP addresses, a little thinking and testing,
part of a weekend. For extra credit, have a hot-spare offsite
somewhere.
For additional extra credit, enroll all instructions for constructing a
spare machine from bare metal (or minimal OS install) into your choice
of configuration management software (Chef, Puppet, Ansible, etc.) to
make spinning one up even faster and easier.
If this is lunacy, it's the lunacy that gets things done. And the point
is: It's not friggin' brain surgery, including the high availability /
failover pieces that give you for free all the _important_ bits of what
the big boys charge money for. The trick is to _know what you're doing_
and verify thta the important parts are covered.
And knowing what you're doing isn't that hard. I figured it out just by
tinkering and paying attention back when I was a _staff accountant_.
No, you won't achieve five nines of reliability, but you won't need to;
you're not running NAQDAQ. You won't be completely immune to DDoSing,
but you won't need to; you're not Cloudfront.net.
What you _can_ do without too much difficulty as an aspiring technical
Linux user -- because this happens to be what Linux is really good at --
is basic, not too fancy, generic Internet services that Linux has done
extremely well for around a quarter century: e-mail, mailing lists,
regular ol' Web sites, ssh, a medley of other things. Making the
contents be easily portable indefinitely comes along for free, which is
definitely not true of more-complex online content and fancier hosted
services. And doing sysadminly assurance of failover, backup/restore,
etc. is not a lot harder.
Of the two guys who called anyone who makes the effort and walks the walk of
open source and autonomous Internet presence a 'lunatic' rely,
one is a San Franciscan who relies on a partimus.org Internet virthost
that he appears to not do anything to keep operating (that host being
not even an autonomous Internet server, but rather just a virtual-hosted
domain on Dreamhost, the specialty WordPress hosting provider terrible
at e-mail/mailing lists that BALUG's admins, mostly Michael Paoli and
partly me, ran a special project just to get BALUG off their hosting).
The other is a Berkeley guy with an ever-changing name who sharecrops
for Google, i.e., using nothing but free-of-charge Google services.
Neither of them has to my knowledge ever even _aspired_ to learn how to
set up and run Internet infrastructure for the Linux Internet community,
yet both a sure that someone who does is a 'lunatic', and that what many
such people have done pretty easily and for peanuts in monetary outlay
is hopelessly impractical.
I see. Right. Puts it in context.
--
Cheers, "A recursive .sig
Rick Moen Can impart wisdom and truth.
rick(a)linuxmafia.com Call proc signature()"
McQ! (4x80) -- WalkingTheWalk on Slashdot
----- End forwarded message -----