| Author |
Message |
Faeandar
Guest
|
Posted:
Sat Dec 04, 2004 5:15 am Post subject:
Re: Distributed filesystems (was: Re: PanFS?) |
|
|
On Fri, 3 Dec 2004 18:25:44 -0500, "Bill Todd"
<billtodd@metrocast.net> wrote:
| Quote: |
"Faeandar" <mr_castalot@yahoo.com> wrote in message
news:6hg1r0h8hide29uohvkj5kbdaaa3i19ifp@4ax.com...
On Fri, 3 Dec 2004 14:29:21 -0500, "Bill Todd"
billtodd@metrocast.net> wrote:
...
In implementations like Lustre, it refers to low-level local metadata
stored
on all storage units, plus file-level metadata stored in a coordinating
central metadata server (or metadata server cluster). The low-level
local
metadata is completely partitioned (i.e., of only local significance) and
hence has no scaling problem - but by offloading that portion of metadata
management from the central metadata server helps it scale too.
Help me understand this. If file level metadata is coordinated by a
single server how is that not a large bottleneck?
By making the server suitably hefty. Or, as noted above, by using a
metadata server cluster to divvy up the load (though even a relatively
modest SMP system can handle impressive levels of metadata management before
clustering it is necessary for anything save to increase availability via
fail-over mechanisms).
For reads I can see
why it's not, but for writes (in the 10,00 node cluster they talk
about) this would be like trying to stuff the Atlantic through a hose.
At least in my understanding.
The actual user data doesn't pass through the metadata server, just the
metadata operations. So in a product like Lustre (or, IIUC, Storage Tank),
the metadata server merely need be made aware of the write operation to
update things like last-modified date and EOF mark, while the grunt-level
data servers handle the actual movement to disk (and at least potentially
any interlocking involved, though the interaction of byte-range locks with
full-file locks on a file large enough to span many data servers becomes an
interesting challenge then).
Directory operations may often place a heavier load on metadata servers than
even write-intensive file activity.
|
I understand only metadata ops are passed through a metadata server
and not the data itself, but in the case of high performance computing
that load is sizeable. And the more node servers you add to the
cluster the more that load increases.
And locking, though handled at the node server, still has to be
distributed so the other nodes know what's locked and what's not.
This also can be large for compute farms.
I see how you thought I was an idiot with the Atlantic analogy. I was
just illustrating a point, exagerated though it was.
| Quote: |
...
Until SAN bandwidths and latencies approach those of local RAM much more
closely than they do today, there will likely be little reason to use
hardware management of distributed file caches - especially since the
most
effective location for such caching is very often (save in cases of
intense
contention) at the client itself, where special hardware is least likely
to
be found.
My guess is we will see something like NVRAM cards inside the hosts
all connected via IB or some other high bandwidth low latency
interconnect. For now though gigabit seems to do the trick primarily
for the reason you mentioned.
Although I gotta say, it is extremely appealing to use NVRAM in thos
hosts anyway just for the added write performance boost. If only they
could be clustered by themselves and not require host-based
clustering.... Ah well, someday.
Use of NVRAM is eminently feasible today, and has (or at least with suitable
design need have) relatively little connection to distributed caching (let
alone with hardware support for same) - especially if clients aren't
completely trustworthy.
|
It certainly is feasible, but having battery backed NVRAM does little
good if it isn't sync'd with a failover partner or pool. In the case
of a critical host failure for long periods of time that data is
essentially lost.
~F |
|
| Back to top |
|
 |
Bill Todd
Guest
|
Posted:
Sat Dec 04, 2004 7:30 am Post subject:
Re: Distributed filesystems (was: Re: PanFS?) |
|
|
"Faeandar" <mr_castalot@yahoo.com> wrote in message
news:3rv1r0h8dbdb18b8vrg861vikbrjppthhm@4ax.com...
....
| Quote: | I understand only metadata ops are passed through a metadata server
and not the data itself, but in the case of high performance computing
that load is sizeable.
|
My impression is that HPC applications don't typically care all that much
about consistent last-modified dates. And for files too large to be handled
locally at a single data-server node, it's certainly possible for the
metadata server just to track which data server currently contains the
file's EOF and off-load end-of-file-mark maintenance to that server (thus
the metadata server is only bothered each time the file grows or shrinks
enough for its EOF mark to move to another data server - or not at all if
the file is confined to a single data server, if that's where individual
file metadata is managed).
The secret of off-loading centralized metadata servers is to minimize the
need to talk with them. 'Object-oriented' storage takes the first step by
isolating the metadata servers from the details of file placement below the
node level (even if the metadata server tracks placement explicitly at the
node level rather than uses some algorithm to determine distribution across
the data servers, that's still a major reduction in metadata server
workload). The next step is to isolate the metadata servers from metadata
management associated with individual files, farming that off to the data
server on which the file resides (or the first data server of a set over
which the file is spread). The final step is to distribute directories as
well, eliminating the metadata servers entirely and making the system a
group of peers, but this requires some ingenuity to avoid significantly
increasing the overhead of path look-ups.
And the more node servers you add to the
| Quote: | cluster the more that load increases.
And locking, though handled at the node server, still has to be
distributed so the other nodes know what's locked and what's not.
|
The only node that needs to know what's locked is the node controlling the
data that's locked (and perhaps a mirroring partner, for availability):
everyone has to go there for that portion of the file anyway, in the kind of
system I was describing (where clients have sufficient intelligence to
target data servers directly in most cases after acquiring the file map from
the metadata server). It's also nice for that data server to be able to
delegate some locking to a specific client in the absense of contention, but
that doesn't change the basic approach.
....
| Quote: | Use of NVRAM is eminently feasible today, and has (or at least with
suitable
design need have) relatively little connection to distributed caching
(let
alone with hardware support for same) - especially if clients aren't
completely trustworthy.
It certainly is feasible, but having battery backed NVRAM does little
good if it isn't sync'd with a failover partner or pool. In the case
of a critical host failure for long periods of time that data is
essentially lost.
|
NVRAM works most efficiently when in the service of a single server. Since
you have to allow for the failure of that server anyway, you just arrange
partnering at the server level without any special considerations for NVRAM
that may be present for performance reasons. In fact, if you're
sufficiently confident in your UPSs and have the two partners on separate
ones, you can dispense with NVRAM entirely (the UPS-backed system RAM will
suffice, as long as you can be sure that at least one of the mirror partners
will continue running long enough to dump its contents to disk and revert to
more conservative strategies if its partner fails).
In general, when trying to avoid single points of failure, using
coarse-granularity partnering with near-commodity-level components is far
less expensive than any attempt to avoid failure points within any single
node. The only real exception to this is when you're not shooting merely
for availability but absolute reliability, and must therefore use redundant
everything (with hardware comparators) within each single server to guard
against the possibility that it will fail without noticing the problem and
proceed to corrupt its portion of the database (even then, you still need
geographically-separated mirror partners to guard against site-level
failures, so the cost can easily reach an order of magnitude greater than
simple commodity server mirroring).
- bill |
|
| Back to top |
|
 |
Guest
|
Posted:
Sat Dec 04, 2004 10:17 am Post subject:
Re: Distributed filesystems (was: Re: PanFS?) |
|
|
In article <jrp1r0hj9qjckuiq0vr7i6guj8adeqamrf@4ax.com>,
Faeandar <mr_castalot@yahoo.com> wrote:
....
| Quote: | Can it scale to 10,000 nodes? Well not currently (read-only not
withstanding)
|
I've heard of 1024-node clusters today. Ask some of the 1st-tier
vendors. And everyone in the field is working on improving
scalability. Clearly, object-storage is one of the tools in the
arsenal. As is NVRAM, but that has problems (that Bill explained
well).
| Quote: | but 10 years ago scaling this type of file system to
anything beyond 1 server was voodoo in the open systems world.
|
What was there in the open systems world 10 years ago? 10 years ago
you could run Linux version 0.99.14 (which IMHO had ext but not ext2
file systems, it was my first Linux) if you wanted open-source, or
dream of Linux V1, or you could buy BSDI. I think none of the fully
free *bsds was available yet: I remember considering the Jolitzes
386bsd in 1995 (or am I one year late?).
In the commercial world, there was Ultrix on Alpha, SunOS, HP-UX V8,
AIX 3, and some IRIX version (and NextStep, if you were into
masochism). If you wanted cluster computing, you could buy a big SGI,
or an IBM SP2 (were they available in 1994? I don't remember).
I refuse to count Windows as "open systems", even though MCS =
wolfpack (the cluster version of WinNT) was already on peoples mind
then.
On the other hand, 10 years ago, a very-high-quality cluster file
system had already been available for a decade, in the form of
VAXcluster. I've personally used VAXclusters of up to 90 machines, in
the late 80s. And if you wanted multi-node fault-tolerant NVRAM, it
was already available in the form of "servernet", the Tandem's
backplane technology.
It's actually very sad, how little we have accomplished since then.
The destruction of Digital and of Tandem contributed much to the lack
of progress.
--
The address in the header is invalid for obvious reasons. Please
reconstruct the address from the information below (look for _).
Ralph Becker-Szendy _firstname_@lr_dot_los-gatos_dot_ca.us |
|
| Back to top |
|
 |
Wolf
Guest
|
Posted:
Sat Dec 04, 2004 7:17 pm Post subject:
Re: Distributed filesystems (was: Re: PanFS?) |
|
|
<_firstname_@lr_dot_los-gatos_dot_ca.us> wrote in message
news:1102137471.806300@smirk...
| Quote: |
I've heard of 1024-node clusters today. Ask some of the 1st-tier
vendors. And everyone in the field is working on improving
scalability. Clearly, object-storage is one of the tools in the
arsenal. As is NVRAM, but that has problems (that Bill explained
well).
|
We have ~1800 compute nodes. 150TB of disk space. Scalability
is a major problem. For instance, we push NFS further than it was
probably intended. So I see a GFS as a solution for helping not only
our performance, but scalability also.
| Quote: | In the commercial world, there was Ultrix on Alpha, SunOS, HP-UX V8,
AIX 3, and some IRIX version (and NextStep, if you were into
masochism). If you wanted cluster computing, you could buy a big SGI,
or an IBM SP2 (were they available in 1994? I don't remember).
|
don't forget CRAY and a bunch of multi-proc proprietary solutions. But
they were all very expensive. |
|
| Back to top |
|
 |
|
|
|
|