| Author |
Message |
Eric P.
Guest
|
Posted:
Wed Jan 26, 2005 7:34 pm Post subject:
Re: CAS and LL/SC |
|
|
Bernd Paysan wrote:
| Quote: |
Eric P. wrote:
Now tell me there's no DMA unit in the chipset ;-).
I'm not sure what you mean. Each device has its own DMA unit if it
needs one.
The DMA unit are two parts. The device wants to read from or write to main
memory, and the chipset has to satisfy that request. An IOP might sit
between the DMA request and the real memory, handling things like virtual
memory or scatter gathering (scatter gathering is mostly used to implement
the fragmented DMA transfers you get when translating virtual to physical
addresses).
|
Yes, I just don't call that DMA, I call it Scatter Gather logic
in the Bus Adapter Unit. A rose by another name.
I left the virtual memory mgmnt out of my description.
The main cpu must walk the page table and pin the data pages
and any page table pages in the hierarchy that reference them.
The buffer address and page table root physical address are
stuffed into the cmd pkt so the IOP can program the SG table.
So what is being offloaded is:
- SG setup and tear down (and wait states)
- Device DMA register setup (and wait states)
- A small amount of driver code (few hundred instructions)
BTW the IOP must be able to directly read and write all of physical
memory, which these days means it needs a 64 bit register set,
though you might get by with a smaller, multi clock cycle ALU.
But it also needs its own instr. and data memory, so there needs
to be some way of distinguishing IOP addressess from host addresses.
This might be as simple as reserving chunks of physical address
range above that used by the main cpu. The IOP would pick off this
range and route the access to its local RAM. It would also need
a small MMU so device drivers can use position independent code.
| Quote: | Actually, DMA is a sort of "poor man's IOP". The DMA takes over the most
tedious workload, the bulk data transfer. Platforms with IO processors
don't need DMA, the device just requests and acknowledges transfers like an
IDE device in PIO mode.
|
I would call it more optimized than poor. A DMA control is basically
just a couple of counters driving the read & write control lines.
Its' response time is that of the bus arbitration logic, and
it can keep the bus bandwidth fully utilized.
A processor can accomplish the same task, but to handle multiple
devices requires interrupt driven PIO logic which has a lot more
overhead and gets nowhere near DMA bandwidth.
| Quote: | If there is an IOP, then I assume you wanted to use it to program
the device DMA unit. The device driver in the main cpu contructs a
command packet and hands it to IOP, probably stuffs it in a double
linked list and sets an attention flag. A single linked list from
which the pkt ptr would then be copied in and the list order
reversed would also work and avoids the spinlocks.
The IOP, running a minature quasi real time OS, polls the attention
flag, sees the new pkt, parses pkt and runs the same driver code that
it would have run on the main cpu to program to the 'slow' device
dma registers. It then starts the dma.
When IO completes, device interrupts IOP, which marks cmd pkt
as complete, stuffs pkt in a reply queue, interrupts main cpu,
and starts next IO.
Yes, that's the general idea. If you add an IOP to the PC architecture, you
would still have the possibility to bypass it, for compatibility reasons.
The IOP does not necessarily have to run a full "device driver", it's also
possible if the command packet contains enough details (i.e. a sort of
"mini-program") to send details through on-the-fly (like "you have to set
register x to y, and then poll register z for bit q"). And for
packet-oriented IOs like USB or SCSI (SATA), it is sufficient if the lowest
level (the IOP) knows how to send and receive packets.
|
Yes I thought of that too - attaching a driver code snippet to a
cmd pkt. That kind of ability might save writing a whole new driver.
It needs to support a whole IO subsystem
- Load and unload drivers
- Create and destroy devices
- Queue device IO Request Packets
- Cancel individual IO requests
- Housekeeping (Reset devices, blah blah)
All of the above functions require descriptors in IOP local
memory, so there is some form of heap management system.
| Quote: | snip
That's why IOPs have to be multithreaded, so you don't even have to plug in
something physical.
|
Having a multithreaded RTOS can simplify _some_ driver writing
for many non demanding devices. But for performance I would stick
with interrupt driven state machines as they require no context
load/unload. Support both and let the driver writer decide.
Eric |
|
| Back to top |
|
 |
FredK
Guest
|
Posted:
Wed Jan 26, 2005 7:55 pm Post subject:
Re: CAS and LL/SC (was Re: High Level Assembler for MVS & VM |
|
|
"Terje Mathisen" <terje.mathisen@hda.hydro.com> wrote in message
news:ct7j4t$nc9$1@osl016lin.hda.hydro.com...
| Quote: | FredK wrote:
No. The problems overlap, but are not identical. To provide a way to
access IO space, the platform groups invented a "sparse" address
space in which some of the low-order address bits of the VA would
be used to create the length and offset of the partial word read or
write - *and* the data itself would need to be shifted into the correct
alignment for the operation - this was termed "swizzling".
Since they only had three bits left over at the low end, they had to
embed some info in at least one high-order address bit as well, i.e.
you'd need each memory-mapped device to be mapped several times.
|
Because the CPU was not decoding the low order bits, and sparse
space was purely done by the platform, the middle part of the address
was shifted typically by 5 bits, and the bits just above the low-order
bits were loaded with a value that indicated the offset and length.
One of the high order bits in the address indicated that it was a
sparse address.
So a 16-bit write, for example would look something like:
char *tmp;
_memory_barrier();
volatile unsigned int *pIO;
tmp = CSR_ADDRESS + ((port_address << swizzle_shift) | (1 <<
(swizzle_shift - 2)));
pIO = (unsigned int *)tmp;
*pIO = data_out << (port_address & 3) * 8);
where CSR_ADDRESS is the base VA of the sparse mapping (say the base ISA
space)
port_address is the normal port address (byte offset) into the space
swizzle_shift is the platform-specific sparse shift value (like 5 or 7)
data_out is the 16-bit value
Very ugly, but it only had to be done once in a macro or routine. The cost
was
performance - as well as the fact the driver could not simply do a int16
operation
directly to the address in an easy fashion. |
|
| Back to top |
|
 |
Eric P.
Guest
|
Posted:
Wed Jan 26, 2005 9:14 pm Post subject:
Re: CAS and LL/SC (was Re: High Level Assembler for MVS & VM |
|
|
Jan Vorbrüggen wrote:
| Quote: |
But those apps were broken to begin with, their brokenness was just
exposed more easily on Alpha.
Not so. Replacing atomic byte and word stores with read-modify-write
sequences *changes* such accesses to be non-interruptable on that
platform. Granted such usage is an unstated assumption, but so
what - who guarenteed that 32 bits int accesses are atomic?
These are assumptions that are valid on most (all?) platforms
but that one, and are used by lots of code including that of their
existing customers VAX code.
The VAX architecture guaranteed that memory accesses on a uniprocessor
were atomic, as were certain instructions that did read-modify-writes
- for instance, the queue instructions. However, in all multiprocessor
systems only the use of the interlocked instructions (queue manipulation,
setting/clearing bits, ADAWI - add aligned word interlocked) was guaran-
teed this property across processors. Thus, the programmer had to know
when accessing shared data structures in what synchronization domain he
was working. Most of that knowledge was hid behind appropriate APIs.
Where that isn't possible - e.g., interactions between an AST and mainline
code - there are clear instructions on the Dos and Donts, and OS support
for critical sections in the form of temporarily disabling ASTs from user
mode.
Summary: Any thread-based/parallel code on VAX that accessed a shared
data structure by just using code that was oblivious of the fact was
broken to begin with - but you would get away with it in many common
situations (e.g., on a uniprocessor system or even a small SMP that
hadn't reduced quantum). Nothing new here - the same holds true of many
other systems. And that was my point - even all those guys who thought
"all the world's a VAX" were already breaking the rules on the VAX, and
they got their just rewards when they tried to run their junk elsewhere.
|
True, but not what I was refering to.
These are effects that result from non atomic operations on a
single variable. Most people know that 'i++' is not atomic.
But how many expect 'i=0' to be corrupted because some other
thread performed a 'j=1' (a completely different varaible)
at the same instant?
Without byte and word stores, word tearing can result in corruption
of *apparently distinct variables* due to false sharing.
For example, two apparently distinct fields
struct RecT
{ int16 fieldA;
int16 fieldB;
};
FieldA is updated only by ThreadA, fieldB is updated only by ThreadB.
Because of false sharing and word tearing the updates can clobber
each other. This can occur in interrupts, ASTs, and threads
because both sequences must do read-modify-write on 64 bit values.
This can also occur for chanracter strings. Updates to two
supposedly distict strings can clobber each other and there
is almost no way to watch out for it.
The only way to avoid this on the original Alpha 21064 is to
have *every* byte or word update do a LL/SC loop, or to
customize the code for it to avoid this problem.
Eric |
|
| Back to top |
|
 |
Jan Vorbrüggen
Guest
|
Posted:
Wed Jan 26, 2005 10:13 pm Post subject:
Re: CAS and LL/SC (was Re: High Level Assembler for MVS & VM |
|
|
| Quote: | Without byte and word stores, word tearing can result in corruption
of *apparently distinct variables* due to false sharing.
|
As I said in my previous post, this was already true on VAX, at the very
least for multiprocessor systems.
Jan |
|
| Back to top |
|
 |
FredK
Guest
|
Posted:
Wed Jan 26, 2005 10:40 pm Post subject:
Re: CAS and LL/SC (was Re: High Level Assembler for MVS & VM |
|
|
"Jan Vorbrüggen" <jvorbrueggen-not@mediasec.de> wrote in message
news:35q1e9F4ql4h5U1@individual.net...
| Quote: | Without byte and word stores, word tearing can result in corruption
of *apparently distinct variables* due to false sharing.
As I said in my previous post, this was already true on VAX, at the very
least for multiprocessor systems.
|
No, I believe that two *adjacent* bytes in the same longword each
being written by a different CPU on a VAX will not "corrupt" the
other byte. This happens because the cache coherency logic
will do the read/mod/write of the byte as some form of exclusive
cache access with intent to write. The same as an Alpha with
byte/word instructions.
But without byte/word instructions, the program itself had to fetch
at minimum the 32-bit word, modify it, and write it back - which
shifts the responsibility for coherency to the program - hence the
need to LL/SC. |
|
| Back to top |
|
 |
Eric P.
Guest
|
Posted:
Wed Jan 26, 2005 11:14 pm Post subject:
Re: CAS and LL/SC (was Re: High Level Assembler for MVS & VM |
|
|
Jan Vorbrüggen wrote:
| Quote: |
Without byte and word stores, word tearing can result in corruption
of *apparently distinct variables* due to false sharing.
As I said in my previous post, this was already true on VAX, at the very
least for multiprocessor systems.
Jan
|
Oh, sorry... I didn't realize that is what you were saying.
In that case: Nope :-) VAX Arch Handbook, page 25:
"In the VAX architecture, implicit sharing is transparent
to the programmer. The memory system is implemented such that
the basis of access for independent modification is the byte.
... independent modifying accesses to adjacent bytes produces
the same result regardless of the order of execution."
On the other hand, there were also unconfirmed rumors floating
about that it might not actually work right on some SMP machines.
We had some lock free code that would randomly fail for no reason,
and wrapping the accesses with mutex locks fixed the problem.
Eric |
|
| Back to top |
|
 |
Nick Maclaren
Guest
|
Posted:
Thu Jan 27, 2005 1:21 am Post subject:
Re: CAS and LL/SC |
|
|
In article <1106764313.689843@haldjas.folklore.ee>,
Sander Vesik <sander@haldjas.folklore.ee> wrote:
| Quote: |
In 15 years, machines with 1000 simultaneous hw threads from a combination
of smp + mt are imho liekly to be rather commonplace. Using say cumulative
36 bits per thread (after all, the number of software threads is liekly to
be far higher), it gets you to 48 bits. Thats just 8 bits away from hitting
flags. SO large systems running on high end machines in 15 years would
imho not feel comfortable with having 56bits of address + 8 bits of flags,
especialy if kernel takes away one of the bits anyways.
|
Well, firstly, I disagree - for some very solid reasons - but let that
pass. The number of threads is irrelevant - what is important is the
total amount of memory in a single image.
I had reason for saying that I don't believe that it will continue to
increase at "Moore's Law" speeds, and I stand by that.
Regards,
Nick Maclaren. |
|
| Back to top |
|
 |
Per Schröder
Guest
|
Posted:
Thu Jan 27, 2005 3:39 am Post subject:
Re: CAS and LL/SC (was Re: High Level Assembler for MVS & VM |
|
|
Eric P. wrote:
| Quote: | On the other hand, there were also unconfirmedaggressivelyyating
about that it might not actually work right on some SMP machines.
We had some lock free code that would randomly fail for no reason,
and wrapping the accesses with mutex locks fixed the problem.
|
While an SMP VAX processor might not reorder memory accesses as aggressively
as an Alpha, they can have a store buffer. If you use ordinary memory
access instruction and write a memory cell, that write could sit in a
buffer in the VAX CPU and be invisible to other CPUs. The VAX does not have
memory barrier instructions, but by using an "I" instruction (ADAWI,
REMQxI, INSQxI, BBSSI, etc), you can achieve a similar effect.
The mutex locks you added probably executed such instructions, adding
something like a memory barrier around your memory access.
/Per Schröder |
|
| Back to top |
|
 |
Terje Mathisen
Guest
|
Posted:
Thu Jan 27, 2005 4:45 am Post subject:
Re: CAS and LL/SC (was Re: High Level Assembler for MVS & VM |
|
|
FredK wrote:
| Quote: | Because the CPU was not decoding the low order bits, and sparse
space was purely done by the platform, the middle part of the address
was shifted typically by 5 bits, and the bits just above the low-order
bits were loaded with a value that indicated the offset and length.
One of the high order bits in the address indicated that it was a
sparse address.
So a 16-bit write, for example would look something like:
char *tmp;
_memory_barrier();
volatile unsigned int *pIO;
tmp = CSR_ADDRESS + ((port_address << swizzle_shift) | (1
(swizzle_shift - 2)));
pIO = (unsigned int *)tmp;
*pIO = data_out << (port_address & 3) * 8);
|
_Very_ obviously little endian. :-)
| Quote: | where CSR_ADDRESS is the base VA of the sparse mapping (say the base ISA
space)
port_address is the normal port address (byte offset) into the space
swizzle_shift is the platform-specific sparse shift value (like 5 or 7)
data_out is the 16-bit value
Very ugly, but it only had to be done once in a macro or routine. The cost
was
performance - as well as the fact the driver could not simply do a int16
operation
directly to the address in an easy fashion.
|
Since this setup could be done once (as long as the IO window addr
stayed put (?), I'd assume you'd normally generate the needed pointers
during driver init, and then simply do regular (membar protected and
preshifted) accesses to them?
Terje
--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching" |
|
| Back to top |
|
 |
Andrew Reilly
Guest
|
Posted:
Thu Jan 27, 2005 5:38 am Post subject:
Re: CAS and LL/SC (was Re: High Level Assembler for MVS & VM |
|
|
On Thu, 27 Jan 2005 01:45:06 +0100, Terje Mathisen wrote:
| Quote: | Since this setup could be done once (as long as the IO window addr
stayed put (?), I'd assume you'd normally generate the needed pointers
during driver init, and then simply do regular (membar protected and
preshifted) accesses to them?
|
Probably not if you were using portable driver code, as you would be in
Linux/*BSD. Might be able to get away with it on Ultrix, but I imaginge
that even VMS drivers wanted to remain portable to equivalent Vax boxes.
I don't imagine that this sort of thing would be a performance bottleneck,
anyway. Most of the heavy IO lifting is done with DMA these days, rather
than banging on device registers.
--
Andrew |
|
| Back to top |
|
 |
Rupert Pigott
Guest
|
Posted:
Thu Jan 27, 2005 6:33 am Post subject:
Re: CAS and LL/SC (was Re: High Level Assembler for MVS & VM |
|
|
Jan Vorbrüggen wrote:
| Quote: | That's essentially the same problem. In all cases (threads, memory
mapping and interrupts), the problems and solutions have a lot in
common.
I disagree. A lot of accesses to memory don't care about being
thread-
safe or not, or being interrupted and repeated later, etc - that's
the
basis for your often-repeated argument, Nick, that a memory system in
a parallel processor should not support cache coherency on every
access
but only on those where the programmer says it is required, and
subject
to certain restrictions. All those nice properties don't hold for
your
run-of-the-mill (E)ISA and even PCI device - you often cannot repeat
the device memory access without things going wrong, whether you
access
a register as an entity one, two or four bytes long might make a
difference
to its operation, and so on. Of course, you might say that these
designs |
Ah, the old Transputer block move gotcha where it trips
over a FIFO in the address range (ie: sometimes double
pumps the FIFO)... There are ways to mitigate that of
course, but they are ugly. The only realistic alternative
is to actually fix the broken devices, or hang them off
some smart controller that hides the ugly details. ;)
Cheers,
Rupert |
|
| Back to top |
|
 |
FredK
Guest
|
Posted:
Thu Jan 27, 2005 6:56 am Post subject:
Re: CAS and LL/SC (was Re: High Level Assembler for MVS & VM |
|
|
"Terje Mathisen" <terje.mathisen@hda.hydro.com> wrote in message
news:ct9a23$2rl$1@osl016lin.hda.hydro.com...
| Quote: | FredK wrote:
Because the CPU was not decoding the low order bits, and sparse
space was purely done by the platform, the middle part of the address
was shifted typically by 5 bits, and the bits just above the low-order
bits were loaded with a value that indicated the offset and length.
One of the high order bits in the address indicated that it was a
sparse address.
So a 16-bit write, for example would look something like:
char *tmp;
_memory_barrier();
volatile unsigned int *pIO;
tmp = CSR_ADDRESS + ((port_address << swizzle_shift) | (1
(swizzle_shift - 2)));
pIO = (unsigned int *)tmp;
*pIO = data_out << (port_address & 3) * 8);
_Very_ obviously little endian. :-)
|
Well, yup. I don't believe in Big-Endian ;-)
| Quote: | Since this setup could be done once (as long as the IO window addr
stayed put (?), I'd assume you'd normally generate the needed pointers
during driver init, and then simply do regular (membar protected and
preshifted) accesses to them?
|
In a driver, I would probably simply call the system routine that
effectively
would do something like the above. Because for the graphics the drivers
usually weren't doing all the high-performance of a thing.
For the DDX code, it varied depending on the design. VGA-like devices
sucked, but the cost of doing swizzling paled next to the cost of keeping
the writes ordered - especially pre-EV56/EV6. The simplest way to do it
was to generate a structure that matched the sparse space layout, and
then assign the IO base address as the pointer to the structure. But
you *still* needed to put the data into the right byte lane, and deal with
write ordering.
Most PCI graphics provided at least 32-bit registers which gets
rid of the need for "most" byte IO (although some did a few stupid
RAMDAC designs that clung to the VGA design). Frame buffer
logic was all hand crafted, but most graphics these days have a
fast enough DMA path for image data so that tuned frame buffer
code isn't much (if at all) faster even in the special cases -
especially if the IO path isn't tuned for PIO (for example, the Alpha
EV45 will get upward of 80% of DMA performance using properly
done sequential PIO - without the DMA setup time and latency,
but the EV47 absolutely blows chunks for PIO).
Doing sick and twisted things to get around the need for memory
barriers for register writes was a lot more fun. EV6 made this a lot
cheaper with a lightweight write memory barrier. But you still spend
time making sure that your code only emits one when it absolutely
positively has to. |
|
| Back to top |
|
 |
FredK
Guest
|
Posted:
Thu Jan 27, 2005 7:05 am Post subject:
Re: CAS and LL/SC (was Re: High Level Assembler for MVS & VM |
|
|
"Andrew Reilly" <andrew-newspost@areilly.bpc-users.org> wrote in message
news:pan.2005.01.27.00.38.17.452299@areilly.bpc-users.org...
| Quote: | On Thu, 27 Jan 2005 01:45:06 +0100, Terje Mathisen wrote:
Since this setup could be done once (as long as the IO window addr
stayed put (?), I'd assume you'd normally generate the needed pointers
during driver init, and then simply do regular (membar protected and
preshifted) accesses to them?
Probably not if you were using portable driver code, as you would be in
Linux/*BSD. Might be able to get away with it on Ultrix, but I imaginge
that even VMS drivers wanted to remain portable to equivalent Vax boxes.
I don't imagine that this sort of thing would be a performance bottleneck,
anyway. Most of the heavy IO lifting is done with DMA these days, rather
than banging on device registers.
|
Yes. 4-5 years ago, I would have spent time hand crafting frame buffer
PIO logic. Today, by and large you can set up the image data into a DMA
buffer and get about the same or better performance depending on the
platform. Even better is when there is some nice logic - like the 3DLabs
devices where you can actually debug things using PIO into a FIFO ring
address space, and the same sequence can be used for the DMA packet
when you get it all working and switch to using DMA.
However, a lot of crappy xFree86 DDXs still do the majority of their 2D
drawing using PIO. And worse - some of these also support 3D - so the
code thrashes between stopping all the OpenGL DMA, doing X11 PIO
to registers, to the FB, and then start doing more OpenGL DMA.
Yuck. Buggy. Slow. But you get what you pay for. Lets not even
talk about the DRI/DRM crock. |
|
| Back to top |
|
 |
glen herrmannsfeldt
Guest
|
Posted:
Thu Jan 27, 2005 1:52 pm Post subject:
Re: CAS and LL/SC (was Re: High Level Assembler for MVS & VM |
|
|
Terje Mathisen wrote:
| Quote: | Nick Maclaren wrote:
|
(snip)
| Quote: | IMHO C wasn't the worst problem:
Memorymapped IO to 8/16/32-bit device registers with destructive read or
write was a harder problem, and in this case DEC couldn't simply define
away the problem either. The workaround entailed using alternate address
ranges afair. It seemed like a horrible hack at the time, and it
probably generated quite a few bugs in kernel mode drivers. :-(
|
I remember multiple address spaces on the VME bus, I believe 24 and
32 bit addresses, and 8, 16, and 32 bit data bus width could be
separately addressed. SunOS had devices (like /dev/vme24d32) that
could address them, and using mmap() could be memory mapped.
How strange using I/O devices to map memory to memory, but it
did work.
-- glen |
|
| Back to top |
|
 |
glen herrmannsfeldt
Guest
|
Posted:
Thu Jan 27, 2005 1:54 pm Post subject:
Re: CAS and LL/SC (was Re: High Level Assembler for MVS & VM |
|
|
Jan Vorbrüggen wrote:
| Quote: | It originally had only full register loads and stores, which is
unsuitable for implementing C. That was fixed.
I believe the driving force in providing less-than-32-bit-read and
-writes was memory-mapped I/O, not implementing C (which was needed
from day one in any case). The workarounds in hard- and software for
the first systems were Nor Pretty and prone to misuse and errors, so
the necessary instructions were added.
|
The story I heard was that it was Microsoft and NT, though
by the time it was actually done NT/Alpha was dead.
-- glen |
|
| Back to top |
|
 |
|
|
|
|